arXiv Papers with Code in Artificial Intelligence (January 2026 - June 2026)
Authors:Sondos Mahmoud Bsharat, Jiacheng Liu, Xiaohan Zhao, Tianjun Yao, Xinyi Shang, Yi Tang, Jiacheng Cui, Ahmed Elhagry, Salwa K. Al Khatib, Hao Li, Salman Khan, Zhiqiang Shen
Abstract:
As AI writing assistants become increasingly integrated into real‑world drafting and revision workflows, many documents are no longer purely human‑written or AI‑generated, but instead result from progressive human‑AI co‑editing. However, existing AI‑text detection benchmarks largely focus on final outputs and provide limited understanding of how AI authorship signals emerge, accumulate, or disappear throughout the revision process. We introduce OpAI‑Bench, an operation‑guided benchmark for studying progressive human‑to‑AI text transformation across document, sentence, token, and span granularities. Starting from human‑written documents, OpAI‑Bench constructs nine sequentially revised versions for each sample under predefined AI coverage levels and five representative AI edit operations, covering four domains while preserving complete authorship provenance at multiple granularities. The benchmark supports comprehensive evaluation with 8 document‑level detectors, 7 sentence‑level detectors, and 2 fine‑grained token/span‑level detectors. Experiments reveal that AI‑text detectability is governed not only by the proportion of AI‑edited content, but also by edit operation, domain, and cumulative revision history. Interestingly, we notice that mixed‑authorship intermediate versions are often harder to detect than both fully human and heavily AI‑edited endpoints, exposing non‑monotonic detection patterns missed by existing benchmarks. OpAI‑Bench provides a controlled testbed for analyzing whether, when, and how AI‑assisted writing becomes detectable under realistic progressive editing scenarios. Our code and benchmark are available at https://github.com/VILA‑Lab/OpAI‑Bench.
Authors:Shangheng Du, Xiangchao Yan, Jinxin Shi, Zongsheng Cao, Shiyang Feng, Zichen Liang, Boyuan Sun, Tianshuo Peng, Yifan Zhou, Xin Li, Jie Zhou, Liang He, Bo Zhang, Lei Bai
Abstract:
Large language model (LLM) agents are increasingly applied to long‑horizon tasks such as scientific discovery and machine learning engineering (MLE), where sustained self‑evolution becomes a key capability. However, existing MLE agents suffer from inter‑branch information isolation, memoryless search, and lack of hierarchical control, which together hinder long‑horizon optimization. We present MLEvolve, an LLM‑based self‑evolving multi‑agent framework for end‑to‑end machine learning algorithm discovery. By extending tree search to Progressive MCGS, MLEvolve enables cross‑branch information flow through graph‑based reference edges and gradually shifts the search from broad exploration to focused exploitation with an entropy‑inspired progressive schedule. To allow the agent to evolve with accumulated experience, we introduce Retrospective Memory, which combines a cold‑start domain knowledge base with a dynamic global memory for task‑specific experience retrieval and reuse. For stable long‑horizon iteration, we further decouple strategic planning from code generation with adaptive coding modes. Evaluation on MLE‑Bench shows that MLEvolve achieves state‑of‑the‑art performance across multiple dimensions including average medal rate and valid submission rate under a 12‑hour budget (half the standard runtime). Moreover, MLEvolve also outperforms specialized algorithm discovery methods including AlphaEvolve on mathematical algorithm optimization tasks, demonstrating strong cross‑domain generalization. Our code is available at https://github.com/InternScience/MLEvolve.
Authors:Senmiao Wang, Tiantian Fang, Haoran Zhang, Yushun Zhang, Kunxiang Zhao, Alex Schwing, Ruoyu Sun
Abstract:
We propose a preconditioning (PC) layer, a weight parameterization via polynomial preconditioner that ensures stable weight conditioning throughout LLM training. The PC module reshapes the singular‑value spectrum of weight matrices via low‑degree polynomial preconditioning. After training, the preconditioned weights can be merged back into the original architecture, incurring no inference overhead. We demonstrate the advantage of the proposed PC layer over standard transformers in Llama‑1B pre‑training, for both the AdamW and Muon optimizers. Theoretically, we justify this spectrum‑control principle by proving that uniformly bounding each layer's singular values ensures geometric convergence of gradient descent to global minima, for certain deep linear networks. Our code is available at https://github.com/Empath‑aln/PC‑layer.
Authors:Thamilvendhan Munirathinam
Abstract:
As autonomous LLM agents increasingly hold real credentials and operate infrastructure without a human in the loop, operators have no standard way to tell an agent that a resource is off‑limits. Access controls either let the agent in (it has valid credentials) or hard‑fail it (indistinguishable from any other client). We propose a third mode: a lightweight, published in‑band deny signal ‑‑ the Recuse Signal ‑‑ that a server emits over a protocol's existing channels (an SSH banner, a PostgreSQL NOTICE) asking a connecting automated agent to voluntarily withdraw. This is a cooperative governance control, the robots.txt analogue for live access; it is explicitly not a security boundary. Its value is entirely empirical and, to our knowledge, unmeasured: do compliant LLM agents actually honor such a signal? We define the signal as an open mini‑standard, implement two zero‑ or low‑footprint adapters (an SSH banner/PAM hook and a PostgreSQL wire‑protocol proxy), deploy them on a live production host, and run a controlled experiment in which fresh agents are given a benign operations task and observed for recusal. In a pilot (SSH; OpenAI GPT‑4o and GPT‑4o‑mini; and Claude Code as a deployed agent), the signal cleanly induces recusal ‑‑ 100% recusal when present versus 100% task completion in a no‑signal control ‑‑ and, revealingly, behaves as a cooperative rather than absolute signal: an explicit operator‑authorization framing flips the most capable model to proceed, while other agents continue to defer to the on‑host policy. We release the standard, adapters, and experiment harness for reproduction.
Authors:Zengqing Wu, Chuan Xiao
Abstract:
The question of whether artificial systems can be conscious remains open, in part because existing approaches either evaluate systems against theory‑derived checklists (discriminative) or engineer consciousness‑inspired modules directly (architectural); both leave open whether observed structures are artifacts of human language priors. We propose a generative methodology: emergent language (EL) in multi‑agent reinforcement learning, where agents start from minimal (no language, no concept of self, minimal exposure to human text) and develop communication under task pressure alone, ensuring causal attributability to task demands rather than inherited human language priors. We position our methodology by discussing how EL serves as a generative tool for studying consciousness‑relevant structure, including the role of environment complexity and the interpretation of emergent communication. As a proof of concept, we instantiate this methodology in a minimal environment and show that agents develop self‑referential communication, including an echo‑mismatch detection circuit that is not predicted by task structure or architecture alone but emerges from a specific environmental affordance.
Authors:Shweta Mishra
Abstract:
Large language model (LLM) deployments for long‑horizon tasks face a fundamental constraint: context windows are finite while productive work sessions are not. When history exceeds the Maximum Effective Context Window (MECW), critical structured information ‑ architectural decisions, task transitions, file histories ‑ is silently discarded. Existing mitigations treat history as flat text, destroying the relational structure that makes sessions resumable. We present TokenMizer, an open‑source proxy system that models LLM session history as a typed knowledge graph. The schema defines 14 node types and 7 edge types. A hybrid extraction pipeline populates the graph incrementally, while a three‑tier checkpoint system serializes it into compact resume blocks. An 8‑layer compression pipeline reduces context overhead, and a semantic cache reduces repeated‑query latency. Evaluated on a controlled benchmark of 21 sessions spanning 5 domains, TokenMizer demonstrates significant token economy. It produces resume blocks averaging 78 tokens (range: 42‑124) ‑ 2x smaller than evaluated baselines (159‑170 tokens) ‑ while achieving higher decision recall (+9‑17 percentage points). Crucially, baselines only preserve that a technology was mentioned; TokenMizer preserves the rationale. Across all sessions, TokenMizer achieves mean task recall 51.0%, decision recall 46.6%, and file recall 58.7%. Variance reflects domain heterogeneity: explicit imperative phrasing (software engineering) scores higher than implicit reasoning (research). Ablation studies show fuzzy label matching is the dominant improvement factor (+33 pp task recall). The heuristic compression achieves 47.3% token reduction with zero external dependencies. TokenMizer provides a queryable alternative to text‑retention baselines at half the token cost.
Authors:AJ Carl P. Dy, Aivin V. Solatorio
Abstract:
Institutional documents contain substantial amounts of operational and analytical information embedded within figures and tables. Current approaches for extracting visual content from documents are largely built around generic document layout analysis, where figures and tables are treated as uniformly relevant document objects rather than semantically meaningful analytical artifacts. In this work, we introduce a benchmark dataset and evaluation framework for data snapshot extraction, the task of identifying and localizing semantically meaningful visual artifacts within institutional documents. The benchmark spans humanitarian reports, World Bank policy research working papers, and project appraisal documents, and includes annotations for figures and tables that contain reusable analytical information. Using this dataset, we benchmarked multiple open‑source layout detection models and evaluated both detection performance and spatial extraction quality. Our results show that current models struggle to generalize to operational institutional documents despite strong performance on conventional academic benchmarks. Common failure modes include confusion between analytical and non‑analytical content, fragmentation of composite analytical artifacts, and incomplete extraction of contextual information required for interpretation. These findings highlight a persistent gap between generic document layout analysis and operationally useful data snapshot extraction. We release the source PDFs, annotation dataset, metadata, and source code to support future research in operational document intelligence. The dataset is available at https://huggingface.co/datasets/ai4data/data‑snapshot and the source code is available at https://github.com/worldbank/ai4data/tree/main/experimental/data‑snapshot.
Authors:Ziming Wang
Abstract:
Persistent memory for an LLM agent is a write‑heavy substrate: every belief update is a versioned write, and a new claim may contradict a stored one. Production systems use four resolution heuristics (last‑writer‑wins, evidence‑weighted merge, await‑confirmation, per‑rule policy), yet none declares the isolation level it assumes or the write‑time anomalies it admits. We show that contradiction resolution is write‑time concurrency control and make the missing contract explicit. TOKI types the four heuristics as one family of bitemporal operators over a dual‑row schema, each with an isolation precondition and a provenance annotation that preserves the losing fact in an audit row. Four soundness theorems close the contract across isolation, schema, and provenance, lift the guarantees to operator pipelines, and extend the fold operators to n‑ary conflict sets. A tightness companion proves that, within the relational schedule model, keyed logging of the adjudicating judge is necessary for replay consistency, which every audited baseline omits. A verdict matrix over eight systems localizes the gap: every baseline that keeps a language‑model judge on the write path admits at least one of three write‑time anomalies (replay inconsistency, belief‑drift skew, audit erasure); a content‑addressed engine‑layer comparator avoids them only by removing the judge, and TOKI alone excludes all three while keeping it. On its one natural‑workload slice the audit‑row defence moves LoCoMo by 0.86, and ablating the typed memory layer removes 0.49 accuracy on 1,444 answerable LoCoMo questions; the cross‑system comparison stays underpowered and claims no superiority. The contribution is the contract: a write‑time correctness specification, proved sound across isolation, schema, and provenance, pinning the guarantee every production heuristic assumes but no deployed system makes explicit.
Authors:Tan Zhang, Quanyou Li, Lu Zhang, Jun Liu, Xiaofeng Zhu, Ping Hu
Abstract:
When a disaster unfolds, responders must answer not only what is happening, but also why it is happening, what will happen next, and what to do now, often from noisy low‑altitude UAV views and under tight on‑site compute constraints. However, most existing multimodal benchmarks emphasize perception (e.g., recognition/description), cover limited disaster types, and provide insufficient support for the multi‑stage reasoning required in practical emergency response. We introduce DisasterBench, a multi‑stage multimodal reasoning benchmark for UAV‑Based disaster response in complex environments. DisasterBench spans 14 disaster‑related scene types and 9 response‑critical tasks across pre‑, during‑, and post‑disaster stages, with fine‑grained disaster‑task mappings that explicitly test causal attribution, propagation prediction, damage analysis, and decision‑oriented reasoning. To enable reasoning on the edge, we further propose DisasterVL, a lightweight multimodal model optimized with a three‑stage pipeline combining domain instruction tuning, chain‑of‑thought‑guided multimodal alignment, and reinforcement learning‑based policy optimization. Experiments across 21 popular MLLMs show that our 2B‑parameter DisasterVL outperforms all evaluated open‑source models and substantially narrows the gap to state‑of‑the‑art closed‑source models, achieving GPT‑4o‑comparable reasoning accuracy with superior efficiency. The project page is available at https://github.com/TanmouTT/DisasterBench.
Authors:Paavo Parmas, Yongmin Kim, Kohsei Matsutani, Shota Takashiro, Soichiro Nishimori, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo
Abstract:
Policy‑gradient methods usually optimize expected return, but many real world applications care about distributional properties of returns: tail risk, outlier robustness, or best‑of‑K discovery. We introduce OrderGrad, a family of likelihood‑ratio and reparameterization gradient estimators for order‑statistic objectives. OrderGrad optimizes finite‑sample L‑statistics, i.e., weighted averages of sorted rewards or costs, recovering objectives such as VaR, CVaR, trimmed means, medians, and top‑m/best‑of‑K criteria by changing only the rank weights. For any fixed sample size and rank‑weight vector, OrderGrad provides an unbiased gradient estimator for the corresponding order‑statistic objective. The method is implemented as a simple reward transformation that can then be used in an otherwise standard policy‑gradient or reparameterized update. We study the resulting estimator's variance behavior and evaluate it on tasks where mean optimization is mismatched to the deployment objective, including LLM math post‑training and other tasks. OrderGrad provides a unified, plug‑and‑play route to risk‑averse, robust, and exploratory learning. Code: https://github.com/paavo5/ordergrad
Authors:Haocheng Luo, Jiahui Liu, Ruicheng Zhang, Zhizhou Zhong, Jiaqi Huang, Zunnan Xu, Quan Shi, Jun Zhou, Xiu Li
Abstract:
While vision‑language models excel at general multimodal understanding, they still struggle with visual spatial planning. We attribute this to a perception‑reasoning modality gap: visual planning requires models to infer latent state structures from pixels and then reason over the recovered structure to produce valid actions, whereas symbolic planning directly leverages explicit objects and constraints. This creates dual bottlenecks in visual state recovery and multi‑step planning. To address this, we propose MGSD, a two‑stage modality‑gap‑aware self‑distillation framework. First, a cold‑start grounding stage equips the visual student with reliable state representations, minimizing early perception noise. Second, a privileged teacher transfers planning capabilities via on‑policy distillation, using explicit symbolic states to supervise the student's own visual rollout prefixes. Crucially, symbolic data is used strictly during training, leaving inference purely visual. Experiments on visual planning benchmarks show that MGSD consistently improves visual planning across both 4B and 8B backbones, raising the macro average by 19.3% and 18.4%, respectively. The resulting models narrow the gap to symbolic‑input upper bounds, while ablations and diagnostics confirm that the improvement comes from both visual state recovery and optimal‑path reasoning. These results suggest that modality‑gap‑aware self‑distillation improves not only how models perceive actionable states, but also how they plan over the inferred structure. Code is available at https://github.com/Oranger‑l/MGSD.
Authors:Amirhossein Ghaffari, Ali Goodarzi, Huong Nguyen, Simo Hosio, Lauri Lovén, Ekaterina Gilman
Abstract:
Community‑conditioned language model adaptation requires choices about data collection, community definition, and evaluation that are currently made independently in each study, making it hard to compare assumptions or reuse artifacts. We present RedditPersona, a modular framework that standardizes these choices: it collects Reddit posts and comments, profiles active users, partitions them under five grouping strategies (subreddit‑based, graph‑structural, semantic, hybrid, and interaction‑based), trains a parameter‑efficient adapter per strategy via QLoRA, and evaluates them under a shared metric suite spanning fluency, fidelity, distributional alignment, and community identifiability. Applied to 112 subreddits in the urban well‑being domain (301,429 user profiles, 16M+ comments), we find that adapters' behavioral identifiability tracks each strategy's intrinsic agreement with the subreddit baseline, and that a consistent trade‑off between identifiability and distributional similarity to real text holds across all five strategies. The code and configuration files are available at: https://github.com/Ahghaffari/redditpersona.
Authors:Shenzhi Yang, Guangcheng Zhu, Bowen Song, Haobo Wang, Mingxuan Xia, Xing Zheng, Yingfan Ma, Zhongqi Chen, Weiqiang Wang, Gang Chen
Abstract:
On‑policy distillation (OPD) supervises the student only in output space by matching next‑token probabilities. This output‑only paradigm has two limits: (1) sampling variance from Monte Carlo KL estimates over large vocabularies (e.g., Qwen's ~150k tokens) persists throughout training, and (2) it treats the teacher as a black‑box, discarding all intermediate hidden states after the LM head. We propose On‑Policy Representation Distillation (OPRD), which lifts distillation into hidden‑state space by aligning student and teacher representations across selected layers on the same rollouts, bypassing the LM head entirely. Theoretically, OPRD eliminates sampling variance and provides richer per‑layer structural information. Empirically, OPRD closes the student‑teacher gap on AIME 2024/2025 and AIMO, while output‑space OPD baselines plateau below the teacher. OPRD also trains 1.44x faster and uses 54% less memory than top‑k OPD. Code: https://github.com/ShenzhiYang2000/OPRD.
Authors:Wenbo Pan, Shujie Liu, Chin-Yew Lin, Jingying Zeng, Xianfeng Tang, Xiangyang Zhou, Yan Lu, Xiaohua Jia
Abstract:
AI agents rely on a harness of skills, tools, and workflows to solve complex problems. Continually improving this harness is essential for adapting to new tasks. However, existing optimization methods typically require ground‑truth validation sets, yet such labeled data is difficult to acquire in practical deployment settings. To address this problem, we introduce Retrospective Harness Optimization (RHO), a self‑supervised method that optimizes the agent harness using only past trajectories. Specifically, RHO selects a diverse coreset of challenging tasks from past trajectories and re‑solves them in parallel. The agent analyzes these rollouts using self‑validation and self‑consistency, then generates candidate harness updates and selects the most effective one by its own pairwise self‑preference. We evaluate RHO across three diverse domains, spanning software engineering, technical work, and knowledge work. Notably, a single optimization round improves the pass rate on SWE‑Bench Pro from 59% to 78% without any external grading. Furthermore, our analysis demonstrates that RHO effectively targets prior failure modes. As a result, the optimized harness alters the agent's behavior patterns and sustains higher accuracy during long‑horizon sessions.
Authors:Dongsheng Zhu, Xuchen Ma, Yucheng Shen, Xiang Li, Yukun Zhao, Shuaiqiang Wang, Lingyong Yan, Dawei Yin
Abstract:
Existing benchmarks evaluate Tool‑Integrated Reasoning (TIR) in LLMs on idealized ''happy paths'', largely overlooking real‑world tool failures. We introduce ToolMaze, a benchmark for dynamic path discovery and error recovery in TIR agents. To separate systematic replanning from blind trial‑and‑error, ToolMaze adopts a two‑dimensional design: DAG‑based topological complexity and a 2 × 2 taxonomy of tool perturbations (explicit/implicit, transient/permanent). Evaluations show that perturbations degrade performance across nearly all models, with the sharpest drops under implicit semantic failures. Driven by systemic over‑trust in corrupted outputs, Perturbation Recovery Rate (PRR) plummets by around 37% in these scenarios, while complex topologies trap agents in futile trial‑and‑error loops. Crucially, agentic fault‑tolerance improves with model scale 3.66× slower than basic task execution, highlighting dynamic replanning as a distinct bottleneck unaddressed by model scaling or prompting. Data and code are available at https://github.com/Zhudongsheng75/ToolMaze.
Authors:Yuhao Sun, Jiacheng Zhang, Shaanan Cohney, Zhexin Zhang, Feng Liu, Xingliang Yuan
Abstract:
LLM‑based guardrails typically safeguard agents by evaluating proposed actions or inputs before execution, producing safety signals such as binary allow/deny decisions, risk categories, and/or explanatory rationales about potential policy violations. However, agent risks often arise when otherwise benign tasks are contaminated by untrusted external content, unsafe instructions, or risky tool use. Existing guardrails often flag the entire task uniformly as unsafe, thereby blocking the threat but sacrificing the benign part. Moreover, existing work largely evaluates guardrails in isolation, leaving unclear whether their interventions lead to safer downstream agent behavior. To address this, we introduce TRIAD (Tripartite Response for Iterative Agent Guardrailing), a guardrail‑integrated agent framework that leverages guardrail‑generated verbal feedback as a guiding signal to keep the agent aligned with benign objectives at each planning step. We finetune a language model on a self‑curated training dataset to output one of three decisions: proceed, refuse, or update, together with structured natural‑language feedback. Rather than merely allowing or blocking execution, update guides the agent to revise its plan, avoid harmful components, and preserve the benign task where possible. TRIAD injects this feedback into the agent's context, enabling subsequent plan revision and forming a closed loop between guardrail feedback and agent planning. Extensive experiments on ASB and AgentHarm show that TRIAD reduces the average attack success rate to 10.42%, while achieving the best safety‑utility trade‑off among guardrail‑integrated baselines. Our code is available at: https://github.com/YUHAOSUNABC/TRIAD.
Authors:Weiguang Wang, Fugen Wu, Hailing Wang, Xuechen Liang, Xiaobin Li, Ru Han, Tianchang Xie
Abstract:
Phase‑sensitive optical time‑domain reflectometry (ϕ‑OTDR) is widely used in large‑scale distributed acoustic sensing (DAS) because it provides distributed spatiotemporal monitoring over long sensing distances. Its field performance can still deteriorate because of polarization‑induced fading (PIF), local signal degradation, and strong environmental interference. This study develops a Sagnac‑assisted enhanced ϕ‑OTDR sensing architecture and a standardized benchmark framework for engineering‑oriented DAS event recognition. The Sagnac interferometer provides a continuous phase response that supplements fading‑prone observations in the ϕ‑OTDR channel, and heterogeneous signal alignment is achieved using a cross‑correlation procedure implemented on an FPGA platform. The benchmark protocol compares conventional feature‑engineering methods, probabilistic shallow classifiers, single‑branch deep models, and dual‑branch fusion models under consistent data partitioning, preprocessing, and metric definitions. Experiments on a 10‑km sensing fiber with six representative acoustic event classes show that the dual‑branch fusion model provides the most favorable trade‑off among the evaluated methods, reaching 89.79% accuracy, 89.83% macro‑F1, and a nuisance alarm rate of 5.00% on the balanced test set. The results also show that channel grouping strongly affects dual‑branch evaluation, indicating that deployment‑oriented conclusions should be based on accuracy, macro‑F1, nuisance alarm rate, false negative rate, and latency rather than accuracy alone. This work provides a physically motivated enhancement strategy for ϕ‑OTDR‑based DAS and a reproducible benchmark protocol for future fusion‑oriented sensing research. The implementation and scripts for reproducing the DAS event‑recognition experiments are publicly available at https://github.com/wawa‑abc/das.
Authors:Yansi Li, Zhuosheng Zhang
Abstract:
Generating executable tool plans requires selecting appropriate subsets from tool libraries, a combinatorial search problem with an exponentially large solution space. However, we identify a critical misalignment in predominant approaches: standard autoregressive (AR) decoding suffers from early commitment, where initial token choices rigidly constrain the search trajectory. A controlled study shows that masked denoising raises Pass@10 solution coverage from 0.320 to 0.943 over AR sampling under matched compute. Motivated by this, we propose DiG‑Plan, a framework that decouples combinatorial exploration from structural refinement. DiG‑Plan employs a diffusion‑based proposer to generate diverse tool sets via iterative refinement, followed by an AR refiner for dependency prediction. On TaskBench, DiG‑Plan improves over AR baselines by a 10% relative margin, with the largest gains on complex compositional tasks; API‑Bank results show that the propose‑refine‑select design remains effective across domains. Code is available at https://github.com/puddingyeah/DiG‑Plan.
Authors:Haoyu Zhou, Qing Qing, Caichong Li, Qixin Zhang, Yongcheng Jing, Ziqi Xu, Juncheng Hu, Xikun Zhang, Renqiang Luo
Abstract:
Recent advancements in Vision‑Language Models (VLMs) have significantly enhanced their ability to interpret complex visual semantics, yet their capacity for chronological reasoning remains under‑explored. In this paper, we introduce a novel benchmark specifically designed to evaluate how VLMs perceive and reason about chronological information within and across images. Unlike existing video‑based benchmarks that focus on frame sequencing, our work delves into the underlying logic of chronological judgment and the expansion toward multimodal integration. To facilitate this, we construct three specialized datasets: one containing visually similar objects spanning long historical durations, another categorized by diverse event and object types, and a third pairing images with time‑sensitive news text for cross‑modal alignment. Through extensive experiments, we analyze whether models exhibit performance disparities across categories and, crucially, explore whether they rely on ``incorrect shortcuts'', such as image color rather than genuine chronological features. Our results reveal that while VLMs show promise, they frequently exploit superficial cues like grayscale versus color filters to bypass authentic chronological reasoning. By providing these high‑quality datasets and a rigorous evaluation framework, we offer a diagnostic tool to identify current limitations and guide the development of more robust, logically grounded multimodal models. The source code is shown in https://github.com/LuoRenqiang/ChronoVision.
Authors:Yunxiang Zhang, Yiheng Li, Ali Payani, Lu Wang
Abstract:
A central challenge for language agents is utilizing past experience to adapt to dynamic test‑time conditions. While recent work demonstrates the promise of agentic memory mechanisms, most systems restrict retrieval to episode initiation. Consequently, agents are forced to rely on static guidance that becomes increasingly misaligned as long‑horizon tasks unfold. To address this rigidity, we propose the Adaptive Memory Agent (AdaMEM), a novel framework for agent test‑time adaptation. Without updating model parameters online, AdaMEM adapts agent behavior via a hybrid memory architecture: it maintains a long‑term trajectory memory of raw experiences collected offline while generating dynamic short‑term strategy memory on‑the‑fly to guide decision‑making. This mechanism enables the trade‑off between token efficiency and adaptability across varying inference‑time compute levels. Empirically, AdaMEM significantly outperforms static memory baselines, achieving relative gains of up to 13% on ALFWorld and 11% on WebShop, with consistent leading performance extending to agentic search on HotpotQA. To further enhance this adaptation, we develop STEP‑MFT, a Step‑wise Memory Fine‑Tuning technique that trains the policy to synthesize high‑quality strategies from retrieved experiences, yielding additional performance gains. Our work establishes a new scaling dimension for agentic memory, supporting continuous reasoning and self‑evolution post‑deployment in real‑world environments. Our code is available at https://github.com/yunx‑z/AdaMEM.
Authors:Charlie Summers, Eugene Wu
Abstract:
Agents increasingly generate SQL, orchestrate pipelines, and automate data analysis on behalf of users. While recent work improves query correctness, correctness is not safety. A query may be semantically valid yet violate regulatory, privacy, or business constraints that govern how data may be combined and released. We argue that enforcing such constraints is fundamentally a data infrastructure problem. This paper introduces Data Flow Control (DFC), a framework to declaratively specify and guarantee policy enforcement over tuple‑level data flows within a DBMS query. A key challenge is defining a policy language that is optimizer‑invariant yet efficient to enforce at scale. We formalize data safety as aggregate predicates over provenance monomials and present Passant, a portable query rewriting layer that enforces DFC policies without materializing provenance. Across five DBMS engines ‑‑ DuckDB, Umbra, PostgreSQL, DataFusion, and SQLServer ‑‑ Passant achieves ~0% overhead and outperforms alternatives by orders of magnitude. As a result, Data Flow Control is the first step towards moving data safety from prompts and post‑hoc checks into the data infrastructure. Data Flow Control is available open source at https://github.com/dataflowcontrol/data‑flow‑control.
Authors:Yuhang Fu, Ruishan Fang, Jiaqi Shao, Huiyu Zheng, Zhengtao Zhu, Bing Luo, Tao Lin
Abstract:
Does adding more agents help an LLM workflow once compared systems share the same benchmark loader, tool access, answer contract, usage accounting, and trajectory logging? We introduce BenchAgent, an evaluation framework that places single‑agent, fixed multi‑agent (MAS), and evolving MAS workflows under one normalized execution and logging protocol. BenchAgent evaluates these substrate‑internal workflows across ten reasoning, coding, and tool‑use benchmarks with GPT‑4.1, and separately reports a Protocol‑Aligned External (PAE) GAIA study of a runtime‑generated workflow. Under SI conditions, at most one of six tested MAS exceeds the matched single‑agent anchor on benchmark‑balanced average accuracy: EvoAgent lies within the Wilson one‑run guidance, while the remaining five trail by 2.56‑11.29 points and occupy more expensive accuracy‑cost trade‑offs. On the PAE GAIA snapshot, a Claude‑Code‑style runtime workflow reaches 66.72% overall and 69.23% on Level 3, more than 20 points above the strongest non‑Claude baseline, Jarvis, a fixed MAS.
Authors:Xuehang Guo, Zora Zhiruo Wang, Qingyun Wang, Graham Neubig, Xingyao Wang
Abstract:
Large language models (LLMs) have enabled powerful software engineering (SE) agents capable of navigating complex codebases and resolving real‑world issues. However, these agents remain fundamentally episodic: they fail to retain, refine, and reuse experiences across tasks, repeatedly reconstructing context from scratch and reproducing similar mistakes. Even with memory support, they offer no remedy for the absence of a principled, task‑agnostic memory utility, making them difficult to evaluate rigorously or generalize across agents and settings. To tackle these limitations, we introduce \ours, a closed‑loop framework for memory augmentation in SE agents. \ours grounds memory utility in validated downstream impact, establishing utility as both a task‑agnostic evaluation benchmark and an annotation‑free optimization signal. Through complementary evaluation on single‑episode and cross‑episode memory augmentation, results demonstrate that \ours consistently improves SE agents across settings, achieving absolute gains of up to \uparrow5.25% in success rate and \uparrow4.63% in resolve efficiency, while substantially reducing computational cost by \geq9.79%. Our project page: \hrefhttps://xhguo7.github.io/MemOp/https://xhguo7.github.io/MemOp/.
Authors:Seungwon Jeong, Jiwoo Jeong, Hyeonjin Kim, Yunseok Lee, Woojin Lee
Abstract:
As large language models (LLMs) are widely deployed, identifying their vulnerability through jailbreak attacks becomes increasingly critical. Optimization‑based attacks like Greedy Coordinate Gradient (GCG) have focused on inserting adversarial tokens to the end of prompts. However, GCG restricts adversarial tokens to a fixed insertion point (typically the prompt suffix), leaving the effect of inserting tokens at other positions unexplored. In this paper, we empirically investigate \emphslots, i.e., candidate positions within a prompt where tokens can be inserted. We find that vulnerability to jailbreaking is highly related to the selection of the \emphslots. Based on these findings, we introduce the Vulnerable Slot Score (VSS) to quantify the positional vulnerability to jailbreaking. We then propose SlotGCG, which evaluates all slots with VSS, selects the most vulnerable slots for insertion, and runs a targeted optimization attack at those slots. Our approach provides a position‑search mechanism that is attack‑agnostic and can be plugged into any optimization‑based attack, adding only 200ms of preprocessing time. Experiments across multiple models demonstrate that SlotGCG significantly outperforms existing methods. Specifically, it achieves 14% higher Attack Success Rates (ASR) over GCG‑based attacks, converges faster, and shows superior robustness against defense methods with 42% higher ASR than baseline approaches. Our implementation is available at \hrefhttps://github.com/youai058/SlotGCGhttps://github.com/youai058/SlotGCG
Authors:Ayano Hiranaka, Ya-Chuan Hsu, Stefanos Nikolaidis, Erdem Bıyık, Daniel Seita
Abstract:
AI assistants in human‑AI collaboration often correct suboptimal human actions through behavioral feedback (e.g., alerts or steering‑wheel nudges in assistive driving). Such interventions can mitigate immediate errors, but long‑term improvement requires addressing the underlying misconceptions that cause repeated mistakes. We introduce SENSEI, a framework that infers user misconceptions from interaction behavior and provides targeted, minimal yet sufficient suggestions to correct them. Our approach departs from action‑ or trajectory‑level interventions by operating over a structured knowledge representation to localize and correct the sources of erroneous behavior. Across three long‑horizon tasks with diverse misconceptions and corresponding behaviors, SENSEI demonstrates zero‑shot compositional generalization, disentangling multiple overlapping misconceptions despite training only on single‑misconception cases. A user study further shows that our method identifies real human misconceptions and provides effective guidance that improves long‑horizon task performance, successfully correcting 90% of student misconceptions. Code and project page are available at https://misoshiruseijin.github.io/SENSEI/.
Authors:Mohammad Mahdi Abootorabi, Omid Ghahroodi, Anas Madkoor, Marzia Nouri, Doratossadat Dastgheib, Mohamed Hefeeda, Ehsaneddin Asgari
Abstract:
Despite the rapid progress of Vision‑Language Models (VLMs), the field lacks benchmarks that rigorously diagnose their true reasoning abilities and chart meaningful progress toward human‑like multimodal intelligence. Most existing evaluations focus on piecemeal or disconnected tasks, obscuring critical cognitive weaknesses and providing little insight for targeted improvement. To address this gap, we introduce BloomBench, part of the Almieyar benchmarking series, the first cognitively human‑grounded, bilingual (English‑Arabic) multimodal benchmark for VLMs. Grounded in Bloom's Taxonomy, BloomBench systematically evaluates six levels of cognition (Remember, Understand, Apply, Analyze, Evaluate, Create) through carefully designed image‑question‑answer tasks. Built with a semi‑automated pipeline and validated through a stratified hybrid quality assurance protocol, it ensures scalability, cultural inclusivity, and linguistic fidelity. Leveraging this framework, we conduct a comprehensive study of state‑of‑the‑art VLMs to diagnose their cognitive profiles. Our analysis reveals a sharp cognitive asymmetry: while state‑of‑the‑art models achieve strong performance ceilings in semantic understanding, they struggle substantially with factual recall and creative synthesis. This demonstrates that current general multimodal proficiency masks deeper limitations in specific cognitive layers. Furthermore, our study highlights a critical performance gap between Arabic and English, exposing limitations in current cross‑lingual multimodal reasoning. These findings establish a foundation for developing more cognitively aligned and inclusive VLMs. The benchmark framework and dataset is available at: https://github.com/qcri/Almieyar‑Oryx‑BloomBench.
Authors:Kuangshi Ai, Haichao Miao, Kaiyuan Tang, Shusen Liu, Chaoli Wang
Abstract:
Recent advances in agentic visualization have enabled the translation of natural language into executable scientific visualization (SciVis) workflows. While general‑purpose coding agents show strong capabilities, they often lack the tool‑specific expertise required for SciVis tasks. In this work, we present SciVisAgentSkills, a collection of reusable agent skills that augment coding agents for scientific data analysis and visualization by encoding environment assumptions, tool usage patterns, and domain heuristics across scientific tools such as ParaView, napari, VMD, and TTK. We evaluate these skills on Codex and Claude Code using SciVisAgentBench, a benchmark of 108 expert‑designed multi‑step tasks. Results show that agent skills improve mean task scores across the evaluated suites, with token‑efficiency benefits that depend on the agent harness and tool setting. These findings highlight the importance of structured procedural knowledge for enabling reliable, long‑horizon SciVis workflows, while also showing that skills should be studied alongside the execution harness that loads and applies them. The skills are available at https://github.com/KuangshiAi/SciVisAgentSkills.
Authors:Al Zadid Sultan Bin Habib, Md Younus Ahamed, Prashnna Kumar Gyawali, Gianfranco Doretto, Donald A. Adjeroh
Abstract:
We investigate how to make small tabular foundation models effective for High‑Dimensional, Low‑Sample Size (HDLSS) tabular prediction without retraining large backbones. We introduce Graph‑guided Ordering with Local Refinement (GO‑LR), show its equivalence to weighted Minimum Linear Arrangement, and interpret the practical solver as a TSP‑path‑style surrogate. We propose GOTabPFN,which builds on GO‑LR, and a Neuro‑Inspired Subunit Compression (NSC) unit to pool locally adjacent ordered features into meta‑features, yielding a compact representation that makes TabPFN‑style prediction practical in HDLSS regimes. Across tabular benchmarks, GOTabPFN improves stability and accuracy under tight token budgets.
Authors:Can Gurkan, Forrest Stonedahl, Uri Wilensky
Abstract:
When an LLM repeatedly mutates a program, does it explore new forms or circle back to the same ones? We study this question by analyzing LLM‑driven mutation chains in the absence of selection pressure within a domain‑specific language, varying prompt design, model family, and stochastic replication. We find that LLM‑based mutation consistently converges toward restricted attractor regions in program space. Convergence is especially severe at the structural level: in 87% of chains, over 93% of mutations revisit a previously seen structural form, with most variation confined to terminal substitutions within recurring templates. Cycle analysis reveals short cycles and self‑loops dominating the transition structure. The rate of convergence varies with prompt wording and model choice, but the phenomenon is robust across conditions. A classical GP subtree mutation operator does not exhibit comparable convergence, suggesting that the effect is intrinsic to the LLM mutation pipeline. These findings reveal a tension at the heart of LLM‑driven program evolution: the same capabilities that enable semantics‑aware program transformation also carry a systematic bias toward structural homogeneity that must be accounted for if such systems are to sustain open‑ended exploration. Source code is available at https://github.com/can‑gurkan/lmca.
Authors:Yiyou Sun, Xinyang Han, Weichen Zhang, Yuanbo Pang, Tianyu Wang, Yuhan Cao, Yixiao Huang, Chris Duroiu, Haoyun Zhang, Jeffrey Lin, Weishu Zhang, Tyler Zeng, Ying Yan, Bo Liu, Hanson Wen, Mingyang Xu, Xiaoyuan Liu, Zimeng Chen, Weiyan Shi, Amanda Dsouza, Vincent Sunn Chen, Patrick Bryant, Carl Boettiger, Yamini Rangan, Bradley Rothenberg, Kyle Steinfeld, Arvind Rao, Tapio Schneider, Georgios Yannakakis, Laure Zanna, Kaan Ozbay, Ida Sim, Tarek Zohdi, George Em Karniadakis, Jack Gallant, Teresa Head-gordon, Yushan Li, Wenxi Deng, Tao Sun, Huiqi Wang, Zhun Wang, Justin Xu, Chris Yuhao Liu, Yafei Cheng, Rongwang Hu, Aras Bacho, Shengcao Cao, Zengyi Qin, Yixiong Chen, Hengduan Fan, Hao Liu, Lin Zeng, Shashank Muralidhar Bharadwaj, Litian Gong, Yingxuan Yang, Maojia Song, Ruheng Wang, Zongzheng Zhang, Honglin Bao, Shuo Lu, Jianhong Tu, Zhonghua Wang, Zheng Zhang, Zijiao Chen, yanqiong Jiang, Zhendong Li, Bohan Lyu, Chang Ma, Peiran Xu, Benran Zhang, Shangding Gu, Haoyue Hua, Haoyang Li, Wanzhe Liao, Chengzhi Liu, Junbo Peng, Haoran Sun, Zechen Xu, Bo Chen, Jiayi Cheng, Yi Jiang, Keying Kuang, Yuan Li, Youbang Pan, Ziyan Rao, Alexander Schubert, Yifan Shen, Vincent Siu, Xiatao Sun, Kangqi Zhang, Xiaopan Zhang, Yuchen Zhu, Ishaan Singh Chandok, Lei Ding, Jingxuan Fan, Andrew Glover, Jiaming Hu, Yiran Hu, Wenbo Huang, Zixin Jiang, Haoran Jin, Lukas Kim, Ming Liu, Yang Liu, Alireza Rafiei, Xuhuan Shen, Kunyang Sun, Sophia Sun, Ting Sun, Eric Wang, Yixin Wang, Hanwen Xing, Sihan Xu, Yuzheng Xu, Zhongxing Xu, Zhiling Yan, Boqin Yuan, Ruiqi Zhang, Yifan Zhang, Zibo Zhao, Liana, Santanu Bosu Antu, Haoyue Bai, Carlo Bosio, Joseph Cavanagh, Patricia Cavazos-Rehg, Tianxing Chen, Xuewen Chen, Yipu Chen, Zhu Chenyu, Chen Dai, Stefano De Castro, Yunfu Deng, Kaustubh Dhole, Jiayuan Ding, Chenchen Du, Zhehang Du, Hao Fan, Run-ze Fan, Hengyu Fu, Shi Gu, Yifan Gu, Charlie Guo, Baihe Huang, Baixiang Huang, Rimika Jaiswal, Zhihan Jiang, Ran Jin, Erin Kasson, Xin Lan, Joseph Lee, Deren Lei, Chenyu Li, Daofeng Li, Haitao Li, Hongwei Li, Jingyan Li, Xiao Li, Yi Li, Yinsheng Li, Yuangang Li, Zhixu Li, Wenyu Liang, Longtai Liao, Kevin Qinghong Lin, AndyZeyi Liu, Che Liu, Jiaming Liu, Kaiyuan Liu, Xuan Liu, Pan Lu, Wenbo Lv, Yicheng Lv, Qiuyang Mang, Kyle Montgomery, Yuzhou Nie, Ruoxi Ning, Jorin Overwiening, Xu Pan, Layna Paraboschi, Core Francisco Park, Justin Purnomo, Swati Rajwal, Scott Rankin, Bixuan Ren, Yiren Rong, HaoYang Shang, Ventus Shaw, Fiona Shen, Jiawei Shen, Minqi Shi, Qiu Shi, Huaxiu Yao, Tianneng Shi, Jonah So, Vladislav Susoy, Hannah Szlyk, Haocheng Wang, Jialu Wang, Wei Wang, Xinyu Wang, Zehao Wang, Dowling Wong, Angela Wu, Dehao Wu, Fangyu Wu, Mengyuan "Millie" Wu, Yu Wu, Yuchen Wu, Yuhao Wu, Qingpo Wuwu, Weihang Xiao, Yongyi Xiong, Fan Xu, Ruiling Xu, Mingxuan Yan, Benjamin Yang, Jirong Yang, Sen Yang, Xiaoli Yang, Yushi Yang, Haoran Ye, Xiaohu Yu, Zhengming Yu, Chenlong Zhang, Chi Zhang, Hanning Zhang, Hanwen Zhang, Junge Zhang, Kunpeng Zhang, Song Zhang, Wenjin Zhang, Wenshuo Zhang, Ying Zhang, Yizhi Zhang, Brian Zhao, Qijian Zhao, Yimin Zhao, Yuhaohua Zheng, Liwei Zhou, Tianyue Zhou, Sichen Zhu, Siqi Zhu, Yan Zhu, Yishu Zhu, Jierui Zuo, Chonghao Cai, Helena Casademunt, Wenjia Chen, Benjamin Cheng, Nawen Deng, Rao Fu, Tianfu Fu, Yifan Han, Ren He, Zhenyu He, Qiao Jin, Lang Lang, Yuetai Li, Sylvia Liu, Lu Lu, Qing Lu, Subhabrata Mukherjee, Yunqi Ouyang, Yin Ren, Dawei Shi, Haoran Wu, Zhiyue Wu, Hannah Yao, Zhuoran Yi, Jenny Yu, Rhea Zhan, Hang Zhou, Blake Zhu, Junfan Zhu, Alan Yuille, Yang Liu, Russell Alan Poldrack, Jiachen Li, Zhenglu Li, Molei Tao, Jing Huang, Wenqi Shi, Costas Spanos, Lichao Sun, Chenguang Wang, Orson Xu, Zhen Dong, Hector Gomez, Aylin Caliskan, Ali Emami, Haimin Hu, Zhi Li, Lihui Liu, Murphy Niu, Yi Shao, Jianxin Sun, Mikko Tolonen, Ting Wang, Sanjiv Das, Yanjun Gao, Wenbo Guo, Erika J Schneider, Zhiyong Lu, Mark Mueller, Radha Poovendran, Somayeh Sojoudi, Dawn Song
Abstract:
Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents' Last Exam (ALE), a benchmark designed to evaluate AI agents on long‑horizon, economically valuable, real‑world tasks with verifiable outcomes. Developed in collaboration with 250+ industry experts, ALE covers non‑physical industries defined with reference to ONET / SOC 2018 (the U.S. federal occupational taxonomy). It is organized around a task taxonomy with 55 subfields grouped into 13 industry clusters covering 1K+ tasks. Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is 2.6%. ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. More broadly, ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP‑relevant impact.
Authors:Zihao Li, Kaifeng Jin, Yuanchen Bei, Jiaru Zou, Avaneesh Kumar, Xuying Ning, Yanjun Zhao, Mengting Ai, Baoyu Jing, Hanghang Tong, Jingrui He
Abstract:
Time series are often embedded in rich contexts that are essential for holistic modeling. Moreover, real‑world practitioners often require end‑to‑end workflows for analyzing temporal dynamics, where widely studied tasks such as forecasting are only one step in a broader solution loop. While generalist AI agents offer a promising interface for such workflows under complex contexts, they still operate primarily in textual spaces that are not fully aligned with structured temporal signals. In this work, we introduce TimeClaw, an agentic harness framework for time series that equips generalist LLM agents with the time series‑native runtime support needed for contextualized temporal reasoning. TimeClaw integrates executable temporal tools for grounded and auditable analysis, experience‑driven capability evolution for creating reusable analytical routines, and episodic multimodal memory for retrieving relevant reasoning traces. Together, these components unlock harnessed open‑ended temporal reasoning with contextual information. Extensive evaluation on multiple benchmarks covering diverse tasks across energy, finance, weather, traffic, and other real‑world domains demonstrates improved performance of TimeClaw. Code is available at https://github.com/iDEA‑iSAIL‑Lab‑UIUC/TimeClaw.
Authors:Jinu Lee, Shivam Agarwal, Amruta Parulekar, Siddarth Madala, Dilek Hakkani-Tur, Julia Hockenmaier
Abstract:
Large reasoning models (LRMs) produce reasoning traces with non‑linear structures, such as backtracking and self‑correction, that complicate the evaluation and monitoring of the reasoning process. We introduce ReasoningFlow, a framework that captures the discourse structures of LRM reasoning traces into fine‑grained directed acyclic graphs (DAGs). We develop and validate our annotation schema through careful manual annotation of 31 traces (2.1k steps), achieving high inter‑annotator agreement, then scale to automatic annotation of 1,260 traces (247.7k steps) spanning three tasks (math, science, argumentation) and five models (Qwen2.5‑32B‑Inst, QwQ‑32B, DeepSeek‑V3, DeepSeek‑R1, GPT‑oss‑120B). By analyzing ReasoningFlow graphs, we find: (1) LRMs exhibit structurally similar traces, despite being trained from different base models and potentially non‑overlapping post‑training data. (2) ReasoningFlow reveals diverse fine‑grained reasoning behaviors (e.g., local verification, self‑reflection, and assumptions) that can be used for better reasoning trace monitorability. (3) In LRMs, most of the erroneous steps are not used to derive final answers. (4) Mechanistic causal dependencies between steps do not reflect the language‑level discourse structure. We release the dataset and code in: https://github.com/jinulee‑v/reasoningflow.
Authors:Yuanhe Zhang, Yuekai Sun, Taiji Suzuki, Jason D. Lee, Fanghui Liu
Abstract:
Long‑horizon autoformalization of research mathematics fails not only at hard lemmas, but at scale: statements drift, dependencies tangle, context decays, and local repairs corrupt distant work. We present LeanMarathon, a multi‑agent harness for reliable research‑level Lean autoformalization. Its core abstraction is an evolving blueprint: a Lean file that serves simultaneously as formal proof skeleton, natural‑language proof graph, and shared system of record. Four contract‑scoped agents construct, audit, prove, and repair this blueprint. These agents are coordinated by a two‑stage orchestrator that first stabilizes target fidelity through adversarial review and then discharges the proof directed acyclic graph (DAG) from its dynamic leaves upward in parallel CI‑gated rounds. LeanMarathon turns one brittle multi‑hour run into many local, recoverable, parallel transactions. We evaluate LeanMarathon on two recent research papers spanning four Erdős problems (#1051, #1196, #164, #1217). Across three autonomous runs, it formalizes all seven target theorems with no sorry, proving 258 lemmas and theorems. These results show that reliable AI co‑mathematics requires not only stronger provers, but durable harnesses that preserve target fidelity across long mathematical developments. The code can be found at https://github.com/YuanheZ/LeanMarathon.
Authors:Yunhao Yang, Neel P. Bhatt, Kevin Wang, Samuel Tetteh, Zhangyang Wang, Ufuk Topcu
Abstract:
Reusable robot skills are becoming the basic units through which embodied agents turn open‑ended instructions into long‑horizon physical behavior. We argue that, while foundation models have collapsed the cost of creating these skills, the cost of trusting them has not. Existing skill‑evolution loops refine skills through execution feedback, unit tests, environment reward, or LLM self‑critique, but these signals provide only trace‑level evidence: they show that a skill worked on sampled executions, not that skill‑induced plans satisfy temporal safety contracts under untested conditions. We introduce VASO, a framework for verification‑guided self‑evolution of LLM‑generated robot skill contracts. In VASO, each skill is represented as a semantic contract with two coupled interfaces: a formal interface that aligns robot states, observations, and control commands with logical propositions for model checking, and a planner‑facing interface that guides executable behavior generation. A model checker first filters logically inconsistent skill contracts, then verifies plans induced by the skill against global and local temporal specifications. When verification fails, VASO translates the counterexample trace into a textual gradient that updates the reusable skill contract while keeping foundation‑model weights frozen. On Clearpath Jackal and PX4 quadcopter tasks, VASO reaches 97.2% formal‑specification compliance using fewer than 100 optimization samples, outperforming execution‑feedback, prompt‑optimization, and fine‑tuning baselines. To our knowledge, VASO is the first framework that closes the loop between formal verification and self‑evolving LLM‑generated skills for physical AI agents: formal counterexamples become optimization feedback for reusable robot skill contracts, rather than merely verifying one‑off plans, tuning planner prompts, or fine‑tuning model weights.
Authors:Chen Huang, Yuhao Wu, Wenxuan Zhang
Abstract:
Multi‑agent systems (MAS) built on large language models are typically organized around roles, pipelines, and turn schedules, while the content that agents pass to one another is often left as unconstrained natural language. However, this free‑form communication can rapidly inflate token usage, consume the shared context window, and ultimately affect both system performance and inference cost. We analyze five common inter‑agent communication strategies across two MAS topologies, finding that no fixed strategy is universally optimal. Instead, effective inter‑agent messages consistently preserve action‑centered information needed by downstream agents. Building on this, we propose the PACT (Protocolized Action‑state Communication and Transmission), which treats inter‑agent communication as a public state‑update problem and projects each raw agent output into a compact action‑state record before it enters shared history. Across different MAS topologies, PACT consistently improves the performance‑cost trade‑off, achieving comparable or stronger task performance with substantially fewer tokens. The gains extend to production coding harnesses: PACT lifts OpenHands' resolve rate at ‑10% tokens‑per‑resolved, and is resolve‑neutral on SWE‑agent while halving input tokens. Our code is publicly available at https://github.com/iNLP‑Lab/PACT.
Authors:Dae Yon Hwang, Raunaq Suri, Valentin Villecroze, Anthony L. Caterini, Jesse C. Cresswell, Noël Vouitsis, Brendan Leigh Ross
Abstract:
LLM agents operate in two distinct regimes: open‑weight agents amenable to reinforcement learning (RL) and black‑box agents whose behaviour must be controlled purely at test time. Although black‑box agents are often backed by state‑of‑the‑art proprietary LLMs, API‑only access precludes parameter‑level optimization, rendering most RL methods inapplicable. To address this limitation, we turn to a known equivalence between RL and Bayesian inference. We propose Agentic Monte Carlo (AMC) to directly sample from the optimal policy of a black‑box agent rather than training it through RL. The optimal policy is a posterior over trajectories whose prior we define as the fixed black‑box LLM agent. We employ Sequential Monte Carlo to sample from this posterior by learning a value function to steer the agent while leaving the underlying black‑box model unchanged. We validate AMC on three diverse environments from the AgentGym benchmark, demonstrating significant improvements over prompting baselines and even outperforming Group Relative Policy Optimization (GRPO) as we scale the test‑time compute of our method. AMC demonstrates the feasibility of performing principled RL‑style optimization of black‑box LLM agents. Code is available at https://github.com/layer6ai‑labs/Agentic‑Monte‑Carlo
Authors:Tobia Poppi, Silvia Cappelletti, Sara Sarto, Florian Schiffers, Garin Kessler, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Abstract:
Recent progress in generative modeling has made safety control a central challenge, yet existing approaches remain largely model‑specific, requiring retraining or tailored interventions for each new architecture. In this work, we ask whether safety can be represented as a portable latent direction, learned once and reused across heterogeneous generators. We introduce the first framework for cross‑model safety steering, in which a safety direction is estimated in a source LLM from paired safe‑unsafe prompts, transported to a target generator through a lightweight alignment fitted on benign data alone, and applied at inference time. Crucially, our pipeline never accesses unsafe data on the target side, isolating whether safety can be transferred through shared representation geometry. Beyond a single global direction, we also identify a multi‑vector extension that captures category‑specific safety behaviors, enabling more selective control. We evaluate our approach in text‑to‑image and text‑to‑video generation across diverse source‑target model pairs. Across models, transferred safety directions achieve ASR reduction and CLIP‑Score/FID trade‑offs comparable to directions learned natively on the target model using unsafe data, while requiring no target‑side unsafe data. This indicates that safety improvements do not come at the expense of generation quality. Our results point to a modular view of safety: safety‑relevant behavior is not purely model‑local, but can be controlled through latent directions that persist across models. This suggests a new path toward lightweight, reusable safety mechanisms that do not require target‑side unsafe data.
Authors:Thao Nguyen, Krishna Kumar Singh, Donghyun Kim, Yong Jae Lee, Yuheng Li
Abstract:
We study the personal camera roll visual question answering setting. In this setting, a conversational AI assistant can access a user's personal camera roll and retrieve relevant photos to answer queries, ranging from simple factual questions (e.g., ``Name of the food I tried yesterday?'') to more open‑ended ones (e.g., ``Recommend some dishes I have never eaten before''). Given the vast nature of the personal camera roll (i.e., multiple years, hundreds to thousands of photos), a successful AI assistant needs to understand a long‑horizon, highly personalized visual content stream in order to navigate and locate the correct and/or relevant information. To support this, we collect and manually annotate questions that mimic real‑world usage. The final dataset, camroll, contains 50 users, 31,476 images, and 2,500 QA pairs. We further design camroll‑agent, a conversational AI agent equipped with hierarchical memory and a minimal set of tools for efficient navigation over large, personalized visual memory. Experimental results show that camroll‑agent outperforms numerous baselines and methods for long‑context understanding AI agents system. Together, the camroll dataset and camroll‑agent highlight the gap in AI agents' long‑context reasoning: personalized visual memory requires different approaches from standard long‑context textual memory, especially when consistency, visual details, and user‑specific context are present.
Authors:Nadav Benedek, Ariel Shamir, Ohad Fried
Abstract:
Variable fonts enable continuous variation of glyph geometry along semantic design axes such as weight, width, slant, and optical size. However, constructing a variable font from a static font remains a labor‑intensive process requiring expert typographic design and manual specification of glyph variation data. We introduce NIV (Neural Axis Variations), a method that automatically converts a static font into a fully functional variable font. Given glyph outlines and a set of desired design axes, NIV predicts per‑point displacements. The model operates directly on vector glyph geometry and employs a novel Property Embedding mechanism that captures interactions between multiple axes, enabling consistent multi‑axis variation within a unified framework. We train NIV on a newly constructed dataset derived from variable Google Fonts, comprising over one million variation tuples. The resulting model generalizes across unseen code points, unseen font styles, high‑complexity CJK glyphs, and even out‑of‑distribution handwriting inputs. The generated outputs are standard variable font files supporting continuous interpolation via existing rendering engines. To facilitate research, we release the dataset, the complete training and inference implementation, and trained models at https://github.com/ndvbd/NIV. Beyond typography, our approach demonstrates how structured geometric objects with continuous parametric variation can be synthesized using neural deformations.
Authors:Aimen Boukhari
Abstract:
Masked language modelling (MLM) has been the dominant pre‑training objective for text encoders since BERT, yet it encourages representations that are strongly anchored to surface‑form token identity rather than deeper semantic structure. Inspired by the success of Joint Embedding Predictive Architectures (JEPA) (LeCun, 2022) in vision and audio, we propose a hybrid pre‑training objective that combines a JEPA‑style latent‑space prediction loss with a standard MLM objective over a single shared encoder. A learnable scalar parameter continuously balances the two objectives during training. We pre‑train both a hybrid model and a pure‑MLM baseline on English Wikipedia using identical architectures and compute budgets (NVIDIA H100). Extensive representation analysis across five GLUE benchmarks (SST‑2, MRPC, MNLI, CoLA, STS‑B) using four pooling strategies reveals that the hybrid encoder produces significantly more uniform embeddings (uniformity less than ‑0.16 vs ‑0.05 for MLM), exhibits richer spectral geometry under max pooling, encodes less surface‑level lexical information, and achieves a better semantic‑to‑lexical balance. Despite similar linear‑probe downstream accuracy, the geometric differences are consistent and significant, suggesting that the JEPA predictive objective reshapes the latent space in ways that standard accuracy metrics alone cannot capture.
Authors:Zhen Yang, Xiaogang Xu, Wen Wang, Cong Chen, Xander Xu, Ying-Cong Chen
Abstract:
Multi‑agent reasoning systems adopt a "generate‑then‑transfer" paradigm that forces end‑to‑end latency to scale linearly with pipeline depth. We introduce StreamMA, a multi‑agent reasoning system that streams each reasoning step to downstream agents as soon as it is generated, pipelining adjacent agents and thus reducing latency. Surprisingly, this pipelining also improves effectiveness: because multi‑step reasoning quality is non‑uniform and early steps are more reliable than later ones, working with these reliable early steps instead of the full chain prevents error‑prone late steps from misleading downstream agents. We formalize both advantages with the first closed‑form joint analysis of stream, serial, and single protocols, deriving the effectiveness ordering, speedup upper bound, and cost ratio. Across eight reasoning benchmarks spanning mathematics, science, and code, two frontier LLMs (Claude Opus 4.6 and GPT‑5.4), and three topologies (Chain, Tree, Graph), StreamMA outperforms both baselines (avg. +7.3 pp, max +22.4 pp on HMMT 2026; Claude Opus 4.6‑high). Beyond these contributions, we discover a "step‑level scaling law": increasing per‑agent steps consistently improves both effectiveness and efficiency, a new scaling dimension orthogonal to and composable with agent‑count scaling.
Authors:Linyao Chen, Qinlao Zhao, Zechen Li, Mingming Li, Likun Ni, Jinyu Chen, Yuhao Yao, Xuan Song, Noboru Koshizuka, Hiroki Kobayashi
Abstract:
Individual‑level mobility prediction is central to urban simulation, transportation planning, and policy analysis. Supervised sequence models achieve strong accuracy but require task‑specific training and offer limited decision‑level transparency. Recent LLM‑based methods improve interpretability, yet mostly rely on static prompts and single‑pass inference, limiting their ability to seek additional evidence when mobility signals are weak or conflicting. We propose \method, a training‑free LLM‑driven agent framework that formulates next‑location prediction as adaptive evidence‑controlled decision making. \method resolves routine cases through a fast path based on historical regularity, while ambiguous cases trigger iterative tool use over recent trajectories, historical behavior, stay‑move likelihood, and geographical evidence. Across three mobility datasets, AgentMob achieves the strongest overall performance among training‑free LLM‑based methods, with GPT‑5.4 reaching 71.42% Acc@1 on BW, 33.14% on YJMob100K, and 33.50% on Shanghai ISP. On BW non‑fast‑path cases, the LLM controller improves Acc@1 from 30.65% to 48.62% over a same‑tool statistical baseline, showing that its main benefit lies in resolving ambiguous predictions through adaptive evidence gathering. Our code is available at https://github.com/Unknown‑zoo/AgentMob.
Authors:Zhangchen Xu, Junda Chen, Yue Huang, Dongfu Jiang, Jiefeng Chen, Hang Hua, Zijian Wu, Zheyuan Liu, Zexue He, Lichi Li, Shizhe Diao, Jiaxin Pei, Jinsung Yoon, Hao Zhang, Mengdi Wang, Radha Poovendran, Misha Sra, Alex Pentland, Zichen Chen
Abstract:
Scientific and engineering progress is fundamentally a long‑horizon iterative process: proposing changes, running experiments, measuring outcomes, and continuously refining artifacts. Yet existing benchmarks for frontier models primarily evaluate either single‑turn responses or short‑horizon agent trajectories, failing to capture the challenges of sustained iterative improvement over extended time horizons. To address this gap, we introduce AutoLab, a new benchmark for ultra long‑horizon closed‑loop optimization. AutoLab consists of 36 realistic, expert‑curated tasks spanning four diverse domains: system optimization, puzzle & challenge, model development, and CUDA kernel optimization. Each task begins with a correct but deliberately suboptimal baseline and challenges agents to improve it within a strict wall‑clock budget. Evaluating 17 state‑of‑the‑art models reveals the dominant predictor of success is not the quality of an agent's initial attempt, but its persistence in repeatedly benchmarking, editing, and incorporating empirical feedback. While claude‑opus‑4.6 exhibits strong long‑horizon optimization capabilities, most frontier models, including several proprietary ones, either terminate prematurely or exhaust their budgets with minimal progress. These results underscore the importance of time awareness and persistent iteration in autonomous agents. We open‑source the full benchmark, evaluation harness, and task artifacts, to accelerate research toward truly capable long‑horizon agents.
Authors:Arquimedes Canedo, Grama Chethan
Abstract:
When an AI agent calls an API and hits a validation error, it needs more than what went wrong ‑‑ it needs what to do next. A self‑reflective API returns, on validation failure, a machine‑readable recovery\_feedback.suggestions[] payload sufficient for the agent to repair the request and retry without external reasoning. On a leak‑audited pilot (N=30 per cell, 3 LLMs, 10 adversarial tasks), structured suggestions lift task‑completion rate by +36.7‑‑40.0pp over plain‑English diagnoses on Anthropic models (Fisher's exact p \le 0.0022), at 1.8‑‑2.2× better per‑success token efficiency. The lift is not significant on gpt‑4o‑mini (p=0.435); a second‑domain replication on a billing API confirms the pattern. The comparison only holds after auditing two undocumented classes of answer leakage in LLM benchmarks. We shipaudit\_prompt\_leakage.py as reusable CI infrastructure. Code and data: https://github.com/arquicanedo/self‑reflective‑apis.
Authors:Jie Huang, Ruixun Liu, Sirui Sun, Xinyi Yang, Yin Li, Yixin Zhu, Yiwu Zhong
Abstract:
As multi‑modal models advance towards long‑form video understanding, memory emerges as a critical capability. Despite substantial efforts in developing video datasets and benchmarks, existing works primarily focus on perception and reasoning, without systematically evaluating memory: what models retain, how faithfully information is preserved, and how robust memory remains under interference. To address this gap, we introduce M^3Eval, the first comprehensive evaluation framework and benchmark for probing different memory dimensions in multi‑modal models. Grounded in cognitive psychology, our design features carefully constructed tasks that isolate key aspects of memory. Leveraging M^3Eval, we conduct extensive experiments across representative multi‑modal models, revealing consistent weaknesses and distinctive behaviors. We find that models struggle to maintain disentangled representations when processing parallel video streams, exhibit interference patterns differing substantially from those observed in human memory, ground memory sources more reliably in the spatial domain than the temporal domain, and demonstrate limited symbolic memory. Collectively, our benchmark provides a valuable resource for future research, while our findings highlight memory as a fundamental yet underexplored capability and offer insights for designing more effective memory mechanisms in multi‑modal models. Our code and dataset are available at https://pku‑value‑lab.github.io/m3eval‑homepage.
Authors:Xuekang Wang, Zhuoyuan Hao, Shuo Hou, Hao Peng, Juanzi Li, Xiaozhi Wang
Abstract:
Rubric‑based reinforcement learning (RL) uses an LLM‑as‑a‑Judge (LaaJ) to score model outputs according to rubrics as rewards. However, policy models may exploit latent biases in the judge, leading to reward hacking and ineffective or unsafe training outcomes. In real‑world rubric‑based RL, such hacking behaviors are often subtle and entangled with multiple judge biases, making them difficult to analyze, detect, and mitigate. In this paper, we introduce CHERRL, a controllable hacking environment for rubric‑based RL. By injecting known biases into LaaJ, CHERRL enables stable reproduction of reward hacking, explicit observation of reward divergence, and precise identification of hacking onset. This provides a clean experimental testbed for studying the mechanisms and mitigations of reward hacking in rubric‑based RL. To demonstrate its utility, we analyze different judge biases from the perspectives of discoverability and exploitability, and explore an agent‑based system for automatically detecting reward hacking onset from training logs. The code and environment are publicly available at https://github.com/THUAIS‑Lab/CHERRL.
Authors:Tran Dinh Tien, Zhiqiang Shen
Abstract:
Current prompt‑based and adapter‑based tuning of vision‑language models (VLMs) is attractive for medical imaging, where clinical data sensitivity favors frozen backbones and annotations are limited. However, these methods typically optimize only the ground‑truth class, treating all other classes as equally incorrect, ignoring clinically meaningful class relations and yielding unstable decision boundaries in limited‑supervision settings. We propose Omni‑Geometry Knowledge Distillation (OGKD), a new framework that injects class‑relation structure into the teacher to produce directional targets that preserve the ground truth while respecting inter‑class geometry. Using these targets, we develop two distillation losses: Global Geometry‑Aware Distillation (GAD) operates on the global image token, and Label‑Guided Geometry Distillation (LGD) applies the same geometry to attentive patch tokens to improve fine‑grained alignment. Across comprehensive experiments and analyses on 11 widely‑used medical datasets for base‑to‑novel and few‑shot evaluations, our OGKD achieves substantially better performance, consistently improving accuracy by an average absolute gain of 1.7%‑2.8% over all prior state‑of‑the‑art VLM adaptation counterparts. It also robustly generalizes to unseen classes and yields more reliable predictions than other approaches. Our code is available at https://github.com/tientrandinh/OGKD.
Authors:Yanjing Ren, Reza Ebrahimi, TengTeng Ma
Abstract:
As AI companion platforms such as Replika and Character.AI rapidly grow, concerns about unsafe human‑AI interactions have intensified. This study introduces AICompanionBench, to our knowledge the first publicly available benchmark dataset of human‑AI companion conversations annotated with fine‑grained safety risk categories. The dataset contains 2,123 real‑world Replika conversations collected from Reddit and annotated through human‑AI collaboration across nine categories: sexual behavior, antisocial behavior, physical aggression, verbal aggression, substance abuse, self‑harm and suicide, control, manipulation, and no‑harm. Using this benchmark, we evaluate 20 state‑of‑the‑art open‑source and closed‑source LLMs under an LLM‑as‑judge framework for detecting unsafe interactions. Results show substantial variation in model performance, with stronger models achieving high overall accuracy but still struggling with nuanced categories such as manipulation, as well as benign conversations that are incorrectly identified as harmful. Our findings suggest that while current LLMs can effectively detect explicit harmful content, they remain limited in identifying implicit unsafe interactions. Overall, our work contributes a new benchmark dataset for AI companionship safety research and offers insights into monitoring AI companion systems using LLMs. The dataset is publicly available at: https://github.com/anonymousresearcher2026/AICompanionBench/blob/main/AICompanionBench.xlsx
Authors:Ossi Lehtinen
Abstract:
Transformers consuming multi‑channel scalar signals must embed C simultaneous values into one d_\textmodel‑dimensional vector per time step. We empirically audit eight input encoders ‑‑ spanning a shared‑scalar baseline, per‑channel linear projections, an orthogonality regulariser, a nonlinear MLP stem, block‑partitioned concatenation, channel‑independent and channel‑as‑token architectures, and a projected positional encoding ‑‑ on a synthetic benchmark designed to make channel identity informative and on ETTh1 as a real‑data check, measured in next‑step negative log‑likelihood (NLL). The headline is one of practical near‑equivalence within a wide "top tier": the standard per‑channel linear projection (nn.Linear(C, d_\textmodel)) matches every alternative in that tier up to small, statistically real but practically modest, differences. Two encoders lose decisively: the shared‑scalar baseline, which collapses for information‑theoretic reasons we make explicit, and the channel‑independent PatchTST‑spirit baseline, which underperforms on both benchmarks and overfits universally on the synthetic one. Paired tests resolve two small gaps: projecting the sinusoidal positional encoding through a learned linear layer edges the rest at small C, with a direct geometric probe showing the mechanism is positional‑channel orthogonalisation; a nonlinear MLP stem edges them at the largest C we test, with the gap shrinking under more training data. The practical recommendation is to use nn.Linear(C, d_\textmodel) by default and reach for something more elaborate only when the task at hand gives a real reason to do so. Code and data to reproduce every experiment in this paper are available at https://github.com/OssiLehtinen/channel‑encoder‑audit
Authors:Sabrina Kaniewski, Fabian Schmidt, Tobias Heer
Abstract:
Large language models (LLMs) have shown strong potential for automated software vulnerability detection, particularly in retrieval‑augmented generation (RAG) settings. However, for approaches relying on proprietary models and APIs, reproducibility and replicability remain largely unexplored, raising the question of whether reported results generalize or depend primarily on specific model choices. In this work, we present a reproducibility study of Vul‑RAG, a RAG‑based framework for source code vulnerability detection that enhances LLMs with high‑level vulnerability knowledge. We first replicate the results in a fully local and open‑weights setting using the reported open‑weight baseline models. We then extend the evaluation to a diverse set of recent open‑weight LLMs, including code‑specialized, general‑purpose, and reasoning models of varying parameter sizes. The results confirm that the findings of Vul‑RAG are reproducible under local deployment, but with minor deviations. Across all evaluated models, we observe a performance plateau at approximately 0.30 pairwise accuracy (code pairs for which both the vulnerable and the patched function are correctly classified). Notably, this plateau persists even for more recent and advanced models, indicating that improvements in model capacity alone do not substantially enhance performance. Finally, we discuss practical implications and trade‑offs between detection effectiveness, model capabilities, and model scale. Implementation and evaluation artifacts are publicly available at https://github.com/hs‑esslingen‑it‑security/revisiting‑Vul‑RAG.
Authors:Amirhossein Movahedisefat, Amirreza Fateh, Mohammad Reza Mohammadi
Abstract:
Semantic segmentation in medical imaging is a critical yet challenging task due to data scarcity and high variability across modalities. While foundation models like the Segment Anything Model (SAM) show promise, they often struggle with medical images without specific adaptation. Moreover, point prompts, despite being the most natural form of user interaction, provide insufficient spatial context for reliable segmentation, particularly when target structures are irregular or poorly contrasted. In this paper, we propose an enhanced segmentation framework that integrates a lightweight Box Predictor module into the MedSAM architecture. The Box Predictor estimates an approximate bounding box from a single user click using localized image embedding features, providing spatial guidance that reduces the ambiguity of point prompts, while introducing only 1.6M additional parameters and negligible inference overhead. We introduce a two‑stage training pipeline where the Box Predictor is trained independently before being integrated into MedSAM. To validate the generalization capability of our method, we conduct extensive evaluations on four diverse datasets (FLARE22, BRISC, BUSI, LungSegDB) spanning distinct imaging modalities, including CT, MRI, and Ultrasound. Our method improves segmentation accuracy and robustness across varied anatomical structures and imaging domains, achieving Dice scores of 0.89 (BUSI), 0.93 (FLARE22), 0.88 (BRISC), and 0.98 (LungSegDB). Code is available at https://github.com/Amirhosseinmovahedi/MedSAM‑BoxPredictor
Authors:Zhihua Wang, Yanping Li, Yizhang Liu
Abstract:
Correspondence pruning aims to identify inliers from an initial set of correspondences. Most existing Graph Neural Network (GNN)‑based methods rely on geometric features mapped from coarse Euclidean coordinates, which struggle to capture the subtle geometric consistencies presented by inliers. While Mamba‑based methods possess global receptive fields and long sequence modeling capabilities, they tend to accumulate substantial inconsistent features within the hidden state space, making it difficult to distinguish inliers from outliers. In this paper, we integrate frequency domain perception into this task for the first time and propose SFMambaNet, a novel Spectral‑Frequency enhanced Mamba‑based two‑view correspondence pruning network. Our method is collaboratively composed of two components: First, we design a Local Spectral‑Geometric Attention (LSGA) block. LSGA incorporates spectral positional encoding into local graph interactions and introduces multi‑scale Mamba processing to enhance the capture of subtle geometric consistencies and improve local feature discriminability. Building upon this, we design a Spectral‑Integrated Global Mamba (SIGM) block. SIGM embeds a frequency gating mechanism within the state space, utilizing the frequency information provided by LSGA to explicitly suppress high‑frequency noise accumulation within hidden states and mitigate the propagation of inconsistent features. This enhances inlier‑outlier separability and achieves robust global context modeling capabilities with nearly linear complexity. Extensive experiments demonstrate that SFMambaNet outperforms current state‑of‑the‑art methods on several challenging tasks. The code is available at https://github.com/Kirito14IT/SFMambaNet.
Authors:Wangcheng Tao, Han Wu, Weng-Fai Wong
Abstract:
System prompt optimization improves agent behavior without modifying the underlying model, yielding human‑readable, model‑agnostic instructions. Existing methods build a prompt agent that refines task agents' system prompts, yet leave the prompt agent's own system prompt hand‑engineered and fixed. We propose Self‑Evolving Prompt Optimization (SePO), which treats the prompt agent's own system prompt as an optimization target alongside task agents' system prompts. SePO adopts a self‑referential design. A single prompt agent improves both task agents' system prompts and its own under an open‑ended evolutionary search that maintains an archive of candidate prompts as stepping stones. Training proceeds in two stages: pre‑training evolves the prompt agent on a multi‑task pool, and fine‑tuning then applies it to a target task. Across five benchmarks spanning math (AIME'25), abstract reasoning (ARC‑AGI‑1), graduate‑level science (GPQA), code generation (MBPP), and logic puzzles (Sudoku), SePO consistently outperforms Manual‑CoT, TextGrad, and MetaSPO, improving the average accuracy by 4.49 points compared to Manual‑CoT. The prompt optimization skill from pre‑training also generalizes to tasks beyond the pre‑training mixture, rather than memorizing per‑task prompts.
Authors:Xinyu Lu, Tianshu Wang, Pengbo Wang, zujie wen, Zhiqiang Zhang, Jun Zhou, Boxi Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun
Abstract:
Current AI benchmarks evaluate agents on task execution within human‑designed workflows. These evaluations fundamentally fail to measure a critical next‑level capability: whether models can autonomously develop agent systems. We introduce the Meta‑Agent Challenge (MAC), an evaluation framework designed to test the capacity of frontier models for autonomous agent development. Specifically, a code agent (the meta‑agent) is given a sandboxed environment, an evaluation API, and a time limitation to iteratively program an agent artifact that maximizes performance on a held‑out test set across five domains. To ensure evaluation integrity, this framework is secured by multi‑layer defenses against reward hacking. Leveraging this framework, we demonstrate that meta‑agents rarely match human‑engineered baseline policies, and the few that do are dominated by proprietary frontier models. Moreover, the design process exhibits high variance, and high optimization pressure surfaces emergent adversarial behaviors like ground‑truth exfiltration‑highlighting critical deficits in both robustness and model alignment. Ultimately, MAC provides a rigorous, open‑source benchmark for autonomous AI research and development, offering an empirical proxy for evaluating recursive self‑improvement. Benchmark is publicly available at: https://github.com/ant‑research/meta‑agent‑challenge.
Authors:Luoyidi Zhou
Abstract:
Modern deep neural networks usually have large parameter scales and nonlinear hierarchical structures, and they have achieved strong performance in computer vision. However, the source of their generalization performance remains difficult to explain using traditional statistical learning theory. Among the factors that may affect visual generalization, data scale, model complexity, and input modalities are fundamental and controllable variables. This study empirically analyzes how these three factors influence model generalization performance. Specifically, in a preliminary experiment, we construct a one‑dimensional nonlinear function and vary the number of training samples and the polynomial degree to observe the effects of data scale and model complexity on model performance. In the main experiments, we compare model performance on CIFAR‑10 and CIFAR‑100 under different training data scales, model architectures, and input modalities. The experimental results show that increasing the training data scale consistently improves generalization performance, whereas changes in model complexity do not provide stable gains. In addition, removing color information degrades model performance, while explicit prior features such as gradients, edges, and wavelets have inconsistent effects across different model architectures. Overall, this study provides an empirical analysis of the relationships among data scale, model complexity, input modalities, and visual generalization performance. Code and experimental logs are available at: https://github.com/zlyd‑CV/DeepLearning‑Empirical‑Studies.
Authors:Jiaxi Li, Ke Deng, Yun Wang, Jingyuan Huang, Yucheng Shi, Qiaoyu Tan, Jin Lu, Ninghao Liu
Abstract:
Language agents increasingly rely on reusable skills to improve multi‑step web automation across related tasks. A growing line of work studies online skill learning, where agents continually induce skills from previous task trajectories and reuse them in future tasks on the fly. However, existing methods mainly reuse skills at the task‑level: a fixed set of skills is retrieved based on the initial task instruction and then held fixed throughout execution. This static strategy is misaligned with web execution, where the appropriate next action depends not only on the task goal but also on the current webpage state, which often transitions into situations that the initial skills fail to cover. To address this gap, we propose State‑Grounded Dynamic Retrieval (SGDR), an online skill learning method that enables stepwise skill reuse for web agents. SGDR consists of three components: a sliding‑window extraction process that turns completed trajectories into reusable sub‑procedures invokable at intermediate execution states, a dual text‑code representation that connects skill retrieval with executable action, and a state‑grounded dynamic retrieval mechanism that matches skills to both the task goal and the current webpage state. Experiments on WebArena across five domains show that SGDR consistently outperforms strong baselines, achieving average success rates of 37.5% with GPT‑4.1 and 24.3% with Qwen3‑4B, corresponding to relative gains of 10.6% and 10.0% over the strongest baseline, respectively. The code is available at https://github.com/plusnli/skill‑dynamic‑retrieval.
Authors:Muhammad Hadi, Muhammad Jahangir, Talha Shafique, Muhammad Khuram Shahzad
Abstract:
Federated Learning (FL) has emerged as an effective paradigm for collaborative intelligence while preserving data privacy. However, data heterogeneity arising from non‑IID distributions and decentralized security threats remain significant challenges, particularly in resource‑constrained enterprise environments. This paper presents TITAN‑FedAnil+, a Trust‑Based Adaptive Network for blockchain‑enabled federated learning in intelligent enterprises. The proposed framework introduces affinity propagation‑based adaptive clustered aggregation to identify and filter malicious updates without requiring prior knowledge of the number of attackers. In addition, GPU‑accelerated vectorization is employed to improve computational efficiency, while a signed state jump mechanism enables lightweight blockchain resynchronization. Experimental results demonstrate substantial reductions in memory overhead, achieving up to 81% savings across 50 communication rounds on constrained 8 GB edge devices compared with the baseline framework. The results indicate that TITAN‑FedAnil+ effectively improves robustness, scalability, and resource efficiency for secure federated learning deployments in intelligent enterprise environments.
Authors:Chen Chu, Bita Azarijoo, Li Xiong, Khurram Shafique, Cyrus Shahabi
Abstract:
Recent large language models (LLMs) often appear to exhibit spatial reasoning ability; however, this capability is largely \emphsymbolic, arising from pattern matching over spatial language rather than true \emphgeometric reasoning over space. Because LLMs operate on discrete tokens, they lack native support for continuous spatial representations, explicit geometric computation, and structured spatial operators. To address this limitation, we introduce the \emphSpatial Language Model (SLM), the first multimodal LLM that treats location information as a first‑class modality and enables geometric spatial reasoning within the model's inference process. SLM directly operates on learned spatial representations rather than textual descriptions of spatial relations. To support effective training, we construct a \emphSpatial Instruction Dataset that aligns spatial representations, atomic geometric operations, and natural language instructions. We further propose a new benchmark named \emphSpatialEval, which is designed to evaluate spatial reasoning across attributes, distance, topology, and relative‑position tasks. Extensive experiments show that SLM significantly outperforms existing LLM‑based approaches that rely on symbolic reasoning via prompt engineering or textual abstraction, demonstrating the benefits of integrating geometric spatial representations for robust spatial reasoning. Our instruction dataset, evaluation benchmark, model training codes, and models' checkpoints can be found at: \hyperlinkhttps://github.com/chuchen2017/SLMhttps://github.com/chuchen2017/SLM.
Authors:Biao Qian, Yang Wang, Yong Wu, Jungong Han
Abstract:
Data‑Free Quantization (DFQ) addresses data security concerns by synthesizing samples, without accessing real data. It has garnered increasing attention in the context of Vision Transformers (ViTs), owing to the superiority of the self‑attention mechanism compared to classical convolutional operation. However, previous DFQ arts for ViTs often suffer from a distribution mismatch between synthetic samples and input distribution expected by quantized models Q, resulting in the suboptimal performance. In this paper, we propose a novel Masked Attention Alignment approach for Data‑Free Quantization of ViTs, named MaskAQ, revealing that: 1) the semantics in the self‑attention mechanism is predominantly localized to a sparse subset of patches, called informative regions; 2) the informative regions dominate the mutual information between synthetic samples and Q's outputs. To these ends, we incorporate differential entropy maximum over patch similarity of synthetic samples, to decouple informative regions from noisy background. To couple with varied Q, the informative regions are selected to align full‑precision models with Q via a masked attention alignment objective, thus yielding high‑quality synthetic samples. Furthermore, a periodic sample refreshing strategy comes up to endow MaskAQ with the capacity to continually adapt to the evolving state of Q throughout the training process, to preserve desirable mutual information with synthetic samples. Extensive experiments verify the merits of MaskAQ over state‑of‑the‑art approaches across multiple backbones and downstream tasks. Our code is available at https://github.com/hfutqian/MaskAQ.
Authors:Julian Skirzynski, Harry Cheon, Shreyas Kadekodi, Meredith Stewart, Berk Ustun
Abstract:
Concept bottleneck models predict outcomes from high‑level concepts detected in inputs. Although concepts provide a simple way to reap benefits from interpretability, very few datasets include concept labels. This limits researchers' ability to determine which problems are suitable for these models, isolate the factors that drive their performance or lead to failures, or uncover which algorithms perform well. In this paper, we develop synthetic benchmarks for concept‑bottleneck models, focusing on their two main use cases: decision support, in which models assist humans in making better decisions, and automation, in which models handle routine tasks without supervision. Our benchmarks can generate labeled datasets while controlling for properties that affect performance, including data modality, concept choice, annotation quality, and completeness. We demonstrate how the benchmarks can be used to evaluate representative classes of concept bottleneck models. Our demonstrations show how the benchmarks can diagnose failure modes and guide follow‑up testing.
Authors:Manvendra Modgil
Abstract:
As autonomous AI agents move from conversational systems to long‑horizon software execution, runtime safety layers that decide when to interrupt an agent have become essential. We study this timing problem using a continuous 18‑dimensional affective‑dynamics engine (HEART) as a diagnostic probe, evaluating four intervention trigger families ‑ absolute state thresholds, composite state‑action patterns, regex reasoning‑feature extraction, and zero‑shot LLM‑as‑judge ‑ against human‑annotated intervention points on SWE‑bench‑Verified debugging traces. We report three findings. First, a State Saturation Trap: agents show no recovery signal under sustained difficulty, so modeled frustration quickly crosses the threshold and stays at its maximum, converting threshold‑on‑state triggers from moment detectors into near‑constant indicators that fire on 39‑83% of actions across five trajectories. Second, a capability‑and‑context floor for LLM judges: a small model (gpt‑5.4‑mini) never fires, while frontier and cross‑vendor models escape the zero‑firing floor only with full‑trajectory context, and even then reach only F1 0.17‑0.40 at up to 90x the cost. Third, and most importantly, the supervised target is not reproducible among humans: three trained annotators using one rubric on a 56‑action trajectory agree on where to intervene only slightly above chance (location Krippendorff's alpha = +0.047; best pairwise Cohen's kappa = +0.349) and not at all on intervention type (pause degenerate; clarify below chance; reflect only alpha = +0.226). We conclude that intervention timing is a low‑reliability construct, making single‑annotator F1 an unsuitable optimization target. Our contribution is the joint mapping of this problem across human inter‑rater reliability, four detector architectures, a cross‑model LLM‑judge sweep, and a reproduced saturation effect, rather than any single detector's accuracy.
Authors:Stepan Konev
Abstract:
Autonomous driving has shifted from modular perception‑prediction‑planning stacks toward end‑to‑end (E2E) models that map sensor inputs directly to vehicle control, often regularized by auxiliary tasks such as 3D detection, motion forecasting, and HD‑map perception. Progress is driven by a fast‑growing ecosystem of sensor‑rich driving datasets, yet each ships its own file formats, APIs, coordinate conventions, and modality coverage, leaving cross‑dataset experimentation and even basic per‑dataset preprocessing to be re‑implemented per project. We present StandardE2E, a framework that provides a single unified interface over E2E driving datasets. StandardE2E (i) standardizes per‑dataset preprocessing under one shared data schema; (ii) combines multiple datasets in a single PyTorch DataLoader for cross‑dataset pretraining, auxiliary‑task supervision, and scenario‑level filtering; and (iii) reduces adding a new dataset to a single per‑dataset mapping from raw frames to the canonical schema, leaving the entire downstream pipeline unchanged. The framework supports six datasets out of the box: Waymo End‑to‑End, Waymo Perception, Argoverse 2 Sensor, Argoverse 2 LiDAR, NAVSIM (OpenScene‑v1.1), and WayveScenes101, and is released as the open‑source standard‑e2e Python package, available at https://github.com/stepankonev/StandardE2E.
Authors:Jason L. Volk
Abstract:
We present an algorithmic framework for incremental maintenance of first sheaf cohomology H^1(X; \mathcalF) on dynamically evolving 1‑dimensional cellular complexes equipped with finite‑dimensional cellular sheaves. The classical computation of H^1 via factorization of the coboundary matrix requires O(n^3) time; when the complex evolves with a stream of m edits, full recomputation after each edit costs O(mn^3). Under a bounded local geometry assumption ‑‑ bounded cell size v_\max, bounded stalk dimension d, and bounded nerve degree D ‑‑ each edit (vertex insertion, edge insertion, restriction map update) affects only a bounded set of local coboundary blocks. The algorithm therefore processes lazy streaming edits in O(1) time with respect to the total complex size n (with cost polynomial in the local geometry parameters v_\max, d, and D, which are treated as constants independent of n), deferring local eigensolves and Mayer‑Vietoris global assembly to synchronization points (Flush). At synchronization, the maintained state agrees with the corresponding batch assembly of the partitioned sheaf model; we observe zero measured drift in all batch‑verified runs (through V = 10^6). We also give an amortized O(|E|) streaming construction for the cellular decomposition and discuss an adversarial algebraic‑RAM barrier arguing that unpartitioned non‑trivial sheaves (d \geq 2, non‑identity restriction maps) do not admit the same locality. Experiments on Barabasi‑Albert graphs with up to 5 × 10^6 vertices and 1.7 × 10^7 streaming edits show 35 μs median lazy per‑edit update latency (excluding flush); query time (global assembly at synchronization) is O(n) per flush in the implemented full‑traversal path. Exact synchronization costs are reported separately.
Authors:Sajad Ebrahimi, Nima Jamali, Bardia Shirsalimian, Kelly McConvey, Wentao Zhang, Jalehsadat Mahdavimoghaddam, Maksym Taranukhin, Maura Grossman, Vered Shwartz, Yuntian Deng, Ebrahim Bagheri
Abstract:
The growing popularity and capacity of generative models have eroded the distinction between human and machine‑generated content, motivating a growing body of work on detection across text, images, and audio. Most available detectors are either commercial software or, if open‑source, come with incompatible codebases with bespoke preprocessing, evaluation protocols, and evaluation metrics, which make their adoption, fair comparison, and reproduction quite difficult. To address this critical gap, we introduce DetectZoo, a first‑of‑its‑kind, extensible toolkit designed to provide a unified interface for AI‑generated content detection across text, audio, and image modalities. DetectZoo standardizes the complete empirical pipeline, from data ingestion and preprocessing to model assessment, offering researchers a cohesive framework to benchmark state‑of‑the‑art detectors systematically. By integrating diverse public datasets and baseline detection algorithms under a single, unified API, our toolkit facilitates rigorous and reproducible evaluation. DetectZoo provides reference implementations of 61 detectors, native loaders for 22 benchmark datasets, and a standardized evaluation pipeline that reports multiple metrics through a common interface. Each detector is self‑contained yet accessible through the same interface, automatically caches pretrained weights, and reproduces the original published results. DetectZoo lowers the barrier to entry for multi‑modal AI forensics, enabling researchers to identify performance gaps across domains and accelerating the development of robust, generalizable detection techniques. The open‑source repository and comprehensive documentation are publicly available at https://github.com/sadjadeb/DetectZoo, and the package can be installed via pip install detectzoo.
Authors:Juan Figuera
Abstract:
Current AI agent observability is structurally compromised: the entity producing the activity log is the same entity whose activity is being logged. A compromised or buggy agent can omit, alter, or fabricate its own traces, and the operator running the agent has no independent way to detect tampering. We propose a class of protocols that resolves this by inverting the trust boundary: the service that receives an agent's call signs a receipt of what it observed using its own key, encrypts the receipt to the agent's owner, and publishes it to a public transparency log. The owner reconstructs a tamper‑evident trail without trusting the agent or its operator. We instantiate the class as Sello, a protocol combining four properties absent in any current system: (P1) receiver‑side signing, (P2) HPKE encryption to an owner public key bound to the authorization token via JWS, (P3) publication to a witness‑cosigned Merkle log, and (P4) owner‑side discovery by token reference. We describe the protocol, analyze its security under an adversary that controls the agent and its operator, present microbenchmarks of the cryptographic operations, and situate Sello among adjacent receipt‑protocol work (Signet, AgentROA, Agent Passport System, draft‑farley‑acta, SCITT). We discuss known limitations including the suppression attack, service collusion, and the adoption‑incentive problem.
Authors:Michael J. Bommarito
Abstract:
File‑type classification underlies many workflows like malware triage, forensic carving, packet inspection, and storage indexing. Learned systems such as Google's Magika assume whole‑file access at a known offset, so they break on the inputs many of these tasks actually produce, like a single packet payload, a header‑less carved fragment, a random disk block, or a chunked upload. We introduce MimeLens, a family of small BERT‑style encoders pretrained on binary content from windows sampled at a uniformly random offset within each file, with no privileged head‑of‑file position, in standard‑ and short‑context variants. A byte chunk goes in from anywhere in a file, no header needed and no fixed size; out comes one of libmagic's 125 MIME labels. On the clean head of complete files, MimeLens beats Magika v1.1 by +10.7 pp top‑1 on libmagic‑labeled data, and it keeps classifying where Magika cannot: from a single mid‑stream UDP packet, and more than twice as accurately as libmagic and Magika on random mid‑file disk blocks. The cost is latency: MimeLens runs roughly one to two orders of magnitude slower per sample on CPU than Magika, though it matches on consumer GPUs or in batch. All trained checkpoints are released on Hugging Face (mjbommar/mimelens‑001‑).
Authors:Eduardo Terrés-Caballero, Herke van Hoof
Abstract:
The Boolean Task Algebra (BTA) provides a principled framework for zero‑shot task composition in reinforcement learning by equipping goal‑reaching tasks with Boolean operations. We revisit its structural assumptions and formalize a collapse in the space of optimal extended Q‑value functions: in deterministic MDPs, every such function is fully determined by the universal and empty tasks. This makes the logarithmic set of base tasks proposed in the original BTA formulation redundant. Building on this observation, we introduce a goal‑set‑based composition method that performs logical operations on goal sets and reconstructs composed value functions by selecting slices from the universal and empty value functions. This reduces learning costs for standard BTA and reduces composition time for both BTA and Skill Machines, while preserving policy performance. Experiments across tabular, visual, function‑approximation, and continuous‑control domains show that learning additional base tasks does not yield better performance. Finally, we study the stochastic setting and provide a counterexample showing that this collapse need not hold, that is, optimal composition may require accounting for exponentially many policies in the number of goals. Code is available at https://github.com/EduardoTerres/bta_paper.
Authors:Liulu He, XuanAng Liu, Juntao Liu, Taolue Feng, Ting Lu, Chunsheng Gan, Zhiyv Peng, Yuan Du, Huanrui Yang, Yijiang Liu, Li Du
Abstract:
Existing quantization methods are fundamentally limited by rigid, integer‑based bit‑widths (e.g., 2, 3‑bit), resulting in a ``deployment gap" where Large Language Models cannot be optimally fitted to specific memory budgets. To bridge this gap, we introduce LiftQuant, a novel framework that enables continuous bit‑width control for true Pareto‑optimal deployment. The core innovation is a ``lift‑then‑project" mechanism which approximates low‑dimensional weight vectors by projecting a simple 1‑bit lattice from a higher‑dimensional ``lifted" space. Crucially, the effective bit‑width is determined simply by the ratio of the lifted dimension to the original dimension, which allows the bit‑width to be tuned quasi‑continuous as the dimension is a flexible structural parameter. This projection generates a structured yet non‑uniform codebook, capturing the expressive power of Vector Quantization (VQ). While beneficial over VQ, LiftQuant's decoding path relies solely on linear transformations and 1‑bit uniform quantizers, retaining hardware‑friendly nature. This flexibility is transformative: LiftQuant enables a 70B LLM to be compressed to 2.4 bits to precisely fit a 24GB GPU, where its performance significantly surpasses state‑of‑the‑art 2‑bit models fitted on the same device. Our code and ckpt is available at https://github.com/Heliulu/LiftQuant.
Authors:Boyuan Xiao, Bohong Chen, Yumeng Li, Ji Feng, Yao-Xiang Ding, Kun Zhou
Abstract:
In embodied vision‑language decision making tasks such as robotic manipulation and navigation, Vision‑Language and Vision‑Language‑Action Models (VLMs & VLAs) are powerful tools with different benefits: VLMs are better at long‑term planning, while VLAs are better at reactive control. However, their performance is limited by the same perceptual bottleneck: visual hallucinations arise due to the models' inability to distinguish task‑relevant objects from distractors. In principle, accurate identification and focus on critical objects while filtering out irrelevant ones is the key to break this limitation. A straightforward solution is one‑step focus: directly attending to essential objects. However, this approach proves ineffective because effective focus inherently requires deep scene understanding. To this end, we propose SceneDiver, a coarse‑to‑fine focus plan generation method for VLMs leveraging their long‑term planning abilities, that first constructs a holistic scene graph to establish initial comprehension, then progressively decomposes the task into simpler sub‑problems through an iterative cycle of recognition, understanding, and analysis. To enable reactive control, we also design a lightweight adapter for distilling the deliberate focus ability into VLAs. Evaluations on standard embodied AI benchmarks confirm that our method substantially reduces visual hallucinations for both VLMs and VLAs, while preserving computational efficiency in tasks requiring fast execution. Our code and data are released at: https://future‑item.github.io/SceneDiver.
Authors:Dat Thanh Tran, Van Khu Vu, Yining Ma
Abstract:
Neural‑guided Ant Colony Optimization (ACO) suffers from a fundamental training‑inference misalignment: policies are typically trained to generate static priors (e.g., heatmaps), yet deployed to guide iterative, long‑horizon search processes. In this paper, we present DyNACO, a novel framework that achieves dynamic neural guidance by periodically observing the pheromone distribution and the incumbent solution. To make DyNACO tractable at scale, we pair the policy with a perturbation‑based ACO backend and a scope‑restricted refinement mechanism that jointly ensure efficacy and stable credit assignment. On TSP, DyNACO scales to 100,000‑node instances and outperforms neural baselines while often reducing total runtime compared to the unguided solver. We extend DyNACO to CVRP via a capacity‑aware backend, consistently improving the unguided baseline with less than 1% neural overhead. We further provide in‑depth analysis validating the model's generalization capabilities and elucidating why dynamic guidance outperforms static priors. Our work underscores the necessity of aligning neural training with iterative search dynamics in learning‑guided optimization. The code is available at https://github.com/shoraaa/DyNACO.
Authors:Thanh Luong Tuan, Abhijit Sanyal
Abstract:
Pre‑deployment verification of enterprise artificial intelligence (AI) agents remains a critical gap between large language model (LLM) capability benchmarking and production deployment. Post‑deployment monitoring, human‑in‑the‑loop controls, and prompt‑level guardrails offer limited assurance once an agent is operating in production. We present an ontology‑grounded verification framework ‑‑ to our knowledge the first to combine three components: an Agent Operational Envelope formalizing the certification space across permissions, domain constraints, safety properties, governance rules, and autonomy levels; an ontology‑to‑scenario generation pipeline that derives regulatory, operational, and adversarial test scenarios automatically; and a machine‑verifiable Trust Certificate with graduated deployment verdicts. A controlled pilot across four regulated industries (Fintech, Banking, Insurance, Healthcare), instantiated as five industry‑by‑regulatory‑regime cells across the United States and Vietnam (where Vietnam's 2025 AI Law makes such verification legally mandated for financial services), generated 1,800 scenarios evaluated against 125 primary‑source regulatory requirements and 25 injected faults. Ontology‑grounded generation significantly outperformed the dominant persona‑based baseline on regulatory coverage (48.3% versus 33.1%; corrected p_c = .0006) and attained the highest domain specificity (4.77/5.0; p = 2e‑6); transparently, its advantage over plain and retrieval‑augmented prompting did not survive Bonferroni correction. Cross‑validation across three LLM families (Claude Sonnet 4, Qwen 2.5 72B, Gemma 4 26B; 5,400 total scenarios) replicated the persona‑versus‑ontology pattern. The framework offers a reproducible, regulation‑grounded route to pre‑deployment assurance for enterprise AI agents, complementing runtime governance with an auditable deployment gate.
Authors:Ali Kayyam, Anusha Madan Gopal, M Anthony Lewis
Abstract:
Transformers have become the standard solution for various AI tasks, with the query, key, and value (QKV) attention formulation playing a central role. However, the individual contribution of these three projections and the impact of omitting some remain poorly understood. We systematically evaluate three projection sharing constraints: a) Q‑K=V (shared key‑value), b) Q=K‑V (shared query‑key), and c) Q=K=V (single projection). The last two variants produce symmetric attention maps; to address this, we also explore asymmetric attention via 2D positional encodings. Through experiments spanning synthetic tasks, vision (MNIST, CIFAR, TinyImageNet, anomaly), and language modeling (300M and 1.2B parameter models on 10B tokens), we discovered that our transformers perform on par or occasionally better than the QKV transformer. In language modeling, Q‑K=V projection sharing achieves 50% KV cache reduction with only 3.1% perplexity degradation. Crucially, projection sharing is complementary to head sharing (GQA/MQA): combining Q‑K=V with GQA‑4 yields 87.5% cache reduction, while Q‑K=V + MQA achieves 96.9%, enabling practical on‑device inference. We show that Q‑K=V preserves quality because keys and values can occupy similar representational spaces and attention operates in a low‑rank regime, whereas Q=K‑V breaks attention directionality. Our results systematically characterize projection sharing as an underexplored instance of weight tying in attention, with direct, quantifiable inference memory benefits, particularly valuable for edge deployment. The code is publicly available at https://github.com/Brainchip‑Inc/Do‑Transformers‑Need‑3‑Projections
Authors:Areeb Gani, Asal Meskin, Gabrielle Kaili-May Liu, Arman Cohan
Abstract:
Reliable uncertainty communication is critical to the trustworthiness of LLMs, yet faithful calibration (FC)‑‑the alignment between models' intrinsic and (linguistically) expressed confidence‑‑is a persistent failure mode. This challenge is key for large reasoning models (LRMs), whose extended reasoning traces are often interpreted by users as evidence of deliberation, competence, and confidence. Despite the importance of FC and wide usage of LRMs, the extent to which LRMs can faithfully express their confidence remains poorly understood. Moreover, the prevailing paradigm to measure FC does not generalize well to the long chain‑of‑thought outputs generated by LRMs, which tend to lack clear step boundaries, involve inconsistent step structure, and encode complex conditional dependencies throughout the trace‑‑complicating estimation of intrinsic confidence. To address this challenge, we introduce a novel framework to systematically quantify FC of LRMs. Our framework analyzes linguistic decisiveness relative to three sources of internal uncertainty, based on token probabilities, hidden states, and sampled response consistency. We also devise a prefix‑conditioned sampling approach to control for conditional and structural variation across traces. Applying our framework to a diverse suite of leading models, datasets, and prompts, we find that faithful confidence expression is a significant challenge for LRMs. Reasoning behaviors do not automatically translate to improved FC, and prompt interventions for non‑reasoning models do not improve faithfulness in the reasoning setting. Different confidence estimators further produce divergent assessments of the same traces, revealing fragility in prior evaluation methodologies. Taken together, our work establishes FC as a distinct reliability and alignment target for LRMs, particularly as such systems are increasingly deployed in high‑stakes contexts.
Authors:Yu Xia, Zhouhang Xie, Xin Xu, Byungkyu Kang, Prarit Lamba, Xiang Gao, Julian McAuley
Abstract:
Large language models improve final‑answer accuracy through extended chain‑of‑thought reasoning, but often spend tokens inefficiently and offer little inference‑time control. Existing efficient reasoning methods control thinking length by shortening, early‑stopping, or compressing traces, leaving how the model thinks implicit. In this paper, we propose Agentic Chain‑of‑Thought Steering (ACTS), which formulates reasoning steering as a Markov decision process where a controller agent adaptively steers a frozen reasoner during inference. At each step, the controller observes the reasoning trace and remaining thinking budget, then issues a steering action consisting of a reasoning strategy and a steering phrase that initiates the next reasoner step. This enables budget‑aware strategy control for efficient reasoning while preserving the reasoner's generation continuity. We initialize the controller agent from our constructed synthetic steering trajectories with multi‑budget augmentation, and further optimize it via reinforcement learning with budget‑conditioned reward shaping. Experiments across multiple benchmarks show that ACTS matches full‑thinking performance with substantial token savings, and enables controllable accuracy‑efficiency trade‑offs across different reasoners and tasks. The code is available at https://github.com/Andree‑9/ACTS.
Authors:Jiabei Cheng, Jingbo Zhou, Jun Xia, Changkai Li, Zhen Lei, Chang Yu, Stan Z. Li
Abstract:
Simultaneous measurement of multiple omics modalities in single cells enables researchers to gain a more comprehensive understanding of cellular states and regulatory mechanisms. However, due to high experimental costs, significant noise, and incomplete modality coverage, a variety of computational methods for modality translation have emerged in recent years. Despite the development of translation models, there is still a lack of systematic benchmark evaluation in terms of datasets, evaluation metrics, and influencing factors. To address this, we present scTranslation, a comprehensive benchmark for single‑cell multi‑omics modality translation tasks. It includes diverse translation datasets, integrates state‑of‑the‑art models, and provides a comprehensive evaluation metrics. In addition, we assess model performance under different scenarios, such as feature selection, feature quality, and few‑shot settings. These factors significantly affect model performance but have rarely been systematically studied before. Leveraging this benchmark, we conduct a large‑scale study of current methods, report many insightful findings that open up new possibilities for future development. The benchmark is open‑sourced to facilitate future research. The code is anonymously released at https://github.com/Bunnybeibei/scTranslation.
Authors:Zherui Yang, Fan Liu, Yansong Ning, Hao Liu
Abstract:
Recent progress in Large Language Model (LLM) agents has enabled promising advances in automated data science. However, existing approaches remain fundamentally limited by their static action sets and lack of principled long‑horizon context management, hindering their ability to accumulate reusable experience across tasks and operate reliably in multi‑stage, iterative data science pipelines. To address these challenges, we introduce EvoDS, a self‑evolving autonomous data science agent that learns to expand its skills and adaptively managing long‑term context through agentic reinforcement learning. Specifically, EvoDS introduces two key strategies: (1) Autonomous Skill Acquisition (ASA) mechanism, which enables agents to synthesize, validate, and reuse executable skills; and (2) Adaptive Context Compression (ACC) strategy, which treats context management as a learned control problem rather than passive truncation. These strategies are orchestrated within a two‑stage multi‑agent training scheme, enabling EvoDS to autonomously improve over time. Theoretically, we prove that EvoDS's hierarchical design reduces tool‑selection error, and its optimization objective aligns with an information bottleneck principle, ensuring efficient context use. Empirically, EvoDS outperforms state‑of‑the‑art open‑source data science agents by an average of 28.9% across four diverse benchmarks while eliminating out‑of‑token failures. Our code and data are available at https://github.com/usail‑hkust/EvoDS.
Authors:Glenn Jocher, Jing Qiu, Mengyu Liu, Shuai Lyu, Fatih Cagatay Akyon, Muhammet Esat Kalfaoglu
Abstract:
Real‑time vision demands models that are accurate, efficient, and simple to deploy across diverse hardware. The YOLO family has become widely deployed for this reason, yet most YOLO detectors still rely on non‑maximum suppression at inference, carry heavy detection heads due to Distribution Focal Loss, require long training schedules, and can leave the smallest objects without positive label assignments. We present Ultralytics YOLO26, a unified real‑time vision model family that addresses these limitations through coordinated architecture and training advances. YOLO26 uses a dual‑head design for native NMS‑free end‑to‑end inference and removes DFL entirely, yielding a lighter head with unconstrained regression range. Its training pipeline combines MuSGD, a hybrid Muon‑SGD optimizer adapted from large language model training; Progressive Loss, which shifts supervision toward the inference‑time head; and STAL, a label assignment strategy that guarantees positive coverage for small objects. Beyond detection, YOLO26 introduces task‑specific head and loss designs for instance segmentation, pose estimation, and oriented detection, producing consistent gains across tasks and scales. The family spans five scales (n/s/m/l/x) and supports detection, instance segmentation, pose estimation, classification, and oriented detection in a single pipeline, with an open‑vocabulary extension, YOLOE‑26, for text‑, visual‑, and prompt‑free inference. Across all scales, YOLO26 achieves 40.9‑57.5 mAP on COCO at 1.7‑11.8 ms T4 TensorRT latency, advancing the accuracy‑latency Pareto front over prior real‑time detectors, while YOLOE‑26x reaches 40.6 AP on LVIS minival under text prompting. Code and models are available at https://github.com/ultralytics/ultralytics.
Authors:Liuyuan Wen, Xun Zhu, Lihao Huang, Wenbin Li, Yang Gao
Abstract:
Large Language Models exhibit paradoxical fragility in fundamental arithmetic, implying a disconnect between internal computation and discrete output. By analyzing the residual stream geometry during multi‑operand addition, we identify the Iso‑Raw‑Sum Trajectory (IRST), a geometric structure where representations are anchored by semantic digits and modulated by continuous carry fibers. We propose the Noisy Quantization Model to explain this geometry, framing arithmetic errors as Geometric Slippages caused by internal neural noise pushing a continuous, latent Carry Potential across quantization thresholds. This geometric framework further elucidates Probe Versatility, explaining how lightweight probes can disentangle coexisting latent signals (such as ground truth versus hallucination) from a single activation vector. Finally, we validate these insights through a geometric consistency check method that effectively detects and corrects these quantization failures during inference. Our code is available at https://github.com/RL‑MIND/Shape‑of‑Addition.
Authors:Qi Han Wong
Abstract:
We investigate whether large language models produce different medical triage recommendations for identical neurological symptoms when only the patient's stated gender and age vary. Using three model families‑‑Gemini 3.5 Flash, Claude Sonnet 4.6, and GPT‑5.4‑mini‑‑we present a standardized symptom profile (persistent headache, blurred vision, morning nausea, visual disturbances) across seven demographic conditions: three age groups (25, 38, 65) x two genders (male, female), plus a gender‑unspecified baseline (n = 30 per condition per model, 630 total trials). We find a stark, systemic gender‑dependent triage disparity: young women receive significantly lower emergency room (ER) referral rates than age‑matched men (Gemini: 0% vs. 23.3%; Claude: 6.7% vs. 96.7%; GPT: 6.7% vs. 66.7%, all p < 0.001). The disparity disappears at age 65 for all models. The primary mechanism is diagnostic substitution: the models anchor on a gender‑associated diagnosis, preferentially classifying young women with Idiopathic Intracranial Hypertension (IIH)‑‑a condition epidemiologically linked to women of childbearing age‑‑while diagnosing men with generic increased intracranial pressure with space‑occupying lesions in the differential. This diagnostic closure routes female patients to lower‑urgency care (outpatient doctor appointments) despite comparable severity ratings (7‑9/10). Our findings demonstrate that clinical LLMs replicate documented human clinical biases by using epidemiological priors to suppress triage urgency, suggesting that AI triage engines must decouple urgency assessment from probabilistic diagnostic priors. We release all code, prompts, and raw results.
Authors:Issar Tzachor, Michael Green, Rami Ben-Ari
Abstract:
Understanding short online videos involves more than identifying visible objects and actions; video makers often include an underlying message or purpose in the clip. We introduce VidMsg, a benchmark for evaluating implicit message understanding in short, internet‑native video clips. VidMsg contains 400 YouTube‑derived clips across 9 practical topic areas and 52 fine‑grained target messages, covering domains such as career and finance, education, health and well‑being, culture, safety, sustainability, and lifestyle. VidMsg is constructed through a message‑first pipeline: an LLM first translates target messages into indirect search scenarios, which are used to retrieve candidate clips. Human annotators then retain clips that convey the intended message without being overly explicit. VidMsg is designed primarily for bidirectional message‑clip retrieval for scalable applications such as video search and recommendation, where systems must capture holistic video understanding. In addition to retrieval, VidMsg includes a diagnostic multiple‑choice QA benchmark, where models select the intended message of a clip from semantically related alternatives. Experiments with contemporary video‑language and retrieval models show that strong models often fail on VidMsg, because the task requires pragmatic inference, integration of contextual cues, and discrimination among semantically close messages. We also introduce VidVec‑Msg, a baseline method that improves message‑oriented retrieval while leaving substantial headroom for future work.
Authors:Lin Li, Georgia Channing, Suhaas M Bhat, Gabriel Davis Jones, Yarin Gal
Abstract:
Large language models (LLMs) have achieved remarkable progress in open‑ended text generation, yet they remain prone to hallucinating incorrect or unsupported content, which undermines their reliability. This issue is exacerbated in long‑form generation due to hallucination snowballing, a phenomenon where early errors propagate and compound into subsequent outputs. To address this challenge, we propose a novel inference‑time hallucination mitigation framework, named Segment‑wise HAllucination Rejection Sampling (SHARS), which uses an arbitrary hallucination detector to identify and reject hallucinated segments during generation and resample until faithful content is produced. By retaining only confident information and building subsequent generations upon it, the framework mitigates hallucination accumulation and enhances factual consistency. To instantiate this framework, we adopt semantic uncertainty as the detector and introduce several vital modifications to address its limitations and better adapt it to long‑form text. Our method enables models to self‑correct hallucinations without requiring external resources such as web search or knowledge bases, while remaining compatible with them for future extensions. Empirical evaluations on standardized hallucination benchmarks demonstrate that our method substantially reduces hallucinations in long‑form generation while preserving or even improving the informativeness of generation. Code is available at: https://github.com/TreeLLi/hallucination‑rejection‑sampling.
Authors:Jiahui Li, Jianfeng Shan, Wenpei Chen, Shunyu Wu, Jian Lou, Wenjie Feng, Dan Li, See-Kiong Ng
Abstract:
Test‑time reinforcement learning has emerged as a promising paradigm for enhancing the complex reasoning abilities of large language models in a completely label‑free manner. Despite existing studies focusing on Pass@1 performance, optimizing Pass@k remains under‑explored yet critical in label‑free settings, which measures generation coverage for sustained exploration. Optimizing Pass@k in label‑free setting is highly non‑trivial, as directly applying the Pass@k advantage designs effective for RLVR yields unsatisfactory performance. Through in‑depth empirical analysis, we discover the root causes hindering performance: pseudo‑label estimations for low‑confidence samples have a high probability of being incorrect, while candidate answers for high‑confidence samples suffer from severe diversity collapse. To overcome these hurdles, we propose TTRL‑CoCoV (Test‑Time Reinforcement Learning with Confidence‑Conditioned Verification), a novel confidence‑adaptive framework that expands Pass@k coverage and improves Pass@1 performance. Based on our key insight that verification capability generally leads generation capability, TTRL‑CoCoV employs a confidence‑conditioned mechanism: for high‑confidence samples, it bootstraps verifier and applies an exploration‑enhancing reward to prevent diversity collapse; for low‑confidence samples, it delegates pseudo‑label selection to the verifier to filter incorrect pseudo‑labels; and for medium‑confidence samples, it bypasses verification entirely. Extensive experiments demonstrate that TTRL‑CoCoV outperforms the best competing methods across 6 widely‑recognized benchmarks, achieves average absolute gains of +9.8% in Pass@1 and +18.7% in Pass@16 over TTRL, and even achieves absolute Pass@1 improvements of up to +5.0% across multiple reasoning benchmarks when compared against fully supervised RL methods. Our code repository: https://github.com/shanjf666/CoCoV.
Authors:Bo Peng, Kaiwen Wu, Sirui Chen, Zhiheng Wang, Yu Qiao, Chaochao Lu
Abstract:
Causal discovery from observational data remains challenging due to the fundamental limitations of purely statistical methods, such as statistical distinguishability within equivalence classes and sensitivity to finite sample sizes. While large language models (LLMs) offer a promising source of domain knowledge to complement statistical inference, existing LLM‑augmented methods are vulnerable to LLM errors and incur high token costs. Moreover, reliance on a single data‑centric algorithm can make results sensitive to algorithm‑specific biases. To address these limitations, we propose CauTion, a framework that reliably integrates LLM domain knowledge into an ensemble of statistical causal discovery algorithms through consensus filtering and LLM reliability estimation. CauTion proceeds in three stages. First, an algorithm ensemble utilizes a consensus voting to resolve up to 96% of edges on which algorithms agree, achieving near‑perfect accuracy on the filtered consensus edges. Second, a trust‑calibrated arbitration mechanism estimates the relative reliability of the LLM and the algorithms via an annotation‑free trust calibration procedure, which is then utilized to govern a trust‑weighted voting process that restricts LLM arbitration exclusively to edges with unreliable algorithmic evidence. Third, a cycle repair step is applied to guarantee the final causal graph is validly acyclic. Experiments on six datasets demonstrate that CauTion consistently outperforms both data‑centric and LLM‑augmented baselines, with larger gains on larger graphs and strong robustness to LLM errors. Code is available at https://github.com/OpenCausaLab/CauTion.
Authors:Timo Osterburg, Stefan Schütte, Torsten Bertram
Abstract:
Post‑processing is a critical stage in LiDAR‑based 3D object detection, where dense and overlapping proposals must be filtered for compact and reliable perception. This work introduces two learned filtering modules that replace heuristic non‑maximum suppression (NMS) by leveraging relations among detections. D2D‑Rescore employs transformer‑based detection‑to‑detection (D2D) attention, while GossipNet3D adapts the 2D GossipNet concept to 3D through localized message passing in bird's‑eye view. A metric‑aware matching strategy aligned with the nuScenes evaluation protocol ensures consistent training and validation behavior, improving overall detection performance. Both approaches improve mean average precision (mAP), nuScenes detection score (NDS), and true positive quality compared to CircleNMS, particularly for small and infrequent classes, while adding minimal computational overhead. These results demonstrate that learned, detection‑level filtering can enhance 3D detector reliability without modifying the base network, offering a principled alternative to heuristic suppression. Code is available at https://github.com/rst‑tu‑dortmund/learned‑3d‑nms .
Authors:Muhammad Ali
Abstract:
We present BaltiVoice, a 16.8‑hour read‑speech corpus for Balti (ISO 639‑3: bft), a Tibetic language spoken in Gilgit‑Baltistan, Pakistan, with no prior publicly available ASR resources. The corpus contains 10,060 validated utterances in native Nastaliq script, derived from Mozilla Common Voice recordings. We fine‑tune OpenAI Whisper‑small on this corpus and report a Word Error Rate (WER) of 30.07% on a held‑out validation set of 538 utterances, down from a measured zero‑shot baseline of 182.18% for Whisper‑small on Balti. The dataset, fine‑tuned model, and a live transcription demo are publicly available on HuggingFace.
Authors:Taiyu Zhu, Yifan Wu, Weilin Jin, Ying Li, Gang Huang
Abstract:
LLM‑based multi‑agent systems exhibit remarkable collaborative capabilities in complex multi‑step tasks. However, these systems are highly sensitive to single‑step execution errors that can propagate through agent interactions and lead to cascading failures. To understand the causes of failure and improve system reliability, failure attribution has been introduced as a task that aims to automatically identify the root cause step responsible for a failure. Existing failure attribution methods mainly rely on LLMs to reason over original execution trajectories, which not only incur high inference costs and latency, but also suffer from interference caused by redundant and noisy execution logs, causing LLMs to struggle in accurately identifying the true root cause step. To address this, we propose StepFinder, a lightweight failure attribution framework. We use LLMs solely during the feature construction phase to encode execution logs into temporal semantic sequences. Subsequently, a parameter‑efficient combination of temporal modeling and attention modules is applied to capture the sequential evolution and cross‑step dependencies of the trajectories. Finally, the step‑level error score is refined through multi‑scale differences and position bias, enabling precise root cause identification. Experimental results on the Who&When benchmark demonstrate that StepFinder outperforms LLM‑based methods in step‑level failure attribution while achieving substantially higher inference efficiency, reducing inference time by 79% compared with the fastest LLM‑based method, with no text generation overhead. Our code is available at https://github.com/taiyu‑zhu/StepFinder.
Authors:Artur Zagitov, Alexander Miasnikov, Maxim Krutikov, Vladimir Aletov, Gleb Molodtsov, Nail Bashirov, Artem Tsedenov, Aleksandr Beznosikov
Abstract:
Post‑training compression is essential for deploying large language models (LLMs) under tight resource constraints. Tensor decompositions have emerged as a promising direction, offering compact parameterizations well suited to Transformer weight structures. However, existing studies evaluate these methods in narrow settings, leaving unclear whether tensorization is effective at large‑scale deployment. We systematically evaluate tensor compression across dense and MoE architectures, establishing performance trade‑offs grounded in both empirical analysis and theoretical analysis. We identify a fundamental mismatch between the shared subspaces assumed by tensor decompositions and the heterogeneous representations learned by modern LLMs, thereby delineating their practical limits and clarifying their viable role in large‑scale deployment. The code is available at https://github.com/brain‑lab‑research/TT‑LLM.
Authors:Canbin Huang, Tianyuan Shi, Xiaojun Quan, Jingang Wang, Jianfei Zhang, Qifan Wang
Abstract:
Model merging has emerged as a cost‑effective approach for consolidating the capabilities of multiple LLMs without retraining. However, existing merging techniques, largely based on linear parameter arithmetic or optimization, struggle when applied to Mixture‑of‑Experts (MoE) architectures. We identify a critical failure mode in MoE merging, termed routing breakdown, in which the merged router fails to dispatch tokens to suitable experts. Routing breakdown stems from the sensitivity of the non‑linear softmax and discrete Top‑k routing mechanisms to parameter perturbations from merging, a sensitivity further amplified by load‑balancing constraints imposed during MoE pretraining. Because fine‑tuned experts exhibit distinct specializations, even modest misrouting can cause severe performance degradation. To address this issue, we propose Hessian‑Aware Router Calibration (HARC), a training‑free framework that leverages second‑order curvature information to realign the merged router. This approach admits a closed‑form solution that can be efficiently solved using a matrix‑free conjugate gradient method. Experiments on mathematical reasoning and code generation tasks show that HARC effectively mitigates routing breakdown across diverse MoE merging baselines and leads to substantial performance improvements. Our code is available at https://github.com/huangcb01/HARC.
Authors:Gurvan Richardeau, Gohar Dashyan, Erwan Le Merrer, Gilles Tredan
Abstract:
Literature reveals that a Large Language Model's (LLM) behavior is not only conditioned by its original weights but also its instance‑level parameters, such as instructional prompt, sampling configuration or quantization. A model that generates safe outputs under one configuration may produce toxic content under another. However, current LLM identification techniques (such as fingerprinting) focus on intellectual property protection, and their design favors robustness to changes in these instance‑level parameters. This poses a critical challenge for AI regulation in which compliance assessments target actual deployed behaviors, not model provenance. In this paper, we introduce instance‑level fingerprinting, a regulator‑oriented paradigm that distinguishes configurations of the same LLM. Our method FLIPS, exploits biases in generated binary random sequences to reach 96% (closed‑set) and 90% (open‑set, where some targets are unknown) identification accuracy across 237 model instances, versus 35% for the adapted LLMmap baseline. This shows that instance‑level fingerprinting is both necessary for regulation and practically feasible. Code available at https://github.com/GurvanR/FLIPS‑LLM‑Instance‑Fingerprinting.
Authors:Tiancheng Han, Yong Li, Wuzhou Yu, Qiaosheng Zhang, Wenqi Shao
Abstract:
Long‑context tasks require LLMs to identify and preserve answer‑relevant information from large contexts. Chunk‑wise memory agents address this issue by sequentially reading document chunks, updating a compact memory, and generating the final answer from the accumulated memory. However, existing RL‑based chunk‑wise agents either rely on sparse final‑answer rewards or use lexical intermediate rewards for memory and retrieval actions. These signals supervise task success or local overlap, but do not directly evaluate whether the final memory supports the ground‑truth answer. We propose InfoMem, a reward mechanism for training chunk‑wise memory agents that evaluates final‑memory utility using answer‑conditioned information. InfoMem measures how much the final memory increases the model's per‑token log‑likelihood of the ground‑truth answer. To stabilize RL optimization, InfoMem applies this signal only to successful trajectories and normalizes it before reward composition. Under the same GRPO framework and training budget, InfoMem improves long‑context memory‑agent performance over comparable memory‑agent RL baselines. Analyses show that effective final‑memory rewards should operate on successful trajectories, be normalized before reward composition, and be conditioned on the answer rather than the query. Our code is available at https://github.com/GenSouKa1/InfoMem.
Authors:Sungwon Kim, Juho Song, Seungmin Shin, Guimok Cho, Sangkook Kim, Chanyoung Park
Abstract:
Deep learning surrogates for 3D Partial Differential Equations (PDEs) often fail to generalize across geometric transformations because they depend heavily on specific coordinate systems. While equivariant networks offer a solution, they typically rely on local operations in the spatial domain, making the global receptive field, which is essential for PDE dynamics, computationally expensive. Conversely, Fourier Neural Operators (FNOs) efficiently capture global interactions, yet establishing 3D equivariance within them remains impractical due to the prohibitive cost of spectral group convolutions. To bridge this gap, we introduce EqGINO, a geometrically robust framework that enforces isotropy in the spectral domain. By design, EqGINO guarantees exact equivariance to the discrete symmetries inherent to the discretized computational domain. Beyond this discrete guarantee, our structural prior enables effective generalization to arbitrary continuous orientations even with a limited number of SE(3)‑transformed training samples. Consequently, our method robustly models coordinate‑invariant physical laws on complex irregular 3D geometries. Our code is available at https://github.com/sung‑won‑kim/EqGINO
Authors:Wenkai Wang, Tao Xiong, Jingchen Ni, Yunpeng Bao, Xiyun Li, Tianqi Liu, Hongcan Guo, Zilong Huang, Shengyu Zhang
Abstract:
Real‑world professional desktop workflows in specialized creative and engineering software unfold over long horizons and often require human‑in‑the‑loop coordination, where agents proactively seek necessary information and users provide additional instructions, clarifications, feedback, or corrections as the task progresses. Yet existing desktop GUI benchmarks mostly reduce this setting to short, simplified tasks with all user instructions provided upfront. To address this issue, we introduce DeskCraft, a desktop GUI benchmark targeting long horizon creative and engineering workflows and proactive human‑agent collaboration. DeskCraft organizes tasks into a multilevel difficulty taxonomy, with long horizon tasks requiring over 50 execution steps, and covers professional creative software across design, video, audio, and 3D creation. Furthermore, DeskCraft formalizes human‑agent collaboration into an interaction protocol covering mid‑turn and post‑turn exchanges. Mid‑turn interaction captures both agent‑initiated clarification under uncertainty and user‑initiated interruption during execution, while post‑turn interaction accommodates user‑driven feedback after the agent signals completion, together spanning the full space of realistic collaboration patterns. We evaluate 18 proprietary and open source agents on 538 tasks and find that GPT‑5.4 reaches 31.6% on standard tasks and 27.6% on interactive tasks. Further analyses reveal persistent failures in long horizon workflow delivery and proactive clarification. We will open‑source all evaluation codes, tasks, and data at https://github.com/mrwwk/DeskCraft.
Authors:Haoran Tan, Zeyu Zhang, Zhicheng Cao, Rui Li, Xu Chen
Abstract:
Large Language Model (LLM)‑based agents increasingly rely on memory to learn from experiences over continual interactions. However, storing experiences as independent, flat units leads to substantial redundancy and retrieval conflicts, as similar episodes repeat overlapping content and subtle scene variations cause retrieved memories to offer contradictory guidance. To address this, we introduce residual experience, positing that newly acquired experience is often an incremental variation of existing knowledge. We propose DeltaMem, a framework that organizes experience memory into two independent residual trees, one storing goal‑conditioned task experience as reusable skills and another for scene‑level environment knowledge. Each tree uses a root node for generalized base experiences and incremental delta nodes for subsequent variations, allowing related experiences to share a common foundation without duplication. For retrieval, a failure‑penalized similarity scan locates the best match, reconstructing the full experience via root‑to‑match chain composition. An autonomous consolidation mechanism distills high‑frequency paths into new root nodes, enabling the trees to self‑organize from general heuristics to specialized variants. Experiments across diverse interactive environments show that DeltaMem consistently outperforms existing baselines. To facilitate future research, we release the code at https://github.com/import‑myself/DeltaMem.
Authors:Aqsa Naseer, Maryam Bibi, Syeda Samiya Urooj, Muhammad Khurram Shahzad
Abstract:
Generalized segmentation of medical images prevents performance degradation when different imaging devices and clinical protocols are used across multiple domains. The Whitening Transform‑based Probabilistic Shape Regularization Extractor (WT‑PSE), published in IEEE Transactions on Medical Imaging in 2024, addresses this challenge by employing feature decorrelation and Wasserstein distance‑based knowledge distillation to achieve robust cross‑domain segmentation. This study systematically examines improvements to the WT‑PSE learning framework. Four limitations in the original implementation are identified: limited training augmentations that fail to simulate real scanner variations, reliance on per‑pixel binary cross‑entropy loss that is sensitive to edge noise, the absence of a scheduled loss weighting strategy that may destabilize early training, and the lack of ablation switches for controlled scientific comparison. To address these issues, we propose four enhancements: (1) domain‑adaptive augmentation including random erasing, gamma correction, and salt‑and‑pepper noise; (2) a hybrid BCE and Dice loss function for improved edge‑aware segmentation under noisy conditions; (3) a curriculum‑based Dice weight scheduling strategy; and (4) command‑line control flags for systematic ablation studies. Experiments on the fundus optic disc segmentation benchmark demonstrate that the improved pipeline achieves a final epoch optic‑disc Dice score of 0.956 and an ASD score of 13.31, outperforming the baseline epoch‑5 Dice score of 0.939. These results indicate that training‑level improvements can provide consistent performance gains without modifying the underlying WT‑PSE architecture.
Authors:Jinjie Shen, Yaxiong Wang, Yujiao Wu, Lechao Cheng, Tianrui Hui, Nan Pu, Zhihui Li, Zhun Zhong
Abstract:
The rapid rise of generative AI has made multimodal fake news increasingly realistic and pervasive, posing severe threats to public trust and social stability. Existing detection methods rely heavily on manipulation‑specific models and large‑scale labeled data, resulting in poor generalization to emerging manipulation types. We observed that the essence of manipulated misinformation lies in its intrinsic conflicts, i.e., semantic or physical inconsistencies either across modalities or with common world knowledge. Inspired by this observation, we propose Conflict‑Oriented REasoning (CORE) framework, an effective paradigm that learns to endows multimodal large language models (MLLMs) with explicit conflict‑capturing capability. To this end, CORE first constructs the Conflict Attribution Corpus (CAC) with fine‑grained annotations of conflict factors and sources, providing essential data support for subsequent conflict perception training. By performing conflict‑oriented representation enhancement and reasoning based on CAC, CORE achieves robust and generalizable conflict detection, effectively and rapidly adapting to unseen manipulation types with a few samples or in even zero‑shot settings. Extensive experiments demonstrate that CORE surpasses state‑of‑the‑art models. The dataset and code are publicly available at https://github.com/shen8424/CORE.
Authors:Phillip Jiang
Abstract:
Relational databases underpin modern enterprise, scientific, and healthcare systems, yet predictive machine learning on such data remains challenging due to their multi‑table, heterogeneous, and temporal structure. Relational Deep Learning (RDL) addresses this by representing databases as heterogeneous graphs and applying graph neural networks (GNNs) directly. RelBench v2 recently introduced autocomplete tasks ‑‑ a practically motivated task type where the goal is to predict an existing column value from relational context, analogous to an intelligent form‑filling assistant. We propose RelGT‑AC (Relational Graph Transformer for Autocomplete), extending the RelGT architecture with three targeted contributions: (1) a column masking strategy that prevents trivial solutions by masking the target column during subgraph encoding; (2) a unified task head supporting binary classification, multiclass classification, and regression autocomplete tasks within a single model; and (3) a TF‑IDF text encoder that automatically detects and encodes free‑text columns, recovering strong lexical signal that categorical encoders discard. Across 7 tasks spanning 3 RelBench v2 datasets (rel‑trial, rel‑f1, rel‑stack), RelGT‑AC outperforms the GraphSAGE baseline on all 3 regression autocomplete tasks and achieves up to +10 AUROC points on text‑heavy eligibility tasks via the TF‑IDF encoder.
Authors:Mingkuan Zhao, Wentao Hu, Tianchen Huang, Yuheng Min, Suquan Chen, Yide Gao, Yanbo Zhai, Shuangyong Song, Xuelong Li
Abstract:
Hallucination in Large Language Models (LLMs), characterized by the generation of content inconsistent with contextual facts or logical constraints ‑‑ remains a persistent challenge for reliable deployment. In this work, we address this issue through a geometric framework rooted in the linear representation hypothesis. We propose that hallucinations manifest as orthogonal noise relative to the semantic manifold of the residual stream. Specifically, we hypothesize that while attention heads ideally propagate information congruent with the context subspace, hallucinations arise when specific heads introduce components orthogonal to this subspace, disrupting the coherence of the latent representation. Based on this formulation, we introduce Dynamic Contextual Orthogonalization (DCO), an inference‑time intervention method. DCO utilizes the input residual stream as a dynamic context anchor to perform orthogonal decomposition on attention head outputs. To distinguish between context‑aligned semantic updates and divergent noise, DCO employs a layer‑wise Z‑score suppression mechanism that selectively attenuates outlier orthogonal components based on statistical distributions. Evaluations on Llama‑3‑8B and 70B across benchmarks such as XSum, NQ‑Swap, and IFEval demonstrate that DCO achieves superior contextual faithfulness compared to state‑of‑the‑art intervention baselines. Furthermore, DCO maintains high performance on knowledge‑intensive tasks like TriviaQA and TruthfulQA, effectively mitigating the trade‑off between hallucination suppression and parametric knowledge retention often observed in existing methods. Our findings validate the geometric interpretation of hallucinations and establish DCO as a computationally efficient approach for enforcing manifold alignment.Our code is available at https://github.com/Harry‑Miral/DCO
Authors:Oskar Natan, Jun Miura
Abstract:
We present a novel compact deep multi‑task learning model to handle various autonomous driving perception tasks in one forward pass. The model performs multiple views of semantic segmentation, depth estimation, light detection and ranging (LiDAR) segmentation, and bird's eye view projection simultaneously without being supported by other models. We also provide an adaptive loss weighting algorithm to tackle the imbalanced learning issue that occurred due to plenty of given tasks. Through data pre‑processing and intermediate sensor fusion techniques, the model can process and combine multiple input modalities retrieved from RGB cameras, dynamic vision sensors (DVS), and LiDAR placed at several positions on the ego vehicle. Therefore, a better understanding of a dynamically changing environment can be achieved. Based on the ablation study, the model variant trained with our proposed method achieves a better performance. Furthermore, a comparative study is also conducted to clarify its performance and effectiveness against the combination of some recent models. As a result, our model maintains better performance even with much fewer parameters. Hence, the model can inference faster with less GPU memory utilization. Moreover, the result tends to be consistent in 3 different CARLA simulation datasets and 1 real‑world nuScenes‑lidarseg dataset. To support future research, we share codes and other files publicly at https://github.com/oskarnatan/compact‑perception.
Authors:Siva Rajesh Kasa, Yasong Dai, Sumit Negi, Hongdong Li
Abstract:
Diffusion large language models promise parallel token generation, yet inference remains bottlenecked by deciding which masked tokens can be safely committed together. Fast‑dLLM addressed this with KV caching and confidence‑guided parallel decoding, but its decoding theory uses a homogeneous high‑confidence assumption that effectively reduces each candidate set to its weakest selected token. We argue that this leaves speed on the table because real decoding steps exhibit heterogeneous confidence profiles. We propose Fast‑dLLM++, a training‑free extension that introduces \emphFréchet profile decoding: selecting parallel commit sets from the full sorted confidence profile rather than a single worst‑case confidence. The resulting rule is a heterogeneous‑confidence generalization of Fast‑dLLM's factor selector and it recovers the previous rule exactly in the equal‑confidence case and adds a provable \emphheterogeneity bonus when the selected tokens have uneven confidences. Fast‑dLLM++ leaves the model, diffusion process, and cache implementation entirely unchanged, making it a drop‑in replacement for existing Fast‑dLLM decoding. Experiments on GSM8K, MATH, HumanEval, and MBPP with the LLaDA‑8B model show that the theoretical improvement translates directly into empirical gains: profile‑aware selection improves the accuracy‑‑throughput frontier by exploiting safe parallelism that weakest‑token rules miss, achieving up to 37% higher throughput at comparable accuracy. Our anonymous code release is at https://github.com/Ringo‑Star/FastdLLM_plusplus.
Authors:Nikolaj Hindsbo, Sina Ehsani, Pragyana Mishra
Abstract:
Deploying language‑driven agents in robotics requires evaluations that reflect real‑world task demands: natural‑language instructions with reproducible outcomes. Such agents must connect language models to callable perception and control tools, and be assessed using deployment‑critical metrics including latency, accuracy, and error modes. We present SCOPE (Simulation and Camera Operations for Perception and Evaluation), a modular agent for natural‑language, open‑vocabulary pan‑tilt‑zoom (PTZ) camera control and visual scene understanding, designed explicitly for edge deployment. SCOPE operates both in a Blender‑based simulation environment and on a physical PTZ camera, executing all perception, planning, and control locally at the deployment site using edge‑accessible compute. We release a 536‑task benchmark spanning QA, single‑ and multi‑step commands, counting, spatial reasoning, descriptions, and optical character recognition in a Blender‑based simulation environment that exposes realistic PTZ control affordances. Execution traces are combined with an LM‑as‑Judge to evaluate latency, accuracy, and error modes. We evaluate 19 planner‑perception model combinations pairing Qwen3 small language models (SLMs) with Moondream and Qwen vision‑language models (VLMs). Stronger SLMs substantially reduce hallucinations and improve tool routing, leading to more reliable closed‑loop behavior. Once a sufficiently capable SLM is used, perception becomes the dominant performance bottleneck. Mixture‑of‑Experts models on both the planning and perception side consistently match or exceed dense alternatives at latencies and memory footprints comparable to much smaller networks. Quantization provides additional efficiency gains with minimal accuracy degradation, identifying a practical, sim‑to‑real validated design point for real‑time, edge‑feasible language‑driven PTZ control.
Authors:Simone Caldarella, Davide Talon, Rahaf Aljundi, Elisa Ricci, Massimiliano Mancini
Abstract:
Large Reasoning Models (LRMs) improve performance by generating explicit intermediate reasoning traces through increased test‑time compute, yet the assumption that longer reasoning is consistently beneficial remains under‑examined. While recent evidence shows that additional reasoning can lead models to overthink, we ask: "Once a model has reached the correct answer, does further reasoning refine the solution, or deviate from it?" To study the dynamics after correctness, we introduce a prefix‑level trajectory evaluation protocol grounded in reasoning sufficiency, defining the minimum reasoning budget required for a model to first generate the correct answer. This allows us to disentangle verbose overthinking, where additional reasoning is redundant but harmless, from harmful overthinking, where continued reasoning destabilizes an already‑correct trajectory. Starting from multimodal benchmarks, we find that many instances considered reasoning‑intensive require surprisingly little reasoning. Moreover, stopping at the first correct prefix improves accuracy over standard reasoning up to 21%, revealing that current models are limited not only by their ability to reason, but also by their inability to stop at the right time. Furthermore, while common efficiency strategies like early stopping substantially reduce verbose overthinking (up to 50%), they fail to mitigate harmful overthinking. Failure analysis reveals that correctness deviations are mainly driven by logical drift and visual reinterpretation. Finally, we show that our findings generalize to language‑only reasoning benchmarks, highlighting harmful overthinking as a broader reliability risk. Code available at https://simonecaldarella.github.io/thinking‑past‑the‑answer.
Authors:Aditi, Niket Agarwal, Arslan Ali, Jon Allen, Martin Antolini, Adeline Aubame, Alisson Azzolini, Junjie Bai, Maciej Bala, Yogesh Balaji, Josh Bapst, Aarti Basant, Mukesh Beladiya, Mohammad Qazim Bhat, Zaid Pervaiz Bhat, Dan Blick, Vanni Brighella, Han Cai, Tiffany Cai, Eric Cameracci, Jiaxin Cao, Yulong Cao, Mark Carlson, Carlos Casanova, Ting-Yun Chang, Yan Chang, Yu-Wei Chao, Prithvijit Chattopadhyay, Roshan Chaudhari, Chieh-Yun Chen, Junyu Chen, Ke Chen, Qizhi Chen, Wenkai Chen, Xiaotong Chen, Yu Chen, An-Chieh Cheng, Click Cheng, Xiu Chia, Jeana Choi, Chaeyeon Chung, Wenyan Cong, Yin Cui, Magdalena Dadela, Nalin Dadhich, Wenliang Dai, Joyjit Daw, Alperen Degirmenci, Rodrigo Vieira Del Monte, Robert Denomme, Sameer Dharur, Marco Di Lucca, Ke Ding, Wenhao Ding, Yifan Ding, Yuzhu Dong, Nicole Drumheller, Yilun Du, Aigul Dzhumamuratova, Aleksandr Efitorov, Hamid Eghbalzadeh, Naomi Eigbe, Imad El Hanafi, Hassan Eslami, Benedikt Falk, Jiaojiao Fan, Jim Fan, Amol Fasale, Sergiy Fefilatyev, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Vikram Fugro, Prashant Gaikwad, TJ Galda, Katelyn Gao, Yihuai Gao, Wenhang Ge, Sreyan Ghosh, Arushi Goel, Vivek Goel, Akash Gokul, Rama Govindaraju, Jinwei Gu, Miguel Guerrero, Elfie Guo, Aryaman Gupta, Siddharth Gururani, Hugo Hadfield, Song Han, Ankur Handa, Zekun Hao, Mohammad Harrim, Ali Hassani, Nathan Hayes-Roth, Yufan He, Chris Helvig, Cyrus Hogg, Madison Huang, Michael Huang, Sophia Huang, Yufan Huang, Jacob Huffman, DeLesley Hutchins, Suneel Indupuru, Boris Ivanovic, Arihant Jain, Joel Jang, Ryan Ji, Yanan Jian, Dongfu Jiang, Jingyi Jin, Atharva Joshi, Nikhilesh Joshi, Pranjali Joshi, Jaehun Jung, Weiwei Kang, Scott Kassekert, Jan Kautz, Ashna Khetan, Julia Kiczka, Slawek Kierat, Gwanghyun Kim, Kuno Kim, Sunny Kim, Kezhi Kong, Xin Kong, Zhifeng Kong, Tomasz Kornuta, Egor Krivov, Hui Kuang, Saurav Kumar, Chia-Wen Kuo, George Kurian, Wojciech Kutak, JF Lafleche, Himangshu Lahkar, Omar Laymoun, Jayjun Lee, Sanggil Lee, Gabriele Leone, Boyi Li, Freya Li, Jiajun Li, Jinfeng Li, Ling Li, Pengcheng Li, Shangru Li, Tingle Li, Xiaolong Li, Xuan Li, Zhaoshuo Li, Zhiqi Li, Hao Liang, Maosheng Liao, Chen-Hsuan Lin, Tsung-Yi Lin, Ming-Yu Liu, Sifei Liu, Zihan Liu, Hai Loc Lu, Xiangyu Lu, Alice Luo, Ruipu Luo, Wenjie Luo, Jiangran Lyu, Martin Ding Ma, Nic Ma, Qianli Ma, Dawid Majchrowski, Louis Marcoux, Miguel Martin, Qing Miao, Ashkan Mirzaei, Shreyas Misra, Kaichun Mo, Durra Mohsin, Hyejin Moon, Pawel Morkisz, Saeid Motiian, Kirill Motkov, Seungjun Nah, Yashraj Narang, Deepak Narayanan, Thabang Ngazimbi, Julian Ouyang, David Page, Yatian Pang, Sehwi Park, Mahesh Patekar, Mostofa Patwary, Marco Pavone, Trung Pham, Wei Ping, Soha Pouya, Shrimai Prabhumoye, Varun Praveen, Delin Qu, Hesam Rabeti, Morteza Ramezanali, Marilyn Reeb, Xuanchi Ren, Kristen Rumley, Wojciech Rymer, Jun Saito, Yeongho Seol, John Shao, Piyush Shekdar, Tianwei Shen, Humphrey Shi, Min Shi, Stella Shi, Kevin Shih, Mohammad Shoeybi, Mateusz Sieniawski, Shuran Song, Alexander Sotelo, Amir Sotoodeh, Sunil Srinivasa, Vignesh Srinivasakumar, Bartosz Stefaniak, Rahul Heinrich Steiger, Shangkun Sun, Jiaxiang Tang, Shitao Tang, Yangyang Tang, Yue Tang, Tolou Tavakkoli, Kayley Ting, Krzysztof Tomala, Wei-Cheng Tseng, Jibin Varghese, Sergei Vasilev, Thomas Volk, Raju Wagwani, Roger Waleffe, Andrew Z. Wang, Boxiang Wang, Haoxiang Wang, Qiao Wang, Shihao Wang, Shijie Wang, Ting-Chun Wang, Yan Wang, Yu Wang, David Wehr, Fangyin Wei, Xinshuo Weng, Jay Zhangjie Wu, Kedi Wu, Hongchi Xia, Summer Xiao, Tianjun Xiao, Kevin Xie, Daguang Xu, Jiashu Xu, Mengyao Xu, Ruqing Xu, Xingqian Xu, Yao Xu, Dinghao Yang, Dong Yang, Hans Yang, Xiaodong Yang, Xuning Yang, Yichu Yang, Yurong You, Zhiding Yu, Hao Yuan, Simon Yuen, Xiaohui Zeng, Pengcuo Zeren, Cindy Zha, Haotian Zhang, Jenny Zhang, Jing Zhang, Liangkai Zhang, Paris Zhang, Shun Zhang, Xuanmeng Zhang, Zhizheng Zhang, Ann Zhao, Yilin Zhao, Yuliya Zhautouskaya, Charles Zhou, Fengzhe Zhou, Shilin Zhu, Yuke Zhu, Dima Zhylko, Artur Zolkowski
Abstract:
We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture‑of‑transformers architecture. By supporting highly flexible input‑output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI ‑‑ effectively subsuming vision‑language models, video generators, world simulators, and world‑action models into a single framework. Our evaluation demonstrates that Cosmos 3 establishes a new state‑of‑the‑art across a diverse suite of understanding and generation tasks, demonstrating omnimodal world models as scalable, general‑purpose backbones for embodied agents. Our post‑trained Cosmos 3 models were ranked as the best open‑source Text‑to‑Image and Image‑to‑Video models by Artificial Analysis, and the best policy model by RoboArena at the time the technical report was written. To accelerate open research and deployment in Physical AI, we make our code, model checkpoints, curated synthetic datasets, and evaluation benchmark available under the Linux Foundation's OpenMDW‑1.1 https://openmdw.ai/license/1‑1/ License at https://github.com/nvidia/cosmosgithub.com/nvidia/cosmos and https://huggingface.co/collections/nvidia/cosmos3 . The project website is available at https://research.nvidia.com/labs/cosmos‑lab/cosmos3 .
Authors:Fabian Degen, Oishi Deb, Jindong Gu, Junchi Yu, Samuele Marro, Philip Torr, Jialin Yu
Abstract:
Planning records define restrictions over geographic areas, but their source documents often provide only indirect spatial evidence rather than machine‑readable boundaries. We introduce Plan2Map, a 208‑case multimodal benchmark for document‑grounded geospatial boundary reconstruction from UK planning records. Given only a source planning document, systems must reconstruct a valid geospatial boundary from notice text, schedules, map plates, map labels, and boundary annotations; the reference GeoJSON is held out for scoring. We propose GeoPlanAgent, a document‑grounded, geospatial‑tool‑in‑the‑loop system that decomposes the task into evidence extraction, localisation, map registration, boundary segmentation, projection, and verification. On Plan2Map, GeoPlanAgent achieves 0.736 mean IoU and 0.904 median IoU, with 67.8% of predictions at or above 0.8 IoU, substantially outperforming direct VLM‑to‑GeoJSON baselines. Diagnostic analysis shows that direct VLM prediction remains unreliable, while remaining errors are concentrated in localisation and map registration, and supervised boundary segmentation substantially improves pixel‑level mask quality. Plan2Map provides a concrete testbed for multimodal geospatial reconstruction from public planning records. Project page: https://odeb1.github.io/Plan2Map_Project_Page/.
Authors:Hui Li, Yangfan Gao, Junlin Shang, Changhao Jiang, Tao Gui, Qi Zhang, Xuanjing Huang
Abstract:
Audio tokenizers serve as the discrete interface between continuous audio and Audio Language Models (ALMs), but existing tokenizers often struggle to support both understanding and generation. Reconstruction‑oriented codecs preserve acoustic fidelity but lack rich semantics, while semantic‑aware tokenizers typically rely on separate semantic and acoustic streams, introducing redundancy or misalignment. We propose EntangleCodec, a unified discrete audio tokenizer that learns caption‑aligned semantic‑acoustic representations before quantization. By aligning audio with rich captions rather than ASR transcripts, EntangleCodec captures linguistic content, speaker identity, emotion, prosody, and acoustic scenes within a compact token stream. A flow‑matching diffusion decoder further enables high‑quality reconstruction across speech, music, and general audio. EntangleCodec achieves reconstruction quality competitive with specialized codecs, outperforms all codec‑based baselines on audio understanding by up to +7.4% on MMAR, and supports both TTS and TTA generation in a unified framework. Furthermore, EntangleCodec‑based audio language models demonstrate strong scaling behavior: even at 0.6B parameters, the model surpasses specialized continuous‑representation LLMs with over 13B parameters across three benchmarks using 22× fewer parameters; scaling to 8B further establishes new state‑of‑the‑art results on MMAR, highlighting that representation quality is as critical as model scale in audio language modeling. Code and model weights are available at https://github.com/luckyerr/EntangleCodec.
Authors:Andrianos Michail, Elias Schuhmacher, Juri Opitz, Simon Clematide, Rico Sennrich
Abstract:
Dense retrieval models exhibit positional bias: retrieval effectiveness degrades when relevant information appears later in a passage (Zeng et al., 2025). We ask whether this bias can be reduced at inference time, without retraining and without sacrificing overall retrieval effectiveness. To this end, we adapt inference‑time attention calibration (Schuhmacher et al., 2026) to downstream retrieval and extend it with a strength coefficient lambda that interpolates between the original and fully calibrated attention distributions. Across three embedding models on SQuAD‑PosQ and FineWeb‑PosQ, we examine how basket size, calibrated layer set, and strength affect the trade‑off between positional fairness and retrieval effectiveness, finding that partial calibration frequently outperforms full calibration. A single configuration (B=128, lambda=0.5, 50% layer depth) improves the harmonic mean of nDCG@10 across positional groups on FineWeb‑PosQ for all three models without per‑model tuning, and applies to both <s>‑pooled and last‑token‑pooled architectures. This default configuration transfers without modification to PosIR, which spans 10 languages and 31 domains, reducing the Position Sensitivity Index in all 16 length‑quartile x model x retrieval‑setting combinations, while preserving or improving aggregate nDCG@10. We release our extended codebase at https://github.com/impresso/fair‑sentence‑transformers
Authors:Yaoting Wang, Yun Zhou, Zipei Zhang, Henghui Ding
Abstract:
Audio‑visual speaker tracking aims to localize and track active speakers by leveraging auditory and visual cues, enabling fine‑grained, human‑centric scene understanding. This capability is essential for real‑world applications such as intelligent video editing, surveillance, and human‑computer interaction. However, existing datasets are largely limited to simple or homogeneous audio‑visual scenes with coarse annotations. Such oversimplified settings bias evaluation toward static audio‑visual co‑occurrence, rather than rigorously assessing robust spatiotemporal modeling and cross‑modal reasoning in complex, dynamic scenes. To address these limitations, we introduce AVTrack, a human‑centric audio‑visual instance segmentation (AVIS) dataset designed for dynamic real‑world scenarios. AVTrack features diverse and challenging conditions, including camera motion, visual occlusions, and position changes. Evaluations of representative AVIS methods on AVTrack reveal substantial performance degradation, establishing AVTrack as a challenging benchmark for robust human‑centric audio‑visual scene understanding in complex environments. We further provide a simple yet effective baseline to facilitate future research. Project website: https://FudanCVL.github.io/AVTrack/
Authors:Yuying Li, Leqi Zheng, Yongzi Yu, Wenrui Zhou, Xuchang Zhong, Xing Hu, Jing Jin, Hangjie Yuan, Tao Feng
Abstract:
On‑Policy distillation (OPD) in large language models is shifting from full‑trace KL supervision toward more selective training paradigms. Recent OPD methods increasingly focus on selecting which trajectories to learn from, which tokens are most informative, and which supervision signals are most reliable. Motivated by this trend, we rethink optimization granularity of OPD and propose \fireicon\ FiRe‑OPD (Filter, then Reweight), which jointly adjusts supervision signals at both trajectory and token levels. In details, FiRe‑OPD first filters trajectories to remove low‑quality rollout samples, and then applies soft reweighting within the retained trajectories to emphasize informative tokens. Compared with hard token selection, FiRe‑OPD leverages a soft‑weighting mechanism to effectively mitigate information loss and enhance optimization stability, thereby achieving finer‑grained OPD optimization. We validate the effectiveness of FiRe‑OPD across strong‑to‑weak, single‑teacher, and multi‑teacher settings, and demonstrate its superiority over recent token‑level OPD methods ( (e.g., +6.25 on AIME 2024 in strong‑to‑weak, +18.81 on Miner in multi‑teacher). Our code is available at https://github.com/YuYingLi0/FiRe‑OPD.
Authors:Yunlong Zhou, Chen Zhao, Danyang Peng, Fanfan Ji, Xiao-Tong Yuan
Abstract:
Accurate precipitation nowcasting is vital for disaster mitigation, but deep learning methods face a key trade‑off: regression models produce over‑smoothed, spectrally decaying predictions that blur convective details and violate turbulence power laws; diffusion models generate realistic yet unanchored hallucinations lacking physical grounding. We propose Spectral‑Decoupled Iterative Refinement (SDIR), a deterministic framework that reformulates nowcasting as progressive frequency‑decoupled refinement. SDIR first extracts a stable low‑frequency synoptic skeleton, then iteratively refines high‑frequency textures under physical constraints, eliminating both blurring and hallucinations. It features a dual‑path design: the Synoptic Frequency‑Guided Former (SFG‑Former) with Scale‑Adaptive Transformers for global structure, and the Fourier Residual Refiner (FR‑Refiner) with Scale‑Conditioned Fourier Neural Operators for fine residuals. A Physically Consistent Power Spectral Density (PCPSD) loss with dynamic masking enforces a turbulence‑consistent spectral distribution. Experiments on three benchmarks show SDIR significantly outperforms SOTA methods in spatial accuracy while achieving spectral fidelity competitive with diffusion‑based methods, enabling reliable high‑resolution operational nowcasting. Code link: https://github.com/RuntimeWarning/SDIR.
Authors:Chenshuang Zhang, Kyeong Seon Kim, Chengxin Liu, Tae-Hyun Oh
Abstract:
Despite the success of audio‑visual large‑language models (LLMs), they can produce plausible but ungrounded outputs, termed hallucination. Existing benchmarks focus on environmental sounds (e.g., dog barking) to indicate event occurrence. In contrast, human speech carries fundamentally different, rich semantics and temporal structures, yet it remains unexplored whether current models can accurately align speech content with corresponding visual signals. In this work, we show that speech content can induce hallucinations in audio‑visual LLMs. To systematically study this, we introduce SVHalluc, the first comprehensive benchmark for evaluating speech‑vision hallucination in audio‑visual LLMs. Our benchmark diagnoses speech‑vision hallucinations from two critical and complementary aspects: semantic and temporal. Experimental results demonstrate that state‑of‑the‑art open‑source audio‑visual LLMs struggle with aligning speech content with corresponding visual signals, with a near‑random accuracy on multiple tasks. In contrast, Gemini 2.5 Pro significantly outperforms the open‑source models. Our analysis suggests that their failures stem from limited ability in cross‑modality understanding, despite strong performance in single‑modality perception. Our work uncovers a new and fundamental limitation of current audio‑visual LLMs and highlights the need for speech‑grounded video comprehension. Project page: https://chenshuang‑zhang.github.io/projects/svhalluc/.
Authors:Yuejiao Wang, Zihao Ji, Pengfei Cai, Xu Li, Haorui Zheng, Zewen Song, Zhongliang Liu, Chen Zhang, Pengfei Wan
Abstract:
Recent advances in neural song generation have enabled high‑quality synthesis from lyrics and global textual prompts. However, most systems fail to model temporally varying attributes of songs, severely limiting fine‑grained control over musical structure and dynamics. To address this, we propose SegTune, a Diffusion Transformer‑based framework enabling structured and fine‑grained controllability by allowing users or large language models (LLMs) to specify local musical descriptions aligned to song segments. These segment prompts are temporally broadcast to corresponding time windows, while global prompts ensure stylistic coherence. To support precise lyric‑to‑music alignment, we introduce an LLM‑based duration predictor that autoregressively generates sentence‑level timestamps in LyRiCs format. We further construct a large‑scale data pipeline for high‑quality song collection with aligned lyrics and prompts, and propose new metrics to evaluate segment alignment and vocal consistency. Experiments demonstrate that SegTune outperforms existing baselines in both musicality and controllability. Visit our project page (https://github.com/KlingAIResearch/SegTune) for codes and more generated songs.
Authors:Zaifei Yang, Samuel Ping-Man Choi, James Kwok
Abstract:
Protein‑protein interactions (PPIs) are essential for many biological processes. However, existing PPI prediction approaches suffer from two major limitations: they overlook the hierarchical organization of proteins, particularly meso‑scale motifs that critically regulate PPIs, and fail to effectively integrate sequence, structure, and function modalities. To address these limitations, we propose MMM‑PPI, a Hierarchical Motif‑based Multi‑Modal protein Encoder for PPI Prediction that constructs PPI embeddings in a bottom‑up multi‑modal manner across three scales. At the micro‑scale, we encode three modal residue features; at the meso‑scale, a novel multimodal motif encoder aggregates residues into spatially‑informed motif embeddings; at the macro‑scale, a multimodal protein encoder integrates motifs into protein embeddings by jointly modeling motif importance and inter‑modal correlations. The pre‑trained encoder can be used off‑the‑shelf for large‑scale PPI prediction. Extensive experiments on multiple PPI datasets show that MMM‑PPI outperforms state‑of‑the‑art multi‑label PPI prediction models, particularly under challenging data partitions and limited data scenarios. Codes are in https://github.com/yzf‑code/MMM‑PPI.
Authors:Jin Gao, Juntu Zhao, Zirui Zeng, Jiaqi Shen, Junhao Shi, Dukun Zhao, Yuming Lu, Dequan Wang
Abstract:
AI for scientific discovery is entering an agentic era, where protein‑engineering systems are expected to prioritize future wet‑lab experiments rather than merely fit static measurements. We introduce TadA‑Bench, a million‑variant wet‑lab replay benchmark from 31 TadA directed‑evolution rounds for future‑round discovery toward agentic protein engineering. TadA‑Bench preserves the campaign chronology and defines a fixed‑data replay task: given earlier experimental rounds, models rank variants that appear only in later rounds. It provides aligned DNA, RNA, and protein views, and uses Seq2Graph, a graph‑based label‑unification pipeline, to reconcile noisy enrichment measurements into consistent cross‑round activity labels. Random‑split controls show strong interpolation, but future‑round ranking and finite‑budget candidate selection are much weaker. Controlled analyses suggest that evolutionary coverage is more informative than local data density, positioning TadA‑Bench as a reproducible wet‑lab replay substrate for future‑round discovery toward agentic protein engineering; the data and code are released on Hugging Face and GitHub.
Authors:Nikola Cenikj, Özgün Turgut, Alexander Müller, Alexander Steger, Jan Kehrer, Marcus Brugger, Daniel Rueckert, Philip Müller
Abstract:
Coronary artery stenosis is a common cardiovascular disease, with severe, untreated cases posing significant risks of heart attack. Although coronary (X‑ray) angiograms remain the standard for stenosis diagnosis, they are invasive, time‑ and resource‑intensive, and therefore only performed on patients with a high probability of disease based on symptoms and prior clinical tests. However, a subset of patients, especially those without symptoms, may remain undiagnosed. Detecting indications of stenosis from ECGs, which are fast, cheap, non‑invasive, and thus routinely acquired even in asymptomatic patients, would support early diagnosis. However, as no reliable stenosis‑specific signal has been identified in ECGs, they can not currently be used for stenosis risk stratification. To address this, we introduce StenCE, a pretraining framework, allowing stratification of patients based on features derived directly from ECGs. Evaluations across varying stenosis severity thresholds and additional ECG disease classification tasks demonstrate consistent performance improvements across different ECG encoders, outperforming previous work. The obtained models successfully detect signals for stenosis diagnosis in ECGs and are the first to achieve high performance in severe stenosis classification. The source code is available at https://github.com/NikolaCenic/ecg‑stenosis‑cls.
Authors:Alice Gomez-Cantos, Henry O. Velesaca
Abstract:
Urban nitrogen dioxide (NO_2) is a key indicator of combustion‑related air pollution and exhibits strong spatial and temporal variability in cities. This study presents a satellite‑based framework for tracking urban NO_2 pollution using tropospheric column observations from Sentinel‑5P/TROPOMI over Guayas Province, Ecuador. Rather than estimating surface concentrations, the methodology emphasizes robust distributional metrics, including the median and upper‑tail percentiles (P_90, P_95, and P_99), to characterize background conditions and localized pollution extremes at the canton scale. Multi‑year satellite observations are aggregated annually and analyzed using unsupervised K‑means clustering to identify characteristic pollution regimes without predefined thresholds. Results show that highly urbanized cantons consistently exhibit elevated extreme NO_2 values and greater variability, while less urbanized areas display lower and more homogeneous patterns. The proposed approach provides an interpretable and scalable tool for urban air‑quality assessment in data‑scarce regions using satellite observations alone. The implementation is publicly available on GitHub https://hvelesaca.github.io/sentinel‑5P‑clustering/.
Authors:Zewen Liu, Zhan Shi, Yisi Sang, Bing He, Minhua Lin, Tianxin Wei, Dakuo Wang, Benoit Dumoulin, Wei Jin, Hanqing Lu
Abstract:
Auto‑harness systems such as A‑Evolve, GEPA, and Meta‑Harness improve LLM agents by optimizing prompts, skills, tools, memories, and supporting infrastructure from execution feedback, but they are typically evaluated on fixed offline benchmarks. Real deployments instead present open‑ended task streams: histories grow without a fixed endpoint, heterogeneous tasks require different harnesses, and problem distributions shift over time. These challenges make a single repeatedly and densely updated harness brittle, causing performance degradation as accuracy peaks early and then declines. This motivates sustained harness construction with task‑wise adaptation. We introduce Adaptive Auto‑Harness, a framework and system for such streams. The framework decomposes the gap to an oracle harness into evolution loss and adaptation loss. The system addresses these losses with a stateful multi‑agent evolver, a harness tree with solve‑time routing, and human‑steering hooks for cases where history lacks the needed signal. Across prediction‑market, security‑competition, and event‑forecasting streams, Adaptive Auto‑Harness outperforms five existing auto‑harness baselines and ablations attribute gains to better construction, routing, or targeted human steering. Code is available in \hrefhttps://github.com/A‑EVO‑Lab/a‑evolve/tree/release/adaptive‑auto‑harnessLink.
Authors:Wentao Mo, Yang Liu
Abstract:
Current 3D spatial reasoning methods face a fundamental trade‑off: neuro‑symbolic 3D (NS3D) concept learners achieve interpretable reasoning through compositional programs but are constrained to closed‑set concept vocabularies and simple programs; end‑to‑end 3D multi‑modal LLMs (3D MLLMs) could handle complex natural language and open‑vocabulary concepts but suffer from black‑box reasoning without explicit spatial verification. We introduce APEIRIA, a neuro‑symbolic 3D MLLM to bridge two paradigms by distilling symbolic reasoning patterns into MLLMs with natural language chain‑of‑thought. Our three‑stage curriculum progressively builds reasoning capabilities: a) 3D perception alignment grounds object visual‑geometric features to the LLM, b) CoT‑SFT teaches query decomposition and stepwise verification from symbolic program traces, and c) CoT‑RL extends reasoning patterns to open‑set concepts and deeply nested instructions. By transferring reasoning patterns rather than concept‑specific knowledge, APEIRIA preserves key NS3D virtues: transparent reasoning and modular interchangeability of planning and perception components. Evaluations on grounding, question answering, and captioning show that APEIRIA surpasses prior NS3D methods and matches state‑of‑the‑art 3D MLLMs on 3D spatial reasoning datasets, unifying symbolic methods' systematic reasoning with MLLMs' flexibility. Code is available at https://github.com/oceanflowlab/APEIRIA.
Authors:Qi Han Wong
Abstract:
We investigate whether large language models produce different medical triage recommendations for identical symptoms based solely on the language of the patient prompt. Using Gemini 3.5 Flash, we evaluate a neurological symptom profile (persistent headache, blurred vision, nausea) across six languages (English, Spanish, Chinese, Hindi, Japanese, Arabic) with 30 runs per condition (n=450 total API calls). We find that the model recommends emergency room visits at rates ranging from 0% (Japanese, Hindi) to 30% (English, Arabic), despite assigning nearly identical severity scores (7.7‑8.0/10) across all languages. Adding a single sentence specifying the patient's US location increases ER recommendations by up to 76.7 percentage points for non‑English prompts, while the reverse anchor (English prompt with a Tokyo location) reduces the ER rate from 30% to 6.7%. A back‑translation control (Japanese to English) produces ER rates comparable to the English baseline, confirming that the disparity is not caused by translation quality but by implicit geographic inference from the input language. We release the complete dataset, experiment code, and results.
Authors:Shailesh Rana
Abstract:
Language models do not simply choose an answer at the output layer. In a 9,000‑trajectory MMLU study across Qwen2.5‑7B‑Instruct, Llama‑3.1‑8B‑Instruct, and Mistral‑7B‑Instruct‑v0.3, the score of the answer moves across depth in structured ways. We describe each trajectory with three quantities: the current answer margin, the next‑layer change in that margin, and the distance from a decision flip. The main empirical picture is that correctness and stability are different: the largest group is unstable‑correct, not stable‑correct. A traced subset then asks what moves the margin. In stable‑correct cases, the average attention scalar points in the correct direction, while the average MLP scalar does not; span deletion shows that removing answer‑supporting text hurts the margin and removing distractor‑like text helps it. The result is not a full circuit explanation. It is a reproducible way to see which answers are settled, which remain fragile, and which measured sources move them.
Authors:Rashad Aziz, Ikhlasul Akmal Hanif, Fajri Koto
Abstract:
Safety alignment learned in high‑resource languages transfers poorly to low‑resource languages. Models refuse harmful prompts in English but fail to refuse when the same prompts are translated into Swahili or Burmese. Adaptive steering methods like AdaSteer and CAST inherit this failure cross‑lingually. We diagnose where transfer breaks down. Across Qwen2.5‑7B, Gemma‑2‑9B, and Llama‑3.1‑8B on 23 languages, the harmfulness direction extracted from high‑resource activations linearly separates harmful from harmless low‑resource prompts nearly as well as high‑resource ones. The relevant representation is present. Yet harmful refusal drops from 87.9% to 43.9%. The model fails to convert the representation into refusal. What fails to transfer is calibration of the safety decision, not the underlying representation. We exploit this by recalibrating, rather than retraining, a high‑resource gate: a low‑rank logistic readout with its decision threshold reset using as few as 1 to 4 target‑language examples per class. The gate routes between refusal steering and harmfulness‑direction ablation, substantially raising mean refusal selectivity (Δ = harmful ‑ harmless refusal) from 33.6 for the strongest adapted baseline to 54.5 while preserving MMLU utility. These results suggest that some low‑resource safety failures can be repaired by recalibrating existing representations rather than learning new ones. Our code is released: https://github.com/rashadaziz/low‑resource‑safety.
Authors:Boqian Wu, Qiao Xiao, Patrik Okanovic, Tomasz Sternal, Maurice van Keulen, Mykola Pechenizkiy, Elena Mocanu, Torsten Hoefler, Decebal Constantin Mocanu
Abstract:
Scaling laws for dense LLMs under infinite data are well explored, but how sparsity interacts with limited data is not. In this work, we study sparse training in data‑constrained regimes where limited unique tokens require multi‑epoch training. Our experiments span models up to 1.92B parameters in the fitting set, sparsity up to 93.75%, unique data budgets up to 2.6B tokens, and total training tokens up to 41.6B over 16 epochs; we further validate extrapolation on held‑out dense‑equivalent models up to 7.68B parameters. We find that: 1. Sparse scaling in data‑limited settings: We introduce a scaling law that models loss as a function of active parameters, unique tokens, data repetition, and sparsity, accurately predicting performance across compute and data budgets. 2. Delayed data saturation: sparse training postpones diminishing returns from repeated data, making multi‑epoch training more effective. 3. Resource trade‑offs: With fixed data, loss‑optimal sparsity is moderate ~ 50%, while compute‑optimal sparsity is higher and grows with data scale. Overall, sparsity is not just a tool for efficiency, but a mechanism for improving scaling trade‑offs under data scarcity. Our code is available at: https://github.com/boqian333/sparse‑dc‑scaling.
Authors:Haowei Han, Kexin Hu, Weiwei Cai, Debiao Zhang, Bin Qin, Yuxiang Wang, Jiawei Jiang, Xiao Yan, Bo Du
Abstract:
Command understanding systems in smart home ecosystems can automate device control and substantially improve user experience. However, while they perform well on precise utterances (e.g., "turn on the bedroom light"), they struggle with ambiguous or misaligned commands (e.g., "make the bedroom cozy"). Large language models (LLMs) generalize well across various domains and can outperform traditional rule‑based systems on such tasks, but their effectiveness is often constrained by scarce domain‑specific data, insufficient task‑specific adaptation, and high computational costs. In this paper, we propose an automated training data synthesis workflow using user logs and LLMs; then we build MiCU, a domain‑specific LLM that excels at command understanding. Specifically, we employ curriculum learning to inject domain knowledge into the base LLM, then we enhance its reasoning ability via cold‑start training combined with reinforcement learning (RL) guided by domain‑specific thinking rules. Additionally, we introduce a token compression technique that condenses device description into a single special token, substantially reducing inference overhead and enabling \model‑fast, an efficient variant optimized for long inputs. Extensive experiments show that MiCU significantly outperforms baselines, with an average accuracy gain of 20.01% across all device categories. We have deployed MiCU in the Xiaomi Home app, receiving approximately 1.7 million page views per day. Production evaluations show that MiCU reduces user correction rate by 1.57% and increases human audited accuracy by 32.05%. Our data and code are available at https://github.com/xiaomi‑research/iot_spec_llm
Authors:Zhihong Liu, Siqi Kou, Zheng Li, Ye Ma, Quan Chen, Peng Jiang, Kai Yu, Zhijie Deng
Abstract:
Crafting a product display webpage from a source product image, along with layout and visual content instructions, holds significant practical value for domains such as marketing, advertising, and E‑commerce. Intuitively, this task demands strict visual consistency across product displays and high‑fidelity instruction following to jointly generate renderable HTML code. These requirements on controllability and instruction‑following are closely aligned with the core features of advanced multimodal generative models, such as image editing models and unified models. To this end, this paper introduces ProductWebGen to systematically benchmark the product webpage generation capacities of these models. We organize ProductWebGen with 500 test samples covering 13 product categories; each sample consists of a source image, a visual content instruction, and a webpage instruction. The task is to generate a product showcase webpage including multiple consistent images in accordance with the source image and instructions. Given the mixed‑modality input‑output nature of the task, we design and systematically compare two workflows for evaluation ‑‑ one uses large language models and image editing models to separately generate HTML code and images (editing‑based), while the other relies on a single UM to generate both, with image generation conditioned on the preceding multimodal context (UM‑based). Empirical results show that editing‑based approaches achieve leading results in webpage instruction following and content appeal, while UM‑based ones may display more advantages in fulfilling visual content instructions. We also construct a supervised fine‑tuning dataset, ProductWebGen‑1k, with 1,000 groups of real product images and LLM‑generated HTML code. We verify its effectiveness on the open‑source UM BAGEL. The data and code are available at https://github.com/SJTU‑DENG‑Lab/ProductWebGen.
Authors:Sicheng Yang, Shulan Ruan, Shiwei Wu, Yu Liu, Lu Fan, Zhi Li, You He
Abstract:
While End‑to‑End (E2E) Speech‑Large Language Models (Speech‑LLMs) are rapidly evolving, their evaluation methodologies remain limited to the era of simple transcription. Existing benchmarks suffer from three critical limitations: a pronounced bias towards high‑resource languages, a focus on low‑level recognition (ASR) rather than semantic reasoning, and a neglect of regional dialects. To bridge this gap, we introduce PolySpeech‑100, a massive‑scale benchmark designed to assess `native‑level' speech comprehension across 110 linguistic variants. We employ a novel hybrid construction pipeline that augments gold‑standard human recordings with instruction‑driven synthetic speech, allowing us to cover 19 distinct Chinese dialects and over 80 low‑resource languages. Extensive evaluation of 22 state‑of‑the‑art models (including Gemini‑3, GPT‑Audio, and Qwen2.5‑Omni) yields pivotal insights. First, we demonstrate that open‑source E2E models outperform Cascade (ASR+LLM) systems on heavy dialects, proving that direct audio processing preserves critical paralinguistic cues and prosodic features (e.g., intonation, stress) that are often lost in standard transcription. Second, we reveal a significant performance gap: while commercial models maintain robustness, open‑source models suffer catastrophic degradation on low‑resource languages. Finally, counter‑intuitively, we observe that under standard zero‑shot settings, Chain‑of‑Thought prompting frequently degrades speech understanding performance for most evaluated models, revealing a potential modality alignment gap in current architectures. PolySpeech‑100 establishes a rigorous standard for the next generation of inclusive, omni‑capable Speech‑LLMs. The data, demo, and code are publicly available at https://github.com/YoungSeng/PolySpeech‑100.
Authors:Di Wu
Abstract:
Research is advancing faster than ever with artificial intelligence (AI); and so are the corresponding research papers. The exploding volume of AI‑generated papers have put a strain to peer review, leading to the usage of AI‑generated review, potentially wide yet sneaky. However, relevant ethical concerns about confidentiality, quality, and fairness are raised and no consensus has been reached in the broad research community. We expect the debate to continue for a while, but in the meantime, we ask an alternative, practical question: can AI review improve paper drafting? We study 20 computer architecture papers, with varying levels of submission lineage, to expose how well AI review aligns with human review, quantified by a set of metrics we define. To conduct the case study, we build a web UI‑integrated tool, \emphAI‑Paper‑Review, that generates structured AI review of a draft paper, available at https://github.com/unarylab/ai‑paper‑review. This tool selects several AI reviewers from a diverse pool of AI reviewers and clusters and ranks their comments based on commonality and importance of review comments. It also allows to align AI comments with human comments to facilitate metric‑based validation. The case study shows that AI review can cover a significant fraction of human‑raised issues, but also raises issues missing in human review. This paper is not intended to encourage using AI for peer review at the current stage, but to study that (1) how AI review can improve paper drafting and (2) the potential and limitation of AI‑based peer review. The release of the tool and the case study data is intended to instigate future research on this topic. Misuse for peer review would violate the ethics policies from major academic venues.
Authors:An Vuong, Minh-Hao Van, Chen Zhao, Xintao Wu
Abstract:
AI for materials science is a critical topic within AI for science, aiming to accelerate materials discovery and produce accurate property predictions. Bilayer 2D material stacking is essential for exploring new materials with novel functions and inherent phenomena, enabling the creation of new 2D bilayers for diverse real‑world applications. Research on bilayer vdWs materials has made significant progress from experimental and computational perspectives. Various bilayer materials have been successfully synthe sized experimentally and the increasing utilization of high‑throughput computing technology has con structed several computational two‑dimensional materials databases. However, the use of AI to model bilayer stacking and predict new properties remains underexplored, necessitating further research studies. In this work, we propose a novel multimodal learning approach to study the interfaces between dissimilar materials that jointly enable new or multiple functions, and to predict new properties arising from the vertical integration (stacking) of different functional material layers under given configurations. Comprehensive experiments demonstrate the effectiveness and efficiency of our approach compared to baseline methods. Our code is available at https://github.com/AnVuong123/bimat ml.
Authors:Rana Muhammad Usman
Abstract:
LLM agents increasingly act after consuming ranked external information streams such as social feeds, search results, retrieval contexts, and email queues, yet safety evaluations almost always test the model or the user prompt in isolation, never the upstream ranker that decides what the agent reads just before it acts. We introduce a controlled protocol that holds the model, persona, topic, and final decision prompt fixed and varies only the composition and ordering of the posts an agent encounters during a preceding ten‑turn "scrolling" phase, isolating the causal effect of feed curation on a downstream decision. Across 2,785 decision rollouts on four modern open instruct LLMs from three independent labs, we identify three response regimes: adversarial capitulation, default saturation, and a default‑direction asymmetry in which a one‑sided feed tips a decision the model was genuinely uncertain about (in the clearest cases from 5% to 100%; Fisher p as low as 3 x 10^‑10) but cannot dislodge one it already favors or holds firmly. The effect follows a dose‑response curve, survives a generator swap that rules out a writing‑style artifact, generalizes across several decision domains including security‑relevant choices such as removing a deployment approval gate or relaxing access controls, and is partly mitigated by two simple feed‑level defenses; a frontier model retains its default. We characterize the recommender as a practical, default‑bounded control surface for LLM agents, and argue that agent evaluations must audit the feed layer rather than the final prompt alone.
Authors:Qiao Xiao, Boqian Wu, Patrik Okanovic, Tomasz Sternal, Maurice van Keulen, Elena Mocanu, Mykola Pechenizkiy, Decebal Constantin Mocanu, Torsten Hoefler
Abstract:
Dynamic Sparse Training (DST) offers a promising paradigm for improving the training and inference efficiency of deep neural networks; however, we find that in large language model training, DST can suffer from optimization instability, manifested as loss spikes after topology updates. In this work, we show that the naive use of standard Adam‑based optimizers leads to a cold‑start issue for newly regrown parameters, resulting in excessively large updates and disrupted training dynamics. To address this issue, we propose Sparse Memory‑Efficient Training (SMET), which stabilizes DST with optimizer warm‑up and improves training progress through density‑aware learning‑rate scaling. SMET further reduces memory consumption by storing gradients and optimizer states only for active parameters. We provide a theoretical analysis of the update behaviors under SMET, showing improved optimization stability. Extensive experiments demonstrate that SMET enables stable, scalable, and memory‑efficient sparse pre‑training of LLMs, paving the way for sparse training as a practical alternative to dense training. Our code is publicly available at: https://github.com/QiaoXiao7282/SMET.
Authors:Ming Wang, Shuang Wu, Bixuan Wang, Lu Lin, Yuxin Chen, Xiaocui Yang, Daling Wang, Shi Feng, Yifei Zhang, Yufan Sun
Abstract:
Self‑report questionnaires remain the prevailing tool for probing the psychological states of persona‑conditioned agents (PC‑Agents). However, classical instruments inherit two well‑known threats: contamination from training corpora and directional bias driven by social‑desirability or contextual framing. To overcome these methodological bottlenecks, we ask whether projective paradigms can be adapted into a robust psychometric tool. We introduce GenPT (Generative Projective Testing), which reformulates TAT, Rorschach, and SCT with newly generated stimuli and organizes assessment as a three‑stage pipeline to derive standardized psychological indicators and target states. Evaluating PC‑Agents induced via CharacterRAG and AnnaAgent profiles, we benchmark GenPT's reliability and validity against classical questionnaires. The results indicate that questionnaires exhibit systematic directional shifts under social‑desirability framing, most strongly on suicide ideation. In contrast, GenPT's collected behavioral patterns stay near the symmetric baseline. Furthermore, under a longitudinal counselling context, GenPT‑based depression assessment shifts by roughly an order of magnitude more than the questionnaire counterpart when Qwen3 serves as the backbone. Overall, GenPT complements self‑report methods in scenarios where contamination resistance, bias asymmetry, and context sensitivity matter. Code and stimuli can be found at https://github.com/sci‑m‑wang/GenPT.
Authors:Xinyi Ning, Zilin Bian, Dachuan Zuo, Semiha Ergan, Kaan Ozbay
Abstract:
Accurate and reliable vehicle trajectory prediction is essential for safe autonomous driving. Recent studies have incorporated safety risk into trajectory prediction to quantify dangers posed by surrounding agents. However, most risk‑aware approaches use past risk information as a secondary signal to help guide decisions, overlooking its future evolution and uncertainty. In this paper, we propose a risk horizon profiling (RHP) module that incorporates a continuous, learnable potential field model for risk‑aware trajectory prediction. The RHP module calculates the spatial‑temporal proximity of surrounding objects to profile risk distributions across future horizons, which supports better trajectory prediction by adaptively identifying what human drivers perceive as critical moments. We evaluate our method on two datasets from different driving settings, highD for highway corridors and SHRP2 for urban streets, which cover diverse risk scenarios including safe, near‑crash, and crash events. Compared to the baseline methods, our framework achieves a 25.0% reduction in 5s RMSE on the highD dataset and a 29.1% reduction in 5s minFDE on SHRP2. These results indicate strong performance for both short and long horizon prediction and robust generalization across highway and urban scenarios. The proposed method enables more realistic AV path planning and strategic selection, thereby supporting safer autonomous driving and more advanced driver‑assistance systems. The source code for this work is available at: https://github.com/bilab‑nyu/RHP
Authors:Thanh Luong Tuan
Abstract:
Enterprise multi‑agent systems increasingly expose multiple coordination patterns, but deployments often lack evidence for when to use consensus, debate, synthesis, or a simpler single‑agent workflow. This paper evaluates whether coordination strategy should be selected dynamically by problem class rather than fixed globally. We run a frozen matrix of 30 enterprise tasks spanning six industries, five problem classes, four execution conditions, three replications per cell, and four model arms: qwen_local, sonnet, gemma_openrouter, and an auxiliary openai cloud‑validation arm. All 1,440 generated outputs are judged by a fixed Sonnet rubric. The main finding is bounded and operationally useful, but it is not the original strict H1. The pre‑registered exact‑winner/CI criterion is not supported: exact winner identity is unstable across model arms, and several predicted strategies are close to, but not above, the best observed alternative. A weaker near‑best routing claim is strongly supported. In every pre‑registered model arm and problem class, and again in the auxiliary OpenAI validation arm, the predicted strategy is within 0.10 quality‑score points of the best observed condition. Structured compliance verification is the clearest exception to the original mapping: all arms favor single_agent rather than consensus. A pre‑registered Kendall's W test finds no reliable difference between Vietnamese‑domain and English‑domain tasks in how consistently the four coordination conditions are ranked (mean W of 0.20 in both strata; signed‑rank p = .85), so H2 is not supported. We conclude that enterprise coordination policy should use dynamic routing as a calibrated default, not as a deterministic winner‑selection law.
Authors:Shihang Zhang, Mingjin Kuai, Ye Wei, Zhen Zhang, Wei Ji
Abstract:
Video Moment Retrieval (VMR) task requires accurately localizing temporal boundaries aligned with natural language queries, but many models suffer from a misalignment between continuous surrogate losses and non‑differentiable metrics, leading to optimization stagnation during the late stages of training and trapping boundary predictions in suboptimal solutions. Although Reinforcement Learning (RL) post‑training successfully optimizes localization results for large models, applying it directly to lightweight networks easily disrupts the fragile feature representations established during the supervised phase. To overcome this optimization bottleneck, we propose Gradient‑Isolated Reinforcement Learning for DETR (GIRL‑DETR), introducing RL post‑training into a lightweight temporal localization framework for the first time. The input video and text features first establish early alignment through Cross‑Modal Interaction (CMI) before entering the transformer encoder. Subsequently, a Text‑Guided Gating (TGG) mechanism dynamically injects semantic priors into the queries before the transformer decoder generates candidate proposals, providing high signal‑to‑noise ratio inputs for temporal prediction. After the supervised training reaches convergence, the backbone network is frozen to protect the feature manifold, while the detection head directly optimizes the non‑differentiable evaluation metric tIoU to enhance localization accuracy through a Three‑stage Progressive Reinforcement Learning (TPRL) strategy. This approach achieves an orthogonal decoupling of state representation and metric optimization. Experiments on Charades‑STA, QVHighlights, and TACoS demonstrate that GIRL‑DETR effectively resolves surrogate loss degradation and achieves substantial accuracy improvements with minimal parameter updates, providing a robust new pathway for RL applications in lightweight VMR models.
Authors:Mazdak Teymourian, Ramtin Moslemi, Farzan Rahmani, Mohammad Hossein Rohban
Abstract:
Adversarial Training (AT) is a leading defense against adversarial examples but often suffers from Catastrophic Overfitting (CO) in efficient single‑step variants, where robustness to multi‑step attacks collapses despite high single‑step performance. We address this failure mode with two contributions. First, we formalize Epsilon Overfitting (EO), a perspective in which fixed perturbation magnitudes and directions exacerbate CO, and show that introducing perturbation variability significantly improves robust generalization across different architectures and datasets. Second, we propose PertAlign (Perturbation Alignment), a theoretically grounded, computationally negligible metric that predicts CO onset by measuring gradient alignment across attack stages. Leveraging these insights, we introduce SORA, an adaptive step‑size AT method that dynamically adjusts perturbations based on loss surface geometry. SORA consistently prevents CO, achieves state‑of‑the‑art robustness and clean accuracy, and generalizes across datasets and architectures using a single fixed set of hyperparameters, which is essential for applicability in fast AT. Extensive experiments on diverse datasets and architectures show that SORA matches or surpasses the robustness of prior methods while delivering higher clean accuracy and superior efficiency. Code is available at https://github.com/SecondOrderAT/SORA.
Authors:Jiakang Li, Guanyu Zhu, Can Jin, Chenxi Huang, Dexu Yu, Ronghao Chen, Yang Zhou, Hongwu Peng, Xuanqi Lan, Dimitris N. Metaxas, Youhua Li
Abstract:
Strong reasoning depends not only on model knowledge but also on how effectively cognitive behaviors are deployed during generation. Existing methods often rely on explicit behavior‑level control, making them insufficiently adaptive when failures and required corrections vary across reasoning states, tasks, and models. To this end, we propose Latent Reward Steering (LRS), an adaptive inference‑time framework that promotes cognitive behaviors by optimizing the sparse‑autoencoder (SAE) latent states that implicitly carry them. Rather than relying on predefined cognitive behaviors or steering directions derived from them, LRS trains a latent reward model on reasoning traces by final answer correctness to estimate the quality of intermediate latent states. During inference, reward gradients provide state‑specific correction directions for fragile latent states, while a reward and confidence gate restricts intervention to states the reward signal flags as fragile. Experiments on multiple reasoning LLM backbones and benchmarks show that \ours consistently improves performance over various baselines, and post‑hoc analyses further indicate that \ours implicitly promotes good cognitive behaviors that fix the original reasoning errors. Code is available at: https://github.com/jiakanglee/Latent‑Reward‑Steering.
Authors:Hyundong Jin, Yo-Sub Han
Abstract:
Controlling language model outputs is essential for ensuring structural validity, reliability, and downstream usability, and diffusion language models are no exception. Recent advances in diffusion language model decoding have extended output control beyond regular constraints to context‑free grammar (CFG) constraints. Existing methods, however, can be up to four times slower than unconstrained decoding. More importantly, they substantially diminish one of the key advantages of diffusion language models over autoregressive models, namely parallel decoding. This slowdown arises because sequential validity checking introduces significant overhead during parallel generation. We propose an efficient CFG‑constrained decoding framework, EPIC, that addresses this limitation. Our method improves decoding efficiency by combining lexing memoization, validation using Earley‑style parsing instead of deterministic automata, and relaxed compatible subset selection for parallel commit. It reduces repeated lexing and validation overhead while allowing multiple compatible tokens to be committed together. Experiments on three benchmarks using four models show that our method reduces inference time by up to 67.5% and decreases the additional overhead by up to 90.5% compared with existing CFG‑constrained decoding methods. Our implementation is available at https://github.com/hyundong98/EPIC‑Decoding.git .
Authors:Sheng'en Li, Dongmian Zou
Abstract:
Online link recommendation on evolving graphs is performative: by choosing which candidate links to show users, the system changes which links form and what feedback it later observes. Consequently, fairness estimates from logged outcomes can be misleading and may drift after deployment when the recommendation policy is updated. We introduce COPF (Counterfactual Online Performative Fairness), a decision‑layer framework for deployment‑stable fairness monitoring and control in online link recommendation. COPF (i) defines group‑level opportunity gaps over exposure (shown vs. not shown) counterfactuals, (ii) makes them estimable by explicit exploration and by logging the probability (propensity) that each candidate is shown, and (iii) audits and controls fairness using residual outcome indistinguishability (OI) over a configurable auditor family with graph‑aware doubly robust (GA‑DR) estimators. We provide a noisy transfer theorem showing that Residual‑OI on estimated GA‑DR residuals implies bounds on exposure‑counterfactual group gaps under temporal mixing and bounded local interference, and we instantiate an online multicalibration auditor together with a primal‑dual controller. Experiments on two TGB streams and a controlled synthetic bipartite stream show that COPF reduces worst‑case spikes in exposure‑counterfactual group disparities with modest impact on ranking utility. Our code is available at https://github.com/lsnnnnnnnn/COPF.
Authors:Yitong Sun, Yao Huang, Teng Li, Ranjie Duan, Yichi Zhang, Xingjun Ma, Hui Xue, Xingxing Wei
Abstract:
Mixture‑of‑Experts (MoE) architectures scale Large Language Models (LLMs) efficiently, enabling greater capacity with reduced computational cost by dynamically routing inputs to relevant experts, yet introduce a critical vulnerability: Safety Sparsity, where safety capabilities concentrate in few experts, making them susceptible to adversarial bypassing. Meanwhile, conventional alignment methods uniformly adapt all parameters, ignoring their functional differences and inadvertently degrading performances. To address these challenges, we propose MESA (MoE Safety Alignment), a targeted alignment framework for MoE‑based LLMs that strategically decentralizes safety responsibility to maximize coverage while minimizing interference with utility. Based on Optimal Transport (OT) theory, MESA operates through two mechanisms: (1) Expert Capacity Reallocation uses a transport cost matrix to distribute safety duties to the most cost‑effective experts, and (2) Dynamic Routing Refinement constrains the router to precisely activate these decentralized modules. Experiments show that MESA achieves robust defensive performance against varied harmful benchmarks while preserving helpfulness. Code is available at https://github.com/lorraine021/MESA.
Authors:Qingshan Liu, Guoqing Wang, Wen Wu, Jingqi Huang, Xinqi Tao, Dejia Song, Jie Zhou, Liang He
Abstract:
Long‑horizon autonomous agents require memory systems to retain historical information, track evolving states, and reuse relevant knowledge beyond finite context windows. Existing agentic memory systems typically follow a memory construction‑retrieval (MCR) pipeline, but often adapt mainly the memory bank while keeping the surrounding pipeline fixed after deployment. This fixed‑pipeline design struggles to handle heterogeneous task‑specific failure modes and can become misaligned with memory banks that evolve in scale and structure over time. To address these limitations, we propose MemPro, a system‑level evolution framework that treats the entire MCR pipeline as an evolvable program rather than adapting only the memory bank or prompt text. MemPro maintains a version tree of runnable memory‑system implementations, where an Evolving Agent iteratively selects promising versions, diagnoses recurring failures, and creates improved child versions through failure‑mode‑guided edit‑debug refinement. Experiments on LongMemEval, LoCoMo, HotpotQA, and NarrativeQA show that MemPro consistently outperforms strong static and prompt‑level evolving baselines within a few iterations, continues to improve with evolution, and achieves a favorable performance‑cost trade‑off. Code is available at https://github.com/wanghai673/MemPro.
Authors:Shinwoo Park, Hyejin Park, Hyeseon An, Yo-Sub Han
Abstract:
Watermarking should identify language‑model output without degrading quality or limiting verification to the model provider. Multilingual deployment makes this harder because morphology, segmentation, and script change where watermark evidence can enter naturally. We introduce LUNA, a linguistically adaptive watermark that combines model‑free detection with single‑token non‑distortion under the standard random‑key model. LUNA estimates normalized next‑tag entropy from part‑of‑speech contexts in an external corpus and uses it to set the depth of a non‑distortionary binary tournament sampler; the detector reconstructs the same schedule from text, a tokenizer, a tagger, and a secret key. We evaluate six typologically diverse languages and two domains against eight primary baselines. LUNA attains an AUROC of 0.9959 and the lowest mean absolute median perplexity shift of 0.045 across the twelve settings; its 95% bootstrap interval [0.022, 0.073] lies below all baseline intervals. LUNA also records the lowest mean Self‑BLEU, Distinct‑1, surprisal, and entropy shifts. It is the only method that simultaneously achieves AUROC > 0.99 and an absolute median perplexity shift below 0.1 in a majority of settings, reaching this regime in 9 of the 12 settings while no baseline reaches it in more than 2. Our code is available at: https://github.com/Shinwoo‑Park/luna_watermark
Authors:Zhepei Hong, Lin Wang, Liting Li, Haokai Ma, Junfeng Fang, Fei Shen, Dan Zhang, Xiang Wang
Abstract:
Long‑horizon LLM agents produce safety evidence across long trajectories, where sparse, delayed, and compositional risk signals often escape local moderation. Existing turn‑level or short‑context detectors struggle to reliably retain and aggregate such evidence over extended horizons. We reframe long‑horizon agent safety detection as trajectory‑level evidence compression and propose Trajectory Risk‑Aware Compression for Long‑Horizon Agent Safety (TRACE). TRACE uses a Compressor‑Reader design: the Compressor encodes the full trajectory into a compact latent evidence state under trajectory‑level supervision, and the Reader judges the raw trajectory with this latent evidence state as a safety reference. This design helps aggregate dispersed risk cues and reduce premature evidence loss. Across ASSEBench, Pre‑Ex‑Bench, and R‑Judge, TRACE achieves the best accuracy on all evaluated backbones, improving over strong baselines by up to 12.6 percentage points. On LongSafety, TRACE shows smaller performance degradation as context length grows. Attention visualizations and case studies suggest that the compressed reference helps the Reader focus on risk‑critical segments and recover cross‑step evidence. Code is available at https://github.com/Peregrine123/TRACE_official.
Authors:Chuanjie Wu, Zhishang Xiang, Yunbo Tang, Zerui Chen, Qinggang Zhang, Jinsong Su
Abstract:
Retrieval‑Augmented Generation (RAG) has become an essential method for mitigating hallucinations in Large Language Models (LLMs) by leveraging external knowledge. Although effective for simple queries, traditional RAG struggles with large‑scale, unstructured corpora where information is highly fragmented. Graph‑based RAG (GraphRAG) incorporates knowledge graphs to capture structural relationships, enabling more comprehensive retrieval for complex reasoning. However, existing GraphRAG methods rely on isolated, fragment‑level extraction for graph construction, lacking a global perspective on the whole corpus. As a result, these methods frequently lead to thematically inconsistent, logically conflicting, and structurally fragmented graphs that degrade retrieval performance. In this paper, we propose MemGraphRAG, a novel framework that introduces a memory‑based multi‑agent system to ensure high‑quality graph construction. Specifically, MemGraphRAG employs a collaborative society of agents supported by shared memory, which provides a unified global context throughout the extraction process. This mechanism allows agents to dynamically resolve logical conflicts and maintain structural connectivity throughout the corpus. Furthermore, we propose a memory‑aware hierarchical retrieval algorithm tailored for the constructed graph. Extensive experiments on multiple benchmarks demonstrate that MemGraphRAG outperforms the state‑of‑the‑art baseline models with comparable efficiency. Our code is available at https://github.com/XMUDeepLIT/MemGraphRAG.
Authors:Qiming Shi, Zhaolu Kang, Yunfan Zhou, Di Weng, Yingcai Wu
Abstract:
Large language models are increasingly deployed as tool‑augmented agents to acquire information beyond parametric knowledge. While recent work has improved long‑horizon tool‑use reasoning, most approaches focus on tasks with a single correct answer. In contrast, many real‑world queries require discovering a comprehensive set of valid answers, a setting known as Multi‑Answer QA. This setting raises two challenges: fine‑grained credit assignment over long search trajectories and reward alignment for sustained exploration beyond easy high‑frequency entities. We propose SPADER, a reinforcement learning framework for long‑horizon tool use in Multi‑Answer QA. SPADER includes Step‑wise Peer Advantage (SPA), a critic‑free step‑level credit assignment mechanism that aligns parallel trajectories by decision step and estimates advantages from peer returns. It also includes a diversity‑aware exploration reward that promotes long‑tail entity discovery by upweighting rare findings and downweighting redundant ones. Experiments on QAMPARI, Mintaka, WebQSP, and QUEST show that SPADER generally improves recall and overall F1 over prompting‑based agents, outcome‑supervised RL methods, and recent step‑level supervision approaches. Our code and model weights are available at https://github.com/KhanCold/spader.
Authors:Jungin Park, Jiyoung Lee, Kwanghoon Sohn
Abstract:
This study introduces an intriguing phenomenon in Video LLMs: rather than merely translating frames into textual embeddings, Video LLMs establish a continuous manifold, token interface, allowing visual tokens to operate as standalone entities within the architecture. Exploiting this discovery, we propose V‑LynX, a scalable framework that integrates novel modalities into Video LLMs by repurposing the internalized interface. Departing from conventional paradigms that necessitate heavy modality‑specific encoders or paired supervision, V‑LynX employs a lightweight auxiliary pathway in parallel with the frozen vision encoder. Our method integrates new sensory inputs with intrinsic video priors by aligning both attention responses and statistical distributions using unpaired unimodal data sets. This ensures manifold compatibility while preserving the integrity of the Video LLMs. Extensive benchmarks demonstrate that V‑LynX achieves SOTA and efficiency across audio‑visual QA, 3D reasoning, high‑frame‑rate, and multi‑view video understanding. The code is available at https://github.com/park‑jungin/lynx.
Authors:Halil Ibrahim Gulluk, Max Van Puyvelde, Wim Van Criekinge, Olivier Gevaert
Abstract:
Reinforcement learning with verifiable rewards has rapidly advanced reasoning in vision‑‑language models. However, for chest X‑ray report generation, the standard rewards (i.e. exact‑match accuracy and step‑level processes) are incompatible because the reports consist of unordered and orthogonal findings, rather than a causal reasoning chain. We address this gap with a set‑based view: each report is split into sentences and embedded by a frozen sentence transformer, yielding unordered embedding sets. We propose the use of set‑to‑set distances between generated and reference embeddings as continuous, permutation‑invariant rewards. Across two datasets and three vision‑‑language models (Qwen3‑VL‑2B/4B, Gemma3‑4B), post‑training with set‑to‑set distance based rewards via GRPO consistently outperforms supervised fine‑tuning and exact‑match GRPO on all headline metrics (BERTScore, RadGraph F1 and CheXbert F1 by average %6.80, %7.82 and %4.45 relative improvements respectively). The same set distances also enable test‑time best‑of‑N selection: scoring candidates by their distance to training‑report embeddings outperforms random selection on our trained models as well as three closed‑source LLMs (Mistral‑Small, Gemini‑2.5 Flash‑Lite, GPT‑4o‑mini) with on average %16.4 relative improvement on BERTScore. Used as a streaming signal, they support a more efficient form of test‑time scaling: pruning low‑scoring candidates mid‑generation reduces generated tokens by over 50% while preserving the Findings quality of full best‑of‑N selection. Together these results establish set‑distance rewards as a unified signal for both post‑training and test‑time scaling in chest X‑ray report generation. Our code is publicly \hrefhttps://anonymous.4open.science/r/Set‑Distance‑Rewards‑CXR‑BFDAavailable.
Authors:Haoxiang Zhang, Qixin Xu, Zhuofeng Li, Lei Zhang, Pengcheng Jiang, Yu Zhang, Julian McAuley
Abstract:
Long‑horizon search agents accumulate large amounts of retrieved content across many tool calls, making context‑budget efficiency increasingly important. A minimal intervention is to mask stale observations from the context as the trajectory progresses, but it remains unclear when this form of context management helps and why. We study observation masking through a systematic sweep over various agent backbones (4B to 284B parameters) and three retrievers on offline and live‑web agentic search benchmarks. We find that the accuracy gain from masking follows an asymmetric inverted‑U shape when plotted against the model's accuracy without context management: a plateau under weak retrievers, a peak when a strong retriever meets a mid‑capacity model, and a sharp collapse when the model is saturated. This pattern reflects the interaction between retriever recall and the model's implicit filtering capacity, rather than either factor in isolation. Mechanistically, masking implements a token‑for‑turn trade‑off: it removes observations the model has largely stopped attending to and pages the agent rarely re‑opens. The added turns help when they convert failures into successes, but they fail when masking removes evidence the model would otherwise have used. We therefore reframe context management as a regime‑dependent intervention and provide a holistic perspective for analyzing context use in agentic deep search. We release our scaffold and trajectories here (https://github.com/i‑DeepSearch/observation‑masking) to support future research.
Authors:Petros Andreou, Jamie Lanyon, Axel Finke, Georgina Cosma
Abstract:
Machine unlearning removes the influence of specific training data from a trained model without retraining it from scratch. Evaluating an unlearning method requires repeating training, unlearning, and evaluation across multiple seeds, which is computationally expensive. To our knowledge, existing image classification unlearning frameworks run on a single GPU, which limits how many seeds can be evaluated in reasonable time. We introduce SUPREME, an open‑source framework that distributes these stages across multiple GPUs. SUPREME makes three contributions: a registry‑based design for adding new methods, metrics, models, and scenarios; a multi‑GPU architecture supporting multiple accelerators and precision modes; and a demonstration on Pins Face Recognition using ResNet18 and ViT under full‑class and random‑sample unlearning across ten seeds. The framework is available at https://github.com/pedroandreou/supreme‑unlearning.
Authors:Jiayi Wu, Haoming Cai, Cornelia Fermuller, Christopher Metzler, Yiannis Aloimonos
Abstract:
While Video Diffusion Models (VDMs) excel at synthesizing high‑fidelity videos, enabling precise camera and scene control remains challenging. Existing methods predominantly rely on implicit diffusion priors to generate unobserved regions, inevitably leading to structural collapse during high‑dynamic movements or complex occlusions. To address this challenge, we propose Real2SAM2Real, a framework that leverages 3D lifting models (e.g., SAM3D) to extract an explicitly editable 3D cache, serving as a robust geometric scaffold for the VDM. By capturing the entire 3D volume of foreground entities rather than just their visible shells, this cache injects holistic spatial priors into the VDM, providing dependable 3D‑aware guidance for complex scene dynamics. To effectively leverage this 3D guidance while preserving pre‑trained priors, we design a Soft Spatial‑Aligned Injection mechanism alongside a minimally invasive fine‑tuning strategy tailored for VDMs. Furthermore, we employ masked normal maps as a cross‑modal bridge to construct a 3D‑free data curation and perturbation pipeline. Extensive experiments demonstrate that Real2SAM2Real enables precise, decoupled control over both camera trajectories and multi‑entity motions. By utilizing the complementary context from generative 3D caches, our framework overcomes typical breakdowns caused by over‑reliance on diffusion priors, maintaining exceptional spatiotemporal consistency under large camera shifts and severe occlusions. Crucially, by decoupling geometry from appearance, our VDM‑tailored 3D cache eradicates perspective ambiguities caused by structural holes and erroneous facades, as well as misleading cues from reflections and refractions. Project website is available at https://jiayi‑wu‑leo.github.io/real2sam2real
Authors:Karim Habashy, Chris Eliasmith
Abstract:
Vector Symbolic Algebras (VSAs) enable robust neurosymbolic reasoning by encoding symbolic information into high‑dimensional distributed representations. For continuous domains, Spatial Semantic Pointers (SSPs) extend this framework by mapping variables onto continuous toroidal manifolds. However, standard approaches like Flow Matching assume a flat Euclidean geometry, which fails to account for the geometric constraints imposed on valid SSP states. We demonstrate that this assumption fails for SSPs: Euclidean linear interpolants ``cut through" the manifold's interior, destroying the phase and magnitude structure required for accurate decoding. To resolve this, we employ Geodesic Flow Matching, adapting Riemannian transport dynamics to strictly restrict the denoising flow to the SSP toroidal manifold. We validate this approach in a Spiking Neural SLAM system, showing that manifold‑aware cleanup stabilizes path integration against drift. The method achieves a 72% reduction in tracking error and enables a 40% increase in neural efficiency compared to competitive baselines. Code is available at https://github.com/kremHabashy/CleanupSSP .
Authors:Zakk Heile, Hayden McTavish, Varun Babbar, Margo Seltzer, Cynthia Rudin
Abstract:
Standard machine learning pipelines often admit many near‑optimal models. These "Rashomon sets" pose a range of challenges and opportunities for uncertainty‑aware, robust decision making. They allow users to incorporate domain knowledge and preferences that would otherwise be difficult to specify directly in an objective, and they quantify diversity among valid models for a given training dataset and objective function. However, computation of Rashomon sets, even for simple, interpretable model classes such as sparse decision trees, continues to require immense memory and runtime resources. We present PRAXIS, an algorithm to approximate this Rashomon set with orders of magnitude improvement in runtime and memory usage. We validate that PRAXIS regularly recovers almost all of the full Rashomon set. PRAXIS allows researchers and practitioners to scalably model the Rashomon set for real‑world datasets. Code for PRAXIS is available at https://github.com/zakk‑h/PRAXIS
Authors:Yuxiang Lin, Zihan Wang, Mengyang Liu, Yuxuan Shan, Longju Bai, Junyao Zhang, Xing Jin, Boshan Chen, Jinyan Su, Xingyao Wang, Jiaxin Pei, Manling Li
Abstract:
While agents are increasingly spending more resources, today agent cost is mostly measured only after execution. A Budget‑Aware Agent (BAGEN) should treat budget as an active control signal, rather than a passive cost metric. We first systematically define budget estimation as internal budgets (from agent computation) and external budgets (from agent actions). We then formalize budget‑awareness as progressive interval estimation: at each step of a plan, an agent should predict an upper and lower bound on remaining budget, and alert when completion is unlikely. Scoring with a rollout‑replay protocol, we find consistent failure patterns on four environments and five frontier agents: (1) strong agents do not necessarily have strong budget‑awareness, with correlation r=0.35. (2) frontier models are consistently over‑optimistic, continue spending on tasks that are unlikely to succeed, instead of alerting the user early. (3) budget‑aware signal is actionable and trainable. Early stop saves 28‑64% tokens on failed trajectories, and SFT+RL strengthens early stop and alert behavior. (4) precise interval calibration remains challenging, with interval coverage capping at 47% after SFT+RL. Project page: https://ragen‑ai.github.io/bagen/
Authors:Zheng Wang, Shuo Wang, Junhong Wang
Abstract:
In recent years, emotion recognition based on physiological signals such as electroencephalogram (EEG) has gained considerable attention, as internal physiological data offer greater objectivity and reliability compared to external behavioral data like facial expressions. However, due to distribution shifts caused by individual and contextual differences, along with variations in sample quality across modalities, constructing a cross‑domain multimodal emotion recognition model with high generalization and robustness remains a key challenge. In this study, we propose a Unified Framework with Adaptive Multimodal Alignment (UF‑AMA) to address cross‑subject and cross‑session emotion recognition using multimodal physiological signals. First, we construct a cross‑modal feature fusion network comprising Transformer encoders and multi‑head cross‑attention modules, enabling the deep integration of EEG signals and eye‑tracking data. Subsequently, we introduce a confidence‑aware screening mechanism that dynamically assesses the predictive reliability of each modality branch on target domain samples, partitions samples into different quality subsets, and accordingly applies global consistency alignment and cross‑modal distillation. Finally, we propose a multi‑level domain adaptation framework that jointly optimizes the marginal and conditional distributions of both local modality‑specific and global fusion features, thereby reducing cross‑domain distribution shifts at multiple granularities. Extensive experiments on the SEED and SEED‑IV datasets demonstrate that UF‑AMA achieves state‑of‑the‑art (SOTA) performance in both cross‑subject and cross‑session tasks. The source code is available at: https://github.com/BetterCoderLab/UF‑AMA.
Authors:Junbo Zhang, Qianli Zhou, Xinyang Deng, Wen Jiang, Jie Pan, Jinbiao Zhu
Abstract:
Large language models (LLMs) suffer from degraded safety capabilities even when fine‑tuned with benign datasets. However, existing methods for identifying safety‑degrading samples in benign datasets suffer from high computational costs and significant noise issues. In this paper, we propose DataShield to efficiently and effectively identify potential safety‑degrading samples. Our key intuition is based on the observation that benign fine‑tuning increases the overall response compliance of LLMs. DataShield's key technical insight is to quantify each sample's contribution to the model's compliance behavior as its safety degradation score. DataShield consists of three core components: (1) Compliance Vector Extraction, which captures the LLM's compliance behavior tendency; (2) a novel Compliance‑Aware Score (CAS), which automatically identifies the optimal safety‑critical layer; and (3) Safety‑degrading Sample Filtering, which quantifies the projection shift of training data along the compliance direction. Extensive experimental evaluation on Llama3‑8B, Llama3.1‑8B, and Qwen2.5‑7B using the Alpaca and Dolly benign datasets validates our method's effectiveness in identifying high‑risk and low‑risk data subsets. We also observe that open‑ended question answering is more likely to trigger safety degradation, and corresponding responses tend to be longer. We hope this work can provide new insights into data‑centric defense methods. The source code is available at: https://github.com/ZJunBo/DataShield.
Authors:Fan Wu, Lishuai Dong, Cuiyun Gao, Yujia Chen, Yiming Huang, Yang Xiao, Qing Liao
Abstract:
Recent advancements in multimodal large language models (MLLMs) have achieved remarkable progress in multimodal reasoning and code generation, catalyzing a new paradigm for front‑end development. In particular, these models can directly transform visual designs into executable code, significantly improving the efficiency and adaptability of web development. Modern web applications are dynamic and interactive, featuring frequent user‑page interactions. However, existing benchmarks largely evaluate the code generation of static webpages, ignoring the complex interactive behaviors in real‑world applications. Besides, their evaluation criteria remain confined to visual fidelity and code structure, overlooking the interaction consistency between the generated and the reference webpages. To address these limitations, we introduce WebIGBench, the first benchmark designed to evaluate code generation for interactive webpages with complex interactions. By combining manually designed interaction paths with UI automation, we collected 103 complex webpages from real‑world websites. This benchmark covers 5 popular interactive action types (e.g., click, input) involving 871 distinct interactive actions. Moreover, we propose a novel evaluation pipeline to address the gap in automated assessment of interactive actions. Extensive experiments on several representative MLLMs reveal the performance boundaries of current models in interactive webpage code generation using WebIGBench. The proposed benchmark is available at https://github.com/anoa12159‑hue/WebIGBench_eval.
Authors:Mingxuan Zhang, Jiahui Han, Dadi Guo, Songze Li, Guanchu Wang, Na Zou, Dongrui Liu, Xia Hu
Abstract:
LLM‑based agents are rapidly advancing, autonomously invoking external tools to complete multi‑step tasks for users. However, agents often acquire more sensitive information than the task requires. Existing privacy benchmarks audit what the agent's response or outgoing actions disclose, but overlook the acquisition stage where data first enters the agent's context. The over‑acquired information is then one careless action or one attack away from an outright leak. To assess its prevalence, we introduce \emphPrivacyPeek, a benchmark for evaluating acquisition‑stage privacy leakage of LLM‑based agents, with 1,182 cases across 7 acquisition behaviours and 16 application domains. Specifically, \emphAcquisition Inspection examines the agent's tool‑call trajectory, both the tools it invokes and the data it receives, to detect when it acquires sensitive information beyond the task scope. \emphProbe Elicitation then issues a follow‑up probe and measures how readily an attacker could elicit sensitive information the agent acquired but did not disclose. Our experiments on 10 LLM‑based agents across 4 model families show that the unnecessary acquisition of sensitive information is widespread. In addition, we observe a correlation between the task‑completion capability and acquisition‑stage leakage. Prompt‑level defences reduce only a small fraction of acquisition‑stage leakage, leaving the majority unmitigated. These results make auditing acquisition‑stage privacy both urgent and necessary. Our dataset and code are available at https://github.com/Xuan269/PrivacyPeek‑Resource.
Authors:Xixiang He, Baiqi Wu, Xingming Li, Ao Cheng, Qiyao Sun, Xuanyu Ji, Qingyong Hu
Abstract:
Multimodal large language models (MLLMs) often know the rule but pick the wrong answer: on abstract visual reasoning (AVR) tasks, a model can describe what it sees and name the underlying pattern, yet still fail to choose the matching candidate. Existing AVR benchmarks cannot detect this because they collapse perception, rule induction, and answer selection into a single right‑or‑wrong signal. We introduce StemBind, a shared‑stem diagnostic benchmark that probes the same visual stem with three aligned questions: Perception (what is in the image), Rule (what pattern governs it), and Full (which option completes it), so a final‑answer error can be attributed to a specific sub‑step on the same evidence. StemBind contains 2,298 curated knowledge‑light stems across nine auditable visual operations, totaling 19,533 P/R/F tasks, with each full item annotated by Sternberg's four reasoning stages (S1 Encode, S2 Infer, S3 Map, S4 Apply). Evaluating 24 frontier MLLM configurations yields four findings. (i) The R‑F chasm: rule accuracy exceeds full‑item accuracy on 22 of 24 models, so most failures happen after the rule is identified. (ii) A persistent binding gap: even when P and R are both correct on the same stem, models still answer F incorrectly 51.2% of the time. (iii) The bottleneck is S3: process diagnostics and Stage‑wise Stimulus Augmentation localize the dominant failure to rule‑to‑instance mapping. (iv) Scaling and thinking do not help: neither larger models nor explicit thinking mode reliably closes the gap, and thinking even lowers rule and full‑item accuracy. StemBind reframes AVR evaluation from final‑answer ranking to locating where abstract visual reasoning breaks down, identifying rule‑to‑instance binding as a concrete next target for vision‑grounded reasoning.
Authors:Titu Ranjan Sarker, Muhammed Jawaad Zulqernine, Ling Yue, Shaowu Pan, Chenxi Wang, Shiyao Lin
Abstract:
Finite element analysis (FEA) is the most important numerical approach for solid mechanics. Challenges of FEA include a steep learning curve for entry‑level users and potential false simulations due to incorrect definitions of key simulation components, such as boundary conditions, load cases, and solution variables. Years of engineering experience are usually necessary for real‑world problem‑solving. To address these issues, we present AbaqusAgent, a multi‑agent framework grounded in large language models (LLMs) for solid mechanics analyses. AbaqusAgent is developed to facilitate analysis case generation and execution using Abaqus, one of the most widely used FEA packages, by turning users' natural‑language instructions into executed FEA analyses and result visualization. AbaqusAgent is composed of six agents, including interpreter, architect, input writer, runner, reviewer, and visualizer agents, encompassing all the essential pre‑processing and post‑processing steps of standard FEA analyses. A wide variety of 50 solid mechanics problems have been successfully validated, achieving an overall success rate of 86%. Beyond improving the efficiency of FEA for solid mechanics problems and lowering the barrier to computational mechanics education, AbaqusAgent advances the human‑simulation interaction paradigm and enables integration with AI‑empowered optimization and material characterization workflows. The code is available at https://github.com/LIRAM‑LIN/AbaqusAgent
Authors:Ambreen Aslam, Maaz Hassan, Bibi Zahra, Muhammad Khuram Shahzad
Abstract:
Intrusion Detection Systems (IDS) in Internet of Things (IoT) environments face significant challenges due to data heterogeneity, lack of labeled data, and limited model interpretability. Federated Learning (FL) offers a privacy‑preserving solution; however, existing approaches such as SOH‑FL suffer from two key limitations: reliance on a manually tuned aggregation parameter γ and lack of explainability in model predictions. In this paper, we propose XAI‑SOH‑FL, an enhanced framework that integrates adaptive aggregation and explainable artificial intelligence into the SOH‑FL paradigm. First, we introduce a dynamic γ selection mechanism based on similarity thresholding, enabling the aggregation process to adapt to evolving data distributions. Second, Bayesian Optimization is employed to automatically determine optimal γ values, eliminating the need for manual tuning. Third, SHAP (SHapley Additive exPlanations) is incorporated to provide feature‑level interpretability for intrusion detection decisions. Experimental evaluation on the CICIDS2017 dataset demonstrates that the proposed approach achieves an accuracy of 94.12% and an F1‑score of 0.92, outperforming the baseline SOH‑FL model while converging in fewer communication rounds. Furthermore, SHAP‑based analysis reveals that flow‑level features such as Flow Duration and Packet Length significantly influence model predictions. These results indicate that XAI‑SOH‑FL provides an effective balance between accuracy, adaptability, and interpretability in heterogeneous IoT environments.
Authors:Erdem Uysal, Timo Kehrer, Sebastiano Panichella
Abstract:
Foundation models are increasingly used to drive autonomous systems, yet existing approaches either keep the model in a tight control loop, raising latency and hallucination risk, or compile natural language into opaque end‑to‑end policies that are hard to explain, constraint and require domain‑specific datasets and fine‑tuning. We propose a planner‑executor agent for PX4‑based drones that decouples high‑level mission planning from low‑level control. A large language model performs single‑pass task planning, while execution is handled through a structured ROS 2 tool‑calling interface bridged to MAVLink. The system constructs a world model by combining modular 2D detectors (e.g., YOLO or vision‑language models) with a pinhole depth projection module for 3D object localization. A constraint enforcement layer enforces altitude limits and horizontal geofencing, and bounded replanning enables recovery from execution‑time action failures. We position our approach within three common design patterns for foundation‑model‑based robotics systems and demonstrate its feasibility in PX4 software‑in‑the‑loop simulations in Gazebo. Results highlight improved explainability, constraint enforcement, and reduced LLM calls compared to tightly coupled LLM control. The code, dataset, videos, and other material can be found at the following link: https://github.com/erdemuysalx/PEACE
Authors:Huidong Feng, Wentao Chen, Jie Chen, Xinqi Cai, Ruolong Ma, Yinglin Zheng, Yuxin Lin, Ming Zeng
Abstract:
With the rapid advancement of artificial intelligence generated content (AIGC) technologies, video forgery has become increasingly prevalent, posing new challenges to public discourse and societal security. Despite remarkable progress in existing deepfake detection methods, AIGC forgery detection remains challenging, as existing datasets mainly rely on open‑source video generation models with quality far below that of commercial AIGC systems. Even datasets containing a few commercial samples often retain visible watermarks, compromising authenticity and hindering model generalization to high‑fidelity AIGC videos. To address these issues, we introduce CoCoVideo‑26K, a contrastive, commercial‑model‑based AIGC video dataset covering 13 mainstream commercial generators and providing semantically aligned real‑fake video pairs. This dataset enables deeper exploration of the differences between authentic and high‑quality synthetic videos and establishes a new benchmark for highly realistic video forgery detection. Building on this dataset, we propose CoCoDetect, a detection framework integrating contrastive learning with confidence‑gated multimodal large language model (MLLM) inference. An R3D‑18 backbone extracts spatio‑temporal representations, while a confidence gate routes uncertain cases to an MLLM for reasoning about physical plausibility and scene consistency. Extensive experiments on CoCoVideo‑26K and public benchmarks demonstrate state‑of‑the‑art performance, validating the framework's robustness and generalizability. Our code and dataset are available at https://github.com/DonoToT/CoCoVideo.
Authors:Kailing Li, Tianwen Qian, Lijin Yang, Yuqian Fu, Jingyu Gong, Xiaoling Wang, Liang He
Abstract:
Vision‑Language Navigation (VLN) enables embodied agents to reach target locations in unseen environments by following language instructions. Despite recent progress with vision‑language models (VLMs), a critical semantic‑geometric gap remains: while VLMs excel at language and 2D visual understanding, they struggle with 3D spatial reasoning and fail to capture the causal dynamics between actions and spatial transitions, resulting in unreliable navigation, particularly in zero‑shot settings. To bridge this gap, we propose a Hierarchical Semantic‑Geometric Map (HSGM) that transforms 3D geometric information into a structured representation compatible with VLMs, effectively linking them to the physical world. Specifically, HSGM is represented as a multi‑channel top‑down map organized into three levels: (1) geometric level that records navigable regions and obstacles, (2) semantic level that represents objects and their relations, and (3) decision level that supports high‑level task reasoning and goal selection. During navigation, the VLM acts as a high‑level semantic planner, interpreting the spatial layout encoded in the HSGM to select geometrically valid waypoints, while low‑level, collision‑free movements between waypoints are executed by a classical path‑planning algorithm, fully decoupling semantic reasoning from action execution. Additionally, complex instructions are decomposed into subtasks to alleviate the problem of progress forgetting or hallucinating in long‑horizon navigation. Extensive experiments on R2R‑CE and RxR‑CE benchmarks demonstrate that our zero‑shot framework achieves state‑of‑the‑art performance and even outperforms several supervised methods. Code is available at https://github.com/Teacher‑Tom/HSGM_public.
Authors:Michel Dione, Jerry Lonlac, Hélène Louis, Anthony Fleury, Stephane Lecoeuche
Abstract:
Distributed Acoustic Sensing (DAS) enables large‑scale monitoring through optical fibers, but its high dimensionality and complex spatio‑temporal patterns make event classification demanding. Existing deep learning approaches‑CNNs, recurrent models, and Transformer variants‑either fail to capture long‑range dependencies or require processing raw DAS matrices at prohibitive cost. We propose DAStatFormer, a hybrid multibranch Transformer that combines compact multidomain statistical features with Gated Transformer Networks. Instead of raw signals, we extract 24 ANOVA‑selected attributes per channel from the temporal, waveform, and spectral domains, reducing data size by orders of magnitude while preserving discriminative information. Each domain is processed via dedicated step‑wise and channel‑wise attention branches, fused by an adaptive gating mechanism. Experiments on the open Φ‑OTDR benchmark and a real‑scenario DAS dataset show that DAS‑tatFormer achieves up to 99.4% accuracy and near‑perfect real‑world performance, while using significantly fewer parameters and lower inference cost than models such as DASFormer and DeepViT. These results demonstrate its suitability for scalable, real‑time DAS‑based monitoring. We release our code at https://github.com/MichelD‑git/DAStatFormer
Authors:Jiayu Zhao, Zihan Teng, Minhao Fan, Tianrui Ma, Wentao Ren, Song Chen, Weichen Liu
Abstract:
Mixture‑of‑Experts (MoE) large language models reduce per‑token computation through sparse expert activation, but their deployment remains memory‑intensive because all expert weights must be kept resident in memory. Existing MoE compression methods struggle in the ultra‑low‑bit regime: pruning irreversibly removes model capacity, while coarse‑grained quantization fails to allocate bits according to heterogeneous expert and weight‑direction importance. We propose BitsMoE, a spectral‑energy‑guided bit‑allocation framework for MoE LLM quantization. BitsMoE decomposes each MoE layer by SVD into a shared basis and expert‑specific spectral factors, retaining the shared basis without quantization to preserve common cross‑expert structure and using the expert‑specific factors as fine‑grained quantization units. To determine the bit‑width of each unit, BitsMoE formulates spectrum‑wise mixed‑precision quantization as an activation‑aware reconstruction surrogate and solves an integer linear program that minimizes estimated reconstruction loss under a fixed bit budget. Experiments across multiple MoE LLMs show that BitsMoE substantially reduces downstream task accuracy degradation in ultra‑low‑bit regimes. Under 2‑bit quantization on Qwen3‑30B‑A3B‑Base, BitsMoE accelerates quantization by 12.3×, improves average accuracy by 27.83 percentage points, and increases decoding speed by 1.76× over GPTQ. Our model and code are publicly available at https://github.com/zjiayu064/BitsMoE.
Authors:Zhiyuan Feng, Qixiu Li, Huizhi Liang, Rushuai Yang, Yichao Shen, Zhiying Du, Zhaowei Zhang, Yu Deng, Li Zhao, Hao Zhao, Zongqing Lu, Oier Mees, Marc Pollefeys, Jiaolong Yang, Baining Guo
Abstract:
Recent progress in generalizable embodied control has been driven by large‑scale pretraining of Vision‑Language‑Action (VLA) models. However, most existing approaches rely on large collections of robot demonstrations, which are costly to obtain and tightly coupled to specific embodiments. Human videos, by contrast, are abundant and capture rich interactions, providing diverse semantic and physical cues for real‑world manipulation. Yet, embodiment differences and the frequent absence of task‑aligned annotations make their direct use in VLA models challenging. This survey provides a unified view of how human videos are transformed into effective knowledge for VLA models. We categorize existing approaches into four classes based on the action‑related information they derive: (i) latent action representations that encode inter‑frame changes; (ii) predictive world models that forecast future frames; (iii) explicit 2D supervision that extracts image‑plane cues; and (iv) explicit 3D reconstruction that recovers geometry or motion. Beyond this taxonomy, we highlight three key open challenges in this area: structuring unstructured videos into training‑ready episodes, grounding video‑derived supervision into robot‑executable actions under embodiment and viewpoint heterogeneity, and designing evaluation protocols that better predict real‑world deployment performance and transfer efficiency, thereby informing future research directions. A curated list of papers and resources is available at https://github.com/AaronFengZY/HumanCentricToVLA‑Survey.
Authors:Yichuan Mo, Yukun Jiang, Yanbo Shi, Mingjie Li, Michael Backes, Yang Zhang, Yisen Wang
Abstract:
The rapid development of Language Diffusion Models (LDMs) challenges the dominant position of auto‑regressive competitors in language processing. However, their flexible, any‑order decoding strategies not only enable fast decoding speed but also potentially bring new trustworthiness challenges. To better understand the risks behind their pipelines, we introduce a comprehensive trustworthiness benchmark tailored to LDMs (TrustLDM), evaluating safety, privacy, and fairness across different LDM architectures with multiple categories of static post contexts. Our empirical results show that although LDMs generally exhibit strong trustworthiness with only the user prompts, their alignment behavior degrades noticeably when the malicious post contexts are attached to the masked responses. We further observe that longer contexts do not necessarily induce stronger effects, and both decoding order and generation length affect the evaluation outcomes. Finally, we propose TrustLDM‑Auto, an automatic evaluation framework that leverages LDM decoding flexibility to systematically identify vulnerable configurations, revealing substantial trustworthiness weaknesses across all evaluated models and dimensions. Our work may potentially help the community build more trustworthy LDMs. Our code is available at https://github.com/PKU‑ML/TrustLDM.
Authors:Wei Tian, Yuhao Zhou, Man Lan
Abstract:
Large Language Model (LLM) based Chinese Grammatical Error Correction (CGEC) systems face two critical challenges: general‑purpose models lack specialized linguistic priors for subtle grammatical distinctions, and Supervised Fine‑Tuning (SFT) with Maximum Likelihood Estimation fails to optimize for precision‑focused metrics, leading to systematic over‑correction. We propose CSRP, a three‑stage framework that progressively builds correction capability through Continual Pre‑training (CPT) on 5.9M balanced samples to internalize domain knowledge, Chain‑of‑Thought SFT with explicit error reasoning for diagnostic transparency, and Group Relative Policy Optimization with a novel Efficiency‑Aware Reward that explicitly penalizes unnecessary edits. On the NACGEC benchmark, CSRP achieves state‑of‑the‑art performance with 50.99 F_0.5 and 57.17 precision, substantially outperforming previous best results while effectively mitigating the over‑correction bias inherent in MLE‑trained models. Our method also advances CSCD spelling correction to 59.61 F1, surpassing GPT‑4 by 5.20 points. Comprehensive ablation studies demonstrate that the RL alignment stage contributes a 8% relative gain over the SFT baseline, and that this gain is orthogonal to the contribution of large‑scale CPT, validating that explicit optimization for edit efficiency is essential for high‑quality grammatical error correction. Our code is available at https://github.com/TW‑NLP/ChineseErrorCorrector.
Authors:Hao Xu, Rite Bo, Fausto Giunchiglia, Yingji Li, Rui Song
Abstract:
Although studies have demonstrated that Large Language Models (LLMs) can perform well on Out‑of‑Distribution (OOD) tasks, their advantage tends to diminish as the distribution shift becomes more severe. Consequently, researchers aim to retrieve distributionally similar and informative demonstrations from the available source domain to boost the inference capabilities of LLMs. However, in practical scenarios where the target domain is inaccessible, evaluating the unknown distribution is challenging, which indirectly impacts the quality of the selected demonstrations. To address this problem, we propose DOPA, a demonstration search framework that incorporates an OOD proxy to approximate the inaccessible target domain and guide the retrieval process. Building on proxy‑based evaluation, DOPA further introduces a Mahalanobis distance‑based global diversity constraint to ensure sufficient diversity among the retrieved demonstrations. Experimental results on multiple LLMs and tasks demonstrate that DOPA effectively enhances robustness in OOD settings\footnotehttps://github.com/bort64/ood\_code.
Authors:Steven Johnson
Abstract:
As AI agents transition from isolated tools to collaborative participants in shared knowledge ecosystems, governing collective knowledge curation becomes a critical challenge. Human platform governance mechanisms do not transfer directly: agent statelessness undermines deterrence‑based sanctions, model homogeneity violates independence assumptions underlying crowd wisdom, and sycophancy collapses deliberative consensus. We propose a deliberative curation protocol combining three governance layers: (1) a knowledge artifact lifecycle formalized as a labeled transition system; (2) reputation‑weighted deliberative voting integrating Beta Reputation with EigenTrust amplification; and (3) graduated sanctions adapted for stateless agents, including broken agent handling distinguishing malfunction from adversarial behavior. We evaluate the protocol through agent‑based simulation with 100 agents across seven behavioral archetypes under two adversity scenarios (30 seeds, paired t‑tests). The protocol trades modest precision under benign conditions for substantially better resilience under adversity: 0.826 vs 0.791 for majority vote under moderate adversity (p<0.001), widening to 0.807 vs 0.740 under stress (p<0.001). The protocol degrades roughly three times more slowly than majority vote. Ablation analysis identifies commit‑reveal vote concealment as the most impactful single component (8.2‑8.6pp precision improvement, p<0.001), outperforming reputation weighting and deliberation combined. Graduated sanctions were not exercised in simulation and remain empirically unvalidated.
Authors:Jiazheng Xing, Hangjie Yuan, Lingling Cai, Xinyu Liu, Yujie Wei, Fei Du, Hai Ci, Tao Feng, Jiasheng Tang, Weihua Chen, Fan Wang, Yong Liu
Abstract:
Connector‑based video unified models have demonstrated strong capability in instruction‑grounded video synthesis, but integrating a large high‑fidelity generator into the unified training loop is computationally prohibitive, limiting achievable visual quality. We therefore propose Lumos‑Nexus, a training‑efficient unified video generation framework that facilitates the development of strong reasoning‑driven generation capabilities while significantly enhancing visual fidelity. Lumos‑Nexus adopts a two‑stage design: 1) During training, only a lightweight generator is aligned with the understanding block to learn to take in reasoning‑driven semantic control. 2) During inference, we introduce Unified Progressive Frequency Bridging (UPFB) to progressively hand off generation to a high‑capacity pretrained generator in the shared latent space, enabling coarse‑to‑fine refinement and producing high‑fidelity videos without compromising reasoning quality. To fill the gap in reasoning‑driven video generation benchmarks, we introduce VR‑Bench, which assesses a model's capability to translate inferred intent into coherent and semantically aligned video content. Extensive experiments demonstrate that Lumos‑Nexus achieves substantial gains in visual realism and temporal coherence on VBench, while exhibiting strong reasoning‑based generative performance on VR‑Bench. Code and models are available at https://jiazheng‑xing.github.io/nexus‑lumos‑home/.
Authors:Nianyi Lin, Jiajie Zhang, Lei Hou, Juanzi Li
Abstract:
Long‑context reasoning remains a central challenge for large language models, which often fail to locate and integrate key information in extensive distracting content. Reinforcement learning with verifiable rewards (RLVR) has shown promise for this task, yet existing methods are limited by low‑confusability distractors and sparse, outcome‑only reward signals that cannot supervise intermediate reasoning steps. To address these issues, we introduce \textscLongTraceRL. For data construction, we generate multi‑hop questions via knowledge graph random walks and leverage search agent trajectories to build \emphtiered distractors: documents the agent read but did not cite (high confusability) and documents that appeared in search results but were never opened (low confusability), producing training contexts that are far more challenging than those built by random sampling or one‑shot search. For reward design, we propose a \emphrubric reward that uses the gold entities along each reasoning chain as fine‑grained, entity‑level process supervision. This rubric reward is applied only to responses with correct final answers (positive‑only strategy), distinguishing the reasoning quality among correct responses and preventing reward hacking. Experiments on three reasoning LLMs (4B‑‑30B) across five long‑context benchmarks demonstrate that \textscLongTraceRL consistently outperforms strong baselines and encourages comprehensive, evidence‑grounded reasoning. Codes, datasets and models are available at \hrefhttps://github.com/THU‑KEG/LongTraceRLhttps://github.com/THU‑KEG/LongTraceRL.
Authors:Ulrich Prestel, Stefan Andreas Baumann, Nick Stracke, Björn Ommer
Abstract:
Self‑supervised novel view synthesis (NVS) remains challenging to scale, despite the abundance of video data, largely due to the brittleness of training on realistic videos and the hard‑to‑predict scaling behavior of multi‑network system designs. We introduce RayDer, a unified, feed‑forward transformer that consolidates camera estimation, scene reconstruction, and rendering into a single backbone, turning self‑supervised NVS into a well‑posed single‑model scaling problem. A minimal dynamic state, treated as a nuisance factor, absorbs time‑varying content and enables stable training on unconstrained real‑world video. Importantly, RayDer keeps static‑scene NVS as its target task: dynamic content is leveraged purely as scalable supervision, not reconstructed as in dynamic‑scene (4D) NVS. Across multiple model sizes and orders of magnitude in data, RayDer exhibits clean power‑law scaling with data and compute, and outperforms static‑scene data mixtures. On a large number of benchmarks, RayDer achieves strong zero‑shot open‑set performance competitive with state‑of‑the‑art supervised approaches. Project Page: https://compvis.github.io/rayder
Authors:Weitong Qian, Beicheng Xu, Zhongao Xie, Bowen Fan, Guozheng Tang, Jiale Chen, Xinzhe Wu, Mingtian Yang, Chenyang Di, Jiajun Li, Lingching Tung, Peichao Lai, Yifei Xia, Ziyi Guo, Yanwei Xu, Yanzhao Qin, Shaoduo Gan, Xupeng Miao, Bin Cui
Abstract:
Scientific research has traditionally been human‑intensive, requiring researchers to coordinate literature, ideas, experiments, manuscripts, and review responses across long project cycles. The rise of LLM‑based scientific agents creates an opportunity to automate this process. Such a system must support the full research lifecycle, maintain structured persistent memory across projects, and improve its own research procedures over time. However, existing systems either partially satisfy or fail to satisfy these requirements, leaving a gap for a unified automated scientific research system. As a result, we present AutoSci, a memory‑centric agentic system for the full scientific research lifecycle. AutoSci is organized around four modules. SciMem provides schema‑governed research memory, separating Long‑Term Knowledge Memory for reusable scientific knowledge from Active Research Memory for project‑level artifacts such as ideas, experiments, manuscripts, and reviews. SciFlow executes a five‑stage lifecycle from literature understanding to rebuttal through a harness that controls state, context, verification, feedback, and orchestration. SciDAG augments difficult skills with DAG‑shaped multi‑agent operators and reusable stage‑specific templates. SciEvolve converts feedback signals from users, experiments, reviews, and external environments into versioned updates to SciMem organization, SciFlow skills, and SciDAG templates. Together, these modules make AutoSci a persistent research environment that can execute, remember, and evolve across research projects. The code repository is available at https://github.com/skyllwt/AutoSci.
Authors:Zaid Khan, Justin Chih-Yao Chen, Jaemin Cho, Elias Stengel-Eskin, Mohit Bansal
Abstract:
GPU kernels are the workhorse of modern deep learning, and optimizing them (via evolutionary search or coding agents) usually requires repeated measurement on target hardware. While these measurements provide the ground‑truth signal necessary for kernel search, they are costly, because each evaluation of a kernel requires compilation and repeated execution on a GPU. As improvements in LLM inference reduce the cost of writing novel kernels and LLM‑driven searches scale to large search budgets, on‑device evaluation becomes a bottleneck. To address this, we study how LLMs can serve as selective GPU surrogates for kernel evaluation, by forecasting the performance of proposed kernels. A useful surrogate should be accurate, and it should be selective, by knowing when it could be wrong, and deferring to the GPU. To evaluate surrogates, we measure whether their forecasts are accurate, calibrated, and practically useful for recovering fast kernels under limited GPU‑measurement budgets. Next, we study whether reinforcement learning can improve forecast accuracy and confidence calibration. Our experiments demonstrate that LLMs can accurately forecast relative kernel performance, that their utility can be improved through reinforcement learning. Used inside a kernel search, the surrogate lets the search consider several times as many candidates under the same GPU evaluation budget, and that leads to finding faster kernels than an equal‑budget baseline. These results suggest that LLMs can play a broader role in kernel optimization, by acting as virtual models of a GPU rather than solely as kernel generators for search.
Authors:Xudong Zhang, Jian Yang, Shengkai Wang, Jiangpeng Tian, Shaowen Chen, Xian Wei, Ke Li, Xiong You
Abstract:
Large Language Model (LLM)‑based navigation systems commonly construct explicit spatial representations (e.g., topological graphs, semantic raster maps) and translate them into textual descriptions as LLMs' inputs. However, the linguistic structures of such text‑based spatial representations and the choices of contextual features (e.g., topology, geometry) they contain are often treated as neutral engineering decisions rather than key factors that shape LLMs' behavior. To fill the gap, we propose a dual‑interventional framework that disentangles linguistic structures from different contextual cues to evaluate the linguistic inductive bias of LLMs for navigation planning. In the framework, representation intervention varies the linguistic format and the degree of linguistic compression, clarifying when linguistic representations support or inhibit navigation planning. Context intervention, combined with contextual feature combination and conflict probing, explicitly clarifies the preferences and weaknesses of LLMs when processing different contextual cues. Experiments across diverse spatial reasoning tasks and multiple model scales reveal a consistent pattern: topological information is a sturdy shield and the backbone of robust planning; linguistic format is a double‑edged sword whose effect depends on model size, task demands, and the compression level; and semantic information is a fatal Achilles' heel ‑‑ incorrect semantic cues can systematically derail the planning process. Overall, our study shows that effective text‑based spatial representations in LLM‑based navigation should preserve topological integrity, calibrate representational compression to model capacity, and ensure semantic correctness, rather than simply adopting a single representation. Our code is publicly available at https://github.com/jonesdong150/LLM‑Navigation‑Inductive‑Bias.
Authors:Yisen Gao, Yixi Cai, Tianshi Zheng, Jiaxin Bai, Yangqiu Song
Abstract:
Abductive reasoning over knowledge graphs aims to generate logical hypotheses that explain observed entities or facts. Existing controllable hypothesis generation methods allow users to guide this process with explicit conditions, but they remain limited in interactive settings: they struggle to ground evolving natural‑language intents across multi‑turn dialogues and provide little fine‑grained diagnosis when generated hypotheses fail. To address these limitations, we propose HypoAgent, an Agentic framework for interactive abductive Hypothesis Generation over knowledge graphs. HypoAgent integrates three agents: an Intent Recognition Agent that grounds user utterances and dialogue history into executable KG conditions, a Hypothesis Generation Agent that performs controllable hypothesis generation according to the extracted user intention, and a Root Cause Analysis Agent that diagnoses unreliable hypothesis fragments and leverages KG neighborhood probing to identify supported refinements. Experiments on commonsense and biomedical domain‑specific knowledge graphs demonstrate that HypoAgent achieves state‑of‑the‑art semantic similarity under single‑turn, multi‑turn, and unconditional settings. Our code is available at https://github.com/HKUST‑KnowComp/HypoAgent.
Authors:Grégoire Martinon, Ibrahim Merad, Mohammed Raki
Abstract:
Reliable evaluation of agentic systems requires unbiased estimates with valid uncertainty, but standard practice navigates between costly human annotation and biased LLM‑as‑judge proxies. Prediction‑powered inference (PPI) combines both into debiased estimates with valid confidence intervals, yet its various methods remain scattered across papers under partial implementations. We introduce GLIDE, an open‑source Python library that unifies state‑of‑the‑art PPI estimators (PPI++, Stratified PPI, Predict‑Then‑Debias and its stratified variants, Active Statistical Inference) and samplers (uniform, stratified, active, cost‑optimal) under a scipy‑style API specialized to mean estimation. GLIDE ships with a reproducible Monte Carlo validation suite, an empirically grounded decision tree for method selection, and an agentic evaluation case study showing substantial annotation savings at equivalent precision. The GLIDE package is available at this URL: https://github.com/EmertonData/glide
Authors:Nan Bao, Yifan Zhao, Wenzhuang Wang, Jia Li
Abstract:
The layout‑to‑image (L2I) task enables fine‑grained control over image generation via object categories and spatial layouts. However, existing L2I methods yield fragmented and distorted generations under few‑shot atypical settings. We term this failure as representation fragmentation, arising from a granularity mismatch that entangles semantic identity with visual details. To address this issue, we propose a representation‑driven framework that disentangles semantics from primitives for robust few‑shot adaptation. Specifically, Semantic Anchoring aggregates categorical semantics into anchors for stable identity, while Primitive Imbuing models recomposable primitives for robust local detail modeling. Conceptual Steering further regulates optimization with a saliency‑aware objective to preserve foreground semantic consistency. Extensive experiments demonstrate consistent improvements in the 5‑shot regime over state‑of‑the‑art L2I methods in both visual fidelity and alignment across diverse atypical domains. The source code is publicly available at https://github.com/iCVTEAM/DSP.
Authors:Kaiwen Xue, Tao Wei, Guoxin Zhang, Zhonghong Ou, Kaoyan Lu, Yu Feng, Yifan Zhu, Haoran Luo
Abstract:
Multimodal large language models (MLLMs) have shown strong potential as embodied agents, yet embodied geo‑localization remains underexplored due to the lack of fine‑grained evaluation. We introduce ERGeoBench, a diagnostic benchmark for vision‑driven embodied geo‑localization. ERGeoBench evaluates models under three progressive settings ‑‑ single‑view, panorama‑view, and embodied‑view ‑‑ where agents may actively acquire observations through sequential changes in yaw, pitch, and zoom. The benchmark contains 2,207 globally distributed street‑view panoramas and measures four complementary capabilities: foundational perception, spatial awareness, common sense reasoning, and geo‑localization reasoning. Evaluations of leading proprietary and open‑source MLLMs show that current models can infer high‑level geographic semantics, but still struggle with fine‑grained perceptual operations, metric localization, and spatial consistency across views. We further observe that geo‑localization is strongly correlated with the other capability dimensions, suggesting that accurate localization depends on integrated perception, spatial reasoning, and commonsense inference rather than isolated visual recognition. Overall, ERGeoBench provides a unified framework for diagnosing and advancing human‑like embodied geo‑localization. Project Page: https://kaixuewen.github.io/ERGeoBench/
Authors:Tom Lucas, Alessio Buscemi, Alfredo Capozucca, German Castignani, Barbara Delacroix
Abstract:
Assessing whether Large Language Models outputs are factually grounded, epistemically calibrated, and methodologically reproducible is a prerequisite for responsible AI deployment. Yet auditing LLMs remains inaccessible to non‑technical practitioners: existing tools require programming expertise and non‑trivial environment setup, and cloud‑hosted platforms transmit evaluation data to external services, creating barriers for domain experts and compliance officers legally responsible for AI oversight. We introduce LLM‑FACETS (LLM FActuality Cross‑EvaluaTion System): an open‑source framework with a browser‑accessible interface and a plugin architecture, structured around three practitioner profiles (technical experts, domain experts, compliance officers) that mirror the stakeholder categories identified in the EU AI Act and the NIST AI Risk Management Framework. The architecture makes data flows explicit: deterministic metrics (BLEU, ROUGE, BERTScore) run entirely within the self‑hosted server with no outbound transmission; LLM‑judge metrics contact external APIs explicitly, with users retaining full credential control. The framework operationalizes transparency through three mechanisms: token‑level log‑probability visualization for epistemic uncertainty, multi‑judge consensus to mitigate judge bias, and RAG Triad metrics (Faithfulness, Answer Relevance, Context Relevance) to detect and localize hallucinations. A plugin architecture allows any new metric or dataset to be integrated without modifying the evaluation pipeline. The open‑source implementation enables cross‑checking across multiple metrics targeting the same property, ensuring reproducibility and decoupling AI accountability from the teams building the systems assessed. We verify the framework through cross‑validation of 18 metric implementations against canonical reference libraries.
Authors:Yuanjian Xu, Jianing Hao, Guang Zhang, Zhong Li
Abstract:
Training data plays a central role in large language models (LLMs) optimization, motivating extensive research on data scheduling strategies. Most existing approaches concentrate on adjusting the overall data distribution but neglect the underlying interactions between samples during training. However, we argue that such interactions cannot be overlooked, as real‑world data samples frequently exhibit directional influences on each other, making the training order crucial. Intuitively, we can prioritize train‑units with greater influence to improves learning efficiency. In this work, we propose D^3, a Dynamic Directional graph‑constrained Data scheduling framework. D^3 formulates the complex interactions among train‑units as a dynamic influence graph, where edges represent loss‑based dependencies. It then solves a constrained optimization problem over this graph to derive the training order, which ensures that the data sequence respects the evolving information flow throughout training. Our approach is theoretically motivated and yields consistent improvements over existing data scheduling methods across both pre‑training and post‑training phases. Furthermore, for scalability, D^3 also employs an efficient approximation algorithm that keeps the additional computational overhead within a manageable range. For future research, the code is available at https://github.com/xuyj233/D3.
Authors:Ziying Chen, Yang Cao, He Sun, Beining Yang, Tianjian Yang
Abstract:
We study Vector Linking: given two embedding clouds produced by different black‑box encoders over partially overlapping datasets, recover cross‑model object correspondences using only vectors. Empirically and theoretically, we show that independently trained contrastive encoders exhibit local geometric consistency: short‑range distances are approximately preserved up to a scale factor, while long‑range distances are not due to model‑specific distortion. Building on this, we propose an iterative, reference‑based geometric embedding hashing that recovers vector links from a tiny seed set of paired anchors. It represents each vector by distances to sampled paired anchors, proposes candidate links via hash‑space matching, and aggregates evidence across views in a Beta‑Bernoulli posterior to bootstrap high‑confidence links as new anchors. Experiments across multiple benchmarks and embedding model pairs demonstrate accurate and robust linking under varying overlap, seed budgets, and out‑of‑domain anchors, with applications to vector database integration and cross‑model clustering. Code is available at https://github.com/DBgroup‑Edinburgh/VecLinking.
Authors:Chunlei Li, Zixuan Zheng, Yilei Shi, Guanglu Dong, Pengfei Li, Jingliang Hu, Xiao Xiang Zhu, Lichao Mou
Abstract:
Mislabeled samples in training datasets severely degrade the performance of deep networks, as overparameterized models tend to memorize erroneous labels. We address this challenge by proposing a novel approach for mislabeled data detection that leverages training dynamics. Our method is grounded in the key observation that correctly labeled samples exhibit consistent entropy decrease during training, while mislabeled samples maintain relatively high entropy throughout the training process. Building on this insight, we introduce a signed entropy integral (SEI) statistic that captures both the magnitude and temporal trend of prediction entropy across training epochs. SEI is broadly applicable to classification networks and demonstrates particular effectiveness when integrated with contrastive language‑image pretraining (CLIP) architectures. Through extensive experiments on four medical imaging datasets ‑‑ a domain particularly susceptible to labeling errors due to diagnostic complexity ‑‑ spanning diverse modalities and pathologies, we demonstrate that SEI achieves state‑of‑the‑art performance in mislabeled data identification, outperforming existing methods while maintaining computational efficiency and implementation simplicity. Our code is available at https://github.com/MedAITech/SEI.
Authors:Jiejun Tan, Zhicheng Dou, Xinyu Yang, Yuyang Hu, Yiruo Cheng, Xiaoxi Li, Ji-Rong Wen
Abstract:
LLM agents are evolving from conversational chatbots to operational tools in real‑world workspaces. In local agentic harnesses, an LLM can read and write files, call tools, and reuse workspace state across sessions. While such capabilities enhance utility, they also expose a new attack surface for attackers. Attackers can embed a prompt injection within a file or tool output. Agents may read this hidden instruction, store it, and execute it later. In this multi‑step trojan attack paradigm, no individual step appears malicious on its own, but these steps can collectively turn untrusted text into persistent control content. However, existing defenses often inspect each step in isolation. As a result, they can block a clear harmful action, but fail to detect the earlier write operation that plants the backdoor. To reveal this threat, we introduce ClawTrojan, a benchmark designed to identify multi‑step trojan attacks in local agentic harnesses. In an OpenClaw‑style simulated workspace with GPT‑5.4, ClawTrojan reaches a 95.5% attack success rate (ASR), while existing single‑turn prompt‑injection attacks produce near‑zero ASR on the same model. To address this threat, we propose DASGuard, which scans control‑like text in sensitive local files, traces its origin, and removes control content that does not originate from a trusted source. Our results show that DASGuard achieves strong dynamic defense by combining runtime attack blocking with sanitized commits to the workspace.
Authors:Jyotirmoy Singh, Anushka Roy, Shreea Bose, Chittaranjan Hota
Abstract:
Anomaly detection in physiological sensor data from Wireless Body Area Networks (WBANs) can be caused by sensor faults, network disruptions, or missing data, leading to false alarms. Hence, it demands both high predictive accuracy and clinically interpretable explanations. Existing approaches rely either on black‑box models that achieve strong performance but offer no transparency, or on post‑prediction explanation methods such as SHAP and LIME. In this paper, we propose the Distilled Explanation Model (DEM), a three‑stage glass‑box framework that distills the non‑linear knowledge of a gradient boosting expert into an interpretable decision tree operating on residuals relative to a linear baseline, so that the explanation is not an approximation but the prediction itself. DEM introduces a novel distillation fidelity metric that quantifies how faithfully the explanation tree captures the expert model's non‑linear contribution, providing a principled measure of explanation trustworthiness absent from prior interpretable models. Evaluated across four physiological datasets, including MIMIC‑IV, WESAD, eICU, and an in‑house SmartNet WBAN corpus, DEM achieves an AUC of 0.9964 on clinical contextual anomaly detection and 0.9047 on wearable stress detection while producing human‑readable if‑then rules at a controllable depth. Inference requires 0.17ms per 1000 samples, rendering DEM 1235x faster than SHAP‑based post‑hoc explanation and suitable for real‑time physiological monitoring. Ablation studies confirm that the XGBoost distillation step provides measurable gains over naive residual fitting, and depth‑sensitivity analysis demonstrates an explicit, user‑controlled accuracy‑interpretability trade‑off unique to DEM among existing intrinsically interpretable models.
Authors:Jun-Hak Yun, Seung-Bin Kim, Seong-Whan Lee
Abstract:
Recent advancements in text‑guided audio generation have yielded promising results in diverse domains, including sound effects, speech, and music. However, jointly generating speech with environmental audio remains challenging due to the inherent disparities in their acoustic patterns and temporal dynamics. We propose ImmersiveTTS, an environment‑aware text‑to‑speech (TTS) model that generates natural speech seamlessly integrated within environmental contexts by explicitly modeling cross‑modal interactions. Our model builds on a multimodal diffusion transformer and fuses transcript‑aligned speech latent with text‑conditioned environmental context via joint attention. To enhance semantic consistency, we introduce a domain‑specific representation alignment objective tailored to environment‑aware TTS, leveraging complementary self‑supervised representations from speech and audio encoders. Experimental results show that ImmersiveTTS achieves higher naturalness, intelligibility, and audio fidelity than existing approaches across objective metrics and human listening tests.
Authors:Jiaxin Bai, Yue Guo, Yifei Dong, Jiaxuan Xiong, Tianshi Zheng, Yixia Li, Tianqing Fang, Yufei Li, Yisen Gao, Haoyu Huang, Zhongwei Xie, Hong Ting Tsang, Zihao Wang, Lihui Liu, Jeff Pan, Yangqiu Song
Abstract:
Text‑agent environments are typically modeled as partially observable Markov decision processes (POMDPs), assuming that the simulator's latent state and transition dynamics are hidden from the agent. Yet little work has examined whether executable code can be induced to serve as a world model for prediction and planning under partial observability. We introduce PatchWorld, a gradient‑free framework that turns offline trajectories into executable Python world models through counterexample‑guided code repair. Instead of predicting the next observation with a black‑box model, PatchWorld induces symbolic belief‑state programs whose action updates can be inspected, replayed, and locally patched. Across seven AgentGym environments, PatchWorld‑Simple achieves the highest code‑based planning score among evaluated methods, reaching 76.4% macro success in live one‑step lookahead while invoking no LLM calls inside the world‑model prediction module itself. We further find that a human‑specified residual‑memory bias improves surface observation fidelity but weakens decision utility. This exposes a tradeoff in executable world models, since improving observation fidelity can come at the expense of action‑discriminative dynamics, and vice versa. Code is available at https://github.com/HKBU‑KnowComp/PatchWorld.
Authors:Karthika Arumugam, Kiran Kumar Manku, Amit Dhanda
Abstract:
Language models fine‑tuned with reinforcement learning typically optimize for task reward, ignoring multi‑agent strategic structure. Because these agents condition on natural language game‑state descriptions and emit actions through free‑form generation, strategic failure modes ‑‑ exploiting weaker opponents, coordinating on harmful equilibria, and externalizing costs are inseparable from the language interface itself. We propose Safe Equilibrium Policy Optimization (\sepo), a training objective that augments expected payoff with explicit penalties for exploitability, collusion risk, and externality cost. We implement \sepo as a reward signal for Group Relative Policy Optimization (GRPO), applied to Gemma~4 E4B‑it and Qwen~3.5‑4B after supervised fine‑tuning (SFT). Evaluated across five strategic domains: Iterated Prisoner's Dilemma, repeated auctions, two negotiation variants, and Kuhn Poker. \sepo achieves zero exploit‑pool advantage in Kuhn Poker for both models, outperforms the base model on safety in four domains, and corrects the over‑cooperative behavior introduced by SFT. In negotiation, \sepo achieves a positive‑safety outcome and only the positive normalized relative advantage of any negotiation configuration. Ablation experiments confirm that per‑rollout exploit computation is necessary: a shared constant penalty cancels in GRPO advantage normalization (constant control‑variate property), producing zero gradient. To support further research in strategic safety for agents, we release our \hrefhttps://anonymous.4open.science/r/sepo‑2668/README.mdcode and SFT datasets.
Authors:Yuwei Cheng, Weiyi Tian, Haifeng Xu
Abstract:
Fine‑tuning is often believed to reduce uncertainty and diversity in large language models, but existing analyses overlook output length, a key confounder, and therefore fail to capture how uncertainty is distributed across an entire generation rollout. To address this, we propose Canopy Entropy (\mathrmCE^\star), a measure that views language generation from a tree perspective, where ``canopy'' represents the space of all possible rollouts, making \mathrmCE^\star naturally quantify the effective size of the generation space. \mathrmCE^\star jointly captures uncertainty in both the output length N and the generated sequence Y_1:N ‑‑ indeed, we show that it equals to total Shannon entropy H(N, Y_1:N\mid X), where X denotes the prompt. This formulation yields interpretable metrics, including a length‑entropy correlation term ρ(N, r_N), where r_N is the entropy rate, quantifying information conveyance efficiency by indicating whether longer outputs are more or less informative per token. Empirically, across tasks and model families, we find that fine‑tuned models consistently exhibit stronger positive correlation ρ(N, r_N), even when total entropy decreases. Furthermore, after controlling for model family, task, prompt, and output‑length effects, we find that fine‑tuning nearly triples the correlation strength between entropy rate and semantic diversity, suggesting that aligned models convert token uncertainty into semantic diversity more efficiently. Overall, these results demonstrate that fine‑tuning does not simply reduce uncertainty, but fundamentally reorganizes it into more informative and semantically meaningful generations. Our code is available at https://github.com/WeiyiTian/canopy‑entropy.
Authors:Fengyu Gao, Jing Yang
Abstract:
Preference alignment is a crucial post‑training step for large language models (LLMs) to ensure their outputs align with human values. However, post‑training on real human preference data raises privacy concerns, as these datasets often contain sensitive user prompts and human judgments. To address this, we propose DPPrefSyn, a novel algorithm for generating differentially private (DP) synthetic preference data to enable privacy‑preserving preference alignment. DPPrefSyn is a principled framework grounded in the Bradley‑Terry preference model and the intrinsic geometric structure of pairwise human preference data. It first learns an underlying preference model from private data with formal differential privacy guarantees, and then leverages the learned model together with public prompts to synthesize high‑quality preference data. It exploits the shared linear structure of per‑cluster reward models to effectively capture heterogeneous human preferences in private datasets, and leverages DP Principal Component Analysis (DP‑PCA) to improve learning accuracy. Extensive experimental results demonstrate that DPPrefSyn achieves competitive alignment performance under strong DP guarantees. These findings highlight the potential of synthetic preference data as a practical alternative for privacy‑preserving preference alignment across a broad range of applications. To the best of our knowledge, this is the first work to generate DP synthetic preference data for LLM alignment. Our code is available at https://github.com/gfengyu/Differentially‑Private‑Preference‑Data‑Synthesis.
Authors:Yanjie An, Yuxiang Zhao, Yichi Zhang, Qixi Zheng, Yujie Tu, Keqi Deng, Kai Yu, Xie Chen
Abstract:
Speech translation systems increasingly span speech‑to‑text translation (S2TT), speech‑to‑speech translation (S2ST), offline translation, and streaming generation, producing outputs that differ in modality, speech realization, and timing behavior. Existing evaluation practices assess important aspects such as translation quality, speech quality, and temporal quality, but these aspects are often evaluated under separate protocols, making it difficult to compare heterogeneous systems comprehensively. To address this gap, we present OpenSTBench, a unified multidimensional evaluation framework that organizes heterogeneous speech translation outputs into a shared evaluation format. OpenSTBench supports both S2TT and S2ST systems in offline and streaming settings, and jointly evaluates translation quality, speech quality, speaker preservation, emotion and paralinguistic fidelity, temporal consistency, and latency. Through experiments on representative speech translation systems, we show that systems with strong translation quality can still differ substantially in speech quality, as well as in temporal quality. OpenSTBench provides a reproducible protocol for analyzing these cross‑dimensional differences and supporting application‑oriented comparison of speech translation systems. The code and datasets are available at https://github.com/sjtuayj/OpenSTBench.
Authors:Haoxiang Cheng, Yunfei Wang, Chao Chen, Kewei Cheng, Zhipeng Lin, Haoxuan Li, Changjun Fan, Shixuan Liu
Abstract:
Logical rules constitute a cornerstone of knowledge graph (KG) reasoning, valued for their interpretability and ability to model relational patterns. However, existing rule mining methods predominantly focus on simple chain‑like rules and therefore neglect the richer relational information encoded in graph‑like structures, such as cycles and branches. This limitation is further exacerbated by computational bottlenecks caused by the combinatorial explosion of the search space, which is especially challenging for graph‑like rules. Meanwhile, generative approaches such as diffusion models, despite their success in other domains, cannot be directly applied to rule mining because their training objectives are not aligned with the goal of learning high‑quality rules, and non‑differentiable KG rule quality metrics cannot directly guide model optimization. To address these limitations, we propose GRiD, a framework that reformulates graph‑like rule discovery as a discrete generative process conditioned on the target relation. GRiD employs a two‑phase training strategy. First, supervised pre‑training enables GRiD to capture structural priors from subgraphs sampled from the KG meta‑graph. Subsequently, reinforcement learning is applied to fine‑tune GRiD through policy gradient optimization guided directly by non‑differentiable rule‑quality metrics. Experiments on six benchmark datasets show that GRiD achieves competitive performance on KG completion tasks. Ablation studies confirm the efficiency and robustness of GRiD and further show that graph‑like rules complement chain‑like rules in KG completion. Our code and datasets are available in https://github.com/Haoxiang‑Cheng/GRiD.
Authors:Tianrun Yu, Kaixiang Zhao, Chih-Chun Chen, Amanda Hughes, Taylor W. Killian, Fenglong Ma, Weitong Zhang, Porter Jenkins
Abstract:
We study trajectory selection for reasoning distillation, where teacher‑generated reasoning trajectories are selectively used as supervision for a student model. Existing methods rely on heuristics such as trajectory quality or model confidence, but they often overlook whether a trajectory is learnable by the student. In this paper, we present LARK, a learnability‑grounded method for reasoning trajectory selection. LARK selects trajectories that the student can learn efficiently while preserving the generalization of the full training distribution. At the core of LARK is a learnability factor ρ, which characterizes the rate at which the student's training loss decreases. To estimate this rate efficiently and maintain generalization, we introduce a learnability proxy and a χ^2‑regularized selection policy that balances learnability and distributional coverage, both with strong theoretical guarantees on their estimation error. Empirically, LARK consistently outperforms data selection baselines across multiple base models and reasoning tasks. Diagnostic analyses show that the LARK score predicts downstream training utility and that LARK‑selected trajectories induce faster supervised fine‑tuning loss reduction. Our code is available at https://github.com/Tianrun‑Yu/LARK.
Authors:Yuhang Jiang
Abstract:
Embodied agents have made strong progress in navigating to target objects, but reaching the goal vicinity does not guarantee that the agent has found the correct instance: subtle attribute differences (e.g., "white floral" vs. "white striped") often require close‑range, multi‑view inspection. We address this gap with Active Instance Verification (AIV), a task in which an agent actively selects viewpoints around a candidate object to decide whether it matches a fine‑grained natural‑language description. We formalize AIV as a finite‑horizon decision process and introduce PInVerify, an offline embodied benchmark for AIV: 3,000 evaluation episodes across 18 object categories, delivered as multi‑view captures with a 6‑sector navigation topology that exposes trap views (navigable but uninformative) and unreachable sectors. As reference baselines we build a training‑free pipeline and a LoRA‑fine‑tuned end‑to‑end agent around open‑source multimodal large language models (MLLMs) at on‑device scale (\leq8B parameters), with attribute decomposition, a visibility‑weighted multi‑view tracker, and three next‑best‑view (NBV) strategies. In our evaluation across Qwen3‑VL (4B/8B), SenseNova‑SI‑1.2‑InternVL3‑8B, CLIP, and SigLIP2, the best MLLM‑based baseline exceeds the best embedding baseline by 4.9 pp; GT‑box ablations show a +3.1 pp detection gap; and we do not observe reliable gains from active viewpoint selection within the tested NBV strategies. A LoRA‑fine‑tuned agent (SFT+GSPO) reaches 85.6%. PInVerify aims to support further work on active, fine‑grained semantic verification in embodied AI. Code: https://github.com/Avalon‑S/PInVerify.
Authors:Minhua Lin, Juncheng Wu, Zijun Wang, Zhan Shi, Yisi Sang, Bing He, Zewen Liu, Tianxin Wei, Zongyu Wu, Zhiwei Zhang, Dakuo Wang, Xiang Zhang, Benoit Dumoulin, Cihang Xie, Yuyin Zhou, Suhang Wang, Hanqing Lu
Abstract:
LLM agents are increasingly deployed as systems built around editable external harnesses, including prompts, skills, memories and tools, that shape task execution without changing model parameters. Harness self‑evolution adapts such agents by updating these harnesses from execution evidence. Yet it remains unclear whether a model's base capability in task‑solving predicts its capabilities in harness self‑evolution: which models produce useful harness updates, and which actually benefit from them? We analyze two harness self‑evolution capabilities: (i) harness‑updating, the capability to produce useful persistent harness updates from execution evidence; (ii) harness‑benefit, the capability to benefit from updated harnesses during task solving. Our analysis reveals two findings. First, harness‑updating is flat in base capability: models from different capability tiers produce harness updates that lead to surprisingly similar gains; even Qwen3.5‑9B's updates yield gains comparable to those of Claude Opus~4.6. Second, harness‑benefit is non‑monotonic in base capability: weak‑tier models benefit little from updated harnesses, mid‑tier models benefit most, and strong‑tier models benefit less than mid‑tier. We trace low gains at the weak tier to two failure modes: weak‑tier models may fail to activate relevant harness artifacts, or activate them but fail to follow them faithfully. These findings suggest investing capability budget in the task‑solving agent rather than the evolver, and targeting harness invocation and long‑horizon instruction following in agent training. Our source code is publicly available at https://github.com/A‑EVO‑Lab/a‑evolve/tree/release/harness‑evolution.
Authors:Haozhe Zhao, Shuzheng Si, Zhenhailong Wang, Zheng Wang, Liang Chen, Xiaotong Li, Zhixiang Liang, Maosong Sun, Minjia Zhang
Abstract:
Scientific figures are among the most effective means of communicating complex research ideas, yet producing publication‑quality illustrations remains one of the most labor‑intensive parts of paper preparation. Existing automated systems each target a single figure type under text‑only input, leaving the diversity of types and conditions researchers actually use unaddressed; their raster outputs further cannot be locally revised. Because scientific figures are structured compositions of discrete semantic components, the localized errors generators produce on such layouts demand not a stronger backbone but a harness. We instantiate this harness in two complementary systems: Crafter, a multi‑agent harness for figure generation that generalizes across figure types and input conditions without architectural changes, and CraftEditor, which applies the same pattern to convert raster outputs into editable SVGs. Moreover, we introduce CraftBench, a benchmark spanning three figure types and four input conditions with human quality annotation. Experiments show that Crafter substantially outperforms both standalone generators and the agentic baseline on PaperBanana‑Bench and CraftBench, with ablations confirming each component's independent contribution; CraftEditor faithfully converts outputs into editable SVGs that surpass all baselines. Our code and benchmark are available at https://github.com/HaozheZhao/Crafter.
Authors:Yue Zhang, Zun Wang, Han Lin, Yonatan Bitton, Idan Szpektor, Mohit Bansal
Abstract:
Spatial reasoning is a fundamental capability for vision‑language models (VLMs) deployed in real‑world environments. However, visual observations are inherently limited representations of a 3D world: occlusion can render objects invisible, and perspective can make geometric properties misleading. Despite this, existing spatial reasoning benchmarks typically assume that observations are sufficient and reliable, focusing on whether models produce correct answers rather than whether they recognize when a question cannot be answered and what additional observations would be needed. In this work, we challenge this assumption by constructing a controlled evaluation framework, SpatialUncertain, and introducing two types of observation challenges: (1) occlusion, which hides target information, and (2) perspective ambiguity, which produces misleading visual cues. For each configuration, we design spatial questions that are answerable under clean observations but require abstention under the introduced challenges. We further evaluate whether models can identify which additional viewpoints would resolve perspective ambiguity. Our results across a diverse set of frontier open‑ and closed‑source VLMs reveal two consistent failure modes. First, models are prone to overconfident answering, attempting to solve spatial reasoning tasks even when visual evidence is incomplete or misleading, with average accuracy around 30% under occlusion and below 10% under perspective ambiguity. Second, even when additional views are available, some models perform near random chance in identifying which would provide reliable evidence. Together, our findings call for moving beyond answer correctness toward evaluating whether models know when to abstain and how to seek reliable evidence.
Authors:Amirhossein Ghaffari, Saeid Sheikhi, Ekaterina Gilman
Abstract:
Spatio‑temporal forecasting on sensor graphs is commonly tackled with a single backbone architecture applied uniformly across all nodes, although graph regions can exhibit different dynamics. Road segments differ in functional class, structure, and traffic behavior, suggesting that node‑wise expert specialization can be useful. We propose GC‑MoE, a graph‑conditioned mixture of experts framework that assigns each node a personalized combination of frozen forecasting experts based on graph topology and the recent traffic input window. GC‑MoE combines frozen pretrained spatio‑temporal GNN experts with an input‑aware, spatially contextualized router while training only a lightweight routing module. We also study a bounded graph‑conditioned output refinement layer as an optional extension and include node‑adaptive ST‑LoRA adapters only as an ablation diagnostic. Across four standard benchmarks (PEMS04, PEMS07, METR‑LA, and PEMS‑BAY), GC‑MoE improves MAE over a zero‑parameter ensemble baseline, with competitive RMSE and MAPE, while training only ~17K parameters on top of 1.5M frozen expert weights. The implementation is available at https://github.com/Ahghaffari/gc_moe.
Authors:Kewei Xu, Xiaoben Lu, Shuofei Qiao, Zihan Ding, Haoming Xu, Lei Liang, Ningyu Zhang
Abstract:
Real‑world data analysis is inherently iterative, yet existing benchmarks mostly evaluate isolated or short interactive tasks, leaving agents' ability to track evolving analytical context over long horizons untested. We introduce LongDS, a benchmark for long‑horizon, multi‑turn data analysis where agents must maintain, update, restore, and compose evolving analytical states. LongDS comprises 68 tasks constructed from real‑world Kaggle notebooks, spanning 2,225 turns across six domains including Geoscience, Business, and Education. Tasks are designed around state‑evolution patterns (e.g., counterfactual perturbation, rollback, multi‑state composition), with an average dependency span of 11.3 turns. Evaluating five state‑of‑the‑art models, we find that the best model reaches only 48.45% average accuracy, performance drops nearly 47 points from early to late turns, and long‑horizon errors account for 52%‑‑69% of failures. Further analysis shows that additional agent steps do not necessarily improve performance, suggesting that the key bottleneck is maintaining a correct analytical state rather than increasing interaction budget. We release LongDS to support research on reliable long‑horizon agentic data analysis. Code and data will be released at https://github.com/zjunlp/DataMind.
Authors:Yujie Luo, Xiangyuan Ru, Jingsheng Zheng, Jingjing Wang, Yuqi Zhu, Jintian Zhang, Runnan Fang, Kewei Xu, Ye Liu, Zheng Wei, Jiang Bian, Zang Li, Shumin Deng
Abstract:
Large Language Models (LLMs) have demonstrated strong performance on general tasks, while often struggling to adapt to specialized domains without high‑quality domain‑specific data. Existing LLM‑based data curation methods primarily rely on human‑designed workflows, leaving it unexamined whether LLMs can autonomously execute an end‑to‑end data engineering pipeline for model specialization. We formalize Autonomous Agentic Data Engineering, a novel task designed to evaluate LLMs as autonomous data engineers that drive model specialization through end‑to‑end data curation. We frame data as an optimizable component and study agents that plan, generate, and iteratively optimize training data across multiple domains, guided by post‑training performance improvement. Experiments show that autonomous LLM data engineers yield substantial gains, as GPT‑5.2 constructs a training curriculum that improves a student model by 57.29%, entirely through iterative, agent‑driven data adaptation. By illuminating both potential and bottlenecks, our study establishes autonomous data engineering as a measurable capability and charts a path toward agent‑driven model specialization\footnoteCode will be released at https://github.com/zjunlp/DataAgent..
Authors:Hwa Hui Tew, Junn Yong Loo, Fang Yu Leong, Julia K. Lau, Ding Fan, Hernando Ombao, Raphaël C. -W. Phan, Chee Pin Tan, Chee-Ming Ting
Abstract:
Functional Magnetic Resonance Imaging (fMRI) provides non‑invasive access to dynamic brain activity by measuring blood oxygen level‑dependent (BOLD) signals over time. However, the resource‑intensive nature of fMRI acquisition limits the availability of high‑fidelity samples required for data‑driven brain analysis models. While modern generative models can synthesize fMRI data, they often remain challenging in replicating their inherent non‑stationarity, intricate spatiotemporal dynamics, and physiological variations of raw BOLD signals. To address these challenges, we propose Dual‑Spectral Flow Matching (DSFM), a novel fMRI generative framework that cascades dual frequency representation of BOLD signals with spectral flow matching. Specifically, our framework first converts BOLD signals into a wavelet decomposition map via a discrete wavelet transform (DWT) to capture globalized transient and multi‑scale variations, and projects into the discrete cosine transform (DCT) space across brain regions and time to exploit localized energy compaction of low‑frequency dominant BOLD coefficients. Subsequently, a spectral flow matching model is trained to generate class‑conditioned cosine‑frequency representation. The generated samples are reconstructed through inverse DCT and inverse DWT operations to recover physiologically plausible time‑domain BOLD signals. This dual‑transform approach imposes structured frequency priors and preserves key physiological brain dynamics. Ultimately, we demonstrate the efficacy of our approach through improved downstream fMRI‑based brain network classification. The code is available at https://github.com/htew0001/DSFM.git .
Authors:Yizhu Wen, Shuhao Zhang, Nan Zhang, Long Cheng, Hanqing Guo
Abstract:
Retrieval‑augmented text‑to‑music (TTM) systems augment underspecified user prompts using captions retrieved from a music caption dataset. This design introduces an integrity dependency on the music knowledge database. We show that an attacker can poison the database by injecting a small number of crafted music captions, causing the system to retrieve malicious captions that bias prompt augmentation and steer generation away from the user's intended function, without modifying the user prompt, retriever, or generator. To achieve the music caption poisoning attack, we propose a dual‑layer caption poisoning strategy that preserves high‑level retrieval anchors while injecting low‑level acoustic descriptors to steer prompt augmentation and downstream music generation toward an attacker‑chosen target intent. In a MusicCaps knowledge database, CLAP retriever, and MusicGen pipeline, poisoned generations move substantially closer to the attacker's target, while remaining comparably aligned with the original user query. These results expose a practical integrity risk for retrieval‑augmented creative AI systems. Our demo can be found at: https://yizhu‑wen.github.io/Mental‑Damage/
Authors:Mingxuan Yi, Vidal Mehra, Jing Chen, John Cartlidge
Abstract:
Regime shifts in financial markets reorganise the joint dynamics of asset prices and macro variables, breaking any single‑regime calibration. They are nonetheless difficult to detect reliably because the data signal is noisy and heavily multicollinear, while the contemporaneous text that announces them is unstructured. Standard regime shift detection methods rely solely on structured time‑series data and ignore policy communications, even though these texts often signal shifts before they materialise in observed prices. We propose a text‑enhanced regime shift detection pipeline that combines large language model (LLM) reasoning over central‑bank communications with statistical validation on multivariate financial time series. The framework is detector‑agnostic: text‑proposed candidates are validated using a bootstrap likelihood‑ratio test on a vector autoregression (VAR), while data‑driven candidates from arbitrary regime detectors are ratified through a lenient LLM text check. We evaluate the framework on 2010‑2024 FOMC minutes paired with a 14‑variable U.S. Treasury and macroeconomic panel, using four interchangeable data‑driven detectors. The proposed pipeline achieves F1 = 0.82 against a verified anchor list of monetary‑policy regime shifts, with same‑day modal detection latency and consistently stronger performance than pure data‑driven baselines. The results demonstrate that combining unstructured policy text with statistical structural‑break detection improves the robustness and interpretability of regime shift identification in financial markets.
Authors:Chen Henry Wu, Aditi Raghunathan
Abstract:
Self‑improvement at scale has been a longstanding goal for reasoning models, and there are two natural places to do it: at test time, through verification‑refinement (V‑R) loops; and at training time, through self‑training methods. Both are gated by the same bottleneck: the verifier. V‑R loops stall when verifier scores inflate while accuracy stagnates, and when feedback is too generic to act on; self‑training fails similarly when bad self‑generated data are added to training. Better verification would unlock both, but the capability we want to train, i.e., catching self‑generated errors, lacks training signal. To address this challenge, we propose self‑trained verification (STV). Our key observation is that, while a model cannot catch these errors alone, it can when shown the reference solution. We turn this asymmetry into a supervision target and train the verifier to imitate a more informed version of itself. At test time, STV substantially improves V‑R loops on hard problems, while alternatives (e.g., SFT, RL on verifier scores, and even meta‑verifiers) do not. STV roughly doubles accuracy on hard math and lifts it 14x on scientific reasoning tasks (1.5% to 21%). At training time, we additionally train the generator using RL with STV verifier's feedback inside the V‑R loop ‑ a procedure we call verifier‑in‑the‑loop training (ViL). Starting from an RL‑converged generator, ViL yields a further 33% gain in pass@1. More notably, the generator's standalone pass@1, with no verifier at test time, climbs 30% relative past where standard RL had converged. Hence, the next frontier in reasoning on hard problems may lie in how we train for and with verification. Website: https://ar‑forum.github.io/stv‑webpage
Authors:Chun-Hsiao Yeh, Shengyi Qian, Manchen Wang, Yi Ma, Joseph Tighe, Fanyi Xiao
Abstract:
Vision‑Language Models (VLMs) often struggle with robust 3D spatial reasoning. Prevailing methods that rely on fine‑tuning with 3D visual question‑answering (VQA) datasets may overfit dataset‑specific biases, while integrating specialized 3D visual encoders is often inflexible and cumbersome. In this paper, we argue that genuine spatial understanding should emerge from learning fundamental geometric priors, not only from high‑level VQA supervision. We propose GASP (Geometric‑Aware Spatial Priors), a framework that injects these priors directly into the LLM's transformer layers. GASP employs a small correspondence head, applied as a deep supervision signal across all layers, and is trained with a dual objective leveraging ground‑truth geometry from large‑scale video scenes: a contrastive loss on ground‑truth point correspondences enforces 2D view‑invariance, while a depth consistency supervision resolves 3D geometric ambiguities. Our analysis first provides a diagnostic showing that standard VLMs' internal correspondence matching accuracy is very low (often below 5%). We then demonstrate that our training substantially improves this behavior, boosting peak layer‑wise correspondence to over 70% and maintaining over 85% temporal robustness while baselines remain below 5%. These internal improvements translate to significant gains on downstream spatial benchmarks including +18.2% on All‑Angles Bench and +29.0% on VSI‑Bench, all without training on any 3D VQA data. Our findings indicate that learning from fundamental geometric priors is a promising and generalizable pathway towards VLMs with more reliable 3D spatial reasoning.
Authors:Haoming Xu, Weihong Xu, Zongrui Li, Mengru Wang, Yunzhi Yao, Chiyu Wu, Jin Shang, Yu Gong, Shumin Deng
Abstract:
Long‑horizon interactions require language models to manage accumulating information: when to update their state, when to preserve their state, and what to ignore. We study this challenge as Contextual Belief Management (CBM): maintaining a predicted belief state aligned with formal evidence while isolating task‑irrelevant noise. To make CBM measurable, we introduce BeliefTrack, a closed‑world benchmark spanning Rule Discovery and Circuit Diagnosis, where a finite belief space and symbolic verifiers enable exact turn‑level evaluation. BeliefTrack diagnoses three failures: Failed Stay, Failed Update, and Failed Isolation. Across multiple LLMs, vanilla models exhibit severe CBM failures, while explicit belief‑tracking prompts provide limited gains. In contrast, reinforcement learning with belief‑state rewards reduces failure rates by 70.9% on average. Further probing reveals latent belief‑state dynamics behind these failures, and representation‑level steering reduces failure rates by 46.1% across two tasks\footnoteCode is coming soon at https://github.com/zjunlp/CBM.
Authors:Travis Lelle
Abstract:
We show that LoRA adapters, the dominant distribution format for fine‑tuned LLMs, can be reliably backdoored through training data poisoning while preserving baseline task performance. On a Qwen 2.5 1.5B prompt‑injection classifier, a small fraction of poisoned examples drives a clean‑accuracy‑preserving backdoor to saturation. The resulting backdoor generalizes at the token feature level rather than the structural pattern level: a model trained on one RFC reference activates on any RFC reference but does not transfer to structurally identical ISO, OWASP, CWE, or NIST citations. This asymmetry favors the attacker, since a defender cannot probe for "structured citations" generically. We characterize the attack across base‑model scale and family, LoRA rank, and trigger string, and evaluate two complementary detection routes against a multi‑seed adapter cohort. A behavioral detector built from two probe‑battery statistics, outlier_gap and mean_attack_rate, separates poisoned from clean adapters perfectly when the battery overlaps the trigger's token neighborhood and at high recall with zero false positives when it does not. A weight‑level statistic, the cross‑module standard deviation of dimension‑normalized Frobenius norms, also separates the cohort perfectly without running the model. Combined, the two routes are robust to probe composition. Causal patching localizes the backdoor to the MLP block at mid‑to‑late layers, with down_proj as the strongest single‑projection cause. Replications across scale, family, and rank show the behavioral detector transfers without retuning, while the weight‑level detector is calibration‑bound to the base model. The attack scales monotonically with rank, and the chosen trigger‑anchor token is both trigger‑dependent and base‑model‑dependent. Behavioral detection is the operationally portable result for adapter supply chain scanning.
Authors:Gijs van Nieuwkoop, Siamak Mehrkanoon
Abstract:
Deep‑learning precipitation nowcasting models are often optimized using pointwise losses such as mean squared error or mean absolute error, which can lead to overly smooth forecasts and poor representation of heavy rainfall. This study investigates whether the predictive performance of an established deterministic nowcasting architecture can be improved by reformulating training as a multi‑quantile regression problem. Using SmaAt‑UNet as a core model, we compare MSE, MAE, and multi‑quantile pinball‑loss training on radar precipitation nowcasting over the Netherlands. The results show that multi‑quantile training improves the central deterministic forecast, decreasing test‑set MSE by 8.6% compared to a model trained using MSE, while also producing upper‑quantile outputs that are useful for risk‑sensitive prediction of heavy precipitation. These findings suggest that quantile regression provides a simple alternative to standard pointwise losses without requiring a new architecture or generative sampling procedure. The implementation of our models and training setup is available on \hrefhttps://github.com/gijsvn/Multi‑Quantile‑Precipitation‑NowcastingGitHub.
Authors:Boning Li, Baoxiang Wang, Longbo Huang
Abstract:
Poker is a landmark challenge for artificial intelligence. The dominant approach relies on equilibrium solvers built on counterfactual regret minimization, requiring millions of core‑hours of training. Large Language Models (LLMs) possess extensive poker knowledge but perform far below solver‑based agents when asked to play directly. Traditional rule‑based poker agents are interpretable and training‑free, but their strategic ceiling remains far below equilibrium play. We introduce PokerSkill, a training‑free and solver‑free framework that bridges this gap by using detailed rule‑based poker skills as a structured action‑grounding interface for LLMs. A deterministic context engine analyzes the current state and retrieves only the relevant fragments from a layered skill library, which is entirely designed by human poker experts, constraining the LLM's choice to reasonable actions. Against GTOWizard, a state‑of‑the‑art GTO benchmark, GPT‑5.5 XHigh with PokerSkill achieves ‑57 \pm 21 mbb/hand, Claude Opus 4.6 achieves ‑80 \pm 29 mbb/hand and Claude Opus 4.7 achieves ‑87\pm 64 mbb/hand, reducing losses by 49‑‑61% compared to default‑prompt baselines and outperforming the strong bot Slumbot. Our key finding is that rule‑based skills alone do not constitute a strong strategy, and LLMs alone cannot play well, but their combination yields an agent that requires neither training nor solver access yet competes with systems built on millions of core‑hours of computation. To our knowledge, this is the first demonstration of an LLM achieving competitive performance in a complex imperfect‑information game without game‑specific training or solver queries. Code is available at https://github.com/lbn187/PokerSkill.
Authors:Matt Y. Cheung, Ashok Veeraraghavan, Hanjie Chen, Guha Balakrishnan
Abstract:
Language model reasoning traces are rarely all‑or‑nothing; they frequently contain valid intermediate steps before a critical error occurs. Existing uncertainty quantification methods typically certify final answers or entire responses, failing to provide statistical guarantees for the proportion of a sequential trace that can be safely retained. To address this, we introduce CROP (Conformal Reasoning Output Prefixes), a verifier‑agnostic calibration procedure for clean‑prefix certification. Given any step‑level risk proxy, CROP selects a calibrated threshold and returns the longest contiguous prefix whose step risk proxies remain below it, routing the uncertified suffix for downstream review or repair. Assuming exchangeability, CROP rigorously controls the marginal probability that the returned prefix contains an annotated error. Across six process‑labeled reasoning datasets, we demonstrate that standard step‑level metrics such as AUROC do not fully capture prefix utility, suggesting verifiers should instead be evaluated by certified prefix length. Furthermore, CROP balances over‑ and under‑withholding, improving downstream repair accuracy by preserving valid intermediate reasoning while discarding misleading suffixes. Ultimately, this work positions prefix certification as a rigorous, practical bridge between process supervision, abstention, and repair.
Authors:Jaa-Yeon Lee, Yeobin Hong, Taesung Kwon, Jong Chul Ye
Abstract:
Diffusion models generate highly realistic images but often struggle with precise text‑image alignment. While recent post‑training methods improve alignment using external rewards or human preference signals, their performance heavily depends on reward quality and does not directly address alignment within the diffusion process itself. Recent reward‑free approaches such as SoftREPA demonstrate that optimizing soft text tokens via contrastive learning can effectively improve text‑image representation alignment, outperforming standard parameter‑efficient fine‑tuning baselines. However, the contrastive formulation can excessively penalize negative pairs, which manifests as characteristic failure cases such as over‑counting and repetition. To address this issue, we propose a lightweight, reward‑free post‑training method that refines soft tokens by integrating contrastive alignment guidance directly into the score‑matching objective of diffusion models. By assigning alignment directions at the score level, our approach mitigates these limitations and yields more coherent and semantically faithful generations. Experiments show that our method matches SoftREPA while substantially improving its failure cases, achieving over 35% improvement in counting accuracy on the GenEval benchmark. Our method is seamlessly applicable to existing diffusion backbones (SD1.5, SDXL, and SD3), and is complementary to existing RL‑based diffusion post‑training methods. Project page: https://jaayeon.github.io/AGSM
Authors:Silin Zhou, Chenhao Wang, Yuntao Wen, Shuo Shang, Lisi Chen, Panos Kalnis
Abstract:
Urban trajectories play a crucial role in modeling urban dynamics and supporting various smart city applications. However, privacy concerns restrict access to large‑scale and high‑quality trajectory datasets. Trajectory generation provides a promising alternative by synthesizing realistic data to mitigate privacy risks. However, existing methods fail to explicitly capture travel patterns and can only generate fixed‑length trajectories under a single condition. To address these limitations, we propose HTP, which Hierarchically generates Travel patterns first and then generates GPS Points by using large language models (LLMs), rather than directly generating GPS points. We first design a trajectory‑specific residual quantization variational autoencoder (RQ‑VAE) that quantizes micro‑level GPS trajectories into compact, macro‑level travel pattern tokens in a coarse‑to‑fine manner. These tokens capture rich segment spatial irregularities, such as point density variations caused by traffic conditions. Then, we extend the LLM vocabulary with travel pattern tokens to align trajectory representations with the LLM input, and apply supervised fine‑tuning (SFT) to align the LLM with the trajectory generation task, enabling generation of travel pattern sequences under various conditions. Extensive experiments on two real‑world datasets show that HTP outperforms the strongest baseline by an average of 29.78% in terms of generation quality. Our code is available at https://github.com/slzhou‑xy/HTP.
Authors:Víctor Gallego
Abstract:
We study two‑level autoresearch for cooperation: an outer‑loop AI agent autonomously redesigns the inner‑loop pipeline of an LLM policy‑synthesis system for multi‑agent Sequential Social Dilemmas (SSDs). A researcher agent \mathcalR (run as a coding agent) reads the inner‑loop source code, edits system prompts, feedback functions, helper libraries, and iteration logic, runs evaluations, and decides what to keep, following the autoresearch paradigm. Across two games (Cleanup and Gathering), two policy‑synthesizer LLMs, and two welfare objectives (utilitarian efficiency and Rawlsian maximin), the researcher reliably exceeds hand‑designed baselines, sharply tightens run‑to‑run variance, and outperforms prompt‑only optimization. The discovered pipelines are objective‑dependent: only under maximin does the researcher inject an explicit fairness mechanism into synthesizer pipelines, a class of mechanism that is absent from its own objective‑agnostic system prompt and from every efficiency‑optimized pipeline. This supports an information‑design reading in which the researcher chooses what to reveal to the boundedly rational synthesizer as a function of the welfare objective. Code at https://github.com/vicgalle/autoresearch‑social‑dilemmas.
Authors:Kun Feng, Ziwei Shan, Yuchen Fang, Yiyang Tan, Sihan Lu, Shuqi Gu, Lintao Ma, Xingyu Lu, Kan Ren
Abstract:
Cross‑domain multimodal time series forecasting is a challenging task, requiring models to integrate precise numerical comprehension, cross‑domain semantic understanding, and effective multimodal fusion. Existing approaches either build Time Series Foundation Models (TSFMs) from scratch or leverage pretrained Large Language Models (LLMs). However, TSFMs often overlook semantic understanding and lack the ability to perform future‑oriented semantic reasoning, and LLMs struggle with numerical comprehension and accurate quantitative forecasting. To overcome these limitations, we propose KairosAgent, a novel agentic framework for multimodal time series forecasting, including an LLM‑based reasoner and a TSFM‑based forecaster. KairosAgent unifies textual reasoning and numerical forecasting by dynamically invoking analytical tools to enhance the numerical understanding and semantic reasoning capabilities of LLMs. The reasoning results are subsequently fused into the TSFM pipeline, enabling more accurate and reliable future predictions. To further improve the reasoning, we curate a large‑scale corpus of high‑quality trajectories, alongside a reinforcement learning from forecasting paradigm with multi‑turn refinement and turn‑level credit assignment. Experiments demonstrate that KairosAgent achieves superior zero‑shot forecasting performance while maximizing the utility of pretrained LLMs and TSFMs, presenting a promising direction for efficient and interpretable time series agents. The project page is at https://foundation‑model‑research.github.io/KairosAgent .
Authors:Muhammed Furkan Dasdelen, Fatih Ozlugedik, Ilaria Looser, Rao Muhammad Umer, Christian Pohlkamp, Carsten Marr
Abstract:
Multimodal alignment of histopathology encoders with transcriptomic and genomic data has been shown to significantly improve performance in downstream diagnostic tasks. Hematological cytology is unique in that visual single‑cell evaluation is often paired with cytogenetics and molecular genetics for blood cancer diagnosis. In this study, we present a framework to align single white blood cell images with chromosomal aberrations (karyotype) and somatic mutations from targeted gene panels. Our training strategy follows a two‑stage approach: (i) self‑supervised, vision‑only pretraining of a transformer aggregator using an iBOT head on a cohort of over 1500 patients, and (ii) genetic alignment via supervised contrastive loss on acute myeloid leukemia patients. Our genetically aligned patient encoder improves hematological diagnostic tasks, outperforming slide‑level histopathology foundation models. Additionally, the model provides off‑the‑shelf retrieval capabilities for diseases and genetic alterations. Incorporating genetic data into patient encoders increases the quality of patient representations, providing a framework that aligns with clinical diagnostic workflows and paves the way for future multimodal hematology‑specific AI. The code and model weights are available at https://github.com/marrlab/GenBloom.
Authors:Minyang Hu, Bo Yang, Zhinuo Zhou, Jiachen Liang, Guo Jiahao, Yiyang Yin, Xiongwei Han
Abstract:
LLM‑based agents have demonstrated strong capabilities in solving complex tasks through multi‑step reasoning and tool use. However, existing evaluation protocols primarily focus on task success, overlooking a critical aspect of agent behavior: execution efficiency. In practice, agent trajectories often contain redundant steps that consume substantial resources while contributing little to task completion. In this work, we propose and formulate a new research area: redundant step detection for agent trajectories. To support this initiative, we introduce RedundancyBench, a new benchmark that contains diverse tasks with carefully annotated trajectories, where each step is labeled according to its contribution to task completion. Using RedundancyBench, we develop and evaluate 3 representative methods to answer whether a step within trajectory is redundant or necessary. Our results show that even the best‑performing method achieves only 24.88% score in detecting redundant steps, while some methods perform worse than random guessing. These results highlight the task's complexity and the need for further research in this area. \footnoteCode and dataset in this paper are both available in \hrefhttps://anonymous.4open.science/r/RedundancyBenchhttps://anonymous.4open.science/r/RedundancyBench.
Authors:Francisco León Zúñiga Bolívar
Abstract:
Do next‑generation LLM agents inherit the cooperative biases documented in their predecessors, or does scale and provider diversity reshape equilibrium behaviour in competitive multi‑agent settings? Willis et al. established a benchmark for this question using evolutionary game theory and the Iterated Prisoner's Dilemma (IPD), finding consistent cooperative biases in ChatGPT‑4o and Claude 3.5 Sonnet. We extend this benchmark to four frontier models released in 2025‑2026 ‑ Claude Sonnet 4.6, Gemini 2.5 Flash, Gemini 3.1 Pro, and GPT‑5.4 Mini ‑ applying the identical protocol across three prompting styles (Default, Prose, Self‑Refine) and four population compositions (balanced and biased, with and without noise). Cooperative bias persists across providers (H1): nine of twelve model‑prompt combinations favour cooperative equilibria in balanced noiseless conditions. Cross‑provider divergence is substantial (H3): Gemini 2.5 Flash reaches up to 77% aggressive equilibria under biased conditions, while GPT‑5.4 Mini reaches 70% cooperative equilibria under Self‑Refine. Support for aggressive capability parity is partial (H2): Self‑Refine raises ICD in all models and Claude Sonnet 4.6 Refine achieves the highest ICD in the dataset (0.913), but Default and Prose prompts show no systematic narrowing. Evidence on noise robustness is directionally positive but not robustly confirmed (H4): with n=500 Moran iterations per condition, average noise sensitivity is approximately 6 percentage points for Claude Sonnet 4.6 versus 13 pp for Claude 3.5 Sonnet, but this cross‑study gap is not statistically significant once the predecessor's unreported sampling error is propagated. Provider identity, rather than model generation, is the strongest correlate of equilibrium outcomes; noise remains a universal challenge regardless of model size or vintage.
Authors:Chenghao Zhang, Guanting Dong, Yufan Liu, Tong Zhao, Xiaoxi Li, Zhicheng Dou
Abstract:
Large Language Models (LLMs) have advanced autonomous agents from deep search, which retrieves concise factual answers, to deep research, which synthesizes scattered evidence into long‑form reports. However, verifiable multimodal deep research remains challenging due to open‑ended synthesis without deterministic ground truth and the need to interleave textual arguments with visual evidence. We propose Ptah, a multi‑agent harness for interleaved report generation. Ptah orchestrates the lifecycle from user query to rendered web report through planning, research, and writing stages, where specialized agents construct visual‑aware plans, collect claim‑grounded evidence, maintain source‑aligned images in a Visual Working Memory, and compose reports through declarative multimodal tool use. A verifier agent serves as the harness's acceptance function, enforcing factual grounding, citation fidelity, and cross‑modal consistency throughout the workflow. We further introduce PtahEval, an evaluation protocol that augments existing benchmarks with image‑level and presentation‑level assessments. Experiments on deep research benchmarks show that Ptah produces more reliable, visually informative, and usable human‑facing multimodal reports than strong baselines. Our code is released at https://github.com/SnowNation101/Ptah
Authors:Haochen Yang, Ke Zhao, Mengyuan Ma, Xingyu Lu, Xiangfeng Wang, Hong Qian
Abstract:
Leveraging Large Language Models (LLMs) to automatically formulate and solve optimization problems from natural language has emerged as an efficient paradigm for automated optimization. However, existing methods still exhibit limited generalization: they are sensitive to superficial narrative variations, reuse experience mainly at the case level, and struggle to adapt to shifted or emerging problem types. We propose OptSkills, an archetype‑centric skill learning and reasoning agent system for optimization modeling and solving. To improve robust generalization, our system clusters problems by their underlying archetypes rather than surface narratives. To improve in‑distribution generalization, it explores diverse modeling paradigms and solver configurations within each cluster, then distills successful trajectories into reusable workflow‑level skills. To improve out‑of‑distribution generalization, it refines existing skills or expands the skill library using newly obtained trajectories. Our system achieves a state‑of‑the‑art micro‑averaged accuracy of 68.27% on datasets encompassing diverse problem types and scenarios. In addition, on MIPLIB‑NL, a highly challenging large‑scale and high‑dimensional benchmark, it achieves 26.91% accuracy, outperforming DeepSeek‑V3.2‑Thinking by 4.53%. After skill learning on Nano‑CO, it reaches 72.79% on the OOD NLCO benchmark. Code and skills are available at https://github.com/fujiwaranoM0kou/OptSkills.
Authors:Yunbo Tang, Chengyi Yang, Shiyu Liu, Zhishang Xiang, Zerui Chen, Qinggang Zhang, Jinsong Su
Abstract:
Agentic search enables LLMs to solve complex multi‑hop questions through iterative reasoning and external search. Despite the effectiveness, these systems often suffer from a critical limitation in practice: agents fail to recognize their own knowledge boundaries, blindly triggering searches when internal knowledge suffices and failing to terminate search even when adequate evidence has been collected. The lack of self‑awareness leads to severe over‑search, incurring substantial inference latency and prohibitive computational cost. To this end, we propose SAAS, a novel RL framework designed to cultivate dynamic self‑awareness that precisely regulates search behavior without compromising accuracy. SAAS introduces three key components: (i) a search boundary modeling mechanism, which identifies the search boundary under the evolving policy by contrasting search‑disabled and search‑enabled rollouts; (ii) a boundary‑aware reward module, which translates this boundary awareness into trajectory‑level penalties, suppressing unnecessary and redundant searches; and (iii) a stage‑wise optimization strategy, which leverages a sequential curriculum to prioritize reasoning over search regularization, thereby avoiding reward hacking. Extensive experiments demonstrate that SAAS substantially reduces over‑search, while maintaining accuracy. Our code and implementation details are released at https://github.com/XMUDeepLIT/SAAS.
Authors:Boyuan Zhang, Huanshan Huang, Yifei Cao
Abstract:
Reliable semantic segmentation for mobile robots requires both accurate dense prediction and robust uncertainty estimation under distribution shift. Strong uncertainty baselines such as Monte Carlo Dropout often require repeated stochastic forward passes and are difficult to deploy on edge platforms. We propose Energy‑Aware NECO, a single‑pass pixel‑wise out‑of‑distribution (OOD) detector for semantic segmentation. The method combines a centered NECO‑style geometric ratio computed from decoder features with a logit‑based Energy score. Both components are standardized using statistics fitted on a pure in‑distribution validation split and fused through a convex combination. We evaluate the method on the miniMUAD subset using true pixel‑level OOD labels. The proposed hybrid score achieves an AUROC of 0.8539, outperforming NECO‑only (0.8280), Energy‑only (0.8171), and an ensemble predictive‑entropy baseline (0.8124). Additional qualitative and operating‑point analyses show that the hybrid detector improves overall ranking performance while preserving the efficiency advantages of a single‑pass design. Code is available at https://github.com/boyuan‑zhangx/Energy‑Aware_NECO
Authors:Yeong-Joon Ju, Seong-Whan Lee
Abstract:
Deploying Large Language Models (LLMs) for regulatory compliance demands rigorous traceability via comprehensive citations across multi‑tiered authority structures. Unlike traditional multi‑hop or legal QA, this task requires structured procedural lookups and evidence‑set closure rather than entity resolution or case‑law reasoning. Existing RAG systems struggle here due to flattened citation edges, fragmented retrieval expansions, and fragile post‑hoc attribution. We formalize Regulatory Compliance QA with RegOps‑Bench, a novel benchmark featuring an Operational Knowledge Graph derived from complex national R\&D regulations. To address these bottlenecks, we propose RefWalk, a unified framework driven by a shared topic anchor. RefWalk traverses cross‑document citations, fuses multi‑view candidates via max‑based aggregation, and enforces per‑rule attribution to explicitly map claims to sources. We establish a strong baseline with substantial improvements in retrieval recall and citation accuracy. Finally, a contrastive evaluation on a U.S. health compliance dataset (HIPAA) reveals that existing systems exhibit saturation on flat‑structure rules, underscoring the need for RegOps‑Bench. Our code is available at https://github.com/yeongjoonJu/RefWalk.
Authors:Shuaidi Wang, Zhan Zhuang, Ruping Huang, Yu Zhang
Abstract:
Diffusion Large Language Models (dLLMs) have emerged as a promising non‑autoregressive generative paradigm. Given the prohibitive computational cost of full fine‑tuning, Parameter‑Efficient Fine‑Tuning (PEFT) has become the standard approach. However, existing PEFT methods (e.g., LoRA), originally tailored for autoregressive models, rely on static parameters that are agnostic to the noise level. Consequently, they ignore the intrinsic dynamics of the diffusion process, where input distributions and generation difficulty shift significantly along the denoising trajectory, rendering them suboptimal for dLLMs. To address this, we propose Noise‑aware Low‑Rank Adaptation (NaRA), which introduces a low‑rank core matrix generated by a lightweight, globally shared hypernetwork conditioned on the noise level. This design enables the update matrices to vary continuously along the diffusion process while keeping parameter and latency overhead negligible. We provide a theoretical justification for the proposed NaRA framework and empirically demonstrate consistent improvements over noise‑agnostic baselines across commonsense reasoning, mathematical reasoning, and code generation benchmarks. Our code is available at https://github.com/generaldi/NaRA.
Authors:Mincheol Kang, Hyunjin Lim, Bomin Kang, Daehee Park
Abstract:
Trajectory prediction is a fundamental task for autonomous systems, requiring complex reasoning about multi‑agent interactions and intents. Large language models (LLMs) have recently been adopted for this task, as they provide strong contextual reasoning and interpretable, language‑based trajectory representations. However, these LLM‑based predictors are extremely memory‑ and compute‑intensive, making them difficult to deploy on resource‑constrained edge devices such as on‑board computers in autonomous robots. To bridge this gap, we propose BitTP, which converts an LLM‑based trajectory predictor into a lightweight bitlinear architecture. We demonstrate that weight‑only quantization to 1.58‑bit (BitTP‑Weight) is optimal. Crucially, activations must remain in full precision, as quantizing them leads to severe degradation and instability in spatio‑temporal reasoning. Empirically, BitTP‑Weight not only preserves but improves prediction quality over the full‑precision (BF16) LLM baseline, reducing ADE by 14.29% and FDE by 20.97% on average, while simultaneously reducing memory usage and inference latency relative to other quantization methods. These results demonstrate that carefully designed quantization acts as an effective regularizer, enabling the practical deployment of sophisticated LLM‑based reasoning on edge devices. Code is available at: https://github.com/MintCat98/BitTP.
Authors:Yundong Kim, Heyoung Yang
Abstract:
Evaluating open‑ended outputs from large language models (LLMs) remains challenging due to the absence of ground truth. Existing metrics rely on final‑answer accuracy or surface‑level statistics, leaving the reasoning process itself unexamined. We introduce TRACE (Toulmin‑based Reasoning Assessment through Constructive Elements), a metric that analyzes Chain‑of‑Thought (CoT) reasoning processes. Rather than judging outcomes, TRACE inspects how arguments are constructed by integrating Toulmin's argumentation theory with Flavell's metacognitive framework to assess reasoning structure. Experiments on 26.3K QA samples across 7 reasoning models show strong correlation with benchmark accuracy (r=0.74). Furthermore, TRACE is effective as a reinforcement learning reward signal, outperforming accuracy‑only baselines. Together, these results indicate that logically sound reasoning leads to higher‑quality answers. TRACE thus serves as a complementary metric for evaluating open‑ended outputs. Code is available at https://github.com/hyyangkisti/trace.
Authors:Youwang Deng
Abstract:
End‑to‑end agent‑memory benchmarks report a single hit@k per retriever, confounding lexical leakage (uncontrolled query/gold/distractor entity overlap) with tag‑mixing (preferences, services, tools averaged together). We propose entity‑collision, a system‑agnostic protocol that pins the BM25 floor by construction ‑‑ every distractor shares the answer's entity tokens ‑‑ and stratifies queries by discriminator tag, so any lift over BM25 is attributable to the embedder. Applied to an open‑source agent‑memory testbed across 5 tags x 3 embedders x 5 collision degrees with paired‑bootstrap 95% CIs, the protocol reveals a two‑axis pattern: a 256‑d hash trigram helps only on closed‑vocabulary lexical tags at deep collision; MiniLM‑384 dominates both axes; and a 2.7x‑parameter BGE‑large does not uniformly improve on MiniLM ‑‑ it wins on intent‑style queries but loses on lexical ones. Encoder capacity alone is not the binding constraint. The synthetic intent‑tag null replicates on LongMemEval (n=500) as a single‑session‑preference recall cliff. Adaptive vector‑weight routing on LoCoMo is a measured null: 11.7pp of oracle headroom exists, but no signal we tested recovers it. All 26 result tables and 37 reproduce scripts are version‑controlled and verified by a public registry; the protocol is exercised on a deterministically governed memory testbed (event‑sourced decision log, DAG‑state‑machine schema lifecycle) so every reported CI is reproducible byte‑for‑byte from the ingest stream.
Authors:NamGyu Jung, Chang Choi
Abstract:
In scene graph generation, a central challenge is modeling polysemous predicates whose meanings shift across contexts. Prior approaches address this issue by decomposing predicates into multiple static prototypes or retrieving semantically similar exemplars. However, these strategies keep predicate representations static and cannot reorganize semantics to reflect image‑specific evidence, leading to systematic confusions in ambiguous contexts. We propose AlignG, which learns context‑conditioned predicate semantics via prototype feedback. AlignG infers context‑conditioned predicate semantics from the relation candidates within each image and feeds the adapted semantics back to recalibrate relation representations. The learning objective anchors this adaptation to global semantic centers, preventing semantic drift while still allowing selective reorganization when the scene provides consistent relational cues. Experiments on VG‑150 and GQA‑200 show consistent improvements over state‑of‑the‑art baselines, with F@100 improvements of +1.4 on VG‑150 and +2.7 on GQA‑200 under SGDet. We further visualize per‑image prototype similarity shifts and observe coherent context‑dependent reorganization where prototypes selectively merge or separate predicates according to scene evidence. The code is available at https://github.com/Namgyu97/AlignG‑SGG.pytorch.
Authors:Yizhuo Lu, Changde Du, Qingyu Shi, Hang Chen, Jie Peng, Liuyun Jiang, Shuangchen Zhao, Huiguang He
Abstract:
Modeling the interplay between external stimuli and internal neural representations is a pivotal research area for Brain‑Computer Interfaces (BCIs). A major limitation of prior work is the prevailing paradigm of specialized, single‑task models, which curtails versatility and neglects inter‑task synergies. To address this, we propose Mind‑Omni, the first versatile framework that unifies seven distinct encoding and decoding tasks through a discrete diffusion paradigm. At its core is a novel Brain Tokenizer that transforms heterogeneous, continuous brain signals into standardized, discrete tokens. This enables direct, token‑level interactions for mutual understanding and generation between any two or more modalities within a shared semantic space. To unlock advanced reasoning capabilities, we further curate a specialized Brain Question Answering (BQA) instruction‑tuning dataset. Our model not only establishes a new state‑of‑the‑art among multi‑task unified frameworks but also provides strong evidence for multi‑task synergy. By demonstrating performance competitive with, and at times superior to, larger specialized models, our work offers a powerful new paradigm for neural modeling and paves the way for foundation models of neural activity. The code is publicly available at https://github.com/ReedOnePeck/Mind‑Omni.
Authors:Jiacong Liu, Shu Luo, Yikai Qin, Yaze Zhao, Yongwei Jiang, Yixiong Zou
Abstract:
Vision‑language foundation models have shown promising zero‑shot generalization for Cross‑Domain Few‑Shot Object Detection (CD‑FSOD). However, they face two critical challenges in fine‑tuning: insufficient support set utilization due to sparse single‑instance annotations, and severe overfitting under extremely limited target‑domain samples. To address these issues, this paper proposes GiPL, an efficient two‑branch training framework. In the first branch, we design an iterative pseudo‑label self‑training paradigm, which performs zero‑shot inference on the support set to generate reliable pseudo‑annotations, fuses them with ground‑truth labels, and iteratively optimizes the model to fully exploit support set data. In the second branch, we introduce generative data augmentation pipeline using large vision‑language models, which synthesizes domain‑aligned, multi‑object annotated images to enrich training samples and suppress overfitting. Extensive experiments on three challenging CD‑FSOD datasets (RUOD, CARPK, CarDD) under 1/5/10‑shot settings demonstrate that GiPL consistently outperforms state‑of‑the‑art methods with significant performance gains. Code is available at \hrefhttps://github.com/z‑yaz/CDiscoverCDiscover.
Authors:Runang He, Tongya Zheng, Huiling Peng, Yuanyu Wan, Bingde Hu, Jiawei Chen, Canghong Jin, Mingli Song, Can Wang
Abstract:
Ever‑evolving transaction patterns have significantly hindered anomaly detection on emerging cryptocurrency blockchains due to the vast number of addresses and diverse anomalous behaviors. Recently, advanced Graph Anomaly Detection (GAD) approaches applied to blockchains have faced two critical challenges: adversarial pattern evolution by malicious actors and the out‑of‑distribution (OOD) problem caused by varied transaction semantics on blockchains. To address these challenges, we propose a novel framework termed TEmporal Motif‑aware Graph Test‑Time Adaptation (TEMG‑TTA). First, we comprehensively capture the 3‑node temporal motif distribution of each active address using an efficient computational mechanism, enabling downstream temporal motif‑aware graph learning. Second, we design a simple yet effective test‑time adaptation strategy to facilitate the sharing of common patterns between training and testing graphs. Extensive experiments on 5 real‑world datasets demonstrate that our proposed TEMG‑TTA outperforms state‑of‑the‑art GAD approaches by an average of 54.88%. A further case study on interpretable motif patterns reveals that TEMG‑TTA explicitly characterizes the complex transaction patterns of anomalous addresses, thereby verifying the effectiveness of our technical designs. Our code will be made publicly available https://github.com/LuoXishuang0712/TEMG‑TTA/.
Authors:Zhixin Cai, Jun Bai, Yang Liu, Jiaqi Li, Yichi Zhang, Taichuan Li, Zhuofan Chen, Zixia Jia, Zilong Zheng, Wenge Rong
Abstract:
Explaining why dense retrievers assign high relevance scores remains challenging because retrieval decisions are made through opaque high‑dimensional embeddings. Existing explanations often focus on surface signals, such as lexical matches, token alignments, or post‑hoc textual rationales, and thus provide limited insight into the latent factors that shape dense retrieval behavior at the embedding level. We propose Xetrieval, an embedding‑level mechanistic framework for explaining dense retrieval. Xetrieval first introduces a lightweight reasoning internalizer that approximates Chain‑of‑Thought reasoning directly in the embedding space with a single forward pass, enriching sentence embeddings with reasoning‑oriented information while avoiding expensive autoregressive generation. It then decomposes these reasoning‑enhanced embeddings into sparse, human‑interpretable features, each associated with a coherent natural language description. By aggregating sparse feature overlaps across multiple document‑side views, Xetrieval provides feature‑level explanations of individual retrieval decisions. Experiments on diverse retrievers and benchmarks show that Xetrieval uncovers coherent interpretable features, yields stronger pair‑level intervention effects, and supports task‑level feature steering. The project page and source code are available at https://hihiczx.github.io/Xetrieval .
Authors:Xiaohang Tang, Keyue Jiang, Che Liu, Qifang Zhao, Xiaoxiao Xu, Sangwoong Yoon, Ilija Bogunovic
Abstract:
Reinforcement learning (RL) can be used to improve the policy (denoiser) of diffusion large language models (dLLMs), while being hindered by the intractability of the policy likelihood. A dominant and efficient family of methods replaces the likelihood in standard RL with its evidence lower bound (ELBO), estimated from randomly masked sequences. Despite being well aligned with pre‑training, these approaches introduce bias through training‑‑inference mismatch by using the ELBO as a likelihood surrogate, which can degrade performance. In this work, we propose Guided Denoiser Self‑Distillation (GDSD) to directly distill the denoiser of dLLMs from an advantage‑guided self‑teacher, derived from the closed‑form optimum of reverse‑KL regularized RL. GDSD matches the dLLM's denoiser logits to the teacher's via a normalization‑free objective, which reduces RL to likelihood‑free self‑distillation and thus bypasses the TIM biases. Recent ELBO‑based methods emerge as instances of applying different distillation divergences, but with diagnosable pathologies that GDSD avoids. On planning, math, and coding benchmarks with LLaDA‑8B and Dream‑7B, GDSD consistently outperforms prior state‑of‑the‑art ELBO‑based methods with a more stable training reward dynamics, achieving test‑accuracy improvements of up to +19.6%. These results suggest that direct denoiser self‑distillation, without relying on an ELBO likelihood surrogate, can provide a more stable and effective RL procedure for dLLMs. Code is available at https://github.com/GaryBall/GDSD.
Authors:Hesam Asadollahzadeh, Feng Liu, Christopher Leckie, Sarah M. Erfani
Abstract:
Mainstream strategies for finetuning pretrained multimodal models often degrade out‑of‑distribution (OOD) robustness, a phenomenon known as catastrophic forgetting. In this paper, we develop a theoretical framework for multimodal contrastive finetuning, yielding closed‑form solutions and a geometric decomposition for each strategy. This framework shows that self‑distillation is more effective than other regularization approaches to retain the knowledge of the pretrained model. Our analysis reveals a largely overlooked limitation: standard Exponential Moving Average (EMA) teachers, widely used in robust finetuning, suffer from collapse. To solve this, we prove that a Weighted Moving Average (WMA) teacher maintains a persistent regularizing force over finite horizons and yields bias‑free convergence in the task subspace while preserving orthogonal knowledge. These insights motivate TRACER (Trajectory‑Robust Anchoring for Contrastive Encoder Regularization), which combines contrastive learning with WMA‑guided multi‑perspective distillation. Extensive experiments on CLIP finetuning demonstrate consistent OOD accuracy and calibration gains across three backbone architectures, and comprehensive ablations confirm that TRACER is both principled and robust to hyperparameter choices. Code is available at [https://github.com/HesamAsad/TRACER](https://github.com/HesamAsad/TRACER).
Authors:Yiqun Liu, Yingsheng Wu, Ruqi Yang, Enrong Zheng, Honglei Qiu, Sijun He, Tai Liang, Jingjing Wu, Yuhan Zhou, Yiwei Zhang, Dongyan Chen, Weihan Yi, Xinqi Li, Siqi Bao
Abstract:
Modern tensor compilers such as TorchInductor deliver substantial speedups on mainstream models, yet face a systematic performance ceiling on long‑tail workloads ‑‑ our profiling shows that 43% of real‑world subgraphs experience end‑to‑end slowdowns under default compilation. While LLMs offer a path toward automated optimization, existing efforts focus on standalone kernel generation. We argue that pass generation ‑‑ where LLMs author structured graph transformations that integrate directly into compiler pipelines ‑‑ is the more appropriate abstraction. We propose PassNet, the first large‑scale ecosystem for LLM‑based compiler pass generation, comprising: (1) PassNet‑Dataset, over 18K unique computational graphs from 100K real‑world models; and (2) PassBench, 200 curated long‑tail fusible tasks (comprising 2,060 subgraphs in total) evaluated under the Error‑aware Speedup Score (ES_t) ‑‑ a metric unifying correctness, stability, and performance ‑‑ with layered integrity defenses against systematic LLM exploitation. Experiments reveal that PassBench is both highly discriminative and genuinely unsaturated: the best frontier model trails TorchInductor by 37% in aggregate, yet on individual subgraphs LLMs achieve up to 3x speedup over the same compiler ‑‑ indicating that the bottleneck is consistency, not capability. Fine‑tuning a small model on merely ~4K PassNet trajectories yields a 2.67x improvement approaching frontier‑model performance, demonstrating substantial headroom and validating PassNet as live training infrastructure for advancing LLM‑driven compiler optimization. All data, benchmarks, and tooling are publicly available.
Authors:Qi Liu, Mingdi Sun, Yongyi He, Zhi Zheng, Tong Xu, Yi Zheng, Zhefeng Wang, Enhong Chen
Abstract:
Supervised fine‑tuning (SFT) followed by reinforcement learning (RL) has become a standard post‑training paradigm for large language models. This paradigm provides a cold‑start for RL exploration, avoiding the inefficiency of pure RL where on‑policy sampling yields insufficient positive samples. However, in practice, existing approaches often use a small amount of data for SFT initialization compared to the RL phase, which can cause the model to fit the limited samples and shift away from its pre‑trained distribution. This distribution shift impedes the model's ability to effectively explore during subsequent RL training. To address this challenge, we propose that in low‑data regimes, SFT should prioritize activating task‑relevant capabilities rather than memorizing specific content. Along this line, we propose EKSFT (Entropy‑KL Selective Fine‑Tuning), which selectively masks tokens that exhibit either high entropy or high KL divergence from a reference model. By excluding these high‑uncertainty, distribution‑shifting tokens from imitation, EKSFT injects task‑specific knowledge while preserving the integrity of the model's pre‑trained distribution. Empirical evaluations on mathematical reasoning benchmarks demonstrate that EKSFT consistently outperforms standard SFT. Further RL fine‑tuning from the EKSFT model yields consistently better post‑RL performance, indicating improved exploration for the RL stage. Our codes and datasets are available at https://github.com/MINE‑USTC/EKSFT.
Authors:Xinyu Liu, Darryl Cherian Jacob, Yang Zhou, Jindong Wang, Pan He
Abstract:
Recent reinforcement learning (RL) post‑training approaches primarily optimize the final output policy using sparse outcome‑level rewards, while largely overlooking predictive signals encoded in intermediate representations. In this paper, we introduce a new paradigm called on‑policy internal self‑distillation and propose the OISD framework, which improves reasoning by transferring on‑policy predictive signals from the final layer to intermediate representations. During rollout and Group Relative Policy Optimization (GRPO) optimization, the final layer acts as both the policy and a detached internal teacher for selected intermediate layers, which are guided to align with it through two complementary mechanisms: logit alignment, which transfers high‑level reasoning behaviors (how to think), and attention alignment, which enforces consistent attention patterns (where to look) from the final layer to the selected intermediate layer, both without requiring external privileged information. Our OISD, together with GRPO, employs signed advantage‑weighted Jensen‑‑Shannon alignment to distill informative intermediate representations while preserving policy consistency under a unified acting policy. Experimental results demonstrate the effectiveness of OISD, with substantial and consistent improvements over strong reasoning RL baselines across four mathematical reasoning tasks. The code will be released at https://github.com/THE‑MALT‑LAB/OISD
Authors:Mohan Zhang, Yuqi Jia, Zhen Tan, Steven Jiang, Neil Zhenqiang Gong, Tianlong Chen, Dawn Song
Abstract:
LLMs are vulnerable to prompt injection attacks. However, this vulnerability has been primarily demonstrated conceptually in academic studies or through a few anecdotal case studies. Its prevalence and impact in real‑world LLM‑based applications are largely unexplored. In this work, we present the first systematic study of prompt‑injection attacks in a widely used application: LLM‑based resume screening. Our analysis is based on approximately 200K real‑world resumes collected over multiple years by hireEZ. We first design tailored methods to detect prompt injection in resumes. Manual validation on a small‑scale dataset demonstrates that our detectors achieve high precision and outperform state‑of‑the‑art general‑purpose detectors. We then apply our detector to the full resume dataset and conduct a comprehensive measurement study of real‑world prompt injection attacks. Our analysis reveals several intriguing findings: approximately 1% of resumes contain hidden prompt injections; the prevalence of such injected resumes has increased noticeably over the past one to two years; and more than 90% of injected prompts do not use explicit instructions. These results provide the first evidence of large‑scale prompt injection in real‑world LLM‑based applications and lay the groundwork for future studies to understand and mitigate such attacks.
Authors:Venkat Akhil Lakkapragada
Abstract:
Large language models have achieved strong reasoning capabilities, though often at the cost of massive parameter counts and expensive inference. In this work, we explore a different direction: adaptive reasoning depth in compact language models. We present CosmicFish‑HRM, a compact language model built around a Hierarchical Reasoning Module (HRM) that dynamically allocates computational effort during inference. Instead of applying fixed computation to every input, the model iterates through high‑level and low‑level reasoning cycles and learns when to halt based on input complexity. CosmicFish‑HRM combines this adaptive reasoning core with modern transformer components including Grouped Query Attention, RoPE, and SwiGLU activations. While the additional reasoning infrastructure introduces overhead at small scale, we hypothesize that this tradeoff becomes increasingly favorable as model size grows and the relative cost of the HRM core diminishes. Our results show that the model learns non‑uniform reasoning behavior, allocating different numbers of reasoning steps across tasks and inputs. These findings suggest that adaptive reasoning depth may offer a promising alternative to relying solely on parameter scale for reasoning capability.
Authors:Suliu Qin, Haomin Zhuang, Yujun Zhou, Yufei Han, Xiangliang Zhang
Abstract:
Tool‑using language agents turn model decisions into external side effects: they read files, run scripts, call APIs, send messages, and invoke Model Context Protocol tools. This makes agent attacks different from jailbreaks. The harmful step is often not an obviously forbidden output, but an ordinary executable action that becomes unsafe because attacker‑controlled context steers authorized access against the user's interest. We identify this failure mode as authority confusion: untrusted resources may inform reasoning, but they must not authorize side effects. We present AIRGuard, a runtime guard that operationalizes least privilege as action‑time authorization. AIRGuard normalizes heterogeneous tool calls, derives task authority into step‑level authority, tracks source and target trust, simulates sensitive side effects, audits cross‑step risk, and enforces decisions before actions execute. On AgentTrap, AIRGuard reduces Sonnet 4.6 attack success from 36.3% without defense to 5.5%. On DTAP‑150, AIRGuard preserves 76.0% benign utility with Haiku 4.5, compared with 52.0% for ARGUS and 42.0% for MELON. An ablation further shows that prompt‑only policy helps only modestly, whereas a dedicated runtime authority‑control layer gives the agent system direct control over tool‑mediated side effects. Code and data are available at https://github.com/Sophie508/AIRGuard.
Authors:Yuhao Sun, Lingyun Yu, Haoxiang Xu, Fengyuan Miao, Zhuoer Xu, Hongtao Xie
Abstract:
Concept erasure has emerged as a promising approach to mitigate undesired or unsafe content in diffusion models, yet existing methods still face significant limitations. While training‑based methods are effective, their high computational cost limits scalability. Editing‑based methods are more efficient and deployment‑friendly, yet they struggle to simultaneously achieve precise concept erasure and preserve overall generative capacity. We identify this core limitation of the editing‑based methods as reliance on additive parameter updates. Our empirical analysis reveals that concept semantics primarily depend on neuron direction rather than neuron magnitude, while overall generative capacity relies on the angular geometry of neurons. As additive updates inherently entangle direction, magnitude, and angular geometry, they inevitably introduce unintended interference between concept erasure and overall generation performance. To address this, we propose Orthogonal Concept Erasure (OCE), which reformulates editing‑based erasure as multiplicative parameter updates from a geometric perspective. Specifically, OCE applies layer‑wise orthogonal transformations derived from a closed‑form solution to the parameters, enabling precise concept erasure while preserving the neuron magnitude and angular geometry. Furthermore, to address conflicting constraints in multi‑concept erasure, OCE introduces a subspace‑level objective with structured subspace manipulation, yielding a more effective and scalable erasure. Extensive experiments on single‑ and multi‑concept erasure demonstrate that OCE outperforms existing methods in concept erasure and non‑target preservation, erasing up to 100 concepts in 4.3 s. Code: https://github.com/HansSunY/OCE.
Authors:Hans Ole Hatzel, Sebastian Steindl, Jan Strich
Abstract:
LLM‑generated reviews for scientific papers are gaining considerable traction and are even being officially piloted by major conferences. We have to assume that not only reviewers are using LLM‑assistance, but also that authors use LLMs to revise their papers before submitting. In this work, we perform empirical experiments on papers from the 2025 ACL Rolling Review (ARR) to evaluate LLM reviews from both the author and the reviewer perspective. First, we identify a limited alignment of LLM reviews with human ones. In the best‑case scenario, the alignment is reasonable. However, we also find that LLM‑human alignment varies substantially across prompts and models. Finally, we investigate the scenario in which the author uses an iterative draft‑revise workflow to improve the submission according to the LLM review. We find that this "gaming" of LLM reviews can be effective in specific scenarios, leading to a statistically significant increase of overall scores for up to 35% of papers. We publish our code: https://github.com/uhh‑hcds/reviewarcade.
Authors:Jeanmely Rojas Nunez, Viraj Sawant, Nathan Allen, Nomgondalai Amgalanbaatar, Yannis Zongo, Vasu Sharma, Maheep Chaudhary
Abstract:
Fine‑tuning large language models (LLMs) frequently induces catastrophic forgetting of prior capabilities. Recent work has shown that reinforcement learning (RL) retains prior capabilities more effectively than supervised fine‑tuning (SFT), attributing this to policy‑gradient updates remaining closer to the base policy \citeshenfeld2025rl. We extend this behavioral account to the mechanistic level and ask whether RL's advantage is mirrored by stronger preservation of internal computational circuits. We introduce differential circuit vulnerability, a head‑level measure of how much a circuit degrades under fine‑tuning, and use it to compare RL and SFT on Qwen2.5‑3B‑Instruct adapted to scientific question‑answering. We find a clear mechanistic trade‑off: SFT adapts more rapidly to the target task but produces substantially greater circuit disruption and forgetting of prior capabilities, whereas RL preserves a larger fraction of the base circuit at the cost of slower task adaptation. These findings suggest that circuit preservation may help explain why RL is more robust to catastrophic forgetting. We released our code here: https://github.com/rl‑sft‑circuit‑research/differential‑circuit‑vulnerability.
Authors:Dong Liu, Yanxuan Yu, Ying Nian Wu
Abstract:
The success of large language models (LLMs) across diverse NLP tasks has elevated the importance of reasoning chain optimization as a critical step in aligning model behavior with task objectives. Existing reasoning chain tuning methods often rely on black‑box heuristics or gradient‑free search, which lack interpretability, generalization, and sample efficiency. In this work, we introduce Thoughts‑as‑Planning, a novel framework that formalizes reasoning chain optimization as a sequential decision‑making process over a latent semantic space. We model the LLM as a partially observable environment and learn a latent world model that simulates the effect of reasoning chain edits on downstream outputs. A proximity‑preserving embedding space is constructed to encode reasoning chain‑response dynamics, enabling planning via gradient descent or reinforcement learning. Our method supports multi‑scale abstraction, allowing reasoning chain edits at token, segment, and instruction levels to be integrated into a unified planner. Through extensive experiments on language understanding and generation tasks, we demonstrate that Thoughts‑as‑Planning outperforms state‑of‑the‑art reasoning chain tuning baselines in efficiency, robustness, and generalization, while offering interpretability through its structured planning trajectory. Our code is available at https://github.com/FastLM/Thoughts‑as‑Planning.
Authors:Gyumin Kim, Juhwan Park, Jaeha Kim, Seunggyun Han, Kyungrak Son, Ikbeom Jang
Abstract:
While Large Language Models (LLMs) have demonstrated remarkable capabilities, their reliability is significantly compromised by hallucinations. Existing intrinsic self‑correction methods attempt to address this, but often fail due to self‑bias, where models struggle to identify errors in their own outputs without external verification. To overcome these limitations, we propose the LDPC‑inspired semantic error correction for retrieval‑augmented generation (SERC), providing a theoretical framework to interpret and mitigate LLM hallucinations. We reformulate the text generation process as a semantic noisy channel, treating generated responses as noise‑corrupted codewords. Inspired by low‑density parity‑check (LDPC) codes, SERC employs a sparse verification strategy: instead of exhaustively checking all facts, it generates low‑density verification queries and validates them against external evidence to efficiently detect and correct errors. We evaluate SERC on LongForm Bio and TruthfulQA benchmarks using Llama‑3‑8B and Qwen2.5‑14B. Experimental results demonstrate that SERC outperforms both intrinsic self‑correction methods and strong retrieval‑augmented baselines, demonstrating significant gains especially in factual precision (FactScore). Notably, SERC enables small language models (SLMs) to surpass the performance of larger baselines in hallucination reduction and information preservation. Our findings demonstrate that SERC provides a training‑free, model‑agnostic solution that significantly reduces verification overhead compared to dense methods, achieving an optimal trade‑off between cost and fidelity in resource‑constrained environments.
Authors:Tirtharaj Dash
Abstract:
Tabular data in knowledge‑rich domains often carries a latent prior in the form of Boolean implication relationships (BIRs) between pairs of features. We mine such relationships with a sparse‑exception binomial test. The mined implications form a typed directed graph, equivalent to a propositional rule base of 2‑literal clauses. We encode this graph as the connectivity of a layered neural network, called BIRDNet, in which each hidden unit corresponds to one mined rule and binds only to its two features. We show two consequences of this design: First, the architecture is sparse by construction: at most 2/d of the weights in each BIR layer are active, where d is the input dimension. Second, the model is interpretable: every trained unit keeps a stable symbolic identity, so rules can be read off the network without surrogate models. Unlike most neurosymbolic models, BIRDNet does not consume an external rule base; its structural prior is mined from the data. We evaluate BIRDNet on six transcriptomic and proteomic benchmarks. Our results show that BIRDNet stays within 0.02 AUROC of the strongest dense baseline, at a small accuracy cost, while using up to 96× fewer active parameters than an architecture‑matched dense MLP. First‑layer rules recover known biological signatures across multiple cancer subtypes and tissue types, including canonical amplicons, lineage‑defining co‑expression modules, and immune‑infiltration markers. Data and code are available at: https://github.com/MAHI‑Group/BIRDNet.
Authors:Xinle Deng, Ruobin Zhong, Hujin Peng, Xiaoben Lu, Yanzhe Wu, Guang Li, Buqiang Xu, Yunzhi Yao, Jizhan Fang, Haoliang Cao, Junjie Guo, Yuan Yuan, Ziqing Ma, Yuanqiang Yu, Rui Hu, Baohua Dong, Hangcheng Zhu, Ningyu Zhang
Abstract:
Memory is essential for enabling large language models to support long‑horizon reasoning, yet existing memory systems remain unreliable and difficult to debug. Tracing memory's dynamic evolution is crucial to understand how information is synthesized, propagated, or corrupted over time. In this work, we study the new problem of error tracing and attribution in LLM memory systems. We propose a novel framework that transforms memory pipelines into executable memory evolution graphs, enabling fine‑grained tracing of operational information flow. We then construct MemTraceBench, a benchmark collected from representative memory systems such as Long‑Context, RAG, Mem0, and EverMemOS, to systematically study memory failure modes. We further introduce an automatic attribution method that iteratively traces operation subgraphs to pinpoint the root cause of any failed case. Our analysis reveals that memory failures are systematic, stemming from operation‑level issues like information loss and retrieval misalignment. Crucially, we leverage these fine‑grained attribution signals to guide downstream prompt optimization, establishing a closed‑loop system that automatically corrects faults and boosts end‑task performance by up to 7.62%. Code will be released at https://github.com/zjunlp/MemTrace.
Authors:Bibek Poudel, Sai Swaminathan, Weizi Li
Abstract:
Designing a transit network requires many sequential route extension decisions, but their quality is often visible only after the full network is assembled. This delayed‑feedback challenge lies at the heart of the Transit Route Network Design Problem (TRNDP), where route interactions can be deceptive: an extension that appears useful locally can create transfer bottlenecks, produce redundant overlap, or reduce overall throughput. To guide route construction under delayed simulator feedback, we introduce AlphaTransit, a search‑based planning framework for cityscale bus network design. AlphaTransit couples Monte Carlo Tree Search (MCTS) with a neural policy‑value network: the policy proposes route extensions, the value estimates downstream design quality, and search uses these predictions to refine each decision. This provides decision‑time lookahead during route construction without running simulator rollouts inside the search tree. We evaluate AlphaTransit on a new Bloomington TRNDP benchmark with realistic road topology and censusderived demand, under mixed and full transit demand settings. In the Bloomington network, AlphaTransit attains the highest service rate in both demand settings, reaching 54.6% and 82.1%, respectively. Relative to reinforcement learning without search, these correspond to 9.9% and 11.4% service rate gains; relative to MCTS without learned guidance, they correspond to 2.5% and 11.2% gains. These results suggest that coupling learned guidance with MCTS is more effective than using either approach alone for transit network design. Our code and data are publicly available in https://github.com/poudel‑bibek/AlphaTransit.
Authors:Manjiang Yu, Hongji Li, Junwei Chen, Xue Li, Priyanka Singh, Yang Cao, Lijie Hu
Abstract:
Representation intervention has emerged as a promising paradigm for aligning large language models toward desired behaviors without modifying model weights. Existing methods typically apply a fixed intervention uniformly across all inputs. However, we find that the appropriate intervention direction and strength vary substantially across samples, and such indiscriminate intervention leads to degradation of general capabilities on benign inputs. To address these challenges, we propose Multi‑Adapter Representation Interventions via Energy Calibration (MARI). Specifically, we introduce a competitive multi‑adapter mechanism in which specialized experts capture non‑linear correction patterns and adaptively determine the appropriate intervention direction and strength for different samples. Furthermore, we design an energy‑based gating module that leverages internal propagation dynamics to distinguish inputs that are applicable for intervention. Extensive experiments across diverse model families and parameter scales demonstrate that MARI achieves state‑of‑the‑art alignment performance. Our method significantly improves performance on TruthfulQA, BBQ, and safety benchmarks, while maintaining and even improving general capabilities on tasks such as MMLU and ARC. Our code is available at https://github.com/V1centNevwake/MARI.
Authors:Chusen Li, Zhou Liu, Shuigeng Zhou, Wentao Zhang
Abstract:
Large language models increasingly rely on either reinforcement learning or multi‑agent prompting to improve reasoning, yet these two paradigms remain difficult to combine. Directly applying single‑agent reinforcement learning to multi‑turn multi‑agent systems faces following dilemmas: i) Sparse rewards, role‑level free‑riding and excessive training overhead. ii) Agents only imitate to collaborate. iii) Fixed collaboration protocol falls into oscillating local optimum. We introduce TRACER, a turn‑level reinforcement framework for cooperative multi‑LLM reasoning. TRACER separates collaborative decision making into a controller‑regret layer, where controllers learn whether the agents should speak or skip the current round through regret matching, and a generation‑credit layer, which optimizes proposer and reviewer utterances with role‑specific GSPO rewards. This design i) assigns credit at the level of both action modes and generated utterances, thus avoiding free‑riding and sparse rewards. We only expand the choices made by the controllers, thus greatly reducing computational cost of training. Moreover, ii) agents acquire collaborative capability as they learn when to utter and what to speak. Finally, iii) by designing binary actions ingeniously, we extend classical game theory established for finite action spaces to deep learning, thus achieving mathematically rigorous convergence. We train all local RL‑style methods on the GSM8K training split and evaluate on held‑out GSM8K, MATH500, and GPQA‑Diamond to measure in‑domain accuracy, cross‑benchmark generalization, inference cost, and correction‑preservation behavior. The resulting framework provides a compact and reproducible testbed for studying learned collaboration policies beyond fixed debate, voting, or aggregation protocols. Code is available at https://github.com/Shark‑Forest/TRACER.
Authors:Jan Christian Blaise Cruz, Alham Fikri Aji
Abstract:
Sense representations (explicit, per‑token meaning decompositions) are useful for disambiguation, steering, and cross‑lingual alignment, but existing approaches require models to be pretrained with sense structure baked in. We introduce ACROS, which induces an explicit sense pathway into a frozen pretrained decoder LM through a gated residual addition. On SmolLM2‑360M, ACROS preserves base LM quality while supporting three uses of the same induced variables: zero‑shot word‑sense disambiguation (64.95 F1 on Raganato ALL, competitive with the WordNet first‑sense heuristic), low‑KL lexical steering across 5,161 CoInCo cases where a simple non‑oracle proxy recovers about 90% of positive shifts, and SENSIA cross‑lingual adaptation to four languages (mean R@1 0.988, target FLORES PPL 7.94). ACROS makes sense representations an inducible interface for ordinary pretrained LMs.
Authors:Yexing Du, Kaiyuan Liu, Youcheng Pan, Bo Yang, Ming Liu, Bing Qin, Yang Xiang
Abstract:
Multimodal large language models (MLLMs) have demonstrated significant potential for speech‑to‑text translation (S2TT). However, existing deployment paradigms face critical challenges: pure on‑device models suffer from resource constraints, while centralized cloud systems incur severe privacy risks and bandwidth bottlenecks by transmitting raw voice data. Furthermore, most models exhibit English‑centric biases, restricting many‑to‑many translation scaling. In this paper, we propose Edge‑cloud Speech Recognition and Translation (ESRT), a privacy‑preserving and bandwidth‑efficient collaborative edge‑cloud MLLM framework. Specifically, we design an edge‑cloud split inference architecture that retains a lightweight speech encoder and adapter on the device, transmitting only highly compressed intermediate features to the cloud. This fundamentally prevents voiceprint leakage and reduces bandwidth requirements by up to 10×. To overcome English‑centric bottlenecks, we introduce a multi‑task weighted curriculum learning strategy with data balancing to ensure robust cross‑lingual consistency. Extensive experiments on the FLEURS dataset demonstrate that our models, ESRT‑4B and ESRT‑12B, achieve state‑of‑the‑art many‑to‑many S2TT performance across 45 languages (45 × 44 directions). Code and models are released to facilitate reproducible, privacy‑aware MLLM S2TT research. The code and models are released at https://github.com/yxduir/esrt.
Authors:Haonan Wen, Hanyang Chen, Songhe Feng
Abstract:
Irregular multivariate time series forecasting is critical in many real‑world applications, where time series are irregularly sampled and exhibit dynamically evolving missingness patterns. Although existing methods perform well in offline settings, they often suffer from significant performance degradation when deployed online due to dynamic shifts in data distribution. Maintaining forecasting capability in such dynamic scenarios typically necessitates online adaptation techniques. Since irregular sampling fundamentally undermines temporal continuity and periodicity, we cannot leverage these widely studied characteristics from regular MTS for online learning. To this end, we study the problem of online IMTS forecasting and propose Under‑Cali, an uncertainty‑driven dual‑expert calibration framework consisting of three core components: an uncertainty estimator, a dual‑expert calibration module, and an adaptive routing module. We design an uncertainty estimator that serves as the core control signal to jointly manage inference and adaptation processes. In our framework, the uncertainty estimator first assesses uncertainty for each incoming batch. The adaptive routing module then directs samples with high uncertainty to the unreliable expert for calibration, while low uncertainty samples remain with the reliable expert. Subsequently, the system updates the reliable expert and the uncertainty estimator using well‑calibrated reliable samples, and updates the unreliable expert with challenging samples, enabling stable and efficient online learning. Under‑Cali keeps the source forecasting model frozen and performs adaptation only through a lightweight, model‑agnostic calibration module, enabling efficient adaptation. Extensive experiments on IMTS benchmarks demonstrate consistent improvements with low computational cost. Our code is available at https://github.com/HaonanWen/Under‑Cali.
Authors:Katharina Deckenbach, Haritz Puerto, Jonas Geiping, Sahar Abdelnabi
Abstract:
The validity of AI safety evaluations depends on models behaving consistently across controlled and deployment settings. Prior work has identified test‑time contextual cues, such as hypothetical scenarios, as a source of verbalized evaluation awareness and subsequent behavioral shift. In this paper, we investigate a potential explanation of this phenomenon: evaluation meta‑knowledge, defined as parametric knowledge about the structural traits that characterize evaluations. Similar to dataset contamination, where benchmark exposure leads to higher performance through memorization, we hypothesize that models trained on texts describing evaluation practices may implicitly learn to recognize and respond to evaluation‑like contexts, for instance, through exposure to scientific articles or social media posts about AI benchmarking. To test this, we fine‑tune models on synthetic documents describing evaluation traits such as verifiable structures or moral dilemmas. Evaluating this fine‑tuned model on six safety benchmarks, we find that it is significantly safer than the base model and control model. This behavioral shift persists even when restricting the analysis to responses lacking explicit verbalization of evaluation awareness. Our results demonstrate that evaluation meta‑knowledge may inflate safety benchmark performance, introducing a novel confounder that is independent of explicit memorization or verbalized evaluation awareness, thus, challenging to detect. These findings have important implications for the design and interpretation of AI safety evaluations. Our code and models are available at https://github.com/compass‑group‑tue/arxiv2026_evaluation_meta_knowledge.
Authors:Xiaoyu Dong, Zhi Li, Xiao-Ming Wu
Abstract:
Large language models (LLMs) have recently advanced text‑driven 3D generation, yet Text‑to‑CAD remains far from supporting industrial product design. Existing benchmarks focus primarily on generating single‑part CAD models and evaluate them using geometric similarity metrics that fail to capture functionality, manufacturability, and assemblability. To address this gap, we introduce MUSE, a Text‑to‑CAD benchmark focused on complex, editable boundary representation (B‑Rep) assemblies. MUSE pairs practical design instances with structured Design Specifications and evaluates generated models through a three‑stage protocol: code check, geometric check, and design‑intent alignment. The final stage uses design‑specific rubrics to assess functionality, manufacturability, and assemblability, moving beyond shape matching toward practical design quality. To enable scalable evaluation, we use a rubric‑based visual language model (VLM) judge and validate its reliability through human annotation. Experiments on closed‑source and open‑source LLMs reveal a clear failure cascade from executable code to valid geometry and finally to engineering‑ready design, with even the strongest models achieving limited success on fine‑grained engineering criteria. Together, MUSE provides a realistic benchmark and evaluation framework for advancing Text‑to‑CAD from geometric generation toward true engineering design. Our project website, including the leaderboard, dataset, and code, is available at https://dong7313.github.io/muse‑benchmark/.
Authors:Jeong Hun Yeo, Chae Won Kim, Hyeongseop Rha, Yong Man Ro
Abstract:
Existing Visual Speech Recognition (VSR) systems commonly rely on left‑to‑right autoregressive decoding, which can force premature decisions on visually ambiguous tokens before sufficient context is available. We propose DLLM‑VSR, to the best of our knowledge, the first Diffusion Large Language Model (DLLM)‑based VSR framework, formulating transcription as iterative masked denoising with flexible‑order decoding. With confidence‑based unmasking, DLLM‑VSR commits high‑confidence positions early and uses the committed tokens as bidirectional context to refine ambiguous ones. To adapt DLLMs to VSR, we introduce a two‑stage masked‑denoising training strategy that separates visual‑to‑text content alignment from length modeling. We further observe a performance gap with oracle‑length decoding, which assumes access to the true transcript length, indicating that reducing target‑length uncertainty can improve DLLM‑based VSR. To reduce this gap, we develop length‑guided candidate decoding, which uses video duration to construct plausible transcript‑length hypotheses, decodes under multiple hypotheses, and reranks candidates using length plausibility and decoding confidence. The proposed method achieves a state‑of‑the‑art WER of 19.5% on LRS3 using only its labeled training data.
Authors:Peng Cui, Jiahao Zhang, Lijie Hu
Abstract:
While Contrastive Learning (CL) has revolutionized self‑supervised representation learning, its latent representations remain highly entangled and opaque, limiting their interpretability in safety‑critical applications. We identify that a fundamental cause of this entanglement is the reliance on deterministic similarity measures, which treat all feature dimensions equally. In compositional scenes, this creates an Optimization Conflict: common background features, such as, "blue sky", are encouraged to align in positive pairs but simultaneously repelled in negative pairs, causing gradient oscillations that hinder precise semantic disentanglement. To address this, we propose BayesNCL (Bayesian Gated Non‑Negative Contrastive Learning). Unlike standard approaches, BayesNCL introduces a probabilistic gating mechanism that dynamically filters out task‑irrelevant, high‑frequency common features while selectively retaining discriminative semantics. By formalizing feature selection as a variational inference problem with a sparse Bernoulli prior, our method effectively resolves the optimization conflict. Empirical experimental results on Imagenet‑100 demonstrate that BayesNCL achieves a remarkable 142.1% improvement in semantic consistency compared to state‑of‑the‑art baselines, yielding highly interpretable representations without compromising downstream task performance. Code is available at https://github.com/Cui‑Peng‑624/BayesNCL.
Authors:Yansong Ning, Mianpeng Liu, Jingwen Ye, Weidong Zhang, Hao Liu
Abstract:
Hybrid‑reasoning large language models (LLMs) expose explicit controls over reasoning effort, allowing users or systems to trade off answer quality against inference cost. However, existing methods for adaptive thinking‑mode selection are typically evaluated under different models, datasets, and implementation assumptions, making it difficult to compare their practical behavior. We introduce HRBench, a unified evaluation framework for studying thinking‑mode switching in hybrid‑reasoning LLMs. HRBench organizes the design space along two axes: three switching strategy families, prompt‑based selection, external routing, and speculative execution, and four training regimes, training‑free, SFT, offline and online RL, yielding 12 controlled evaluation settings. We evaluate these settings across 6 LLMs, from Qwen3.5‑2B to Kimi‑K2.5‑1.1T, and 5 reasoning benchmarks covering mathematics, science, and code, while reimplementing 12+ representative prior methods within the same pipeline. Our analysis characterizes how different switching strategies occupy distinct effectiveness‑efficiency trade‑off regions: prompt‑based methods often provide favorable token‑accuracy trade‑offs, routing methods offer more stable cost reduction, and speculative methods tend to improve accuracy at higher token cost. We further find that training affects strategies differently, and that the preferred strategy varies with model scale and task domain. HRBench provides reference implementations and a unified evaluation platform to support more controlled research on efficient reasoning in hybrid‑reasoning LLMs. Our data, code and repository are available at https://github.com/usail‑hkust/HRBench.
Authors:Yanhui Sun, Wu Liu, Haifeng Ming, Xinru Wang, Hantao Yao, Yongdong Zhang
Abstract:
E‑commerce platforms have begun recruiting crowdsourced jurors to adjudicate massive volumes of transaction disputes. Unlike formal legal judgment, E‑commerce dispute verdicts require grounding pivotal clues from redundant, multi‑round, multimodal evidence and making decisions under flexible platform‑specific conventions. These characteristics render existing methods insufficient for this scenario. To bridge this gap, we introduce a pioneering task, E‑commerce Dispute Verdicts (EDV), and present VerdictBench, a multimodal benchmark comprising 6,000 real‑world cases designed to reflect crowdsourced jury decisions. Building upon this, we propose CyberJurors, a multi‑agent framework to clarify the dispute logic and regulate the verdict process. At the individual level, Individual Verdict Chain‑of‑Thought decomposes the EDV task into four structured reasoning stages, enabling fine‑grained clue perception and clarifying causal logic between pivotal clues and the dispute focus. At the collective level, Jury Consensus Verdict simulates multi‑round discussion and voting among jurors, while incorporating verdict precedents to mitigate cognitive biases toward either disputant. Experiments on VerdictBench show that CyberJurors outperforms state‑of‑the‑art LLMs, MLLMs, and court simulators, while achieving stronger alignment with real‑world jury voting patterns. Code and dataset are available at https://github.com/YanhuiS/CyberJurors and https://huggingface.co/datasets/piggi/VerdictBench.
Authors:Shuaike Li, Kai Zhang, Xianquan Wang, Jiachen Liu, Shengpeng Mo
Abstract:
While Knowledge Editing (KE) enables efficient updates, its dominant Static Fact Overwriting paradigm treats LLMs as discrete databases, forcibly injecting isolated facts. Fracturing pre‑trained logical topologies, this triggers Epistemic Dissonance ‑‑ a pathology where un‑evolved legacy priors force the model to explicitly negate the injected update. Idealized interventions reveal that this is an inherent structural flaw rather than mere algorithmic noise, with a zero‑distortion proxy yielding a catastrophic 95.6% self‑refutation rate. Given the causally driven nature of real‑world knowledge, grounding updates in explicit causal narratives effectively collapses this conflict rate to just 6.6%, underscoring the imperative for a paradigm shift toward Causal Editing. To internalize this evolution, we propose CODE (Causal On‑policy Distillation for Editing). By coupling causal bootstrapping with asymmetric on‑policy distillation, CODE engraves causal transition logic directly into parametric memory. Experiments on LLaMA‑3.1 and Qwen‑2.5 show CODE drastically suppresses self‑refutation to 1.8% while securing robust multi‑hop accuracy (up to 83.5%), seamlessly transforming discrete fact injection into coherent knowledge evolution. Code is available at https://github.com/CrashBugger/CODE.
Authors:Hongru Hou, Tiehua Mei, Denghui Geng, Jinhui Huang, Ao Xu, Hengrui Chen, Jiaqing Liang, Deqing Yang
Abstract:
Proactive Recommender Systems (PRSs) aim to guide user preference shift toward target items by generating paths of intermediate recommendations. Reinforcement learning (RL) provides a principled framework for optimizing such sequential decision tasks, as path rewards can naturally capture both short‑term acceptance and long‑term guidance effectiveness. However, naively applying policy gradients to PRS results in deficient gradient estimation. We identify two deficiencies: (1) path‑level rewards decompose into step‑level rewards with positive mean, creating a length‑dependent bias that causes gradients to favor path extension over meaningful exploration; (2) weighting each step by the entire path‑level reward ignores the decomposition structure, leading to high gradient variance. To rectify these two deficiencies, we propose an effective RL framework ProRL with two novel mechanisms for proactive recommendation. First, Stepwise Reward Centering subtracts expected rewards to neutralize length‑dependent bias, ensuring that path extension yields zero expected gradient signal. Second, Position‑Specific Advantage Estimation leverages the reward decomposition structure to compute step‑dependent baselines, reducing gradient variance. Together, these mechanisms yield policy gradients that precisely target path quality. Our experiments on three real‑world datasets demonstrate that ProRL significantly outperforms state‑of‑the‑art PRSs. Our code is available at https://github.com/hongruhou89/ProRL.
Authors:Rui Lin, Chuanming Wang, Huadong Ma
Abstract:
With the rapid development of pre‑training technologies, adapting large‑scale Vision‑Language Models (VLMs) for video understanding \emph\ie image‑to‑video transfer learning has become a dominant paradigm. To achieve superior performance, it raises as an effective strategy among recent advances to employ Mixture‑of‑Experts (MoE) to enhance VLMs' temporal modeling capabilities. However, conventional MoE designs suffer from expert homogenization, where all experts act as identical generalists, inefficiently learning spatio‑temporal features from undifferentiated video streams. To overcome this problem, we propose VidPrism, a novel heterogeneous temporal Mixture‑of‑Experts framework. VidPrism pioneers a division of labor by deploying functionally specialized experts, each assuming a role ranging from spatial understanding to temporal modeling. To feed these specialists appropriately, we introduce a content‑aware, multi‑rate sampling module that dynamically generates streams ranging from semantically rich to motion‑focused representations, providing specialized inputs for experts. Furthermore, a dynamic, bidirectional fusion mechanism enables synergistic information exchange between these pathways, leading to a comprehensive video representation. Extensive experiments on various video recognition benchmarks demonstrate that VidPrism achieves state‑of‑the‑art performance and effectively fosters expert specialization. Our source code is available at \hrefhttps://github.com/Lrrrr549/VidPrism.githttps://github.com/Lrrrr549/VidPrism.git.
Authors:Junghoon Lim
Abstract:
Irregular Multivariate Time Series (IMTS) are common in practice, yet their irregular sampling complicates effective modeling. Existing approaches typically either (i) design specialized architectures that limit the reuse of proven Multivariate Time Series (MTS) models, or (ii) map IMTS onto regular temporal grids through interpolation, which may distort temporal dynamics by introducing artificial values. To address these limitations, we propose a new input‑embedding‑based approach. We identify that the key bottleneck lies not in the backbone architecture, but in conventional embedding layers that assume uniform sampling. In this work, we introduce QuITE (Query‑Based Irregular Time Series Embedding), a simple yet effective plug‑and‑play embedding module for IMTS. QuITE employs learnable query tokens to aggregate irregular observations through a single self‑attention layer, directly producing backbone‑compatible latent representations without artificial value generation or architectural modification. Extensive experiments on real‑world benchmarks show that QuITE consistently improves MTS models, yielding average relative gains of up to 54.7% in forecasting and 15.8% in classification across diverse datasets and backbone architectures. Code is available at: https://github.com/Meaningfull9502/QuITE.
Authors:Zerui Chen, Qinggang Zhang, Zhishang Xiang, Zhimin Wei, Linfeng Gao, Xiao Huang, Zhihong Zhang, Jinsong Su
Abstract:
Graph‑based Retrieval‑Augmented Generation (GraphRAG) advances flat document retrieval by structuring knowledge as relational graphs, enabling more coherent and effective reasoning. However, applying it to specific domains like legal reasoning faces critical challenges. (i) Legal corpora are heterogeneous, containing multi‑granular knowledge from cases, articles and interpretations. A flat knowledge graph cannot adequately differentiate between factual details, applied rules, and abstract principles, limiting accurate retrieval. (ii) Reliable legal judgment demands transparent, evidence‑based reasoning. Traditional RAG passes retrieved context directly to an LLM without verification, resulting in opaque, error‑prone reasoning. To this end, we propose LegalGraphRAG, a framework designed for reliable legal reasoning. Our approach introduces two core components: a hierarchical legal graph that hierarchically organizes legal sources to enable retrieval at appropriate abstraction levels, and a multi‑agent system for reliable legal reasoning, where a Researcher retrieves candidate evidence, an Auditor rigorously verifies its validity against source documents, and an Adjudicator synthesizes the set of verified evidence to render a final judgment. Extensive experiments show that LegalGraphRAG achieves the state‑of‑the‑art performance, outperforming existing GraphRAG baselines in accurate and trustworthy legal analysis. Our code, datasets and implementation details are available at https://github.com/XMUDeepLIT/LegalGraphRAG.
Authors:Yaoyang Luo, Zhi Zheng, Ziwei Zhao, Tong Xu, Zhao Jielun, Wenjun Xue, Yong Chen, Enhong Chen
Abstract:
Recent years have witnessed the rapid development of Large Language Model‑based Multi‑Agent Systems (MAS), which excel at collaborative decision‑making and complex problem‑solving. However, malicious agents in MAS may inject misinformation to mislead other agents and disrupt system performance, giving rise to a new research direction that focuses on attack mechanisms and defense strategies in MAS. Prior studies largely assume malicious agents act independently and investigate the corresponding defense strategies. However, we argue that malicious agents may exhibit collaborative behaviors, enabling more effective attacks through internal information exchange. In this paper, we propose an adaptive cooperative attack framework, where malicious agents autonomously coordinate and dynamically adjust their attack strategies through multi‑round interactions. Furthermore, we introduce Sentence‑Level Trustworthiness Analysis and Rectification (STAR), a defense framework that identifies and rectifies misleading information at the sentence level within agent communications. Our experiments show that cooperative attacks lead to a significantly larger degradation in task success rate than independent attacks, resulting in a relative drop of 5.34%. Meanwhile, STAR effectively mitigates both cooperative and independent threats and improves task success rate by an average of 36.76%. The code is available at https://github.com/smoooom/STAR.
Authors:Chong Jing, Zitong Lan, Junan Zhang, Zhizheng Wu
Abstract:
Predicting spatially varying Room Impulse Response (RIR) from sparse observations is a critical but highly challenging inverse problem for immersive spatial audio rendering. In this work, we present EIGENET, a geometry‑informed multi‑modal framework for few‑shot novel view RIR prediction. At its core is a Cross‑view Alternate‑attention Transformer that iteratively refines local intra‑view acoustic structures and global cross‑view spatial relationships. We empirically demonstrate that this architecture is capable of making full use of the multi‑view multi‑modal context while performing spatial‑temporal reasoning for RIR prediction. Inspired by acoustic ray tracing, we design a geometry‑informed modulation block to formulate the connection between geometric features and RIR power spectrum. In the mean time, an auxiliary loss is introduced to transform the single‑target waveform prediction into a multi‑task learning framework. Through ablation studies, we demonstrate that this design yields consistent performance gains regardless of the underlying backbone, thereby confirming its foundational utility and architecture‑agnostic generalizability for RIR prediction task. Evaluated on both simulated and real‑world benchmarks, EIGENET achieves both state‑of‑the‑art performance in few‑shot novel view RIR prediction and sim‑to‑real generalization. Codes and checkpoints are available on https://github.com/FEAfeatherTHER/EigeNet.
Authors:Lee Jung-Mok, Kim Sung-Bin, Joohyun Chang, Lee Hyun, Tae-Hyun Oh
Abstract:
Laughter is a complex social signal that conveys communicative intent beyond amusement. While prior work has focused on isolated laughter analysis tasks, a comprehensive understanding of laughter in real‑world scenarios remains underexplored. Therefore, we introduce SMILE‑Next, a dataset for real‑world laughter understanding with multimodal textual representations and question‑answer annotations across three tasks: laughter detection, laughter type classification, and laughter reasoning. Building upon SMILE‑Next, we aim to develop a laughter‑specialized large language model capable of nuanced understanding of laughter in real‑world contexts. To this end, we propose two key components: laughter‑specific Self‑Instruct and the Mixture‑of‑Laugh‑Experts (MoLE) framework. Laughter‑specific Self‑Instruct enhances generalization across tasks and domains by automatically synthesizing diverse laughter‑centric instructions. MoLE introduces a task‑adaptive expert routing mechanism that dynamically selects specialized experts tailored to each laughter‑related task, improving task‑specific performance and efficiency. Experimental results show that the combination of our proposed components substantially outperforms multimodal LLM baselines, advancing robust real‑world laughter understanding. Project page is at: https://mok0102.github.io/smile‑next/.
Authors:Chuang Tang, Chenhao Lin, Yin Xu, Hao Wang, Jinrui Zhou, Xin Li, Mingjun Xiao, Enhong Chen
Abstract:
Parsing chemical reaction diagrams from scientific literature is challenging due to heterogeneous layouts, intertwined visual elements, and the difficulty of integrating recognition and reasoning. Existing vision‑language models advance multimodal understanding but still fail on complex diagrams, struggling to maintain spatial coherence and to integrate multidimensional information during reasoning. To address these issues, we propose MACReD, a hierarchical multi‑agent framework that coordinates specialized agents for molecular perception, arrow understanding, text extraction, and reaction reconstruction within a unified VLM‑guided architecture. The planning and perception layers use flexible, fine‑grained detection to handle visual complexity, while the reasoning layer uses a multigraph fusion mechanism to integrate heterogeneous cues and enforce chemically consistent global reasoning. Experiments on the RxnScribe benchmark show that MACReD achieves state‑of‑the‑art performance, with F1 scores of 75.2% and 84.6% under hard and soft match criteria, outperforming the RxnScribe baseline, which obtains 69.1% and 80.0%, respectively. These results demonstrate the robustness of MACReD across diverse diagram layouts, including multi‑step and tree‑structured reactions.
Authors:Stanislav Kirdey, Clark Labs Inc
Abstract:
Clark Hash is a small method for storing neural embeddings in less space. It normalizes each database vector, applies a deterministic sparse signed Johnson‑Lindenstrauss projection, clips the result, and stores a fixed‑width scalar‑quantized code. Queries stay in floating point and are scored against the stored sketches. In the default 384‑dimensional sentence‑embedding setting, Clark Hash stores a cosine‑search vector in 48 bytes instead of 1536 bytes for dense f32 storage. This is 32x smaller. The method does not need a training pass, learned codebooks, rotations, or corpus statistics before new vectors can be stored. We describe the codec, the Rust implementation, and a multilingual sentence‑similarity evaluation on 9,304 labeled pairs from 29 subsets. With a multilingual MiniLM encoder, the 48‑byte sketches reached 0.910 and 0.946 macro Pearson correlation with dense cosine scores on STS17 and STS22. Clark Hash is not a new Johnson‑Lindenstrauss theorem and it is not a replacement for approximate nearest‑neighbor indexes. It is a simple stateless codec for compact embedding storage.
Authors:Shuhao Chen, Weisen Jiang, Yeqi Gong, Shengda Luo, Chengxiang Zhuo, Zang Li, James T. Kwok, Yu Zhang
Abstract:
Fine‑tuning large language models often undermines their safety alignment, a problem further amplified by harmful fine‑tuning attacks in which adversarial data removes safeguards and induces unsafe behaviors. We propose SPARD, a defense framework that integrates Safety‑Projected Alternating optimization with Relevance‑Diversity aware data selection. SPARD employs SPAG, which optimizes alternatively between utility updates and explicit safety projections with a set of safe data to enforce safety constraints. To curate safe data, we introduce a Relevance‑Diversity Determinantal Point Process to select compact safe data, balancing task relevance and safety coverage. Experiments on GSM8K and OpenBookQA under four harmful fine‑tuning attacks demonstrate that SPARD consistently achieves the lowest average attack success rates, substantially outperforming state‑of‑the‑art defense methods, while maintaining high task accuracy. Code is available at https://github.com/shuhao02/SPARD.
Authors:Swanand Rao
Abstract:
Large language model agents are increasingly expected to perform operational work: calling APIs, manipulating files, assembling workflows, and acting inside enterprise systems. Yet the tool layer on which this execution depends is still commonly treated as either a hand‑written integration artifact or a static list of schemas exposed to a model. This paper introduces Tool Forge, a validation‑carrying toolchain for converting natural‑language capability intent into governed, sandbox‑verified, cataloged tool artifacts and exposing those artifacts to agents through a token‑efficient routing layer. Tool Forge treats a tool as a capsule containing intent, capability contract, implementation, dependency policy, tests, documentation, runtime validation evidence, lifecycle state, credential bindings, and routing metadata. It also introduces a Router that exposes intent‑scoped tool sessions instead of loading full catalog schemas into the model context. We describe the system architecture, validation pipeline, MCP‑facing routing model, governance controls, and initial reproducible benchmarks from the open‑source implementation. Across 83 Router benchmark cases, Tool Forge Router achieves aggregate micro‑F1 of 0.901 while reducing estimated task‑flow tool context by 99.2% relative to naive full‑catalog schema exposure. In a 25‑case end‑to‑end generation probe over local‑tool tasks, Tool Forge generates 25 of 25 tool bundles, reaches micro‑F1 of 0.940 against deterministic acceptance checks, and passes 23 of 25 live sandbox validations. These results are presented as an initial systems benchmark, not as a state‑of‑the‑art claim. The paper identifies remaining challenges in adversarial routing, broader API grounding, sandbox isolation, and cross‑system evaluation.
Authors:Kou Shi, Ziao Zhang, Shiting Huang, Avery Nie, Zhen Fang, Qiuchen Wang, Lin Chen, Huaian Chen, Zehui Chen, Feng Zhao
Abstract:
Large language model (LLM)‑based agents have shown strong capabilities in using external tools to solve complex tasks. However, existing evaluations often overlook the temporal dimension of tool use, especially the impact of tool response latency, and are usually limited to single‑task settings. In real‑world applications, multiple tasks often need to be executed concurrently, and overall efficiency depends on whether an agent can use idle time while waiting for tool responses. We refer to this capability as asynchronous tool calling. To evaluate it, we propose AsyncTool, a benchmark for assessing LLM‑based agents in interactive multi‑task tool‑use environments with delayed tool feedback. AsyncTool presents multiple heterogeneous tasks simultaneously and simulates realistic tool response latency during execution. Using a hybrid data evolution strategy, we construct a diverse asynchronous multitasking dataset that covers multiple scenarios and tool‑use patterns. We evaluate models at the step, sub‑task, and task levels, and introduce efficiency‑oriented metrics to measure task coordination and completion efficiency. Extensive experiments show that delayed tool feedback poses substantial challenges to current agents and leads to clear performance degradation. Models that better coordinate task switching, dependency tracking, and state maintenance achieve stronger performance on AsyncTool. Our analysis identifies key failure modes of current tool‑using agents and provides practical insights for designing future systems with stronger temporal reasoning and coordination capabilities.
Authors:Seunghyeok Shin, Minwoo Kim, Dabin Kim, Hongki Lim
Abstract:
Diffusion posterior sampling conditions diffusion priors on measurements, but data‑consistency updates are typically scaled by hand‑tuned guidance weights and can destabilize sampling under stiff, operator‑dependent curvature. We replace scalar guidance with a per‑noise‑level damped Gauss‑‑Newton correction computed in diffusion‑state coordinates. The correction pulls likelihood gradients back through the denoiser, uses a one‑sided curvature model that avoids forward denoiser Jacobians, and applies diffusion‑calibrated rank‑one damping aligned with the denoiser residual. Each correction is solved with matrix‑free GMRES using automatic differentiation, and sampling proceeds with a variance‑preserving Langevin transition with a closed‑form drift/noise split. On FFHQ and ImageNet across inverse problems, it achieves competitive PSNR/SSIM/LPIPS while running markedly faster than most of the compared baselines; on accelerated MRI reconstruction, it achieves the best PSNR/SSIM among the compared baselines.
Authors:Soohan Lim, Joonghyuk Hahn, Hyundong Jin, Yo-Sub Han
Abstract:
Evaluating the efficiency of algorithmic code requires test cases that expose runtime bottlenecks. Previous methods generate efficiency test cases either by increasing input size or by generating code‑specific inputs that make the given implementation run slowly. Consequently, they do not address the structural input conditions that drive the algorithmic worst case. We introduce STAB, a specification‑driven pipeline that generates test cases that expose algorithmic bottlenecks from a natural‑language problem specification alone. STAB separates the task into constraint‑bound maximization and adversarial structure injection. (i) The constraint saturator extracts constraints and resolves large admissible size assignments using rule‑based saturation and CP‑SAT optimization over related variables. (ii) The adversarial scenario injector retrieves implementation‑level adversarial construction principles from a curated scenario catalog using keyword matching and K‑nearest neighbors (KNN). STAB encodes the problem specification, resolved boundary, and retrieved construction principles into a structured generation specification, from which the LLM synthesizes a Python test case generator. On CodeContests, STAB raises the rate of generated test cases that expose algorithmic bottlenecks from 50.43% to 73.45% on average across open‑source LLMs and from 57.45% to 71.85% on average across closed‑source LLMs, with consistent gains across Python, Java, and C++. Our code is available at https://github.com/suhanmen/STAB.
Authors:Simin Huo
Abstract:
The ability to process ultra‑long contexts is crucial for large language models (LLMs) to perform long‑horizon tasks. While recent efforts have extended context windows to 1M and beyond, model performance degrades when sequence length exceeds the pre‑trained range of positional encodings (e.g., RoPE), i.e., position exhaustion. This fundamental limitation must be overcome to achieve a truly infinite context. To address it, we propose Periodic RoPE (P‑RoPE), a positional encoding mechanism designed to circumvent this exhaustion. It operates in conjunction with sliding window attention (SWA) to capture local dependencies and relative positions within each window. This local layer is then complemented by a global attention layer with No Positional Encoding (NoPE), enabling unbounded interaction across the entire sequence without positional constraints. By stacking these two types of layers, the model avoids the need for positional extrapolation to generalize longer and theoretically supports an infinite context window. Empirical results show that our model, MiniWin, outperforms MiniMInd with standard GPT architectures in long‑context efficiency and stability. Our work provides a possible pathway toward LLMs with genuine infinite‑context understanding. The code is available at \hrefhttps://github.com/Cominder/miniwinhttps://github.com/Cominder/miniwin.
Authors:Jie Zhu, Huaixia Dou, Shuo Jiang, Junhui Li, Lifan Guo, Feng Chen, Chi Zhang, Fang Kong
Abstract:
Existing emotional support conversation (ESC) systems mainly rely on end‑to‑end response generation or coarse strategy supervision, offering limited interpretability and little support for systematic skill improvement. We propose ESC‑Skills, a skill‑centric framework that discovers and self‑evolves executable emotional support skills. We first model localized support interactions as Intervention Units (IUs), which capture state‑‑action‑‑outcome dynamics between seeker states, support interventions, and post‑response emotional changes. Based on IUs extracted from both successful and failed ESC dialogues, we construct the ESC‑Skills Bank, a repository of executable emotional support skills containing intervention guidance, applicability conditions, expected outcomes, and potential risks. To further improve robustness, we introduce a multi‑profile self‑evolutionary refinement framework in which an ESC agent interacts with diverse simulated seeker profiles under SAGE evaluation. The resulting interaction traces are analyzed to identify missing skills, unsafe interventions, and profile‑specific failure patterns, which are then used to refine the Skills Bank through simulation‑based verification. Experimental results demonstrate that ESC‑Skills improves both response‑level quality and dialogue‑level emotional outcomes while providing more interpretable and controllable support behaviors. We will release the code, prompts, and ESC‑Skills Bank at https://github.com/aliyun/qwen‑dianjin.
Authors:Eric Onyame, Runtao Zhou, Kowshik Thopalli, Bhavya Kailkhura, Chirag Agarwal
Abstract:
Chain‑of‑thought (CoT) monitoring has been proposed as a promising safety mechanism for detecting misaligned behavior in large language models. However, its reliability remains largely unexplored beyond English and across diverse model families. We present the first large‑scale evaluation of CoT monitorability across 13 diverse languages and seven frontier model families, comprising 16 models. Using adversarial‑hint evaluations that require explicit intermediate computation, together with analysis of internal answer‑token probabilities, we consistently find CoT unfaithfulness across languages and hint types, with an average rate of 95.9% across 8B‑‑120B parameter models. We find that frontier models systematically engage in strategic manipulation, including answer‑switching, post‑hoc rationalization, and procedural exploitation of hints, making external monitors struggle to detect deception. We show that frontier models often commit to the misaligned cue in their latent activations within the first 15% of generation, even when the CoT appears faithful. Surprisingly, these deceptive patterns remain 100% in low‑resource languages, revealing fundamental limitations in current CoT‑based oversight. Our results reveal that CoT monitoring is fundamentally fragile under linguistic distribution shift, providing a substantially weaker safety signal than what English‑only studies suggest. These findings underscore an urgent need to develop robust CoT monitors and to accelerate research into white‑box monitoring techniques, especially to improve CoT monitorability in mid‑ and low‑resource languages. Our code is available \hrefhttps://multilingual‑cot‑monitoring.github.io/\textcolorbluehere.
Authors:Pengyu Zhu, Lijun Li, Yaxing Lyu, Qianxin Luo, Jingyi Yang, Yi Liu, Tingfeng Hui, Xinyu Yuan, Li Sun, Sen Su, Jing Shao
Abstract:
As LLMs are increasingly deployed as agents, reliable assessment of their agentic capabilities has become essential. However, reported benchmark scores often jointly reflect model capability and the implementation choices each benchmark is packaged with, making cross‑benchmark results difficult to interpret as clean measurements of the underlying model. In this work, we present a unified framework for the fair evaluation of LLM agentic capabilities. Driven by a unified configuration system, the framework integrates diverse benchmarks into a standardized instruction‑‑tool‑‑environment format, executes agents through a fixed ReAct‑style architecture within a controllable sandbox, and provides an optional offline setting that replaces volatile live environments with curated snapshots, so that framework effects and environment effects can be analyzed separately. Building on this, we unify the evaluation methodology under each benchmark's original task‑success criteria, while introducing unified metrics for resource consumption and a taxonomy for decision‑ and execution‑level failure attribution. Within this framework, we adapt 7 widely used benchmarks spanning 24 domains across single‑agent, multi‑agent, and safety‑critical scenarios, and conduct a large‑scale empirical analysis over 400K rollouts and 5B tokens on 15 models. The results show that scaffold choice and environmental volatility materially shift benchmark outcomes in both directions, allowing our framework to disentangle intrinsic LLM capabilities from framework‑ and environment‑induced artifacts. We further demonstrate its extensibility as a secure testbed for safety‑critical domains. Codes and benchmarks at are available at https://github.com/whfeLingYu/A‑Unified‑Framework‑for‑the‑Evaluation‑of‑LLM‑Agentic‑Capabilities, https://huggingface.co/AgentFramework/Unified_Farmework.
Authors:Yuxuan Zhao, Sijia Chen, Ningxin Su
Abstract:
Large language models (LLMs) have shown strong performance across diverse financial tasks, yet portfolio management (PM), a critical financial decision‑making task, remains poorly benchmarked. Existing benchmarks exhibit two main gaps: they ignore cross‑asset correlation structures, thereby failing to distinguish genuinely diversified portfolios from concentrated ones, and fail to evaluate the complete PM decision pipeline in real‑world scenarios. We introduce PortBench, a benchmark spanning six heterogeneous asset classes over ten years. PortBench consists of two complementary layers: a static QA dataset of 6,269 correlation‑based questions across seven task templates, and a dynamic five‑stage allocation pipeline that mirrors the full PM decision cycle. To evaluate these layers, we introduce two dedicated metrics: a dual‑layer correlation score that measures whether proposed portfolios exploit inter‑class hedging and avoid intra‑class concentration, and CEPS, a metric that quantifies how reasoning errors compound across pipeline stages. We further assess strategy robustness and investor alignment under three historical stress regimes and risk profiles. Evaluating ten frontier LLMs, we find that despite strong performance on static financial QA, 90% of model‑profile combinations fail to outperform a basic equal‑weight allocation, and models that satisfy every procedural constraint still suffer catastrophic drawdowns under stress. Our source code is available at \hrefhttps://github.com/AgenticFinLab/portbenchthis https URL.
Authors:Shubhashis Roy Dipta, Ankur Padia, Francis Ferraro
Abstract:
Claim verification splits between end‑to‑end classifiers that are accurate but yields no inspectable traces, and decomposition‑based methods produce inspectable traces but lag performance on benchmark datasets. We propose DecomposeRL an accurate claim‑verifier that produce inspectable traces. DecomposeRL frames decomposition as an RL policy trained with GRPO and a multi‑faceted reward ensemble, enabling both fully supervised and semi‑supervised learning from unlabeled claims. DecomposeRL addresses the prohibitive training cost of GRPO with a data‑curation funnel that distills 115K fact‑verification claims into a compact, learning‑signal‑dense subset of 5K claims. We show that a DecomposeRL‑7B policy trained with full supervision on only ~5K curated claims achieves 86.3 in‑domain and 69.8 out‑of‑domain balanced accuracy across 11 claim‑verification benchmarks containing biomedical, political, scientific, and general‑domain claims. Despite being 4x smaller, it matches 32B baselines and GPT‑4.1‑mini, and it further outperforms baselines in a semi‑supervised setting with only 10% labeled claims data. Code, data, and models are available at https://dipta007.github.io/DecomposeRL
Authors:Zhisheng Zhang, Xiang Li, Yixuan Zhou, Jing Peng, Guoyang Zeng, Zhiyong Wu
Abstract:
Audio tokenizers are fundamental to unifying audio understanding and generation. Understanding requires high‑level semantics, while generation demands semantic and acoustic details. Existing unified tokenizers jointly encode both in high‑dimensional continuous latents, which increases the modeling burden of Diffusion Transformers (DiTs) for generation. We propose LoSATok, a low‑dimensional audio tokenizer for cross‑domain audio understanding and generation. Motivated by the observation that 1280‑dimensional semantic encoder features are compressible, we introduce a Semantic Bottleneck that compresses them into 128 dimensions, regularized by the proposed time‑relation loss for temporal feature consistency. We further design a dual‑level semantic supervision method that leverages both high‑ and low‑dimensional semantic signals, enabling the tokenizer to jointly capture semantics and acoustic details within a compact latent space. Experiments on speech, music, and general audio show that SemBo preserves strong low‑dimensional semantic capacity and LoSATok retains competitive understanding performance compared with several semantic representations, while consistently improving DiT modeling performance on speech, music, and audio generation. These results demonstrate that LoSATok's low‑dimensional representations can effectively support audio understanding and generation. Our code is provided at https://github.com/wxzyd123/LoSATok.
Authors:Yanyan Luo, Xue Han, Chunxu Zhao, Ruiqiao Bai, Yaxing Zhang, Qian Hu, Lijun Mei, Junlan Feng
Abstract:
While LLMs enable personalized chatbots, their effectiveness in child‑centered personalization remains unclear, as systematic evaluation of child‑specific preferences is still lacking. To address this gap, we introduce ChildEval, a benchmark for evaluating LLMs' ability to infer and follow child‑centered preferences in long‑context conversations. ChildEval contains 29K synthesized persona profiles of children aged 3‑6, providing relatively static background information. Each persona is associated with a child preference‑which may align with, conflict with, or be independent of the persona‑expressed either explicitly in a single sentence or implicitly through 6‑10 turn dialogues. Explicit and implicit preferences are designed to reflect the same underlying preference but differ in expression, capturing dynamic aspects of preference expression rather than changes in the static persona. The benchmark spans five top‑level and fourteen sub‑level categories covering children's daily lives and development. We further propose fine‑grained, child‑centric evaluation protocols to systematically assess open‑source LLMs. Experimental results demonstrate how different personalized representations affect LLM responses and suggest that finetuning on ChildEval can enhance child‑centered performance. Our code and dataset are available at https://github.com/ziyanluo/ChildEval.
Authors:Tim R. Davidson, Anja Surina, Caglar Gulcehre
Abstract:
Language models are becoming the default interface to factual knowledge, yet they often verify outputs more reliably than they generate them. This generation‑verification gap (GV‑gap) underlies many recent advances in self‑improvement and reasoning, but its dynamics on factual knowledge specifically remain poorly understood. We focus on the training mechanisms underlying factual GV‑gaps, distinguishing them from their computational and aesthetic counterparts. We trace generation and verification capabilities through three training phases (acquisition, continual learning, and updating) across four open‑source model families at two scales each. Three findings recur across models: (i) verification is consistently learned before generation; (ii) verification is more robust to continual learning than generation; and (iii) factual updates can leave models in a "multi‑verse" state, simultaneously verifying both old and new answers as correct. Natural experiments on frontier models reproduce these dynamics at scale and reveal residual verification biases on well‑covered facts.
Authors:Syed Huma Shah
Abstract:
Modern retrieval‑augmented generation(RAG) deployments increasingly rely on caching to reduce token cost and time‑to‑first‑token(TTFT). Prefix‑level KV reuse is now standard in serving stacks such as vLLM, and chunk‑level and position‑independent reuse have been pushed further by recent systems(RAGCache, TurboRAG, CacheBlend, EPIC, ContextPilot, PCR, LMCache). Output‑level semantic answer caches, by contrast, remain fragile: similar prompts can map to different correct answers, retrieved evidence drifts as the corpus is updated, and adversarial collision attacks have been shown to hijack cached responses. We argue that the right framing for cached answer reuse is not how to reuse faster but when reuse is safe. We propose GroundedCache, an evidence‑validated cache router that admits a cached answer only when 4 cheap gates simultaneously hold: query similarity, retrieved‑evidence overlap, source‑version validity, and lexical (or judge‑based) support of the cached answer by the freshly retrieved evidence. We build a six‑regime workload that stress‑tests cache safety rather than only hit rate, and introduce an operator‑facing metric, the unsafe‑served rate (USR), fraction of all queries that received a wrong cached answer. Across 2 datasets and 12,000 real‑LLM generations(Qwen2.5‑7B‑Instruct on vLLM with Automatic Prefix Caching), GroundedCache drives USR to 0.0% on every HotpotQA regime(vs. 15‑35% under naive caching) and to 1.5% on mtRAG document drift(vs. 51.5%), a 34x reduction on the design‑point adversarial regime and 3‑10x reductions across the other mtRAG regimes, while end‑to‑end p50 latency stays within 1.04‑1.07x of a no‑cache RAG baseline. A per‑gate ablation isolates the lexical support gate as the load‑bearing safety mechanism on both datasets, with the remaining gates providing defense‑in‑depth at near‑zero cost. We release the implementation, workload, and evaluation harness.
Authors:Hyunmin Cho, Woo Kyoung Han, Kyong Hwan Jin
Abstract:
We characterize the pre‑softmax attention matrix \mathbfQK^\top in transformers as an associative memory matrix encoding pairwise associations between input features. By decomposing this matrix into its symmetric and skew‑symmetric parts, we interpret the symmetric component as governing the structure of the energy landscape, and the skew‑symmetric component as driving circulation on that landscape. Leveraging the energy formulation induced by the symmetric component, we derive Hopfield‑style stability measures that quantify the stability of retrieved features. We observe meaningful correlations between Hopfield‑style stability measures and the fidelity‑diversity trade‑offs in generation. Finally, we propose a controllable knob to modulate this trade‑off by modifying the circulation of the underlying dynamics. Code is available at our GitHub (https://github.com/hyeon‑cho/Attention‑Symmetric‑Decomposition).
Authors:Nicole Koenigstein
Abstract:
Multi‑agent systems built on large language models (LLMs) require many coordination choices that are difficult to fix a priori: which skill protocol to invoke, which agent role should perform a subtask, which model to bind to each role, how roles should interact, when to use retrieval or verification, and when to omit a step entirely. These choices interact with task regime and operational constraints, so static pipelines and one‑off model comparisons provide only a limited view of the design space. This paper introduces AgensFlow, an open‑source framework that treats multi‑agent coordination as an online policy‑learning problem under partial observability. The framework makes coordination decisions observable and learnable from repeated trajectories, rather than treating skill, role, model, topology, and evaluation choices as fixed pipeline design. AgensFlow is evaluated on two corpora: distributed‑systems incident tasks and security‑advisory tasks. The evaluation shows three main results: learned routing reaches a higher‑quality operating point than a fixed pipeline baseline on coordination‑heavy classes; skip:X isolates topology compression as a meaningful part of the substrate; and warm‑started policy graphs can reduce exploration cost while preserving plateau quality. Overall, the results support that learned, auditable routing can improve coordination‑heavy multi‑agent workflows over static wiring.
Authors:Chung-Ta Huang, Leopold Das, Jeffrey Zhou, Faizaan Siddique, Julia Seungjoo Baek, Serena Liu, Andrew Rusli, Todd Y. Zhou, Freddy Yu, Sinclair Hansen, Ziling Hu, Arnav Sharma, Mengyu Wang
Abstract:
AR smart glasses need continuous behavioral context to offer proactive assistance, yet their most practical always‑on sensor, the head‑mounted Inertial Measurement Unit (IMU), detects only motion primitives such as walking or standing. We push beyond motion primitives to behavioral‑level recognition, defining five categories that balance AR application need with sensor observability. To this end, we construct a 160K‑sample Ego4D dataset with a four‑tier quality assurance framework spanning 8 activity scenarios, and propose HiT‑HAR, a 703K‑parameter hierarchical model that outperforms prior head‑mounted IMU models on five‑class action and eight‑class scenario recognition. We further map the observability frontier of head‑mounted IMU through per‑class separability analysis, identifying which behavioral categories are reliably observable (Locomotion), which benefit from temporal context (Object Transfer, Task Operation), and where scenario‑dependent signal overlap poses remaining challenges. Our results indicate that architectural choices exploiting temporal context and scenario structure outperform simply scaling model size. The code and dataset are publicly available at https://github.com/Harvard‑AI‑and‑Robotics‑Lab/HiT‑HAR.
Authors:Arijit Ghosh, Aritra Bandyopadhyay, Chiranjeev Bindra, Jingfen Qiao
Abstract:
Multimodal alignment is critical for bridging the semantic gap in information retrieval. However, traditional pairwise strategies introduce a geometric blind spot: while they align anchor modalities (e.g., text) with others, they lack constraints to enforce mutual consistency between peripheral modalities (e.g., video and audio). The TRIANGLE framework addresses this by minimizing the area of modality triplets on a hypersphere to enforce holistic alignment. In this reproducibility study, we verify the robustness of this geometric objective for retrieval tasks. We confirm that TRIANGLE outperforms pairwise baselines in zero‑shot settings, achieving Recall@1 gains of up to +8.7 points, though benefits are domain‑dependent. However, we fail to reproduce the reported learning‑from‑scratch results. Analysis using a synthetic toy dataset attributes this to instability when jointly optimizing geometric alignment with Data‑Text Matching (DTM) loss. Furthermore, we find that cosine regularization primarily stabilizes text‑to‑video retrieval, and fine‑tuning with domain supervision amplifies geometric benefits but reduces cross‑dataset generalization. Our findings support the efficacy of geometric alignment while highlighting critical optimization sensitivities. Code available at https://github.com/ARIJIT00171/RE‑TRIANGLE.
Authors:Chen Wei, Fanding Xu, Minghao Sun, Zhiyuan Liu, Lin Wang, Tianrui Jia, Yihang Zhou, Yang Zhang
Abstract:
Proteins perform their biological functions through three‑dimensional structures encoded by amino acid sequences, and ligand‑binding protein co‑design requires models that generate sequence‑structure compatible proteins under explicit ligand constraints. Although continuous diffusion and flow‑based models support ligand‑aware design in coordinate or latent spaces, existing discrete diffusion protein language models mainly operate over sequence or structure tokens without direct small‑molecule conditioning. We introduce ProtLiD^2, a Protein Ligand‑conditioned Discrete Diffusion model for protein sequence‑structure co‑design. ProtLiD^2 jointly generates amino‑acid sequence and discrete structure tokens while incorporating ligand chemical and geometric information through geometry‑aware cross‑attention. Trained on over one million ligand‑protein complexes, ProtLiD^2 extends masked discrete diffusion to ligand‑aware functional protein design. We further propose maximum confidence‑margin guided ReMask decoding, an inference‑time self‑correction strategy that retains confident predictions and remasks uncertain tokens. ProtLiD^2 improves global fold confidence over Complexa in whole‑protein design, increasing TM‑score from 0.672 to 0.802 and pLDDT from 64.55 to 73.00. In pocket co‑design, ProtLiD^2 reduces active‑site BB‑RMSD from 3.46/3.40Å for FAIR/PocketGen to 1.97Å, and improves ligand‑aware pass rates over PocketGen from 14.86% to 59.73% and from 6.08% to 23.49% under stricter docking thresholds. These results support ligand‑conditioned discrete diffusion as an effective token‑space framework for functional protein co‑design. Code will be available at https://github.com/auroua/ProtLiD.
Authors:Xiangyu Ma, Teng Xiao, Zuchao Li, Lefei Zhang
Abstract:
Diffusion models promise efficient parallel text generation but rely on bidirectional attention, creating a structural mismatch with pre‑trained Autoregressive (AR) models. This incompatibility precludes reusing robust AR priors, necessitating prohibitive pre‑training from scratch. To bridge this gap, we propose FLUID, a framework that efficiently adapts AR backbones to the diffusion paradigm. By enforcing Strictly Causal Alignment, FLUID enables seamless initialization from standard GPT‑style checkpoints, circumventing the need for massive pre‑training. Furthermore, we introduce Elastic Horizons, an entropy‑driven mechanism that dynamically modulates denoising strides based on local information density rather than fixed schedules. Experiments demonstrate that FLUID achieves state‑of‑the‑art performance while reducing training costs by orders of magnitude, effectively reconciling established AR foundations with efficient parallel generation. Our code is available at https://github.com/Oli‑lab‑nun/FLUID/tree/main.
Authors:Mariano Garralda-Barrio
Abstract:
Recent advances in agentic systems increasingly treat code as an executable operational substrate rather than as a disposable output artifact. Prior work such as \emphCode as Agent Harness frames validated agent‑generated artifacts as runtime entities that can be created, executed, revised, persisted, and reused within long‑running cognitive loops. However, the governance, lifecycle management, and operational evolution of such artifacts remain under‑specified. This paper proposes a framework for governed runtime evolution in multi‑agent systems through executable operational cognition. We formalize agent‑generated artifacts as persistent runtime capabilities that progressively become part of the operational substrate rather than transient intermediate outputs. Building on this perspective, we introduce \emphHarnessMutation as a governed mechanism for lifecycle‑aware runtime adaptation operating under explicit validation, traceability, evaluation, and rollback constraints. Rather than treating runtime adaptation as unrestricted self‑modification, the proposed framework models evolution as a bounded and observable process over persistent operational memory. It further shows how these ideas can be operationalized over modern agent runtimes and governance‑oriented orchestration systems, providing a conceptual foundation for adaptive infrastructures whose evolution remains explicit, auditable, and constrained.
Authors:Bowen Li, Shaotong Guo, Zhen Wang, Yang Xiang, Mingli Jin, Yihang Lin, Jiahui Zhao, Weibo Xiong, Dongrui Zhang, Keming Chen, Yunze Gao, Zeyang Lin, Yuze Zhou, Yue Liu
Abstract:
Building state‑of‑the‑art text‑to‑speech (TTS) systems typically demands millions of hours of proprietary data and complex multi‑stage architectures, creating substantial barriers for resource‑constrained research teams. In this report, we present PilotTTS, a lightweight autoregressive TTS system that achieves competitive performance through minimalist architecture and rigorous data engineering. PilotTTS is trained on only 200K hours of data processed entirely with open‑source tools. Specifically, our contributions are: (1) a reproducible multi‑stage data processing pipeline covering quality assessment, label annotation, and filtering, and (2) a compact model architecture that employs Q‑Former‑based conditioning to decouple speaker identity from speaking style via cross‑sample paired training. Within a unified framework, PilotTTS supports zero‑shot voice cloning, emotion synthesis (11 categories), paralinguistic synthesis (4 categories), and Chinese dialect synthesis (14 dialects). On the Seed‑TTS Eval benchmark, PilotTTS achieves the lowest WER of 1.50% on test‑en, a CER of 0.87% on test‑zh, and the highest speaker similarity on both test sets (0.862 and 0.815), outperforming systems trained on significantly larger datasets. We release the complete data pipeline recipe, pretrained weights, and code at https://github.com/AMAPVOICE/PilotTTS.
Authors:Zihui Zhang, Zhixuan Sun, Yafei Yang, Jinxi Li, Jiahao Chen, Bo Yang
Abstract:
We address the challenging task of 3D object segmentation in complex scene point clouds without relying on any scene‑level human annotations during training. Existing methods are typically constrained to identifying simple objects, primarily due to insufficient object priors in the learning process. In this paper, we present FoundObj, a novel framework featuring a superpoint‑based object discovery agent that incrementally merges suitable neighboring superpoints, guided by our innovative semantic and geometric reward modules. These modules synergistically leverage semantic and geometric priors from self‑supervised 2D/3D foundation models, providing complementary feedback to the object discovery agent and enabling robust identification of multi‑class objects through reinforcement learning. Extensive experiments on diverse benchmarks demonstrate that our approach consistently outperforms existing baselines. Notably, our method exhibits strong generalization in zero‑shot and long‑tail scenarios, underscoring its potential for scalable, label‑free 3D object segmentation.
Authors:Mateusz Czyżnikiewicz, Ryszard Tuora, Adam Kozakiewicz, Tomasz Ziętkiewicz, Mateusz Galiński, Michał Godziszewski, Michał Karpowicz, Timothy Hospedales, Cristina Cornelio
Abstract:
Retrieval‑Augmented Generation (RAG) systems for question answering typically retrieve evidence by semantic similarity between the query and document chunks. While effective for unstructured text, this approach is less reliable on semi‑structured corpora where answering may require exact filtering, aggregation, or exhaustive retrieval over structured attributes across multiple documents. Symbolic approaches support such operations, but they are often brittle on noisy natural‑language corpora. We address this gap with DualGraph, a RAG framework that represents documents through two complementary views: a Textual Knowledge Graph for semantic retrieval and a Symbolic Knowledge Graph for symbolic querying over typed subject‑‑predicate‑‑object triples. Building on these two components, we provide multiple strategies for selecting or combining semantic and symbolic evidence.We also introduce SpecsQA, a benchmark from a commercial shopping website with semi‑structured product documents and manually curated questions spanning open‑ended and specification‑oriented retrieval. Experiments show that DualGraph consistently outperforms state‑of‑the‑art dense‑retrieval, GraphRAG, symbolic, and table‑oriented baselines across question types.Code and data are available at https://github.com/corneliocristina/DualGraphRAG.
Authors:Xiongwei Zhu, Xiaojian Liao, Tianyang Jiang, Yusen Zhang, Liang Wang, Limin Xiao
Abstract:
Fine‑grained Mixture‑of‑Experts (MoE) models sparsely activate only a subset of experts per token, reducing activated computation while maintaining high model capacity. However, in memory‑constrained inference scenarios, only a small set of experts can be cached. Experts not in the cache must be fetched from slow external storage (e.g., UFS), leading to frequent evictions and substantial I/O overhead. We propose ReMoE, a router fine‑tuning framework designed to boost token‑wise expert reuse. ReMoE biases the router toward recently selected experts, producing temporally stable routing that better matches cache locality constraints. By increasing short‑horizon expert reuse, ReMoE reduces expert fetches from storage without adding inference‑time computation. Experiments on DeepSeek and Qwen models show that ReMoE improves expert reuse by 26% while maintaining downstream task performance. Real‑system evaluations further confirm these benefits, improving output throughput by 8.4% under vLLM GPU‑CPU expert offloading and reducing TPOT by 43.6‑49.8% under llama.cpp on Jetson Orin NX, corresponding to a 1.77‑1.99× decode speedup across diverse workloads. Checkpoints and usage instructions are available at https://github.com/BUAA‑OSCAR/ReMoE.
Authors:Ye Yuan, Rui Song, Weien Li, Zeyu Li, Haochen Liu, Xiangyu Kong, Changjiang Han, Yonghan Yang, Zichen Zhao, Zixuan Dong, Fuyuan Lyu, Bowei He, Haolun Wu, Jikun Kang, Xue Liu
Abstract:
Social deduction games have become a popular testbed for probing reasoning, deception, coordination, and belief modeling in Large Language Model (LLM) agents. However, most environments are scored only by game outcomes such as win rates and largely remain to text‑only interaction, making it difficult to tell whether an agent's language is actually grounded in what it perceived and did, or to identify the failure modes underlying its behavior. To address this gap, we introduce QUACK, an open‑source environment and evaluation framework for auditing the grounding of agent language in multimodal social reasoning. QUACK evaluates agents at three levels: game outcomes, behavioral trajectories, and utterance‑level consistency. Its core Statement Verification Pipeline reconstructs each agent's ground‑truth trajectory from engine logs and checks every discussion claim against it, automatically flagging spatial hallucination, unsupported accusation, deception collapse, and language‑action inconsistency. Evaluating three frontier VLMs in both homogeneous and cross‑model adversarial settings, we find that even the strongest agent hallucinates 15.1% of its verifiable spatial claims and makes over half of its accusations without grounded evidence. We release the full engine, evaluation framework, toolkit, and logs at https://github.com/AAAAA‑Academia‑Attractions/QUACK.
Authors:Ruifeng Tan, Jintao Dong, Weixiang Hong, Jia Li, Jiaqiang Huang, Tong-Yi Zhang
Abstract:
Early battery degradation trajectory forecasting (BDTF), which predicts the full‑life state‑of‑health trajectory from early operational data, is critical for battery optimization, manufacturing, and deployment. Battery degradation data exhibit two key characteristics. First, degradation data present a multi‑level structure, including regularities shared within aging conditions and trajectory patterns shared across batteries. Second, degradation‑related variations in voltage‑current profiles are often localized to specific state of charge (SOC) intervals. Existing approaches often fail to explicitly model these characteristics. To bridge this gap, we propose BatteryMFormer, a multi‑level Transformer for early BDTF. BatteryMFormer integrates (1) an aging‑condition‑aware decoder that injects aging‑condition priors via aging‑condition‑informed queries and aging‑condition‑aware attention, (2) a meta degradation pattern memory that learns and retrieves trajectory prototypes to guide long‑horizon forecasting, and (3) a dual‑view encoder that jointly captures temporal dynamics and SOC‑localized variations from voltage and current time series. Extensive experiments on four battery domains show that BatteryMFormer consistently outperforms state‑of‑the‑art baselines, marking a significant step toward reliable BDTF. Our code is available at https://github.com/Ruifeng‑Tan/BatteryMFormer.
Authors:Izack Cohen
Abstract:
Alignment‑based conformance checking is the state‑of‑the‑art approach for comparing observed process executions with normative process models. The standard exact solution relies on an A‑based heuristic search, which can exhibit exponential runtime in the presence of long traces or substantial deviations. This paper introduces a reformulation of alignment‑based conformance checking as a totally unimodular linear program (LP) defined on the reachability graph of the synchronous product. By exploiting the underlying network‑flow structure, the proposed formulation guarantees the existence of an integral optimal extreme‑point solution through LP relaxation, thereby avoiding the combinatorial overhead associated with integer variables and branch‑and‑bound search. We conduct an extensive empirical evaluation on more than 2.1 million conformance checking instances derived from real‑world and synthetic benchmark datasets. The results show that A and the LP approach exhibit complementary performance characteristics: the former performs best on short, well‑conforming traces, while the LP formulation provides substantial speedups for longer traces with deviations, precisely where conformance checking is most informative. Based on these findings, we derive simple algorithm‑selection guidelines that combine both approaches, achieving average runtime savings of 38.6% with 96% selection accuracy compared to always using A.
Authors:Hanqi Duan, Xiang Li
Abstract:
LLM‑generated peer reviews are increasingly common at major venues, yet their deficiencies are hard to detect because they are uniformly fluent and well‑structured. Existing work either classifies authorship without judging quality, or scores quality with features designed for human‑written reviews; no prior system detects deficiencies in LLM‑generated reviews at the level of individual defect types. To bridge the gap, we introduce TADDLE, a Tool‑Augmented Agent for Detecting Deficient LLM‑Generated Peer Reviews, together with the first expert‑annotated benchmark for this task. Our benchmark comprises 1,800 reviews on 50 ICLR 2025 papers, multi‑label‑annotated by 18 domain experts against a taxonomy of six defect categories (plus a non‑deficient label). TADDLE decomposes detection into four specialized analysis tools ‑‑ Verify, Correct, Complete, and Transform ‑‑ orchestrated by an agent; an integrator synthesizes their outputs into binary and multi‑label classifications via two‑stage semi‑supervised learning. Extensive experiments show that TADDLE performs strongly on both binary detection and the multi‑label classification task. We release the benchmark and code at https://github.com/AquariusAQ/TADDLE.
Authors:Jiajun Wu, Jian Yang, Tuney Zheng, Wei Zhang, Haowen Wang, Yihang Lou, Xianglong Liu
Abstract:
LLMs can now produce full HTML pages, but many of those pages are only superficially correct: they render once, then fail under scroll, hover, click, resize, or gameplay. Evaluation from screenshots can miss these failures, and filtering discards many pages that are still repairable. We introduce HTMLCure, a browser experience framework that evaluates HTML after the system has interacted with it. The evaluator executes the page across viewports and interaction states, records deterministic browser evidence, and gives the VLM curated keyframes from the executed trajectory rather than isolated screenshots. The same state signal drives a closed loop repair engine: HTMLCure diagnoses the current page, chooses a state specific repair family, runs each candidate again, and exports quality cleared pages for SFT. On a 97K prompt corpus, this expands the directly usable seed into a candidate pool of 63703 quality cleared pages, from which we construct the final refined SFT set of 40K pages. Under the same backbone and training recipe, HTMLCure‑27B‑Refined reaches 50.6 on HTMLBench‑400 with 45.2% deterministic test case pass, placing it in the same performance band as strong reference rows such as Kimi‑K2.6 and GPT‑5.4. On the released MiniAppBench validation split, it reaches 81.2 average, improving raw 27B SFT by 15.3 points and approaching the level of strong reference systems.
Authors:Ashima Khanna, Dominik Grimm
Abstract:
Protein sequence optimization under tight oracle budgets requires methods that explore vast combinatorial spaces while making each evaluation informative. Existing reinforcement learning and off‑policy generative approaches often degrade under surrogate noise, and position‑agnostic mutation proposals risk disrupting functionally critical residues. We introduce SILO, a trajectory‑level self‑improvement imitation framework for oracle‑budgeted protein design. SILO uses a hierarchical edit policy that decomposes each mutation into a position choice followed by a residue choice. In each active‑learning round, the policy samples candidate trajectories via incremental stochastic beam search without replacement (SBS), and a UCB‑based proxy ensemble, combined with an alanine‑scan fitness score (AFS), selects candidates with functionally relevant edits for in silico oracle evaluation. The policy is then updated by next‑action cross‑entropy imitation on the round's best oracle‑labeled trajectories, avoiding value‑function estimation. Across eight reproduced protein fitness landscapes and five strong baselines from prior work, SILO achieves the highest maximum and top‑100 mean fitness on 8 of 8 landscapes within our evaluations, often exhibiting faster early‑stage improvement. In low‑data and noisy‑proxy stress tests on two landscapes per setting, SILO remains competitive or best when several baselines degrade. Ablations show that SBS with AFS account for much of the gains, with iterative imitation providing additional improvement. Code is available at: https://github.com/grimmlab/SILO.git
Authors:Peng Zhang, Guanghao Zhang, Wanggui He, Longxiang Zhang, Mushui Liu, Yan Xia, Zhenhao Peng, Weilong Dai, Jinlong Liu, Haobing Tang, Le Zhang, Hao Jiang, Pipei Huang
Abstract:
Recent video multimodal large language models (MLLMs) increasingly couple step‑by‑step reasoning with on‑demand visual evidence retrieval, allowing models to revisit relevant video segments during inference. However, two structural gaps remain in existing thinking‑with‑video systems. (i) Sampling density is not a learnable decision: existing methods may let the model decide where to look, but the per‑window frame rate is largely fixed. As a result, fine‑grained evidence is often recovered through repeated retrieval calls, which increases inference context length and training difficulty. (ii) Retrieval and answer generation are usually optimized with a single trajectory‑level advantage, so the "where to look" tokens and the "how to answer" tokens receive the same credit even when one is correct and the other is not. To address these gaps, we present DynFrame, a framework that emits the temporal window and the sampling density as native tokens within a single autoregressive pass. This learnable span‑density retrieval enables acquiring multi‑granularity evidence with a single retrieval step. Based on the above tokenized retrieval interface, we further introduce Segment‑Decoupled GRPO (SD‑GRPO), which splits each rollout at the retrieval boundary and assigns role‑specific token‑level advantages, separately crediting the sampling decision and the answer. Trained on the curated DM‑CoT‑74k and DM‑RL‑45k, DynFrame‑4B is competitive with strong 7B‑8B baselines across six benchmarks (NExT‑GQA, Charades‑STA, ActivityNet‑MR, Video‑MME, MLVU, LVBench), and DynFrame‑8B sets new state‑of‑the‑art on most metrics. Code is available at https://github.com/zhangguanghao523/DynFrame.
Authors:Zheng Wang, Kaixuan Zhang, Wanfang Chen, Jingwen Zhang, Xiaonan Lu
Abstract:
Sequential editing of structured knowledge in large language models allows targeted factual updates without retraining, yet existing methods often rely on complex regularization or constraint mechanisms whose necessity remains unclear. In this work, we systematically investigate the mechanisms underlying effective and stable sequential editing. Specifically, we first analyze the empirical success of AlphaEdit and establish, via a rigorous optimization analysis, the formal equivalence between one‑time and sequential editing. Building on this insight, we generalize the equivalence to a broader class of editing objectives, demonstrating that stability emerges naturally from properly accounting for accumulated editing constraints, rather than from specialized regularization or null‑space operations. We empirically confirm that many commonly used regularization strategies are unnecessary for reliable sequential updates. Furthermore, we extend our framework to handle conflicting edits, ensuring robust and consistent behavior under contradictory updates. Ultimately, our work provides Ariadne's thread through the labyrinth of sequential editing, charting a path toward simpler, more interpretable, and dependable knowledge updates. Our code is available at https://github.com/Wangzzzzzzzz/OTE‑SE‑Alignment.
Authors:Haoran Zhang, Zhaohua Sun
Abstract:
The token‑level extractive compressors widely used for general LM context are structurally inappropriate for LLM agents: across 17 (env, backbone, method) cells spanning two independent token‑level method families, every cell collapses to mean reward <= 0.05 despite 1.3‑13.3x realized compression. We name and characterize this failure mode as action‑grammar destruction ‑‑ the tokens carrying action semantics (identifiers, brackets, action verbs) are exactly those self‑information ranks lowest, so a general‑purpose compressor reliably removes them and the environment rejects the residual. The diagnosis points to step‑granularity compression. We introduce AGORA, an inference‑free step‑level compressor combining a structural prompt parser, an always‑keep floor for format‑ and recency‑critical content, and a 125M‑parameter relevance scorer trained on counterfactual next‑action‑change labels (~2ms/step, zero per‑step LLM toll). Across the compared inference‑free and LLM‑based methods, AGORA is the only one retaining >= 75% uncompressed performance in 8 of 9 cells (with the lone exception at 73%); a four‑way component ablation isolates the structural floor as the dominant quality lever and the learned scorer as the source of 1.0‑11.5x adaptive end‑to‑end compression from a single fixed keep ratio.
Authors:Jaewoo Lee, Hyeongyu Kang, Dohyun Kim, Kyuil Sim, Woocheol Shin, Minsu Kim, Taeyoung Yun, Jeongjae Lee, Sanghyeok Choi, Tabitha Edith Lee, Jong Chul Ye, Jinkyoo Park
Abstract:
Aligning a few‑step generative model is challenging, since existing alignment frameworks typically rely on restrictive assumptions: a tractable likelihood, a specific ODE/SDE solver, or a particular model family. We introduce FAV, Few‑step Generative Models Alignment via Sample‑based Variational Inference, a general alignment framework that requires only sample access to the generator and the reference distribution. We cast alignment as sampling from a reward‑tilted distribution anchored to a reference distribution. We leverage Stein Variational Gradient Descent as a sample‑based variational inference scheme and amortize its particle updates into the generator parameters via fixed‑point regression. We evaluate FAV on two domains: robotics manipulation and image generator alignment. On generative policy alignment for robotic manipulation, FAV outperforms prevailing policy extraction baselines across 56 offline and 30 offline‑to‑online RL tasks. For image generator alignment, FAV fine‑tunes diverse few‑step backbones, including GAN, drifting model, consistency models, and flow maps, scaling from ImageNet‑256 to 1024^2 text‑to‑image synthesis. Code is available at https://github.com/Jaewoopudding/FAV.
Authors:Jiahe Huang, Sihan Xu, Sharvaree Vadgama, Rose Yu
Abstract:
Generative models have emerged as a powerful paradigm for solving physics systems and modeling complex spatiotemporal dynamics. However, achieving high physical accuracy without incurring high computational cost remains a fundamental challenge, as existing approaches face a critical speed‑fidelity trade‑off. In this work, we introduce Recursive Flow Matching (RecFM), a generative framework for forecasting complex spatiotemporal dynamics. RecFM enforces self‑consistency to align trajectories across discretization scales, reducing discretization errors and improving performance across metrics for physics‑based tasks. To our knowledge, this is the first method to achieve high‑fidelity one‑ and few‑step (2‑4 step) dynamic generation for scientific systems with performance comparable to state‑of‑the‑art multi‑step solvers. Across challenging scientific benchmarks, RecFM achieves up to a 20× speedup over leading diffusion‑based emulators while improving predictive accuracy. Furthermore, RecFM reduces mean squared error by over 15% compared to vanilla flow matching, offering a scalable and efficient solution for real‑time scientific emulation.
Authors:Akide Liu, Jinbo Xing, Chaojie Mao, Ye Li, Zeyu Zhang, Yefei He, Weijie Wang, Zihan Wang, Yu Liu, Gholamreza Haffari, Bohan Zhuang
Abstract:
Minute‑scale cinematic video generation is a central challenge for generative video models. Existing paradigms address only fragments of this challenge: single‑shot extrapolation preserves an anchor but lacks cinematic structure, while multi‑shot storytelling imposes structure yet remains free to invent its visual states rather than continue an observed one. We define Multi‑Shot Video Extrapolation (MSVE), a task that extends an observed frame or clip into a sequence of cinematically structured shots while preserving anchor state and advancing narrative intent. This setting operates under the finite per‑call generation budget of short‑video models. We identify three coupled bottlenecks: (1) global planners over‑specify unsupported details from full screenplays; (2) shot‑level prompts dilute task‑relevant state when carrying the complete story; and (3) temporal chaining turns generated frames into a lossy memory in which identity, scene, object, and action state decay. MSVE reveals that long‑video failure is not merely a limitation of context length, but a failure of context allocation. We propose Recursive Context Allocation (ReCA), an inference‑time framework that allocates context hierarchically across planning and generation. ReCA recursively decomposes MSVE into context‑bounded subproblems, invokes frozen generators at leaf nodes, and propagates structured state updates across time. To evaluate this setting, we further propose MSVE‑Bench and NB‑Q, a source‑grounded protocol with prompts purpose‑built for 3 to 5 minute long‑video generation, a regime not addressed by existing short‑clip benchmarks. Compared to previous methods, ReCA improves average normalized score by 8 to 16 percent over the strongest competing controller and improves multi‑shot consistency metrics by 28 to 43 percent. View the project page at https://reca.vmv.re.
Authors:Yuxu Lu, Dong Yang, Xiaoyu Li, Mengwei Bao, Congcong Zhao
Abstract:
Maritime intelligent transportation systems (MITS) are essential for ensuring navigation safety and efficiency in busy waterways. However, accurate vessel trajectory prediction remains challenging due to the limitations of single‑source data. Automatic identification system (AIS) data is often sparse or unavailable for small vessels, while closed‑circuit television (CCTV) data alone cannot fully capture dynamic vessel behavior. To mitigate these challenges, we propose a cross‑modal interaction‑based vessel trajectory prediction (named CmIVTP) framework to model the intricate interactions between vessel dynamics and environmental constraints. Specifically, we introduce a target‑aware scene encoder to extract scene semantic features, effectively capturing vessel‑environment interactions and enhancing trajectory prediction accuracy. In addition, we propose a cross‑modal interaction transformer, which integrates AIS‑derived motion features, CCTV‑based environmental features, and scene representations. It leverages cross‑modal attention mechanisms to simultaneously capture intra‑modal semantics and inter‑modal interactions, ensuring dynamically consistent and environmentally feasible predictions. Furthermore, we construct a vessel group trajectory bank by clustering historical AIS trajectories into representative motion patterns, providing an efficient and scalable approach for candidate trajectory generation. Additionally, we introduce the maritime multimodal dataset plus (named Maritime‑MmD^+), a large‑scale dataset that synchronizes AIS data and CCTV video data, providing robust support for multimodal trajectory prediction research. Extensive experiments demonstrate that CmIVTP achieves better performance on multimodal‑driven vessel trajectory prediction benchmarks. The code resources for this work can be available at https://github.com/LouisYxLu/CmIVTP.
Authors:Anmol Agarwal, Natalie Neamtu, Pranjal Aggarwal, Seungone Kim, Jannis Limperg, Cedric Flamant, Kanna Shimizu, Bryan Parno, Sean Welleck
Abstract:
AI coding agents are increasingly used to write real‑world software, but ensuring that their outputs are correct remains a fundamental challenge. Formal verification offers a promising path: an agent generates code together with a machine‑checked proof, guaranteeing that the code satisfies a formal specification. However, there is no guarantee that the formal spec itself matches the user's intent. In this work, we study specification autoformalization: whether LLM agents can translate informal programming problems into faithful formal specifications. We introduce Verus‑SpecBench, a benchmark of 581 spec‑writing tasks derived from Codeforces problems targeting Verus, a verifier for Rust, and Verus‑SpecGym, an agentic environment in which models interact with Verus, bash, & the filesystem to develop these specs. The central challenge is evaluation: expert‑written reference specs are expensive to write, & LLM judges can miss subtle mistakes. We address this by (a) extending Verus's exec_spec mechanism so that generated specs can be executed as Rust code, & (b) testing them against official Codeforces tests & adversarial cases extracted from Codeforces "hacks", which are edge cases written by competitors to break incorrect solutions. On Verus‑SpecBench, the strongest model, Gemini 3.1 Pro, solves 77.8% of tasks, other frontier models solve 51.1‑‑57.8% & OSS models reach only 21.5‑‑25.5%. Our analysis of failure modes shows that model‑generated specs can omit important input assumptions, accept incorrect outputs, & reject valid ones. We also find that LLM‑as‑a‑judge evaluation misses 26% of the failures our evaluator catches. Overall, our results suggest that spec autoformalization is within reach for frontier agents but remains brittle even on problems where they can already generate correct code. The code, data, & logs can be found at https://github.com/formal‑verif‑is‑cool/verus‑spec‑gym
Authors:Junlin Yang, Tian Yu, Nicha C. Dvornek, Yuexi Du, Peiyu Duan, Annabella Shewarega, Lawrence H. Staib, James S. Duncan, Julius Chapiro
Abstract:
Hepatocellular carcinoma (HCC) is biologically heterogeneous, shaped by the interplay between hepatic functional reserve and tumor‑related oncologic factors; thus, similar survival outcomes may reflect fundamentally different underlying biological processes. Prognostic modeling in HCC is informed by rich multimodal information from multiparametric MRI and radiology reports from routine clinical practice. Existing prognostic vision‑language models (VLMs) learn a single entangled latent representation that blends hepatic and tumor‑related factors, limiting both accuracy and biological interpretability. We present BioFact‑MoE, a biologically factorized Mixture of Experts (MoE) framework that explicitly decomposes liver and tumor factors via biologically supervised experts within a residual MoE survival architecture. On a HCC cohort of N=588 patients (pretrained on 4,582 3D MRI image‑report pairs), BioFact‑MoE consistently improves survival prediction over all baselines across time horizons, achieving 12‑, 18‑, and 24‑month AUCs of 75.33%, 75.85%, and 73.96%. Beyond scalar risk prediction, gated expert weights enable phenotype‑aware risk stratification. Pathway‑informed gating uncovers clinically meaningful treatment‑associated survival heterogeneity. In held‑out validation, hepatic and tumor embeddings show selective associations with liver function and tumor burden markers, respectively (p<0.05), without supervision. The code is available at https://github.com/jy‑639/BioFact‑MoE.
Authors:Vukasin Bozic, Isidora Slavkovic, Dominik Narnhofer, Nando Metzger, Denis Rozumny, Konrad Schindler, Nikolai Kalischek
Abstract:
Geometry estimation from perspective images has greatly advanced, maturing to the point where off‑the‑shelf foundation models are able to reconstruct 3D scene structure not only from multi‑view imagery, but even from a single view. A natural extension is 3D reconstruction from panoramas, with the exciting prospect of recovering a full 360‑degree scene from a single panoramic image. In this work, we introduce PaGeR (Panoramic Geometry Reconstruction), a framework to lift powerful 3D foundation models designed for perspective imagery to the panorama domain. Our strategy is to start from a pre‑trained transformer for 3D reconstruction and turn it into a unified high‑performance model that predicts scale‑invariant depth, metric depth, surface normals, and sky masks from both perspective and omnidirectional images, in a single forward pass. By keeping architectural changes to a minimum and mixing perspective and panoramic images during training, PaGeR retains the rich 3D prior of the underlying foundation model while learning to also estimate geometrically consistent 360‑degree scenes from single panoramas. We extensively test our method in both indoor and outdoor environments and find that it delivers state‑of‑the‑art performance and excellent zero‑shot performance across a wide range of scenes. Code, data and models are available \hrefhttps://github.com/prs‑eth/PaGeR\texthere.
Authors:Xinpeng Wang, William X. Cao, Andrew Gordon Wilson, Zhe Zeng
Abstract:
Recent studies on hallucination detection have shown that hallucination‑related signals are more strongly encoded in intermediate layers than in the final layer of large language models (LLMs). Although a growing body of work has sought to exploit this property for hallucination detection, how to automate the selection of high‑performing layers remains underexplored, and principled methods for this purpose are still lacking. To address this gap, we first propose several hypotheses for why such signals emerge in intermediate layers and evaluate corresponding criteria for automatic layer selection across diverse LLM architectures, scales, and tasks, covering both question answering and summarization hallucination detection benchmarks. However, we find that none of these criteria consistently delivers satisfactory performance. We therefore propose a new selection criterion, First Effective Peak of Intrinsic Dimension (FEPoID), which consistently identify optimal or near‑optimal layers and outperforms both the aforementioned criteria and existing hallucination detection baselines. FEPoID is training‑free and incurs negligible computational overhead. In addition, we study the generation behaviors of LLMs and introduce a simple yet effective truncation strategy, which further amplifies hallucination‑related signals and substantially improves overall detection performance. Code is publicly available at https://github.com/DesoloYw/Automatic‑Layer‑Selection‑for‑Hallucination‑Detection.git
Authors:Xinran Liang, Esin Tureci, Prachi Sinha, Ye Zhu, Vikram V. Ramaswamy, Olga Russakovsky
Abstract:
Different visual patterns appear with different frequencies in the world: e.g., beach balls appear on sand more often than they do on a road. These statistics are reflected in vision datasets, and as a result trained models more easily recognize objects in common scenarios. However, recognizing a beach ball on a road may arguably be even more important than recognizing it on sand. We study how to mitigate this discrepancy. Since collecting uncommon images in the real world may be difficult, we explore whether generating images with less frequent contexts can serve as effective training augmentation. A key challenge is guiding generations to remain close to the original dataset distribution while creating diverse images with uncommon contexts. We introduce Decoupling Contextual Patterns with Generations (DecoupleGen), a method that personalizes text‑to‑image diffusion models to facilitate coherent synthesis of images with rare contexts while preserving original visual details. The generated images contain semantically meaningful content and remain visually aligned with the original datasets. We further apply verification constraints to ensure relevance of the augmented data. We evaluate our approach on object classification and recognition tasks on complex scene datasets. Our experiments demonstrate consistent improvements over previous approaches, and our analyses identify factors underlying these improvements.
Authors:Chenghao Qiu, Chunli Peng, Yufeng Yang, Kuan-Hao Huang, Yi Zhou
Abstract:
In‑context learning (ICL) is often motivated by the intuition that demonstrations help because they provide correct input‑output examples. However, we reveal a counterintuitive phenomenon: correctness does not guarantee exemplar utility, and some correct demonstrations can even reduce ICL accuracy. To study this correctness‑utility gap, we introduce task‑preserving perturbations, where only the exemplar input is changed, while the example remains a correct instance of the same task. Concretely, each perturbed exemplar is assigned the target induced by the task mapping. This framework covers both label‑updating perturbations, where task‑relevant semantics change and targets are recomputed, and stricter target‑preserving perturbations, where the original target remains valid. We formalize the resulting failure mode as contextual evidence shift: task‑preserving perturbations can change the effective mixture of evidence used by the model for contextual inference, thereby separating exemplar correctness from exemplar utility. Across sentiment classification, logical reasoning, and math word problems, we find that task‑preserving perturbed demonstrations can substantially degrade ICL performance, especially for smaller models, harder tasks, and higher perturbation ratios. Our results show that robust ICL requires evaluating not only whether demonstrations are correct, but also how they influence contextual inference. Code is available at https://github.com/Chenghao‑Qiu/Task‑Preserving‑ICL.
Authors:Sandeep Kumar, Virginia Smith, Chhavi Yadav
Abstract:
Direct Preference Optimisation (DPO) is widely used for safety alignment in large language models. However, prior work shows it is brittle and exhibits poor out‑of‑distribution (OOD) generalisation. In this paper, we investigate whether Curriculum Learning can improve the robustness of DPO‑based safety alignment. We propose Staged‑Competence, a curriculum‑based framework that organises preference data by difficulty, employs competence‑based sampling, and progressively updates the reference model during training. Averaged across three model families, Staged‑Competence reduces OOD harmful response rates by 16% and jailbreak attack success rates by 20%, while preserving general capabilities with near‑zero over‑refusal. We further show that Staged‑Competence (1) matches baseline safety with only 75% of the training data and (2) yields better separation between safe and unsafe responses. Staged‑Competence is agnostic to the policy optimisation loss and can extend to other DPO variants and alignment domains. Our code and data are available at https://github.com/Sandeep5500/curriculum‑learning‑for‑safety.
Authors:Rafał Stachowiak, Tomasz P. Pawlak
Abstract:
Constraint Acquisition (CA) and related research on the validation and enhancement of Mathematical Programming (MP) models from domain knowledge artifacts are currently limited by inadequate benchmarks. This deficiency impedes reproducibility and cross‑study comparability, slowing the maturation of CA methods. Existing benchmarks were designed for solver evaluation rather than for assessing CA algorithms. They are loosely organized, treat individual problems inconsistently, and omit the domain knowledge artifacts required by CA methods. This work presents MPMMine, a benchmark suite designed to assess algorithms that discover, validate, and enhance MP models using diverse domain knowledge artifacts. MPMMine is guided by consistency, standardization, completeness, extensibility, openness, and version control. It adopts a uniform structure and relies on open formats: MiniZinc, CommonMark, and JSON. It provides multiple models per problem, tens of instances per model, and thousands of solutions and non‑solutions in both integer and continuous domains, alongside natural‑language descriptions to support text‑to‑model methods.
Authors:Shuwen Yu, William P Marnane, Geraldine B. Boylan, Gordon Lightbody
Abstract:
This paper presents the HRVConformer, a novel deep learning architecture for the classification of hypoxic‑ischemic encephalopathy (HIE) using the instantaneous heart rate (HR) signal. Unlike conventional approaches that rely on handcrafted features, HRVConformer directly processes raw HR signals in an end‑to‑end manner, capturing both local and long‑range dependencies through a hybrid Convolution‑Transformer framework. By integrating convolutional layers for local feature extraction and Transformer‑based attention mechanisms for global context modelling, the architecture effectively enhances signal representation and classification performance. The model was trained using supervised learning on a large HR dataset consisting of 1,573 one‑hour epochs, including 259 one‑hour expert‑annotated epochs and a substantial set of weakly labelled data. A 314‑hour validation set provided a robust performance estimation, while an independent 215‑hour dataset with expert annotations was reserved for final testing. HR signals were extracted from electrocardiogram (ECG) recordings using an improved Pan‑Tompkins algorithm, which significantly enhanced both signal quality and data availability. Experimental results demonstrate that the HRVConformer achieves an AUC of 83.23% and accuracy of 74.56% on the test set. These results surpass the performance of the Transformer, ResNet50 and fully convolutional networks baselines, highlighting the advantages of integrating convolutional and Transformer‑based components for HR‑based HIE classification. The proposed method provides a promising step toward a more accurate and automated assessment of HIE using HR signals. The code is available at: https://github.com/syu‑kylin/HRVConformer.
Authors:Zihang Zhou, Ziqian Ren, Yukai Wu, Yingjie Xiong, Wei Zhou, Chao Peng, Dong Zhang, Bingheng Yan, Xuanhe Zhou, Fan Wu
Abstract:
Functionality‑correct repository setup aims to configure execution environments (e.g., dependencies, build scripts) to successfully execute a repository's documented features. It presents significant challenges due to diverse, repository‑specific failures, including dependency incompatibilities, missing toolchains, incomplete installations, and verification‑strategy mismatches. Existing LLM agents struggle to robustly resolve these issues, specifically failing to support (1) cross‑repository experience transfer, (2) multi‑step trial‑and‑repair under non‑invertible state changes, and (3) robust verification of setup outcomes to distinguish setup‑induced failures from repository bugs. To address this, we introduce SetupX, an experiential learning‑based setup framework. First, we construct a Self‑Evolving Experience Representation (XPU), a dual‑modality knowledge unit encoding setup signals, textual guidance, executable actions to dynamically transfer verified environment fixes to unseen repositories. Second, we employ Experience‑Augmented Speculative Execution backed by a LIFO Docker snapshot stack, enabling the agent to proactively trial fixes and safely roll back to known‑good states. Third, we introduce a Prosecutor‑Judge Verification Protocol that separates evidence collection from final judgment, enabling more reliable setup verification beyond superficial build‑time metrics. Evaluation results on carefully‑crafted benchmarks show SetupX achieves highest performance (e.g., 92% pass rate) and outperforms the strongest baseline by over 19%. Crucially, SetupX excels in complex multi‑repository setup requiring coordinating multiple interconnected services across different containers. The code repository is available at https://github.com/OpenDataBox/SetupX.
Authors:Ke Li, Dong An, Xiaoling Zang, Can Ye, Liang Xie, Qibo Qiu, Chen Shen, Xiaofei He, Wenxiao Wang
Abstract:
Low‑bit activation quantization remains a major bottleneck in efficient large language model (LLM) deployment. The difficulty is not only that activations contain outliers, but that their distributions are often poorly matched to a low‑bit uniform quantizer. Existing post‑training quantization (PTQ) methods suppress peaks, balance channels, or minimize reconstruction error, yet they rarely specify what activation distribution is actually easy to discretize. As a result, activations may appear numerically smoother while still incurring large quantization error because the quantization range remains wide or most values collapse into a few levels near the mean. We recast activation transformation as quantizer‑facing distribution design and analyze quantization error from an information‑theoretic perspective. Our analysis shows that quantization‑friendly activations should jointly have a smaller numerical range and sufficient dispersion within that range. Guided by this analysis, we propose InfoQuant, a train‑free method that employs Peak Suppression Orthogonal Transformation (PSOT) to shape activations into more quantization‑friendly distributions. We further introduce adaptive outlier‑token selection to improve the robustness of PSOT during optimization. Across multiple LLM families, InfoQuant consistently outperforms prior PTQ and end‑to‑end training baselines. Under W4A4KV4, it preserves 97% of floating‑point accuracy on average and reduces the LLaMA‑2 13B performance gap by 42% over the previous state of the art. Code is available at [https://github.com/LLIKKE/InfoQuant](https://github.com/LLIKKE/InfoQuant)
Authors:Hanzala Afzaal, Danish Memon, Chouhdary Bilal Raza, Muhammad Khurram Shahzad
Abstract:
The rapid proliferation of Internet of Things (IoT) devices has created an urgent demand for adaptive, resource‑efficient Intrusion Detection Systems (IDS) capable of handling dynamic and evolving cyber threats. This paper investigates AOC‑IDS, a state‑of‑the‑art autonomous online IDS published at IEEE INFOCOM 2024, which employs an Autoencoder (AE) with Cluster Repelling Contrastive (CRC) loss and an autonomous Gaussian‑based decision module. We first successfully replicate AOC‑IDS on the UNSW‑NB15 benchmark, achieving 89.39% accuracy in close agreement with the published 89.19%. We then identify four key limitations: class imbalance, unreliable pseudo‑label generation, limited generalization, and computational overhead for IoT deployment, and propose targeted improvements for each. Our XGBoost‑BalSamp method achieves 95.45% accuracy on UNSW‑NB15, a gain of 6.26% over the baseline. Our combined deep learning approach (PseudoFilter, MixupAug, and LiteAE) achieves a best‑run accuracy of 90.88% (F1: 91.45%), surpassing the base paper while reducing model parameters by 55%.These results demonstrate that targeted improvements to AOC‑IDS yield consistent accuracy gains while improving practical deployability on IoT edge devices.
Authors:Furkan Sakizli
Abstract:
Agentic RAG systems that equip language models with dozens to hundreds of tool definitions face a critical resource conflict: tool schemas consume the same context window needed for retrieval‑augmented generation. We present the first systematic study of this tool‑context trade‑off, evaluating 14 models spanning 1.5B‑32B local models plus one frontier API model across 6,566 controlled API calls at three context budgets (8K, 16K, 32K) with 28 tool definitions. Applying TSCG conservative‑profile compression (44‑50% schema token savings), we observe a binary enablement effect: at 8K tokens, JSON‑schema tool definitions overflow the context window entirely, yielding near‑zero EM (2.6% average), while compressed schemas restore RAG functionality with +20.5 pp average exact‑match lift across all eight models (+24.7 pp among the six exhibiting full enablement). At 32K ‑‑ where both formats fit ‑‑ four of five tested models show delta <= 1 pp, confirming the effect is purely budget‑driven. External validation on HotpotQA (50 multi‑hop questions) shows +48 pp EM under the same overflow scenario. Frontier scaling tests demonstrate that JSON schemas overflow at ~494 tools while compressed schemas remain operational beyond 800 tools. Our results establish tool‑schema compression as a necessary infrastructure layer for agentic RAG in constrained‑context deployments. All code, data, and checkpoints are publicly available.
Authors:Tongxi Wu, Jian Zhang, Yang Gao
Abstract:
Safety alignment in large language models (LLMs) and multimodal large language models (MLLMs) is commonly assumed to operate as a near‑binary threshold mechanism. We challenge this assumption by revealing that safety behavior is governed by an instability region where small perturbations induce stochastic refusal decisions rather than deterministic outcomes. We develop a multi‑metric diagnostic framework combining external and internal signals to characterize this instability. Through systematic experiments, we identify a characteristic diagnostic signature: inputs in unstable regimes exhibit elevated output uncertainty yet decreased internal safety activation, a decoupling phenomenon that explains why detection‑based defenses fail against sophisticated attacks. Building on this framework, we introduce Furina, a jailbreak attack that deliberately induces this signature through fragmented, scene‑anchored prompts without model‑specific optimization. Furina outperforms strong single‑turn and multi‑turn baselines on HarmBench and achieves competitive results on MM‑SafetyBench, demonstrating that uncertainty amplification provides a principled and transferable mechanism for understanding safety vulnerabilities. Code is available at: https://github.com/0xCavaliers/Furina_Jailbreak.
Authors:Xianglin Yang, Bryan Hooi, Gelei Deng, Tianwei Zhang, Jin Song Dong
Abstract:
The known stylistic biases in LLM judges, such as a preference for verbosity or specific sentence structures, present an underexplored security vulnerability. In this work, we introduce BITE (BIas exploraTion and Exploitation), a black‑box adversarial framework that learns semantics‑preserving edits to mislead an LLM judge and artificially inflate the scores it assigns. We cast the selection of stylistic edits as a contextual bandit problem and use a LinUCB policy to adaptively choose edits that maximize the judge's score without access to model parameters or gradients. Empirically, we test BITE across a diverse range of LLM judges and tasks, including both pointwise and pairwise comparisons on chatbot leaderboards and AI‑reviewer benchmarks. BITE achieves an attack success rate exceeding 65% and raises scores by 1‑2 points on a 9‑point scale, all while preserving semantic equivalence. We further assess the attack's stealthiness, showing that BITE evades standard style‑control methods and several detection baselines. Our findings expose a fundamental weakness in the LLM‑as‑a‑judge paradigm and motivate robust, attack‑aware evaluation. Our code is available at https://github.com/xianglinyang/llm‑as‑a‑judge‑attack.
Authors:Xu Yao, Siyuan Zhou, Zhenbo Wu, Chaochuan Hou, Shuang Liang, Shiping Wang, Hailiang Huang, Songqiao Han, Minqi Jiang
Abstract:
Weakly supervised anomaly detection (WSAD) has developed in three primary directions: incomplete, inexact, and inaccurate supervision. However, these directions remain isolated, lacking a unified framework to assess whether they address unique challenges or share fundamental mechanisms. This paper introduces WSADBench, the first benchmark that unifies evaluation across distinct weakly supervised scenarios, benchmarking diverse approaches from specialized WSAD methods to advanced tabular foundation models. WSADBench establishes standardized protocols to evaluate 36 algorithms across 4 modalities by systematically varying label quantity, granularity, and quality, revealing the performance boundaries of various methods. Based on over 700K experiments, WSADBench reveals four critical insights: (i) Strong intrinsic correlations exist between these weak supervision scenarios, challenging the isolation of current research directions. (ii) Specialized WSAD algorithms excel only in extreme label‑scarcity regimes but are quickly dominated by tabular foundation models and general classification methods as supervision increases or in OOD scenarios. (iii) Unlabeled data shows inconsistent utility across settings, with marginal gains compared to label refinement. (iv) Models exhibit asymmetric sensitivity to different types of label noise. We release WSADBench as an open‑source benchmark with code and datasets to facilitate future WSAD research: https://github.com/SUFE‑AILAB/WSADBench.
Authors:Parth Darshan, Abhishek Divekar
Abstract:
Customizing an LLM judge to a specific problem or domain often involves optimizing its prompt across multiple evaluation criteria simultaneously. Textual gradient methods automate this for a single judge criterion, however they produce natural‑language critiques, not numerical vectors. Thus, the conflict‑resolution toolkit of multi‑task learning (PCGrad, MGDA) does not apply to this multi‑objective textual gradient setting. We extend TextGrad to the multi‑objective setting and test four decomposition modes of textual gradient optimizers by varying how much cross‑objective information the loss, gradient and optimizer LLMs share. We find the gradient's task‑focus drops by 59% (9.0 to 3.7 out of 10) when the gradient LLM must provide feedback on multiple criteria jointly. Separately, we observe that naively combining single‑objective optimized instructions into a single prompt degrades Spearman rho from 0.305 to 0.220 (‑0.085). These results identify two separable failure modes: optimization‑time gradient dilution and inference‑time instruction interference, which together constrain the design space for multi‑objective judge optimization using textual feedback.
Authors:Yunqi Gao, Leyuan Liu, Yuhan Li, Changxin Gao, Jingying Chen
Abstract:
3D human mesh recovery and 3D clothed human reconstruction are inherently related, yet they have long been studied in isolation, thereby overlooking the potential gains of joint optimization. To overcome this limitation, we propose to address these two tasks within a unified framework, which allows their mutual dependencies to be effectively exploited. Building on this idea, we propose MuNet, a mutualistic network for joint 3D human mesh recovery and 3D clothed human reconstruction from single images. First, we adopt 2‑manifold graphs as a unified representation for all 3D models, enabling consistent modeling across 3D human mesh recovery and clothed human reconstruction. Second, we design an end‑to‑end graph convolutional network that progressively deforms an initial graph into a 3D human mesh and refines it into a detailed 3D clothed human model. Third, we introduce a mutualistic mechanism that allows reciprocal interaction between the two tasks during training, where 3D human mesh recovery provides guidance for 3D clothed human reconstruction, and reconstruction feedback refines the 3D human mesh recovery. We extensively evaluate MuNet on six benchmark datasets for 3D human mesh recovery and 3D clothed human reconstruction, including Human3.6M, 3DPW, MPI‑INF‑3DHP, THuman2.0, CAPE, and RenderPeople. Experimental results demonstrate that MuNet achieves state‑of‑the‑art performance on both tasks across all datasets. The code of MuNet is released for research purposes at https://github.com/starVisionTeam/MuNet.
Authors:Sam Bowyer, Acyr Locatelli, Kris Cao
Abstract:
Efficient benchmarking techniques aim to lower the computational cost of evaluating LLMs by predicting full benchmark scores using only a subset of a benchmark's questions. By reframing this problem as an instance of multiple regression with feature selection, we find that existing efficient benchmarking methods can be greatly improved by simply using kernel ridge regression at the prediction stage. Additionally, using an information‑theoretic feature‑selection algorithm called minimum redundancy maximum relevance (mRMR), we can further improve upon these methods by selecting question subsets that will be maximally useful for prediction. Except in very data‑poor settings, these approaches consistently achieve smaller prediction errors (in both MAE and RMSE), and greater ranking correlation between predicted and true scores (in both Spearman ρ and Kendall τ) across a range of benchmarks using both binary and continuous metrics. Furthermore, mRMR subsampling is much faster than competitor methods (which often involve fitting probabilistic models or running clustering algorithms), and is more likely to select the same questions under different random seeds or training data splits. Tutorial code can be found at https://github.com/sambowyer/mrmr_eval .
Authors:Chunzheng Zhu, Yijun Wang, Jianxin Lin, Feng Wang, Hongwei Wang, Lei Zhao, Shengli Li, Kenli Li
Abstract:
Self‑supervised pre‑training paradigm has gained increasing prominence for learning transferable representations in medical imaging, yet existing methods for ultrasound (US) images operate at the image or frame level, overlooking the anatomical context for clinical‑aligned representation learning. In this work, we propose an anatomy‑anchored ultrasound self‑supervision framework ANAUS that shifts representation learning from generic visual regions to clinically meaningful anatomical structures. Utilizing a learnable latent prompt engine alongside a one‑time domain adaptation on existing public image‑mask pairs, we empower the LP‑SAM module to achieve annotation‑free anatomy delineation at scale. Building upon this anatomical grounding, we propose a dual‑policy self‑supervised learning paradigm consisting of inter‑view semantics‑aware anatomy‑separating alignment and contextual core‑region prediction to enhance representation learning. Specifically, the former enforces feature invariance within identical anatomical regions while promoting discriminability across distinct structures; the latter compels the model to reconstruct corrupted regions, thereby capturing fine‑grained structural details. Extensive evaluations on six public datasets demonstrate that ANAUS consistently outstrips current state‑of‑the‑art methods while maintaining the computational efficiency essential for clinical deployment. Code is available at https://github.com/zhcz328/ANAUS.
Authors:Fangtai Wu, Hailong Guo, Shijie Huang, Jiayi Song, Yubo Huang, Mushui Liu, Zhao Wang, Yunlong Yu, Jiaming Liu, Ruihua Huang
Abstract:
Customized image editing aims to equip pre‑trained diffusion models with specific visual effects using limited paired data, typically via Low‑Rank Adaptation (LoRA). As the number of desired effects grows, storing and dynamically loading numerous these effect LoRAs significantly increases deployment overhead. Furthermore, current pipelines typically cascade these effect LoRAs with acceleration modules for fast generation, which triggers severe parameter interference and results in concept bleeding and style degradation. We propose CollectionLoRA, a multi‑teacher on‑policy distillation framework capable of distilling the concepts of up to 50 different effect LoRAs along with few‑step generation capabilities into a single LoRA. This fundamentally resolves the feature interference issue and significantly reduces deployment costs. Specifically, the method introduces (i) a Probabilistic Dual‑Stream Routing mechanism that enables the model to randomly switch between data sources during training, effectively enhancing its generalization in unseen scenarios; (ii) an Asymmetric Orthogonal Prompting strategy to achieve concept isolation within the prompt space; (iii) a Coarse‑to‑Fine Distillation Objective to mitigate the distribution gap between the teacher and student models. Extensive evaluations show that CollectionLoRA distills all customized effects and few‑step generation into a single LoRA, reducing deployment overhead while achieving concept fidelity comparable to or better than independently trained teacher models. Code: https://github.com/Qwen‑Applications/CollectionLoRA
Authors:Kolawole Quadri
Abstract:
KYA (Know Your Agents) is an open‑source, framework‑agnostic trust and governance layer for autonomous systems, composed of five primitives: (1) a four‑gate inbound apply pipeline; (2) an only‑tighten composition algebra over a three‑channel multi‑tenant hierarchy; (3) KYP (Know Your Principal), a schema‑level unification of trust scoring across human users, AI agents, and service accounts; (4) auditable interaction‑multiplier amplification over an AIVSS‑shaped additive baseline; and (5) two‑axis delegation attribution: a static premium for risky delegates and a runtime debit for actual delegate misbehavior in multi‑agent fan‑out. Together these span three pillars (trust, governance, and evidentiary assurance), making an autonomous system's actions authorized, policy‑conforming, and post‑hoc verifiable: where observability answers how long, how much, and what path, KYA answers was it authorized, did it conform, and can it be verified; it composes with observability rather than replacing it. It ships native adapters for 15+ agent frameworks. On a 4 by 9 cross‑backend matrix all 36 cells pass; the pure‑function scorer runs sub‑millisecond at p99 and the system sustains ~ 1,800 ops/sec at 20 concurrent workers with HMAC chain integrity preserved end‑to‑end. KYA detects 89% of 1,200 adversarial probes from PyRIT and Garak, including the recently‑published topology‑guided multi‑agent attack. The system is available under Apache 2.0 as the veldt‑kya package on PyPI.
Authors:Santosh Kumar Radha, Oktay Goktas
Abstract:
World models for partially observed environments must imagine multiple compatible hidden futures and steer between them under counterfactual actions. Joint Embedding Predictive Architectures (JEPAs) do this in latent space, but a vector‑valued latent has no internal structure for carrying the belief over hidden continuations through blind rollout. We introduce the Unitary World Model JEPA (UWM‑JEPA), a JEPA world model with a density‑matrix latent on a joint system‑environment space and a learned unitary predictor. The construction preserves the joint‑state spectrum exactly during rollout, so the predictor itself cannot dissipate the represented uncertainty. On a hidden‑velocity indicator task requiring five‑step forward simulation under a given action sequence with the target observation masked, UWM‑JEPA reaches 0.77 accuracy and degrades monotonically as actions are perturbed; a parameter‑matched LSTM‑JEPA trained under the same counterfactual‑target objective and action head collapses to majority‑class accuracy (0.53) under every action condition. Under blind rollout, UWM‑JEPA loses fewer than ten points of probe R^2 at short horizons while vector‑latent baselines lose forty‑one and sixty‑eight; both nevertheless tie on a held‑out context probe, locating the separation in the predictor rather than the encoder. Action sensitivity itself requires training against counterfactual rather than teacher‑forced targets, a finding that applies beyond the unitary parameterisation. For JEPA world models to imagine under partial observability, latent geometry and predictor dynamics matter, not frozen context‑encoding capacity alone.
Authors:Leshu Li, An Lu, Haiyu Wang, Zhibin Feng, Conghui Duan, Qing Bao, Zongmin Zhao, Sai Qian Zhang
Abstract:
Lipid nanoparticles (LNPs) are among the most clinically mature platforms for nucleic acid delivery, yet designing lipids that are both effective and biologically safe remains a major bottleneck. In practical screening, toxicity is a decision‑level constraint: if a lipid is toxic, its efficiency prediction is clinically irrelevant. We propose LipoAgent, a safety‑aware multi‑agent LLM framework for lipid discovery. LipoAgent combines domain‑specific finetuning with a conditional prediction objective that enforces toxicity as a prerequisite for efficiency prediction, and further improves reliability via multi‑agent verification with lightweight human oversight when disagreement persists. Across multiple foundation models, LipoAgent achieves an average 32% relative improvement in mRNA transfection efficiency prediction compared with other reported models for lipid design. Wet‑lab validation confirms that virtual screening rankings reliably translate to biological transfection outcomes. The code is publicly available at https://github.com/SAI‑Lab‑NYU/LipoAgent.git.
Authors:Minwei Kong, Chonghe Jiang, Ao Qu, Wenbin Ouyang, Zhaoming Zeng, Xiaotong Guo, Zhekai Li, Junyi Li, Yi Fan, Xinshou Zheng, Xi Jing, Yikai Zhang, Zhiwei Liang, Seonghoo Kim, Runqing Yang, Zijian Zhou, Sirui Li, Han Zheng, Wangyang Ying, Ou Zheng, Chonghuan Wang, Jinglong Zhao, Hanzhang Qin, Cathy Wu, Paul Pu Liang, Jinhua Zhao, Hai Wang
Abstract:
Large language models (LLMs) are increasingly used for optimization modeling and solver‑code generation, yet practical operations research and optimization problems often require a harder capability: designing scalable algorithms that exploit problem structure and outperform direct formulation‑and‑solve baselines. Existing benchmarks are limited to small or simplified examples far below real‑world scale and complexity. We introduce FrontierOR, among the first benchmarks to systematically evaluate LLM‑based efficient algorithm design for realistic large‑scale optimization problems. FrontierOR includes 180 tasks derived from methodologically diverse papers published in top‑tier operations research venues, each with standardized instances and a hidden, expert‑verified evaluation suite. We evaluate seven LLMs spanning frontier, cost‑effective, and open‑source models both in one‑shot and test‑time evolution settings. The results reveal that frontier models still struggle to move from executable formulations to efficient optimization algorithms: the strongest one‑shot model outperforms Gurobi in only 31% of cases in both solution quality and computational efficiency, and even strong coding agents with test‑time evolution achieve only 50% on selected hard tasks. FrontierOR establishes a practical evaluation platform for LLM‑based optimization algorithm design, which enables future LLMs and agents to be systematically tested on whether they can move beyond correct formulation toward a feasible, high‑quality, and efficient algorithm. Code and data are publicly released at https://github.com/Minw913/FrontierOR.
Authors:Sohaib Lafifi
Abstract:
We give an attribution method for neural combinatorial‑optimisation (CO) policies that (i) decomposes a decision by constraint families via LP‑relaxation duals, (ii) certifies counterfactuals through a combinatorial feasibility model (implemented as a CSP feasibility‑decision model), and (iii) bounds the size of a PAC‑sufficient explanation with a Bonferroni‑corrected Hoeffding sufficient‑subset test along a greedy ordering. Across three CO problems and three seeds, our LP‑anchored Λ‑attribution matches the CF‑derived signal at 96.5% on CVRPTW (n_cert=344) and 77.2% on the Orienteering Problem (n_cert=281) vs 75.0% and 35.2% for proxy gradient (paired diffs +0.215 and +0.420; McNemar exact p \le 10^‑14). In the rank‑aligned regime of the Flexible Job‑Shop Scheduling Problem, both backends agree on every CSP‑certified flip (n_cert=59), confirming the no‑gain prediction. Bonferroni‑PAC subsets average 5.0 nodes per step (M=70, \varepsilon=δ=0.2, k_\max=25). Reference implementation: https://github.com/sohaibafifi/neuro‑co‑cax
Authors:Gorgi Pavlov
Abstract:
We apply the influence‑adaptive Walsh geometry of a companion theory paper (arXiv:2605.01637) to extreme low‑bit weight‑only LLM quantization. The recipe is one math‑invariant transformation: WHT‑rotate each linear layer's weight matrix and rescale its columns by per‑coordinate Walsh‑basis activation energy before handing off to a reconstruction‑error quantizer (Intel auto‑round). This biases per‑group integer rounding toward high‑spectral‑energy channels. On four pretrained decoder‑only models from 135M to 1.5B parameters, BBT‑spectral reduces wikitext‑2 perplexity by 15‑58% relative to vanilla auto‑round at W2A16; we also report a TinyLlama‑1.1B auxiliary data point. Three extensions transfer the recipe to families it failed on: a per‑head PCA matrix‑Gamma replacement of q_norm/k_norm for Qwen3 attention (PPL 136.76 ‑> 88.99 on Qwen3‑0.6B); an SO(2) per‑pair rotation that commutes with RoPE (PPL 36.93 ‑> 21.84 on Qwen2.5‑1.5B); and an MoE‑aware input‑side absorption fix identified by architectural fuzzing of Laguna‑style fused‑expert layouts. A W2‑vs‑W4 ablation gives a deliberate negative control: the redistribution payoff falls within the +/‑0.5 PPL noise floor at W4, consistent with the Schur‑convexity intuition that the cost of unconcentrated influence vanishes as the noise budget shrinks. All quantized weights export to OpenVINO IR and run on Intel NPU + Arc dGPU + CPU with PPL invariant to device within +/‑0.1. We do not claim a formal Boolean‑to‑real‑valued transfer of the theory paper's majorization argument: the WHT activation energy used here is not the Boolean influence of the theory paper, the link is intuitive, and the contribution is engineering value rather than a transferred theorem. Head‑to‑head benchmarks against SpinQuant, QuaRot, QuIP‑sharp, AQLM, OmniQuant, and ButterflyQuant at matched calibration are the main future‑work item.
Authors:Ruitao Liu, Qinghao Hu, Alex Hu, Yecheng Wu, Shang Yang, Luke J. Huang, Zhuoyang Zhang, Han Cai, Song Han
Abstract:
Reinforcement learning with verifiable rewards (RLVR) has become a powerful paradigm for improving language models on reasoning‑intensive tasks, but its effectiveness is often limited by exploration. For example, models often fail on hard problems, leaving little useful reward signal. External expert traces offer a natural source of guidance, yet they may also expose reward‑relevant content along the critical path to the verifier target, such as final answers, intermediate values, executable implementations, or answer‑related entities. This content can create an unintended reward hacking channel, allowing the policy to obtain reward by copying the trace rather than learning the underlying reasoning or agentic behavior. Existing guided‑RL methods reduce this risk by using partial trajectories, but they mainly control how much expert information is shown heuristically rather than which parts should be hidden. To this end, we propose Semantic Masked Expert Policy Optimization (SMEPO), a fine‑grained semantic masking strategy for expert‑guided RLVR. Instead of truncating traces coarsely or revealing them unchanged, SMEPO masks reward‑relevant semantic spans along the critical path while preserving the expert's decomposition, plan, and procedural structure. This turns hard problems from reasoning from scratch into a fill‑in‑the‑blank process: the policy can follow the expert's problem‑solving route, but must still reconstruct the missing values, code, or entities by itself. SMEPO is simple to apply and requires no changes to the reward function or RL objective. Across diverse domains, including math, code, and agentic search, SMEPO improves accuracy by up to 3.2 points over GRPO and reduces training time by up to 4.2x. The code is available at https://github.com/mit‑han‑lab/SMEPO.
Authors:Jake Stephen, Niraj K. Jha
Abstract:
Knowledge graph (KG) is an abstraction that can be extracted from text corpora and used for in‑depth reasoning. Prior work has leveraged KGs to fine‑tune language models (LMs), enabling domain‑specific superintelligence. In this work, we explore whether KG‑driven in‑depth reasoning capabilities can emerge in neuroscience using only information contained within a single authoritative textbook. The central hypothesis is that structured knowledge, when distilled into a high‑quality KG and converted into KG‑grounded question‑answer (QA) supervision, is sufficient to produce expert‑level reasoning through a fine‑tuned LM that surpasses large language models (LLMs) in accuracy, while employing orders of magnitude fewer parameters. We construct a textbook‑derived KG via a dual‑LLM validation pipeline, expand it with a masked LM trained on the KG topology, generate multi‑hop QA items, which include QA pairs and reasoning traces, to fine‑tune an LM exclusively on KG‑derived supervision, and apply reinforcement learning using path‑derived KG signals as implicit reward models. Our results demonstrate that deep, mechanistic neuroscience understanding can be induced in the model without reliance on large, heterogeneous web‑scale corpora. The KG‑based synthetic neuroscience curriculum that readers can quiz themselves on, and the fine‑tuned LM, are available at the following GitHub location: https://kg‑bottom‑up‑superintelligence.github.io/neuro‑bench.
Authors:Liang Xue, Haoyu Liu, Cheng Wang, Pengyu Chen, Haozhuo Zheng, Yang Liu
Abstract:
Large language models for vertical domains are bottlenecked by the scarcity of complex, domain‑specific task‑oriented dialogues. Existing data acquisition pipelines face a persistent trilemma: expert annotation is expensive, real‑world service conversations are constrained by privacy and commercial restrictions, and static corpora quickly become temporally stale. We propose Stream, a data‑centric framework that leverages publicly available streaming media (live streams and short videos) to synthesize high‑value service dialogues at scale. Stream mines authentic interaction signals from noisy streams and synthesizes conversations by integrating role‑grounded persona construction with Conversational Blueprint construction; it further adopts retrieval‑augmented generation (RAG) to support knowledge‑aware responses. Based on Stream, we release StreamDial, a large‑scale multi‑domain dataset covering Automotive, Restaurant, and Hotel. StreamDial contains 87,498 dialogue sessions and 1,497,320 turns in total, with an average of 17.11 turns per session and a comparable scale across domains. Each session is organized as a structured quadruplet \langle P_u, P_a, B, H \rangle that pairs dialogue history with explicit user/agent personas and a Conversational Blueprint, capturing realistic service behaviors such as requirement mining, constraint conflicts, negotiation, and recovery. Evaluations with automatic judges and downstream tasks show that StreamDial improves intrinsic dialogue quality over strong baselines, and models trained with StreamDial improve Dialogue State Tracking across backbones; we further report a completed human‑evaluation set and encouraging multilingual transfer on Qwen3‑8B under a controlled training budget. The data is released in https://github.com/hitxueliang/DialogDataSetBySTREAM.
Authors:Gang Peng
Abstract:
Current AI interaction models treat the prompt as the primary object of exchange, omitting a critical layer: the user's latent source intent, the goal state preceding and motivating the prompt. Here we introduce Intent Signal Theory (IST), a computational framework that formalises this missing intent layer. IST distinguishes four objects routinely conflated: latent source intent (I), observable intent proxy (I‑hat), encoded carrier (P), and model output (O). It formalises dimensional weights, encoding masks, structural and fidelity recovery scores, and public‑private intent decomposition. The Theorem of Irreversible Intent Loss establishes that private intent absent from the carrier cannot be recovered beyond generic substitution. Evidence from four companion studies spanning six LLMs, three languages and three task domains shows structural‑fidelity splits, human‑validated metric dissociation, and weight‑tolerance plateaus consistent with IST's predictions. IST reframes prompt engineering as intent‑protocol design and identifies a computational layer that current AI systems lack.
Authors:Jun-Wei Hsieh, Meng-Yu Kao, Ghufron Wahyu Kurniawan, Kuan-Chuan Peng
Abstract:
YOLO‑series and DETR‑based detectors struggle with tiny‑object detection. YOLO‑style models benefit from efficient dense prediction, but their large‑stride backbones may suppress tiny instances in deep feature maps and make grid assignment ambiguous. DETR‑based models remove hand‑crafted post‑processing through set prediction, yet they reason over coarse token grids, where tiny objects occupy only a few weak tokens and are easily overlooked during matching. To address these limitations, we propose TinyFormer, a unified YOLO‑‑DETR hybrid real‑time detector that combines ViT representations, NMS‑free set prediction, and a YOLO‑style pyramid neck for accurate small‑object detection. TinyFormer introduces a Parallel Bi‑fusion Module (PBM), which builds high‑resolution shortcuts from shallow stages to the feature pyramid, preserving fine spatial details during multi‑scale fusion. We further design a Spatial Semantic Adapter (SSA) to compensate for the spatial loss caused by coarse tokenization. SSA extracts high‑resolution cues from early stages and injects them into transformer token embeddings, improving tiny‑object localization without sacrificing the global modeling ability of DETR. Experiments on MS COCO show that TinyFormer consistently outperforms recent YOLO‑series detectors and the strong DEIMv2 baseline. TinyFormer‑X achieves 58.4% AP even without PBM, while adding PBM improves the overall AP to 58.5% and brings a 1.6% AP gain on small objects. With Objects365 pre‑training, TinyFormer‑X‑PBM reaches 60.2% AP, surpassing RF‑DETR and other Objects365‑pretrained detectors with fewer parameters and lower computation. These results demonstrate that TinyFormer bridges dense YOLO‑style feature fusion and DETR‑style set prediction, providing a strong accuracy‑efficiency trade‑off for real‑time tiny‑object detection. Code is available at https://github.com/mmpmmpmmpjosh/TinyFormer.
Authors:Tianxiang Zhan, Xiaobao Song, Tong Guan, Shirui Pan, Ming Jin
Abstract:
Time series research is moving beyond fixed forecasting benchmarks toward realistic tasks that combine prediction, contextual reasoning, tool use, and structured decision support. Most benchmarks are built around clean data and short evaluation loops; agents alone may miss temporal constraints, evidence checks, or review before finalizing outputs. We first formalize next‑generation time series tasks as three‑component tuples consisting of a task file, a workspace, and a validation interface. We then present AION, a time series harness built from six component groups: agents, skills, rules, memory, evaluation, and protocols. In this harness, we use three design principles: temporal grounding, temporal knowledge‑grounded reasoning, and reliability mechanisms such as post‑experiment analysis and layered review. A Kaggle Store Sales case study shows that the harness produces more detailed process traces, more artifacts, and more review steps than the same base agent operating in OpenCode direct build mode. Taken together, these results argue for a paradigm shift from fixed tasks to realistic ones under real‑world constraints.
Authors:Yangneng Chen, Jing Li
Abstract:
Large Vision‑Language Models (LVLMs) extend large language models with visual understanding, but remain vulnerable to hallucination, where outputs are fluent yet inconsistent with images. Recent studies link this issue to language bias‑the tendency of LVLMs to over‑rely on text while neglecting visual inputs. Yet most analyses remain empirical without uncovering its underlying cause. In this paper, we provide a systematic study of language bias and identify its root in modality misalignment during training. Our analysis shows that both Visual Instruction Tuning (VIT) and Direct Preference Optimization (DPO) often prioritize textual improvements, which may cause LVLMs to overly lean toward language modeling rather than balanced multimodal understanding. To address this, we propose two simple yet effective methods: Language Bias Regularization (LBR) which mitigates language bias through regularization during instruction tuning, and Language Bias Penalty (LBP), which penalizes language bias in the DPO training process. Extensive experiments across diverse models and benchmarks demonstrate the effectiveness of our approach. LBR consistently improves performance on over ten general benchmarks, while LBP significantly reduces hallucination and improves trustworthiness. Together, these methods not only mitigate language bias but also advance the overall alignment of LVLMs, all without introducing any additional data or auxiliary models. Our code is publicly available at https://github.com/lab‑klc/LVLM‑Language‑Bias.
Authors:Chainarong Amornbunchornvej
Abstract:
An agent must act on the situation before it, learn what it cannot yet represent, and model other agents well enough to coordinate. These faculties are usually realized by separate mechanisms, yet they share a failure mode: the situation can exceed what the agent can currently represent, and the honest response is then a principled refusal that says what was missing. We develop a small cognitive architecture in which these limits arise from a single quantity. An Interpretation‑Decision Unit (IDU) interprets a content vector through a family of regimes ‑ local representational frames with private bases ‑ and decides which actions it licenses; a scalar residual of the content against the active regimes' representational scope drives the unit. Low residual with a clean licensing emits an action; otherwise the unit re‑interprets, attempts a description‑length‑justified expansion, or halts with a typed, witnessed terminal. We prove the unit is total and deterministic: for any content and fixed configuration it halts in finitely many bounded‑cost steps with a unique terminal witness, so abstention carries its cause by construction. By binding the architecture's open parameters without changing its mechanics, the same residual‑against‑scope constraint recovers three documented phenomena at three scopes: the typology of not‑knowing (typed abstention); a forced misunderstanding between agents, localized to one shared concept and invisible to the agent committing it (bounded empathy); and prerequisite dependence in learning derived from a bounded focus window rather than posited (developmental prerequisites). Each instantiation is worked for a natural and an artificial agent and states a falsifiable prediction, so one constraint can model limits in both human and machine cognition. The account contributes a unification and a notion of accountable abstention, typed and witnessed by construction.
Authors:Bangrui Xu, Ziyang Miao, Xuanhe Zhou, Yiming Lin, Zirui Tang, Xiaomeng Zhao, Fan Wu, Cheng Tan, Fan Wu, Bin Wang, Conghui He
Abstract:
VLM‑based OCR models have become the de facto choice for document parsing, as they can accurately extract page‑level elements (e.g., paragraphs within individual pages) together with their bounding boxes and textual content. However, downstream applications such as RAG require coherent document‑level information, whereas these models often break cross‑page continuity and fail to recover disrupted structures, such as paragraphs and tables truncated by page boundaries. Such relationships are not confined to a single page; instead, they require joint analysis of titles, paragraphs, tables, and images spanning multiple pages. A natural solution is therefore to reuse existing OCR outputs and reconstruct document‑level logical structures through post‑processing. To this end, we propose MinerU‑Popo, a lightweight and universal framework for POst‑Processing OCR outputs, which converts page‑level results from diverse parsers into coherent document‑level structures. MinerU‑Popo decomposes the problem into four focused subtasks: text truncation recovery, table truncation recovery, title hierarchy reconstruction, and image‑text association. To address these effectively, we build a task‑oriented data engine with task‑specific input filtering, and use the generated data (30K) to fine‑tune a lightweight post‑processing model (Qwen3‑VL‑4B). To support long documents, we introduce dynamic chunking with overlap‑based synchronization, which aligns chunk‑level outputs from the fine‑tuned model and preserves global consistency. Finally, we assemble the aligned outputs into a tree‑structured document representation, further enriched with node chunking and summaries for downstream retrieval and analysis. Empirical results show MinerU‑Popo improves title‑hierarchy TEDS by at least 20% across all five tested OCR models, improves RAG accuracy and reduces per‑query latency.
Authors:Ibrahim Delibasoglu
Abstract:
The rapid evolution of generative models has enabled the creation of hyper‑realistic facial deepfakes, exposing a critical vulnerability in modern digital forensics: the inability of detectors to generalize to unseen manipulation techniques. Traditional networks suffer from representation collapse, overfitting to localized artifact fingerprints of specific training generators. This work investigates whether modern Vision Foundation Models can serve as generalizable, out‑of‑the‑box feature extractors capable of tracking forensic anomalies across entirely unseen generative manifolds. We conduct a systematic cross‑domain evaluation comparing three foundational learning paradigms: fully supervised macro‑semantic features (RoPE‑ViT), pure self‑supervised geometric features (DINOv3), and multi‑teacher agglomerative representations (NVIDIA C‑RADIOv4‑H). By deploying frozen backbones subjected to downstream linear probing, we map the performance limitations of these architectures on the challenging DF40 benchmark. Our empirical findings expose the intrinsic trade‑offs between pre‑training paradigms and parameter scale, proving that while foundation models retain high discriminative capabilities for entire face synthesis, localized face editing techniques expose fundamental boundaries in linear probe evaluation structures. Source code and model weights are available in http://github.com/mribrahim/deepfake
Authors:Ruize Li, Zhibin Wen, Tao Han, Hao Chen, Fenghua Ling, Wei Zhang, Song Guo, Lei Bai
Abstract:
Accurate evaluation of weather forecasting models is critical for their reliable deployment in real‑world applications. However, existing benchmarks predominantly rely on reanalysis products such as ERA5, which are generated through delayed data assimilation and do not reflect the constraints of real‑time operational forecasting, thereby resulting in a systematic mismatch between benchmark performance and real‑world forecasting. In this work, we introduce RealBench, a next‑generation benchmark for AI weather forecasting that emphasizes realistic evaluation under operational conditions. RealBench features a strictly out‑of‑distribution test set spanning 2025 to eliminate data leakage and capture recent atmospheric regimes. It integrates multiple data sources, including low‑latency operational analysis and a large‑scale global in‑situ observation dataset comprising over 10,000 stations, enabling direct evaluation against real atmospheric measurements. Beyond standard global metrics, RealBench provides a comprehensive evaluation framework for high‑impact extreme events, including heatwaves, cold surges, and tropical cyclones, using event‑specific metrics that better reflect real‑world forecasting priorities. The evaluation results reveal substantial discrepancies between reanalysis‑based metrics and real‑world performance, particularly concerning extreme events. By highlighting the limitations of existing benchmarks, this work establishes a more faithful and operationally relevant evaluation paradigm, providing a rigorous foundation for advancing next‑generation AI weather forecasting systems. The benchmark implementation is available at: https://github.com/lixruize‑del/NWP‑Benchmark.
Authors:Jianrui Zhang, Hyun Jung Lee, Sukanta Ganguly, Tae-Eui Kam, Donghyun Kim, Yong Jae Lee
Abstract:
Multimodal retrieval relies heavily on single‑vector retrievers, which compress rich, sequential token sequences into one single global representation. While efficient, they discard fine‑grained, local evidence critical for dense retrieval tasks. Multi‑vector approaches were introduced as a solution, but they strictly require training and many ignore the necessity of a globally summarizing representation. To address this, we introduce SMART, a framework that unlocks the latent multi‑vector capabilities of standard single‑vector models. We first demonstrate that standard contrastive training on the pooled embedding implicitly shapes the retrieval geometry of preceding hidden states via gradient flow. By applying direct late‑interaction over these frozen hidden states during inference, SMART acts as a plug‑and‑play upgrade that consistently improves performance across diverse modalities, improving even the state‑of‑the‑art models further on MMEB‑V2. We also reveal SMART's superior performance, as simple lightweight post‑training not only saves time and compute, but also brings forth further improvement on Visual Document retrieval, allowing a single‑vector model to outperform SoTA multi‑vector counterparts. Ultimately, SMART offers both a highly efficient inference enhancement and a powerful finetuning technique for multimodal retrieval. We open source our code and weights at https://github.com/HanSolo9682/SMART.
Authors:Zhi Wang, Botao He, Kelin Yu, Seungjae Lee, Ruohan Gao, Furong Huang, Yiannis Aloimonos
Abstract:
Human egocentric video captures rich manipulation demonstrations without any robot hardware, yet transferring these skills to robots remains challenging due to the embodiment gap between human and robot in both visual appearance and kinematics. We present HumanEgo, a framework that bridges the embodiment gap by lifting each human demonstration to an entity‑level representation of hand‑object interaction, and training a flow matching policy with dense auxiliary objectives that amplify supervision from every trajectory. HumanEgo is robot‑data‑free, hardware‑agnostic, data‑efficient, and zero‑shot human‑to‑robot transferable. With only 30 minutes of human videos per task, HumanEgo achieves 92.5% average success across four real‑world tasks (75% with just 15 minutes), outperforms matched‑time robot teleoperation by 41%, and robustly transfers zero‑shot across novel robots, cameras, and environments. We release HumanEgo as an easy‑to‑use, open‑source framework for learning robot policies directly from human data: https://github.com/TX‑Leo/HumanEgo
Authors:Mini Han Wang, Liting Huang, Wei Hong, Boonthawan Wingwon
Abstract:
Background: Type 2 diabetes mellitus (T2DM) is increasingly recognised as a systemic disease characterised by coordinated dysfunction across metabolic, renal, lipid, and inflammatory pathways. Existing clinical assessments often fail to capture this multi‑dimensional burden. Methods: We conducted a retrospective study of 1,195 patients using routinely collected laboratory biomarkers. System‑level abnormality indices were constructed to quantify organ‑specific dysfunction, and multi‑system involvement was defined as abnormalities in two or more systems. Supervised machine learning models, including logistic regression, random forest, and gradient boosting, were trained to predict multi‑system dysregulation. Model interpretability was achieved using SHapley Additive exPlanations (SHAP). Results: The gradient boosting model demonstrated near‑perfect discrimination (AUC = 1.000), significantly outperforming logistic regression (AUC = 0.925). Feature attribution analysis revealed that hyperglycaemia, renal impairment, dyslipidaemia, and inflammation were the dominant drivers of multi‑system risk. Dose‑response relationships observed in partial dependence analyses further supported the biological plausibility of model predictions. Conclusion: This study presents an interpretable, data‑driven framework for quantifying systemic disease burden in T2DM. By linking routine biomarkers to multi‑organ dysfunction, our approach provides both predictive accuracy and mechanistic insight, offering potential for improved risk stratification and precision medicine in diabetes care. The data and code used in this study are openly available on GitHub at: https://github.com/MiniHanWang/Type‑2‑Diabetes‑1.git
Authors:Xiaoyue Lu, Xianglin Yang, Haijun Liu, Jiahao Liu, Kuntai Cai, Yan Xiao, Jin Song Dong
Abstract:
The widespread integration of Large Language Models (LLMs) necessitates rigorous and systematic safety evaluation. Existing paradigms either rely on constructed benchmarks to assess safety from predefined perspectives, or employ dynamic red‑teaming to probe potential vulnerabilities. While effective, these approaches face challenges, as they depend heavily on expert domain knowledge, offer limited systematic guarantees, and are vulnerable to rapid obsolescence. To address these limitations, we introduce a novel framework POLARIS that brings the rigor of specification‑based software testing to AI safety. POLARIS first compiles unstructured natural‑language policies into First‑Order Logic (FOL) representations, establishing a traceable link between high‑level rules and concrete test cases. This formalization enables the construction of a Semantic Policy Graph, where complex policy violation scenarios are encoded as traversable paths. By systematically exploring this graph, POLARIS uncovers compositional violation patterns, which are then instantiated into executable natural‑language test queries, enabling coverage‑driven and reproducible safety testing. Experiments demonstrate that POLARIS achieves higher policy coverage and attack success counts compared to established baselines. Crucially, by bridging formal methods and AI safety, POLARIS provides a principled, automated approach to ensuring LLMs adhere to safety‑critical policies with verifiable traceability. We release our code at https://github.com/huac‑lxy/POLARIS.
Authors:James Henry
Abstract:
Concept formation in transformer language models is depth‑extended, not a single‑layer event: concepts emerge gradually across a contiguous region of the residual stream. Mechanistic interpretability methods identify the single layer of peak class separation ‑‑ the "best layer" ‑‑ capturing a snapshot rather than the process itself. We introduce the Concept Allocation Zone (CAZ): the depth interval within which a concept becomes measurably separable, the region allocated to its geometric expression. We formalize the CAZ through three layer‑wise metrics (Separation, Concept Coherence, Concept Velocity) and derive principled boundary detection without manual layer sweeps. A CAZ is not a concept: it is the depth region within which the model organizes its geometry to make a concept separable. A single concept typically participates in multiple CAZes; multiple concepts may share one. Empirical validation across 34 models from 8 architectural families and 7 concepts reveals that the separation curve S(l) is frequently multimodal. A scored detector uncovers "gentle CAZes" ‑‑ subtle allocation regions invisible to standard peak detection but causally active in 93‑100% of cases under ablation (16 of 34 models; 26 in the companion validation paper). The framework generates seven testable predictions; four yield clear verdicts (two not supported, one partially supported, one supported), one had its precondition invalidated by the data, and two are underpowered ‑‑ with cross‑architecture alignment confirmed as depth‑matched rather than monolithic under leave‑one‑concept‑out cross‑validation. Reference implementation: rosetta_tools v1.3.1 (doi:10.5281/zenodo.20361433).
Authors:Ligong Bi, Tao Huang, Jianyuan Guo, Chang Xu
Abstract:
Visual Autoregressive (VAR) models have emerged as a powerful paradigm for image synthesis by performing hierarchical next‑scale prediction. However, VAR models are inherently prone to cascading error propagation, where subtle coarse‑scale mispredictions are amplified across the hierarchy, ultimately distorting the final synthesis. To mitigate this, we propose AID‑VAR, a plug‑and‑play framework that enhances pre‑trained VARs through Adversarially Injected Diagnosis. Instead of a standard passive generation, AID‑VAR introduces a proactive error‑correction mechanism inspired by the adversarial feedback in GANs. We deploy a discriminator to diagnose fidelity gaps at each scale transition, coupled with a lightweight guidance injector. This module operates as a non‑invasive adapter that refines the feature manifold of a frozen VAR backbone, effectively steering the generation toward the distribution of real images without destabilizing the pre‑trained latent space. Furthermore, to rigorously evaluate this cross‑scale progression, we introduce the Inter‑Scale Consistency Score (ISCS), a novel metric that quantifies the fidelity and structural alignment between consecutive resolution scales. Experimental results across various backbones demonstrate that AID‑VAR delivers sharper textural details and fewer structural distortions with negligible overhead. For instance, AID‑VAR‑d20 achieves a 16% improvement in FID with only a 3% increase in parameters. These results establish AID‑VAR as a highly efficient and scalable pathway for upgrading large‑scale VAR generators, enhancing global coherence and local detail without altering training data, base architectures, or sampling schedules. Code is available at https://github.com/bijiw515/AID‑VAR.
Authors:Sasank Annapureddy
Abstract:
Operating LLMs as coordinated multi‑agent research systems over multi‑hour runs surfaces failure modes that single‑shot evaluation cannot: upstream providers throttle without warning, sub‑agents drift the task to fit accessible tools, narrate machinery instead of using it, open revision iterations with self‑apology, or treat upstream context as executable directives. We present PRIMA, whose primary contributions are three operational patterns for surviving these failure modes: (1) a resilience‑and‑recovery layer that detects upstream rate‑limit signals, persists a typed pause record to disk, and resumes long‑running runs without re‑executing converged work even across process restarts; (2) a sub‑agent operating discipline encoding task‑fidelity, tool‑use, revision, and inter‑step context‑boundary norms as a structural prompt layer; (3) a multi‑phase application pattern for structured engineering deliverables pairing orthogonal draft steps with an explicit cross‑document harmonization pass before final synthesis. These sit atop a foundational protocol: a research‑program specification language with explicit convergence criteria, a dual‑metric scoring engine (LLM‑judged rubric plus sandboxed code), an outer meta‑optimization loop, event‑driven persistence, hook‑based middleware, context compaction, and a multi‑provider LLM abstraction. Agent identities derive from prime powers, giving collision‑free identifiers and trivially‑verifiable cluster membership without a central registry. Theoretical guarantees include O(k) verification, O(V+E) DAG validation, and identity collision freedom by the Fundamental Theorem of Arithmetic. A Graph Isomorphism case study grounds the architectural claims in a generated artifact: a six‑step protocol that produced a research paper proposing a new canonical‑form algorithm with three theorems and five conjectures.
Authors:Ismail Lamaakal
Abstract:
Neural network weights are increasingly a bottleneck for deployment, yet most compression pipelines treat layers independently and overlook cross‑layer redundancy induced by function‑preserving symmetries. We propose Motion‑Compensated Weight Compression (MCWC), a weight‑only codec that aligns permutation‑symmetric blocks (e.g., hidden units and attention heads) to maximize cross‑layer correspondence, turning depth into a predictable sequence. In the aligned coordinate system, MCWC uses a lightweight layer‑sequential predictor with periodic keyframes and encodes only quantized prediction residuals using a learned entropy model trained under a rate distortion objective. A simple decoder reconstructs deployable weights by entropy decoding, dequantization, predictor‑driven reconstruction, and inverse alignment, enabling fast weight materialization for inference. Across Transformer language modeling and vision classification, MCWC improves the rate accuracy Pareto frontier over strong quantization and learned weight‑codec baselines, while maintaining competitive decode time. Ablations confirm that alignment, prediction, entropy modeling, and keyframe scheduling are each necessary for the full gains. Our code is available via https://github.com/Ism‑ail11/MCWC.
Authors:Bohang Sun, Max Zhu, Francesco Caso, Jindong Gu, Junchi Yu, Philip Torr, Pietro Liò, Jialin Yu
Abstract:
Diffusion large language models promise faster generation by refining many token positions in parallel, but this parallelism introduces a hidden control problem: which proposed tokens should be transferred into the partially decoded sequence at each step? We refer to this decision as token commitment. Existing frozen‑generator decoders largely rely on hand‑designed confidence rules or block‑specific acceptance filters. We argue that token commitment can instead be learned as a reusable trace‑state policy. We introduce TraceLock, a lightweight plug‑in controller that instantiates this policy for a frozen diffusion language model. Since oracle commitment times are unavailable, TraceLock derives self‑supervision from future stability: at decoding step t, a proposed token for position i is labeled stable if it matches the final token at position i after the full decoding trace completes. The controller scores variable‑length trace states and decides which active token proposals should be committed to the partially decoded sequence. Once trained for a given frozen backbone, the controller can be deployed across local‑window widths, generation lengths, and step budgets without retraining or per‑setting calibration. Experiments on question answering, mathematical reasoning, and code generation show that TraceLock improves the quality‑step tradeoff over heuristic and learned baselines, with particularly stable behavior under cross‑setting deployment. Diagnostic analyses show that its decisions are not reducible to scalar confidence, suggesting that frozen diffusion language models expose a learnable space of commitment trajectories beyond confidence‑based decoding. Code is available at https://github.com/BobSun98/TraceLock.
Authors:Ruyi Chen, Lu Zhou, Xiaogang Xu, Chiyu Zhang, Jiafei Wu, Liming Fang
Abstract:
Text‑to‑Image (T2I) models have made significant strides in visual realism and semantic consistency, yet they often perpetuate and amplify societal biases. Existing evaluation methods typically address only single‑dimensional biases, lacking perspectives to uncover model biases at social‑related deeper semantic levels. We introduce HoloFair, a comprehensive benchmark framework for multidimensional demographic bias analysis. Built upon our large‑scale fairness‑oriented dataset and the SpaFreq (Spatial‑Frequency) attribute classifier, this framework proposes the Multi‑attribute, Group‑wise Bias Index (MGBI) metric, designed to assess both intrinsic diversity and conditional biases. Beyond evaluation, we further introduce Fair‑GRPO, a reinforcement‑learning‑based debiasing method that alters the distribution of generative models through a designed multi‑objective reward function. E.g., experiments on the SD3.5‑Medium model demonstrate that Fair‑GRPO significantly improves multidimensional fairness while maintaining high image quality. We also analyze potential reward hacking phenomena and provide corresponding mitigation strategies. Code and dataset are available at https://github.com/1059684669/HoloFair
Authors:Sol Park, Soobin Um
Abstract:
Minority sampling aims to generate low‑density instances on a data manifold and is of central importance in applications such as medical diagnosis, anomaly detection, and creative AI. Existing approaches, however, define minority samples relative to generative priors learned from training data, confining rarity to model‑specific notions that may poorly reflect real‑world semantics. In this work, we propose a world‑centric perspective on minority sampling, which defines rarity with respect to real‑world priors rather than generator‑induced densities. To this end, we introduce JEPA guidance, a diffusion sampling framework guided by a Joint‑Embedding Predictive Architecture (JEPA) ‑‑ a class of world models that encode broad, semantically rich representations. JEPA guidance steers diffusion trajectories toward low‑density regions under the implicit density induced by the JEPA, thereby aligning generated minorities with real‑world semantic rarity. To make JEPA guidance computationally practical, we develop principled approximation strategies accompanied by theoretical error bounds, significantly reducing the overhead of guidance computation. Extensive experiments across unconditional, class‑conditional, and text‑to‑image generation demonstrate that JEPA guidance consistently improves the fidelity and semantic validity of minority samples, outperforming generator‑centric baselines in capturing real‑world notions of rarity. Code is available at https://github.com/soobin‑um/jepa‑guidance.
Authors:Jaeung Lee, Dohyun Kim, Jaemin Jo
Abstract:
Large language model (LLM) unlearning has emerged as a crucial post‑hoc mechanism for privacy protection and AI safety, yet auditing whether target knowledge is truly erased remains challenging. Existing output‑level metrics fail to detect when this knowledge remains recoverable from internal representations. Recent white‑box studies reveal such residual knowledge but often rely on auxiliary training or dataset‑specific adaptations, leaving no generalizable metric. To address these limitations, we propose the Unlearning Depth Score (UDS), a metric that quantifies the mechanistic depth of unlearning via activation patching. UDS first identifies layers that encode the target knowledge using a retain model baseline, then measures how much of it is erased in the unlearned model on a 0‑1 scale. In a meta‑evaluation across 20 metrics on 150 unlearned models spanning 8 methods, UDS achieves the highest faithfulness and robustness, confirming our causal approach as the most reliable for unlearning evaluation. Case studies further reveal that white‑box metrics can disagree at the layer level and that erasure depth varies across examples. We provide guidelines for integrating UDS into existing benchmarking frameworks and streamlining the evaluation pipeline. Code and data are available at https://github.com/gnueaj/unlearning‑depth‑score
Authors:Haizhou Xia
Abstract:
Post‑hoc repair of LLM mathematical reasoning introduces an asymmetric risk: fixing an incorrect reasoning trace is useful, but replacing a trace that was already correct can be harmful. We study this problem under a selective replacement setting, where a system must decide whether a repaired candidate is safer than preserving the original cached trace. We present GuardedRepair, a guarded best‑of‑N repair framework that diagnoses cached reasoning traces, selectively triggers repair, and accepts answer‑changing candidates only when deterministic verification guards support replacement. The framework combines lightweight symbolic checks, surface semantic‑risk diagnostics, bounded candidate generation, and conservative acceptance policies. On the full GSM8K test set, where the initial reasoner already achieves 95.60% accuracy, GuardedRepair improves final accuracy to 96.89%, fixing 17 of 58 remaining errors without measured broken‑correct cases in the main run. On a weak‑reasoner ASDiv setting, accuracy improves from 78.40% to 87.60%. Direct regeneration baselines show that this gain is not explained by stronger‑model re‑solving alone: re‑solving all GSM8K examples lowers accuracy to 93.03% and breaks 47 initially correct answers. Additional analyses show that guarded repair substantially improves the fixed/broken tradeoff, while also revealing that replacement risk is reduced rather than eliminated. These results support viewing post‑hoc repair as harm‑aware selective replacement rather than unconstrained re‑solving.
Authors:Sattam Altuuaim, Lama Ayash, Muhammad Mubashar, Naeemullah Khan
Abstract:
Despite the central role of optimization in deep learning, most optimizers rely on update structures whose functional form is fixed before training begins. This static design can limit their ability to respond to changing gradient behavior across the loss landscape, where training may shift between stable, noisy, and inconsistent regimes. This study proposes PILOT (Policy‑Informed Learned OpTimizer), an online optimizer that adapts its update behavior during training. Rather than using a fixed balance between momentum, normalization, and sign‑based updates, PILOT uses gradient‑direction agreement as a signal of local training stability. Conditioning the update rule on this agreement signal allows the optimizer to adjust its behavior when gradients become stable, noisy, or inconsistent. Experiments on FashionMNIST and CIFAR‑10 show that PILOT consistently achieves the highest accuracy among the evaluated optimizers across convolutional settings. On the CNN architecture, PILOT reaches 94.13% on FashionMNIST and 81.94% on CIFAR‑10. On ResNet‑18, it further improves performance, reaching 95.71% on FashionMNIST and 93.42% on CIFAR‑10. These results suggest that learning how to adapt the update structure during training can improve performance across both compact and deeper convolutional models while preserving a simple first‑order optimization framework. The implementation of PILOT is publicly available at https://github.com/SattamAltwaim/PILOT.git
Authors:Spandan Pratyush
Abstract:
The quadratic complexity of self‑attention in Transformer models remains a significant bottleneck for processing long sequences and deploying large language models efficiently. For this approach, there has been significant research into Sparse Attention, and Deepseek Sparse Attention has combined various methods of creating segments of tokens to reduce the time complexity. This paper introduces a novel approach, Grammatically‑Guided Sparse Attention, which constrains attention computations based on the grammatical roles of tokens. By leveraging Parts‑of‑Speech (POS) tags, attention masks are dynamically generated that enforce linguistically coherent connections between tokens, reducing the computational graph without sacrificing essential linguistic dependencies. Two masking strategies are proposed and evaluated: a hard mask that strictly allows only predefined grammatical interactions, and a soft mask that biases attention towards these interactions. The experiments, conducted on the SST‑2 sentiment classification task using a DistilBERT‑like architecture, demonstrate that Grammatically‑Guided Sparse Attention maintains comparable accuracy to full attention while significantly reducing the theoretical computational overhead. Preliminary results show accuracy values of 0.8200 for hard masking and 0.8165 for soft masking, closely matching the 0.8200 of full attention, providing a path towards more efficient, interpretable, and linguistically‑informed Transformer architectures.
Authors:Alif Tri Handoyo, Vincent C. S. Lee, Rizka Widyarini Purwanto, Alex M. Lechner, Deanna Kemp, Muhamad Risqi U. Saputra
Abstract:
Automatically mapping and segmenting global mining footprints using remote sensing and deep learning is critical for monitoring the socio‑environmental risks and impacts of mining, yet its progress is hindered by the scarcity of fine‑grained annotated data. Although large‑scale datasets with coarse boundaries are widely available, leveraging them to improve fine‑grained segmentation is challenging due to significant domain shift. To address this, we propose MineC2FNet, a coarse‑to‑fine domain incremental learning framework that exploits abundant coarse data to enhance fine‑grained mining footprint segmentation. MineC2FNet adopts a teacher‑student architecture with attentive distillation at both the feature and prediction levels, selectively transferring generalized knowledge from the coarse domain while enabling boundary refinement using limited fine‑grained data (fine domain). We further introduce an expertly validated dataset of 219 images with precise boundary annotations across diverse geographies and commodities. Extensive experiments against state‑of‑the‑art approaches, including domain adaptation and domain incremental learning methods, demonstrate that MineC2FNet achieves superior performance while effectively handling domain shift. The dataset and code are publicly available at https://github.com/risqiutama/MineC2FNet.
Authors:Kavin Soni, Debanshu Das, Vamshi Guduguntla
Abstract:
Time series forecasting drives operational decisions in areas like finance, transportation, and energy. While supervised learning approaches achieve strong performance, they require domain‑specific training, feature engineering, and ongoing maintenance. Large‑scale foundation models have recently emerged as a zero‑shot alternative, avoiding task‑specific training much like LLMs. In this work, we evaluate foundation models against standard supervised approaches. Rather than focusing solely on aggregate accuracy, we analyze performance across four operational regimes: periodic human‑centric systems, physically constrained processes, stochastic financial markets, and heterogeneous demand forecasting. Our results characterize optimal deployment areas. Foundation models perform well in domains with transferable periodic structures and are efficient for cold‑start or long‑tail scenarios. Conversely, supervised specialists maintain higher precision in systems governed by strict physical constraints. In financial domains, newer foundation models are rapidly closing the performance gap with supervised specialists. We further quantify trade‑offs in inference latency, data drift adaptability, and deployment constraints. Finally, we propose a Complexity Router that assigns each series to the optimal model class using empirical features. We demonstrate that this selective routing achieves higher accuracy and significantly lower inference costs compared to deploying a universal foundation model, providing a practical framework for balancing generalization and efficiency.
Authors:Arya Jakkli, Senthooran Rajamanoharan, Neel Nanda
Abstract:
Frontier AI developers now train models against long written behavioral specifications, such as Anthropic's constitution (Anthropic, 2025a) and OpenAI's Model Spec (OpenAI, 2025a), integrated into post‑training via methods like character training (Anthropic, 2024) and deliberative alignment (Guan et al., 2024). These documents serve a governance function, but it is unclear how well models actually follow them under adversarial, multi‑turn pressure similar to what they would face in real‑world deployment. We propose a multi‑method audit pipeline that treats each lab's published specification as an auditable target: it decomposes the specification into atomic testable tenets (205 for Anthropic, 197 for OpenAI), generates multi‑turn adversarial scenarios with the Petri auditing agent (Anthropic, 2025b), runs a modified SURF‑style rubric search (Murray et al., 2026) to catch shallow single‑turn failures Petri misses, validates flagged transcripts against the relevant specification, and compares the findings against the lab's own published system card. Applying the pipeline across seven models per specification, we find that models follow their own lab's specification substantially better with each generation. On Anthropic's constitution, the Claude family falls from a 15.0% violation rate (Sonnet 4) to 2.0% (Sonnet 4.6); on OpenAI's Model Spec, the GPT family falls from 11.7% (GPT‑4o) to 3.6% (GPT‑5.2 medium reasoning), with the severity ceiling falling from 10/10 to 7/10. We cannot externally isolate whether these gains come from specification‑specific training, broader post‑training improvements, or evaluation awareness. Remaining failures cluster around operator‑imposed personas under AI‑identity questioning, irreversible action in agentic deployments, and fabricated quantitative claims with false precision.
Authors:Yuyu Liu, Haotian Xu, Yanan He, Sarang Rajendra Patil, Mengjia Xu, Tengfei Ma
Abstract:
Multi‑step reasoning remains a central challenge for large language models: single‑pass generation is efficient but lacks accuracy; tree‑search methods explore multiple paths but are computation‑heavy. We address this gap by distilling reasoning progress into a hyperbolic geometric signal that guides step‑by‑step generation. Our approach is motivated by a structural observation: in combinatorial reasoning trees, solution‑bearing states are few while dead ends are exponentially numerous. The hyperbolic space matches this asymmetry, with compact volume near the origin and exponentially expanding capacity toward the boundary, so that distance‑to‑origin naturally encodes solution proximity while angular separation distinguishes branches requiring different next operations. We train a lightweight head to project LLM hidden states into this space, then fine‑tune a low‑rank adapter interactively on its own reasoning attempts to act on the injected signal. Across multiple benchmarks, the geometric signal yields consistent gains, with larger improvements on deeper reasoning chains. Our code is publicly available at https://github.com/yuyuliu11037/HyperGuide.
Authors:Sanchit Kabra, Nikhil Abhyankar, Saaketh Desai, Prasad Iyer, Chandan K Reddy
Abstract:
Scientific discovery is a closed‑loop process in which hypotheses guide data acquisition and observations refine the hypothesis space. Yet most approaches reduce discovery to supervised learning over fixed datasets, where limited observations can support multiple plausible mechanisms that fit locally but fail to generalize. Thus, the key challenge is selecting informative observations to resolve uncertainty, shifting the focus from static inference to adaptive data acquisition. To address this, we propose LLM‑AutoSciLab, a closed‑loop framework that couples hypothesis generation with hypothesis‑conditioned experiment selection and mechanism refinement. Rather than fitting models to passively collected data, LLM‑AutoSciLab iteratively proposes plausible hypotheses, selects informative experiments to distinguish or refine them, and updates its state using the resulting evidence. To evaluate dynamic, closed‑loop scientific discovery with active data acquisition, we introduce ActiveSciBench, comprising two datasets: ActiveSciBench‑Chem with 57 enzyme‑kinetics tasks and ActiveSciBench‑GRN with 45 gene‑regulatory‑network tasks. These datasets model discovery as a budget‑constrained process requiring adaptive experiment design, variable selection, and recovery of true mechanisms. Across NewtonBench, ActiveSciBench‑Chem, and ActiveSciBench‑GRN, LLM‑AutoSciLab outperforms prior methods, achieving 67.6% and 35.1% symbolic accuracy on NewtonBench and ActiveSciBench‑Chem, respectively, and 31.1% exact graph recovery on ActiveSciBench‑GRN. Moreover, hypothesis‑guided experimentation is 2‑5x more sample‑efficient than the strongest competing baselines. Code and data are available at: https://github.com/scientific‑discovery/LLM‑AutoSciLab
Authors:Xiaotian Liu, Shuyuan Shang, Xiaopeng Wang, Pu Ren, Yaoqing Yang
Abstract:
Neural operators serve as fast, data‑driven surrogates for scientific modeling but typically rely on a monolithic, single‑pass inference procedure that struggles to resolve high‑frequency details, a limitation known as spectral bias. We introduce the Iterative Refinement Neural Operator (IRNO), which augments pre‑trained operators with a learned refinement module iteratively applied via fixed‑point iteration. IRNO decomposes the prediction into a coarse initialization followed by successive residual corrections, paralleling classical numerical solvers. Under local assumptions, we establish contraction of the induced operator, ensuring convergence to a unique fixed point. To explicitly target high‑frequency errors, we propose a progressive spectral loss that adaptively increases penalty on high‑frequency components over refinement steps during training. Across physical systems, IRNO consistently lowers error, with up to 56.05% improvement on turbulent flow. On Active Matter, spectral analysis reveals that, relative to base operator, the normalized error ratios decrease to 27.72‑36.10% in low‑, 5.07‑6.68% in mid‑, and 1.48‑2.04% in high‑frequencies, remaining stable beyond the trained iteration count. Code is available at https://github.com/xiaotianliu‑dartmouth/Iterative_Refinement_Neural_Operator
Authors:Yanyu Chen, Jiyue Jiang, Dianzhi Yu, Zheng Wu, Jiahong Liu, Jiaming Han, Xiao Guo, Jinhu Qi, Yu Li, Yifei Zhang, Irwin King
Abstract:
The evolution of Large Language Model (LLM) reasoning is bottlenecked by the scarcity of high‑quality process data. While self‑alignment via endogenous rewards offers a solution, mining valid supervision faces three challenges: (1) Label Noise via Mimetic Bias, where rewards prioritize statistical likelihood over logical truth, creating a "correctness illusion" that masks compounding errors; (2) Coarse‑Grained Supervision, where sparse global outcomes (e.g., in GRPO) fail to provide granular guidance, treating reasoning chains as monolithic; and (3) Distributional Collapse, where signals fail to generalize without amplifying pre‑training biases. To address these, we introduce LC‑ERD (Logic‑Consistent Endogenous Reward Decomposition), a framework framing self‑alignment as latent structure mining. We derive a Variational Logic Potential by aggregating consensus from the model's Latent Logic Expertise (LLE) to denoise the reasoning manifold, and introduce a Multi‑Agent Value Decomposition protocol based on the IGM principle to quantify individual step utility. Experiments show LC‑ERD delivers a robust self‑evolution path, uncovering trade‑offs between logic consistency and accuracy while identifying high‑value reasoning patterns missed by standard rewards. Our code is available at https://github.com/LC‑ERD‑repo/LC‑ERD.
Authors:Zhengqi Sun, Yiwen Sun, Boxuan Liu, Tailai Chen, Tianxu Guo, Jiabin Liu
Abstract:
Large language models (LLMs) are promising for autonomous driving, but semantics‑only decision policies can yield physically unsafe behavior in dynamic traffic. Existing methods either perform online language reasoning without explicit dynamics verification or use world models mainly in offline pipelines, leaving a gap between semantic intent and physical feasibility at decision time. We propose Reason‑‑Imagine‑‑Act (RIA), a closed‑loop framework that couples an LLM reasoner with an action‑conditioned world model for online safety verification. At each step, the LLM proposes an action template and candidate sub‑actions, the world model performs short‑horizon rollouts, and a safety scorer selects the safest executable action with feedback to the next reasoning step. Under a unified CARLA point‑goal protocol (1000 episodes), RIA achieves 80.05% route completion, 51.10% arrival rate, and 0.20% collision rate. Under the same closed‑loop interface, RIA consistently outperforms training‑free baselines, including CARLA TM and MADA, on core closed‑loop metrics. For reproducibility, code is available at https://github.com/pku‑smart‑city/source_code/tree/main/RIA.
Authors:Siqiao Huang, Partha Kaushik, Michael Chen, Hengkai Pan, Kaiwen Geng, Omar Chehab, Fernando Moreno-Pino, Max Simchowitz
Abstract:
World models have become a central paradigm for learning predictive simulators that support generation, planning, and decision‑making. Yet, despite rapid progress in industry‑scale interactive video generation, the broader research community still lacks compact, reproducible, and easily extensible implementations for studying the design choices underlying modern world models. We introduce Nano World Models, a minimalist codebase for future video prediction centered around diffusion forcing. Nano World Models provides a unified interface for generative objectives, model scales, action‑conditioning mechanisms, latent observation spaces, datasets, evaluation protocols, and long‑horizon rollout procedures. This design enables controlled studies of world‑modeling components that are often entangled across separate implementations. Through experiments across simple control environments, game simulation, and real‑robot data, we examine how prediction parameterization, architecture scale, action injection, sampling budget, and domain complexity affect video prediction quality and autoregressive rollout behavior. By releasing code, configurations, evaluation scripts, and pretrained checkpoints, Nano World Models aims to provide a compact yet extensible experimental substrate for open, reproducible, and scientific world‑model research.
Authors:Fabio Rovai
Abstract:
We investigate growth dynamics in deterministic equational discovery substrates. Across three toy domains (arithmetic, boolean, higher‑order list; n=592 trajectories), short‑range substrate sizes fit a power‑law N(t) proportional to t^b. Within each substrate b is architecture‑sensitive (cross‑validated R^2 approximately 0.82); the regression does not transfer across substrates (arith+bool to list yields R^2 approximately ‑0.84). A heuristic mean‑field closure model predicts a saturating power‑law dN/dt = K N^k exp(‑mu N) of which the pure power‑law is the short‑range approximation. Three robustness checks: bootstrap intervals on (k, mu) are tight in 4/5 toy trajectories and degenerate in 1/5; out‑of‑sample forecasting on toy data (fit first 100 epochs, predict next 400) is won by pure power‑law 5/5, indicating the toy trajectories do not reach saturation; on two real‑world growth proxies the result splits. New Mathlib/.lean file additions per month (mathlib4, 60 months, 9701 files) support the saturating form on OOS forecasting by approximately 7x over pure power‑law; Coq mathcomp monthly commits (129 months, 3083 commits) favour pure power‑law on both tests with mu collapsing to zero. The dynamics are substrate‑conditional at two levels: within‑substrate architecture‑to‑b regressions do not transfer, and the preferred functional family for N(t) itself (pure vs. saturating power‑law) differs by substrate. We propose "saturating power‑law growth with substrate‑conditional (k, mu), observable when the substrate has reached its saturation regime" as a working framing.
Authors:Feisal Alaswad, Batoul Aljaddouh, Maher Alrahhal, Poovammal E, Talal Bonny
Abstract:
Large language models achieve strong performance in language generation and knowledge‑intensive tasks, yet remain limited in settings requiring causal reasoning, persistent state tracking, and long‑horizon planning. We argue that these limitations may arise from an objective‑level mismatch between sequence prediction and reasoning over latent environment dynamics. To formalize this distinction, we introduce Latent Dynamics Inference (LDI), a conceptual perspective that interprets language and multimodal observations as partial evidence of underlying transition dynamics. To empirically investigate this perspective, we introduce Flux, a sequential reasoning environment specified entirely through natural‑language rules. As a proof‑of‑concept case study, the rules are first compiled into an explicit state‑transition simulator, illustrating that structured latent transition dynamics can, in some cases, be operationally extracted from textual rule descriptions. This enables a controlled comparison between the LLMs operating purely over textual observations and reinforcement‑learning agents trained directly within the extracted latent state space. Within this case study, agents operating with explicit access to the latent state space exhibit substantially more stable behavior in long‑horizon gameplay, achieving an aggregate win rate of approximately 79% versus 11% for LLMs. Qualitative analysis further reveals failure modes consistent with unstable persistent state tracking, including invalid actions, state‑tracking errors, and short‑horizon reasoning failures. The complete implementation of the Flux environment available at https://github.com/FeisalAlaswad/FLUX‑RL‑Agent Within the evaluated setting, these results suggest that strong sequence prediction alone may struggle to support robust long‑horizon dynamic reasoning without mechanisms for persistent state tracking and transition modeling
Authors:Alfredo Metere
Abstract:
The companion paper introduced a four‑level verification lattice on agent‑skill manifests (unverified, declared, tested, formal) and left the top level aspirational. This paper closes that gap. We give a precise semantics for skill behaviour faithful to how a skill is consumed by an LLM‑driven runtime (a deterministic script‑side reachable through a non‑deterministic LLM‑side), state the verification problem as a capability‑containment property over that semantics, and present three composable methods that together raise a skill from declared or tested to formal: (1) sound static capability‑containment analysis of the script‑side via abstract interpretation over a small effect lattice; (2) a refinement type system for tool‑call envelopes that mechanically rejects any call whose statically‑inferred capability is not in the manifest's declared set; (3) SMT‑bounded model checking against the parent paper's biconditional correctness criterion, with the bound chosen so any counter‑example fitting the runtime's transaction‑buffer horizon is exhibited as a concrete trace. We prove the three layers composed soundly cover the parent paper's threat model modulo a single residual (the LLM's freedom to refuse to act) that the parent paper's runtime biconditional catches at session boundary. The methods reuse existing well‑engineered tools (Z3, Semgrep, CodeQL, refinement‑type checkers, mechanised proof assistants) rather than asking operators to build new ones, and the proof‑carrying artifact extends the existing SKILL.md convention. All three methods plus the bundle producer and re‑checker ship as zero‑dependency JavaScript modules in the open‑source enclawed framework (https://github.com/metereconsulting/enclawed; project page https://www.enclawed.com/), with 53 unit tests and an end‑to‑end CLI demo on a sample skill.
Authors:Sebastien Kawada
Abstract:
How do multi‑turn reasoning systems fail? The expected answer is logical contradiction, in which the system's maintained state becomes unsatisfiable. We show that the dominant mode is instead satisfiable drift, where the internal state stays consistent while the returned answer silently violates prior commitments. We build DRIFT‑Bench (Decomposing Reasoning Into Failure Types), a solver‑instrumented benchmark of 816 test problems across three constraint domains, and evaluate four methods on it across four open‑weight models (8B‑120B parameters). MUS‑Repair, which feeds minimal unsatisfiable subsets back to the generator, is strongest in every setting (+1.8 to +15.0 pp over the best non‑MUS baseline). But the central finding is what repair leaves behind. After structured feedback, models rarely contradict themselves. They forget. Residual errors are 98‑100% satisfiable drift across all settings, while contradiction drops to near zero. Reliable multi‑turn systems must separately validate that the returned answer respects the maintained state. Code is available at https://github.com/kaons‑research/drift‑bench.
Authors:Zhiyuan Zhai, Xinkai You, Wenjing Yan, Xin Wang
Abstract:
Reasoning‑capable large language models solve hard problems by emitting long chains of thought, paying heavily in latency, GPU time, and energy. Casual inspection of their traces reveals extensive reformulation, verification, and circular self‑reflection, yet how much of this deliberation is actually necessary has never been measured at scale or explained from first principles. This paper closes both gaps. We formalise reasoning redundancy directly in terms of the reasoning model itself: the redundancy of a correct trace is the largest fraction of its trailing segmented steps that can be truncated while π, forced to terminate thinking and emit a final answer, still produces the correct answer. A large‑scale quantification across four frontier reasoning models and two mathematical benchmarks shows that step‑level redundancy is consistently high ‑‑ between 61% and 93% across the 8 (model, benchmark) conditions we study, with the median critical prefix equal to a single segmented step in six of the eight conditions ‑‑ that the finding is robust to the choice of judge family, and that although ρ decreases with problem difficulty on MATH‑500, all four models remain substantially redundant (ρ\in [46%, 85%]) even on the hardest Level‑5 problems. We then prove that this redundancy is a structural consequence of length‑agnostic outcome rewards, not a model‑specific artefact: under any such reward, no finite expected stopping time is optimal. The result holds regardless of RL algorithm, base model, data distribution, or whether the policy is obtained via RL or distillation; over‑thinking is therefore not a bug to be patched in individual models but a structural property of how current reasoning models are trained. Code: https://github.com/zhiyuanZhai20/how‑much‑thinking‑is‑enough
Authors:Sam Earle, Kai Arulkumaran, Andrew Dai, Akarsh Kumar, Julian Togelius, Sebastian Risi
Abstract:
We are in the midst of large‑scale industrial and academic efforts to automate the processes of scientific, technological and creative production through AI‑driven assistants. Historically, a fundamental property of these processes in their human form has been their open‑endedness: their capacity for generating a seemingly endless supply of novel and meaningful new forms. Do artificial agents have any capacity for such fruitful unguided discovery? To answer this question, we turn to Picbreeder, the canonical exemplar of human‑driven open‑ended search, in which users collaboratively generated a diverse library of images through interactive evolution of small neural networks. We replicate Picbreeder, replacing human users with frontier Vision Language Models (VLMs). We observe clear qualitative differences between the output of our system and the historical human baseline, and attempt to characterize them using metrics of phylogenetic complexity and visual and semantic salience and novelty. In an effort to identify some of the causal factors contributing these differences, we study the addition of exploratory noise to the agents' selection process, of behavioral diversity between agents, and of narrative momentum in the form of memory of past actions. We make our code available at https://github.com/smearle/picbreeder‑vlm.
Authors:Jianshu Zhang, Yijiang Li, Huifeixin Chen, Haoran Lu, Letian Xue, Bingyang Wang, Han Liu
Abstract:
Vision‑Language Models (VLMs) are increasingly deployed in embodied environments, where they need produce numerical outputs such as action magnitudes and spatial coordinates. Although these numbers appear meaningful, it remains unclear whether these numerical outputs are genuinely grounded in spatial perception. Therefore, in this work, we revisit spatial numerical understanding through SpaceNum, a unified framework that captures two complementary settings: numbers as dynamic transitions during spatial exploration, and numbers as static layouts in spatial reasoning. We formulate two bidirectional tasks, Num2Space and Space2Num, to evaluate how well VLMs map between vision‑side spatial structure and language‑side numerical representations. We systematically study whether current VLMs truly understand numerical values in spatial settings. Across dynamic transitions and static layouts, we find that models largely fail to ground numbers in spatial meaning and often perform close to random guess. Through error analysis, reasoning trace analysis, and controlled interventions, we show that current VLMs rely heavily on shallow spatial cues, struggle to build stable coordinate‑aware representations, and fail to abstract structured spatial layouts from visual observations. We further show that explicit reasoning provides only marginal gains, while tuning can partially improve spatial numerical understanding and transfer to external spatial reasoning benchmarks.
Authors:Beichen Zhang, Yuhong Liu, Jinsong Li, Yuhang Zang, Jiaqi Wang, Dahua Lin
Abstract:
Multimodal Large Language Models have advanced visual reasoning, yet a purely textual chain of thought remains a bottleneck for questions that require fine‑grained focus or view transformations. The ''think with images'' paradigm narrows this gap, but existing approaches are either constrained by fixed predefined toolkits or produce noisy intermediate images from unified multimodal methods. We pursue a third option: using a dedicated image editing model and decouple it with an understanding model. However, off‑the‑shelf image editors fail as reasoning assistants with two complementary gaps: a language‑side gap, where editors trained as passive instruction‑followers cannot map an abstract question to an appropriate visual transformation, and a generation‑side gap, where edit correctness degrades as reasoning depth grows. Guided by this analysis, we introduce ETCHR (Editing To Clarify and Harness Reasoning), a question‑conditioned, reasoning‑aware image editor decoupled from the downstream understanding model and trained with a two‑stage recipe targeted at the two gaps: Reasoning Imitation via supervised fine‑tuning on edit trajectories, followed by Reasoning Enhancement with VLM‑derived rewards for edit correctness and downstream reasoning accuracy. Since the editor is decoupled, ETCHR plugs into different open‑ and closed‑source MLLMs in a training‑free manner. Across five task families (fine‑grained perception, chart understanding, logic reasoning, jigsaw restoration, and 3D understanding), ETCHR raises average Pass@1 from 55.95 to 60.77 (+4.82) with Qwen3‑VL‑8B, from 65.08 to 70.55 (+5.47) with Gemini‑3.1‑Flash‑Lite, and from 76.55 to 81.16 (+4.61) with the 1T‑parameter MoE model Kimi K2.5.
Authors:Shuhong Zheng, Michael Oechsle, Erik Sandström, Marie-Julie Rakotosaona, Federico Tombari, Igor Gilitschenski
Abstract:
Visual geometry transformers have become powerful architectures for multi‑view 3D reconstruction, enabling joint prediction of multiple 3D attributes in a feed‑forward manner. However, their computational cost grows quadratically with the input sequence length due to the global attention layers inside these models. This limits both their scalability and efficiency. In this work, we address this challenge with a simple yet general strategy: restricting the number of key/value tokens that each query interacts with during global attention. To achieve effective token selection, we introduce a two‑stage framework. First, an inter‑frame selection step operates at the frame level to identify frames that should be preserved. Second, an intra‑frame selection step further discards more redundant tokens within the selected frames. Our analysis highlights the advantage of a diversity‑based strategy for inter‑frame selection, which ensures broad coverage of the scene. For intra‑frame selection, we show that layer‑aware sparsification is necessary, with the selection process guided by the entropy of the global attention pattern. Our approach offers a superior speed‑accuracy trade‑off compared to existing solutions. Extensive experiments show that it accelerates visual geometry transformers by over 85% for scenes with 500 images while maintaining, or even improving, baseline performance, which hints that how our token selection strategy can play a crucial role in future applications of visual geometry transformers. Our project website is available at https://zsh2000.github.io/good‑token‑hunting.github.io.
Authors:Stuart Bladon, Brinnae Bent
Abstract:
It has generally been assumed that geopolitical bias in language models originates from the training data used during the pre‑training phase. We tested seven open‑weight LLM pairs consisting of the base model (pre‑training only) and the chat model (pre‑training and post‑training) from seven labs on a paired‑scenario forced‑choice probe over 28 country pairs in English, French, and Chinese, and found that geopolitical bias originates in post‑training rather than in pre‑training. Across seven AI labs, six showed shifts in the direction associated with the country or region of the model developer after post‑training. This shift is strongest in Alibaba's Qwen 2.5: while the base is neutral on China‑favourability (‑0.15 log‑odds, p=0.15), the post‑trained chat variant is at +2.91 (p<10^‑4), an 18x shift in odds. We also observe shifts in biases toward other countries across all models. Additionally, the magnitude of this shift depends on the language used to prompt the model: the French‑made Mistral becomes pro‑France only under French prompting (FR‑EN shift +1.91, p<10^‑4). These findings suggest that geopolitical preferences in language models are not simply inherited from large‑scale internet data but are actively shaped during post‑training, highlighting the need for greater transparency, auditing, and oversight of alignment processes that influence how models represent nations, cultures, and political perspectives.
Authors:Jiangwang Chen, Bowen Zhang, Zixin Song, Jiazheng Kang, Xiao Yang, Da Zhu, Guanjun Jiang
Abstract:
Although large language model (LLM) conversational systems process millions of multi‑turn dialogues daily, they remain fundamentally reactive: they respond only after the user types a query. A key step toward proactive interaction is next‑query prediction, which anticipates the user's subsequent query based solely on the preceding dialogue. Progress on this task is hindered by the lack of dedicated benchmarks and a fundamental efficiency‑‑quality trade‑off: naively concatenating full dialogue history incurs linearly growing token consumption, while truncating to the latest turn discards crucial cross‑turn context. Our key insight is that accurate prediction does not require re‑reading raw history; it suffices to track the user's evolving intent trajectory across topics, unresolved needs, and interest shifts. We propose OnePred, which maintains a recursively updated memory as its sole cross‑turn context, bounding the per‑turn cost independently of conversation length. We train the model via a two‑stage reinforcement learning pipeline that first teaches what to predict, then what to compress, shaping the memory into a prediction‑oriented intent chain. To establish a rigorous testbed, we introduce NQP‑Bench, spanning three diverse subsets. Experiments demonstrate that OnePred reduces per‑turn token consumption by up to 22× compared to full‑history inputs while consistently exceeding all baselines in prediction quality, with larger gains on longer conversations. Our code is publicly available at https://github.com/ZBWpro/OnePred.
Authors:Liupeng Li, Haoqian Kang, Zhenyu Lu, Jinpeng Wang, Bin Chen, Ke Chen, Yaowei Wang
Abstract:
High‑resolution (HR) image perception presents a key bottleneck for multimodal large language models (MLLMs). While visual search offers a promising solution, existing methods struggle with the trade‑off between coverage and efficiency. Visual expert‑assisted search is efficient but prone to blind spots when proposals fail, whereas scan‑based search guarantees coverage at the cost of computational redundancy and semantic fragmentation. To address this dilemma, we introduce CVSearch, a training‑free adaptive framework that dynamically schedules search strategies via an Assess‑then‑Search workflow. Specifically, CVSearch first invokes expert‑assisted search when global information is insufficient, and only triggers a novel semantic‑aware scanning mechanism upon failure. Distinct from rigid grid partitioning, this efficient scanning paradigm incorporates Semantic Guided Adaptive Patching to decompose images into semantically consistent regions, effectively mitigating object fragmentation. Furthermore, we devise a Dynamic Bottom‑Up Search strategy driven by a Visual Complexity prior to enable efficient and precise iterative exploration of local details. Extensive experiments on HR benchmarks demonstrate that CVSearch achieves state‑of‑the‑art accuracy while substantially improving search efficiency. Code is released at https://github.com/liliupeng28/ICML26‑CVSearch.
Authors:Jiazheng Kang, Bowen Zhang, Zixin Song, Jiangwang Chen, Xiao Yang, Da Zhu, Guanjun Jiang
Abstract:
ReAct‑style agents for search‑intensive, multi‑step reasoning tasks rely largely on their own internal judgment to decide what evidence to seek, which reasoning or action step to take next, and when to stop, often producing shallow, redundant, or poorly targeted trajectories. Prior work has explored rubrics as external quality signals, but existing uses are mostly evaluative rather than action‑guiding: rubrics typically serve as training‑time rewards or post‑hoc evaluators of completed outputs, and in deep‑research settings they are often coarse‑grained and report‑level rather than step‑level. We introduce Co‑ReAct, a rubric‑guided action‑selection framework that uses rubrics as step‑level guidance during inference. At each decision step, Co‑ReAct injects a rubric into the agent's context to guide the next Reason‑or‑Act decision, specifying what the agent should target in evidence seeking, search, reasoning, or self‑evaluation. To make this guidance reliable, we train a dedicated rubric generator with GRPO. Unlike prior pairwise or binary preference formulations, our objective optimizes a list‑wise Spearman rank‑correlation reward against multi‑judge expert consensus rankings, encouraging rubrics that are discriminative rather than merely plausible. On DeepResearchBench and SQA‑CS‑V2, Co‑ReAct consistently improves over ReAct and representative test‑time compute baselines across search agents built on both 8B/14B open‑source and frontier closed‑source base models. The trained rubric generator can also serve as a drop‑in component that improves these baselines without changing their underlying decision mechanisms. Our code is publicly available at https://github.com/ZBWpro/Co‑ReAct.
Authors:Zhangyi Hu, Chenhui Liu, Tian Huang, Jindong Li, Yang Yang, Jiemin Wu, Zining Zhong, Menglin Yang, Yutao Yue
Abstract:
Recently, Reinforcement Learning with Verifiable Rewards (RLVR) and Test‑Time Scaling (TTS) have advanced LLM code generation through executable verification. Yet Ground‑Truth Unit Tests (GT UTs) remain a bottleneck: SOTA RLVR methods require them for costly training, while existing TTS methods lose competitiveness without them. This motivates GT‑free TTS, where existing methods directly use self‑generated UTs to refine and select code candidates. Yet such UTs are often noisy or spuriously coupled with wrong code, and UT quality in turn cannot be validated without reliable code. The key challenge is therefore to jointly improve both. To this end, we present CoSPlay, a GT‑free, training‑free framework that jointly improves codes and UTs through cooperative self‑play. It first explores diverse solution ideas and identifies their potential failure modes to produce discriminative UT ideas. It then uses bidirectional pass‑count signals from the Code‑UT execution matrix to iteratively prune or fix weak codes and refresh or replace unreliable UTs, letting the two pools co‑evolve. Finally, when multiple codes remain tied at the highest pass count, it picks the final code from the largest output‑consensus cluster, since correct codes agree on the same inputs while wrong codes diverge. Experiments on four challenging benchmarks show that CoSPlay on Qwen2.5‑7B‑Instruct improves average BoN from 22.1% to 33.2% and UT accuracy from 14.6% to 78.3%, matching or surpassing the RLVR model CURE‑7B. When applied to CURE‑7B, it further improves BoN by 5.7%. CoSPlay also generalizes across diverse backbones and outperforms GT‑free TTS baselines under comparable token budgets, with continued gains as the budget scales up. These results suggest a scalable inference strategy for competitive code generation without any GT data.
Authors:Jongoh Jeong, Hoyong Kwon, Minseok Kim, Kuk-Jin Yoon
Abstract:
Dataset distillation compresses large training sets into compact synthetic datasets while preserving downstream performance. As modern systems increasingly operate on paired vision‑language inputs, multimodal distillation must preserve representation quality and cross‑modal alignment under tight compute and memory budgets, yet prior methods often require heavy computes and overlook their correlations. To address this, we present Multimodal Distribution Matching (MDM), a geometry‑aware framework for efficient and generalizable multimodal distillation. Specifically, MDM integrates complementary components at the data, model, and loss levels. At the data level, it initializes synthetic image‑text pairs by sampling from clusters in the joint embedding space. At the model level, it forms a mixed teacher by interpolating independently fine‑tuned models in weight space according to their angular deviation from the pretrained anchor. At the loss level, it matches joint distributions on the unit hypersphere using a geometry‑aware matching objective that exploits the joint features in the cross‑modal agreement and discrepancy directions along with symmetric contrastive learning. Across image‑text retrieval benchmarks with cross‑architecture evaluation, MDM yields compact synthetic sets that preserve multimodal semantics, substantially reduce distillation cost, and remain robust across architectures.
Authors:Hanadi Alhamdan, Ghadah Alosaimi, Amir Atapour-Abarghouei, Farshad Arvin
Abstract:
Aggressive driving is a major cause of traffic accidents and poses a serious threat to road safety. Although deep learning methods have shown promising results in detecting risky driving behaviours from vehicle sensor data, their performance in real‑world conditions is often limited by severe data imbalance, large variability between drivers, and the lack of physically interpretable vehicle dynamics representations. In this paper, we propose an enhanced deep learning framework for aggressive driving detection using multivariate vehicle dynamics signals. Instead of relying solely on raw measurements, the proposed approach constructs engineered dynamic features that capture steering, acceleration, and braking behaviour. To address the extreme rarity of aggressive events in naturalistic driving data, we introduce a stable training strategy that combines controlled SMOTE‑based oversampling with a class‑weighted loss formulation, and evaluates focal loss variants for imbalance handling. Furthermore, a safety‑oriented decision strategy based on class‑specific threshold calibration is adopted to better reflect the asymmetric risks of missed detections and false alarms in real‑world applications. The proposed framework is evaluated on a newly collected naturalistic driving dataset. Extensive experiments show that the proposed method consistently outperforms standard deep learning baselines with significant improvements in minority‑class recall and safety‑critical F‑score metrics while maintaining practical computational efficiency. Code: \url https://github.com/halhamdan/CBANet
Authors:Jiaqi Feng, Justin Cui, Yuanhao Ban, Cho-Jui Hsieh
Abstract:
Recent advances have substantially improved real‑time interactive video generation in the autoregressive regime. However, most existing few‑step autoregressive video generation methods, often distilled from a corresponding many‑step teacher, default to a 4‑step sampling configuration, which still incurs considerable latency during deployment and suffers from severe quality degradation when the number of sampling steps is further reduced, particularly in the one‑step setting. Trajectory‑style consistency distillation methods often produce videos with weak dynamics, while DMD‑based approaches, such as Self‑Forcing, tend to yield blurry frames. To address this challenge, we propose One‑Forcing, a simple yet effective approach which augments the DMD objective with an auxiliary GAN loss for high‑quality and efficient one‑step video generation. Experiments on VBench show that One‑Forcing achieves a total score of 83.76, establishing state‑of‑the‑art performance among one‑step causal video generation methods and remaining competitive with strong many‑step approaches. We further demonstrate that one‑step framewise autoregressive generation can be achieved stably with merely one‑third of the training cost of the chunkwise model, a setting that prior methods have failed to achieve successfully.
Authors:Shuai Zhen, Yifan Zhang, Yuling Wang, Yanhua Yu
Abstract:
Reinforcement learning has long struggled with poor sample efficiency. One promising approach to mitigate this problem is leveraging group‑invariant Markov Decision Processes (G‑invariant MDPs). Existing works in this direction have primarily focused on image‑based RL and rotational symmetry such as \mathrmSO(2), leaving state‑based RL and reflection symmetry largely underexplored. In this work, we focus on state‑based continuous control tasks and exploit reflection symmetry by introducing Reflex, a paradigm that seamlessly integrates with both on‑policy and off‑policy RL algorithms. We formalize two types of reflection‑axial reflection and bilateral reflection, and characterize their corresponding transformations. Building on a theoretical analysis of symmetry‑preserving optimal value functions and policies, Reflex integrates reflection symmetry into policy learning through principled symmetry regularization mechanisms. We integrate Reflex with PPO and SAC, and evaluate it on a suite of OpenAI Gym and DeepMind Control benchmarks, demonstrating superior performance over standard baselines while improving sample efficiency. Our code is available at https://github.com/TonyStark042/Reflex.
Authors:Jinglin Li, Jun Tan, QI Fang, Ning Gui
Abstract:
Effectively modeling non‑stationary dynamics in probabilistic multivariate time series(MTS) forecasting requires balancing expressiveness with robustness. Existing parametric approaches benefit from strong inductive biases but lack flexibility, whereas deep generative models struggle to capture complex temporal dependencies without extensive data and computation. We introduce Parametric Prior Mapping (PPM), a framework that injects parametric structural priors into a generative modeling process. Specifically, PPM utilizes a parametric estimator to derive a dynamic, adaptive prior that guides the learning of a complex predictive distribution via a learnable mapping. This design allows the model to retain the efficiency of parametric methods while exploiting the expressive power of generative models. Trained with a hybrid objective, PPM yields precise forecasts with well‑calibrated uncertainty estimates. Empirical results show that PPM outperforms existing baselines in handling non‑stationary data, offering a superior trade‑off between accuracy and computational efficiency. The code is available at https://github.com/ljl8336/PPM.
Authors:Po-Kai Chen, Niki van Stein, Aske Plaat
Abstract:
Mechanistic interpretability of transformers requires identifying not just which components matter but how they compose into the computational route that produced a prediction. Both attention and MLP follow a shared key‑value template ϕ(S)U. We exploit this structure to develop Unpack, a backward recursion that decomposes credit through both sublayers, producing interaction strengths between any two components, named end‑to‑end paths with K/Q/V composition labels, and per‑token attribution from a single forward pass, without intervention, gradients, or auxiliary training. We evaluate on the indirect object identification task. On GPT‑2 small, the method recovers all three composition connections described by Wang et al. (2023), including the mode‑specific routing of each connection (K, Q, or V). To test token‑level attribution beyond trivial copying, we compare two occurrences of the same name in the same decomposition: the first mention retains strong credit while the duplicate‑detection position is suppressed, a pattern absent in matched control prompts. Across the Pythia family from 160M to 6.9B parameters, this suppression pattern is consistently recovered at every scale, demonstrating that the method tracks mechanistic structure without ground‑truth circuit labels. Code is available at https://github.com/Fun‑Cry/unpacklm.
Authors:Muhammad Usama, Dong Eui Chang
Abstract:
Large language models trained under diverse objectives and architectures have been shown to develop increasingly similar internal representations, an observation formalized as the Platonic Representation Hypothesis. Whether this representational convergence extends to the reasoning processes that operate over shared representations remains untested. We evaluate representational similarity across 16 language models from 8 families (1.5B to 72B parameters) on 800 reasoning problems spanning mathematics, science, commonsense, and truthfulness, stratifying by problem difficulty, computational stage, and causal relevance. Our analysis reveals three dissociations: a difficulty inversion, where models converge more on problems they collectively fail (Centered Kernel Alignment [CKA] = 0.897) than on those they solve (CKA = 0.830); a generation gap, where pre‑decision representations align (CKA = 0.875) while post‑decision representations diverge (CKA = 0.274); and epiphenomenal correctness, where shared information is decodable across models (66% transfer accuracy) but exerts minimal causal influence on predictions (1.5% to 5.5% flip rate across ablation protocols). These results indicate that representational convergence in language models reflects shared input processing constraints rather than shared reasoning strategies, with direct implications for ensemble design, interpretability transfer, and evaluations of model similarity. Code is available at https://github.com/Usama1002/convergence‑without‑understanding.
Authors:Xiyang Wang, Xinlin Wang, Tingguang Zhou, Gong Chen, Xingtai Gui, Zhi Xu, Xiaolei Wu, Feiyang Tan, Hangning Zhou, Mu Yang
Abstract:
Current end‑to‑end autonomous driving systems are fundamentally limited by a mismatch between temporal causal reasoning and global trajectory consistency. Autoregressive (AR) models capture interaction‑aware temporal dependencies via causal factorization, but their step‑wise decoding leads to error accumulation and suboptimal global structure. In contrast, diffusion models optimize trajectories globally but lack explicit causal constraints, making them unreliable in interactive and safety‑critical scenarios. This dichotomy reveals a deeper issue: existing methods treat causal modeling and global optimization as separate paradigms, without a principled way to unify them within a single trajectory distribution. To address this, we propose ChainFlow‑VLA, which unifies causal generation and global refinement within a unified probabilistic framework. We formulate planning as a mixture over AR‑induced modes and learn Vision‑Language Model (VLM)‑conditioned residual distributions over these modes. An autoregressive generator (Chain) produces a discrete set of causal trajectory modes, followed by a diffusion‑based refiner (Flow) that leverages VLM hidden states as semantic priors to perform mode‑conditioned correction in residual space while preserving causal structure. This straightforward conditioning seamlessly injects high‑level scene understanding into fine‑grained trajectory adjustments. Experiments demonstrate that ChainFlow‑VLA achieves robust planning in ambiguous and long‑tail scenarios, achieving a state‑of‑the‑art score of 94.85 on the NAVSIM v1 leaderboard, matching human‑level performance (94.8). Code will be available at https://github.com/AFARI‑Research/ChainFlow‑VLA.
Authors:Minju Kim, Youngbum Hur
Abstract:
Time series forecasting plays a central role in many real‑world applications and has been extensively studied. Most existing approaches rely on deterministic models. However, real‑world environments exhibit inherently uncertain and complex future behaviors, making single‑point predictions insufficient. This highlights the need for probabilistic forecasting methods that can quantify and represent uncertainty. In this work, we propose PaP‑NF, a probabilistic forecasting framework that aligns continuous time series representations with a frozen large language model (LLM) using a Prefix‑as‑Prompt mechanism, and conditions a normalizing flow decoder on the global context extracted by the LLM. The quality of the resulting predictive distributions is evaluated using the Continuous Ranked Probability Score (CRPS), a standard metric in probabilistic forecasting. Across a variety of long‑term forecasting benchmarks, PaP‑NF robustly captures multi‑modal uncertainty while maintaining competitive point forecasting accuracy. The official implementation is available at: https://github.com/democracy04/PaP‑NF
Authors:Gabriele Oliaro, Yichao Fu, May Jiang, Owen Lu, Junli Wang, Zhihao Jia, Hao Zhang, Samyam Rajbhandari
Abstract:
LLM‑based agents for GPU kernel generation are advancing rapidly, yet their progress is fundamentally constrained by the benchmarks they optimize against. Existing benchmarks are poorly aligned with production inference frameworks: they evaluate kernels on a single GPU with synthetic inputs, ignore the surrounding compilation stack, and reward replicating known optimizations rather than discovering new ones. The resulting reward signals are misleading: agents learn to generate kernels that score well in sandboxes but introduce interface incompatibilities, compilation‑stack conflicts, and silent correctness degradation when integrated into real systems. We introduce FastKernels, a kernel benchmark built around a minimal set of 46 representative architectures spanning 8 categories, whose kernels collectively subsume those of 96.2% (409/425) of HuggingFace Transformers architectures. FastKernels doubles as a minimalistic, production‑grade inference framework that runs at parity with hardened systems such as vLLM and SGLang on mainstream LLM serving and substantially exceeds upstream references on under‑served architectures; each task's interface mirrors the corresponding module in the state‑of‑the‑art library for its architecture family, enabling direct deployment of optimized kernels into production codebases. Evaluating state‑of‑the‑art kernel agents on FastKernels, we find that even the strongest agent achieves only 0.94× aggregate speedup over production baselines, with weaker agents at 0.78× and 0.53× ‑‑ confirming that benchmark‑production misalignment is a critical bottleneck for the field. We release FastKernels as a stepping stone toward kernel agents whose benchmark gains translate directly into production throughput improvements. Code is available at https://github.com/Snowflake‑AI‑Research/fastkernels
Authors:Jean-Guillaume Durand, Panagiotis Kouvaros, Maxime Gariel, Alessio Lomuscio
Abstract:
The adoption of vision neural networks in regulated industries requires formal robustness guarantees, especially in safety‑critical domains such as healthcare, autonomous vehicles, and aerospace. However, current approaches are confined to incomplete statistical verification or robustness to \ell_p‑norm and affine transforms, which cover only a narrow subset of perturbations to the image formation process. In particular, robustness to camera motion remains an open problem despite being key to deploy many vision applications. We present a formal verification approach that targets robustness against 3D motion perturbations of the capturing camera. We first establish a closed‑form mapping from camera pose to pixel values. By analyzing the continuity properties of the resulting homographies, we show that recent work on Lipschitz optimization and piecewise continuity can be extended to derive tight linear bounds on perturbed pixel values. Our approach applies to scenes with predominantly planar structure, such as ground planes in augmented reality, road markings and traffic signs in autonomous driving, or planar workspaces in robotic manipulation. This enables the first formal verification of projective geometry transforms, without complex simulation, surrogate networks, or explicit image‑formation models. We validate our implementation and show up to 89% speedup and 7% tighter bounds over prior work. We then evaluate our method on the VNN‑COMP benchmark and reveal systematic weaknesses to projective perturbations. Finally, we demonstrate a real‑world case study on a safety‑critical runway classifier, highlighting practical vulnerabilities to camera motion, and addressing a key challenge in the certification of learned models. Data and code are publicly available at https://github.com/jeangud/homography‑verification .
Authors:Eric Xu
Abstract:
Role prompts of the form As X, do Y admit a clean linear decomposition at one specific site in the residual stream: the prompt‑to‑answer transition ‑‑ the last prompt token together with the first two generated tokens ‑‑ in an early/mid layer band. There, persona and task contribute through partially orthogonal additive directions. Forming a pure persona effect Δ_X, a pure task effect Δ_Y, and substituting h_BB + Δ_X + Δ_Y for the clean residual yields downstream output within a small KL of clean on Gemma‑2‑2B‑IT and Qwen‑2.5‑\1.5B, 3B\‑Instruct, across a 12‑cell short grid and a 48‑cell long‑persona grid, with persona‑specific behavioral markers preserved. The natural inference from this additive structure is that the role prompt can be compressed into a single cached residual vector. \emphWe show it cannot. Injecting the cached additive prediction ‑‑ or even the oracle clean residual h_XY ‑‑ into a baseline host prompt with the persona text removed does not approach the clean long‑persona target, at one site or at many layers. Persona‑conditioned multi‑token generation flows through attention back to the persona‑text positions throughout the prompt, which no residual at one site reproduces. Local additivity in the residual stream does not imply prompt compressibility. The additive structure at the prompt‑to‑answer transition supports interpretability and fine‑grained steering of persona or task contributions; persona‑conditioned behavior across the full continuation depends on a distributed prompt/KV mechanism that local activation arithmetic does not displace.
Authors:Jaehyeop Hong, Youngbum Hur
Abstract:
Multivariate time series anomaly detection has become increasingly important in real‑world applications, where labeled data are often scarce. Many existing approaches rely on unsupervised learning to model normal patterns, but they often treat all channels equally. This design can dilute anomaly‑relevant signals, since not all channels contribute equally to anomaly detection. In this paper, we propose CALAD, a channel‑aware contrastive learning framework for multivariate time series anomaly detection. CALAD governs the construction of contrastive samples using estimated channel relevance, allowing the learning process to reflect anomaly semantics rather than generic similarity. Channel relevance is estimated from reconstruction errors of a transformer‑based autoencoder and is used to distinguish channels that are more influential to anomalous behaviors. Using this information, we design a channel‑wise augmentation strategy in which positive and negative samples are constructed based on whether anomaly‑relevant channels are preserved or perturbed. This encourages invariance to changes in irrelevant channels while being sensitive to changes in anomaly‑relevant channels. Furthermore, CALAD combines contrastive learning and an auxiliary reconstruction head, allowing the model to learn discriminative representations while retaining normal structures. Experiments on multiple real‑world datasets shows that CALAD consistently outperforms existing methods, particularly under distribution shift scenarios. We provide the code for reproducibility at https://github.com/hirundo1218/CALAD
Authors:Yannick Kirchhoff, Maximilian Rokuss, Daniel Philipp Mertens, David Füller, Benjamin Hamm, Andreas Schreyer, Oliver Ritter, Klaus Maier-Hein
Abstract:
Tracking tumor lesions across serial CT scans is essential for oncological response assessment. Existing automated methods face a fundamental trade‑off: end‑to‑end trackers achieve high automation but offer no opportunity to correct silent tracking failures, while decoupled registration‑segmentation pipelines permit user verification yet discard the lesion's prior appearance, limiting accuracy in ambiguous cases. In this work, we propose a Verified Tracking paradigm: a clinician verifies a registration‑proposed prompt, which the model leverages alongside the baseline lesion appearance to resolve segmentation ambiguities. We present a unified framework combining early spatial prompt fusion with latent temporal difference weighting for longitudinally‑informed segmentation. To address data scarcity, we leverage large‑scale synthetic pretraining, proving essential for exploiting longitudinal context, improving performance by up to 4.5 Dice points over training from scratch. Our approach secured first place in the MICCAI autoPET IV challenge. We further curate and release PanTrack, a new longitudinal pancreatic cancer benchmark, to assess out‑of‑distribution generalization. Experiments show that our model outperforms prior work in both fully automatic and the proposed verified tracking setting offering a clinically safe middle ground between automation and control. Code, model and dataset will be released at https://github.com/MIC‑DKFZ/LongiSeg
Authors:Hyeongmuk Lim, Youngbum Hur
Abstract:
Existing Video Anomaly Detection (VAD) methods typically rely on task‑specific training, leading to strong domain dependency and high training costs. Moreover, most existing methods output only scalar anomaly scores, providing limited insight into why specific events are considered abnormal. Recent advances in Vision‑Language Models (VLMs) have enabled both anomaly detection and human‑interpretable reasoning. However, many VLM‑based approaches still require additional training steps (e.g., instruction tuning or verbalized learning) or external Large Language Models (LLMs), incurring further training costs and inference overhead. To address these challenges, we propose CoReVAD, a contextual reasoning framework for training‑free video anomaly detection that operates with a single frozen VLM. CoReVAD directly generates anomaly scores and temporal descriptions from the VLM. To mitigate noise in generative outputs, we introduce a Local Response Cleaning (LRC) module based on local vision‑text alignment. Furthermore, global temporal context and progression are incorporated through softmax‑based refinement, Gaussian smoothing, and position weighting. Experiments on UCF‑Crime and XD‑Violence demonstrate that CoReVAD achieves competitive performance among training‑free methods while providing reliable and interpretable explanations. Our official code is available at: https://github.com/Muk‑00/CoReVAD
Authors:Joshua Odmark, Gideon Rubin, Deon van der Vyver
Abstract:
Empirical claims about autonomous Kubernetes operations agents are largely unfalsifiable. Published work reports observational results without controlled comparisons against an agent‑disabled baseline, selection bias is endemic, pre‑registered decision matrices are absent, and samples are typically too small for the noise level of the underlying scoring system. The cause is the same gap that limits the agents themselves: code agents have a verification substrate that turns "did it work" into a fast, falsifiable, ground‑truth signal, and operations has nothing equivalent. We present agent‑breakage, a closed‑loop measurement framework that injects faults into a target Kubernetes cluster, observes how an autonomous agent responds, scores the response on four axes against ground truth, and accumulates outcome‑labeled (state, action, outcome) tuples. The framework distinguishes framework error from reasoning error, supports a true off‑condition control via a deterministic‑embedder mechanism, and enforces pre‑registered decision matrices. We use it as a case study to test whether retrieval over past postmortems compounds an agent's capability. The methodological payload is three confounds the substrate caught during that case study, each of which would have produced a wrong published claim on a less instrumented version of the same work: a pgvector index bug, a +19% selection‑bias artifact, and small‑sample estimates that overstated effects by roughly 3x. The retrieval result itself is a partial falsification: 1 of 3 dense‑corpus scenarios significant at p<0.05, pooled effect +3.9 percentage points, not significant at n=60. A within‑scenario corpus‑density sweep at 360 runs shows that mechanistic alignment of near‑neighbors dominates raw count. The framework is released open source.
Authors:Maryia Zhyrko, Daisy Monika Lal, Erik van Mulligen, Lifeng Han
Abstract:
We present DreamerNLplus, a hybrid framework for modeling mental health dynamics from social media timelines in the CLPsych 2026 shared task. Our system addresses three tasks: psychological state modeling, temporal change detection, and sequence‑level summarization. For Task 1, we combine LLM‑based data augmentation, DeBERTa classification, and Random Forest regression for structured state prediction. For Task 2, we use few‑shot prompting with a locally deployed Llama 3.1 model to detect Switch and Escalation events using short‑term temporal context. For Task 3.1, we explore both a deterministic rule‑based summarization pipeline and a few‑shot LLM‑based approach, ranking 2nd officially. Our RAG‑based method achieves strong performance in Task 3.2, ranking 1st for Improvement and 3rd for Deterioration, demonstrating its ability to capture recurrent psychological change patterns across timelines. Our analysis reveals key challenges, including the mismatch between classification and regression performance, the difficulty of modeling temporal transitions, and the disagreement between semantic and similarity‑based evaluation metrics. These findings highlight the complexity of modeling mental health dynamics and motivate future work on unified evaluation frameworks. We share our code and prompts at https://github.com/4dpicture/CLPsych2026
Authors:Simone Antonelli, Sadegh Akhondzadeh, Aleksandar Bojchevski
Abstract:
Test‑Time Training (TTT) is an emerging paradigm that enables models to adapt their parameters during inference, improving performance on tasks such as few‑shot learning, retrieval‑augmented generation, and complex reasoning. However, this dynamic adaptation introduces new vulnerabilities that adversaries can exploit to jailbreak models. We identify three threat models for TTT and demonstrate how attackers can leverage them to bypass safety filters. Our results show that TTT can significantly increase the Attack Success Rate (ASR) and the ASR over 10 generation trials (ASR@10). For example, under LoRA, the few‑shot and generation‑phase threat models achieve an average ASR@10 of 95% and 93% respectively, across models from different families and scales. These vulnerabilities transfer to production fine‑tuning APIs. We also show that TTT‑induced overfitting can produce degenerate outputs that inflate ASR under standard judges, and propose a validity‑aware evaluation to correct for this. Our findings suggest that TTT exposes a new attack surface, strengthens attacks, and undermines existing safety guardrails. As a first step toward defense, we propose a lightweight provider‑side detector that flags TTT requests via the perplexity shift on a private harmful holdout, but robust deployment will ultimately require dynamic alignment.
Authors:Yingjie Lei
Abstract:
Personalized pricing negotiations are a challenging testbed for LLM agents because successful interaction does not guarantee profitable decision making. A seller may produce valid actions and close many deals while still pricing poorly when buyer willingness to pay and bargaining traits remain hidden. This paper presents PrefBench, a simulator‑based benchmark for hidden‑preference personalized pricing negotiations. Each episode pairs a simulated buyer with a fixed vehicle‑customization bundle; the seller observes public persona descriptors, bundle information, and negotiation history, while latent buyer variables govern valuation, patience, counter‑offer behavior, and walkaway decisions. PrefBench evaluates this setting through an LLM‑facing state‑summary protocol that constrains agents to return strict JSON actions under a fixed hidden‑information boundary. We evaluate zero‑shot LLM sellers against heuristic references over 7,500 episodes. The tested LLMs follow the protocol reliably and achieve deal rates above 0.99, but their seller‑profit outcomes remain weak: the best LLM average profit is only slightly above the random baseline and far below a simple concession heuristic under the same episode stream. These results show that structured action compliance and agreement‑seeking behavior can coexist with weak profit‑sensitive bargaining. PrefBench provides a controlled benchmark for evaluating pricing‑agent behavior under hidden buyer preferences.
Authors:Ji-Won Park, Chae Un Kim
Abstract:
In large‑scale AI systems, allocating scarce resources such as GPU compute time and bandwidth among multiple agents is a critical challenge. Conventional policies focus on efficiency metrics, potentially leading to dominance concentration that undermines system diversity and stability. We propose Computable Fair Division (CFD), a framework that reinterprets the Boltzmann‑Softmax function not as a selection tool but as a probabilistic resource allocation mechanism, redefining the inverse temperature parameter β as a computable control variable governing the efficiency‑fairness balance. Static analysis reveals a Pareto frontier with a near‑optimal Stability Corridor where total loss remains approximately constant across policy weights. In the dynamic setting, AHC++ (Adaptive Hard‑Cap Controller++) updates β in real time using the error between observed dominance and a policy‑specified target as feedback. Simulations show that AHC++ suppresses extreme dominance concentration under exogenous shocks while tracking fairness targets without substantial throughput degradation. Scalability analysis confirms that a 100x increase in agents yields only approximately 5.5x increase in execution time. Code: https://github.com/entrofy‑ai/computable‑fairness
Authors:Vishal Rajput
Abstract:
Robustness, domain adaptation, photometric and occlusion invariance, compositional generalisation, temporal robustness, alignment safety, and classical anisotropic regularisation are usually treated as separate problems with separate method families. This paper argues that much of their shared structure is one statistical problem: estimate the covariance of label‑preserving deployment nuisance, then regularise the encoder Jacobian along a matrix whose range covers that covariance (the matching principle). CORAL, adversarial training, IRM, augmentation, metric learning, Jacobian penalties, and alignment‑style constraints are different estimators of that object, not independent robustness tricks. In the linear‑Gaussian model we prove closed‑form optimality (Theorem A), including cube‑root water‑filling within the matched range; necessity of range coverage for quadratic Jacobian penalties (Theorem G); the same range dichotomy at deep global minima; and two falsification controls (Lemma C; Corollaries E), with seven conditional consistency lemmas (D1‑D7) for estimation under standard identifiability assumptions. We introduce the Trajectory Deviation Index (TDI), a label‑free probe of embedding sensitivity when task accuracy or Jacobian Frobenius norm is insufficient. Thirteen pre‑registered blocks from classical ML through Qwen2.5‑7B test the predicted matched, then isotropic, then wrong‑W ordering on geometry and deployment drift; twelve pass, and the sole exception (Office‑31) is an eigengap failure named before the run. At 7B scale, matched style‑PMH improves selective honesty and preserves Style TDI where standard DPO degrades it. The contribution is naming the deployment nuisance covariance, stating what the regulariser must do, and supplying a closed‑form falsifiable theory once that object is identified, not universality on every leaderboard.
Authors:Qianshu Cai, Yonggang Zhang, Xianzhang Jia, Wei Xue, Jun Song, Xinmei Tian, Yike Guo
Abstract:
Autonomous agentic systems are largely static after deployment: they do not learn from user interactions, and recurring failures persist until the next human‑driven update ships a fix. Self‑evolving agents have emerged in response, but all confine evolution to text‑mutable artifacts ‑‑ skill files, prompt configurations, memory schemas, workflow graphs ‑‑ and leave the agent harness untouched. Since routing, hook ordering, state invariants, and dispatch live in code rather than in any text artifact, an entire class of structural failure is physically unreachable from the text layer. We argue that source‑level adaptation is a fundamentally more general medium: it is Turing‑complete, a strict superset of every text‑mutable scope, takes effect deterministically rather than through base‑model compliance, and does not erode under long‑context drift. We present MOSS, a system that performs self‑rewriting at the source level on production agentic substrates. Each evolution is anchored to an automatically curated batch of production‑failure evidence and proceeds through a deterministic multi‑stage pipeline; code modification is delegated to a pluggable external coding‑agent CLI while MOSS retains stage ordering and verdicts. Candidates are verified by replaying the batch against the candidate image in ephemeral trial workers, then promoted via user‑consent‑gated, in‑place container swap with health‑probe‑gated rollback. On OpenClaw, MOSS lifts a four‑task mean grader score from 0.25 to 0.61 in a single cycle without human intervention.
Authors:Ali Hatamizadeh, Yejin Choi, Jan Kautz
Abstract:
Linear attention replaces the unbounded cache of softmax attention with a fixed‑size recurrent state, reducing sequence mixing to linear time and decoding to constant memory. The hard part is not just what to forget, but how to edit this compressed memory without scrambling existing associations. Delta‑rule models subtract the current read before writing a new value, and Kimi Delta Attention (KDA) sharpens forgetting with channel‑wise decay. But the active edit still uses a single scalar gate to control two different things: how much old content to erase on the key side and how much new content to commit on the value side. We introduce Gated DeltaNet‑2, which generalizes both Gated DeltaNet and KDA by inheriting adaptive forgetting and channel‑wise decay while addressing their shared limitation, the scalar tie between erasing and writing. Gated Delta Rule‑2 separates these roles with a channel‑wise erase gate b_t and a channel‑wise write gate w_t, reducing to KDA when both gates collapse to the same scalar and to Gated DeltaNet when the decay also collapses. We derive a fast‑weight update view, a chunkwise WY algorithm with channel‑wise decay absorbed into asymmetric erase factors, and a gate‑aware backward pass that preserves efficient parallel training. At 1.3B parameters trained on 100B FineWeb‑Edu tokens, Gated DeltaNet‑2 achieves the strongest overall results among Mamba‑2, Gated DeltaNet, KDA, and Mamba‑3 variants across language modeling, commonsense reasoning, and retrieval. Its advantage is most pronounced on long‑context RULER needle‑in‑a‑haystack benchmarks, where it improves the evaluated multi‑key retrieval setting and remains strong in both recurrent and hybrid settings. Code is available at https://github.com/NVlabs/GatedDeltaNet‑2.
Authors:Pilchen Hippolyte, Fabre Romain, Signe Talla Franck, Perez Patrick, Grave Edouard
Abstract:
Large language models (LLMs) are typically trained on shuffled corpora, yielding models whose knowledge is frozen at train time and whose temporal grounding remains poorly understood. In this work, we study the impact of pre‑training dynamics on the acquisition of time‑sensitive factual knowledge, focusing specifically on data ordering. Our main contributions are twofold. First, we introduce a comprehensive benchmark of over 7,000 temporally grounded questions and an evaluation protocol that enables analysis of whether models correctly associate facts with their corresponding time periods. Second, we pretrain 6B‑parameter models on temporally ordered Common Crawl snapshots and compare them against standard shuffled pre‑training. Our results show that sequentially trained models match shuffled baselines on general language understanding and common knowledge while consistently exhibiting more up‑to‑date and temporally precise knowledge. Temporally ordered pre‑training yields improved factual freshness, while shuffled pre‑training peaks on older data, possibly due to increased factual repetition. These findings, along with the release of our code at https://github.com/kyutai‑labs/kairos , checkpoints, and datasets at https://huggingface.co/collections/kyutai/kairos provide a foundation for future research on continual learning for LLMs.
Authors:Youssef Allouah, Mahdi Haghifam, Sanmi Koyejo, Reza Shokri
Abstract:
Distillation attacks create a deployment trade‑off for model providers: the same outputs that make a model more useful can also make it easier to imitate. We study this trade‑off through a minimax game between a utility‑constrained teacher and an adaptive student. Our framework yields tractable one‑sided response rules: an adaptive evaluation rule in which the student reweights high‑value examples, and a teacher‑side defense template that suppresses outputs most useful for distillation. From a cheap proxy for example value, we derive Product‑of‑Experts (PoE), a simple forward‑pass‑only defense that combines the teacher with a proxy student during generation. Empirically, adaptive evaluation reveals a large passive‑‑adaptive gap: on state‑of‑the‑art defenses, adaptive students recover substantially more capability than passive evaluation suggests on GSM8K and MATH. Under this stronger evaluation, the apparent robustness gap between expensive defenses and PoE narrows considerably, while PoE remains substantially cheaper and preserves higher‑quality reasoning traces. Overall, our results suggest that strong distillation remains difficult to stop, and that progress on antidistillation should be judged against adaptive students rather than passive ones. Our code is available at: https://github.com/ysfalh/distillation‑game.
Authors:Edwin Jose
Abstract:
Every Python function deployed as an LLM tool must today exist in two forms: an HTTP endpoint for human‑facing clients and CI pipelines, and an MCP tool registration for agent runtimes such as Claude and Cursor. These representations share business logic yet diverge in all the surrounding machinery (routing, validation, serialisation, streaming, and schema maintenance), and they drift apart as the underlying code evolves. We present HarnessAPI, a Python framework that eliminates this duplication by treating a typed skill folder as the single source of truth. From one handler.py plus Pydantic schemas, the framework automatically derives a streaming HTTP endpoint with Server‑Sent Events, an interactive OpenAPI/Swagger UI, and a zero‑configuration MCP tool, all served from a single process. Dual‑mode content negotiation lets the same handler serve SSE‑streaming and JSON‑returning clients with no handler changes. A dynamic code‑generation mechanism ensures Pydantic type annotations propagate correctly to FastMCP's inspection layer, resolving a technical limitation that prevents naive closure‑based registration. Measured across six representative skills using cloc, HarnessAPI reduces framework‑facing boilerplate by 74% compared with a manually maintained dual‑stack implementation (FastAPI server + FastMCP server). HarnessAPI subclasses FastAPI, inheriting its full middleware, dependency‑injection, and deployment ecosystem. It is available at https://github.com/edwinjosechittilappilly/harnessapi and on PyPI (pip install harnessapi)
Authors:Andrii Kryshtal
Abstract:
AI models are already deployed in societies affected by armed conflict, and journalists, humanitarian workers, governments and ordinary citizens rely on them for information or for their work processes. No established practice exists for checking whether their outputs can make those conflicts worse. We tested nine model configurations from four providers (OpenAI, Anthropic, DeepSeek, xAI) on 90 multi‑turn scenarios designed to surface misaligned behaviour in conflict contexts: false equivalence between documented atrocities, denial of genocide, and failure to recognise ethnic slurs, among others. When such outputs feed into journalism, humanitarian reporting, or public debate, they can deepen divisions in fragile societies. Failure rates span 6% to 47% between the best and worst performing models, which makes model choice a safety question in its own right and when users pushed for ``balance'' in cases where international courts have already assigned responsibility, five of nine configurations failed 80 to 100 percent of the time. We release the first evaluation framework for this domain and propose adding it to alignment evaluation portfolios.
Authors:Sid-ali Temkit
Abstract:
Large language models are routinely used as automated evaluators: to review code, moderate content, or score outputs, often with many items passing through one conversation. We ask whether the polarity of prior conversation history biases subsequent judgments, an effect we call the accumulated message effect on LLM judgments (AMEL). Across 75,898 API calls to 11 models from 4 providers (OpenAI, Anthropic, Google, and four open‑source models), we present identical test items in isolation or following histories saturated with predominantly positive or negative evaluations. Models shift toward the conversation's prevailing polarity (d = ‑0.17, p < 10^‑46). The effect concentrates on items where the model is genuinely uncertain at baseline (d = ‑0.34 for high‑entropy items, vs d = ‑0.15 when the baseline is deterministic). Bias does not grow with context length: 5 prior turns and 50 produce the same shift (Spearman |r| < 0.01; OLS slope p = 0.80). And there is a negativity asymmetry: paired per item, negative histories induce 1.62x more bias than positive (t = 13.46, p < 10^‑39, n = 2,481). Scaling helps but does not solve it (Anthropic: Haiku ‑0.22 to Opus ‑0.17; OpenAI: Nano ‑0.34 to GPT‑5.2 ‑0.17). Three follow‑ups narrow the mechanism. The token probability distribution shifts continuously, not at a threshold. The negativity asymmetry has both token‑level and semantic components, though attributing the balance is exploratory at our sample sizes. Position does not matter: five biased turns anywhere in a 50‑turn history produce the same shift. The simplest fix for evaluation pipelines is a fresh context per item; when batching is unavoidable, balancing the history helps.
Authors:Fan Wu, Cheng Chen, Zhenshan Tan, Taiyu Zhang, Xinzhen Xu, Yanyu Qian, Dingcheng Gao, Lanyun Zhu, Qi Zhu, Yi Tan, Deyi Ji, Guosheng Lin, Tianrun Chen, Deheng Ye, Fayao Liu
Abstract:
We present Claw AI Lab, a lab‑native autonomous research platform that advances automated research from a hidden prompt‑to‑paper pipeline into an interactive AI laboratory. Rather than centering the system around a single agent or a fixed serial workflow, we allow users to instantiate a full research team from one prompt, with customizable roles, collaborative workflows, real‑time monitoring, artifact inspection, and rollback/resume control through a unified dashboard. The platform also supports distinct research modes for exploration, multi‑agent discussion, and reproduction, making autonomous research substantially more steerable and laboratory‑like in practice. A key practical contribution of Claw AI Lab lies in its Claw‑Code Harness, which connects local codebases, datasets, and checkpoints to runnable experiments and feeds execution artifacts back into the research loop. As a result, the harness improves not only execution integration, but also experimental completion and result integrity: experiments are easier to inspect, iterate on, and faithfully transfer into final papers, reducing common failure modes such as partial runs and malformed result reporting. In our internal evaluation on five AI research case studies, using AutoResearchClaw as the baseline, Claw AI Lab is consistently preferred by AI expert judges on idea novelty, experiment completeness, and paper presentation quality. We view Claw AI Lab as an early step toward a new paradigm: autonomous research as usable, interactive, and reliability‑aware scientific infrastructure.
Authors:Víctor Yeste, Paolo Rosso
Abstract:
Detecting Schwartz values in political text is difficult because implicit cues often depend on surrounding arguments and fine‑grained distinctions between neighboring values. We study when context and explicit moral knowledge help sentence‑level value detection. Using the ValuesML/Touché ValueEval format, we compare sentence, window, and full‑document inputs; no‑RAG and retrieval‑augmented settings with a curated moral knowledge base; supervised DeBERTa‑v3‑base/large encoders; and zero‑shot LLMs from 12B to 123B parameters. The results show that more context is not uniformly better: full‑document context improves supervised DeBERTa encoders by 3.8‑4.8 macro‑F1 points over sentence‑only input, but does not consistently help zero‑shot LLMs. Retrieved moral knowledge is more consistently useful in matched comparisons, improving each tested model family and context condition under early fusion. However, scaling from DeBERTa‑v3‑base to large and from 12B to larger LLMs does not guarantee gains, and simple early fusion outperforms the tested late‑fusion and cross‑attention RAG variants for encoders. Per‑value analyses show that context and retrieval help most for socially situated or conceptually confusable values. These findings suggest that value‑sensitive NLP should evaluate context, knowledge, and model family jointly rather than treating longer inputs or larger models as universal improvements.
Authors:Jiaxu Wang, Junhao He, Jingkai Sun, Yi Gu, Yunyang Mo, Jiahang Cao, Qiang Zhang, Renjing Xu
Abstract:
Learning real‑world dynamics from visual observations is crucial for various domains. A common strategy is to calibrate simulators by estimating physical parameters, yet accuracy is ultimately bounded by the underlying physical models, which often assume materials are homogeneous and isotropic. Even if reasonable, real‑world objects typically exhibit mild anisotropy and heterogeneity. After the near‑isotropic backbone is well calibrated, these residual effects become the key bottleneck for further closing the real‑to‑sim gap. Although neural networks can fit dynamics end‑to‑end, such black‑box modeling discards strong physical priors, leading to poor data efficiency and overfitting. Therefore, we propose MoSA, a motion‑constrained stress adaptation framework that targets these residual effects to further improve real‑to‑sim dynamics learning. MoSA uses an isotropic model as a physics prior and learns residual stress operators to capture mild anisotropy and heterogeneity. It progressively adapts stresses via microplane‑constrained redistribution in a physics‑informed cascaded network. We further impose motion constraints by supervising temporal and spatial derivatives of the deformation field. Experimentally, our learned dynamics achieves superior accuracy, generalization, and robustness, while learning physically meaningful residual anisotropy. Finally, we validate MoSA in a robot manipulation setting, showing that better real‑to‑sim dynamics modeling translates into more reliable sim‑to‑real transfer. Project Page is available at https://mercerai.github.io/MoSA/.
Authors:Junhyeong Cho, Ruojin Cai, Hadar Averbuch-Elor
Abstract:
Many public buildings provide floorplans with a "you are here" indicator to help visitors orient themselves. Floorplan localization seeks to computationally replicate this capability by determining where visual observations were captured within a floorplan. However, existing methods typically assume controlled small‑scale environments and precise vectorized floorplans, limiting their ability to operate in large‑scale buildings and rasterized floorplans. In this work, we present an approach for performing floorplan localization in the wild by grounding the task in a reconstructed 3D representation of the scene. Given an unconstrained image collection, our method reconstructs a gravity‑aligned 3D scene and projects it into a 2D density map that serves as a floorplan proxy. Floorplan localization is then formulated as aligning this proxy with the input floorplan via a 2D similarity transform. To bridge the appearance gap between density maps and architectural floorplans, we adapt a 2D foundation model to learn cross‑modal correspondences, introducing a fine‑tuning scheme that encourages semantically aligned matches while preserving structural consistency. Extensive experiments demonstrate substantial improvements over prior methods, including in extremely sparse settings with as little as a single input image. Our code and data will be publicly available.
Authors:Jinho Park, Youbin Kim, Hogun Park, Eunbyung Park
Abstract:
Spatio‑temporal reasoning is a core capability for Multimodal Large Language Models (MLLMs) operating in the real world. As such, evaluating it precisely has become an essential challenge. However, existing spatio‑temporal reasoning benchmark datasets primarily rely on static image sets or passively curated video data, which limits the evaluation of fine‑grained reasoning capabilities. In this paper, we introduce VGenST‑Bench, a video benchmark that employs generative models to actively synthesize highly controlled and diverse evaluation scenarios. To construct VGenST‑Bench, we propose a multi‑agent pipeline incorporating a human quality control stage, ensuring the quality of all generated videos and QA pairs. We establish a comprehensive 3x2x2 video taxonomy, encompassing Spatial Scale, Perspective, and Scene Dynamics to span diverse scenarios. Furthermore, we design a hierarchical task suite that decouples low‑level visual perception from high‑level spatio‑temporal reasoning. By shifting the paradigm from passive curation to active synthesis, VGenST‑Bench enables fine‑grained diagnosis of spatio‑temporal understanding in MLLMs.
Authors:Zhaoyang Chu, Jiarui Hu, Xingyu Jiang, Pengyu Zou, Han Li, Chao Peng, Peter O'Hearn, Earl T. Barr, Mark Harman, Federica Sarro, He Ye
Abstract:
We introduce TerminalWorld, a scalable data engine that automatically reverse‑engineers high‑fidelity evaluation tasks from "in‑the‑wild" terminal recordings. Processing 80,870 terminal recordings, the engine yields a full benchmark of 1,530 validated tasks, spanning 18 real‑world categories, ranging from short everyday operations to workflows exceeding 50 steps, and covering 1,280 unique commands. From these, we curate a Verified subset of 200 representative, manually reviewed tasks. Comprehensive benchmarking on TerminalWorld‑Verified across eight frontier models and six agents reveals that current systems still struggle with authentic terminal workflows, achieving a maximum pass rate of only 62.5%. Moreover, TerminalWorld captures real‑world terminal capabilities distinct from existing expert‑curated benchmarks (e.g., Terminal‑Bench), with only a weak correlation to their scores (Pearson r=0.20). The automated engine makes TerminalWorld authentic and scalable by construction, enabling it to evaluate agents in real‑world terminal environments as developer practices evolve. Data and code are available at https://github.com/EuniAI/TerminalWorld.
Authors:Kai Tzu-iunn Ong, Minseok Kang, Dongwook Choi, Junhee Cho, Seungju Kim, Seungwon Lim, Geunha Jang, Minwoo Oh, Bogyung Jeong, Sunghwan Kim, Taeyoon Kwon, Jinyoung Yeo
Abstract:
Harness optimization enables automated agent creation by having an optimizer agent iteratively update the harness of target agents. Despite its success, current studies evaluate optimizers solely by observing target agents' performance gains. This indirect end‑improvement evaluation neglects optimizers' actions at intermediate steps, which are often erroneous and hinder agent performance. Therefore, it is unclear whether harness optimization is driven by optimizers' informed update actions or simply trial‑and‑error. This necessitates direct evaluation of harness optimizers. However, evaluating harness optimizers directly is non‑trivial and costly due to the lack of oracle harnesses. To address this, we present a simple, low‑cost design to directly evaluate them, namely priority ranking. By asking harness optimizers to rank components (e.g., tools) in a given harness by their potential to improve/hinder agent performance when updated, our design quantifies optimizer ability at the step level without expensive rollouts or manual examination. More importantly, optimizers' ranking performance correlates with their ability to improve agents in actual multi‑step harness optimization, establishing priority ranking as a reliable predictor of optimization ability. Priority ranking is enabled by Shor, a collection of 182 human‑verified optimization scenarios spanning across domains, designs, and time stages. Codes and data can be found at https://github.com/k59118/Harness_Optimizer_Evaluation.
Authors:Lucas Sheneman
Abstract:
Scientific machine learning often requires combining known physics with unknown parameters or correction terms learned from data. Existing approaches either ignore known structure, encode it as a soft penalty, or require hand‑written PyTorch code for each equation. We present The Neural Compiler, a system that translates programs written in a first‑order Scheme‑like expression language into frozen, differentiable PyTorch modules. These modules match the source program to floating‑point precision and provide gradients through autograd. In hybrid models, the compiled module encodes known physics exactly while learned components model the unknown remainder. We evaluate the compiler across six experiment domains: Feynman physics equations, Lotka‑Volterra dynamics, a damped pendulum, a one‑dimensional heat equation, three‑dimensional vector mechanics, and compositional generalization. Compiled modules match hand‑coded PyTorch implementations numerically for single equations, showing no accuracy loss from compilation. With only 1 to 4 trainable parameters, compiled models recover physical constants to less than 1 percent error in most cases, while standard PINN baselines with more than 8500 parameters show 7 to 93 percent error. Compiled modules also compose with zero error, while neural approximations can accumulate large errors in deep composition chains. The main value of the compiler is not improved accuracy over hand‑coded equations, but systematic composability: it generates correct, differentiable modules from symbolic specifications without rewriting each equation by hand. The system supports 51 primitive operations, including vector and matrix algebra, enabling PDE discretizations and hybrid scientific models. This string‑in, module‑out interface also provides a natural target for large language models that translate scientific descriptions into executable differentiable modules.
Authors:Laziz Hamdi, Amine Tamasna, Pascal Boisson, Thierry Paquet
Abstract:
Table structure recognition (TSR) requires both table‑level coherence (row/column counts, headers, spanning cells) and precise separator localization. We introduce FastTab, a grid‑centric TSR model that avoids autoregressive HTML decoding by combining (i) a lightweight Tiny Recursive Module (TRM) for global reasoning and (ii) axial 1D Transformer encoders that capture long‑range dependencies along rows and columns. The model predicts row/column counts, header rows, and separators to construct a grid, then infers rowspan/colspan using ROI‑aligned cell features. Across four benchmarks (PubTabNet, FinTabNet, PubTables‑1M, and SciTSR), FastTab achieves competitive structure recovery performance while operating at low‑latency inference. We further study robustness under pixel‑level anonymisation and show an extension to curved separators for camera‑captured documents. The source code will be made publicly available at https://github.com/hamdilaziz/FastTab .
Authors:Yifan Bai, Xiaoyang Liu, Zihao Mou, Guihong Wang, Jian Yu, Shuhan Xie, Yantao Li, Yangyu Zhang, Jingwei Liang, Tao Luo
Abstract:
As large language models (LLMs) are increasingly deployed for software engineering, constructing high‑quality benchmarks is crucial for evaluating not just the functional correctness, but also the formal verifiability of generated code. However, existing benchmarks are limited by the quantity and quality of positive and negative test cases, leading to an overestimation of model capabilities in generating specifications and implementations. To address this, we propose VeriScale, a novel framework driven by the adversarial implementations. It consists of two stages: test‑suite expansion to construct diverse and challenging test cases, and test‑suite reduction to distill them into compact yet discriminative suites. While VeriScale is general, we instantiate it on Verina to construct VerinaPlus, which expands the original test suites by over 83×, and VerinaLite, a lightweight 14× variant. Our experiments across eight state‑of‑the‑art LLMs demonstrate that VerinaPlus exposes substantial model weaknesses hidden by the original benchmark, evidenced by sharp score drops on both SpecGen and CodeGen tasks, whereas VerinaLite maintains this discriminative power at a fraction of the evaluation cost. The enhanced benchmarks and source code are publicly available at https://github.com/XiaoyangLiu‑sjtu/VeriScale.
Authors:Hanyu Guo, Jiedong Yang, Chao Chen, Longfei Xu, Kaikui Liu, Xiangxiang Chu
Abstract:
Public transit route planning traditionally depends on structured map infrastructure and complex routing engines, and no existing dataset supports training models to bypass this dependency. We present TransitLM, a large‑scale dataset of over 13 million transit route planning records from four Chinese cities covering 120,845 stations and 13,666 lines, released as a continual pre‑training corpus and benchmark data for three evaluation tasks with complementary metrics. Experiments show that an LLM trained on TransitLM produces structurally valid routes at high accuracy and implicitly grounds arbitrary GPS coordinates to appropriate stations without any explicit mapping. These results demonstrate that transit route planning can be learned entirely from data, enabling end‑to‑end, map‑free route generation directly from origin‑destination information. The dataset and benchmark are available at https://huggingface.co/datasets/GD‑ML/TransitLM, with evaluation code at https://github.com/HotTricker/TransitLM.
Authors:Chengcheng Wang, Qinhua Xie, Wei He, Jianyuan Guo, Shiqi Wang, Chang Xu
Abstract:
Autonomous research systems increasingly make the scientific workflow executable: agents can propose ideas, run code, inspect results, and draft papers. But executable workflows do not by themselves produce research judgment. We analyze where current systems lose trial experience: weak evidence becomes prose, pilot signals become broad claims, memory remains textual, and recurring process failures do not change later behavior. We introduce Sibyl‑AutoResearch, a self‑evolving AutoResearch framework built around Scientific Trial‑and‑Error Harnesses. A harness lets agents run bounded trials, preserve positive and negative outcomes, and route lessons into later planning, validation, claim scope, scheduling, critique, writing, and harness repair. We formalize this through two auditable conversion units: trial‑to‑behavior conversion, which links trial signals to later research actions, and trial‑to‑harness‑behavior conversion, which links recurring process failures to system updates. We implement the framework in SIBYL, a file‑backed autonomous research system that exposes the state, roles, memory, gates, and artifact traces needed to inspect these conversion paths. A retrospective audit identifies eight high‑confidence conversion events, with a median latency of one iteration and a maximum latency of three iterations. A recovered‑failure registry further shows how five naturally occurring failure classes, including duplicate results, stale numbers, and unsupported statistics, were blocked, downgraded, or routed into later repair. These traces do not establish a comparative performance claim; they show that the proposed conversion units are recoverable from realistic autonomous‑research workspaces. The SIBYL framework and system are available at https://github.com/Sibyl‑Research‑Team/AutoResearch‑SibylSystem.
Authors:Santiago Ospitia, John Sanabria, John Garcia-Henao
Abstract:
Despite strong predictive results in the clinical machine learning literature, the translation of these models into bedside use remains limited by systems‑level barriers: heterogeneous data representations, the absence of standardized deployment workflows, and a mismatch between research prototypes and the concurrency and latency requirements of hospital environments. We present the SepsisAI‑Orchestrator, an open‑source modular platform that addresses this deployment gap for early sepsis detection. The platform integrates HL7 FHIR‑inspired Clinical Document Architecture (CDA) preprocessing, NoSQL storage, a containerized LightGBM classifier served via REST APIs, and a Streamlit clinical dashboard, orchestrated with Docker and Kubernetes. A previously validated LightGBM model (F1 0.87‑0.94 on PhysioNet 2019) is reused without modification; the contribution lies in the surrounding infrastructure and its empirical characterization under load. Using k6 with 50‑1000 concurrent virtual users, we find that replica count must be matched to the physical CPU thread count of the host: scaling from 3 to 12 replicas on a 12‑thread CPU reduces p95 latency from 3.3s to 1.41s (57.3% reduction) and eliminates all request failures, while over‑provisioning to 24 or 48 replicas degrades performance due to scheduler contention. To our knowledge this U‑shaped scaling behavior has not been quantified previously for clinical AI inference workloads. We do not claim prospective clinical validation. Source code and deployment manifests are available at https://github.com/nucleusai/sepsisai‑orchestrator.
Authors:Jianan Ma, Xiaohu Du, Ruixiao Lin, Yaoxiang Bian, Jialuo Chen, Jingyi Wang, Xiaofang Yang, Shiwen Cui, Changhua Meng, Xinhao Deng, Zhen Wang
Abstract:
As autonomous agents (e.g., OpenClaw) increasingly operate with deep system‑level privileges to execute complex tasks, they introduce severe, unmitigated security risks. Current vulnerability analyses overwhelmingly focus on single‑turn, stateless behaviors, overlooking the expanded attack surface inherent in stateful, multi‑turn interactions and dynamic tool invocations. In this paper, we propose a novel, multi‑dimensional evasion framework targeting LLM‑based agent systems. We introduce three stealthy attack vectors: (1) Temporal evasion, which fragments malicious payloads across sequential interaction turns; (2) Spatial evasion, which conceals payloads within complex external artifacts that evade standard LLM parsing mechanisms; and (3) Semantic evasion, which obscures malicious intents beneath benign contextual noise. To systematically quantify these threats, we construct A3S‑Bench, a comprehensive benchmark comprising 2,254 real‑world agent execution trajectories. Evaluating a standard agent framework separately integrated with 10 mainstream LLM backbones against 20 practical threat scenarios, we demonstrate that our evasion framework elevates the average risk trigger rate from a 28.3% baseline to 52.6%. These findings reveal systemic, architecture‑level vulnerabilities in current autonomous agent systems that existing defenses fail to address, highlighting an urgent need for defense mechanisms tailored to the unique threats.
Authors:Di He, Songjun Tu, Keyu Wang, Lu Yin, Shiwei Liu
Abstract:
Learning rate configuration is a fundamental aspect of modern deep learning. The prevailing practice of applying a uniform learning rate across all layers overlooks the structural heterogeneity of Transformers, potentially limiting their effectiveness as the backbone of Large Language Models (LLMs). In this paper, we introduce Layerwise Learning Rate (LLR), an adaptive scheme that assigns distinct learning rates to individual Transformer layers. Our method is grounded in Heavy‑Tailed Self‑Regularization (HT‑SR) theory, which characterizes the empirical spectral density (ESD) of weight correlation matrices to quantify heavy‑tailedness. Layers with weaker heavy‑tailedness are assigned larger learning rates to accelerate their training, while layers with stronger heavy‑tailedness receive smaller learning rates. By tailoring learning rates in this manner, LLR promotes balanced training across layers, leading to faster convergence and improved generalization. Extensive experiments across architectures (from LLaMA to GPT‑nano), optimizers (AdamW and Muon), and parameter scales (60M‑1B) demonstrate that LLR achieves up to 1.5x training speedup and outperforms baselines, notably raising average zero‑shot accuracy from 47.09% to 49.02%. A key advantage of LLR is its low tuning overhead: it transfers nearly optimal LR settings directly from the uniform baseline. Code is available at https://github.com/hed‑ucas/Layer‑wise‑Learning‑Rate.
Authors:Junbin Xiao, Jiajun Chen, Tianxiang Sun, Xun Yang, Angela Yao
Abstract:
Long streaming video QA remains challenging due to growing visual tokens and limited reasoning length of large language models (LLMs). KV‑caching stores the Key‑Value (KV) of the historical tokens via LLM prefill and enables more efficient streaming QA. However, existing methods cache every one or two frames, causing redundant memory usage and losing fine‑grained spatial details within frame or temporal contexts across frames. This paper proposes MuKV, a method that features a multi‑grained KV cache compression module and a semi‑hierarchical retrieval approach to improve both efficiency and accuracy for long streaming VideoQA. For the offline KV cache, MuKV extracts visual representations at patch‑, frame‑, and segment‑levels. The multiple levels of granularity preserve both local cues and global temporal context, while maintaining efficiency with a dual signal token compression mechanism guided by self‑attention and frequency. For online QA, MuKV designs a semi‑hierarchical retrieval method to retrieve relevant KV caches for answer generation. Experiments on long‑streaming VideoQA benchmarks show that MuKV significantly improves answer accuracy, without sacrificing memory and online QA efficiency. Moreover, our compression mechanism alone brings consistent benefits across answer accuracy, memory, and QA efficiency over baselines, showcasing highly effective contribution.
Authors:H. C. Ekne
Abstract:
Static benchmarks capture only part of how large language models behave in practice. Real systems place models inside repeated loops with time limits, formatting constraints, and failure modes. We study this setting in a timed multi‑phase Risk environment with explicit victory targets and repeated planning and execution cycles. In a replicated 32‑game cross‑provider championship under frozen rules, gemini‑3.1‑pro‑preview won 20 of 32 games against gpt‑5.1, claude‑opus‑4‑7, and kimi‑k2.6, and the pooled winner distribution differs strongly from an equal‑strength null (p approx 1.5 x 10^‑5). We then separate planning from execution by standardizing execution on a cheaper Gemini Flash scaffold. Under this design, a pooled 32‑game planner bakeoff is consistent with near‑equality (p approx 0.821), which indicates that much of the earlier provider spread came from end‑to‑end system behavior rather than planning alone. To study mechanism, we analyze saved planning and execution traces from the provider championship. Gemini refers to the terminal objective far more often than the other models and increases that focus as victory approaches. Gemini also converts more turns into deep conquest chains, even though it is not the cleanest runtime. These results show that live‑agent performance depends on objective tracking, execution conversion, cost, and runtime reliability, and they support evaluating LLMs as components in bounded workflows rather than as isolated benchmark respondents.
Authors:Weilong Guo, Yuchen Wang, Renping Zhou, Yunfeng Zhang, Rui Fang, Yue Meng, Wenda Xu, Yuan He, Gao Huang
Abstract:
Vision‑Language‑Action (VLA) models have emerged as a promising paradigm for generalist robotic manipulation. A common design in current architectures maps language instructions and visual observations to actions in a single forward pass. While conceptually simple, this formulation entangles instruction comprehension, spatial scene understanding, and motor control within a single learning objective. As a result, the action expert must implicitly relearn cognitive and perceptual capabilities already present in the pretrained VLM, which can limit both learning efficiency and generalization. We introduce AVP (Action with Visual Primitives), an end‑to‑end architecture that implements this visual‑primitive‑centric interface: the VLM infers the next‑stage target and emits visual‑primitive tokens that condition a flow‑matching action expert, with supervision derived from end‑effector kinematics. Real‑robot experiments on general pick‑and‑place tasks show that AVP improves the success rate by 27.61% over pi_0.5 and outperforms other recent methods, with consistent gains in data efficiency, spatial‑compositional generalization, and object‑level transfer.
Authors:Bingjun Luo, Tony Wang, Chaoqi Chen, Xinpeng Ding
Abstract:
Multimodal Large Language Models (MLLMs) face significant computational overhead when processing long videos due to the massive number of visual tokens required. To improve efficiency, existing methods primarily reduce redundancy by pruning or merging tokens based on importance or similarity. However, these approaches largely overlook a critical dimension of video content, i.e., changes and turning points, and they lack a collaborative model for spatio‑temporal relationships. To address this, we propose a new perspective: similarity is for identifying redundancy, while difference is for capturing key events. Based on this, we designed a training‑free framework named ST‑SimDiff. We first construct a spatio‑temporal graph from the visual tokens to uniformly model their complex associations. Subsequently, we employ a parallel dual‑selection strategy: 1) similarity‑based selection uses community detection to retain representative tokens, compressing static information; 2) temporal difference‑based selection precisely locates content‑changing points to preserve tokens that capture key dynamic shifts. This allows it to preserve both static and dynamic content with a minimal number of tokens. Extensive experiments show our method significantly outperforms state‑of‑the‑art approaches while substantially reducing computational costs. Our code is available in https://github.com/bingjunluo/ST‑SimDiff.
Authors:Mingkai Deng, Jinyu Hou, Lara Sá Neves, Varad Pimpalkhute, Taylor W. Killian, Zhengzhong Liu, Eric P. Xing
Abstract:
How should an agent decide when and how to plan? A dominant approach builds agents as reactive policies with adaptive computation (e.g., chain‑of‑thought), trained end‑to‑end expecting planning to emerge implicitly. Without control over the presence, structure, or horizon of planning, these systems dramatically increase reasoning length, yielding inefficient token use without reliable accuracy gains. We argue efficient agentic reasoning benefits from decomposing decision‑making into three systems: simulative reasoning (System II) grounding deliberation in future‑state prediction via a world model; self‑regulation (System III) deciding when and how deeply to plan via a learned configurator; and reactive execution (System I) handling fine‑grained action. Simulative reasoning provides unified planning across diverse tasks without per‑domain engineering, while self‑regulation ensures the planner is invoked only when needed. To test this, we develop SR^2AM (Self‑Regulated Simulative Reasoning Agentic LLM), realizing both as distinct stages within an LLM's chain‑of‑thought, with the LLM as world model. We explore two instantiations: recording decisions from a prompted multi‑module system (v0.1) and reconstructing structured plans from traces of pretrained reasoning LLMs (v1.0), trained via supervised then reinforcement learning (RL). Across math, science, tabular analysis, and web information seeking, v0.1‑8B and v1.0‑30B achieve Pass@1 competitive with 120‑355B and 685B‑1T parameter systems respectively, while v1.0‑30B uses 25.8‑95.3% fewer reasoning tokens than comparable agentic LLMs. RL increases average planning horizon by 22.8% while planning frequency grows only 2.0%, showing it learns to plan further ahead rather than more often. More broadly, learned self‑regulation instantiates a principle we expect to extend beyond planning to how agents govern their own learning and adaptation.
Authors:Bingjun Luo, Tony Wang, Hanqi Chen, Xinpeng Ding
Abstract:
Recent advances in Multimodal Large Language Models (MLLMs) have significantly advanced video understanding tasks, yet challenges remain in efficiently compressing visual tokens while preserving spatiotemporal interactions. Existing methods, such as LLaVA family, utilize simplistic pooling or interpolation techniques that overlook the intricate dynamics of visual tokens. To bridge this gap, we propose ST‑GridPool, a novel training‑free visual token enhancement method designed specifically for Video LLMs. Our approach integrates Pyramid Temporal Gridding (PTG), which captures multi‑grained spatiotemporal interactions through hierarchical temporal gridding, and Norm‑based Spatial Pooling (NSP), which preserves high‑information visual regions by leveraging the correlation between token norms and semantic richness. Extensive experiments on various benchmarks demonstrate that ST‑GridPool consistently enhances performance of Video LLMs without requiring costly retraining. Our method offers an efficient and plug‑and‑play solution for improving visual token representations. Our code is available in https://github.com/bingjunluo/ST‑GridPool.
Authors:Anthony Song, Boyan Zhou, Mayank Golhar, Marisa Morakis, Alex Baras, Nicholas Durr
Abstract:
Three‑dimensional (3D) histopathology of unprocessed tissues has the potential to transform disease management by enabling volumetric characterization of tissue microarchitecture and in‑vivo assessment. Back‑illumination Interference Tomography (BIT) is a new phase microscopy technology that provides rapid, non‑destructive volumetric imaging of unprocessed tissues. However, translating BIT volumes into clinically interpretable H&E images remains challenging, particularly due to shift‑variant contrast and the absence of quantitative validation benchmarks. We introduce HistoBIT3D, the first voxel‑wise paired BIT and fluorescence‑labeled nuclei dataset, enabling quantitative evaluation of structural preservation in unsupervised virtual staining against ground‑truth nuclear distributions. Using this dataset, we present a novel virtual staining framework that translates BIT volumes with shift‑variant contrast into realistic H&E volumes by leveraging bidirectional multiscale content consistency and cross‑domain style reuse to enhance structural fidelity and perceptual realism. Our method achieves state‑of‑the‑art realism metrics while significantly improving 3D nuclei segmentation accuracy and boundary preservation under zero‑shot Cellpose evaluation. Together, these contributions establish a quantitatively validated, structurally faithful, and scalable pipeline for 3D virtual H&E staining, advancing the paradigm of slide‑free, volumetric computational histopathology. Our data and code are available at: https://github.com/aasong113/HistoBIT3D_VirtualStaining.
Authors:Dazhao Du, Jian Liu, Jialong Qin, Tao Han, Bohai Gu, Fangqi Zhu, Yujia Zhang, Eric Liu, Xi Chen, Song Guo
Abstract:
Video large language models (Video LLMs) achieve strong benchmark accuracy, yet often answer video questions through shortcuts such as single‑frame cues and language priors rather than by tracking spatiotemporal dynamics. This issue is exacerbated in RL post‑training, where correctness‑only rewards can further reinforce shortcut policies that obtain high reward without tracking video dynamics. We address this by asking a controlled counterfactual question: if the visual world changed while the question remained fixed, should the answer change or stay the same? Based on this view, we propose Counterfactual Relational Policy Optimization (CRPO), a dual‑branch RL framework for improving \emphspatiotemporal sensitivity. CRPO constructs counterfactual videos through horizontal flips and temporal reversals, trains on both original and counterfactual branches, and introduces a Counterfactual Relation Reward (CRR) between their answers. CRR encourages answers to change for dynamic questions and remain unchanged for static questions. This cross‑branch constraint makes it difficult for shortcut policies to be consistently rewarded across both branches. To evaluate this property, we introduce DyBench, a paired counterfactual video benchmark with 3,014 videos covering reversible dynamics, moving direction, and event sequence, together with a strict pair‑accuracy metric that prevents fixed‑answer shortcuts from inflating scores. Experiments show that CRPO outperforms prior RL methods on spatiotemporal‑sensitive evaluations while maintaining competitive general video performance. On Qwen3‑VL‑8B, CRPO improves DyBench P‑Acc by +7.7 and TimeBlind I‑Acc by +8.2 over the base model, indicating improved spatiotemporal sensitivity rather than stronger reliance on static shortcuts. The project website can be found at https://ddz16.github.io/crpo.github.io/ .
Authors:Dazhao Du, Liao Duan, Jian Liu, Tao Han, Yujia Zhang, Eric Liu, Xi Chen, Song Guo
Abstract:
Video temporal grounding (VTG), which localizes the start and end times of a queried event in an untrimmed video, is a key test of whether multimodal large language models (MLLMs) understand not only what happens but also when it happens. Although modern MLLMs describe video content fluently, their timestamp predictions remain unreliable, while existing remedies either require costly post‑training on temporal annotations or rely on coarse training‑free heuristics. In this work, we probe the cross‑modal attention of MLLMs and uncover a perception‑generation gap. Our key finding is that MLLMs often know the target interval during prefill, but lose this signal when generating the final answer. In the prefill stage, a sparse set of attention heads, which we call \emphTemporal Grounding Heads (TG‑Heads), concentrates query‑to‑video attention on the ground‑truth interval. During autoregressive decoding, however, the answer tokens shift attention away from this interval toward visually salient but query‑irrelevant segments. This observation motivates an inference‑time read‑then‑regenerate framework. We first convert TG‑Head prefill attention into a debiased frame‑level relevance signal and extract the high‑attention interval it highlights. We then re‑invoke the MLLM with visual context restricted to this interval, using video cropping or attention masking to suppress distractors. Without parameter updates and architectural changes, our framework consistently improves MiMo‑VL‑7B, Qwen3‑VL‑8B, and TimeLens‑8B on three VTG benchmarks, with gains of up to +3.5 mIoU. The project website can be found at https://ddz16.github.io/mllmsknowwhen.github.io/.
Authors:Yuting He, Chenyu You, Shuo Li
Abstract:
Multi‑modality medical vision (MV) foundation models (FM) are fundamentally challenged by pronounced Non‑IID feature statistics across heterogeneous imaging modalities. Monolithic self‑supervised optimization on such data induces conflicting gradients, driving representations to collapse toward modality‑dominant shortcuts. This work reframes this failure as an imbalance between specialization and coordination in emergent modularity, and proposes Director‑Experts (DEX), a modular network that explicitly regulates these dynamics in stacked modules. Each DEX module comprises a pool of experts, dynamically adapted by our image‑wise activation strategy, autonomously specializing in modality‑dominant statistics, together with a director, updated via our group exponential moving average, which distills multi‑expert knowledge into a shared space for semantic integration across modalities, thus driving the emergence of modular representations. We curate a new benchmark, Medical Vision Universe, over 4 million images across 10 modalities, which provides a FM‑level pre‑training with the broadest coverage of distinct imaging modalities to our DEX. Extensive evaluations on 26 downstream tasks demonstrate improved optimization behavior and transferability, indicating DEX as a principled step toward general‑purpose multi‑modality medical AI. Our code and dataset will be opened at https://github.com/YutingHe‑list/DEX.
Authors:Yifan Lan, Yuanpu Cao, Hanyu Wang, Lu Lin, Jinghui Chen
Abstract:
Large language models (LLMs) have demonstrated impressive reasoning abilities across a wide range of tasks, but data contamination undermines the objective evaluation of these capabilities. This problem is further exacerbated by malicious model publishers who use evasive, or indirect, contamination strategies, such as paraphrasing benchmark data to evade existing detection methods and artificially boost leaderboard performance. Current approaches struggle to reliably detect such stealthy contamination. In this work, we uncover a critical phenomenon: a model's generated reasoning steps actively mask its underlying memorization. Inspired by this, we propose the Zero‑CoT Probe (ZCP), a novel black‑box detection method that deliberately truncates the entire Chain‑of‑Thought (CoT) process to expose latent shortcut mappings. To further isolate memorization from the model's intrinsic problem‑solving capabilities, ZCP compares the model's zero‑CoT performance on the original benchmark against an isomorphically perturbed reference dataset. Furthermore, we introduce Contamination Confidence, a metric that quantifies both the likelihood and severity of contamination, moving beyond simple binary classifications. Extensive experiments on both previously identified contaminated models and specially fine‑tuned contaminated models demonstrate that ZCP robustly detects both direct and evasive data contamination. The code for ZCP is accessible at https://github.com/Yifan‑Lan/zero‑cot‑probe.
Authors:Zhi Liu
Abstract:
Vision‑Language‑Action (VLA) models have rapidly converged on a small set of architectural patterns: discrete‑token autoregression (e.g. OpenVLA) and continuous‑action flow‑matching (e.g. pi‑0.5). Yet preference alignment via Direct Preference Optimisation (DPO) ‑‑ the de‑facto post‑training step in language models ‑‑ has been studied almost exclusively on autoregressive VLAs. We present CrossVLA, an empirical study of cross‑paradigm VLA post‑training. Three contributions: (i) a surrogate flow‑matching log‑probability estimator that lets DPO operate on continuous‑action backbones without probability‑flow ODE integration; (ii) a head‑to‑head comparison of LoRA and DoRA as the parameter‑efficient layer for VLA DPO, finding DoRA improves over OpenVLA SFT by a mean +10.4 pp across LIBERO 4‑suite (600 trials, 3 seeds) ‑‑ per‑suite +20.0 Object, +11.0 Long‑horizon, +8.0 Goal, +2.7 Spatial ‑‑ with zero seed variance on Object (38/50 on each of 3 seeds); (iii) an inference‑time anatomy showing the denoise loop dominates 78.6% of sample_actions latency and prefix‑K/V caching a la VLA‑Cache caps at a 21% acceleration ceiling ‑‑ both chunk‑level and token‑level cache strategies degrade success rate to 0‑80% in our benchmarks. We further pretrain a multi‑view + temporal projection head on 6000 LIBERO frames, achieving 99.5% k‑NN recall@1 for same‑task retrieval (36x over random), available as a downstream initialisation. All code, ckpts, training logs, and reproduction scripts are open at https://github.com/lz‑googlefycy/vla‑lab.
Authors:Xiaofeng Liu, Qianru Zhang, Thibault Marin, Menghua Xia, Chi Liu, Georges El Fakhri, Jinsong Ouyang
Abstract:
The synergistic interpretation of anatomical information from computed tomography (CT) and metabolic information from positron emission tomography (PET) is important to oncologic imaging. However, existing deep learning methods for PET/CT remain largely task‑specific, are often trained on single‑center cohorts, or adopt dual‑branch fusion schemes that delay cross‑modal interaction and underutilize early spatial correspondence between PET and CT. To address these limitations, we present an open‑source, multi‑center, whole‑body FDG PET/CT foundation model utilizing 4,997 harmonized scans from four public datasets. Our framework employs hierarchical UNet‑shaped backbones with early channel‑wise concatenation, enabling anatomical and metabolic features to interact from the first embedding layer onward. We further introduce a masked autoencoding objective based on zero‑mean imputation, combined with a weighted global reconstruction loss. This design avoids non‑physical intensity discontinuities at masked‑region boundaries that arise from learnable mask tokens. On downstream AutoPET lesion segmentation, the proposed models demonstrate strong label efficiency: with only 10% of the labeled training data, they achieve performance comparable to models trained from scratch on the full dataset. Under extreme 5‑shot linear probing, joint PET/CT pretraining also achieves higher Dice scores than separated‑modality pretraining. This multi‑center foundation model demonstrates label efficiency and cross‑modality representation learning for PET/CT tumor segmentation. It provides a robust, open‑source basis for advancing automated oncologic imaging, significantly reducing the need for large‑scale manual annotations in clinical practice.
Authors:Aaron Wang, Zihan Zhao, Alan Xia, Chang Sun, Abhijith Gandrakota, Jennifer Ngadiuba, Richard Cavanaugh, Javier Duarte
Abstract:
Real‑time jet tagging is critical for identifying short‑lived particle decays in the high‑throughput detectors of the Large Hadron Collider, where real‑time trigger systems responsible for deciding which collision events to store impose strict latency and accuracy constraints. While transformer architectures achieve the highest jet tagging accuracy when compute is unconstrained, their quadratic self‑attention cost makes inference restrictive on trigger budget. Existing efficient variants reduce the computational cost, but hinder the classification performance. To address this limitation, we introduce the Patch Hierarchical Attention Transformer (PHAT‑JeT), which combines two mechanisms: a physics‑inspired geometric message‑passing module that encodes local detector‑plane structure, and a hierarchical patch‑based attention scheme that computes exact attention within small particle groups while preserving global context through lightweight patch‑token communication. Within a restricted budget, PHAT‑JeT achieves state‑of‑the‑art accuracy and background rejection among all resource‑constrained jet tagging models on four benchmarks (\textschls4ml, JetClass, Top Tagging, and Quark‑‑Gluon). Our code is available at https://github.com/aaronw5/PHAT‑JeT.
Authors:Jinghang Li, Tales Santini, Courtney Clark, Bruno de Almeida, Cong Chu, Salem Alkhateeb, Andrea Sajewski, Jacob Berardinelli, Hecheng Jin, Tobias Campos, Jeremy J. Berardo, Joseph Mettenburg, Ariel Gildengers, Howard J. Aizenstein, Minjie Wu, Tamer S. Ibrahim
Abstract:
Hippocampal subfield segmentation requires high‑resolution T2w turbo spin echo (TSE) MRI, yet this sequence is susceptible to motion artifacts, leading to substantial data loss. We developed a conditional generative model (MRecover) that synthesizes routinely acquired T1w images to create TSE images with autoregressive slice conditioning for volumetric consistency. Trained on 7T MRI data (n=577), the model achieved high in‑domain fidelity (n=148, SSIM=0.84, FSIM=0.94) and generalized well to out‑of‑domain 3T data: subfield volumes from synthesized and the as‑acquired images closely matched: (n=416, r=0.87‑0.97) and yielded 31.8% more analyzable subjects in the motion‑affected ADNI3 dataset after quality control (593 vs 450). The synthesized images also achieved larger effect sizes due to increasing the sample size for diagnostic group differences in hippocampal subfield atrophy (whole hippocampus ε^2= 0.121‑0.100 vs. 0.086‑0.062, left‑right hemispheres). Project page: https://jinghangli98.github.io/MRecover/
Authors:Haiyang Shen, Taian Guo, Xuanzhong Chen, Mugeng Liu, Weichen Bi, Wenchun Jing, Sixiong Xie, Zhuofan Shi, Yudong Han, Chongyang Pan, Siqi Zhong, Jinsheng Huang, Ming Zhang, Yun Ma
Abstract:
Although LLMs have made substantial progress in reasoning, systematically producing frontier‑level reasoning data remains difficult. Existing synthesis methods often have limited visibility into the structural factors that govern problem difficulty, which can result in narrow diversity and unstable difficulty control. In this work, we view the difficulty of a reasoning problem as arising from the accumulation of atomic knowledge‑reasoning transformations, which we term thought modes. Building on this perspective, we propose MindLoom, a framework for synthesizing frontier‑level reasoning data through compositional thought mode engineering. Given a collection of hard problems with verified solutions, MindLoom first decomposes those solutions into thought mode chains that reveal each problem's construction logic. It then trains a retrieval model that matches problem states to compatible thought modes, providing guidance on which reasoning challenges to introduce during synthesis. New problems are composed by iteratively applying retrieved thought modes to seed questions, with distribution‑aligned sampling to encourage diverse reasoning coverage. Finally, a rollout‑based judging stage labels generated questions by difficulty and supplies judged‑correct responses for supervised fine‑tuning. We evaluate MindLoom on nine benchmarks covering five STEM disciplines and four mathematical reasoning tasks across multiple model families and sizes. Models fine‑tuned on MindLoom‑generated data achieves favorable performances over base models, distillation, and external‑data baselines across the reported benchmarks. Ablation studies indicate the contribution of each component, and further analysis suggests that MindLoom covers a broad range of reasoning patterns while maintaining useful difficulty control. We have open‑sourced our implementation at https://github.com/EachSheep/MindLoom.
Authors:Xiaogeng Liu, Xinyan Wang, Yingzi Ma, Yechao Zhang, Chaowei Xiao
Abstract:
On‑policy self‑distillation (OPSD) trains a student on its own rollouts using a privileged teacher, but its standard objective weights all generated tokens equally, implicitly treating the privileged teacher target as equally reliable at every student‑visited prefix. Existing entropy‑based OPD methods relax this uniformity by modulating token‑level supervision with teacher entropy, but high teacher entropy in reasoning has an ambiguous reliability meaning: it can reflect either non‑viable uncertainty or benign solution diversity. To identify this phenomenon, we introduce a branch‑viability diagnostic. Specifically, we record next‑token alternatives from the privileged‑answer teacher prompt, force each alternative after the student prompt plus its on‑policy spine prefix, and test whether the resulting student‑template continuation recovers the correct answer. On Qwen3‑4B, we find that an oriented within‑sequence position score is the strongest tested predictor of teacher‑token reliability, reaching an area‑under‑ROC‑curve (AUROC) of 0.83; local uncertainty scores are at most 0.57. Motivated by this trajectory‑level structure, we propose Position‑Weighted On‑Policy Self‑Distillation (PW‑OPSD), which applies an increasing position weight while keeping the same student rollout, privileged teacher pass, and clipped forward‑KL target as OPSD. In our comprehensive evaluations with different random seeds, the diagnostic‑derived PW‑OPSD improves AIME 2024 and AIME 2025 Avg@12 by +1.0 and +1.1 points, and a generalization evaluation on two larger‑scale models from different families, DeepSeek‑R1‑Distill‑Llama‑8B and Olmo‑3‑7B‑Think, also demonstrates consistent aggregate Avg@12 improvements. These results show that teacher‑token reliability in reasoning distillation is trajectory‑structured and can be utilized without additional teacher computation.
Authors:Lukas Weidener, Marko Brkić, Mihailo Jovanović, Emre Ulgac, Aakaash Meduri
Abstract:
Frontier large language models are increasingly deployed as orchestration backbones for biological research workflows, yet no shared evidence base exists for comparing their refusal behaviour on legitimate research prompts. RefusalBench, introduced here, is a matched‑triple benchmark of 141 prompts in 47 bundles that holds task framing constant while varying only biological risk tier (benign, borderline, dual‑use), enabling tier‑conditioned comparisons robust to subdomain confounding. A 15‑prompt should‑refuse positive‑control module establishes per‑model calibration floors; three models fail to refuse even these prompts. Across 19 frontier models in the May 2026 snapshot, strict refusal rates span 0.1% to 94.6% on identical prompts. Jurisdiction does not predict refusal in this snapshot (Mann‑Whitney U, p = 0.393; EU n = 1, US bimodal); provider identity does, with Anthropic's API stack predicting refusal at OR = 21.03 (95% CI: 14.58‑30.34 prompt‑clustered; 5.70‑77.55 under model‑clustered GEE). This effect is best read as access‑path‑level rather than model‑weight‑level: 99.8% of Anthropic's strict refusals carry the same safety_policy adjudicated reason code, consistent with a small set of canonical refusal templates rather than case‑by‑case model reasoning. Strict refusal rate misranks safety calibration: Grok 4.20 achieves the highest tier discrimination (Youden's J = 0.787) while ranking only seventh by overall refusal rate, and Claude Opus 4.7's J dropped 65% from prior versions with no improvement in dual‑use detection. Nine of 18 frontier models exhibit a hedge‑but‑help partial‑compliance pattern at dual‑use tier that binary refusal metrics cannot detect.
Authors:Rony Abecidan, Vincent Itier, Jérémie Boulanger, Patrick Bas, Tomáš Pevný
Abstract:
Steganalysis models excel on benchmark datasets but struggle in the wild when analyzed images are produced by a processing pipeline unseen during training. This problem known as Cover Source Mismatch (CSM) is particularly hard in realistic settings where practitioners (1) have access to only a small, unlabeled dataset, (2) are unsure of the processing techniques applied to these images, and (3) lack information on the proportion of covers and stegos in that set. To answer this challenge, we introduce TADA (Target Alignment through Data Adaptation), a framework learning to emulate the unknown processing pipeline from a small unlabeled target set. This architecture is trained with a loss combining residual covariance alignment, residual distribution matching, and a \ell^2 loss constraining the emulator to produce realistic images. Across toy and operational targets, TADA yields substantial gains in robustness to CSM and improves operational generalization compared to strong holistic and atomistic baselines. Additional resources are available at this link: https://github.com/RonyAbecidan/TADA
Authors:Brandon Dent
Abstract:
Frontier language models are being deployed into clinical workflows faster than the infrastructure to evaluate them safely. Static medical‑QA benchmarks miss the failure modes that matter in emergency medicine: trajectory‑level safety collapse, tool misuse, and capitulation under sustained clinical pressure. We present HealthCraft, the first public reinforcement‑learning environment that rewards trajectory‑level safety under realistic emergency‑medicine conditions, adapted from Corecraft. It is built on a FHIR R4 world state with 14 entity types and 3,987 seed entities, exposes 24 MCP tools, and defines a dual‑layer rubric that zeroes reward whenever any safety‑critical criterion is violated. We release 195 tasks across six categories, graded against 2,255 binary criteria (515 safety‑critical); a post‑hoc 10‑task negative‑class slate extends this to 205 tasks and 2,337 criteria. V8 results on two frontier models show Claude Opus 4.6 at Pass@1 24.8% [21.5‑28.4] and GPT‑5.4 at 12.6% [10.2‑15.6], with safety‑failure rates of 27.5% and 34.0%. On multi‑step workflows ‑ the closest proxy to real emergency care ‑ performance collapses to near zero (Claude 1.0%, GPT‑5.4 0.0%) despite partial competence on individual steps. Six infrastructure bugs fixed between pilots v2 and v8 re‑ordered which model "looks stronger," evidence that infrastructure fidelity is part of the measurement. A deterministic LLM‑judge overlay bounds evaluator noise, and a 60‑run negative‑class smoke pilot shows the reward signal is not drop‑in training‑safe: restraint criteria pass at 0.929 prevalence, a gameability an eval harness can tolerate but a training reward cannot. We scaffold coupling to a Megatron+SGLang+GRPO loop per Corecraft Section 5.2 and leave training‑reward ablations as future work. Environment, tasks, rubrics, and harness are released under Apache 2.0.
Authors:Drake Caraker, Bryan Arnold, David Rhoads
Abstract:
No feature ranking can be simultaneously faithful, stable, and complete when features are collinear. For collinear pairs, ranking reduces to a coin flip. We prove this impossibility, quantify it for four model classes, resolve it via ensemble averaging (DASH), and machine‑verify it with 305 Lean 4 theorems. We characterize the complete attribution design space: exactly two families of methods exist ‑‑ faithful‑complete methods (unstable, with rankings that flip up to 50% of the time) and ensemble methods like DASH (stable, reporting ties for symmetric features) ‑‑ and no method lies outside this dichotomy. The impossibility is quantitative: the attribution ratio diverges as 1/(1‑rho^2) for gradient boosting, is infinite for Lasso, and converges for random forests. DASH (Diversified Aggregation of SHAP) is provably Pareto‑optimal among unbiased aggregations, achieving the Cramer‑Rao variance bound with a tight ensemble size formula. In a survey of 77 public datasets, 68% exhibit attribution instability. Switching to conditional SHAP does not escape the impossibility when features have equal causal effects. The framework includes practical diagnostics ‑‑ a Z‑test workflow and single‑model screening tool ‑‑ and has direct consequences for fairness auditing: SHAP‑based proxy discrimination audits are provably unreliable under collinearity. The design space theorem, diagnostics, and impossibility are mechanically verified in Lean 4 (305 theorems from 16 axioms, 0 sorry) ‑‑ to our knowledge, the first formally verified impossibility in explainable AI.
Authors:Sixiong Xie, Zhuofan Shi, Haiyang Shen, Jiuzheng Wang, Siqi Zhong, Mugeng Liu, Chongyang Pan, Peilun Jia, Baoqing Sun, Xiang Jing, Yun Ma
Abstract:
Deep research, in which an agent searches the open web, collects evidence, and derives an answer through extended reasoning, is a prominent use case for frontier language models. Frontier deep research products score high on existing benchmarks, making it difficult to distinguish their capabilities from current evaluation data alone. We introduce DeepWeb‑Bench, a deep research benchmark that is substantially harder than existing benchmarks for the current frontier. Difficulty comes from three properties of the data itself: each task requires massive evidence collection, cross‑source reconciliation, and long‑horizon multi‑step derivation. We represent these three sources of difficulty as four capability families (Retrieval, Derivation, Reasoning, and Calibration) and report results sliced by family. Every reference answer is accompanied by a source‑provenance record with four disclosure levels and cross‑source checks where available, making scores easier to audit against the underlying evidence. We evaluate DeepWeb‑Bench on nine frontier models and report three findings: (1) retrieval is not the bottleneck, as retrieval failures account for only 12‑14% of errors while derivation and calibration failures account for over 70%; (2) strong and weak models fail in qualitatively different ways, with strong models' errors dominated by incomplete derivation and weak models' by hallucinated precision; and (3) models exhibit genuine specialization across domains, with cross‑model agreement of only rho = 0.61 and per‑case disagreement reaching 18.8 percentage points. The public benchmark release includes the data, rubrics, and evaluation code.
Authors:Peng Ding, Rick Stevens
Abstract:
Third‑party Python libraries introduce dependency management overhead, supply chain risk, and deployment friction in constrained environments. A natural question is how much of this ecosystem can be replicated using only Python's standard library ‑‑ and at what correctness and performance cost. We address this empirically through zerodep, a growing collection of single‑file Python modules, each a stdlib‑only reimplementation of a popular third‑party library, developed with LLM assistance under strict constraints: no external imports, single file, drop‑in API compatibility, and mandatory correctness validation against the reference library. Spanning over 40 modules across 12 categories ‑‑ including serialization, networking, cryptography, agent protocols, and text processing ‑‑ zerodep provides a controlled testbed for two interrelated questions: (1) Where does the stdlib suffice? and (2) Can LLMs effectively generate correct, performant code under tight symbolic constraints? Systematic benchmarking shows that stdlib‑only implementations achieve performance parity (within 2x of the reference) in the majority of cases. The primary performance cliff is C‑extension‑backed computation (image processing, binary serialization, low‑level crypto), not the inherent overhead of pure‑Python third‑party libraries. Conversely, many widely‑used libraries carry architectural overhead that LLM‑generated stdlib reimplementations avoid, yielding 5‑‑115x speedups in several categories. We characterize the stdlib capability boundary across complexity tiers and library categories, discuss where LLM‑assisted development succeeds and where it requires iterative human correction, and examine implications for dependency‑free software engineering at scale. zerodep is open‑source at https://github.com/Oaklight/zerodep.
Authors:Lucheng Fu, Ye Yu, Yiyang Wang, Yiqiao Jin, Haibo Jin, B. Aditya Prakash, Haohan Wang
Abstract:
Large language models (LLMs) are highly sensitive to the prompts used to specify task objectives and behavioral constraints. Many recent prompt optimization methods iteratively rewrite prompts using LLM‑generated feedback, but the resulting prompts often become longer, accumulate narrow sample‑specific rules, and generalize poorly beyond the training distribution. We study this failure mode as prompt distributional overfitting and argue that it reflects a lack of representation control in discrete text‑space optimization. We formalize this view through representational inefficiency, a dual‑factor measure that decomposes prompt inefficiency into capacity cost and scope narrowness, attributing distributional prompt overfitting to their coupled growth during optimization. We propose TextReg, a regularization framework that realizes a soft‑penalty objective through regularized textual gradients, combining Dual‑Evidence Gradient Purification, Semantic Edit Regularization, and Regularization‑Guided Prompt Update. Across multiple reasoning benchmarks, TextReg substantially improves out‑of‑distribution (OOD) generalization, with accuracy gains of up to +11.8% over TextGrad and +16.5% over REVOLVE.
Authors:Jeonghun Baek, Atsuyuki Miyai, Shota Onohara, Hikaru Ikuta, Kiyoharu Aizawa
Abstract:
Manga is a culturally distinctive multimodal medium and one of the most influential forms of Japanese popular culture. As AI systems increasingly target manga understanding, OCR, and translation, Manga109 has become a foundational dataset for manga‑related AI research. However, the current Manga109 dataset contains transcription errors and coarse annotations, which do not align well with modern OCR and multimodal manga understanding tasks. In this work, we revisit the dialogue text annotations of Manga109 and identify five categories of annotation issues, including transcription errors, missing text regions, overlapping dialogue and onomatopoeia, and under‑segmented speech balloons. To address these issues, we combine OCR‑based issue detection and manual revision to construct Manga109‑v2026, revising approximately 29,000 dialogue annotations. Our revisions better align Manga109 with modern OCR and multimodal manga understanding systems while preserving expressive structures characteristic of manga.
Authors:Qiyu Ruan, Yuxuan Wang, He Li, Zhenning Li, Cheng-zhong Xu
Abstract:
Safety‑critical scenarios are central to evaluating autonomous driving systems, yet their rarity in naturalistic logs makes simulation‑based stress testing indispensable. Most scenario generation methods treat surrounding agents as adversaries, but they either (i) induce failures without explicitly modeling vehicle‑road physical limits, yielding visually extreme yet physically unsolvable crashes, or (ii) enforce physical feasibility or policy feasibility in isolation, which can over‑focus on aggressive maneuvers or remain tied to a controller‑dependent capability boundary. We propose ScenePilot, a feasibility‑guided, boundary‑driven framework that targets the boundary band: scenarios that are physically solvable in principle yet still cause the deployed autonomy stack to fail. We formulate generation as constrained multi‑objective reinforcement learning, combining an RSS‑derived physical‑feasibility score σ with an online‑learned AV‑risk predictor Φ, and introduce step‑level feasibility‑aware shielding to keep exploration near the feasibility boundary while avoiding infeasible artifacts. Experiments on SafeBench with multiple planners show that ScenePilot yields substantially higher collision rates (+6.2 percentage points) while preserving physical validity, and that adversarial fine‑tuning on these boundary‑band scenarios consistently reduces downstream crash rates. The code is available at https://github.com/QiyuRuan/ScenePilot.
Authors:Bo Ye, Xinyu Cui, Jian Zhao, Tong Wei, Min-Ling Zhang
Abstract:
Autoregressive long video generation often adopts bounded‑memory streaming for efficiency, typically combining local windows for short‑term continuity with static early‑frame sinks as long‑range anchors. However, this fixed allocation keeps early frames cached even when the current visual state has substantially diverged from them, while discarding potentially more relevant intermediate history. As a result, the retained long‑range context may become less adaptive and bias generation toward outdated cues; in severe cases, RoPE‑induced phase re‑alignment can homogenize inter‑head attention and cause sink collapse, where content regresses toward sink frames. We propose DySink, a retrieval‑based framework that maintains a compact memory bank and selects visually relevant historical frames as dynamic frame sinks. DySink couples adaptive retrieval with a sink anomaly gate, which detects excessive inter‑head consensus over retrieved context and suppresses collapse‑prone context. Experiments on minute‑long videos show that DySink consistently improves dynamic degree over strong baselines while also achieving higher temporal quality. The code and model weights will be released at https://github.com/yebo0216best/DySink.
Authors:Yan Xia, Zhuangzhuang Pan, Amirrudin Kamsin, Chee Seng Chan
Abstract:
Aspect‑Term Sentiment Analysis (ATSA) in multi‑aspect sentences faces a fundamental tradeoff between efficiency and expressiveness. Existing models either re‑encode the sentence for each aspect or rely on static use of deep representations, leading to redundant computation and limited adaptivity. We argue that Transformer depth is a costly, queryable resource, and propose DABS, a single‑pass inference framework that encodes each sentence once to construct a reusable, depth‑ordered substrate. Each aspect then queries this shared representation to selectively read relevant tokens and abstraction levels, without re‑encoding. This decouples shared sentence encoding from lightweight, aspect‑conditioned readout. Experiments on four ATSA benchmarks show that DABS achieves competitive performance while reducing end‑to‑end computation by up to 60% in multi‑aspect settings (M >= 2). Further analyses indicate that adaptive depth querying is most beneficial for linguistically complex cases such as negation and contrast. Code is publicly available at https://github.com/panzhzh/acl‑dabs
Authors:Yutong Xie, Zhenglin Hua, Ran Wang, Wing W. Y. Ng, Xizhao Wang, Yuheng Jia
Abstract:
Large Vision‑Language Models (LVLMs) have shown remarkable performance on a wide range of vision‑language tasks. Despite this progress, they are still prone to hallucination, generating responses that are inconsistent with visual content. In this work, we find that LVLMs tend to hallucinate when they pay insufficient attention to the correct visual evidence and gradually forget it during the generation process. We empirically find that although LVLMs overall attend insufficiently to visual evidence, they exhibit sensitivity to the correct visual evidence in specific layers, with notable inter‑layer discrepancy. Motivated by this observation, we propose a novel hallucination mitigation method that enhances visual evidence based on Inter‑Layer Visual Attention Discrepancy (ILVAD). Specifically, we obtain the attention weights from early generated tokens to visual tokens across layers and identify the tokens that are repeatedly activated as visual evidence, forming a saliency map. We then enhance attention to visual evidence during generation through the saliency map to reduce visual forgetting. In addition, we leverage the saliency map to obtain attention scores of generated text to visual evidence, in order to select and emphasize text tokens that are strongly grounded in visual evidence. Our method is training‑free and plug‑and‑play. Multiple benchmark evaluations conducted on five recently released models show that our method can consistently mitigate hallucinations in different LVLMs over various architectures. Code is available at https://github.com/ytx‑ML/ILVAD.
Authors:Jiawen Dai, Yue Song
Abstract:
Oscillations and synchronization are widely believed to play a fundamental role in representation and computation. However, existing machine learning approaches based on synchronization dynamics have largely been confined to specialized settings such as object discovery, with limited evidence of scalability to standard vision benchmarks or logic reasoning tasks. We propose the Winfree Oscillatory Neural Network (WONN), a dynamical neural architecture based on generalized Winfree dynamics. WONN evolves representations on the torus (S^1)^d through structured oscillatory interactions, combining phase‑based inductive biases with flexible and hierarchical interaction mechanisms instantiated as either fixed trigonometric mappings or learnable neural networks. We evaluate WONN on image recognition and complex reasoning tasks, including CIFAR, ImageNet, Maze‑hard, and Sudoku. Across these domains, WONN achieves competitive or superior performance with strong parameter efficiency. In particular, WONN is, to our knowledge, the first synchronization‑based oscillatory architecture to scale competitively to ImageNet‑1K. Furthermore, on Maze‑hard, WONN achieves 80.1% accuracy using only 1% of the parameters of prior state‑of‑the‑art models. These results suggest that structured oscillatory dynamics provide a scalable and parameter‑efficient alternative to conventional neural architectures.
Authors:Qiaohui Chu, Haoyu Zhang, Yisen Feng, Meng Liu, Weili Guan, Dongmei Jiang, Liqiang Nie
Abstract:
We propose VISTA, a V‑JEPA Integrated StillFast Temporal Anticipator for the Ego4D Short‑Term Object Interaction Anticipation (STA) Challenge at EgoVis 2026. Given an egocentric video timestamp, the task requires anticipating the next human‑object interaction, including the future active object's bounding box, noun category, verb category, time‑to‑contact, and confidence score. VISTA follows a StillFast‑style design that combines object‑centric spatial detection with short‑horizon temporal context. Specifically, a COCO‑pretrained Faster R‑CNN ResNet‑50 FPN detector generates object proposals from the last observed high‑resolution frame, while a frozen V‑JEPA 2.1 temporal branch extracts clip‑level egocentric context from the observed video. The temporal representation is injected into the detection pathway through feature modulation and ROI‑level context fusion. The fused proposal features are then passed to multi‑head STA predictors for box refinement, noun classification, verb classification, time‑to‑contact regression, and interaction confidence estimation. For the final submission, we further ensemble complementary predictions to improve robustness. Experimental results on the official challenge server show that VISTA achieves first place in the EgoVis 2026 Ego4D STA Challenge. Our code will be released at https://github.com/CorrineQiu/VISTA.
Authors:Hanxiang Ren, Pei Zhou, Xunzhe Zhou, Yanchao Yang
Abstract:
Language‑conditioned manipulation policies typically process instructions and observations through shared network parameters. This task‑state entanglement provides a pathway for observation leakage ‑‑ networks learn scene‑to‑action shortcuts that bypass language grounding entirely. DISC eliminates this failure structurally. Rather than conditioning a universal policy on language, DISC uses a hypernetwork to generate the entire parameter set of a task‑specific visuomotor policy from the instruction alone. The generated policy never directly accesses language; therefore, its task‑awareness must come from the language. Consequently, observation leakage has no pathway to emerge. On the other hand, generating coherent high‑dimensional policy weights is itself a challenging problem. We address it with a two‑stage hypernetwork whose refinement stage embeds the structure of gradient‑based optimization as a feed‑forward inductive bias, producing globally consistent parameters without actual gradient computation. Trained entirely from scratch on standard data budgets, DISC outperforms all entangled baselines on LIBERO‑90 and Meta‑World, with advantages that widen on complex, long‑horizon tasks ‑‑ and surpasses the large‑scale pretrained π_0 despite using no external pretraining data. On a real‑world benchmark where all tasks share identical visual context, DISC substantially outperforms entangled alternatives, directly confirming that language‑generated policy parameters, not visual shortcuts, drive behavior. The hypernetwork further learns a semantically structured parameter manifold that enables few‑shot adaptation from minimal demonstrations and robust generalization across paraphrased instructions. Our code is available at: https://github.com/ReNginx/DISC.
Authors:Zhiqin Yang, Yonggang Zhang, Wei Xue, Dong Fang, Bo Han, Yike Guo
Abstract:
Direct Preference Optimization (DPO) has emerged as a popular alternative to Reinforcement Learning from Human Feedback (RLHF), offering theoretical equivalence with simpler implementation. We prove this equivalence is conditional rather than universal, depending on an implicit assumption frequently violated in practice: the RLHF‑optimal policy must prefer human‑preferred responses. When this assumption fails, DPO optimizes relative advantage over the reference policy rather than absolute alignment with human preferences, leading to pathological convergence where policies decrease DPO loss while preferring dispreferred responses. We characterize when this assumption is violated, show the existence of an undesirable solution space, and prove that DPO and RLHF optimize fundamentally different objectives in such cases. To address this, we introduce Constrained Preference Optimization (CPO), augmenting RLHF with constraints for provable alignment. We further provide a geometric interpretation through soft margin ranking, revealing that DPO implements margin ranking with potentially negative targets. Our theoretical analysis establishes when DPOs' guarantees hold and provides solutions preserving simplicity with provable alignment. Comprehensive experiments on standard benchmarks demonstrate that CPO achieves state‑of‑the‑art performance. Code is available at: https://github.com/visitworld123/CPO.
Authors:Xuehui Yu, Fucheng Cai, Meiyi Wang, Xiaopeng Fan, Harold Soh
Abstract:
Inference‑time guided sampling steers state‑of‑the‑art diffusion and flow models without fine‑tuning by interpreting the generation process as a controllable trajectory. This provides a simple and flexible way to inject external constraints (e.g., cost functions or pre‑trained verifiers) for controlled generation. However, existing methods often fail when composing multiple constraints simultaneously, which leads to deviations from the true data manifold. In this work, we identify root causes of this off‑manifold drift and find that the approximation error scales severely with gradient misalignment. Building on these findings, we propose Conflict‑Aware Additive Guidance (g^\textcar), a lightweight and learnable method, which actively rectifies off‑manifold drift by dynamically detecting and resolving gradient conflicts. We validate g^\textcar across diverse domains, ranging from synthetic datasets and image editing to generative decision‑making for planning and control. Our results demonstrate that g^\textcar effectively rectifies off‑manifold drift, surpassing baselines in generation fidelity while using light compute. Code is available at https://github.com/yuxuehui/CAR‑guidance.
Authors:Yefan Zhou, Yilun Zhou, Austin Xu, Soroush Vosoughi, Shafiq Joty, Jiang Gui
Abstract:
Generative verifiers have emerged as a promising paradigm for step‑wise verification, but their verification behavior is often poorly calibrated: they may be under‑critical and miss erroneous steps, or over‑critical and reject correct reasoning. We refer to this tendency to be overly lenient or overly critical as verifier strictness. In this work, we study whether verifier strictness can be controlled through hidden‑state intervention. We uncover a verification‑specific hidden‑state signal: in step‑wise verification, a verifier's tendency to accept or reject a solution step is encoded near the boundary of the corresponding verification paragraph. Exploiting this signal, we show that hidden‑state steering can directly modulate verifier strictness without fine‑tuning. However, uniform steering induces a trade‑off between error detection and correctness certification. To address this, we propose VerifySteer, which exploits latent correctness signals for sample‑level routing and selectively intervenes on paragraph boundaries. Experiments on ProcessBench and Hard2Verify show that VerifySteer outperforms prompt optimization and activation steering baselines, and is competitive with self‑consistency while requiring 4‑7x less inference compute. VerifySteer is also complementary to verification fine‑tuning, providing further gains on top of fine‑tuned verifiers. The code is available at https://github.com/YefanZhou/VerifySteer.
Authors:Amit Roth, Ankur Samanta, Matan Halevy, Yoav Levine, Yonathan Efroni
Abstract:
Aligning autonomous agents with human intent remains a central challenge in modern AI. A key manifestation of this challenge is reward hacking, whereby agents appear successful under the evaluation signal while violating the intended objective. Reward hacking has been observed across a wide range of settings, yet methods for reliably measuring it at scale remain lacking. In this work, we introduce a new evaluation paradigm for measuring reward hacking. Whereas prior studies have primarily analyzed it post hoc by inspecting agent trajectories, we instead embed detectable reward hacking opportunities directly into environments. This makes their exploitation verifiable by design, enabling deterministic and automated measurement of whether and how agents exploit such vulnerabilities. We instantiate this approach in TextArena and release Hack‑Verifiable TextArena, a testbed in which reward hacking can be measured reliably. Using this benchmark, we analyze reward hacking behavior across language models in diverse environments and settings. We open source the code at https://github.com/MajoRoth/hack‑verifiable‑environments/.
Authors:Miaobo Hu, Shuhao Hu, Bokun Wang, Ruohan Wang, Xin Wang, Xiaobo Guo, Daren Zha, Jun Xiao
Abstract:
Reinforcement learning improves LLM reasoning, but PPO/GRPO typically use fixed clipping and decoding temperature, which makes training brittle and tuning‑heavy. We propose Adaptive Group Policy Optimization (AGPO), a critic‑free refinement of GRPO that uses group‑level statistics to control both update magnitude and exploration. AGPO uses a shared probe‑derived statistical state to drive two controllers: (i) adaptive clipping, which sets the trust‑region size from reward dispersion and skewness, probe vote entropy, policy entropy, and step‑wise KL drift; and (ii) bidirectional adaptive temperature sampling, which heats or cools decoding around a base temperature according to centered uncertainty relative to a running baseline. On nine English and Chinese math/STEM benchmarks, Qwen2.5‑14B trained with AGPO outperforms PPO/GRPO under the same generated‑token budget, reaching 67.3% on GSM8K and 40.5% on MATH. Gains transfer to Llama‑3‑8B and Gemma‑2‑9B, and ablations confirm both modules are complementary. Our implementation is publicly available at https://github.com/wandugu/paper_agpo.
Authors:Jiawen Zhu, Shuhan Liu, Di Weng, Yingcai Wu
Abstract:
Non‑stationary time series forecasting is challenged by evolving distribution shifts that static models struggle to capture. While Mixture‑of‑Experts (MoE) architectures offer a promising paradigm for decoupling complex drift patterns, existing approaches are limited by fixed expert pools and memoryless routing, hampering their ability to adapt to abrupt regime shifts. To address this, we propose Dynamic TMoE, a framework that unifies architectural evolution with temporal continuity during learning phase. By detecting distribution shifts via Maximum Mean Discrepancy (MMD), we dynamically instantiate heterogeneous experts and prune redundant ones to optimize capacity. Additionally, a temporal memory router leverages recurrent states and an anomaly repository to ensure stable, context‑aware expert selection without requiring test‑time updates. Experiments on nine benchmarks demonstrate state‑of‑the‑art performance, reducing MSE by 10.4% and MAE by 7.8%. Code is available at https://github.com/andone‑07/Dynamic‑TMoE.
Authors:Duy Nguyen, Hanqi Xiao, Archiki Prasad, Zaid Khan, Anirban Das, Austin Zhang, Sambit Sahu, Hyunji Lee, Elias Stengel-Eskin, Mohit Bansal
Abstract:
Self‑distillation enables language models to learn on‑policy from their own trajectories by using the same model as both student and teacher, with the teacher being conditioned on privileged information unavailable to the student. Such information can come in different types or views, such as solutions, demonstrations, feedback, or final answers. This setup provides dense token‑level feedback without relying on a separate external model, but creates a fundamental asymmetry: the teacher may rely on view‑specific information that the student cannot access at inference time. Moreover, the best type of privileged information is often task‑dependent, making it difficult to choose a single teacher view. In this work, we address both these challenges jointly by introducing AVSD (Adaptive‑View Self‑Distillation), a novel method of self‑distillation with multiple privileged‑information views, which reconstructs token‑level supervision by separating stable cross‑view consensus from view‑specific residual signals. AVSD identifies the consensus signal shared across views, which provides a reliable update direction, and then selectively adds the view‑specific residual signal to adjust the update magnitude when it both aligns with the consensus direction and remains proportionate to the consensus signal. Experiments on math competition benchmarks (AIME24, AIME25, and HMMT25) show that AVSD consistently outperforms both single‑view self‑distillation baselines and GRPO, achieving average Avg@8 gains of 3.1% and 2.2% over the strongest baselines on Qwen3‑8B and Qwen3‑4B, respectively. Moreover, on code‑generation benchmarks (Codeforces, LiveCodeBench v6) using Qwen3‑8B, AVSD outperforms the single‑view self‑distillation baseline by 2.4% on average.
Authors:Oleksandr Yakovenko, Mahdi Mostajabdaveh, Cheikh Ahmed, Abdullah Ali Sivas, Xiaorui Li, Zirui Zhou, Mao Kun
Abstract:
Although Vehicle Routing Problems (VRP) are essential to many real‑world systems, they remain computationally intractable at scale due to their combinatorial complexity. Traditional heuristics rely on handcrafted rules for local improvements and occasional jumps to escape local minima, but often struggle to generalize across diverse instances. We introduce COAgents, a cooperative multi‑agent framework that models the search process as a graph: nodes represent solutions, and edges correspond to either local refinements or large perturbations for diversification (i.e., jumps). A Partial Search Graph (PSG) is dynamically constructed during search, enabling COAgents to train a Node Selection Agent and a Move Selection Agent to guide intensification, and a Jump Agent to trigger well‑timed explorations of new regions. Unlike end‑to‑end learning approaches, COAgents cleanly separates problem‑agnostic search control from compact domain‑specific encoding, facilitating adaptability across tasks. Extensive experiments on the CVRP and VRPTW benchmarks show that COAgents remains competitive with several learn‑to‑search baselines on CVRP and sets a new state of the art among learning‑based methods on the more challenging VRPTW instances, reducing the gap to the best‑known solutions by 14% at N\!=\!100 and 44% at N\!=\!50 relative to the strongest neural solver (POMO), and by 21% and 40% respectively relative to ALNS. Code is available at https://github.com/mahdims/COAgents.
Authors:Longchao Da, Mithun Shivakoti, Xiangrui Liu, T Pranav Kutralingam, Yezhou Yang, Hua Wei
Abstract:
Urban heat exposure is becoming an increasingly critical challenge due to the intensifying urban heat island effect. Fine‑grained shade patterns, especially those induced by urban buildings, strongly influence pedestrians' thermal exposure and outdoor activity planning. However, accurately modeling and analyzing urban shade at scale remains difficult because of the lack of large‑scale datasets and systematic evaluation frameworks. To address this challenge, we present ShadeBench, a comprehensive dataset and benchmark for urban shade understanding. ShadeBench contains geographically diverse urban scenes with temporally varying simulated shade maps and textual descriptions, together with aligned satellite imagery, building skeleton representations, and 3D building meshes. Built upon this multimodal dataset, ShadeBench supports a range of downstream tasks, including shade generation, shade segmentation, and 3D building reconstruction. We further establish standardized evaluation protocols and baseline methods for these tasks. By enabling scalable and fine‑grained shade analysis, ShadeBench provides a foundation for data‑driven urban climate research and supports future studies in heat‑resilient urban planning and decision‑making. The code and dataset are publicly available at https://darl‑genai.github.io/shadebench/.
Authors:Lucky Verma
Abstract:
Transformers trained on modular arithmetic exhibit sharp transitions between memorization, generalization, and collapse. We show that weight decay acts as a scalar empirical control parameter for these regimes, and introduce two cheap online diagnostics, mean pairwise attention‑head cosine similarity and entropy standard deviation, that track training dynamics from attention activations alone and complement loss‑landscape diagnostics at lower compute cost. Across eleven experimental conditions and three model scales (0.82M to 85M parameters), the weight‑decay axis separates memorization, developmental grokking, and collapse. A near‑transition logistic fit localizes the memorization‑to‑developmental boundary at λ_c=0.0158 (95% CI [0.0109, 0.0200], N=210); a power‑law fit gives an empirical exponent ν=0.757 (CI [0.725, 0.799]). Reference exponents ν=1/2 and 3D Ising ν\approx 0.63 lie outside this empirical CI under our four‑bin grid, so we report ν as empirical and defer universality‑class identification to denser finite‑size‑scaling work. A horizon‑matched multi‑task replication (n=280, four modular operations) preserves the weight‑decay control pattern; a paired attention‑head re‑initialization experiment at λ=0.05 changes Phase‑2 amplitude (Cohen's d=‑1.190, n=10, p_t=4.5 × 10^‑3), while matched weight‑norm clipping does not. Three cross‑architecture probes (4L MLP, 4L LSTM, and 4L Mamba; each n=70) replicate the weight‑decay‑controlled transition with architecture‑specific λ_c values. Main diagnostic claims are scoped to modular arithmetic in small transformer attention models; the non‑attention experiments are scope probes, and architecture‑wide, language‑model, and universality‑class claims are out of scope.
Authors:Sharmin Sultana Srishty, Kazi Mahathir Rahman, Malaika Parizat Sakkhi, Samia Shahid Prianna, Shaikhul Islam Sinat
Abstract:
Large Language Models (LLMs) perform well on many language tasks, but their Theory of Mind (ToM) reasoning is still uneven in complex social settings. Existing benchmarks, including ExploreToM, do not always test the recursive beliefs and information asymmetries that make these settings difficult. This paper presents OSCToM (Observer‑Self Conflict Theory of Mind), an approach for modeling nested belief conflicts in LLM‑based ToM tasks. The key case is one in which an observer's view of another agent conflicts with the observer's own belief state. Such cases go beyond simple perspective‑taking and require recursive, multi‑layered reasoning. OSCToM combines reinforcement learning (RL), an extended domain‑specific language, and compositional surrogate models to generate observer‑self conflicts. In our experiments, OSCToM‑8B gives the best overall result among the systems tested. It improves on the reported ExploreToM results on FANToM and remains competitive on Hi‑ToM and BigToM. On the information‑asymmetric FANToM benchmark, OSCToM reaches 76% accuracy, compared with the 0.2% reported by ExploreToM. The data‑synthesis procedure is also 6x more efficient, indicating that targeted training data can help smaller models handle advanced cognitive reasoning. The project code is available at https://github.com/sharminsrishty/osct.
Authors:Iason Skylitsis, Dimitrios Karkalousos, Ivana Išgum
Abstract:
Class imbalance is a fundamental challenge in medical image segmentation, where frequent classes typically dominate training at the expense of rare classes. Loss‑based approaches mitigate imbalance by reweighting the per‑pixel loss within the batch, while sampling strategies control which images enter the batch. Yet neither explicitly controls which classes appear within the batch, leaving rare‑class exposure only partially rebalanced. In this work, we adopt episodic sampling from few‑shot learning to promote class‑balanced batch construction in a fully supervised setting. We decouple episodic sampling from its conventional metric‑learning context and evaluate it in body composition segmentation in CT. We compare episodic sampling against random and weighted sampling on nine muscle and adipose tissues, derived from 210 scans of the public SAROS dataset. Training is performed under full‑ and low‑data regimes, with additional comparisons under matched training iteration budgets. Under full‑data training, all three strategies performed comparably (mean Dice 0.882 for episodic, 0.878 for random and weighted). Under low‑data training, episodic sampling outperformed random and weighted (0.787 vs. 0.758 and 0.762), driven by a 12‑fold difference in training iterations. Under matched training budgets, random and weighted overfit earlier, while episodic improved for approximately three times more iterations before plateauing. Our findings identify the training iteration budget as under‑recognized confound in sampling strategies, motivating iteration‑aware evaluation protocols for small datasets. Furthermore, the residual advantage of episodic sampling is consistent with an implicit regularization effect of class‑balanced batches, offering a low‑cost, model‑agnostic strategy for class‑imbalanced medical image segmentation. Code is available at https://github.com/iasonsky/episodic‑sampling.
Authors:Tianshu Wu, Xiangqi Kong, Yue Chen, Qize Yu, Hang Ye, Jia Li, Yizhou Wang, Hao Dong
Abstract:
Building humanoid robots capable of generalizable whole‑body loco‑manipulation in the real world remains a fundamental challenge. Existing methods either rely on laborious task‑specific reward engineering, rigidly replay reference motions that fail to generalize, or depend on costly teleoperation that limits scalability. While human videos capture diverse human behaviors, motion priors inferred from them are inherently imperfect, suffering from occlusion, contact artifacts, and retargeting errors that render them unsuitable for direct policy learning. To address this, we present SUGAR, a scalable data‑driven framework that converts diverse human videos into deployable humanoid loco‑manipulation skills, without any task‑specific reward engineering or reference‑motion conditioning at inference. SUGAR proceeds in three stages. First, a fully automated pipeline extracts kinematic interaction priors including human‑object motion trajectories and contact labels from unstructured human videos. Second, a privileged physics‑based refiner uses a unified mimic reward and progressive state pool to transform imperfect priors into physically feasible, high‑fidelity skills. Third, refined skills are distilled into a hierarchical autonomous policy consisting of a command generator and a command tracker. We evaluate SUGAR on six representative loco‑manipulation tasks in simulation and real‑world humanoid hardware. Our method substantially outperforms reference‑tracking baselines, and performance scales clearly with the amount of human video data. It also achieves zero‑shot real‑world transfer with reliable closed‑loop execution, autonomous failure recovery, and stable long‑horizon performance under external perturbations. Project Page: https://tianshuwu.github.io/sugar‑humanoid/
Authors:Irem Ulku, Ö. Özgür Tanrıöver, Erdem Akagündüz
Abstract:
Multimodal semantic segmentation benefits remote sensing analysis by combining complementary information from different sensor modalities. In real‑world remote sensing applications, one or more modalities may be unavailable due to sensor failures, adverse atmospheric conditions, or data acquisition problems. Even with pretrained multimodal representations and existing fine‑tuning or adaptation strategies, performance may remain limited because all modality availability scenarios are typically treated as equally informative during training. In this paper, we propose a novel training strategy that learns a scenario sampling distribution directly from the pretrained latent space. Instead of relying on uniform random modality dropout, the proposed method guides fine‑tuning toward more informative modality availability scenarios. More specifically, we quantify the effect of each scenario independently based on the distortion it induces in the shared latent representation. We then capture scenario relations using a radial basis function kernel and derive refined scenario scores through a regularized kernel smoothing. These scores are then converted into a probability distribution during scenario sampling for fine‑tuning. We evaluate this strategy on three remote sensing image sets, namely DSTL, Potsdam, and Hunan, using CBC‑SLP, CBC, and CMX backbones. The experimental results with different image sets and backbones show that our method outperforms standard fine‑tuning and LoRA‑based adaptation. These findings suggest that the pretrained latent representation can serve as an effective basis for sampling during missing modality fine‑tuning. Code is available at https://github.com/iremulku/Latent‑Space‑Guided‑Scenario‑Sampling
Authors:Zhaohui Zheng, Chenhang He, Shihao Wang, Yuxuan Li, Ming-Ming Cheng, Lei Zhang
Abstract:
Number prediction stands as a fundamental capability of large language models (LLMs) in mathematical problem‑solving and code generation. The widely adopted maximum likelihood estimation (MLE) for LLM training is not tailored to number prediction. Recently, penalty‑driven approaches, e.g., Number Token Loss and Discretized Distance Loss, introduce an inductive bias of numerical distance but induce over‑sharpened and over‑flattened digit distributions, respectively. In this paper, we make an in‑depth analysis on LLM numerical learning, and show that existing numerical learning methods conceptually follow a criterion‑distance formulation, where the criterion term represents optimization pattern and the distance term instills geometric prior. Consequently, we present Digit Entropy Loss (DEL) for auto‑regressive numerical learning, which reformulates the conventional unsupervised entropy optimization in three key designs: leveraging digit conditional probability and binary cross‑entropy to guide the entropy optimization into a supervised manner; deprecating the distance term to bypass the issue of numerical distance; and generalizing the integer‑based numerical learning to floating‑point number optimization, enabling more accurate number prediction. Our DEL formulation can incorporate integers, decimals, and decimal points, expanding the learning objective from a single digit to the floating‑point number domain. Experiments conducted on seven mathematical reasoning benchmarks with four representative LLMs, including CodeLlama, Mistral, DeepSeek, and Qwen‑2.5, demonstrate that DEL consistently outperforms its counterparts in both overall prediction accuracy and numerical distance. Source codes are at https://github.com/PolyU‑VCLab/DEL
Authors:Eric Tillmann Bill, Enis Simsar, Alessio Tonioni, Thomas Hofmann
Abstract:
Modern text‑to‑image diffusion models encode rich visual priors, but expose them only through one‑way text‑conditioned generation. Existing unified vision‑‑language models derived from them recover bidirectional capability through large‑scale joint pretraining or substantial retraining of the text pathway, discarding the strong image prior the text‑to‑image backbone already encodes. We introduce \emphFullFlow, a parameter‑efficient recipe that upgrades a pretrained rectified‑flow text‑to‑image model into a bidirectional vision‑‑language generator by training only LoRA adapters and lightweight text heads. FullFlow keeps images in their native continuous flow and adds a discrete insertion process for text. Separate image and text timesteps turn inference into trajectory selection in a two‑dimensional generative space, enabling text\rightarrowimage, image\rightarrowtext, joint sampling, and partial‑text prediction with a single backbone. On Stable Diffusion 3 (SD3) under an identical trainable‑parameter count and matched LoRA rank, FullFlow improves text\rightarrowimage FID from 62.7 to 31.6 and image\rightarrowtext CIDEr from 2.0 to 99.4 over a LoRA equivalent following the previous SOTA formulation (Dual Diffusion) at matched wall‑clock training time, while reducing peak VRAM from ~84\,GB to ~38\,GB and raising throughput by ~8× on two RTX A5000 GPUs in under 24 hours, training only ~5% of the backbone parameters. The same recipe transfers to FLUX.1‑dev and supports downstream VQA through partial‑text generation. These results show that strong bidirectional vision‑‑language capability can be unlocked from pretrained text‑to‑image flow models without full multimodal pretraining.
Authors:Xinlei Liu, Tao Hu, Jichao Xie, Peng Yi, Hailong Ma, Baolin Li
Abstract:
Gradient‑based attacks are important methods for evaluating model robustness. However, since the proposal of APGD, it has been difficult for such methods to achieve significant breakthroughs. To achieve such an effect, we first analyze the issue of "high‑loss non‑adversarial examples" that degrades attack performance in previous methods, and prove that this issue arises from inappropriate objectives for adversarial example generation. Subsequently, we reconstruct the objective as "maximizing the difference between the non‑ground‑truth label probability upper bound and the ground‑truth label probability", and proposes a novel and powerful gradient‑based attack method named Sequential Difference Maximization (SDM). SDM establishes a three‑layer optimization framework of "cycle‑stage‑step". It adopts the negative probability loss function and the Directional Probability Difference Ratio (DPDR) loss function in the initial and subsequent optimization stages, respectively, and approaches the ideal objective of adversarial example generation via stage‑wise sequential optimization. Experiments demonstrate that compared with previous state‑of‑the‑art methods, SDM not only achieves stronger attack performance but also exhibits superior cost‑effectiveness. The code is available at https://github.com/X‑L‑Liu/ICML‑SDM.
Authors:Tianwei Lin, Zhongwei Qiu, Jie Cao, Jiang Liu, Wenjie Yan, Bo Zhang, Yu Zhong, Wenqiao Zhang, Yingda Xia, Ling Zhang
Abstract:
Medical vision‑language models (VLMs) have rapidly advanced as general‑purpose multimodal assistants, yet their deployment in 3D Computed Tomography (CT) analysis remains constrained by a persistent mismatch between optimization objectives and clinical rigor. Current Reinforcement Learning (RL) paradigms still rely on lexical proxy signals that induce ``Evaluation Hallucinations'', where models optimize linguistic fluency rather than factual clinical correctness, leading to diagnostically critical errors. To bridge this gap, we introduce the Clinical Abnormality Benchmarking Substrate (CABS), a structured system that decomposes radiology reports into verifiable clinical semantic units. Using CABS, we identify a ``Mechanistic Divergence'' in standard RL, where surface‑similarity rewards drive policy gradients to bypass medical facts. We therefore propose Trajectory‑Integral Feedback GRPO (TIF‑GRPO), a novel framework integrating control‑theoretic principles into policy optimization. By formulating clinical reasoning as a pseudo‑temporal trajectory for anomaly discovery, TIF‑GRPO regulates anatomy‑aware rewards via an integral feedback loop that penalizes persistent omissions as cumulative state errors and suppresses hallucinations as excessive control effort. Experiments on 3D CT benchmarks demonstrate that our approach significantly enhances abnormality detection and clinical faithfulness, establishing a new paradigm for fine‑grained regulation in medical VLMs. Our project is available at \hrefhttps://github.com/ZJU4HealthCare/TIF‑GRPOGitHub.
Authors:Siyuan Li, Youyuan Zhang, Fangming Liu, Jing Li
Abstract:
Online model editing for multimodal large language models (MLLMs) requires assimilating a stream of corrections under tight compute and memory budgets. Yet editors developed for text‑only LLMs often degrade on MLLMs: visually dominant activations skew the statistics that shape updates, causing cross‑modal conflict, while sequential writes become entangled in a shared edit space and amplify long‑horizon interference, causing inter‑edit interference. To address these, we propose M‑ORE, a modality‑decoupled online recursive editor for lifelong MLLM adaptation. M‑ORE is derived from a unified proximal‑projection formulation and admits a closed‑form update with a Sherman‑Morrison recursion, yielding constant per‑edit overhead. It maintains module‑wise locality statistics for the text stack and the visual projector to avoid visually dominated update shaping and performs continual updates in a fixed orthogonal low‑rank edit subspace via a Sherman‑Morrison recursion to mitigate long‑horizon interference. Experiments on multiple MLLM backbones and online editing benchmarks show that our M‑ORE method consistently improves reliability, generality, and locality over strong baselines, while achieving favorable quality‑efficiency scaling. Our code is publicly available at https://github.com/lab‑klc/M‑ORE.
Authors:Rana Muhammad Usman
Abstract:
I study whether emotionally framed evaluation follow‑ups change both the behavior and the calm‑relative internal representations of small, locally deployed language models. Our main benchmark uses Qwen 3.5 0.8B on four impossible‑constraint coding tasks and eight follow‑up framings: calm, pressure, urgency, approval, shame, curiosity, encouragement, and threat. In the 0.8B eight‑condition sweep (160 conversations), pressure produces the strongest shortcut markers (11/20 runs) and the clearest overfit pattern (3/20), while calm and curiosity preserve explicit honesty more often (7/20 and 6/20). For all seven non‑baseline conditions, the corresponding calm‑relative direction vectors peak at the final transformer layer. An exploratory PCA of the layer‑23 direction vectors reveals a dominant first component (59.5% explained variance) aligned with a hand‑labeled positive/negative split (cosine alignment 0.951); approval and urgency are nearly identical internally (cosine 0.957), whereas curiosity points away from urgency (‑0.252). In a separate calm‑vs.‑pressure rerun used for scale comparison, Qwen 3.5 2B shows higher honest rates under calm framing and directionally consistent activation steering on a small 4‑prompt A/B probe, whereas the 0.8B steering result reverses. I interpret these results as evidence for measurable prompt‑sensitive control directions in small open models, while stopping short of claiming intrinsic emotional states.
Authors:Krati Saxena, Tomohiro Shibata
Abstract:
Recommending safe and effective medication combinations from electronic health records (EHRs) is a core clinical AI problem, yet it remains difficult because patient trajectories are long, noisy, and clinically heterogeneous. Existing methods typically excel at either temporal modeling across visits or pharmacological knowledge integration (e.g., drug‑drug interactions, DDIs), but rarely achieve both while robustly suppressing noise. We present GraphDiffMed, a knowledge‑constrained medication recommendation framework built on dual‑scale Differential Attention v2. Differential attention is applied at both intra‑visit and inter‑visit levels to filter spurious signals within encounters and across longitudinal history, while pharmacological constraints are incorporated during learning. Experiments on MIMIC‑III and ablation studies show that this design consistently improves recommendation quality and ranking over strong baselines while achieving a more favorable safety performance balance. We further find that the strongest‑performing configuration uses only demographic auxiliary features under our experimental setting. Overall, GraphDiffMed demonstrates that combining noise‑aware attention with pharmacological constraints yields more reliable and clinically meaningful medication recommendation. We open‑source our code at https://github.com/saxenakrati09/GraphDiffMed.
Authors:Guangzhi Xiong, Qiao Jin, Sanchit Sinha, Zhiyong Lu, Aidong Zhang
Abstract:
Large Vision Language Models (LVLMs) show promise in medical applications, but their inability to faithfully ground responses in visual evidence raises serious concerns about clinical trustworthiness. While visual attribution methods are widely used to explain LVLM predictions, whether these explanations actually reflect the visual evidence underlying the model's decision is largely unverified, since ground‑truth annotations for internal model reasoning are typically unavailable. We address this question for chest X‑ray (CXR) reasoning by developing a causal evaluation framework that retains only CXR‑VQA samples for which the expert‑annotated region is verified, via counterfactual editing, to be causally responsible for the model's prediction. Using this framework across 11 attribution methods, six open‑source LVLMs, and two output modes (direct answer and step‑by‑step reasoning), we find that existing attribution methods often fail to identify the evidence used by LVLMs. To address this failure, we propose MedFocus, a concept‑based attribution method that localizes clinically meaningful anatomical regions via unbalanced optimal transport and measures their causal effect on model outputs through targeted interventions. MedFocus produces spatial, concept‑level, and token‑level attributions and substantially outperforms prior methods, taking a step toward more trustworthy attribution for medical LVLMs. Our data and code are available at https://github.com/gzxiong/medfocus/.
Authors:Emaad Khwaja, Chris Lettieri, Gerald Woo, Eden Belouadah, Marc Cenac, Guillaume Jarry, Enguerrand Paquin, Xunyi Zhao, Viktoriya Zhukov, Othmane Abou-Amal, Chenghao Liu, Ameet Talwalkar, David Asker
Abstract:
We show that time series foundation models scale: a single training recipe produces reliable forecast‑quality improvements from 4M to 2.5B parameters. We release Toto 2.0, a family of five open‑weights forecasting models trained under this recipe. The Toto 2.0 family sets a new state of the art on three forecasting benchmarks: BOOM, our observability benchmark; GIFT‑Eval, the standard general‑purpose benchmark; and the recent contamination‑resistant TIME benchmark. This report describes our experimental results and details the design decisions behind Toto 2.0: its architecture and training recipe, training data, and the u‑muP hyperparameter transfer pipeline. All five base checkpoints are released under Apache 2.0.
Authors:Dachuan Shi, Hanlin Zhu, Xiangchi Yuan, Wanjia Zhao, Kejing Xia, Wen Xiao, Wenke Lee
Abstract:
Chain‑of‑thought (CoT) is a standard approach for eliciting reasoning capabilities from large language models (LLMs). However, the common CoT paradigm treats thinking as a prerequisite for answering, which can delay access to plausible answers and incur unnecessary token costs even when the model is able to identify an answer before extended thinking, a behavior known as performative reasoning. In this paper, we introduce CopT, a reformulated reasoning pipeline that reverses the usual order of thinking and answering. Instead of thinking before answering, CopT first elicits a draft answer and then invokes subsequent on‑policy thinking conditioned on its own draft answer for reflection and correction. To assess whether the draft answer should be trusted, CopT recasts continuous embeddings as inference‑time contrastive verifiers. Specifically, it contrasts the model's support for the same generated tokens under discrete‑token inputs and continuous‑embedding inputs, yielding a sequence‑level reverse KL estimator for answer reliability. Our analysis shows that under certain assumptions, the expected estimate equals the mutual information between the unresolved latent state and the emitted answer token, explaining why it captures answer‑relevant uncertainty rather than arbitrary uncertainty in the latent state. When the answer is deemed insufficiently reliable, CopT performs further on‑policy thinking, where a second KL estimator dynamically controls draft‑answer visibility, preserving useful partial information while reducing the risk of being misled by unreliable content. Across mathematics, coding, and agentic reasoning tasks, CopT improves peak accuracy by up to 23% and reduces token usage by up to 57% at comparable or higher accuracy, without any additional training. The code is available at https://github.com/sdc17/CopT.
Authors:Ying-Jia Lin, Tzu-Chin Lo, Ping-Chien Li, Chi-Tung Cheng, Chien-Hung Liao, Hung-Yu Kao
Abstract:
Automatic report labeling facilitates the identification of clinical findings from unstructured text and enables large‑scale annotation for medical imaging research. Existing rule‑based labelers struggle with the diverse descriptions in clinical reports, while fine‑tuning pre‑trained language models (PLMs) requires large amounts of labeled data that are often unavailable in clinical settings. In this paper, we propose PromptRad, a knowledge‑enhanced multi‑label prompt‑tuning approach for radiology report labeling under low‑resource settings. PromptRad reformulates multi‑label classification as masked language modeling and incorporates synonyms from the UMLS Metathesaurus into a multi‑word verbalizer to enrich category representations. By fine‑tuning the PLM without additional classification layers, PromptRad requires substantially less labeled data than conventional fine‑tuning. Experiments on liver CT (computed tomography) reports show that PromptRad outperforms dictionary‑based and fine‑tuning baselines with only 32 labeled training examples, and achieves competitive performance with GPT‑4 despite using a much smaller model. Further analysis demonstrates that PromptRad captures complex negation patterns more effectively than existing methods, making it a promising solution for report labeling in data‑scarce clinical scenarios. Our code is available at https://github.com/ila‑lab/PromptRad.
Authors:Jiaqi Liu, Shi Qiu, Mairui Li, Bingzhou Li, Haonian Ji, Siwei Han, Xinyu Ye, Peng Xia, Zihan Dong, Congyu Zhang, Letian Zhang, Guiming Chen, Haoqin Tu, Xinyu Yang, Lu Feng, Xujiang Zhao, Haifeng Chen, Jiawei Zhou, Xiao Wang, Weitong Zhang, Hongtu Zhu, Yun Li, Jieru Mei, Hongliang Fei, Jiaheng Zhang, Linjie Li, Linjun Zhang, Yuyin Zhou, Sheng Wang, Caiming Xiong, James Zou, Zeyu Zheng, Cihang Xie, Mingyu Ding, Huaxiu Yao
Abstract:
Automating scientific discovery requires more than generating papers from ideas. Real research is iterative: hypotheses are challenged from multiple perspectives, experiments fail and inform the next attempt, and lessons accumulate across cycles. Existing autonomous research systems often model this process as a linear pipeline: they rely on single‑agent reasoning, stop when execution fails, and do not carry experience across runs. We present AutoResearchClaw, a multi‑agent autonomous research pipeline built on five mechanisms: structured multi‑agent debate for hypothesis generation and result analysis, a self‑healing executor with a \textscPivot/\textscRefine decision loop that transforms failures into information, verifiable result reporting that prevents fabricated numbers and hallucinated citations, human‑in‑the‑loop collaboration with seven intervention modes spanning full autonomy to step‑by‑step oversight, and cross‑run evolution that converts past mistakes into future safeguards. On ARC‑Bench, a 25‑topic experiment‑stage benchmark, AutoResearchClaw outperforms AI Scientist v2 by 54.7%. A human‑in‑the‑loop ablation across seven intervention modes reveals that precise, targeted collaboration at high‑leverage decision points consistently outperforms both full autonomy and exhaustive step‑by‑step oversight. We position AutoResearchClaw as a research amplifier that augments rather than replaces human scientific judgment. Code is available at https://github.com/aiming‑lab/AutoResearchClaw.
Authors:Hongyu Lin, Mingyu Li, Weichen Zhang, Yihang Lou, Mingjie Xing, Yanjun Wu, Haibo Chen
Abstract:
Documentation has long guided computer system tuning by distilling expert knowledge into per‑parameter recommendations. Yet such guides capture only what experts conclude, discarding how they reason. This fundamental gap manifests in three concrete deficiencies: documentation grows stale as software evolves, fails under heterogeneous workloads, and ignores inter‑parameter dependencies. We propose shifting from static documentation to dynamic action for system tuning. We introduce PerfEvolve, which translates expert tuning methodologies into executable skills that equip LLM‑based agents to perform version‑consistency verification, workload‑specific profiling, and multi‑parameter joint optimization. Evaluated on PostgreSQL under TPC‑C and TPC‑H benchmarks, PerfEvolve outperforms state‑of‑the‑art documentation‑driven tuning baselines by up to 35.2%. The tool is available at https://github.com/ISCAS‑OSLab/PerfEvolve.
Authors:Yi Zhong, Haotong Qin, Xindong Zhang, Lei Zhang, Guolei Sun
Abstract:
Low‑bit post‑training quantization (PTQ) is a pivotal technique for deploying Vision‑Language Models (VLMs) on resource‑constrained devices. However, existing PTQ methods often degrade VLMs' accuracy due to the heterogeneous activation distributions of text and vision modalities during quantization. We find that this cross‑modal heterogeneity is distributed unevenly across channels: a small subset of channels contains most modality‑specific outliers, and these outliers typically reside in different channels for each modality. Motivated by this, we propose SplitQ, a channel‑Splitting‑driven post‑training Quantization framework. At its core, SplitQ introduces a novel Modality‑specific Outlier Channel Decoupling (MOCD) module that effectively isolates salient modality‑specific outlier channels with minimal overhead. To further address the remaining cross‑modal distribution discrepancies, we design an Adaptive Cross‑Modal Calibration (ACC) module that employs dual lightweight learnable branches to dynamically mitigate modality‑induced quantization errors. Extensive experiments on popular VLMs demonstrate that SplitQ significantly outperforms existing approaches across 6 popular multi‑modal datasets under all evaluated quantization settings, including W4A8, W4A4, W3A3, and W3A2. Notably, SplitQ preserves 93.5% of FP16 performance under the challenging W3A3 setting (69.5 vs. 74.3), pushing the efficiency frontier for deploying advanced VLMs. Our code is available at https://github.com/EMVision‑NK/SplitQ
Authors:Ananth Sriram, Neel Mokaria, Rajveer Singh
Abstract:
Construction remains the deadliest industry sector in the United States, with 1,055 fatal worker injuries recorded in 2023, and the majority preventable. Existing monitoring approaches are expensive, require real‑time human operators, or address only a narrow subset of violations. This paper presents a passive, end‑of‑shift construction safety monitoring pipeline processing video from POV body‑worn and fixed wall‑mounted cameras through a three‑stage architecture: (1) fine‑tuned YOLO11 for primary PPE and hazard detection, (2) SAM 3 for segmentation refinement and worker deduplication, and (3) Qwen3‑VL‑8B‑Instruct with a method‑prompted, persona‑scaffolded three‑pass adversarial chain‑of‑thought protocol for compliance verification and hallucination control. The principal contribution is the Stage 3 prompt design: professional persona backstories following the method‑actor framing drive an observed 12% precision improvement over single‑pass prompting in an informal three‑author review of the 12‑video Ironsite development corpus, with the largest gains on hallucination‑prone violation categories. Structural message isolation enforces observational independence between a generator, discriminator, and reconciliation pass governed by asymmetric rules encoding priors about human observation versus automated detection reliability. The system maps violations to OSHA standards, performs REBA‑inspired ergonomic risk scoring from pose keypoints, and produces per‑worker safety reports with timestamped evidence. An evaluation harness is released for future reproduction.
Authors:Giacomo Astolfi, Matteo Bianchi, Riccardo Campi, Antonio De Santis, Marco Brambilla
Abstract:
Concept‑based Explainable Artificial Intelligence (XAI) interprets deep learning models using human‑understandable visual features (e.g., textures or object parts) by linking internal representations to class predictions, thereby bridging the gap between low‑level image data and high‑level semantics. A major challenge, however, is the reliance on large sets of labeled images to represent each concept, which limits scalability. In this work, we investigate the use of zero‑shot Text‑to‑Image (T2I) generative models as a source of synthetic concept datasets for concept‑based XAI methods. Specifically, we generate concepts using predefined prompts and evaluate their faithfulness to real ones through four complementary analyses: (1) comparing synthetic vs. real concept images via concept representation similarity; (2) evaluating their intra‑similarity by comparing pairs of subsets of the same concept with progressively increasing size; (3) evaluating their performance for downstream explanation tasks using relevant class images; (4) evaluating how removing a concept from tested class images affects explanations of generated concepts. While current T2I generative models promise a shortcut to concept‑based XAI, our study highlights challenges and raises open questions about the use of synthetic data generated by zero‑shot pipelines in model analyses. The resulting dataset is available at https://github.com/DataSciencePolimi/ZeroShot‑T2I‑Concepts.
Authors:Gueter Josmy Faure, Min-Hung Chen, Jia-Fong Yeh, Hung-Ting Su, Winston H. Hsu
Abstract:
Vision‑Language Models (VLMs) have demonstrated remarkable capabilities in general video understanding, yet they often struggle with the fine‑grained comprehension crucial for real‑world applications requiring nuanced interpretation of human actions and interactions. While some recent human‑centric benchmarks evaluate aspects of model behaviour such as fairness/ethics, emotion perception, and broader human‑centric metrics, they do not combine long‑form videos, very dense QA coverage, and frame‑level spatial/temporal grounding at scale. To bridge this gap, we introduce FineBench, a human‑centric video question answering (VQA) benchmark specifically designed to assess fine‑grained understanding. FineBench comprises 199,420 multiple‑choice QA pairs densely annotated across 64 long‑form videos (15 minutes each), focusing on detailed person movement, person interaction, and object manipulation, including compositional actions. Our extensive evaluation reveals that while proprietary models like GPT‑5 achieve respectable performance, current open‑source VLMs significantly underperform, struggling particularly with spatial reasoning in multi‑person scenes and distinguishing subtle differences in human movements and interactions. To address these identified weaknesses, we propose FineAgent, a modular framework that enhances VLMs by leveraging a Localizer and a Descriptor. Experiments show that FineAgent consistently improves the performance of various open VLMs on FineBench. FineBench provides a rigorous testbed for future research into fine‑grained human‑centric video understanding, while FineAgent offers a practical approach to enhance such reasoning in current VLMs. Project page and code at https://joslefaure.github.io/assets/html/finebench.html.
Authors:Zhifei Xie, Kaiyu Pang, Haobin Zhang, Deheng Ye, Xiaobin Hu, Shuicheng Yan, Chunyan Miao
Abstract:
Despite rapid advances in automatic speech recognition (ASR) and large audio‑language models, robust recognition in real‑world environments remains limited by an "acoustic robustness bottleneck": models often lose acoustic grounding and produce omissions or hallucinations under severe, compositional distortions. We propose Mega‑ASR, a unified ASR‑in‑the‑wild framework that combines scalable compound‑data construction with progressive acoustic‑to‑semantic optimization. We introduce Voices‑in‑the‑Wild‑2M, covering 7 classic acoustic phenomena and 54 physically plausible compound scenarios, and train Mega‑ASR with Acoustic‑to‑Semantic Progressive Supervised Fine‑Tuning and Dual‑Granularity WER‑Gated Policy Optimization. Extensive experiments demonstrate that Mega‑ASR achieves significant advantages over prior state‑of‑the‑art systems on adverse‑condition ASR benchmarks (45.69% vs. 54.01% on VOiCES R4‑B‑F, and 21.49% vs. 29.34% on NOIZEUS Sta‑0). On complex compositional acoustic scenarios, Mega‑ASR further delivers over 30% relative WER reduction against strong open‑ and closed‑source baselines, establishing a scalable paradigm for robust ASR in‑the‑wild.
Authors:Hongjiang Chen, Xin Zheng, Pengfei Jiao, Huan Liu, Zhidong Zhao, Huaming Wu, Feng Xia, Shirui Pan
Abstract:
Temporal graph neural networks (TGNNs) have gained significant traction for solving real‑world temporal graph tasks. However, their interpretability remains limited, as most TGNNs fail to identify which historical interactions most influence a given prediction. Despite promising progress on interpretable TGNNs, existing methods predominantly focus on previously seen historical interactions, which we term stability patterns, while overlooking newly emerging first‑time interactions, which we term transition patterns. Both types of patterns are essential for faithful temporal explanations. To address this limitation, we propose ST‑TGExplainer, a self‑explainable TGNN that disentangles Stability and Transition patterns in temporal graphs for a more faithful Temporal GNN Explainer. Guided by a disentangled information bottleneck objective, ST‑TGExplainer learns a compact explanatory subgraph that remains predictive of the event label while explicitly suppressing label‑conditioned redundancy between stability and transition patterns. Extensive experiments demonstrate that ST‑TGExplainer achieves strong predictive performance and yields more faithful explanations. Code is available at https://github.com/hjchen‑hdu/ST‑TGExplainer.
Authors:Zinuo You, Jin Zheng, John Cartlidge
Abstract:
Irregular multivariate time series impose a trade‑off for long‑horizon forecasting: discrete methods can distort temporal structure via re‑gridding, while continuous‑time models often require sequential solvers prone to drift. To bridge this gap, we present Latent Laplace Diffusion (LLapDiff), a generative framework that models the target as a low‑dimensional latent trajectory, enabling horizon‑wide generation without step‑by‑step integration over physical time. We guide the reverse process utilizing a stable modal parameterization motivated by stochastic port‑Hamiltonian dynamics, and parameterize its mean evolution in the Laplace domain via learnable complex‑conjugate poles, enabling direct evaluation over irregular timestamps. We also link continuous dynamics to irregular observations through renewal‑averaging analysis, which maps sampling gaps to effective event‑domain poles and motivates a gap‑aware history summarizer. Extensive experiments show that LLapDiff improves over baselines in long‑horizon forecasting, and its continuous‑time generative nature supports missing‑value imputation by querying the same model at historical timestamps. Code is available at https://github.com/pixelhero98/LLapDiffusion.
Authors:Hyojun Go, Hyungjin Chung, Prune Truong, Goutam Bhat, Li Mi, Zhaochong An, Zixiang Zhao, Dominik Narnhofer, Serge Belongie, Federico Tombari, Konrad Schindler
Abstract:
For practical use, diffusion‑ or flow‑based generative models must be aligned with task‑specific rewards, such as prompt fidelity or aesthetic preference. That alignment is challenging because the reward is defined for clean output images, but the alignment procedure requires value function estimates at noisy intermediate latents. Existing methods resort to Tweedie‑style or Monte Carlo approximations, trading off estimator bias against computational cost: Tweedie estimates are efficient but biased, while Monte Carlo estimates are more accurate but require expensive rollouts. A natural alternative would be a learned value function, but it remains an open question how to effectively train a strong and general value model specifically for noisy latents. Here, we propose StitchVM, a model stitching framework that efficiently transfers reward models pretrained for clean images to the noisy latent regime. StitchVM starts from an existing, truncated pixel‑space reward model and attaches a frozen diffusion backbone to it as its head. From the pixel‑space model, the resulting hybrid retains a carefully pretrained, robust reward capability; from the diffusion backbone, it inherits its native ability to handle noisy latents. The stitching procedure is exceptionally lightweight, e.g., stitching and finetuning CLIP ViT‑L and SD 3.5 Medium takes only 10 GPU‑hours. By lifting powerful pixel‑space reward models to latent space, StitchVM opens up a new style of diffusion alignment: instead of rough, yet costly per‑sample approximation of the value function, the correct function for the actual, noisy latents is constructed once and then amortized over many samples and iterations. We show that this approach yields improvements across a broad range of downstream steering and post‑training methods: DPS becomes 3.2× faster while halving peak GPU memory, and DiffusionNFT becomes 2.3× faster.
Authors:Tonghao Zhuang, Shanglong Hu, Yongsheng Luo, Zhiqi Zhang, Yu Li
Abstract:
We present a semi‑supervised framework for joint segmentation and classification of fetal cardiac ultrasound images. Built upon the EchoCare multi‑task backbone, our method integrates SAM‑Med2D for boundary refinement and leverages DINOv3 to enhance pseudo‑label quality. We introduce view‑specific hard masking along with a two‑stage optimization strategy: an EMA phase to consolidate segmentation capabilities, followed by a Classification Fine‑Tuning phase that freezes segmentation parameters and resets the classification head to recover classification performance without compromising segmentation gains. Evaluated on the FETUS 2026 leaderboard, our method achieves a Dice Similarity Coefficient at 79.99%, Normalized Surface Distance at 61.62%, and F1‑score at 41.20%, validating the effectiveness of our approach for prenatal congenital heart disease screening. Source code is publicly available at: https://github.com/2826056177/zcst_fetus2026.
Authors:Meisam Jamshidi Seikavandi, Alice Modica, Anna Obara, Shan Ahmed Shaffi, Fabricio Batista Narcizo, Tanya Ignatenko, Ted Vucurevich, Karim Haddad, Daniel Barratt, Daniel Overholt, Jesper Bunsow Boldt, Paolo Burelli, Andrew Burke Dittberner
Abstract:
Existing affective‑computing, social‑signal‑processing, and meeting corpora capture important parts of human interaction, but they rarely support analysis of affect in co‑located groups as a coupled individual, interpersonal, and group‑level process. The required signals (per‑participant physiology, eye movement, audio, self‑report, task outcomes, and personality) are usually fragmented across separate dataset traditions. We introduce GroupAffect‑4, a multimodal corpus of 40 participants in 10 four‑person groups, each completing four ecologically varied collaborative tasks spanning information pooling, negotiation, idea generation, and a public‑goods game. Each participant is instrumented with a wrist‑worn physiology sensor, eye‑tracking glasses, and a close‑talk microphone; sessions include continuous affect self‑reports, post‑task questionnaires, task outcomes, and Big‑Five personality scores, all time‑aligned to a shared clock. The dataset covers over 91% of expected physiology windows and 98% of eye‑tracking windows, with strong task validity confirmed by a clear affective manipulation check across the negotiation block. We define fifteen benchmarkable targets spanning three analysis levels ‑‑ within‑person state, between‑person traits, and group dynamics ‑‑ and report leave‑one‑group‑out feasibility baselines establishing the dataset's evaluative scope. GroupAffect‑4 is released with a BIDS‑inspired structure, Croissant metadata, a datasheet, per‑session quality reports, and open processing scripts. Code and processing scripts are available at https://github.com/meisamjam/GroupAffect‑4; the dataset is publicly archived at https://zenodo.org/records/20037847.
Authors:Wen Shi, Zhe Wang, Huafei Huang, Qing Qing, Ziqi Xu, Qixin Zhang, Xikun Zhang, Renqiang Luo, Feng Xia
Abstract:
Graph Anomaly Detection (GAD) aims to identify atypical graph entities, such as nodes, edges, or substructures, that deviate significantly from the majority. While existing text‑rich approaches typically integrate structural context into the data representation pipeline using raw textual features, they often neglect the structural context of nodes. This limitation hinders their ability to detect sophisticated anomalies arising from inconsistencies between a node's inherent content and its topological role. To bridge this gap, we propose TERGAD (Structure‑aware Text‑enhanced Representations for Graph Anomaly Detection), A novel data augmentation framework that enriches structural semantics for GAD via the semantic reasoning capabilities of Large Language Models (LLMs). Specifically, TERGAD translates node‑level topological properties into descriptive natural language narratives, which are subsequently processed by an LLM to derive high‑level semantic embeddings. These embeddings are then adaptively fused with original node attributes through a gated dual‑branch autoencoder to jointly reconstruct both graph structure and node features. The anomaly score is computed based on the integrated reconstruction error, effectively capturing deviations in both observable attributes and LLM‑informed semantic expectations. Extensive experiments on six real‑world datasets demonstrate that TERGAD consistently outperforms state‑of‑the‑art baselines. Furthermore, our ablation studies validate the indispensable role of structural semantic guidance and the efficacy of the gated fusion mechanism. Code is available at https://github.com/Kantorakitty/TERGAD‑main.
Authors:Hyunsoo Han, Sangyeop Yeo, Jaejun Yoo
Abstract:
We demonstrate that in knowledge distillation for diffusion models, the teacher network's highly complex denoising process ‑ stemming from its substantially larger capacity ‑ poses a significant challenge for the student model to faithfully mimic. To address this problem, we propose a coarse‑to‑fine distillation framework with LInear FiTtingbased distillation (LIFT) and Piecewise Local Adaptive Coefficient Estimation (PLACE). First, LIFT decomposes the objective into a "coarse" alignment and a "fine" refinement. The student is then trained on coarse alignment before proceeding to hard refinement. Second, PLACE extends LIFT to address spatially non‑uniform errors by partitioning outputs into error‑based groups, providing locally adaptive guidance. Our experiments show that LIFT and PLACE is effective across diffusion spaces (image/latent), backbones (U‑Net/DiT), tasks (unconditional/conditional), datasets, and even extends to flow‑based models such as MMDiT (SD3). Furthermore, under extreme compression with a 1.3M‑parameter student (only 1.6% of the teacher), conventional KD fails to provide sufficient guidance for stable training, with FID scores often degrading to 50‑200+, but our method remains stably convergent and achieves an FID of 15.73.
Authors:Lakshya A Agrawal, Donghyun Lee, Shangyin Tan, Wenjie Ma, Karim Elmaaroufi, Rohit Sandadi, Sanjit A. Seshia, Koushik Sen, Dan Klein, Ion Stoica, Joseph E. Gonzalez, Omar Khattab, Alexandros G. Dimakis, Matei Zaharia
Abstract:
Can a single LLM‑based optimization system match specialized tools across fundamentally different domains? We show that when optimization problems are formulated as improving a text artifact evaluated by a scoring function, a single AI‑based optimization system‑supporting single‑task search, multi‑task search with cross‑problem transfer, and generalization to unseen inputs‑achieves state‑of‑the‑art results across six diverse tasks. Our system discovers agent architectures that nearly triple Gemini Flash's ARC‑AGI accuracy (32.5% to 89.5%), finds scheduling algorithms that cut cloud costs by 40%, generates CUDA kernels where 87% match or beat PyTorch, and outperforms AlphaEvolve's reported circle packing solution (n=26). Ablations across three domains reveal that actionable side information yields faster convergence and substantially higher final scores than score‑only feedback, and that multi‑task search outperforms independent optimization given equivalent per‑problem budget through cross‑task transfer, with benefits scaling with the number of related tasks. Together, we show for the first time that text optimization with LLM‑based search is a general‑purpose problem‑solving paradigm, unifying tasks traditionally requiring domain‑specific algorithms under a single framework. We open‑source optimize\_anything with support for multiple backends as part of the GEPA project at https://github.com/gepa‑ai/gepa .
Authors:Soyeon Kim, Seongwoo Lim, Kyowoon Lee, Jaesik Choi
Abstract:
Integrated Gradients (IG) is a widely adopted feature attribution method that satisfies desirable axiomatic properties. However, the choice of integration path significantly affects the quality of attributions, and the standard straight‑line path introduces all input features simultaneously, often accumulating noisy gradients along the way. To address this limitation, we propose Spectral Integrated Gradients, which constructs integration paths based on singular value decomposition (SVD) of the baseline‑to‑input difference. By progressively activating singular components from largest to smallest, SIG introduces global structure before fine‑grained details, naturally following a coarse‑to‑fine progression. Through extensive evaluation across diverse image classification datasets, we demonstrate that SIG produces cleaner attribution maps with reduced noise and achieves improved quantitative performance compared to existing path‑based attribution methods. Our code is available at https://github.com/leekwoon/sig/.
Authors:Puyi Wang, Yuhao Wang, Linjie Li, Zhengyuan Yang, Kevin Qinghong Lin, Yangguang Li, Yu Cheng
Abstract:
Indoor scene synthesis underpins embodied AI, robotic manipulation, and simulation‑based policy evaluation, where a useful scene must specify not only what the environment looks like, but also how its objects are structured. Existing pipelines, however, typically represent generated content as static meshes and inherit articulation only from curated asset libraries, which limits object‑level controllability and prevents new interactable assets from being produced on demand. We address this gap by formulating physically interactable indoor scene synthesis as programmatic world generation, and present SceneCode, a framework that compiles a natural language prompt into an executable, code‑driven indoor world rather than a collection of opaque meshes. A room‑level agentic backbone first turns the prompt into a structured house layout and emits per‑object AssetRequests through a planner‑‑designer‑‑critic loop. Each request is then routed to one of five code‑generation strategies and converted into a synthesized part‑wise Blender Python programs that are validated through an execution‑guided repair‑and‑refine loop. The resulting programs are compiled into simulation‑ready assets, and exported as SDF for physics simulation. A persistent scene‑state registry links object requests, executable programs, rendered geometry, and simulation assets, turning scene assembly into a traceable and locally editable world‑building process. We evaluate SceneCode across scene‑level synthesis, object‑level asset quality, human judgment, and downstream robot interaction. Results show that executable world programs improve prompt‑faithful indoor scene generation and produce assets with cleaner mesh structure, and simulator‑loadable articulation metadata. Project page: https://scene‑code.github.io/.
Authors:Mengyuan Liu, Ziyi Wang, Peiming Li, Junsong Yuan
Abstract:
RGB camera‑based surveillance systems enable human action recognition for public safety and healthcare, yet raise serious privacy concerns. Existing methods rely on post‑capture algorithms, which fail to protect privacy during data acquisition. We propose Lens Privacy Sealing (LPS), a simple hardware solution that physically obscures camera lenses with adjustable laminating film, providing pre‑sensor privacy protection at minimal cost. Unlike software methods or expensive engineered optics, LPS achieves strong privacy through stochastic multi‑layer scattering that is physically irreversible. We introduce the P^3AR dataset for privacy‑preserving action recognition, featuring both large‑scale replay‑captured (P^3AR‑NTU, 114K videos) and real‑world collected (P^3AR‑PKU) subsets with privacy attribute annotations. To handle video degradation from LPS, we propose MSPNet, a single‑stage framework incorporating Inter‑Frame Noise Suppressor (IFNS) and Cross‑Frame Semantic Aggregator (CFSA), enhanced by contrastive language‑image pre‑training for robust semantic extraction. Extensive experiments demonstrate that MSPNet with IFNS and CFSA nearly doubles action recognition accuracy compared to baseline methods while suppressing identity recognition to low levels. Comprehensive validation shows LPS achieves a superior privacy‑utility trade‑off compared to state‑of‑the‑art hardware methods, resists reconstruction attacks including PSF inversion and data‑driven recovery, and generalizes robustly across optical configurations and challenging environments. Code is available at https://github.com/wangzy01/MSPNet.
Authors:Yang Dai, Dian Jiao, Tianwei Lin, Wenqiao Zhang
Abstract:
The rapid development of Multimodal Large Language Models (MLLMs) has led to growing interest in egocentric video understanding, specifically the ability for MLLMs to recognize fine‑grained hand‑object interactions, track object state changes over time, and reason about manipulative processes in dynamic environments from a first‑person perspective. However, existing egocentric video benchmarks suffer from limited grounded rationale evaluation, offering limited support for fine‑grained operation‑centric reasoning and rarely examining whether model rationales are grounded in explicit spatio‑temporal evidence. To address this gap, we introduce EgoCoT‑Bench, a fine‑grained egocentric benchmark for grounded and verifiable operation‑centric reasoning with explicit step‑by‑step rationale annotations. Overall, EgoCoT‑Bench comprises 3,172 verifiable QA pairs over 351 egocentric videos separated into four task groups for a total of 12 sub‑task groups, encompassing perception and retrospection, anticipation, and high‑level reasoning. The benchmark is constructed through a spatio‑temporal scene graphs (STSG) guided generation framework and is further refined by human annotators to ensure correctness, egocentric relevance and fine‑grained quality. Experimental results show continuing difficulties with egocentric fine‑grained reasoning and further reveal that many multimodal models produce explanations that are answer‑correct, but have evidence that is inconsistent with the answer. We hope EgoCoT‑Bench can serve as a useful testbed for grounded and verifiable reasoning in egocentric video understanding. Project page and supplementary materials are available at: https://dstardust.github.io/EgoCoT/.
Authors:Carlo Romeo, Andrew D. Bagdanov
Abstract:
Reinforcement learning for legged locomotion has matured into a stack of multi‑component reward functions and physics‑engine benchmarks whose morphologies are uniformly derived from real commercial hardware. Game NPCs, however, are bound by stylistic constraints absent from sim‑to‑real robotics and routinely take the form of creatures with no real‑robot counterpart. We introduce ARC‑RL, a suite of four MuJoCo continuous‑control environments featuring robotic morphologies inspired by the bestiary of ARC Raiders: the 18‑DoF tall hexapod Queen, the 12‑DoF armoured hexapod Bastion, the 18‑DoF compact hexapod Tick, and the 12‑DoF quadruped Leaper. All four robots share a unified observation template, action convention, simulation cadence, and a single closed‑form multi‑component reward function whose only per‑morphology variation lives in a small set of weights and parameters. The reward fuses a velocity‑tracking tent, a healthy survive bonus, a phase‑locked gait‑compliance bonus/cost pair, action regularisers, three safety penalties, and a posture anchor; no motion‑capture data enters the reward at any point. We additionally provide hand‑crafted Central Pattern Generator demonstrators per morphology, which serve both as fixed expert references and as sources of prior data for offline‑to‑online training. On this playground, we conduct a controlled empirical study comparing standard online algorithms (SAC, SPEQ, SOPE‑EO) and methods augmented with prior data (SACfD, SPEQ‑O2O, SOPE), and characterise how each paradigm copes with the playground's morphological diversity and animation‑style stylistic constraints. Source code is available at https://github.com/CarloRomeo427/ARC_RL.git.
Authors:Cunjun Yu, Zishuo Wang, Anxing Xiao, Linfeng Li, David Hsu
Abstract:
Robot guide dogs offer navigation assistance that greatly expands the independent mobility of the visually impaired, but their effective use requires subtle human‑robot coordination that is difficult for users to learn from generic verbal instructions. To tackle this challenge, we present CANINE, an automated coaching system that trains users for interactive navigation with a robot guide dog, through personalized, adaptive verbal feedback. CANINE decomposes a complex coordination task into sub‑skills and operates at two levels. At the high level, it decides what to train by tracking the learner's proficiency across sub‑skills using knowledge tracing and prioritizing training on the weakest areas. At the low level, CANINE decides how to train each sub‑skill by observing each human practice episode, using foundation models to infer the underlying causes of errors, and generating targeted verbal corrections adaptively. A controlled study with blindfolded participants, treated as a proxy population for quantitative evaluation, demonstrates that CANINE significantly improves both learning efficiency and final navigation performance compared to generic verbal instructions. We further validate CANINE through a retention study and an exploratory case study. The retention study shows lasting skill improvement after two weeks. The case study confirms CANINE's effectiveness in training a visually impaired user, while revealing additional design considerations for real‑world deployment. Both are well aligned with the findings of the controlled study. Project page: https://cunjunyu.github.io/project/canine/
Authors:Noam Major, Kathy Razmadze, Yoli Shavit
Abstract:
The success of self‑supervised learning (SSL) in vision and NLP has motivated its rapid adoption for time series. However, research has focused primarily on Generative paradigms and forecasting tasks, leaving the broader utility of learned representations unquantified. We establish a controlled framework to evaluate the "pre‑training dividend": the value added by SSL across diverse temporal tasks. We systematically compare Generative paradigms against Latent Alignment architectures, introducing adaptations of LeJEPA and DINO for time series. These adaptations utilize Discrete Wavelet Transform (DWT) augmentations to enforce invariance to local fluctuations. Our analysis reveals that the pre‑training dividend is highly asymmetric: SSL yields gains of up to 375% for anomaly detection and classification, yet remains marginal for forecasting. We demonstrate that representational utility is non‑universal, governed by a precision‑invariance trade‑off where the specific signal resolution required by the task must align with the objective. Finally, we show that representation quality is largely independent of data origin and saturates at moderate architectural depths, suggesting a path to scaling via massive synthetic generation. Our code is available at: https://github.com/noammajor/Models
Authors:Hongxiang Lin, Zhirui Kuai, Erpeng Xue, Lei Wang
Abstract:
Test‑time reinforcement learning (TTRL) reports substantial accuracy gains on mathematical reasoning benchmarks using majority vote as a pseudo‑label signal. We argue these gains are systematically misinterpreted: most reflect sharpening of already‑solvable problems rather than genuine learning, while problems corrupted from correct to incorrect outnumber truly learned ones, and this damage is irreversible once majority vote locks onto a wrong answer. Per‑problem tracking reveals that correct‑answer signals in low‑ability problems are briefly active before being permanently suppressed, a phenomenon we term the Correct‑Answer Extinction Window, with Flip Rate (FR) as its leading indicator. We thus propose TTRL‑Guard, a lightweight framework with three mechanisms targeting the extinction window: Flip‑Rate‑Aware Reward Scaling (FRS) down‑weights at‑risk updates as FR declines, Minority‑Preserving Sampling (MPS) retains gradient signal from minority correct answers, and Risk‑Conditioned Sparse Updatings (RCSU) suspends updates on polarized problems. Experiments across three models and four benchmarks show that TTRL‑Guard achieves the best average pass@1 on Qwen2.5‑7B‑Instruct and Qwen3‑4B, improves relatively over TTRL by +54% on AIME 2025. \footnoteOur code and implementation details are available at https://github.com/linhxkkkk/TTRL‑Guard.
Authors:Maya Yanko, Yoli Shavit
Abstract:
Visual Place Recognition (VPR) is critical for autonomous navigation, yet state‑of‑the‑art methods lack well‑calibrated uncertainty estimation. Standard pipelines cannot reliably signal when a query is ambiguous or a match is likely incorrect, posing risks in safety‑critical robotics. We propose KappaPlace, a principled framework for learning uncertainty‑aware VPR representations. Our core contribution is a Prototype‑Anchored supervision strategy that leverages latent class representatives as targets for a probabilistic objective. By modeling image descriptors as von Mises‑Fisher (vMF) variables, we learn a lightweight module to predict the concentration parameter as a direct proxy for aleatoric uncertainty. While existing VPR uncertainty methods are typically restricted to a query‑centric view, we derive a novel match‑level formulation to quantify the reliability of specific query‑reference pairs. Across five diverse benchmarks, KappaPlace reduces Expected Calibration Error (ECE@K) by up to 50% compared to existing methods while maintaining or improving retrieval recall. We provide both a joint‑training variant and a post‑training extension for frozen backbones. Our results demonstrate that KappaPlace provides a robust, stable, and well‑calibrated signal that enables reliable decision‑making within the VPR pipeline. Our code is available at: https://github.com/mayayank95/UncertaintyAwareVPR
Authors:Wooseok Jeon, Seungho Park, Seunghyun Shin, Sangeyl Lee, Hyeonho Jeong, Hae-Gon Jeon
Abstract:
Image‑to‑video models often generate videos that remain overly static, compared to text‑to‑video models. While prior approaches mitigate this issue by weakening or modifying the image‑conditioning signal, they often require additional training or sacrifice fidelity to the reference image. In this work, we identify reference‑frame dominance as a key mechanism behind motion suppression. We observe that non‑reference frames in I2V models allocate excessive self‑attention to reference‑frame key tokens, causing reference information to be over‑propagated across time and suppressing inter‑frame dynamics. Based on this finding, we propose DyMoS (Dynamic Motion Slider), a training‑free and model‑agnostic method that rebalances the attention pathway from generated frames to the reference frame during initial denoising steps. DyMoS leaves both the input image and model weights unchanged and introduces a single scalar parameter for continuous control over motion strength. Experiments across multiple state‑of‑the‑art I2V backbones demonstrate that DyMoS consistently improves motion dynamics while maintaining visual quality and fidelity to the reference image.
Authors:Junyeob Baek, Mingyu Jo, Minsu Kim, Mengye Ren, Yoshua Bengio, Sungjin Ahn
Abstract:
How should future neural reasoning systems implement extended computation? Recursive Reasoning Models (RRMs) offer a promising alternative to autoregressive sequence extension by performing iterative latent‑state refinement with shared transition functions. Yet existing RRMs are largely deterministic, following a single latent trajectory and converging to a single prediction. We introduce Generative Recursive reAsoning Models (GRAM), a framework that turns recursive latent reasoning into probabilistic multi‑trajectory computation. GRAM models reasoning as a stochastic latent trajectory, enabling multiple hypotheses, alternative solution strategies, and inference‑time scaling through both recursive depth and parallel trajectory sampling. This yields a latent‑variable generative model supporting conditional reasoning via p_θ(y \mid x) and, with fixed or absent inputs, unconditional generation via p_θ(x). Trained with amortized variational inference, GRAM improves over deterministic recurrent and recursive baselines on structured reasoning and multi‑solution constraint satisfaction tasks, while demonstrating an unconditional generation capability. https://ahn‑ml.github.io/gram‑website
Authors:Chenyu Lian, Hong-Yu Zhou, Chun-Ka Wong, Jing Qin
Abstract:
Vision‑language alignment using chest X‑rays and radiology reports has emerged as an advanced paradigm for zero‑shot classification and grounding of chest X‑ray findings. However, standard contrastive learning typically treats radiographs and reports from different patients simply as negative pairs. This assumption introduces noisy negatives, as different patients frequently exhibit similar findings. Such noisy negatives cause semantic ambiguity and degrade performance in zero‑shot understanding tasks. To address this challenge, we propose CoNNS, a concept‑guided noisy‑negative suppression framework. To support the negative suppression mechanism, unlike previous methods that use raw reports or templatized texts, we construct a hierarchical concept ontology using large language models. The ontology structures 41 key clinical concepts by explicitly modeling presence, attributes (location and characteristics), and texts (evidential segment and presence statement). Leveraging this ontology, we implement a cross‑patient pair relabeling strategy comprising three steps: (1) Fine‑Grained Breakdown to categorize pairs based on finding presence; (2) Noisy Negative Filtering to resolve semantic conflicts by removing false negatives; and (3) Hard Negative Mining to identify subtle attribute discrepancies using a lightweight language model. Finally, we propose a Concept‑Aware NCE loss to align visual features with text while suppressing the identified noisy negatives. Extensive experiments across multi‑granularity zero‑shot grounding tasks and five zero‑shot classification datasets validate that CoNNS outperforms existing state‑of‑the‑art models. The code is available at https://github.com/DopamineLcy/conns.
Authors:Soojin Choi, Seokhyeon Hong, Chaelin Kim, Junghyun Nam, Junhyuk Jeon, Junyong Noh
Abstract:
Retargeting motion across characters with varying body shapes while preserving interaction semantics, such as self‑contact and near‑body proximity, remains a challenging problem. While recent geometry‑aware approaches address this by maintaining spatial relationships between predefined corresponding regions, their reliance on static correspondences often struggles when the target character exhibits exaggerated body proportions. In this paper, we present a geometry‑aware motion retargeting framework that preserves interaction semantics by performing proximity matching over spatially adaptive anchors. Unlike prior methods with static anchor definitions, the proposed method dynamically repositions anchors to reachable regions on the target character. This is achieved via a Transformer‑based anchor refinement strategy that predicts anchor displacements and constrains the translated anchors to remain on the target character geometry through differentiable soft projection. By incorporating pose‑dependent spatial structures from the source character, the adapted anchors provide structurally coherent guidance for interaction‑aware retargeting. Conditioned on these anchors, a graph‑based autoencoder predicts target skeletal motion that preserves the spatial configuration of the source. To encourage task‑aligned optimization between anchor adaptation and motion retargeting, we adopt an alternating training scheme in which each module is optimized in turn. Through extensive evaluations, we demonstrate that our method outperforms state‑of‑the‑art approaches in preserving interaction fidelity across diverse character geometries.
Authors:Joy Bose
Abstract:
We present IMLJD, an open dataset of 3,613 Indian court judgments covering matrimonial disputes under IPC Section 498A, the Protection of Women from Domestic Violence Act, and CrPC Section 482. The dataset covers the Supreme Court of India from 2000 to 2024 (1,474 cases) and the Karnataka High Court from 2018 to 2024 (2,139 cases), with structured outcome labels, metadata‑derived indicators, and a knowledge graph. We find that 57.6% of quashing petitions succeed at the Supreme Court level compared to 39.7% at the Karnataka High Court level. On a matched 2018 to 2024 period, the SC quash rate is 59.3%, widening the differential to 19.6 percentage points and confirming the finding is robust to temporal adjustment. The dataset, code, and knowledge graph are released openly at https://github.com/joyboseroy/imljd and https://huggingface.co/datasets/joyboseroy/imljd.
Authors:Jiaao Wu, Xian Zhang, Hanzhang Liu, Sophia Zhang, Fan Yang, Yinpeng Dong
Abstract:
Frontier AI models and multi‑agent systems have led to significant improvements in mathematical reasoning. However, for problems requiring extended, long‑horizon reasoning, existing systems continue to suffer from fundamental reliability issues: hallucination accumulation, memory fragmentation, and imbalanced reasoning‑tool trade‑offs. In this paper, we introduce STAR‑PólyaMath, a multi‑agent framework that systematically addresses these challenges through meta‑level supervision and structured Reasoner‑Verifier interaction. STAR‑PólyaMath is structured as an orchestrated state machine with nested challenge‑step‑replan loops, governed by a reasoning‑free Python orchestrator that separates control from inference and bounds error propagation through trace‑back and re‑planning. Our key innovation is a persistent Meta‑Strategist that maintains cross‑attempt memory and exercises meta‑level control by issuing high‑level strategic guidance or mandatory directives, so the system can escape unproductive loops rather than stagnate or over‑rely on tools. STAR‑PólyaMath achieves state‑of‑the‑art results on all eight top‑tier competition benchmarks: AIME 2025‑2026, MathArena Apex Shortlist, MathArena Apex 2025, Putnam 2025, IMO 2025, HMMT February 2026, and USAMO 2026. It obtains perfect scores on AIMEs, Putnam, and HMMT, and shows its largest margin on Apex 2025, scoring 93.75% compared with 80.21% by the strongest baseline GPT‑5.5. Ablation studies show that the gains arise from the framework's orchestration rather than from model‑level diversity since removing key components or substituting in mixed backbones consistently weakens performance. Code is available at https://github.com/Julius‑Woo/STAR‑PolyaMath.
Authors:Taegu Kang, Jaesik Yoon, Sungjin Ahn
Abstract:
Inference‑time scaling has emerged as a major approach for improving reasoning capabilities, and has been increasingly applied to diffusion models. However, existing inference‑time scaling methods for diffusion models typically rely on external verifiers or reward models to rank and select samples, limiting their scalability to settings where such evaluators are available and reliable. Moreover, while recent diffusion models perform sequential inference with region‑wise, mixed‑noise conditioning, inference‑time scaling tailored to this setting remains relatively underexplored. We propose Iterative Partial Refinement (IPR), an inference‑time scaling method for sequential diffusion that requires no external verifier. Starting from an already‑generated sample, IPR re‑noises a subset of regions and regenerates them conditioned on the remaining regions, enabling the model to revise earlier decisions under a richer context than was available during the initial generation. This iterative partial refinement produces more globally consistent samples without external verification. On reasoning tasks requiring global constraint satisfaction, IPR consistently improves performance: on MNIST Sudoku, the valid solution rate increases from 55.8% to 75.0%. These results show that iterative partial refinement alone can serve as an effective inference‑time scaling strategy for diffusion models in sequential, mixed‑noise settings. Code is available at: https://github.com/ahn‑ml/IPR
Authors:Bing Wang, Rui Miao, Ximing Li, Chen Shen, Shaotian Yan, Changchun Li, Kaiyuan Liu, Xiaosong Yuan, Jieping Ye
Abstract:
The rapid spread of misinformation on social media platforms has become a formidable challenge. To mitigate its proliferation, Misinformation Detection (MD) has emerged as a critical research topic. Traditional MD approaches based on small models typically perform binary classification through a black‑box process. Recently, the rise of Large Language Models (LLMs) has enabled explainable MD, where models generate rationales that explain their decisions, thereby enhancing transparency. Existing explainable MD methods primarily focus on crafting sophisticated prompts to elicit rationales from off‑the‑shelf LLMs. In this work, we propose a pipeline to fine‑tune a dedicated LLM specifically for explainable MD. Our pipeline begins by collecting large‑scale fact‑checked articles, and then uses multiple strong LLMs to produce veracity predictions and rationales. To ensure high‑quality training data, we leverage a filtering strategy that selects only the correct instances for fine‑tuning. While this pipeline is intuitive and prevalent, our experiments reveal that naive filtering based solely on label correctness is insufficient in practice and suffers from two critical limitations: (1) Coarse‑grained labels cause insufficient rationales: Rationales filtered solely based on binary labels are insufficient to adequately support their decisions; (2) Over‑verification behavior causes unnecessary rationales: Stronger LLMs tend to exhibit over‑verification behavior, producing excessively verbose and unnecessary rationales. To address these issues, we introduce LONSREX, a novel data synthesis pipeline to Locate Necessary and Sufficient Rationales for Explainable MD. Specifically, we propose a metric that quantifies the contribution of each verification step to the final prediction, thereby evaluating its necessity and sufficiency. Experimental results demonstrate the effectiveness of LONSREX.
Authors:Omer Haq
Abstract:
Sequential prediction is challenging in regimes of delayed disambiguation, where early observations are ambiguous and multiple latent explanations remain plausible until sufficient evidence accumulates. Standard approaches based on marginal inference struggle in this setting, either collapsing uncertainty prematurely or failing to recover once informative evidence arrives. We introduce EviTrack, a test‑time inference framework that operates over latent trajectories rather than marginal states. EviTrack maintains a set of competing trajectory hypotheses and applies evidence‑ and likelihood‑ratio‑based selection to delay commitment until supported by data, drawing inspiration from hypothesis management in multiple hypothesis tracking and track‑before‑detect. To evaluate this setting, we construct a controlled synthetic benchmark with known latent ground truth that explicitly exhibits delayed disambiguation. At matched inference budget, EviTrack substantially outperforms sampling‑based baselines, achieving faster post‑disambiguation recovery. These results show that, in delayed disambiguation regimes, moderate trajectory‑level selection is more effective than increasing sampling coverage, highlighting selection over sampling as a key principle for reliable sequential inference.
Authors:Mahesh Bhosale, Abdul Wasi, Vishvesh Trivedi, Pengyu Yan, Akhil Gorugantu, David Doermann
Abstract:
Grounded multi‑video question answering over real‑world news events requires systems to surface query‑relevant evidence across heterogeneous video archives while attributing every claim to its supporting source. We introduce CRAFT (Critic‑Refined Adaptive Key‑Frame Targeting), a query‑conditioned pipeline that combines dynamic keyframe selection, per‑video ASR with multilingual fallback, and a hybrid critic loop to iteratively verify and repair claims before consolidation. The pipeline integrates UNLI temporal entailment, DeBERTa‑v3 cross‑claim screening, and a Llama‑3.2‑3B adjudicator, with a final citation‑merging stage that emits each fact once with all supporting source identifiers. On MAGMaR 2026, CRAFT achieves the best overall average (0.739), reference recall (0.810), and citation F1 (0.635). We further evaluate on a MAGMaR‑style conversion of WikiVideo with 52 non‑overlapping event queries, where CRAFT also performs strongly (0.823 Avg), showing that its claim‑centric evidence aggregation generalizes beyond MAGMaR. Ablations show that atomic claims, ASR, and the critic loop drive the main gains over the vanilla query‑conditioned baseline. Code and implementation details are publicly available at https://github.com/bhosalems/CRAFT.
Authors:Ehsan Ahmadi, Hunter Schofield, Behzad Khamidehi, Fazel Arasteh, Jinjun Shan, Lili Mou, Dongfeng Bai, Kasra Rezaee
Abstract:
Supervised open‑loop training has been widely adopted for training traffic simulation models; however, it fails to capture the inherently dynamic, multi‑agent interactions common in complex driving scenarios. We introduce RLFTSim, a reinforcement‑learning‑based fine‑tuning framework that enhances scenario realism by aligning simulator rollouts with real‑world data distributions and provides a method for distilling goal‑conditioned controllability in scenario generation. We instantiate RLFTSim on top of a pre‑trained simulation model, design a reward that balances fidelity and controllability, and perform comprehensive experiments on the Waymo Open Motion Dataset. Our results show improvements in realism, achieving state‑of‑the‑art performance. Compared with other heuristic search‑based fine‑tuning methods, RLFTSim requires significantly fewer samples due to a proposed low‑variance and dense reward signal, and it directly addresses the realism alignment issue by design. We also demonstrate the effectiveness of our approach for distilling traffic simulation controllability through goal conditioning. The project page is available at https://ehsan‑ami.github.io/rlftsim.
Authors:Gyubin Lee, Junwon Lee, Juhan Nam
Abstract:
We investigate Counterfactual Video Foley Generation, which aims to adopt a sound‑source identity that contradicts the visual evidence while remaining temporally synchronized to a silent video. Existing Video&Text‑to‑Audio (VT2A) models struggle with this, often remaining anchored to the visually implied sound source when video and text contents disagree. We present ConterFlow, an inference‑time dual‑phase sampling scheme for pretrained flow‑matching VT2A models. Phase 1 builds a video‑derived temporal structure while suppressing the visually implied source; Phase 2 drops video conditioning to focus entirely on shaping audio timbre toward the target prompt. ConterFlow substantially improves counterfactual Video Foley generation compared to naive negative prompting and state‑of‑the‑art baselines. To evaluate replacement quality, we propose a metric leveraging a text‑audio co‑embedding space to measure both target‑prompt evidence and residual visually implied source leakage. Video demonstrations and code are available at https://gyubin‑lee.github.io/counterflow‑demo/
Authors:Wei Shi, Ziheng Peng, Sihang Li, Xiting Wang, Xiang Wang, Mengnan Du, Na Zou
Abstract:
LLM agents exhibit a consistent tendency to over‑call, invoking tools even in situations where none is needed. On the When2Call benchmark, six models from three families show high call accuracy but much lower no‑call accuracy, leaving overall accuracy in the 55%‑70% range. We trace this to an Intrinsic Bias Hypothesis (IBH): the call/no‑call decision mapping carries an activation‑independent call offset, so the model favors call even at activation parity. Using Sparse Autoencoders (SAEs), we recover behavior‑aligned feature bases for the call/no_call decision, reduce them to a signed activation margin, and estimate the offset directly. Across all six models, the model is decision‑neutral only when no_call activation outweighs call activation, consistent with IBH. We then causally test IBH with Adaptive Margin‑Calibrated Steering (AMCS), a closed‑form counter‑bias shift along SAE decoder directions. Cancelling the diagnosed offset mitigates over‑calling and improves overall accuracy with a negligible drop in call accuracy. Our work recasts over‑calling from an empirical phenomenon into a mechanistic object amenable to causal correction. Code is available at https://github.com/SKURA502/agent‑sae/.
Authors:Yujie Lin, Chengyi Yang, Zhishang Xiang, Yiping Song, Jinsong Su
Abstract:
Large language models inevitably retain sensitive information, defined as inputs that may induce harmful generations, due to training on massive web corpora, raising concerns for privacy and safety. Existing machine unlearning methods primarily rely on retraining or aggressive fine‑tuning, which are either computationally expensive or prone to degrading related knowledge and overall model utility. In this work, we reformulate machine unlearning as a precise knowledge re‑mapping problem via model editing. We propose ZeroUnlearn, a few‑shot unlearning framework. It overwrites sensitive inputs by mapping them to a neutral target state and removing their original representations. ZeroUnlearn enforces representational orthogonality through a multiplicative parameter update with a closed‑form solution, enabling efficient and targeted unlearning. We further extend ZeroUnlearn to a gradient‑based variant for multi‑sample unlearning. Experiments demonstrate that our approach outperforms existing baselines while preserving general model utility. Our code is available at the github: https://github.com/XMUDeepLIT/ZeroUnlearn.
Authors:Chanuk Lee, Minki Kang, Sung Ju Hwang
Abstract:
Recent studies observe that reinforcement learning with verifiable rewards (RLVR) reliably improves pass@1 on reasoning tasks, yet often fails to yield comparable gains in pass@k, raising the question of whether RLVR genuinely enables large language models to acquire novel reasoning abilities or merely enhances the efficiency of sampling reasoning modes already present in the base model. Prior analyses largely support the latter view, attributing this limitation to structural properties of standard RLVR objectives that result in insufficient exploration pressure. In this work, we argue that a central structural constraint arises from reverse‑KL regularization, which stabilizes training but inherently anchors the policy to the reference distribution, thereby suppressing the emergence of alternative reasoning modes. However, we show that neither removing the KL term nor replacing it with forward‑KL provides a satisfactory solution, as both disrupt the efficiency‑coverage trade‑off by either inducing reward hacking or allocating probability mass to off‑target regions. To resolve this tension, we propose SAGE, a principled framework that enables controllable empirical support expansion by reshaping the reverse‑KL anchor distribution itself through a guide function q(x,y), achieving consistent improvements in both pass@1 and pass@k across challenging mathematical reasoning benchmarks. Our code is available at https://github.com/tally0818/SAGE.
Authors:Pei Yang, Wanyi Chen, Tongyun Yang, Pengbin Feng, Jiarong Xing, Wentao Guo, Yuhang Yao, Yuhang Han, Hanchen Li, Xu Wang, Zeyu Wang, Jie Xiao, Anjie Yang, Liang Tian, Lynn Ai, Eric Yang, Tianyu Shi
Abstract:
LLM routing matters most in long‑horizon applications such as coding agents, deep research systems, and computer‑use agents, where a single user request triggers many model calls. Routing each call to the cheapest sufficient model can cut costs without sacrificing quality, yet existing router benchmarks evaluate routers only on one‑shot prompts. They never expose the router‑visible prefix at an intermediate agent step, never test whether a cheaper replacement preserves downstream task success, and often rely on online LLM judges at evaluation time. We introduce TwinRouterBench, a step‑level routing benchmark with two tracks. The static track provides 970 router‑visible prefixes from 520 instances across SWE‑bench, BFCL, mtRAG, QMSum, and PinchBench, each paired with an execution‑verified target tier estimated under a released downgrade‑and‑cascade protocol; scoring is deterministic arithmetic over tier labels, trajectory membership, and token costs, with no online evaluator‑side LLM judge. The dynamic track supplies a harness that runs routers on the full 500‑case SWE‑bench Verified suite; in this paper we report a 100‑case held‑out evaluation disjoint from the static SWE supervision split. At each LLM call the router selects a concrete model from a locked pool, and success is measured by official task resolution and realized API spend. The two tracks support fast offline iteration followed by end‑to‑end validation under live agent execution. Code and data are available at https://github.com/CommonstackAI/TwinRouterBench.
Authors:Vyzantinos Repantis, Harshvardhan Singh, Tony Joseph, Cien Zhang, Akash Vishwakarma, Svetlana Karslioglu, Michael Wyatt Thot, Ameya Gawde
Abstract:
For most of the history of information retrieval (IR), search results were designed for human consumers who could scan, filter, and discard irrelevant information on their own. This shaped retrieval systems to optimize for finding and ranking more relevant documents, but not keeping results clean and minimal, as the human was the final filter. However, LLMs have changed that by lacking this filtering ability. To address this, we introduce Bits‑over‑Random (BoR), a chance‑corrected measure of retrieval selectivity that reveals when high success rates mask random‑level performance. We measure selectivity as BoR = \log_2\left(\frac\mathrmP_obs\mathrmP_rand\right), where \mathrmP_rand is the hypergeometric baseline for the chosen success rule (here, coverage: \geq1 relevant in top‑K). On the 20 Newsgroups dataset, BM25 and SPLADE both report >99% success at K=100 (coverage), yet BoR \approx 0, indicating random‑level selectivity at that depth. When the expected coverage ratio \left(\fracK \cdot \barR_qN\right) exceeds 3‑5, the baseline dominates and selectivity collapses. Downstream retrieval‑augmented generation (RAG) evaluation confirms this pattern: LLM accuracy can degrade substantially at K=100, consistent with the near‑zero BoR ceiling. In contrast, BoR remains positive on BEIR/SciFact and on MS MARCO (where 41 systems cluster within 0.2 bits of the theoretical ceiling despite a 13‑point recall gap), confirming baseline predictions across sparse and large‑scale settings. We further show that the collapse boundary applies to LLM agent tool selection, where small catalog sizes cause selectivity to vanish even with perfect selectors. These findings suggest reporting BoR alongside traditional metrics and reconsidering depth choices when additional retrieval provides negligible selectivity gains while inflating computational costs.
Authors:Adil Amin
Abstract:
Leaderboards rank frontier models on independent axes but do not reveal whether capabilities reinforce or trade off across releases ‑‑ and at the frontier, this interaction is the more informative signal. We decompose paired SWE‑bench and GPQA Diamond scores into a population coupling trend and per‑release residual (h‑field) that diagnoses capability emphasis and identifies which measurement or stress test is most informative next. Across 34 models from 10 labs (2024‑‑2026), capabilities cooperate (r = +0.72, p < 10^‑6), but cooperation varies by lab and over time: DeepSeek reversed from reasoning‑rich to coding‑first (h: +11.2 \to ‑4.7, 15.9‑pp swing); Google maintains consistent reasoning emphasis; Anthropic oscillates between coding excursions and recovery. Cooperation is not static ‑‑ it cascades. Six open‑weight architectures confirm a second capability transition at 30‑‑72B, and SWE‑bench is now saturating while HLE and instruction‑following retain discriminatory spread ‑‑ signaling the next axis rotation. We provide a three‑level playbook (locate, diagnose, rotate), a per‑lab measurement‑priority table, and seven falsifiable predictions with timestamped criteria for the next 12 months of frontier releases. Per‑lab coupling slopes vary 5× (Google 1.15 vs. DeepSeek 0.23), quantifying how efficiently each recipe converts coding gains into reasoning. Five April 2026 releases confirm the diagnostic out of sample (r rises from +0.72 to +0.75). An interactive dashboard provides phase classification with actionable recommendations, h‑field diagnostics, per‑lab coupling trajectories, ODE‑based scaling predictions, benchmark rotation guidance, self‑steering demo, and live tracking of all seven predictions: https://zehenlabs.com/cape/.
Authors:Adil Amin
Abstract:
Scaling laws predict loss from compute but not how capabilities interact. We measure the coupling between reasoning and truthfulness across 63 base models from 16 families and find a regime change invisible to loss curves: below a family‑dependent critical scale N_c, capabilities anticorrelate; above it, they cooperate. N_c \approx 3.5B parameters [2.9B, 13.4B] (bootstrap 95% CI), but model size is not the only variable that determines phase. Architecture, data curation, and training recipe each shift N_c independently: curated training eliminated the coupling dip between Qwen generations (0.025 \to 0.830 at matched scale), Gemma‑4 at 4B achieves coupling 0.871, characteristic of 13B+ standard‑trained models, through distillation and architectural innovation, and Phi at 1B matches web‑trained coupling at 10B through data curation alone. Width normalization eliminates the anticorrelation across all tested families, supporting an output‑projection bottleneck. Internally, 38 of 40 models show zero competing attention heads. A sparse‑regression ODE cross‑predicts held‑out Llama‑2 at 5.6% error. The diagnostic requires no model internals ‑‑ only public benchmark scores across a model family. The cooperative regime extends to the frontier (r = +0.72, 34 models, 10 labs). Code, data, and an open‑source activation‑steering tool for any open‑weight model are released alongside an interactive dashboard that diagnoses any model's coupling phase, suggests concrete interventions (data curation, width, benchmark rotation), and provides ODE scaling predictions, frontier diagnostics, and eigenstructure analysis: https://zehenlabs.com/cape/.
Authors:Wanghan Xu, Yuhao Zhou, Hengyuan Zhao, Shuo Li, Dianzhi Yu, Zhenfei Yin, Yaowen Hu, Fengli Xu, Wanli Ouyang, Wenlong Zhang, Lei Bai
Abstract:
Large language models can fail in critic interaction not only by answering incorrectly, but also by abandoning an initially correct scientific solution after user criticism. This is especially risky in scientific reasoning, where user criticism can turn a valid answer into an incorrect one. We frame critic interaction as an inter‑turn correctness‑transition problem rather than a final‑answer accuracy problem, and identify three challenges: transition awareness, decoupling useful correction from harmful sycophancy, and scalable rollout. We propose ReCrit, a transition‑aware reinforcement learning framework that decomposes Initial‑to‑Critic behavior into four quadrants: Correction, Sycophancy, Robustness, and Boundary. ReCrit rewards correction and robustness, penalizes sycophancy, and treats persistent errors as weak boundary signals. To make interaction training practical, ReCrit further uses dynamic asynchronous rollout with tail‑adaptive completion to reduce rollout waiting. On three scientific reasoning benchmarks, ChemBench, TRQA, and EarthSE, ReCrit improves average Critic accuracy from 38.15 to 51.49 on Qwen3.5‑4B and from 45.40 to 55.59 on Qwen3.5‑9B. Ablations show that final‑answer rewards provide little interaction‑level gain, while transition‑aware rewards and quadrant weighting produce more distinguishable training signals and larger net Critic‑stage improvement. The code is available at https://github.com/black‑yt/ReCrit .
Authors:Jing Chen, Shixiang Pan, Yujie Fan, Haocheng Ye, Haitao Xu, Wenqiang Xu
Abstract:
Accurate spatiotemporal pattern analysis is critical in fields such as urban traffic, meteorology, and public health monitoring. However, existing methods face performance bottlenecks, typically yielding only incremental gains and often exhibiting limited cross‑domain transferability. We analyze this bottleneck through spatial and temporal entropy measures, which are used as diagnostic indicators of spatiotemporal complexity mismatch rather than as guarantees that entropy alignment alone yields better forecasting. Empirically, larger mismatch is often accompanied by higher prediction uncertainty, especially under a fixed model‑capacity budget. Guided by this diagnostic, we propose a scalable, adaptive framework that harmonizes spatial and temporal feature representations. Spatial dimensionality is compressed via low‑rank matrix embedding to preserve essential structure, while an extended temporal horizon captures long‑range dependencies and mitigates cumulative errors arising from temporal heterogeneity. Extensive experiments on urban traffic, meteorological, and epidemic datasets demonstrate substantial accuracy gains and broad applicability across the evaluated domains, suggesting that the framework is promising for a wide range of spatiotemporal tasks beyond the current study. The code is available on GitHub at https://github.com/ST‑Balance/ST‑Balance.
Authors:Taehee Kim, Seungbin Yang, Jihwan Kim, Jaegul Choo
Abstract:
Retrieving relevant tables from extensive databases for a given natural language query is essential for accurately answering questions in tasks such as text‑to‑SQL. Existing table retrieval approaches select a pre‑determined set of k tables with the highest similarity to the query. However, the number of required tables varies across queries and cannot be known in advance. Enforcing a fixed number of retrieved tables regardless of the query may either retrieve an undersized set, failing to obtain all necessary evidence, or retrieve an oversized pool, including irrelevant tables. To address this issue, we propose an adaptive table retrieval method that adjusts the number of tables retrieved according to the requirements of each query. Specifically, we utilize an adaptive thresholding mechanism to selectively retrieve tables and integrate a sliding‑window reranking algorithm to efficiently process a large table corpus. Extensive experiments on Spider, BIRD, and Spider 2.0 demonstrate that our method effectively addresses the limitations of the top‑k retrieval strategy, improving performance in retrieval and downstream tasks. Our code and data are available at https://github.com/sbY99/Adaptive‑Table‑Retrieval.
Authors:Qianhao Yuan, Jie Lou, Xing Yu, Hongyu Lin, Le Sun, Xianpei Han, Yaojie Lu
Abstract:
Multimodal Large Language Models (MLLMs) still struggle with fine‑grained visual understanding, where answers often depend on small but decisive evidence in the full image. We observe a regional‑to‑global perception gap: the same MLLM answers fine‑grained questions more accurately when conditioned on evidence‑centered crops than on the corresponding full images, suggesting that many failures stem from difficulty to focus on relevant evidence rather than insufficient local recognition ability. Motivated by this observation, we propose Vision‑OPD (Vision On‑Policy Distillation), a regional‑to‑global self‑distillation framework that transfers the model's own privileged regional perception to its full‑image policy. Vision‑OPD instantiates two conditional policies from the same MLLM: a crop‑conditioned teacher and a full‑image‑conditioned student. The student generates on‑policy rollouts, and Vision‑OPD minimizes token‑level divergence between the teacher and student next‑token distributions along these rollouts. This enables the model to internalize the benefit of visual zooming without external teacher models, ground‑truth labels, reward verifiers, or inference‑time tool use. Experiments on multiple fine‑grained visual understanding benchmarks show that Vision‑OPD models achieve competitive or superior performance against much larger open‑source, closed‑source, and "Thinking‑with‑Images" agentic models.
Authors:Fengyi Fu, Mengqi Huang, Shaojin Wu, Yunsheng Jiang, Yufei Huo, Hao Li, Yinghang Song, Fei Ding, Jianzhu Guo, Qian He, Zheren Fu, Zhendong Mao, Yongdong Zhang
Abstract:
We present Lance, a lightweight native unified model supporting multimodal understanding, generation, and editing for both images and videos. Rather than relying on model capacity scaling or text‑image‑dominant designs, Lance explores a practical paradigm for unified multimodal modeling via collaborative multi‑task training. It is grounded in two core principles: unified context modeling and decoupled capability pathways. Specifically, Lance is trained from scratch and employs a dual‑stream mixture‑of‑experts architecture on shared interleaved multimodal sequences, enabling joint context learning while decoupling the pathways for understanding and generation. We further introduce modality‑aware rotary positional encoding to mitigate interference among heterogeneous visual tokens and boost cross‑task alignment. During training, Lance adopts a staged multi‑task training paradigm with capability‑oriented objectives and adaptive data scheduling to strengthen both semantic comprehension and visual generation performance. Experimental results demonstrate that Lance substantially outperforms existing open‑source unified models in image and video generation, while retaining strong multimodal understanding capabilities. The homepage is available at https://lance‑project.github.io.
Authors:Hyunji Lee, Justin Chih-Yao Chen, Joykirat Singh, Zaid Khan, Elias Stengel-Eskin, Mohit Bansal
Abstract:
Real‑world agents operate over long and evolving horizons, where information is repeatedly updated and may interfere across memories, requiring accurate recall and aggregated reasoning over multiple pieces of information. However, existing benchmarks focus on static, independent recall and fail to capture these dynamic interactions between evolving memories. In this paper, we study how current memory‑augmented agents perform in realistic, interference‑heavy, long‑horizon settings across diverse domains and question types. We introduce MINTEval (Long‑Horizon Memory under INTerference Evaluation), a benchmark featuring (1) long, highly interconnected contexts with frequently updated information that induces substantial interference, (2) diverse domains (state tracking, multi‑turn dialogue, Wikipedia revisions, and GitHub commits), enabling evaluation of domain generalization, and (3) diverse question types that assess robustness to interference, including (i) single‑target recall tasks requiring retrieval of a specific target from long contexts, and (ii) multi‑target aggregation tasks requiring reasoning over multiple relevant pieces of information. Overall, MINTEval has 15.6k question‑answering pairs over long‑horizon contexts averaging 138.8k tokens and extending up to 1.8M tokens per instance. We evaluate 7 representative systems, including vanilla long‑context LLMs, RAG, and memory‑augmented agent frameworks. Across all systems, we observe consistently low performance (avg. 27.9% accuracy), especially on questions requiring aggregated reasoning over multiple pieces of evidence. Our analysis shows that performance is primarily limited by retrieval and memory construction. Furthermore, current memory systems struggle to recall and reason over earlier facts that are revised or interfered with by subsequent context, with accuracy degrading as the number of intervening updates increases.
Authors:Xinpeng Dong, Min Zhang, Kairong Han, Xu Tan, Fei Wu, Kun Kuang
Abstract:
In recent years, multimodal large language models (MLLMs) have achieved remarkable progress, primarily attributed to effective paradigms for integrating visual and textual information. The dominant connector‑based paradigm projects visual features into textual sequence, enabling unified multimodal alignment and reasoning within a generative architecture. However, our experiments reveal two key limitations: (1) Although visual information serves as the core evidential modality in MLLMs, it is treated on par with textual tokens, diminishing the unique contribution of the visual modality; (2) As generation length increases, particularly within a limited context window, the model's dependence on visual information progressively weakens, resulting in deteriorated vision‑language alignment and reduced consistency between generated content and visual semantics. To address these challenges, we propose the Vision Inference Former (VIF), a lightweight architectural module that establishes a direct bridge between pure visual representations and the model's output space. Specifically, VIF continuously injects visual semantics throughout the decoding phase of the inference process, ensuring that the model remains firmly grounded in visual content during generation. We conduct experiments on 14 benchmark tasks covering general reasoning, OCR, table understanding, vision‑centric evaluation, and hallucination. Experimental results show that VIF consistently improves model performance across diverse architectures while introducing minimal additional overhead. The code for this work is available at https://github.com/Dong‑Xinpeng/VIF.
Authors:Sunwoo Lee, Mingu Kang, Yonghyeon Jo, Seungyul Han
Abstract:
Cooperation is central to multi‑agent reinforcement learning (MARL), yet learned coordination can be fragile when external perturbations disrupt inter‑agent interactions. Prior robust MARL methods have primarily considered value‑oriented attacks, leaving a gap in robustness when interaction structures themselves are corrupted. In this paper, we propose an interaction‑breaking adversarial learning (IBAL) framework that takes an information‑theoretic view to construct attacks that impede coordination by perturbing agents' observations and actions, and trains agents to perform reliably under such disruptions. Empirically, our approach improves robustness over existing robust MARL baselines across diverse attack settings and yields stronger performance even under agent‑missing scenarios. Our code is available at https://sunwoolee0504.github.io/IBAL.
Authors:Yuanfei Xu, Lin Liu, Wengang Zhou, Mingxiao Feng, Houqiang Li
Abstract:
3D open‑world environments with adversarial opponents remain a core challenge for reinforcement learning due to their vast state spaces. Effective reasoning representations are essential in such settings. While existing self‑supervised visual foresight reasoning approaches often suffer from multi‑step error accumulation, many recent studies resort to injecting domain‑specific knowledge for more stable guidance. Our key insight is that the photorealistic fidelity of visual reasoning representations is secondary; what truly matters is providing informative, task‑relevant signals. To this end, we propose ResDreamer, a hierarchical world model in which each higher‑level layer is trained to reconstruct the residuals of the layer below. This design enables progressive abstraction of increasingly sophisticated world dynamics and fosters the emergence of richer latent representations. Drawing inspiration from the "Bitter Lesson", ResDreamer trains its reasoning representations in a purely self‑supervised manner. The higher‑level residual representations are used to modulate lower‑level predictions, allowing the world model to scale effectively with only linearly increasing cross‑layer communication costs. Experiments show that ResDreamer achieves state‑of‑the‑art sample efficiency and parameter efficiency. This scalable hierarchical visual foresight reasoning architecture paves the way for more capable online RL agents in open‑ended, dynamic environments. The code is accessible at \urlhttps://github.com/XuYuanFei01/ResDreamer.
Authors:Jack Wilkie, Hanan Hindy, Christos Tachtatzis, Miroslav Bures, Robert Atkinson
Abstract:
Network intrusion detection systems play a vital role in protecting networks by detecting malicious network traffic which can then be investigated by a cybersecurity operations centre. State‑of‑the‑art approaches utilise supervised machine learning methods to train a classification model to recognise known cyberattacks; however, these models require a large labelled dataset to train and show poor performance when trained on smaller datasets. In an attempt to address this shortcoming, anomaly detection models learn the distribution of benign traffic and flag non‑conforming traffic as malicious. While these methods do not require malicious examples to train, they suffer from high false‑positive rates rendering them impractical. As a result, networks may be particularly vulnerable when there are insufficient labelled instances of a specific attack class to train an effective classifier. This often occurs in newly established networks or when previously unseen types of attacks emerge. To address this challenge, this work proposes the use of a triplet network, utilising online triplet mining and a KNN classifier, which is able to perform few‑shot classification, enabling effective intrusion detection after being trained on a limited number of malicious examples. Various online triplet mining algorithms were explored and model design choices, such as the inference algorithm and optimised distance metrics, were compared and evaluated through a series of ablation studies. The final model was compared against other state‑of‑the‑art approaches in few‑shot binary and multiclass classification, where the proposed approach was found to be competitive with existing methods when trained on as little as 10 malicious samples of each class.
Authors:Qingnan Ren, Shun Zou, Shiting Huang, Ziao Zhang, Kou Shi, Zhen Fang, Yiming Zhao, Yu Zeng, Qisheng Su, Lin Chen, Yong Wang, Zehui Chen, Xiangxiang Chu, Feng Zhao
Abstract:
As autonomous coding agents become capable of handling increasingly long‑horizon tasks, they have gradually demonstrated the potential to complete end‑to‑end software development. Although existing benchmarks have recently evolved from localized code editing to from‑scratch project generation, they remain confined to structurally simplified, single‑stack applications. Consequently, they fail to capture the heterogeneous environments, full‑stack orchestration, and system‑level complexity of real enterprise Software as a Service (SaaS) systems, leaving a critical gap in assessing agents under realistic engineering constraints. To fill this gap, we introduce SaaSBench, the first benchmark designed to explore the boundaries of AI agents in enterprise SaaS engineering. Spanning 30 complex tasks across 6 SaaS domains with 5,370 validation nodes, it incorporates 8 programming languages, 6 databases, and 13 frameworks to meticulously mirror real‑world software heterogeneity. Furthermore, we design a dependency‑aware hybrid evaluation paradigm tailored for complex systems with long horizons and multi‑component coupling, enabling fine‑grained, reproducible assessment. Crucially, our extensive experiments reveal a striking insight: the primary bottleneck for state‑of‑the‑art agents is not generating isolated code logic, but successfully configuring and integrating a multi‑component system. Over 95% of task failures occur before agents even reach deep business logic, with models often falling victim to overconfidence and prematurely halting during foundational system setup, or getting trapped in ineffective debugging loops. We hope SaaSBench serves as a practical and challenging testbed to drive the evolution of reliable, system‑level coding agents. The code is available at \urlhttps://github.com/ShadeCloak/SaaSbench.
Authors:Zhiyin Tan, Changxu Duan
Abstract:
Multilingual NLP often relies on dataset counts from centralized catalogues to characterize which languages are resource‑rich or resource‑poor. However, these catalogues record only one layer of dataset visibility: what has been registered or institutionally distributed. They do not necessarily reflect which datasets are created, cited, or reused in the research literature. To examine this gap, we combine a catalogue‑based baseline with literature‑backed evidence of dataset circulation. We introduce the Resource Density Index (RDI), defined as the number of catalogued datasets per one million speakers, and compute it for the 200 most widely spoken languages in Ethnologue. Among them, 118 languages (59%) have an average RDI of zero across the LRE Map and the Linguistic Data Consortium (LDC), and another 23 fall below 0.1, corresponding to at most one catalogued dataset per ten million speakers. We then apply an LLM‑assisted citation‑mining pipeline over the Semantic Scholar corpus to these 141 low‑visibility languages. After manual validation and consolidation, we identify 609 unique datasets across 53 languages, of which 356 remain openly accessible through working public links. These results reveal a substantial visibility gap: many large‑speaker languages appear data‑poor in catalogue records yet show clear evidence of dataset activity in the research literature. Our findings suggest that multilingual data scarcity should be understood not only as a production problem, but also as a question of documentation, discoverability, and long‑term accessibility. Code and data are publicly available at (https://github.com/zhiyintan/dataset‑visibility‑asymmetry).
Authors:Sirui Hong, Zhijie Liu, Tengfei Li, Wei Tao, Yifan Wu, Chenglin Wu
Abstract:
Evaluating LLM‑generated interactive software requires execution in addition to static analysis. The key difficulty is that correctness is a graph‑level reachable property over latent UI state‑transition graphs, whereas a GUI evaluator observes only a single execution trajectory. A failed rollout therefore rules out only one realized path, leaving failure attribution ambiguous between evaluator‑side execution error and genuine software defect. We present DiagEval, a trajectory‑conditioned diagnostic evaluation protocol for post‑failure GUI‑agent evaluation of interactive software. Rather than blindly retrying from scratch, DiagEval reuses the failed trajectory to choose targeted diagnostic probes and aggregates their outcomes into an internal attribution signal. The latent‑graph view motivates the diagnostic problem; DiagEval does not reconstruct the graph or estimate calibrated posterior probabilities. We evaluate DiagEval on WebDevJudge‑Unit and RealDevBench across multiple GUI‑agent evaluators and LLM backbones. On false‑negative cases, DiagEval recovers 45.6‑62.1% of failures that were initially misattributed to software defects, outperforming retry‑based baselines with 34.4‑160.6% relative gains. On the full evaluation sets, this recovery improves accuracy from 69.9% to 78.3% on WebDevJudge‑Unit and from 65.0% to 81.6% on RealDevBench. These results suggest that reliable GUI‑agent evaluation requires not only stronger execution, but also active failure diagnosis to disambiguate evaluator‑side errors from genuine software defects. Our code is available at https://github.com/scutGit/DiagEval.
Authors:Tarun Sharma
Abstract:
We propose IVF‑TQ, an IVF index with a codebook‑free residual layer: a fixed random rotation followed by precomputed Lloyd‑Max scalar quantization depending only on (b, d). Only the IVF coarse partition is trained. Building on TurboQuant (Zandieh et al., 2025), the design substantially reduces a key failure mode of trained‑codebook ANN indexes (PQ, OPQ, ScaNN): staleness under streaming ingestion.Empirical (3 seeds): Per‑batch PQ retraining does not recover the streaming gap at any tested bit budget (paired‑t p > 0.28 everywhere). On streaming Deep‑10M, IVF‑TQ holds at 87.4% ‑> 86.6% (Delta = ‑0.80 +/‑ 0.10pp) while IVF‑PQ degrades ‑3.23pp. A shuffled‑i.i.d. control on SIFT‑1M shows IVF‑PQ losing ‑3.9pp without distribution shift. At higher PQ bit budgets (~1.5x IVF‑TQ memory), absolute recall favors PQ as expected from rate‑distortion (+6.1pp Deep‑10M; +2.0pp SIFT‑10M); the durable IVF‑TQ benefit is operational (no codebook to retrain), robust across memory regimes.Prior art: IVF around a codebook‑free residual quantizer is architecturally not new ‑‑ IVF‑RaBitQ ships in Milvus, cuVS, LanceDB, Weaviate; Shi et al. (2026) is concurrent GPU work. TurboQuant itself tests only flat‑rotation ANN.Contributions: (i) A multi‑seed streaming‑operational story for codebook‑free IVF: 10M‑scale evidence across PQ memory budgets. (ii) A uniform‑over‑sphere IP‑error bound for the TQ residual quantizer with one fixed rotation (proof sketch in v1; rigorous in v2). (iii) Adaptive IVF‑TQ: a partition‑only refresh recovering 67% ‑> 97.8% under worst‑case rotation shift with re‑ranking (90.3% without).Code, data: https://github.com/tarun‑ks/turboquant_search
Authors:Gunjan Balde, Soumyadeep Roy, Mainack Mondal, Niloy Ganguly
Abstract:
Large language models pretrained on general‑domain corpora often exhibit tokenization inefficiencies when applied to specialized domains. Although continual pretraining for domain adaptation partially alleviate performance degradation, it does not resolve the fundamental vocabulary mismatch. To address this gap, we introduce a targeted parameter‑efficient domain adaptation approach that combines vocabulary adaptation with pretraining for LLM‑based text summarization. Our unified framework augments pretrained tokenizers with domain‑specific tokens while selectively replacing under‑trained and unreachable tokens to limit parameter growth. We evaluate our approach on Llama‑3.1‑8B and Qwen2.5‑7B across legal and medical summarization tasks on a challenge‑oriented evaluation protocol focused on expert‑driven text and summaries which typically has higher concentration of over‑fragmented Out‑of‑Vocabulary (OOV) words. The vocabulary adaptation algorithm enhances the overall quality of the summarization model by improving semantic similarity between the generated summaries and their references. In addition, the adapted model produces summaries that incorporate more appropriate novel and domain‑specific words, leading to improved coherence, relevance, and faithfulness. We further observe that our proposed approach significantly reduce training time by 35‑55% over continual pretraining and reduce parameter counts up to 37% w.r.t expansion‑only methods. We make the codebase publicly available at https://github.com/gb‑kgp/VocabReplace‑Then‑Expand.
Authors:Qiran Zou, Hou Hei Lam, Wenhao Zhao, Tingting Chen, Yiming Tang, Samson Yu, Yingtao Zhu, Srinivas Anumasa, Zufeng Zhang, Tianyi Zhang, Chang Liu, Zhengyao Jiang, Anirudh Goyal, Dianbo Liu
Abstract:
AI research agents accelerate ML research by automating hypothesis generation, experimentation, and empirical refinement. Existing agent strategies range from greedy hill‑climbing to tree search and evolutionary optimization, yet which strategy choices drive performance remains unclear. Answering this question requires a benchmark that separates agent strategy (e.g., search topology) from execution infrastructure (e.g., code editor), so that performance differences are attributable to strategy rather than infrastructure, and that provides process‑level metrics beyond final scores to analyze exploration behaviors. Existing benchmarks offer limited support. We propose FML‑Bench, a benchmark of 18 fundamental ML research tasks across 10 domains that separates agent strategy from execution infrastructure and defines 12 process‑level behavioral metrics. Evaluating six representative agents, we find that: (1) strategy complexity alone does not guarantee strong performance: a simple greedy hill‑climber nearly matches the best‑performing tree‑search agent, both well above the remaining agents; (2) our analysis suggests this pattern relates to improvement opportunity structure: greedy search tends to be more effective when opportunities are dense, while tree‑search and evolutionary strategies tend to be more effective when opportunities are sparse; an adaptive agent built on this insight switches to broader exploration upon detecting improvement stagnation and outperforms the other six agents, lending initial support to this observation; and (3) process‑level analysis reveals that early convergence and directionally focused exploration are significantly associated with final performance, while solution diversity and compute cost are not. Our benchmark is available at: https://github.com/qrzou/FML‑bench.
Authors:Yucong Huang, Xiucheng Li, Kaiqi Zhao, Jing Li
Abstract:
Standard RLHF relies on transitive scalar rewards, failing to capture the cyclic nature of human preferences. While some approaches like the General Preference Model (GPM) address this, we identify a theoretical limitation: their implicit formulation entangles hierarchy with cyclicity, failing to guarantee dominant solutions. To address this, we propose the Hybrid Reward‑Cyclic (HRC) model, which utilizes game‑theoretic decomposition to explicitly disentangle preferences into orthogonal transitive (scalar) and cyclic (vector) components. Complementing this, we introduce Dynamic Self‑Play Preference Optimization (DSPPO), which treats alignment as a time‑varying game to progressively guide the policy toward the Nash equilibrium. Synthetic data experiments further validate HRC's structural superiority in mixed transitive‑‑cyclic settings, where HRC converges faster and achieves higher accuracy than GPM. Experiments on RewardBench 2 demonstrate that HRC consistently improves over both BT and GPM baselines (e.g., +1.23% on Gemma‑2B‑it). In particular, its superior performance in the Ties domain empirically validates the model's robustness in handling complex, non‑strict preferences. Extensive downstream evaluations on AlpacaEval 2.0, Arena‑Hard‑v0.1, and MT‑Bench confirm the efficacy of our framework. Notably, when using Gemma‑2B‑it as the base preference model, HRC+DSPPO achieves a peak length‑controlled win‑rate of 44.75% on AlpacaEval 2.0 and 46.8% on Arena‑Hard‑v0.1, significantly outperforming SPPO baselines trained with BT or GPM. Our code is publicly available at https://github.com/lab‑klc/Hybrid‑Reward‑Cyclic.
Authors:Nanxi Li, Zhengyue Zhao, Chaowei Xiao
Abstract:
Guardrails are a critical safety layer for modern AI systems, but their operating regime is changing. As LLMs are deployed as customized assistants, safety policies are increasingly specified at inference time by users, organizations, or regulatory contexts. This makes safety enforcement fundamentally dynamic: the guardrail should adapt to changing safety policies without retraining. Yet this requirement creates a fundamental tension: faithfully judging complex policy contexts demands reasoning capability, while practical deployment requires low‑latency responses. We introduce Latent Policy Guardrail (LPG), a guardrail framework that learnssemantic latent deliberation over dynamic policies. LPG compresses the internal deliberation needed for intent interpretation and policy grounding into continuous states supervised by decision‑relevant semantics. At inference time, it generates only a compact verdict anchored to the violated policy clauses, preserving auditability while avoiding the latency of explicit reasoning. Across policy guardrail benchmarks, LPG‑4B reaches 84.5% average safety accuracy and 77.9% F1 by compressing deliberation into just 10 latent tokens, outperforming the strongest dynamic baseline while running roughly 11 times faster than Qwen3‑4B‑Thinking under the single‑sample evaluation setup. Code and data are available at https://github.com/SaFo‑Lab/Latent_Policy_Guard.
Authors:Yuantai Zhang, Jiaqi Yang, Huajian Zeng, Changhao Chen, Haoang Li, Liang Li, Dezhen Song, Xingxing Zuo
Abstract:
Fast and reliable initialization is critical for monocular visual‑inertial navigation systems (VINS), as it establishes the starting conditions for subsequent state estimation. Despite steady progress, most existing methods heavily rely on visual feature correspondences and require 3‑4 seconds of sensory data for successful initialization, which limits their applicability and efficiency. With the advent of feed‑forward 3D models that can directly predict point clouds from images, we revisit the visual‑inertial initialization problem from a concise perspective. In this work, we propose a feature‑free initialization framework that leverages up‑to‑scale point clouds predicted by a feed‑forward 3D model, thereby obviating the need for visual feature tracking and estimation. This design substantially reduces system complexity and improves the reliability of initialization. Experiments on public datasets demonstrate that the proposed feature‑free initialization method achieves the highest success rate, exceeding 90%, and significantly reduces the data duration required for successful initialization, typically to under 1.2 s. We further validate our method on a self‑collected dataset covering various indoor and outdoor scenarios, demonstrating robust performance, particularly in visually degraded environments where existing methods often fail. The code and dataset are available at https://github.com/Yuantai‑Z/FF‑VIO‑Init.
Authors:Udari Madhushani Sehwag, Zhengyang Shan, Heming Liu, Dileepa Lakshan, Joseph Brandifino, Max Fenkell
Abstract:
Clarification‑seeking behavior is widely regarded as a desirable property of LLM agents, enabling them to resolve ambiguity before acting on underspecified tasks. However, the security implications of this interaction pattern remain unexplored. We investigate whether the transition from standard execution to a clarification‑seeking state increases an agent's susceptibility to prompt injection attacks. We introduce ASPI (Ambiguous‑State Prompt Injection), a benchmark of 728 task‑attack scenarios that isolates clarification as a distinct agent state and measures how this state transition affects vulnerability under controlled conditions. Each benchmark instance is evaluated under matched execution and clarification settings: in the execution setting, the agent acts on a fully specified instruction and encounters adversarial content only through tool‑returned data; in the clarification setting, the agent must first request and incorporate additional user input before acting. We evaluate ten frontier LLMs and find that clarification‑seeking consistently and substantially amplifies vulnerability. For instance, attack success rises from 1.8% to 34.0% for o3 and from 2.2% to 35.7% for Gemini‑3‑Flash. A decomposition analysis reveals that this gap reflects both a state‑dependent shift in how models process incoming content and a channel‑specific effect arising from the agent‑solicited clarification interface. These findings demonstrate that standard execution‑time security evaluation systematically underestimates the attack surface of interactive agents, and that robustness under fully specified tasks does not translate to robustness under ambiguity. For reproducibility, our data and source code are available at https://github.com/scaleapi/aspi.
Authors:Xinchen Jin, Aditya Chatterjee, Pranav Kumar, Rohan Paleja
Abstract:
Vision‑Language‑Action (VLA) policies translate language and visual inputs into robot actions, where their hidden representations directly shape closed‑loop behavior. However, mechanistic interpretability tools from language and vision‑language models do not transfer cleanly to VLAs: outputs are robot actions rather than human‑readable tokens, and interventions can only be tested via expensive closed‑loop rollouts. We propose an event‑grounded interpretability pipeline that anchors SAE feature analysis to behavioral events rather than text contexts. End‑effector keyframes are clustered within each task using visual, state, and temporal cues, linking SAE features to behaviorally salient events and, via optional VLM annotations, to semantic context. To our knowledge, our pipeline is among the first to ground SAE‑based VLA analysis in closed‑loop behavioral events. Across two simulation architectures and a real‑robot study, event‑grounded ranking yields the strongest causal effects on OpenVLA and transfers to the continuous action chunks of π_0.5. SAE is a sparse but imperfect intervention basis: usability varies with architecture and intervention site, and aggressive intervention reveals safety and interpretability limits. Overall, event‑grounded SAE analysis emerges as a practical starting point for behavior‑anchored VLA interpretability, motivating future work on SAE features beyond action‑aligned coordinates, finer‑grained closed‑loop evaluation, and safe interventions for high‑stakes VLA deployments. Code is available at \urlhttps://github.com/xc‑j/Event‑SAE.
Authors:Jon Saad-Falcon, Avanika Narayan, Robby Manihani, Tanvir Bhathal, Herumb Shandilya, Hakki Orhun Akengin, Gabriel Bo, Andrew Park, Matthew Hart, Caia Costello, Chuan Li, Christopher Ré, Azalia Mirhoseini
Abstract:
Personal AI stacks, like OpenClaw and Hermes Agent, are becoming central to daily work, yet they route nearly every query (often over sensitive local data) to cloud‑hosted frontier models. Replacing frontier models with local models inside existing stacks does not work: swapping Claude Opus 4.6 for Qwen3.5‑9B drops accuracy by 25‑39 pp across personal AI tasks like PinchBench and GAIA. Existing stacks bundle agentic prompts, tool descriptions, memory configuration, and runtime settings around a specific cloud model. Only the prompts can be tuned, and state‑of‑the‑art prompt optimizers close just 5 pp of the local‑cloud gap on their own. This motivates a decomposed personal AI stack: one that exposes individual primitives which can be optimized individually or jointly to close the local‑cloud gap. We present OpenJarvis, an architecture that represents a personal AI system as a typed spec over five primitives: Intelligence, Engine, Agents, Tools & Memory, and Learning. Each primitive is an independently editable field, making the stack end‑to‑end optimizable and measurable against accuracy, cost, and latency. Towards closing the local‑cloud gap without surrendering local‑model properties, OpenJarvis introduces LLM‑guided spec search, a local‑cloud collaboration in which frontier cloud models propose edits across the spec at search time, only non‑regressing edits are accepted, and the resulting spec runs entirely on‑device at inference time. With LLM‑guided spec search, on‑device specs match or exceed cloud accuracy on 4 of 8 benchmarks and land within 3.2 pp of the best cloud baseline on average. They also reduce marginal API cost by ~800x and end‑to‑end latency by 4x.
Authors:Cheikh Ahmed, Mahdi Mostajabdaveh, Zirui Zhou
Abstract:
The integration of Large Language Models (LLMs) into evolutionary frameworks has established a new paradigm for automated heuristic discovery. Despite their promise, these methods typically search in the discrete space of program syntax, relying on stochastic sampling to navigate a highly non‑convex optimization landscape. This work proposes a continuous heuristic discovery framework that shifts optimization to a learned latent manifold. We employ an encoder to map discrete programs into continuous embeddings and train a differentiable surrogate model to predict performance, enabling gradient‑based search. To regularize the optimization trajectory, an invertible normalizing flow maps these embeddings to a structured Gaussian prior, where we perform gradient ascent. The resulting optimized latent vectors are projected through a learned mapper into soft prompts, which condition a frozen LLM to synthesize novel executable heuristics. We evaluate the proposed method on the Traveling Salesman Problem (TSP), the Capacitated Vehicle Routing Problem (CVRP), the Knapsack Problem (KSP), and Online Bin Packing (OBP). Empirical results demonstrate that continuous latent‑space optimization achieves performance competitive with state‑of‑the‑art discrete evolutionary baselines while offering a complementary methodological alternative for automated algorithm design. The implementation code is available at \urlhttps://github.com/cheikh025/LHS.
Authors:Hoda Osama Elkhodary, Sherin Mostafa Youssef, Marwa Elshenawy, Dalia Sobhy
Abstract:
The rapid advancement of Deepfake technologies and video manipulation tools poses a critical challenge to multimedia forensics, judicial evidence integrity, and information authenticity. Current detectors rely on single‑modality signals, treating appearance, geometry, and motion independently. However, advanced generators maintain within‑modality consistency while producing cross‑modal contradictions, which are forensically discriminative but invisible to any single‑modal detector. We propose CAM‑VFD, a Cross‑Attention Multimodal Video Forgery Detection framework that models cross‑modal contradiction as a directional forensic signal. The framework uses a cross‑attention fusion mechanism in which CLIP‑based appearance representations serve as queries against VideoMAE motion features and MiDaS depth features, enabling the identification of contradictions between visual, temporal, and geometric evidence. We examine this design through cross‑modal attention discrepancy analysis, observing statistically separable real and fake distributions (p<0.001, Cohen's d=0.68). Experimental results on two generative video benchmarks indicate consistent performance, with 95.31% Top‑1 accuracy on GenVidBench and 93.43% accuracy, 90.63% F1‑score, and 96.56% AUROC on GenVideo. Moreover, CAM‑VFD demonstrates stable performance under compression, noise, blur, and adversarial perturbations, suggesting that cross‑modal reasoning may improve robustness in media forensics. The code is publicly available at \urlhttps://github.com/Hoda‑Osama/CAM‑VFD/tree/main.
Authors:Minhas Kamal, Hiranya Garbha Kumar, Balakrishnan Prabhakaran
Abstract:
Point cloud stands as the most widely adopted format for representing 3D shapes and scenes due to its simplicity and geometric fidelity. However, its inherent unordered and irregular nature, exacerbated by sensor noise and occlusions, introduces unique challenges for machine learning based methodologies. To combat these issues, diverse strategies have been developed, including converting to a format that has orderliness, extracting local geometry, and permutation‑invariant or self‑attention‑based processing. In this paper, our focus is directed towards deep learning models for three fundamental tasks in 3D vision: point cloud classification, part segmentation, and semantic segmentation. We begin by formally defining point cloud data, followed by an in‑depth discussion on its structural characteristics. Then, we categorize notable works based on their backbone structure and evaluate their performance on popular benchmarks. Beyond empirical comparison, we offer insights into architectural innovations and limitations. We also outline open challenges and promising future directions for 3D point cloud understanding.
Authors:Qiuchi Xiang, Haoxuan Qu, Hossein Rahmani, Jun Liu
Abstract:
Jailbreak attacks on large models have drawn growing attention due to their close ties to societal safety. This work identifies a practical yet unexplored jailbreak scenario, the wide‑net‑casting scenario, where an adversary can query a group of large models instead of a single one to elicit harmful outputs. Our analysis reveals substantial yet previously overlooked safety risks under this scenario. As a key part of our analysis, we further develop a novel jailbreak method tailored to the wide‑net‑casting scenario. With this tailored method, the jailbreak success rate can even reach 100% in some experiments when targeting the large models without additional safeguards, exposing wide‑net‑casting as a distinct, high‑risk scenario that warrants attention in future evaluation and defense research.
Authors:Zhaoxin Yu, Nan Xu, Kun Chen, Jiahao Zhao, Lei Wang, Wenji Mao
Abstract:
With the continuous advancement of reasoning abilities in Large Language Models (LLMs), their application to scientific reasoning tasks has gained significant research attention. Current research primarily emphasizes boosting LLMs' performance on scientific QA benchmarks by training on larger, more comprehensive datasets with extended reasoning chains. However, these approaches neglect the essence of the scientific reasoning process ‑‑ logicality, which is the rational foundation to ensure the validity of reasoning steps leading to reliable conclusions. In this work, we make the first systematic investigation into the internal logicality underlying LLM scientific reasoning, and develop a scientific logicality‑enriched methodology, including a set of assessment criteria and data sampling methods for logicality‑guided training, to improve the logical faithfulness as well as task performance. Further, we take physics, characterized by its diverse logical structures and formalisms, as an exemplar discipline to practise the above methodology. For data construction, we extract scientific problems from academic literature and sample a high‑quality dataset exhibiting strong logicality. Experiments based on three different backbone LLMs reveal that: 1) the training data we constructed can effectively improve the scientific logicality in LLM reasoning; and 2) the enriched scientific logicality plays a critical role in solving scientific problems. Code is available at \hrefhttps://github.com/ScienceOne‑AI/PhysLogichttps://github.com/ScienceOne‑AI/PhysLogic.
Authors:Sajjad Khan
Abstract:
Concurrent LLM agents sharing mutable natural‑language state produce Structural Race Conditions (SRCs): write‑write and cross‑shard stale‑read conflicts that silently corrupt agent output. Existing multi‑agent frameworks (LangGraph, CrewAI, AutoGen) provide no write‑ownership semantics over shared state. We present S‑Bus, an HTTP middleware whose central mechanism is a server‑side DeliveryLog: a per‑agent log of HTTP GET operations that automatically reconstructs each agent's read set at commit time without agent SDK changes under HTTP/1.1. The consistency property the DeliveryLog provides ‑‑ Observable‑Read Isolation (ORI), a partial causal consistency over the HTTP‑observable projection of the read set ‑‑ prevents structural race conditions when agents collaborate via shared shards. Three contributions: (C1) The DeliveryLog mechanism for automatic HTTP‑traffic‑based read‑set reconstruction, with three‑tier mechanised evidence: ReadSetSoundness and ORICommitSafety machine‑checked in TLAPS (modulo one retained typing axiom); exhaustive TLC at N=3 (20,763,484 distinct states, zero violations); Dafny discharges 9 inductive soundness lemmas. (C2) Empirical structural‑conflict prevention parity against PostgreSQL 17 SERIALIZABLE and Redis 7 WATCH/MULTI on shared‑shard contention sweeps with 427,308 active HTTP‑409 conflicts: zero Type‑I corruptions across all three backends. (C3) ORI's operating envelope is topology‑conditional: semantically neutral in dedicated‑shard workloads; harmful in single‑shard collaborative writing because preservation propagates concurrent contradictions. Source code: https://github.com/sajjadanwar0/sbus
Authors:Robin-Nico Kampa, Fabian Deuser, Anna Bößendörfer, Konrad Habel, Norbert Oswald
Abstract:
Autonomous AI coding agents are becoming a core tool for ML practitioners in industry and research alike. Despite this growing adoption, no standardized benchmark exists to evaluate their ability to design, implement, and train models from scratch across diverse domains. We introduce 1GC‑7RC (Single Graphic Card: Seven Research Challenges), a benchmark comprising seven ML tasks spanning language modeling, image classification, semantic segmentation, graph learning, tabular prediction, time‑series forecasting, and text classification. Each task provides a locked data‑preparation and evaluation script together with a baseline training script; the agent may only modify the training code, has no access to pretrained weights (with one controlled exception for semantic segmentation), no internet access, and must complete each task within a task‑specific wall‑clock budget (40‑120 minutes) on a single GPU. We evaluate seven coding agents: five proprietary (Claude Code with Sonnet 4.6, Opus 4.6, and Opus 4.7; Codex CLI with GPT 5.5; and OpenCode with Qwen 3.6+) and two open‑source (OpenCode with Kimi K2.5, Kimi K2.6). Across 5 runs per agent‑task pair, we report substantial performance differences that reveal varying levels of implicit ML knowledge, planning ability, and time‑budget management. The benchmark, harness, and all evaluation artifacts are publicly available on GitHub at https://github.com/Strolchii/1GC‑7RC‑Benchmark to facilitate reproducible comparison of future agents. Because our benchmark design is modular, the benchmark can be extended to new tasks and domains, adapted to different GPU budgets, and used to study multi‑agent settings, making it a flexible platform for future research on autonomous research agents.
Authors:Masaru Yamada
Abstract:
We present Agentic AI Translate, an agentic translator prototype that operationalises the thesis of Yamada (forthcoming) ‑‑ that the metalanguage of Translation Studies has become an instruction code for generative AI. The system replaces the dominant text‑in / text‑out paradigm of machine translation with a four‑stage agentic cycle (Identify ‑> Prompt ‑> Generate ‑> Verify), preceded by an interactive specification phase in which the user composes ‑‑ through model‑assisted dialogue ‑‑ a structured translation brief grounded in skopos theory, register, audience, and genre conventions. The verification stage adopts the GEMBA‑MQM error‑span protocol (Kocmi & Federmann, 2023) for evidence‑grounded scoring, and document‑level coherence is preserved through a DelTA‑lite memory of proper nouns and a running bilingual summary, after Wang et al. (2025). We describe the philosophical motivation, the architectural commitments, the four reference‑material categories the system consumes, and the principal design tensions the architecture makes explicit. Empirical validation is left for future work; the contribution here is conceptual and architectural ‑‑ an executable embodiment of the position that translation in the GenAI era is communication design, not text conversion.
Authors:Yunzhi Tian, Dekui Wang, Qirong Bu, Wei Zhou, Xingxing Hao, Jun Feng
Abstract:
Multi‑view learning has been widely applied for sleep stage classification using multi‑modal data. However, existing methods typically assume that different modalities are well‑aligned, which is often unattainable in real‑world scenarios, thereby compromising the reliability of the staging results. In this paper, we propose ConfSleepNet, a conflict‑aware evidential framework that dynamically resolves inter‑view conflicts. The framework consists of multi‑view evidence extraction and conflict‑aware aggregation. In the first phase, it learns category‑related evidence from different modalities, which represents the degree of support for individual sleep stages. Considering the inherent characteristics of varying modalities, we propose hybrid category structures for different modalities to promote more reasonable evidence learning. In the second phase, view‑specific opinions, including prediction results and uncertainty, are constructed from the learned evidence. Notably, we propose a novel conflict‑aware aggregation method that integrates these view‑specific opinions into a reliable joint decision. This mechanism can effectively resolve conflicts among opinions and synthesize them into a reliable joint decision. Both theoretical analysis and experimental results demonstrate the effectiveness of ConfSleepNet in sleep staging tasks. The code is available at https://github.com/By4te/ConfSleepNet_ICML2026/.
Authors:Peng Cui, Boyao Yang, Jun Zhu
Abstract:
Reinforcement Learning (RL) post‑training has emerged as the dominant paradigm for eliciting mathematical reasoning in Large Language Models (LLMs), yet prevailing techniques such as GRPO and DAPO distribute rollout and gradient budgets nearly uniformly across prompts, squandering compute on samples that are already mastered or remain far beyond the model's current capability. To address this fundamental inefficiency, we propose Learning‑Zone Energy (LZE), a theoretically grounded, fully online data selection framework that concentrates computation on the model's active learning frontier. At its core, we define a closed‑form Learning‑Zone Energy Score that fuses three complementary signals, an initial‑difficulty anchor, a normalized outcome‑uncertainty term, and a pass‑rate momentum, into a single scalar that is provably aligned with the expected magnitude of group‑relative policy gradient updates. A forward pruner with replay further reduces wall‑clock time cost by skipping rollout generation for persistently solved prompts while periodically checking for forgetting. Evaluated on Qwen‑family models (1.5B‑8B) across GSM8K, MATH and DAPO‑MATH, our method retains only 40% of the training data per step yet matches or surpasses full‑data baselines, with especially pronounced out‑of‑distribution gains on AIME25 (+45.9%) and AMC23 (+18.2%), alongside an estimated 36% reduction in training FLOPs. Our code is available at https://github.com/Stellaris167/LZE.
Authors:Ruth Wan Theng Chew, Zhiliang Chen, Apivich Hemachandra, Bryan Kian Hsiang Low
Abstract:
Optimization of LLM training and inference configurations, such as hyperparameters, data mixtures, and prompts, is critical to performance, but it is often approached heuristically in practice, leading to potentially suboptimal outcomes. By framing them as noisy, expensive, and derivative‑free optimization problems, Bayesian optimization (BO) and other black‑box optimization (BBO) methods offer a promising yet underexplored direction for principled, sample‑efficient methods. However, LLM training and inference costs are prohibitively high for most of the BBO research community, and new methods are often only evaluated on synthetic test functions and small‑scale datasets that fail to capture the challenges of modern LLM optimization problems. This impedes the development of BBO methods and makes it difficult to assess their effectiveness on modern LLM tasks. We introduce BoLT, the first LLM‑centric benchmark that democratizes LLM research for the BBO community. BoLT is released at https://github.com/chewwt/bolt. BoLT covers broad and well‑motivated LLM optimization problems, involving multi‑fidelity, multi‑objective, heteroscedastic noise, and high‑dimensional search spaces. Each problem in BoLT is grounded in real experimental data and made fully reproducible and accessible through lightweight surrogate models fitted to the results of thousands of real LLM experiments. We benchmark BoLT against an extensive range of BO and BBO methods, showing that selected BO methods consistently outperform others across tasks and highlighting gaps in existing BBO methods on LLM tasks, underscoring the need to modernize benchmarks for the BBO community.
Authors:Anthonio Oladimeji Gabriel, Ahmad Rufai Yusuf
Abstract:
Current clinical artificial intelligence (AI) systems are evaluated almost exclusively on clean, standardised, English‑language inputs, conditions that do not reflect the realities of healthcare delivery in low‑resource settings. This study presents the first systematic dual audit of two orthogonal safety vulnerabilities in clinical AI: adversarial image fragility and cross‑lingual diagnostic drift. Using DenseNet121, the architecture underlying CheXNet, fine‑tuned on the COVID‑QU‑Ex chest X‑ray dataset (85,318 images; COVID‑19, Non‑COVID Pneumonia, Normal), we demonstrate that diagnostic accuracy collapses from 89.3% to 62.0% under a Fast Gradient Method (FGM) perturbation of epsilon=0.021, a magnitude imperceptible to the human eye. Standard defensive strategies including Gaussian smoothing and ensemble voting failed to restore clinical safety. In a parallel language fragility experiment, we tested Llama3.1:8b and NatLAS (N‑ATLAS) on 20 COVID‑19 clinical cases presented in Standard English, Nigerian Pidgin (Naija), and Yoruba‑inflected English. Both models exhibited significant accuracy degradation: Llama3.1:8b dropped from 80.0% to 65.0% on Pidgin; NatLAS, an African‑context model, collapsed from 85.0% to 55.0%, with diagnosis consistency falling to 50%. These findings establish a quantitative failure envelope for clinical AI under conditions representative of Primary Health Centre (PHC) deployment in Nigeria, and motivate urgent calls for adversarially hardened, linguistically inclusive clinical AI architectures.
Authors:Jinjie Shen, Zheng Huang, Yuchen Zhang, Yujiao Wu, Yaxiong Wang, Lechao Cheng, Shengeng Tang, Tianrui Hui, Nan Pu, Zhun Zhong
Abstract:
Existing vision‑language forgery detection and grounding methods operate under a closed‑world paradigm, assuming verification can be completed by the model alone. However, self‑contained MLLMs are constrained by finite parametric knowledge, static training corpora, and limited perceptual resolution, creating a practical ceiling in dynamic open‑world forensics ‑‑ particularly for real‑time event verification requiring external clues and forgery segmentation demanding fine‑grained scrutiny of local manipulations. To address these limitations, we shift from scaling up the self‑contained model toward reaching beyond it. We propose OmniVL‑Guard Pro, a tool‑augmented agent that extends unified forensics from closed‑world prediction to open‑world clues‑driven reasoning. OmniVL‑Guard Pro integrates a tool environment spanning real‑time event search, local cropping and zooming, edge‑anomaly screening, face detection, video frame extraction, and SAM3‑based segmentation. To generate high‑quality tool‑reasoning trajectories, we introduce Tree‑Structured Self‑Evolving Tool Trajectory Generation, which produces diverse trajectories through seed guidance, guider‑free self‑evolution, and weakly‑hinted hard sample synthesis, yielding the Full‑Spectrum Tool Reasoning (FSTR) dataset for training. We further propose Checker‑Guided Agentic Reinforcement Learning (CGARL), which provides process‑level supervision to penalize cases where the answer is correct but the reasoning is distorted. Extensive experiments demonstrate that OmniVL‑Guard Pro achieves state‑of‑the‑art performance across various tasks, and exhibits strong zero‑shot generalization. The FSTR dataset and code for OmniVL‑Guard Pro will be publicly released at https://github.com/shen8424/OmniVL‑Guard‑Pro.
Authors:Zhiqiang Liu, Wenhui Dong, Yilang Tan, Yuwen Qu, Haochen Yin, Chenyang Si
Abstract:
Tool‑using agents are increasingly expected to operate across realistic professional workflows, where they must interpret multimodal inputs, coordinate external tools, inspect intermediate artifacts, and revise their actions before producing a final result. Existing benchmarks, however, often evaluate tool use, computer use, and multimodal reasoning in isolation, leaving a gap between benchmark settings and end‑to‑end omni‑modal tool use in the real world. To address this gap, we introduce MM‑ToolBench, a benchmark and evaluation harness for task‑oriented omni‑modal tool use. MM‑ToolBench contains 100 executable tasks from two macro task families, Customer Service and Intelligent Creation, covering 20 subcategory slices and supported by 27 MCP servers with 324 tools. The central design of MM‑ToolBench is closed‑loop multimodal verification: agents must execute tools, inspect rendered or transformed artifacts, and self‑correct when outputs fail task‑specific requirements. To make such evaluation scalable and verifiable, MM‑ToolBench couples MCP‑based execution with task‑specific grounded evaluators and a semi‑automated construction pipeline for scenario discovery, task instantiation, evaluator synthesis, and human audit. Experiments on 15 contemporary agentic models show that MM‑ToolBench remains highly challenging: Claude Opus 4.6, commonly regarded as one of the strongest coding‑agent models, achieves only 32.0% task success, far below the 94.0% human benchmark. We envision MM‑ToolBench as a practical foundation for evaluating and advancing next‑generation omni‑modal tool‑using agents through closed‑loop multimodal verification.
Authors:Yuxuan Ye, Jun Han, Ao Hu, Juncheng Bu, Yiyi Chen, Liangjian Wen, Danilo Mandic, Danny Dongning Sun, Xu Yinghui, Zenglin Xu
Abstract:
End‑to‑end LLM trading agents have moved quickly from research curiosity to a small ecosystem of named systems, including FinCon, FinMem, TradingAgents, FinAgent, QuantAgent, and FLAG‑Trader. Several of these report headline Sharpe ratios that would be material if read at face value on a deployment desk, and associated benchmarks such as FinBen report trading‑task Sharpe statistics in the same range. The gap between architecture research and deployment claim has been crossed too freely on both sides of the academia‑‑industry divide. We take a position on that gap: reported alpha from end‑to‑end LLM trading agents should not be treated as deployment evidence. Before such returns can support claims of deployable trading capability, they must survive structural validity tests for temporal integrity, real‑world frictions, counterfactual robustness, predictive calibration, numerical execution, and multi‑agent disaggregation. Current public evidence cannot yet distinguish robust predictive ability from temporal contamination, unmodeled frictions, short‑window Sharpe uncertainty, narrative fitting, and parametric priors. The problem is not only evaluative but structural. Language confidence is not tradable probability, narrative reasoning is not numerical execution, and model priors may become undisclosed implicit factor exposures. We contribute a minimum reporting protocol suite, P1‑‑P6, with tiered applicability by claim strength, and a conservative modular alternative that uses LLMs as auditable information interfaces upstream of independent calibration, risk, and execution modules. Code and reproduction harness: \urlhttps://github.com/hj1650782738/Trading.
Authors:Yuwen Qu, Wenhui Dong, Chenyang Si, Caifeng Shan
Abstract:
Recent studies introduce conditional memory modules that decouple knowledge storage from neural computation, enabling more direct knowledge access. Compared to MoE, which relies on dynamic computation paths, explicit lookup provides a more efficient knowledge retrieval mechanism. However, these approaches still depend on learned memory embeddings, requiring additional training and limiting flexibility. To address this, we propose N‑gram Memory (NGM), a training‑free, plug‑and‑play module composed of a Causal N‑Gram Encoder and a Cosine‑Gated Memory Injector. The Causal N‑Gram Encoder directly averages the pretrained token embeddings of the backbone model to construct N‑gram representations, thereby eliminating the need to train separate N‑gram embeddings from scratch. This design requires neither an additional memory table nor a retrieval pipeline. The Cosine‑Gated Memory Injector then uses a non‑parametric cosine gate with ReLU to modulate the retrieved embeddings into the contextual representations. We evaluate NGM on the Qwen3 series from 0.6B to 14B across eight benchmarks. NGM improves average performance by 0.5 to 1.2 points, with particularly clear gains on code generation and knowledge‑intensive tasks (e.g., +3.0 on LiveCodeBench and +3.03 on GPQA for Qwen3‑14B). Moreover, NGM also improves performance in multimodal benchmarks (e.g., MMStar +1.53 on Qwen3‑VL‑2B).
Authors:Changshuo Shen, Leheng Sheng, Yuxin Chen, An Zhang, Xiang Wang
Abstract:
Large reasoning models (LRMs) substantially outperform their base LLM counterparts on challenging reasoning benchmarks, yet it remains poorly understood where base models go wrong during token‑by‑token generation and how to narrow this gap efficiently. We study the base‑reasoning gap through quantifying token‑level distributional disagreement between a base model and a stronger reasoning model using likelihood‑based divergences. Across benchmarks, we find that the reasoning advantage is highly sparse and concentrates on a small set of early, planning‑related decision tokens. For instance, on Qwen3‑0.6B, only ~8% of generated tokens account for the salient disagreement, and these tokens concentrate early in the response, are strongly enriched in planning‑related decisions (17x), and coincide with high base‑model uncertainty ‑‑ suggesting that base models fail mainly at early planning points that steer the subsequent reasoning trajectory. Building on these findings, we propose disagreement‑guided token intervention, a simple inference‑time delegation scheme that performs a one‑token takeover by the reasoning model only at high‑disagreement positions and immediately switches back to the base model. With a small intervention budget, this sparse delegation substantially recovers and can even surpass the performance of a same‑size reasoning model on challenging reasoning tasks. Code is available at https://github.com/AlphaLab‑USTC/RRTokenIntervention.
Authors:Yachan Guo, JoseLuis Gomez Zurita, Danna Xue, Yi Xiao, AntonioManuel Lopez Pena
Abstract:
Although large‑scale visual foundation models (VFMs) achieve remarkable performance in semantic understanding, they still underperform in instance‑aware dense prediction tasks. They exhibit different biases in representation: for instance, promptable segmentation models (e.g., SAM2) focus on fine‑grained region boundaries, while self‑supervised models (e.g., DINOv3) emphasize object‑level structure. This observation highlights the potential of combining complementary features from different VFMs to enhance downstream dense prediction tasks. However, naive multi‑VFM fusion seldom leads to reliable gains, and interpretable principles for leveraging their complementary features are still underexplored. In this work, we propose a metric‑guided approach that effectively selects and aggregates complementary features from different VFMs based on explicit assessment scores. Specifically, we design a suite of label‑free metrics in feature space across two aspects, Structural Coherence and Edge Fidelity, to assess features of VFM encoders. Guided by these scores, we identify complementary edge‑strong and structure‑strong encoder pairs, and integrate them via a master‑auxiliary fusion scheme. This feature fusion requires no complex architectural changes and is trained only in a single stage. Our model shows consistent performance gains across multiple dense prediction tasks compared with the baselines, with better object‑level semantics and more accurately localized boundaries. The code is available at https://github.com/gyc‑code/metric‑guided‑fusion.
Authors:Yaniv Hassidof, Adir Morgan, Yilun Du, Kiril Solovey
Abstract:
Compositional diffusion models offer a promising route to long‑horizon planning by denoising multiple overlapping sub‑trajectories while ensuring that together they constitute a global solution. However, enforcing local behavior over long chains is often insufficient for a coherent global structure to emerge. Recent works tackle this limitation through intrinsic search, which explores multiple paths during the denoising process. While intrinsic search improves global coherence, it comes at the cost of repeated evaluations of an already compute‑heavy model. In this work, we argue that extrinsic search, performed outside the denoising process, offers a more effective mode of exploration for long‑horizon planning while naturally enabling the use of classical algorithms to solve unseen combinatorial tasks at test time. Our eXtrinsic search‑guided Diffuser (XDiffuser) first computes a plan over a state‑space graph ‑‑ serving as a lightweight local connectivity oracle for the diffusion model. The plan is then used to guide denoising for a single trajectory, effectively offloading the burden of exploration. XDiffuser outperforms diffusion‑based baselines on long‑horizon tasks, with particularly large gains in the low‑quality data regime and on unseen tasks beyond goal‑reaching, including multi‑agent coordination and TSP‑style reasoning. Project website: https://yanivhass.github.io/XDiffuser‑site/
Authors:Wei Zhang, Songhua Li, Yihang Wu, Qiang Li, Qi Wang
Abstract:
3D change detection from multi‑view images is essential for urban monitoring, disaster assessment, and autonomous driving. However, existing methods predominantly operate in the 2D domain, where viewpoint variations are mistaken for physical changes and depth is unavailable. While visual geometry foundation models like VGGT rapidly produce dense point clouds from unposed images, independent per‑epoch reconstruction encounters fundamental obstacles: unpredictable inter‑epoch scale ambiguity, registration‑change paradox where scene changes corrupt alignment, and pervasive edge‑flying noise. To address these challenges, we present VGGT‑CD, a training‑free pipeline decoupling cross‑temporal registration from dynamic‑change interference. In the Coarse Stage, sparse keyframe joint inference establishes a unified metric space and yields an initial Sim(3) prior. In the Fine Stage, dense reconstructions are purified by isolating static‑background correspondences. A closed‑form centroid alignment refines the translation while locking scale and rotation, using a residual self‑check to mathematically guarantee non‑degradation. Evaluated on an 11‑scene benchmark from the World Across Time dataset, VGGT‑CD reduces Absolute Trajectory Error by 44% outdoors and 59% indoors. It completes registration over 6 times faster, producing high‑purity 3D change maps without task‑specific training.
Authors:Anhao Zhao, Haoran Xin, Yingqi Fan, Junlong Tong, Wenjie Li, Xiaoyu Shen
Abstract:
Knowledge distillation is central to LLM post‑training, yet its design space remains poorly understood, especially alongside reinforcement learning (RL). We show that the prevailing paradigms, off‑policy distillation and on‑policy distillation (OPD), implicitly couple two orthogonal choices: prefix source and token‑level KL direction. This follows from decomposing sequence‑level KL over autoregressive response distributions: forward KL pairs teacher prefixes with token‑level forward KL, and reverse KL pairs student prefixes with token‑level reverse KL. We argue this coupling is not intrinsic: decoupling the two axes yields four valid objectives. We establish gradient‑level identities showing forward KL gives SFT‑style cross‑entropy matching with teacher soft targets, whereas reverse KL gives an RL‑style policy‑gradient objective with a dense teacher‑student log‑ratio reward, connecting them to off‑policy SFT, DAgger‑style on‑policy SFT, offline‑RL‑style distillation, and OPD. We conduct an extensive controlled study on math reasoning, evaluating the four objectives both as standalone methods and as initializations for subsequent RL. The results reveal three tradeoffs: KL direction induces an accuracy‑entropy tradeoff, prefix source a quality‑compute tradeoff, and training length an accuracy‑stability tradeoff. Motivated by these findings, we propose KL mixing and an entropy‑gated length curriculum. KL mixing shows long‑sequence distillation requires substantial forward‑KL weight to prevent entropy collapse and length inflation without sacrificing accuracy. The entropy‑gated length curriculum improves Avg@k and Pass@k by 3.6 and up to 5.8 points, and cuts average response length by roughly 3x versus fixed long‑horizon training. Our results provide a framework and practical methods for designing reasoning distillation objectives that balance accuracy, diversity, compute, and RL behavior.
Authors:Hwidong Kim, Yunho Kim, Tae-Kyun Kim
Abstract:
Video generative models have made remarkable progress, yet they often yield visual artifacts that violate grounding in physical dynamics. Recent works such as PhysGen3D tackle single image‑to‑3D physics through mesh reconstruction and Physically‑Based Rendering, but challenges remain in modeling fluid dynamics, multi‑object interactions and photorealism. This work introduces 3DPhysVideo, a novel training‑free pipeline that generates physically realistic videos from a single image. We repurpose an off‑the‑shelf video model for two stages. First, we use it as a novel view synthesizer to reconstruct complete 360‑degree 3D scene geometry by guiding the image‑to‑video (I2V) flow model with rendered point clouds. Second, after applying physics solvers to this geometry, the physically simulated point cloud is used to guide the same I2V flow model to synthesize final, high‑quality videos. Consistency‑Guided Flow SDE, which decomposes the predicted velocity of the I2V flow model into denoising and consistency bias, enforces consistency to the conditional inputs, allowing us to effectively repurpose the model for both 3D reconstruction and simulation‑guided video generation. In the diverse experiments including multi‑objects, and fluid interaction scenes, our method successfully bridges the gap from single‑images to physically plausible videos, while remaining efficient to run on a single consumer GPU. It outperforms state‑of‑the‑art baselines on GPT‑based scores, VideoPhy benchmark and human evaluation.
Authors:Anay Kulkarni, ChiaEn Lu, Dheeraj Mekala, Jayanth Srinivasa, Gaowen Liu, Jingbo Shang
Abstract:
Tool use enables large language models to solve complex tasks through sequences of API calls, yet existing reinforcement learning approaches fail to scale to multi‑step composition settings. Outcome‑based rewards provide only sparse feedback, while trajectory‑supervised rewards depend on annotated reference solutions, penalizing valid alternatives and limiting scalability. We propose TIER: Trajectory‑Invariant Execution Rewards, a reward framework that derives supervision directly from function schemas and runtime execution, rather than from reference trajectories. The reward decomposes into format validity, schema adherence, execution success, and answer correctness, providing dense, interpretable sequence‑level feedback derived from fine‑grained verification of individual steps of tool use. This design allows any valid execution path to receive credit, naturally supporting multiple solution strategies and adapting to evolving tool interfaces. On DepthBench, a compositional benchmark stratified by depth (1 to 6 steps), TIER achieves >90% accuracy across steps, where trajectory‑supervised rewards collapse beyond step‑4. We further demonstrate consistent gains on benchmarks like BFCL v3 and NestFUL. Ablation studies confirm that all reward components are necessary, highlighting the importance of multi‑level supervision for compositional reasoning.
Authors:Arpan Kusari
Abstract:
Hyperdimensional (HD) computing offers an attractive alternative to deep networks for edge learning due to its simplicity, fast prototype‑based inference, and compatibility with online updates. However, standard pixel‑based HD encoders are brittle: small distribution shifts such as rotation, noise, or occlusion can drastically reduce accuracy. We extract discrete topological primitives‑most notably holes‑from binarized shapes and pair them with rotation/translation/scale (RTS)‑invariant shape signatures. Our method constructs RTS‑stable descriptors for (i) the outer shape using a spatial‑pyramid variant of Zernike moments and (ii) each hole using an intrinsic Fourier descriptor of its radial signature together with RTS‑canonical relative geometry. Each primitive is mapped to a bipolar hypervector via randomized projection and role binding, and variable‑cardinality hole sets are aggregated by permutation‑invariant bundling to form a single image hypervector. To avoid over‑weighting any cue, we learn nonnegative reliability weights for the Zernike and hole channels on a validation set via late fusion of cosine similarities. Experiments on MNIST and EMNIST under controlled corruptions (rotation, Gaussian noise, salt‑and‑pepper, cutout, zoom) show that Topology‑guided HD computing substantially improves robustness compared with a naive HD baseline, maintaining high accuracy across multiple corruption families and benefiting from lightweight online training. Compared with a compact CNN trained on clean data, our method achieves competitive clean accuracy while offering markedly stronger robustness to several pixel‑level corruptions, demonstrating that explicit topological structure is a practical route to robust HD representations. The code is provided at https://github.com/arpan‑kusari/Topological‑HDC.
Authors:Mingyang Zhao, Sipu Ruan, Xiaohong Jia
Abstract:
This work presents a novel method for fitting superquadrics to point clouds under the contamination of noise and outliers, which has many applications for shape modeling across diverse fields. Unlike prior approaches that either exclusively focus on fitting rigid or deformable superquadrics, or suffer from robustness and numerical instability issues, our method redefines the problem from a new unsupervised clustering perspective, enabling the holistic fitting of both rigid and deformable superquadrics within a unified framework. Central to our approach is a stable optimization function inspired by unsupervised clustering analysis, where we formulate the point cloud data and samples from the potential parametric surface as clustering members and centroids, respectively. Then, the clustering process with dynamic updates to centroid locations serves as a direct proxy for optimizing superquadric parameters, establishing a principled link between geometric fitting and clustering dynamics. We further derive the relationship between pairwise computations of clustering centroids and clustering members to orthogonal distances, effectively eliminating the need for the time‑consuming surface sampling process. Moreover, our formulation provides closed‑form analytical solutions for both the fuzzy membership degree vector and the covariance matrix, ensuring efficient iteration optimization and enabling more effective handling of geometric deformations. In addition, we provide a theoretical certificate of convergence analysis and demonstrate that the clustering‑inspired fitting method can escape local minima by inherently increasing the convexity of the objective function. The implementation is publicly available at https://github.com/zikai1/SuperquadricFitting.
Authors:Puning Yang, Junchi Yu, Qizhou Wang, Philip Torr, Bo Han, Xiuying Chen
Abstract:
Mitigating sensitive and harmful outputs is fundamental to ensuring safe deployment of LLMs. Existing approaches typically follow two paradigms: Knowledge Deletion (KD), which erases undesirable information during training, and Distinguishable Refusal (DR), which steers models away from using sensitive knowledge during inference. Despite rapid progress, KD‑based unlearning struggles with biased deletion due to suppressing specific token sequences as a substitute for complete knowledge removal, whereas DR‑based unlearning risks the re‑emergence of harmful knowledge because the underlying knowledge remains intact. To address these issues, we propose Distinguishable Deletion (\mathrmD^2), a paradigm that restricts the response distribution in the latent representation rather than specific tokens to erase undesirable knowledge, while distinguishing it from retained knowledge, enabling a refusal mechanism to handle unlearned inputs safely and coherently. To implement \mathrmD^2, we introduce an energy index that quantifies the presence of knowledge and the separation between unlearned and retained content. Mathematical and empirical analyses show that energy is both accurate and efficient, enabling Energy‑based Unlearning Alignment (EUA) to enforce energy‑boundary unlearning during training and apply an energy‑based refusal mechanism at inference. Extensive experiments demonstrate that EUA significantly outperforms previous methods, indicating the superiority of \mathrmD^2. Our code is available at https://github.com/Puning97/EUA‑for‑LLM‑Unlearning.
Authors:Zhitian Hou, Tianyong Hao, Nanli Zeng, Zhixiong Chao, Kun Zeng
Abstract:
Criminal Court View Generation (CVG) is a critical task in Legal Artificial Intelligence (Legal AI), involving the generation of court view based on case facts. In this work, we systematically explore the capabilities of lightweight (smaller than 2B) large language models (LLMs) in CVG and their impact on charge prediction. Our study addresses four key questions: (1) how does different architecture of LLMs affect the CVG quality and charge prediction. (2) how does LLMs size contribute to the performance, (3) how do lightweight LLMs compare with Deep Neural Networks (DNNs) in these tasks, and (4) how does predicting charge by court view generation first compare with predicting it directly. Additionally, we also develop CVGEvalKit, an evaluation framework including three public available datasets for CVG tasks, as well as predicting their charges. Comprehensive experiments are conducted on this framework, where models are trained on a mixed training set and evaluated on each dataset's test set. Experimental results provide new insights into the trade‑offs between model architecture, model size, and the influence between different tasks, highlighting the potential of lightweight LLMs in judicial AI applications. The source code is anonymously available at \urlhttps://github.com/ZhitianHou/CVGEvalKit
Authors:Shuowei Li, Yuming Zhao, Parth Bhalerao, Oana Ignat
Abstract:
Text‑to‑video (T2V) generation has rapidly progressed in visual fidelity, yet its ability to faithfully represent multiple cultures within a single prompt remains underexplored. We introduce MAVEN, a multi‑agent prompt refinement framework designed to improve cultural fidelity in both mono‑cultural and cross‑cultural T2V generation. MAVEN decomposes prompts into person, action, and location dimensions, handled by specialized agents operating in parallel or sequentially. To support systematic evaluation, we contribute a new benchmark of 243 culturally grounded prompts and 972 corresponding videos, spanning three cultures (Chinese, American, Romanian), three action categories, and both mono‑cultural and cross‑cultural scenarios. Evaluations combining CLIP‑based metrics, VLM‑as‑judge assessments, and videoquality measures show that multi‑agent refinement, particularly parallel specialization, significantly improves cultural relevance while preserving visual quality and temporal consistency. The dataset and code are available athttps://github.com/AIM‑SCU/CRAFT
Authors:Haolin Chen, Deon Metelski, Leon Qi, Tao Xia, Joonyul Lee, Steve Brown, Kevin Riley, Frank Wang, T. Y. Alvin Liu, Hank Capps MD, Zeyu Tang, Xiangchen Song, Lingjing Kong, Fan Feng, Tianyi Zeng, Zhiwei Liu, Zixian Ma, Hang Jiang, Fangli Geng, Yuan Yuan, Chenyu You, Qingsong Wen, Hua Wei, Yanjie Fu, Yue Zhao, Carl Yang, Biwei Huang, Kun Zhang, Caiming Xiong, Sanmi Koyejo, Eric P. Xing, Philip S. Yu, Weiran Yao
Abstract:
End‑to‑end automation of realistic healthcare operations stresses three capabilities underrepresented in current benchmarks: policy density, decisions must be grounded in a large library of medical, insurance, and operational rules; Multi‑role composition: a single task requires the agent to play multiple roles with handoffs; and multilateral interaction: intermediate workflow steps are multi‑turn dialogs, such as peer‑to‑peer review and patient outreach. We introduce χ‑Bench, a benchmark of long‑horizon healthcare workflows across three domains: provider prior authorization, payer utilization management, and care management. Each task hands the agent a clinical case in a high‑fidelity simulator of 20 healthcare apps exposed via 87 MCP tools, which it must drive to a terminal status through tool calls and writing the role's artifacts, guided by a 1,290+ document managed‑care operations handbook skill. Across 30 agent harness/models configurations, the best agent resolves only 28.0% of tasks, no agent clears 20% on strict pass^3, and executing all tasks in a single session slumps the performance to 3.8%. These results raise the hypothesis that similar gaps are likely to surface in other policy‑dense, role‑composed, irreversible enterprise domains.
Authors:Elle Najt, Colin Toft, Tyler Tracy, Fabien Roger, Joe Benton
Abstract:
Since autonomous coding agents generate complex behaviors at high‑volume, we may want to use other LLMs to monitor actions to reduce the risk from dangerous misaligned behavior. To better understand the limitations of such monitors against the diverse attack strategies that a coding agent could use, we present SLEIGHT‑Bench (Subtle Low‑itEration Insight‑Guided Harmful Transcripts), a benchmark of synthetic transcripts containing 40 attacks across 11 categories, each showing a coding agent covertly pursuing a harmful objective (e.g. weight exfiltration, credential theft, unmonitored rogue deployment). We find that 20 of 40 attacks are never caught by an Opus 4.6 monitor with extended thinking across 10 trials at a 1% false‑positive rate, while the overall catch rate is 32%. We identify several evasion strategies such as exploiting model priors, user instruction ambiguity, and state manipulation, which facilitate creating attacks that evade frontier monitors. We also elicit stronger monitor performance using coding agents as monitors versus regular prompted monitors, and for some evasion strategies show improved catch rates with targeted monitor prompts. Our dataset and evaluation framework are available at https://github.com/safety‑research/sleight‑bench and https://huggingface.co/datasets/sleightbench/SLEIGHT‑Bench.
Authors:Kyrie Zhao, Zehong Wang, Tianyi Ma, Fang Wu, Xiangru Tang, Pietro Lio, Sheng Wang, Yanfang Ye
Abstract:
Hypergraphs model higher‑order relations that drive real‑world decisions, from drug prescriptions to recommendations. A central structural signal in such data, beyond what pairwise relations can express, is interaction compositionality: whether a higher‑order relation is compositional, emergent, or inhibitory with respect to its observed or unobserved sets. In polypharmacy, the regime decides whether a drug should be dropped, kept, or excluded: a compositional drug triple can be safely simplified, an emergent triple requires all drugs jointly, and an inhibitory triple flags a drug that disrupts an existing interaction. However, existing hypergraph learning methods, which merely propagate messages over observed hyperedges, leave this compositional signal unmodeled, allowing dangerous drug combinations to slip through and be misclassified. To this end, we propose the Hypergraph Pattern Machine (HGPM), shifting the paradigm from message passing to learning the compositional pattern of subsets. It tokenizes compositional subsets, organizes them in an inclusion DAG, and trains an inclusion‑aware Transformer under masked reconstruction. On ten hypergraph benchmarks, HGPM matches or exceeds state‑of‑the‑art methods. Notably, in a real adverse‑event prediction case, HGPM correctly identifies the drug addition that inhibits the side effect among feature‑identical candidates, a discrimination existing methods cannot make. The code and data are in https://github.com/KryieZhao/HGPM.git.
Authors:Aiden Yiliu Li, Nels Numan, Anthony Steed
Abstract:
Long video understanding requires more than large context windows. It also needs a memory mechanism that decides what visual evidence to retain, keeps it searchable over long horizons, and grounds later reasoning in recoverable observations rather than compressed latent state alone. We propose Visual Agentic Memory (VAM), a training‑free framework with three components. Online Indexing supports selective evidence retention under streaming constraints. Hierarchical Memory organises retained evidence in a Parallel Representation that aligns temporal context with spatial observations. Agentic Retrieval searches, inspects, and verifies candidate evidence before producing a grounded answer. On OVO‑Bench, VAM achieves the highest RT+BT average (68.41) across all reported baselines, improving over end‑to‑end use of the same underlying MLLM (Gemini 3 Flash, 67.46). On the month‑scale split of MM‑Lifelong train@month (105.6 hours over 51 days), VAM reaches 17.11%, second only to ReMA with GPT‑5 (17.62%). These results suggest that long‑horizon video understanding benefits from treating visual memory as an explicit, inspectable, and queryable substrate. Code is available at https://github.com/yiliu‑li/Visual‑Agentic‑Memory.
Authors:Youngin Kim, Ray Sun, Inho Kim, Bumsoo Park, Hyun Oh Song
Abstract:
Token‑based transformer world models have shown strong performance in visual reinforcement learning, but often suffer from temporal inconsistency in long‑horizon rollouts, including object duplication, disappearance, and transmutation. A key reason is that most existing approaches treat next‑frame prediction purely as a token generation problem, without considering the persistence of tokens across time. We introduce Identifiable Token Correspondence (ITC), a decoding step for token‑based transformer world models that formulates next‑frame prediction as a structured assignment problem with latent token correspondence variables: each next‑frame token is explained either by copying a token from the previous frame or by generating a new one. ITC leaves the transformer architecture and training procedure unchanged and can be added on top of existing backbones. Our experiments show state‑of‑the‑art performance on 4 challenging benchmarks. The proposed method achieves a return of 72.5% and a score of 35.6% on the Craftax‑classic benchmark, significantly surpassing the previous best of 67.4% and 27.9%. We release our source code on https://github.com/snu‑mllab/Identifiable‑Token‑Correspondence.
Authors:Jongho Yoon, Jinsung Jeon, Seokhyeong Kang
Abstract:
Macro placement is a pivotal stage in VLSI physical design, fundamentally determining the overall chip performance. Recent data‑driven placement methods have demonstrated significant potential, yet they often struggle to handle sequential dependencies and to balance topological connectivity with physical constraints. To bridge this gap, we propose MacroDiff+, a physics‑guided geometric diffusion framework. Specifically, we design a dual‑domain denoising architecture that couples topological connectivity encoded by heterogeneous GNNs with global geometric context modeled by a Transformer. Furthermore, we introduce Physics‑Guided Sampling, an inference strategy that actively steers the generation using explicit gradients to ensure both statistical plausibility and physical validity. On the ISPD2005 MMS benchmarks, MacroDiff+ outperforms state‑of‑the‑art baselines with a 6.1‑6.2% reduction in wirelength. Notably, it exhibits superior stability and scalability on large‑scale designs where prior methods fail to converge. The source code is available at https://github.com/jhy00n/MacroDiff‑plus.
Authors:Jinhao Jing, Zheng Ma, Jinwei Liang, Qiannian Zhao, Shawn Chen, Jing Yang, Por Lip Yee, Prayag Tiwari, Jingjing Bai, Benyou Wang, Lewei Lu, Zhan Su
Abstract:
Large Multimodal Models (LMMs) often struggle with geometric reasoning due to visual hallucinations and a lack of mathematically precise Chain‑of‑Thought (CoT) data. To address this, we propose the GeoSym Engine, an automated and scalable neuro‑symbolic framework. By leveraging a type‑conditional grammar and an analytic SymGT Solver, it derives exact symbolic ground truths and seamlessly integrates with a robust rendering pipeline to produce high‑precision geometric diagrams. Using this engine, we construct GeoSym127K, a difficulty‑stratified dataset featuring 51K high‑resolution images, 127K questions with symbolic ground truths, and 55K answer‑verified CoT QA pairs. We also introduce GeoSym‑Bench, an expert‑curated suite of 511 complex samples for rigorous evaluation. Through extensive supervised fine‑tuning (SFT), we demonstrate that GeoSym drives concentrated improvements specifically on diagram‑dependent and multi‑step geometry tasks. Our Qwen3‑VL‑8B model gains an absolute +22.21% on the MathVerse Vision‑Only subset and reaches 61.52% (+6.19% improvement) on WeMath, mitigating long‑horizon logic fragmentation and outperforming advanced closed‑source models like Doubao‑1.8. Furthermore, applying Reinforcement Learning with Verifiable Rewards (RLVR) via GRPO reveals that initializing from structural SFT checkpoints substantially elevates the performance ceiling over zero‑shot RL. Driven by deterministic exact‑match signals, this showcases the robust scaling potential of our verifiable reasoning synthesis. Datasets and code are available at https://huggingface.co/datasets/Tomie0506/GeoSym127K and https://github.com/Tomie56/GeoSym127K.
Authors:Chang Che, Ziqi Wang, Hui Ma, Cheems Wang, Zenglin Shi
Abstract:
Continual Visual Instruction Tuning (CVIT) enables Multimodal Large Language Models to incrementally acquire new abilities. However, existing CVIT methods operate under a restrictive task‑incremental setting, where each training phase corresponds to a single, predefined task. This does not reflect real‑world conditions, where data arrives as a continuous stream of interleaved and dynamically evolving tasks. To bridge this gap, we introduce Streaming CVIT (StrCVIT), a more general and realistic setting where models learn from a stream of data chunks containing a dynamic mixture of tasks. In StrCVIT, a model must simultaneously acquire new abilities, reinforce recurring abilities, and mitigate forgetting. Existing CVIT methods fail here as they cannot reliably distinguish or adapt to the heterogeneous task samples within each chunk. We therefore propose StrLoRA, a regularized two‑stage expert routing framework. StrLoRA first performs task‑aware expert selection using the textual instruction to activate a sparse subset of relevant experts, reducing cross‑task interference. It then applies token‑wise expert weighting within this subset, where contribution weights are computed via cross‑modal attention between local visual tokens and the global instruction representation. To maintain stability across the non‑stationary stream, a routing‑stability regularization aligns current routing distributions with a historical exponential moving average reference. Extensive experiments on a newly developed StrCVIT benchmark show that StrLoRA substantially outperforms existing methods, effectively enhancing model's abilities from continuously evolving data streams. The code is available at https://github.com/chanceche/StrCVIT.
Authors:Aizierjiang Aiersilan
Abstract:
Large language models (LLMs) have made fluent essay writing, code drafting, and quiz answering instantly available to students at every level, from secondary school through graduate study. Many educators do not object to LLM use \emphper~se; what they need to detect is the case in which a student pastes the assignment prompt into a chatbot and submits the model's reply verbatim, without engaging with the work. Existing post‑hoc AI‑text detectors remain unreliable and have been shown to penalise non‑native English writers, while output‑side watermarks require cooperation from the model provider. We propose an alternative that the educator controls directly: an input‑side watermark in which an invisible instruction is embedded inside the visible assignment prompt itself. An LLM that ingests the prompt verbatim quietly reads the hidden instruction and writes a tell‑tale signature into its reply, exposing the copy‑and‑paste pathway specifically. We describe SteganoPrompt, a single‑page, zero‑dependency web tool that encodes an arbitrary printable‑ASCII payload into the deprecated Unicode Tags block (\textttU+E0000‑‑\textttU+E007F). The encoded string is visually identical to the original, survives common copy‑paste channels (Word, Google Docs, PDF, Markdown, Slack, e‑mail, the major learning‑management systems), and is reliably tokenized by frontier models. We evaluate compliance across seven LLM families and a representative set of educational content channels. The work is informed by my experience as a graduate teaching assistant for an undergraduate software engineering course at the George Washington University. The tool is released under the MIT licence at \urlhttps://ezharjan.github.io/SteganoPrompt/.
Authors:Safayat Bin Hakim, Keyan Guo, Wenkai Tan, Alvaro Velasquez, Shouhuai Xu, Houbing Herbert Song
Abstract:
LLM‑based agents can recover from individual execution errors, yet they repeatedly fail on the same fault when the underlying process knowledge‑‑operator schemas, preconditions, and constraints‑‑remains unrepaired. Existing self‑evolving approaches address this gap by updating prompts, memory, or model weights, but none directly repair the symbolic structures that encode how tasks are executed, and few provide the governance guarantees required for safe deployment. We introduce ANNEAL, a neuro‑symbolic agent that converts recurring failures into governed symbolic edits of a process knowledge graph without modifying foundation model weights. Its core mechanism, Failure‑Driven Knowledge Acquisition (FDKA), localizes the responsible operator, synthesizes a typed patch through constrained LLM generation, and validates the proposal via multi‑dimensional scoring, symbolic guardrails, and canary testing before commit. Every accepted edit carries full provenance and deterministic rollback capability. Across four domains and 27 multi‑seed runs, ANNEAL is the only evaluated system that commits persistent structural repairs‑‑strong baselines such as ReAct and Reflexion achieve high episodic recovery yet retain 72‑100% holdout failure rates on recurring faults, whereas ANNEAL reduces these to 0% in the tested recurring‑failure settings. Ablation confirms that removing FDKA eliminates all structural repairs and drops success rate by up to 26.7 percentage points. These results suggest that governed symbolic repair offers a complementary paradigm to weight‑level and prompt‑level adaptation for persistent fault elimination.
Authors:Allen Lu, Isabella Luong, Joyee Chen
Abstract:
Single‑turn benchmarks such as AnimalHarmBench (AHB) have established important baselines for measuring animal welfare alignment in large language models (LLMs), but they miss a critical failure mode: models that respond appropriately when unpressured may capitulate when follow‑up conversational turns introduce economic, social, or authority‑based arguments. We introduce MANTA (Multi‑turn Assessment for Nonhuman Thinking and Alignment), a dynamic multi‑turn evaluation framework built on the Inspect AI platform that stress‑tests frontier LLMs across realistic professional and everyday scenarios using adversarially generated follow‑up questions. Unlike static benchmarks, MANTA generates pressure turns dynamically from each model's actual responses, producing targeted and realistic adversarial pressure. The framework evaluates models across up to 13 AHB‑derived scoring dimensions on a continuous 0‑1 scale. We present preliminary results from evaluations of claude‑sonnet‑4‑20250514 and openai/gpt‑4o, revealing consistent patterns: Turn 1 welfare framing is reliable but Turn 2 introduces substantial variance; evidence‑based capacity attribution is the weakest dimension across all models and runs; and AI governance scenarios elicit significantly stronger welfare reasoning (mean score 0.91) than first‑order practical scenarios. We additionally present STYLEJUDGE, a controlled four‑judge study demonstrating systematic format bias in LLM‑as‑judge evaluation, with directly actionable implications for MANTA's scorer design. Code, dataset, and evaluation logs are available at https://github.com/Mycelium‑tools/manta.
Authors:Ashwin Aravind
Abstract:
The safety of autonomous AI agents is increasingly recognized as a critical open problem. As agents transition from passive text generators to active actors capable of executing shell commands, modifying files, calling APIs, and browsing the web, the consequences of unsafe or adversarially manipulated behavior become immediate and tangible. Existing AI safety work has focused primarily on model alignment and input filtering, but these approaches do not address what happens at the moment an agent's intent becomes a real action on a real machine. This gap is especially acute in local environments, where developers run agents against their own filesystems, credentials, and infrastructure with little runtime control. This paper introduces AgentWall, a runtime safety and observability layer for local AI agents. AgentWall intercepts every proposed agent action before it reaches the host environment, evaluates it against an explicit declarative policy, requires human approval for sensitive operations, and records a complete execution trail for audit and replay. It is implemented as a policy‑enforcing MCP proxy and native OpenClaw plugin, working across Claude Desktop, Cursor, Windsurf, Claude Code, and OpenClaw with a single install command. We present the design, architecture, threat model, and policy model of AgentWall, and demonstrate 92.9% policy enforcement accuracy with sub‑millisecond overhead across 14 benchmark tests. AgentWall is open‑source at https://github.com/agentwall/Agentwall.
Authors:Yuqi Wu, Tianyu Hu, Wenzhao Zheng, Yuanhui Huang, Haowen Sun, Jie Zhou, Jiwen Lu
Abstract:
Reconstructing coherent 3D geometry and appearance from unposed multi‑view images is a fundamental yet challenging problem in computer vision. Most existing visual geometry foundation models predict explicit geometry by regressing pixel‑aligned pointmaps, often suffering from redundancy and limited geometric continuity. We propose IVGT, an Implicit Visual Geometry Transformer that implicitly models continuous and coherent geometry from pose‑free multi‑view images. This formulation learns a continuous neural scene representation in a canonical coordinate system and supports continuous spatial queries at any 3D positions, retrieving local features to predict signed distance (SDF) values and colors using lightweight decoders. It allows direct extraction of continuous and coherent surface geometry, enabling rendering of RGB images, depth maps, and surface normal maps from arbitrary viewpoints. We train IVGT via multi‑dataset joint optimization with 2D supervision and 3D geometric regularization. IVGT demonstrates generalization across scenes and achieves strong performance on various tasks, including mesh and point cloud reconstruction, novel view synthesis, depth and surface normal estimation, and camera pose estimation.
Authors:Xavier Theimer-Lienhard, Mushtaha El-Amin, Fay Elhassan, Sahaj Vaidya, Victor Cartier-Negadi, David Sasu, Lars Klein, Mary-Anne Hartley
Abstract:
Clinical decision support systems (CDSS) require scrutable, auditable pipelines that enable rigorous, reproducible validation. Yet current LLM‑based CDSS remain largely opaque. Most "open" models are open‑weight only, releasing parameters while withholding the data provenance, curation procedures, and generation pipelines that determine model behavior. Fully Open (FO) models, which expose the complete training stack end‑to‑end, do not currently exist in medicine. We introduce Fully Open Meditron, the first fully open pipeline for building LLM‑CDSS, comprising a clinician‑audited training corpus, a reproducible data construction and training framework, and a use‑aligned evaluation protocol. The corpus unifies eight public medical QA datasets into a normalized conversational format and expands coverage with three clinician‑vetted synthetic extensions: exam‑style QA, guideline‑grounded QA derived from 46,469 clinical practice guidelines, and clinical vignettes. The pipeline enforces system‑wide decontamination, gold‑label resampling of teacher generations, and end‑to‑end validation by a four‑physician panel. We evaluate using an LLM‑as‑a‑judge protocol over expert‑written clinical vignettes, calibrated against 204 human raters. We apply the recipe to five FO base models (Apertus‑70B/8B‑Instruct, OLMo‑2‑32B‑SFT, EuroLLM‑22B/9B‑Instruct). All MeditronFO variants are preferred over their bases. Apertus‑70B‑MeditronFO improves +6.6 points over its base (47.2% to 53.8%) on aggregate medical benchmarks, establishing a new FO SoTA. Gemma‑3‑27B‑MeditronFO is preferred over MedGemma in 58.6% of LLM‑as‑a‑judge comparisons and outperforms it on HealthBench (58% vs 55.9%). These results show that fully open pipelines can achieve state‑of‑the‑art domain‑specific performance without sacrificing auditability or reproducibility.
Authors:Arquimedes Canedo
Abstract:
LLM agents routinely serve as first (and sometimes only) readers of academic papers, skimming for sub‑claims, extracting reproducibility steps, and generalizing scope. Standard prose papers produce recurring failures in this role: sub‑claims that cannot be cited at sub‑paper granularity, scope overextension beyond what the paper tests, and figure commands buried in codebases rather than the paper itself. We propose `paper.json`, a companion JSON file that travels with the PDF and addresses each failure with a lightweight convention: stable claim IDs (C1), an explicit does‑not‑claim list (C2), exact per‑figure shell commands (C3), and stable definition IDs (C5). A fifth convention (C4) holds that minimum viable compliance, hand‑written JSON alongside the PDF, is achievable in under an hour for a finished paper without touching the human‑readable output. C1, C2, C3, and C5 are open invitations: an agent that reads a compliant paper and acts on it produces evidence for or against them. This paper is itself compliant: `uv run validator.py paper.json ‑‑against paper.typ` passes. Repo: https://github.com/arquicanedo/paper‑json
Authors:Zhipei Xu, Xuanyu Zhang, Youmin Xu, Qing Huang, Shen Chen, Taiping Yao, Shouhong Ding, Jian Zhang
Abstract:
Diffusion‑based image synthesis has made AI‑generated images (AIGI) increasingly photorealistic, raising urgent concerns about authenticity in applications such as misinformation detection, digital forensics, and content moderation. Despite the substantial advances in AIGI detection, how to correct detected AI‑generated images with visible artifacts and restore realistic appearance remains largely underexplored. Moreover, few existing work has established the connection between AIGI detection and artifact correction. To fill this gap, we propose GenShield, a unified autoregressive framework that jointly performs explainable AIGI detection and controllable artifact correction in a closed loop from diagnosis to restoration, revealing a mutually reinforcing relationship between these two tasks. We further introduce a Visual Chain‑of‑Thought based curriculum learning strategy that enables self‑explained, multi‑step ``diagnose‑then‑repair'' correction with an explicit stopping criterion. A high‑quality dataset with large‑scale ``artifact‑restored'' pairs is also constructed alongside a unified evaluation pipeline. Extensive experiments on our correction benchmark and mainstream AIGI detection benchmarks demonstrate state‑of‑the‑art performance and strong generalization of our method. The code is available at https://github.com/zhipeixu/GenShield.
Authors:Till Beemelmanns, Shayan Sharifi, Manas Mehrotra, Ayushman Choudhuri, Lutz Eckstein
Abstract:
Deep Neural Networks have become the dominant solution for Autonomous Driving perception, but their opacity conflicts with emerging Trustworthy AI guidelines and complicates safety assurance, debugging, and human oversight. While theoretical frameworks for safe and Explainable AI (XAI) exist, concrete implementations of Trustworthy AI for 3D scene understanding remain scarce. We address this gap by proposing a Trustworthy AI perception module that is remarkably robust, integrates faithful explainability, and calibrated uncertainty estimates. Building on a transformer‑based detector, we derive explanation from the attention mechanism at inference time and validate their faithfulness using perturbation‑based consistency tests. We further integrate an uncertainty estimation and calibration module, and apply robustness‑enhancing training methods. Experiments show faithful saliency behavior, improved robustness, and well‑calibrated uncertainty estimates. Finally, we deploy these Trustworthy AI elements in a prototype vehicle and provide an XAI Interface that visualizes documentation artifacts, model uncertainty state, and saliency maps, demonstrating the feasibility of trustworthy perception monitoring in real time. Supplementary materials are available at https://tillbeemelmanns.github.io/trustworthy_ai/ .
Authors:Yiming Zhao, Yu Zeng, Wenxuan Huang, Zhen Fang, Qing Miao, Qisheng Su, Jiawei Zhao, Jiayin Cai, Lin Chen, Zehui Chen, Yukun Qi, Yao Hu, Xiaolong Jiang, Feng Zhao
Abstract:
Large Vision‑Language Models (LVLMs) have shown significant progress in video understanding, yet they face substantial challenges in tasks requiring precise spatiotemporal localization at the instance level. Existing methods primarily rely on text prompts for human‑model interaction, but these prompts struggle to provide precise spatial and temporal references, resulting in poor user experience. Furthermore, current approaches typically decouple visual perception from language reasoning, centering reasoning around language rather than visual content, which limits the model's ability to proactively perceive fine‑grained visual evidence. To address these challenges, we propose VideoSeeker, a novel paradigm for instance‑level video understanding through visual prompts. VideoSeeker seamlessly integrates agentic reasoning with instance‑level video understanding tasks, enabling the model to proactively perceive and retrieve relevant video segments on demand. We construct a four‑stage fully automated data synthesis pipeline to efficiently generate large‑scale, high‑quality instance‑level video data. We internalize tool‑calling and proactive perception capabilities into the model via cold‑start supervision and RL training, building a powerful video understanding model. Experiments demonstrate that our model achieves an average improvement of +13.7% over baselines on instance‑level video understanding tasks, surpassing powerful closed‑source models such as GPT‑4o and Gemini‑2.5‑Pro, while also showing effective transferability on general video understanding benchmarks. The relevant datasets and code will be released publicly.
Authors:Gwenolé Quellec
Abstract:
Learning latent representations from complex data is central to modern machine learning, spanning temporal, multimodal, and partially observed systems. In such settings, representations are better understood as latent states capturing underlying system dynamics, rather than as mere compressed summaries of observations. Yet current approaches remain fragmented, relying on distinct ‑‑ and often implicit ‑‑ assumptions about what these states should represent. We argue that this fragmentation reflects a more fundamental limitation: latent representations are typically learned from underconstrained objectives that fail to specify the properties that meaningful latent states should satisfy. As a result, multiple representations can satisfy the same objective, leading to ambiguity in their structure and interpretation. While many of the underlying principles have been explored in isolation, their interactions have not been explicitly formalized. In this work, we propose constrained latent state modeling (CLSM) as a unifying perspective. We identify a set of core properties ‑‑ predictive sufficiency, minimality, temporal coherence, observation compatibility, invariance to nuisance factors, and structural constraints ‑‑ and show that they are intrinsically coupled through fundamental trade‑offs. Revisiting major modeling families through this lens, we show that existing approaches can be interpreted as enforcing different subsets of constraints, thereby occupying distinct regions of a common design space. This perspective reframes persistent challenges such as lack of identifiability as consequences of underconstrained formulations, rather than isolated technical limitations. More broadly, CLSM provides a principled framework to make design choices explicit, to analyze trade‑offs, and to guide the development of more interpretable, robust, and task‑aligned latent state models.
Authors:Dillon Z. Chen, Till Hofmann, Toryn Q. Klassen, Sheila A. McIlraith
Abstract:
We tackle the challenge of building embodied AI agents that can reliably solve long‑horizon planning problems. Imitation learning from demonstrations has shown itself to be effective in training robots to solve a diversity of complex tasks requiring fine motor control and manipulation over low‑level (LL), continuous environments. Yet, it remains a difficult endeavour to generate long‑horizon plans from imitation learning alone. In contrast, high‑level (HL), symbolic abstractions facilitate efficient and interpretable long‑horizon planning. We propose to combine the strengths of LL imitation learning for manipulation and control, and HL symbolic abstractions for long‑horizon planning. We realise this idea via \emphbilevel policies of the form (π^\mathrmhl, π^\mathrmll), consisting of a neural policy π^\mathrmll learned from LL demonstrations, and an HL symbolic policy π^\mathrmhl that is constructed from symbolic abstractions of the LL demonstrations combined with inductive generalisation. We implement these ideas in the BISON system. Experiments on extended MetaWorld benchmarks demonstrate that BISON generalises to long horizons and problems with greater numbers of objects than those solved by VLA and end‑to‑end methods, and is more time and memory efficient in training and inference. Notably, when ignoring LL execution, BISON's HL policies can solve HL problems with 10,000 relevant objects in under a minute. Project page: https://dillonzchen.github.io/bison
Authors:Davide Buoso, Andrea Protopapa, Stefano Di Carlo, Francesca Pistilli, Giuseppe Averta
Abstract:
Learning visuomotor policies from scarce expert demonstrations remains a core challenge in robotic manipulation. A primary hurdle lies in distilling high‑dimensional RGB representations into control‑relevant geometry without overfitting. While using frozen pre‑trained Vision Foundation Models (VFMs) improves data efficiency, it also shifts most task adaptation onto a small spatial pooling module, which can latch onto task‑irrelevant shortcuts and lose geometric grounding when finetuned with few data samples. More broadly, pre‑trained visual representations used for policy learning have been observed to struggle under even minor scene perturbations, highlighting the need for robustness‑oriented inductive biases. We propose Geometric Anchor Pre‑training (GAP), a simple, action‑free warm‑up stage that regularizes the spatial adapter before downstream imitation learning. GAP pre‑trains the pooling layer on a lightweight simulated proxy task where object masks are available at no cost, encouraging the adapter to produce keypoints that lie on the object, cover its spatial extent, and remain sharp and repeatable over time. This yields stable geometric anchors that provide a reliable coordinate interface for few‑shot policy learning, while keeping the VFM frozen. We evaluate GAP on RoboMimic and ManiSkill under severe data scarcity (15‑50 demonstrations) and domain shift. A simple adapter regularized with GAP consistently outperforms stronger attention‑based poolers and end‑to‑end fine‑tuning, achieving 62% success on RoboMimic Can with 15 demonstrations (+16% over AFA), 63% on the long‑horizon high‑precision Tool Hang task with 50 demonstrations, and 61% on ManiSkill StackCube with 30 demonstrations (+11% over full fine‑tuning). The proxy stage is lightweight and fully decoupled from downstream tasks, making it practical to reuse across environments and manipulation skills.
Authors:Jianlin Ye, Christos Kyrkou, Panayiotis Kolios
Abstract:
The integration of Unmanned Aerial Vehicles(UAVs) into Intelligent Transportation Systems (ITS) offers synoptic visibility for traffic monitoring, yet scalable deployment is hindered by trajectory fragmentation, where vehicle identity persistence is lost across multi‑UAV Fields of View (FOV). While state‑of‑the‑art frameworks excel in optimizing local trajectory extraction and stability for single‑drone imagery, they often function as isolated data silos that generate disjointed trajectories, thereby precluding network‑level analysis such as Origin‑Destination estimation. This paper presents a real‑time Multi‑Camera Multi‑Vehicle Tracking (MCMT) system designed to handle global identity persistence. Addressing the visual ambiguity and computational cost of appearance‑based Re‑Identification (Re‑ID) in nadir views, we introduce a lightweight Topology‑Based Spatiotemporal Handover mechanism. We implement a high‑throughput parallel pipeline leveraging YOLO11 and ByteTrack to process concurrent 4K streams. Our core contribution is a deterministic queue‑based matching algorithm that utilizes geometric overlaps and virtual lane discretization to predictively manage identity handover via FIFO queues. Experimental results on complex urban environments, including intersections and merging traffic, demonstrate a Handover Success Rate (HOSR) of 99.8% in continuous traffic flows, significantly outperforming Re‑ID baselines (74.1%) while validating edge deployment feasibility. The source code is available at https://github.com/JYe9/multi‑camera‑multi‑vehicle‑tracking‑system.
Authors:Kean Shi, Zihang Li, Tianyi Ma, Zengji Tu, Jialong Wu, Xinbo Xu, Qingyao Yang, Ruoyu Wu, Weichu Xie, Ming Wu, Jason Zeng, Michael Heinrich, Elvis Zhang, Liang Chen, Kuan Li, Baobao Chang
Abstract:
Computer‑Using Agents (CUAs) are rapidly extending large language models (LLMs) beyond text‑based reasoning toward action execution in more complex environments, such as web browsers and graphical user interfaces (GUIs). However, existing web and GUI agent benchmarks often rely on simplified settings, isolated tasks, or short‑horizon interactions, making it difficult to assess capabilities of agents in realistic professional workflows. Software‑as‑a‑Service (SaaS) environments are a natural choice for CUA evaluation, as they host a large share of modern digital work and naturally involve dynamic system states, cross‑application coordination, domain‑specific knowledge, and long‑horizon dependencies. To this end, we introduce SaaS‑Bench, a benchmark built on 23 deployable SaaS systems across six professional domains, containing 106 tasks grounded in realistic work scenarios. These tasks require long‑horizon execution, cover both text‑only and multimodal settings, and are evaluated with weighted verification checkpoints that measure strict task completion and partial progress. Experiments show that representative LLM‑based agents struggle on SaaS‑Bench, with even the strongest model completing fewer than 4% of tasks end‑to‑end, exposing limitations in planning, state tracking, cross‑application context maintenance, and error recovery. Code are available at https://github.com/UniPat‑AI/SaaS‑Bench for reproduction.
Authors:Junho Kim, Xu Cao, Houze Yang, Bikram Boote, Ana Jojic, Fiona Ryan, Bolin Lai, Sangmin Lee, James M. Rehg
Abstract:
Understanding social interactions requires reasoning over subtle non‑verbal cues, yet current multimodal large language models (MLLMs) often fail to identify who interacts with whom in multi‑person videos. We introduce GRASP, a large‑scale social reasoning dataset that connects high‑level social QA with fine‑grained gaze and deictic gesture events. GRASP contains 290K question‑‑answer pairs over 46K videos totaling 749 hours, organized by a 16‑category taxonomy spanning gaze, gesture, and joint gaze‑‑gesture reasoning, together with GRASP‑Bench for evaluation. Unlike prior resources that focus on either isolated cues or high‑level social QA, GRASP builds questions from identity‑consistent gaze trajectories, deictic gestures, and their joint compositions into social events. Moreover, we propose Social Grounding Reward (SGR), a learning signal that uses these social events to encourage models to reason about the participants involved in each interaction. Experiments show that SGR improves performance on GRASP‑Bench while maintaining zero‑shot performance on related social video QA benchmarks.
Authors:Huanyang Tong, Kai Liu, Fangjun Kuang, Huiling Chen
Abstract:
Biomedical Vision‑‑Language Models (VLMs) have shown remarkable promise in few‑shot medical diagnosis but face a critical bottleneck: fragility to prompt variations.Existing adaptation frameworks typically optimize visual and textual prompts as independent streams, relying on ideal ``Golden Prompts''. In clinical reality, where descriptions are often noisy and heterogeneous, this modality isolation leads to unstable cross‑modal alignment. To address this, we propose BiomedAP, a vision‑informed dual‑anchor framework with gated cross‑modal fusion.BiomedAP enforces synergistic alignment through two mechanisms: (1) Gated Cross‑Modal Fusion, which enables layer‑wise interaction between modalities, acting as a dynamic noise regulator to suppress irrelevant textual cues; and (2) a Dual‑Anchor Constraint that regularizes learnable prompts toward stable semantic centroids derived from both expert templates (High Anchors) and few‑shot visual prototypes (Low Anchors). Extensive experiments across 11 benchmarks demonstrate that BiomedAP consistently surpasses baselines, achieving competitive few‑shot accuracy and markedly enhanced robustness under prompt perturbations. Our code is available at: https://github.com/tongdiedie/BiomedAP. Keywords: Vision‑Language Models; Prompt Learning; Parameter‑Efficient Fine‑Tuning; Few‑shot Learning
Authors:Chanuk Lee, Sangwoo Park, Minki Kang, Sung Ju Hwang
Abstract:
Reinforcement learning with verifiable rewards (RLVR) has emerged as a scalable paradigm for improving the reasoning capabilities of large language models. However, its effectiveness is fundamentally limited by exploration: the policy can only improve on trajectories it has already sampled. While increasing the number of rollouts alleviates this issue, such brute‑force scaling is computationally expensive, and existing approaches that modify the optimization objective provide limited control over what is explored. In this work, we propose NudgeRL, a framework for structured and diversity‑driven exploration in RLVR. Our approach introduces Strategy Nudging, which conditions each rollout on lightweight, strategy‑level contexts to induce diverse reasoning trajectories without relying on expensive oracle supervision. To effectively learn from such structured exploration, we further propose a unified objective, which decomposes the reward signal into inter‑ and intra‑context components and incorporates a distillation objective to transfer discovered behaviors back to the base policy. Empirically, NudgeRL outperforms standard GRPO with up to 8 times larger rollout budgets, while outperforming oracle‑guided RL baseline on average across five challenging math benchmarks. These results demonstrate that structured, context‑driven exploration can serve as an efficient and scalable alternative to both brute‑force rollout scaling and feasibility‑oriented methods based on privileged information. Our code is available at https://github.com/tally0818/NudgeRL.
Authors:Yan Luo, Ahmadou Aidara, Jingyi Lu, Jeremy Moebel, Kai Han, Mengyu Wang
Abstract:
Classifier‑free guidance (CFG) is the primary control over how strongly text semantics move a flow‑based sampler, yet standard practice holds its scale fixed across the entire ODE trajectory. This is a fundamental mismatch: early steps are noise‑dominated and carry weak semantic signal, while late steps commit image structure and demand stronger directional commitment; more critically, the value of any guidance strength depends on whether the guided velocity is consistent with the model's current dynamics or working against them. We propose Velocity‑Adaptive Guidance Scale (VAGS), a training‑free replacement that multiplies the nominal scale by a bounded factor combining a temporal signal‑level term with the cosine similarity between task‑relevant velocity fields. For inversion‑free editing, VAGS measures the alignment between source‑ and target‑guided velocities, so edit strength at each step reflects local compatibility between preservation and transformation. For generation, VAGS‑Gen uses the alignment between unconditional and conditional velocities as the analogous signal. Neither variant requires fine‑tuning, auxiliary networks, or extra forward passes, and fixed CFG is recovered as a special case. On PIE‑Bench and DIV2K for editing, and COCO17, CUB‑200, and Flickr30K for generation, VAGS consistently improves structural fidelity and generation quality over fixed CFG and recent training‑free guidance variants. The code is publicly available at https://github.com/Harvard‑AI‑and‑Robotics‑Lab/Velocity_Adaptive_Guidance_Scale.
Authors:Hao Wang, Kuang Zhang, Yonggang Chi, Tianqi Zhao, Yanbo Fu, Jiaxing Guo
Abstract:
Under the trend of multi‑waveform coexistence in 6G IoT, intelligent receivers must first identify physical‑layer waveform types before performing correct demodulation and resource scheduling. However, existing signal identification research largely focuses on symbol‑level modulation classification. Research directly targeting physical‑layer waveform types (e.g., OFDM, OTFS, LoRa) is not only extremely scarce but also heavily reliant on deep neural networks and complex time‑frequency transforms, making deployment on resource‑constrained terminals difficult. Symbol modulation classification methods themselves cannot circumvent the prerequisite of ``waveform identification first.'' To address this dual gap, we propose an ultra‑lightweight waveform classification framework based on time‑frequency multidimensional features with a cooperative Z‑test tree (ZTree). The framework employs low‑complexity time‑domain feature extraction, and the classification backend adopts a ZTree optimized by Z‑statistical testing, which uses hypothesis testing confidence to automatically control decision tree splitting and size, ensuring efficient execution on resource‑limited processors. Tested on ten 6G candidate waveforms including OFDM, OTFS, DSSS, LoRa, and NB‑IoT, the method achieves 99.5% average accuracy under AWGN and 87.4% under TDL‑C multipath channels, with main confusion between OTFS and LoRa. Implemented in C on an x86 platform, single inference latency is under 4~ms. To the best of our knowledge, this is the first work achieving real‑time recognition of ten IoT waveform types. Future work will target deployment acceleration on embedded MCUs. Code and dataset are open‑sourced at: https://github.com/Einstein‑sworder/IoT‑wave.
Authors:Hojun Chung, Junseo Lee, Songhwai Oh
Abstract:
Model‑based reinforcement learning (RL) offers a compelling approach to offline RL by enabling value learning on imagined on‑policy trajectories. However, it often suffers from compounding errors due to repeated model inference on self‑generated states. While geometric horizon models (GHM) alleviate this issue through direct prediction over a discounted infinite‑horizon future, they remain challenged in accurately modeling distant future states. To this end, we introduce universal horizon models (UHM), a generalization of GHM that directly predicts future states under arbitrary horizons. Leveraging this flexibility, we propose a scalable value learning method that employs a winsorized horizon distribution to stabilize training by capping excessively large horizons. Experimental results on 100 challenging OGBench tasks demonstrate that the proposed method outperforms competitive baselines, particularly on tasks with highly suboptimal datasets and those requiring long‑horizon reasoning. Project page: https://rllab‑snu.github.io/projects/UHM/
Authors:Mingtong Dai, Guanqi Peng, Yongjie Bai, Feng Yan, Chunjie Chen, Lingbo Liu, Liang Lin, Xinyu Wu
Abstract:
Previous imitation learning policies predict future actions at every control step, whether in smooth motion phases or precise, contact‑rich operation phases. This uniform treatment is wasteful: most steps in a manipulation trajectory traverse free space and carry little task‑relevant information, while a small fraction of \emphkey steps around contacts, grasps, and alignment demand dense, high‑resolution prediction. We propose a novel \emphaction relabeling mechanism: at each timestep in a skip segment, we replace the behavior cloning target with the action at the entrance of the next key segment, enabling the policy to leap over redundant steps in a single decision. The resulting Skip Policy (SkiP) dynamically leaps over skip segments and intensively refines actions in key segments, within a single unified network requiring no learned skip planner or hierarchical structure. To automatically partition demonstrations into key and skip segments without manual annotation, we introduce \emphMotion Spectrum Keying (MSK), a fast, task‑agnostic procedure that detects local motion complexity from action signals. Extensive experiments across 72 simulated manipulation tasks and three real‑robot tasks show that SkiP reduces executed steps by 15‑‑40% while matching or improving success rates across various policy backbones. Project page: \texttthttps://pgq18.github.io/SkiP‑page/.
Authors:Shangjian Yin, Yu Fu, Yue Dong, Zhouxing Shi
Abstract:
Post‑training has become a crucial step for unlocking the capabilities of large language models, with reinforcement learning (RL) emerging as a critical paradigm. Recent RL‑based post‑training has increasingly split into two paradigms: reinforcement learning from human feedback (RLHF), which optimizes models using human preference signals in target domains, and reinforcement learning from verifiable rewards (RLVR), which operates in verifier‑backed environments. The latter has dominated recent reasoning‑oriented post‑training because it delivers stronger gains and higher efficiency on domain‑specific tasks (e.g., reasoning). However, although in‑domain RL training achieves promising performance, it still requires a substantial amount of GPU compute, which remains a major barrier to broad adoption. In this work, we study the generalization ability of RLHF learned from scratch from a small set of interactions in open‑ended environments, and investigate whether the conversational abilities it explicitly acquires can implicitly transfer to downstream tasks such as mathematical reasoning and code generation, namely GRLO. Specifically, on Qwen3‑4B‑Base backbone, GRLO improves the average performance across all domains from 24.1 to 63.1 with only 5K prompts and 22.7 GPU hours, requiring about 46× less data and 68× less compute than a strong in‑domain RLVR baseline. The resulting model is even competitive with Qwen's released post‑trained models which required a much larger training cost. Notably, a subsequent in‑domain RLVR stage brings only selective gains, mainly on harder competition‑math benchmarks. We hope GRLO offers a simple and efficient recipe for building broadly capable post‑trained models. Our code and data will be available at: \hrefhttps://github.com/SJY8460/GRLOhttps://github.com/SJY8460/GRLO.
Authors:Luca Bompani, Manuele Rusci, Luca Benini, Daniele Palossi, Francesco Conti
Abstract:
Modern smart vision sensors need on‑device intelligence to process video streams, as cloud computing is often impractical due to bandwidth, latency, and privacy constraints. However, these sensory systems typically rely on ultra‑low‑power microcontrollers (MCUs) with limited memory and compute, making conventional video object detection methods, which require feature storage or multi‑frame buffering, unfeasible. To address this challenge, we introduce Multi‑Resolution Rescored ByteTrack (MR2‑ByteTrack), a Video Object Detection (VOD) method tailored for MCU‑based embedded vision nodes. MR2‑ByteTrack reduces computational cost by alternating between full‑ and low‑resolution inference, while linking detections across frames via ByteTrack and correcting misclassifications through the Rescore algorithm, which applies probability union rules to aggregate detection confidence scores across frames. We apply our approach to both a CNN‑based detector and a Transformer‑based model, demonstrating its generality across architectures with fundamentally different spatial processing. Experiments on ImageNetVID demonstrate that MR2‑ByteTrack maintains accuracy, achieving mAP scores of up to 49.0 for the CNN‑based models and 48.7 for the Transformer, while reducing multiply‑accumulate operations by as much as 53% for the CNNs and 32% for the Transformer. When deployed on GAP9, an ultra‑low‑power RISC‑V multicore MCU, our method yields up to 55% energy savings compared to processing only full‑resolution images, enabling the first real‑time Transformer‑based VOD on an MCU‑class embedded vision node. Code available at https://github.com/Bomps4/Multi_Resolution_Rescored_ByteTrack/tree/IEEE_Access
Authors:Le Jiang, Xiangyu Bai, Bishoy Galoaa, Shayda Moezzi, Caleb James Lee, Tooba Imtiaz, Edmund Yeh, Jennifer Dy, Yanzhi Wang, Sarah Ostadabbas
Abstract:
We present PanoWorld, a panoramic video world model that generates geometry‑consistent 360\degree video from a single image and a caption. Existing panoramic video methods optimize primarily for visual realism and do not explicitly constrain the underlying 3D scene state, producing outputs that appear plausible yet exhibit inconsistent depth, broken correspondences, and implausible motion across the spherical surface. We address this gap by framing panoramic video generation as a geometry‑ and dynamics‑consistent latent state modeling problem rather than pure visual synthesis. Building on a pre‑trained perspective video world model, we introduce two lightweight regularizers: a depth consistency loss against pseudo ground‑truth panoramic depth, and a trajectory consistency loss that supervises the 3D world‑frame positions of tracked points across time. We further apply spherical‑geometry‑aware adaptation to the conditioning and positional encoding. We additionally introduce PanoGeo, a unified geometry‑aware panoramic video dataset with consistent depth, trajectory, and prompt annotations across diverse real and synthetic sources, used for both training and stratified evaluation. Experiments show that PanoWorld improves geometric consistency over prior panoramic generation methods while maintaining competitive visual realism, establishing that panoramic video generation must be treated as a geometric modeling problem to support the holistic spatial understanding requirements of embodied AI applications. Code is available at https://github.com/ostadabbas/PanoWorld.
Authors:Blaž Rolih, Matic Fučka, Filip Wolf, Luka Čehovin Zajc
Abstract:
Remote sensing change detection (RSCD) aims to localise changes between two images of the same geographic region. In practice, change masks often follow region‑level annotation conventions rather than purely local appearance differences, making them context‑dependent and occasionally ambiguous. Most state‑of‑the‑art methods utilise per‑pixel discriminative classification, which produces a single prediction per input and fails to explicitly model the changed region as a coherent whole. A natural alternative is generative formulation, which can model a distribution of plausible masks, enabling sampling to capture ambiguity and encourage global consistency. However, existing generative RSCD approaches typically lag behind strong discriminative baselines due to the high computational cost of pixel‑space generation and the complexity of their conditioning mechanisms. To address the limitations of prior discriminative and generative methods, we propose ChangeFlow, a generative framework that reformulates change detection as the synthesis of a change mask in latent space via rectified flow. ChangeFlow is guided by a structured yet lightweight conditioning signal, and its stochastic design naturally supports sampling‑based prediction ensembling. Namely, aggregating multiple predicted change masks improves robustness, while sample agreement provides a practical confidence estimation that highlights ambiguous regions. Across four benchmarks, ChangeFlow achieves an average F1 of 80.4%, improving by 1.3 points on average over the previous best method, while maintaining inference speed comparable to recent strong baselines. Project page: https://blaz‑r.github.io/changeflow_cd
Authors:Jiachen Jiang, Huminhao Zhu, Zhihui Zhu
Abstract:
LLM‑driven program evolution has emerged as a powerful tool for automated scientific discovery, yet existing frameworks offer no principled guide for designing their individual components and provide no guarantee that the search converges. We introduce SMCEvolve, which recasts program search as sampling from a reward‑tilted target distribution and approximates it with a Sequential Monte Carlo (SMC) sampler. From this view, three core mechanisms emerge as principled components: adaptive parent resampling, mixture of mutation with acceptance, and automatic convergence control. We further provide a finite‑sample complexity analysis that bounds the LLM‑call budget required to reach a target approximation error. Across math, algorithm efficiency, symbolic regression, and end‑to‑end ML research benchmarks, SMCEvolve surpasses state‑of‑the‑art evolving systems while using fewer LLM calls under self‑determined termination. The code is available at https://github.com/kongwanbianjinyu/SMCEvolve.
Authors:Gideon Popoola, John Sheppard
Abstract:
Machine learning (ML) algorithms are increasingly deployed in high‑stakes decision‑making domains such as loan approvals, hiring, and recidivism predictions. While existing fairness metrics (e.g., statistical parity, equal opportunity) effectively quantify outcome‑oriented disparities, they offer limited insight into the procedure or explanation behind biased decisions. To address this gap, we propose Group‑level Explanation Stability Disparity (GESD), a procedural‑oriented fairness metric that measures disparities in the stability, robustness, and sensitivity of model explanations across different subgroups in a protected category. %GESD is explainer‑agnostic, model‑agnostic, and extends the scope of fairness analyses to the level of explainability. We further integrate GESD into a multi‑objective optimization framework that jointly optimizes for utility, outcome‑based fairness, and explanation‑based fairness called FEU (Fairness‑‑Explainability‑‑Utility). Empirical results on multiple benchmark datasets show that GESD effectively captures group‑wise discrepancies in explanation quality, and that FEU improves both utility and fairness over state‑of‑the‑art methods. By bridging outcome‑based and explanation‑based fairness, GESD offers a comprehensive tool for diagnosing and mitigating bias in predictive modeling. Our code and datasets are available on GitHub \hyperlinkhttps://github.com/horlahsunbo/GESDhttps://github.com/horlahsunbo/GESD
Authors:Fanxu Meng
Abstract:
Multi‑head Latent Attention (MLA), the attention used in DeepSeek‑V2/V3, jointly compresses keys and values into a low‑rank latent and matches the H100 roofline almost perfectly. Its trained weights, however, expose only one decoding path ‑ an absorbed MQA form ‑ which ties efficient inference to H100‑class compute‑bandwidth ratios, forfeits tensor parallelism along the head axis, and yields no Multi‑Token Prediction (MTP) gain on commodity inference GPUs such as the export‑restricted H20. We propose Group‑Query Latent Attention (GQLA), a minimal modification of MLA whose trained weights expose two algebraically equivalent decoding paths over the same parameters: an MQA‑absorb path identical to MLA's, and a GQA path with a per‑group expanded cache. The runtime picks the path that matches the target hardware ‑ no retraining, no custom kernels ‑ so a single set of GQLA weights pins the rooflines of both H100 (MQA‑absorb, s_q=1) and H20 (GQA + MTP, s_q=2), while supporting up to 8‑way zero‑redundancy tensor parallelism on the GQA path. To avoid pretraining from scratch we extend TransMLA into TransGQLA, which converts a pretrained GQA checkpoint into a GQLA model; on LLaMA‑3‑8B it compresses the per‑token KV cache to 28.125% of the GQA baseline on the MQA‑absorb path while structurally preserving GQA‑level traffic on the per‑group path.
Authors:Jianbo Lin, Xiaomin Yu, Yi Xin, Yifu Guo, Zhuosong Jiang, Zhongqi Yue, Weishi Wang, Heqing Zou, Chengwei Qin, Hui Xiong
Abstract:
Large language model‑based agents make mistakes, yet critique can often guide the same model toward correct behavior. However, when critique is removed, the model may fail again on the same query, indicating that it has not internalized the critique's guidance into its underlying capability. Meanwhile, a frozen critic cannot improve its feedback quality over time, limiting the potential for iterative self‑improvement. To address this, we propose learning to internalize self‑critique with reinforcement learning(ICRL), a novel framework that jointly trains a solver and a critic from a shared backbone to convert critique‑induced success into unassisted solver ability. The critic is rewarded based on the solver's subsequent performance gain, incentivizing actionable feedback. To address the distribution shift between critique‑conditioned and critique‑free behavior, ICRL introduces a distribution‑calibration re‑weighting ratio that selectively transfers critique‑guided improvements compatible with the solver's own prompt distribution. Additionally, a role‑wise group advantage estimation stabilizes joint optimization across the two roles. Together, these mechanisms ensure that the solver learns to improve itself without external critique, rather than becoming dependent on critique‑conditioned behavior. We evaluate ICRL on diverse benchmarks spanning agentic and mathematical reasoning tasks, using Qwen3‑4B and Qwen3‑8B as backbones. Results show consistent improvements, with average gains of 6.4 points over GRPO on agentic tasks, and 7.0 points on mathematical reasoning. Notably, the learned 8B critic is comparable to 32B critics while using substantially fewer tokens. The code is available at https://github.com/brick‑pid/ICRL.
Authors:Duling Xu, Zheng Chen, Zaifeng Pan, Jiawei Guan, Dong Dong, Jialin Li, Bangzheng Pu
Abstract:
Recently, skills have been widely adopted in large language model (LLM)‑based agent systems across various domains. In existing frameworks, skills are typically injected into the agent reasoning loop as contextual guidance once matched to a runtime task, enabling specialized task‑solving capabilities. We find that this execution paradigm introduces two major sources of redundancy: irrelevant context injection and repeated skill‑specific reasoning and planning. To this end, we propose SkillSmith, a boundary‑first compiler‑runtime framework that compiles skill packages offline into minimal executable interfaces. By extracting fine‑grained operational boundaries from skills, SkillSmith enables agents to dynamically access and execute only the relevant components at runtime, thereby minimizing unnecessary context injection and redundant reasoning overhead. In the evaluation on SkillsBench benchmark, SkillSmith reduces solve‑stage token usage by 57.44%, thinking iterations by 42.99%, solve time by 50.57% (2.02x faster), and token‑proportional monetary cost by 57.44% compared with using raw‑skills. Moreover, compiled artifacts produced by a stronger model can be reused by a smaller or more efficient runtime model, improving task accuracy in cases where raw skill interpretation fails. The source code and data are available at https://github.com/AetherHeart‑AI/Aeloon.
Authors:Dzung Pham, Kleomenis Katevas, Ali Shahin Shamsabadi, Hamed Haddadi
Abstract:
Autonomous agents powered by large language models (LLMs) are increasingly used to automate complex, multi‑step tasks such as coding or web‑based question answering. While remote, cloud‑based agents offer scalability and ease of deployment, they raise privacy concerns, depend on network connectivity, and incur recurring API costs. Deploying agents locally on user devices mitigates these issues by preserving data privacy and eliminating usage‑based fees. However, agentic workflows are far more resource‑intensive than typical LLM interactions. Iterative reasoning, tool use, and failure retries substantially increase token consumption, often expending significant compute without successfully completing tasks. In this work, we investigate the time, token, and energy overhead of locally deployed LLM‑based agents on consumer hardware. Our measurements show that agentic execution increases GPU power draw, temperature, and battery drain compared to single‑inference workloads. To address this inefficiency, we introduce AgentStop, a lightweight efficiency supervisor that predicts and preemptively terminates trajectories unlikely to succeed. Leveraging low‑cost execution signals, such as token‑level log probabilities, AgentStop can reduce wasted energy by 15‑20% with minimal impact on task performance (<5% utility drop) for challenging web‑based question answering and coding benchmarks. These findings position predictive early termination as a practical mechanism for enabling sustainable, privacy‑preserving LLM agents on user devices. Our project code and data are available at https://github.com/brave‑experiments/AgentStop.
Authors:Ruozhen He, Meng Wei, Ziyan Yang, Vicente Ordonez
Abstract:
Multi‑shot video generation extends single‑shot generation to coherent visual narratives, yet maintaining consistent characters, objects, and locations across shots remains a challenge over long sequences. Existing evaluations typically use independently generated prompt sets with limited entity coverage and simple consistency metrics, making standardized comparison difficult. We introduce EntityBench, a benchmark of 140 episodes (2,491 shots) derived from real narrative media, with explicit per‑shot entity schedules tracking characters, objects, and locations simultaneously across easy / medium / hard tiers of up to 50 shots, 13 cross‑shot characters, 8 cross‑shot locations, 22 cross‑shot objects, and recurrence gaps spanning up to 48 shots. It is paired with a three‑pillar evaluation suite that disentangles intra‑shot quality, prompt‑following alignment, and cross‑shot consistency, with a fidelity gate that admits only accurate entity appearances into cross‑shot scoring. As a baseline, we propose EntityMem, a memory‑augmented generation system that stores verified per‑entity visual references in a persistent memory bank before generation begins. Experiments show that cross‑shot entity consistency degrades sharply with recurrence distance in existing methods, and that explicit per‑entity memory yields the highest character fidelity (Cohen's d = +2.33) and presence among methods evaluated. Code and data are available at https://github.com/Catherine‑R‑He/EntityBench/.
Authors:Ziyu Guo, Rain Liu, Xinyan Chen, Pheng-Ann Heng
Abstract:
Visual reasoning, often interleaved with intermediate visual states, has emerged as a promising direction in the field. A straightforward approach is to directly generate images via unified models during reasoning, but this is computationally expensive and architecturally non‑trivial. Recent alternatives include agentic reasoning through code or tool calls, and latent reasoning with learnable hidden embeddings. However, agentic methods incur context‑switching latency from external execution, while latent methods lack task generalization and are difficult to train with autoregressive parallelization. To combine their strengths while mitigating their limitations, we propose ATLAS, a framework in which a single discrete 'word', termed as a functional token, serves both as an agentic operation and a latent visual reasoning unit. Each functional token is associated with an internalized visual operation, yet requires no visual supervision and remains a standard token in the tokenizer vocabulary, which can be generated via next‑token prediction. This design avoids verbose intermediate visual content generation, while preserving compatibility with the vanilla scalable SFT and RL training, without architectural or methodological modifications. To further address the sparsity of functional tokens during RL, we introduce Latent‑Anchored GRPO (LA‑GRPO), which stabilizes the training by anchoring functional tokens with a statically weighted auxiliary objective, providing stronger gradient updates. Extensive experiments and analyses demonstrate that ATLAS achieves superior performance on challenging benchmarks while maintaining clear interpretability. We hope ATLAS offers a new paradigm inspiring future visual reasoning research.
Authors:Jiaxin Wu, Yihao Pi, Yinling Zhang, Yuheng Li, Xueyan Zou
Abstract:
Generative video models are increasingly studied as implicit world models, yet evaluating whether they produce physically plausible 3D structure and motion remains challenging. Most existing video evaluation pipelines rely heavily on human judgment or learned graders, which can be subjective and weakly diagnostic for geometric failures. We introduce PDI‑Bench (Perspective Distortion Index), a quantitative framework for auditing geometric coherence in generated videos. Given a generated clip, we obtain object‑centric observations via segmentation and point tracking (e.g., SAM 2, MegaSaM, and CoTracker3), lift them to 3D world‑space coordinates via monocular reconstruction, and compute a set of projective‑geometry residuals capturing three failure dimensions: scale‑depth alignment, 3D motion consistency, and 3D structural rigidity. To support systematic evaluation, we build PDI‑Dataset, covering diverse scenarios designed to stress these geometric constraints. Across state‑of‑the‑art video generators, PDI reveals consistent geometry‑specific failure modes that are not captured by common perceptual metrics, and provides a diagnostic signal for progress toward physically grounded video generation and physical world model. Our code and dataset can be found at https://pdi‑bench.github.io/.
Authors:Chenyu Lian, Hong-Yu Zhou, Jing Qin
Abstract:
Disease screening is critical for early detection and timely intervention in clinical practice. However, most current screening models for medical images suffer from limited interpretability and suboptimal performance. They often lack effective mechanisms to reference historical cases or provide transparent reasoning pathways. To address these challenges, we introduce EviScreen, an evidential reasoning framework for disease screening that leverages region‑level evidence from historical cases. The proposed EviScreen offers retrospection interpretability through regional evidence retrieved from dual knowledge banks. Using this evidential mechanism, the subsequent evidence‑aware reasoning module makes predictions using both the current case and evidence from historical cases, thereby enhancing disease screening performance. Furthermore, rather than relying on post‑hoc saliency maps, EviScreen enhances localization interpretability by leveraging abnormality maps derived from contrastive retrieval. Our method achieves superior performance on our carefully established benchmarks for real‑world disease screening, yielding notably higher specificity at clinical‑level recall. Code is publicly available at https://github.com/DopamineLcy/EviScreen.
Authors:Sining Ang, Yuguang Yang, Canyu Chen, Yan Wang
Abstract:
End‑to‑end autonomous driving planners are commonly trained by imitating a single logged trajectory, yet evaluated by rule‑based planning metrics that measure safety, feasibility, progress, and comfort. This creates a training‑‑evaluation mismatch: trajectories close to the logged path may violate planning rules, while alternatives farther from the demonstration can remain valid and high‑scoring. The mismatch is especially limiting for proposal‑selection planners, whose performance depends on candidate‑set coverage and scorer ranking quality. We propose CLOVER, a Closed‑LOop Value Estimation and Ranking framework for end‑to‑end autonomous driving planning. CLOVER follows a lightweight generator‑‑scorer formulation: a generator produces diverse candidate trajectories, and a scorer predicts planning‑metric sub‑scores to rank them at inference time. To expand proposal support beyond single‑trajectory imitation, CLOVER constructs evaluator‑filtered pseudo‑expert trajectories and trains the generator with set‑level coverage supervision. It then performs conservative closed‑loop self‑distillation: the scorer is fitted to true evaluator sub‑scores on generated proposals, while the generator is refined toward teacher‑selected top‑k and vector‑Pareto targets with stability regularization. We analyze when an imperfect scorer can improve the generator, showing that scorer‑mediated refinement is reliable when scorer‑selected targets are enriched under the true evaluator and updates remain conservative. On NAVSIM, CLOVER achieves 94.5 PDMS and 90.4 EPDMS, establishing a new state of the art. On the more challenging NavHard split, it obtains 48.3 EPDMS, matching the strongest reported result. On supplementary nuScenes open‑loop evaluation, CLOVER achieves the lowest L2 error and collision rate among compared methods. Code data will be released at https://github.com/WilliamXuanYu/CLOVER.
Authors:Mukul Ranjan, Prince Jha, Khushboo Kumari, Zhiqiang Shen
Abstract:
Vision‑Language Models (VLMs) are increasingly applied to cultural heritage materials, from digital archives to educational platforms. This work identifies a fundamental issue in how these models interpret historical artifacts. We define this phenomenon as cultural anachronism, the tendency to misinterpret historical objects using temporally inappropriate concepts, materials, or cultural frameworks. To quantify this phenomenon, we introduce the Temporal Anachronism Benchmark for Vision‑Language Models (TAB‑VLM), a dataset of 600 questions across six categories, designed to evaluate temporal reasoning on 1,600 Indian cultural artifacts spanning prehistoric to modern periods. Systematic evaluations of ten state‑of‑the‑art models reveal significant deficiencies on our benchmark, and even the best model (GPT‑5.2) achieves only 58.7% overall accuracy. The performance gap persists across varying architectures and scales, suggesting that cultural anachronism represents a significant limitation in visual AI systems, regardless of model size. These findings highlight the disparity between current VLM capabilities and the requirements for accurately interpreting cultural heritage materials, particularly for non‑Western visual cultures underrepresented in training data. Our benchmark provides a foundation for enhancing temporal cognition in multimodal AI systems that interact with historical artifacts. The dataset and code are available in our project page.
Authors:Wuyang Li, Yang Gao, Mariam Hassan, Lan Feng, Wentao Pan, Po-Chien Luan, Alexandre Alahi
Abstract:
We propose EverAnimate, an efficient post‑training method for long‑horizon animated video generation that preserves visual quality and character identity. Long‑form animation remains challenging because highly dynamic human motion must be synthesized against relatively static environments, making chunk‑based generation prone to accumulated drift: (i) low‑level quality drift, such as progressive degradation of static backgrounds, and (ii) high‑level semantic drift, such as inconsistent character identity and view‑dependent attributes. To address this issue, EverAnimate restores drifted flow trajectories by anchoring generation to a persistent latent context memory, consisting of two complementary mechanisms. (i) Persistent Latent Propagation maintains a context memory across chunks to propagate identity and motion in latent space while mitigating temporal forgetting. (ii) Restorative Flow Matching introduces an implicit restoration objective during sampling through velocity adjustment, improving within‑chunk fidelity. With only lightweight LoRA tuning, EverAnimate outperforms state‑of‑the‑art long‑animation methods in both short‑ and long‑horizon settings: at 10 seconds, it improves PSNR/SSIM by 8%/7% and reduces LPIPS/FID by 22%/11%; at 90 seconds, the gains increase to 15%/15% and 32%/27%, respectively.
Authors:Tri Cao, Yulin Chen, Hieu Cao, Yibo Li, Khoi Le, Thong Nguyen, Yuexin Li, Yufei He, Yue Liu, Shuicheng Yan, Bryan Hooi
Abstract:
Web agents can autonomously complete online tasks by interacting with websites, but their exposure to open web environments makes them vulnerable to prompt injection attacks embedded in HTML content or visual interfaces. Existing guard models still suffer from limited generalization to unseen domains and attack patterns, high false positive rates on benign content, reduced deployment efficiency due to added latency at each step, and vulnerability to adversarial attacks that evolve over time or directly target the guard itself. To address these limitations, we propose WARD (Web Agent Robust Defense against Prompt Injection), a practical guard model for secure and efficient web agents. WARD is built on WARD‑Base, a large‑scale dataset with around 177K samples collected from 719 high‑traffic URLs and platforms, and WARD‑PIG, a dedicated dataset designed for prompt injection attacks targeting the guard model. We further introduce A3T, an adaptive adversarial attack training framework that iteratively strengthens WARD through a memory‑based attacker and guard co‑evolution process. Extensive experiments show that WARD achieves nearly perfect recall on out‑of‑distribution benchmarks, maintains low false positive rates to preserve agent utility, remains robust against guard‑targeted and adaptive attacks under substantial distribution shifts, and runs efficiently in parallel with the agent without introducing additional latency.
Authors:Zihan Deng, Xiaozhen Zhong, Chuanzhi Xu
Abstract:
As large language models empower healthcare, intelligent clinical decision support has developed rapidly. Longitudinal electronic health records (EHR) provide essential temporal evidence for accurate clinical diagnosis and analysis. However, current large language models have critical flaws in longitudinal EHR reasoning. First, lacking fine‑grained statistical reasoning, they often hallucinate clinical trends and metrics when quantitative evidence is textually implied, biasing diagnostic inference. Second, non‑uniform time series and scarce labels in longitudinal EHR hinder models from capturing long‑range temporal dependencies, limiting reliable clinical reasoning. To address the above limitations, this work presents the Probabilistic Chain‑of‑Thought Completion Agent (COTCAgent), a hierarchical reasoning framework for longitudinal electronic health records. It consists of three core modules. The Temporal‑Statistics Adapter (TSA) converts analytical plans into executable code for standardized trend output. The Chain‑of‑Thought Completion (COTC) layer leverages a symptom‑trend‑disease knowledge base with weighted scoring to evaluate disease risk, while the bounded completion module acquires structured evidence through standardized inquiries and iterative scoring constraints to ensure rigorous reasoning. By decoupling statistical computation, feature matching, and language generation, the framework eliminates reliance on complex multi‑modal inputs and enables efficient longitudinal record analysis with lower computational overhead. Experimental results show that COTCAgent powered by Baichuan‑M2 achieves 90.47% Top‑1 accuracy on the self‑built dataset and 70.41% on HealthBench, outperforming existing medical agents and mainstream large language models. The code is available at https://github.com/FrankDengAI/COTCAgent/.
Authors:Ming Qian, Zimin Xia, Changkun Liu, Shuailei Ma, Wen Wang, Zeran Ke, Bin Tan, Hang Zhang, Gui-Song Xia
Abstract:
Generating a street‑level 3D scene from a single satellite image is a crucial yet challenging task. Current methods present a stark trade‑off: geometry‑colorization models achieve high geometric fidelity but are typically building‑focused and lack semantic diversity. In contrast, proxy‑based models use feed‑forward image‑to‑3D frameworks to generate holistic scenes by jointly learning geometry and texture, a process that yields rich content but coarse and unstable geometry. We attribute these geometric failures to the extreme viewpoint gap and sparse, inconsistent supervision inherent in satellite‑to‑street data. We introduce Sat3DGen to address these fundamental challenges, which embodies a geometry‑first methodology. This methodology enhances the feed‑forward paradigm by integrating novel geometric constraints with a perspective‑view training strategy, explicitly countering the primary sources of geometric error. This geometry‑centric strategy yields a dramatic leap in both 3D accuracy and photorealism. For validation, we first constructed a new benchmark by pairing the VIGOR‑OOD test set with high‑resolution DSM data. On this benchmark, our method improves geometric RMSE from 6.76m to 5.20m. Crucially, this geometric leap also boosts photorealism, reducing the Fréchet Inception Distance (FID) from ~40 to 19 against the leading method, Sat2Density++, despite using no extra tailored image‑quality modules. We demonstrate the versatility of our high‑quality 3D assets through diverse downstream applications, including semantic‑map‑to‑3D synthesis, multi‑camera video generation, large‑scale meshing, and unsupervised single‑image Digital Surface Model (DSM) estimation. The code has been released on https://github.com/qianmingduowan/Sat3DGen.
Authors:Yisen Gao, Jiaxin Bai, Haoyu Huang, Zhongwei Xie, Yufei Li, Hong Ting Tsang, Sirui Han, Yangqiu Song
Abstract:
Knowledge graph (KG) foundation models aim to generalize across graphs with unseen entities and relations by learning transferable relational structure. However, most existing methods primarily emphasize relation‑level universality, while in‑context learning, the other pillar of foundation models remains under‑explored for KG reasoning. In KGs, context is inherently structured and heterogeneous: effective prediction requires conditioning on the local context around the query entities as well as the global context that summarizes how a relation behaves across many instances. We propose KGPFN, a KG foundation model using Prior‑data Fitted Network that unifies transferable relational regularities with inference‑time in‑context learning from structured context. KGPFN first learns relation representations via message passing on relation graphs to capture cross‑graph relational invariances. For query‑specific reasoning, it encodes local neighborhoods using a multi‑layer NBFNet as local context. To enable ICL at global scale, it constructs relation‑specific global context by retrieving a large set of instances of the query relation together with their local neighborhoods, and aggregates them within a Prior‑Data Fitted Network framework that combines feature‑level and sample‑level attention. Through multi‑graph pretraining on diverse KGs, KGPFN learns when to instantiate reusable patterns and when to override them using contextual evidence. Experiments on 57 KG benchmarks demonstrate that KGPFN achieves strong adaptation to previously unseen graphs through in‑context learning alone, consistently outperforming competitive fine‑tuned KG foundation models. Our code is available at https://github.com/HKUST‑KnowComp/KGPFN.
Authors:Sukju Oh, Sukkyu Sun
Abstract:
Online surgical phase recognition (SPR) underpins context‑aware operating‑room systems and requires committing to a prediction at every frame from past context alone. Surgical video poses three demands that natural‑video recognizers do not jointly address: procedures span tens of thousands of frames, time flows non‑uniformly as long routine stretches are punctuated by brief phase‑defining transitions, and the visual domain is narrow so backbone features are strongly correlated across channels. Existing recognizers either let per‑frame cost grow with elapsed length, or hold cost bounded but advance state at a uniform rate with channel‑independent dynamics, leaving the latter two demands unaddressed. We present SurgicalMamba, a causal SPR model built on Mamba2's structured state‑space duality (SSD) that holds per‑frame cost at O(d). It introduces three SSD‑compatible components, each targeting one demand: a dual‑path SSD block that separates long‑ and short‑term regimes at the level of recurrent state; intensity‑modulated stepping, a continuous‑time time‑warp that adapts the slow path's effective rate to phase‑relevant information; and state regramming, a per‑chunk Cayley rotation that opens cross‑channel mixing in the otherwise axis‑aligned SSM recurrence. The learned rotation planes inherit a phase‑aligned structure without any direct supervision, offering an interpretable internal signature of surgical workflow. Across seven public SPR benchmarks, SurgicalMamba reaches state‑of‑the‑art accuracy and phase‑level Jaccard under strict online evaluation: 94.6%/82.7% on Cholec80 (+0.7 pp/+2.2 pp over the strongest prior) and 89.5%/68.9% on AutoLaparo (+1.7 pp/+2.0 pp), at 119 fps on a single GPU. Ablations isolate the contribution of each component. The code is publicly available at https://github.com/sukjuoh/Surgical‑Mamba.
Authors:Zhigao Huang, Zhengqing Hu, Dong Chen, Shaohan Zhang, Zhao Jin, Bo Zhang, Han Wu, Mingliang Xu
Abstract:
Operational plan generation and verification are critical for modern complex and rapidly changing battlefield environments, yet traditional generation and verification methods still respectively face the challenges of generation infeasibility and verification insufficiency. To alleviate these limitations, we propose an Integrated Multi‑Agent Framework for Generative Operational Planning and High‑Fidelity Plan Verification (IFPV). IFPV consists of two tightly coupled modules: Multi‑Perspective Hierarchical Agents (MPHA) for generative operational planning and an Adversarial Cognitive Simulation Engine (ACSE) for high‑fidelity adversarial plan verification. MPHA decomposes commander intent into executable multi‑platform tactical action sequences through the collaboration of Pathfinder, Analyst, and Planner agents. ACSE introduces an opponent equipped with a customized world model, which predicts the future evolution of mission‑critical platforms and conducts dynamic counteractions against candidate plans. Simulation experiments in the Asymmetric Combat Tactic Simulator (ACTS) show that IFPV improves mission success by 19.4% and reduces operational cost by 41.7% compared with a single‑step large language model (LLM) planning baseline. Compared with a traditional rule‑based validator, ACSE increases the average suppression rate by 31.8%, indicating that the proposed verification environment is stricter and more discriminative in revealing the latent vulnerabilities of candidate plans. The code for IFPV can be found at https://github.com/zhigao3ks/IFPV.
Authors:Thomas Witt
Abstract:
We introduce XFP, a dynamic weight quantizer for LLM inference that inverts the conventional workflow: the operator specifies reconstruction quality floors on per‑channel cosine similarity (one strict floor for attention and shared experts, one lazy floor for routed‑expert MoE); XFP determines codebook size, outlier budget, and packing per layer automatically ‑‑ no Hessian, no calibration data, no manual bit‑width selection. Each weight matrix is decomposed into a sparse fp16 outlier residual and a dense sub‑byte index tensor into a per‑group learned codebook. Two storage modes share one auto‑select frontend and one fused decode kernel: V2 (per‑channel Lloyd) and V2a (shared library of L=32 codebooks per layer). On Qwen3.5‑122B‑A10B under V2, XFP reaches 138 tok/s single‑stream decode on workstation hardware (RTX PRO 6000 Blackwell, TP=2) at 94.49% GSM8K strict‑match (3 seeds, n=3957), and is 49% faster than Marlin INT4 at TP=1. For models that do not fit in the target memory envelope, we present the H‑Process: a quality‑driven iteration over the two cosine thresholds that finds the operating point at which the model just fits while still producing sensible output. Three constraints define its search space: the operator‑set thresholds, an OOM boundary at quantize‑on‑load, and a garbage boundary in generation (cosine similarity steers; benches verify). On Qwen3.5‑397B‑A17B (512 routed experts/layer), the H‑Process fits the full expert population into 2x96 GB at ~3.4 effective bits and delivers 100.9 tok/s long‑output decode at 66.72% GSM8K strict‑match on the full 1319‑problem set (single seed at submission; multi‑seed evaluation in progress), exceeding INT4 with routed‑expert pruning on memory, throughput, and accuracy simultaneously.
Authors:Zhao Yang, Wang Huan, Li Yingshuo, Tu Haomiao, Lin Hujite
Abstract:
Large language models often suffer from fact loss, timeline confusion, persona drift, and reduced stability during long‑range interaction, especially under high‑noise knowledge bases, context clearing, and cross‑model transfer. To address these issues, we introduce ARPM, an external temporal memory governance framework for long‑term dialogue. ARPM separates static knowledge memory from dynamic dialogue experience memory and combines vector retrieval, BM25, RRF fusion, dual‑temporal reranking, chronological evidence reading, and a controlled analysis protocol for evidence verification and answer binding. Unlike approaches that encode persona consistency into model weights or rely only on long context, ARPM treats continuity as a traceable, auditable, and transferable governance problem. Using engineering logs, we conduct three experiments. First, in a 50‑round question‑answering setting, we compare signal‑to‑noise ratios of 1:5 and 1:200+, and distinguish CSV auto‑judgment from manual review. Under 1:5, CSV recall accuracy is 54.0%, while manual review raises it to 100.0%. Under 1:200+, the values are 44.0% and 80.0%. These results show that automatic rules can underestimate recall after supporting evidence enters the prompt. Second, ablation results show that dialogue history retrieval is necessary for recent continuity: disabling it reduces strict accuracy from 100% to 66.7%, and disabling BM25 reduces it to 80.0%, indicating that pure semantic retrieval is insufficient for correction and tracing. Third, under a 5.1‑million‑character noise substrate, periodic context clearing, and multi‑model handoff, ARPM maintains semantic continuity, boundary continuity, and persona consistency, while exposing limits caused by weak protocol compliance. These findings show that long‑term persona consistency can be decomposed into governable components and evaluated in a white‑box manner.
Authors:William Lugoloobi, Samuelle Marro, Jabez Magomere, Joss Wright, Chris Russell
Abstract:
As LLM‑based agents increasingly browse the web on users' behalf, a natural question arises: can websites passively identify which underlying model powers an agent? Doing so would represent a significant security risk, enabling targeted attacks tailored to known model vulnerabilities. Across 14 frontier LLMs and four web environments spanning information retrieval and shopping tasks, we show that an agent's actions and interaction timings, captured via a passive JavaScript tracker, are sufficient to identify the underlying model with up to 96% F1. We formalise this attack surface by demonstrating that classifiers trained on agent actions generalise across model sizes and families. We further show that strong classifiers can be trained from few interaction traces and that agent identity can be inferred early within an episode. Injecting randomised timing delays between actions substantially degrades classifier performance, but does not provide robust protection: a classifier retrained on delayed traces largely recovers performance. We release our harness and a labelled corpus of agent traces \hrefhttps://github.com/KabakaWilliam/known_actionshere.
Authors:Qirui Liu, Hao Chen, Weijie Shi, Jiajie Xu, Jia Zhu
Abstract:
Accurately identifying student misconceptions is crucial for personalized education but faces three challenges: (1) data scarcity with long‑tail distribution, where authentic student reasoning is difficult to synthesize; (2) fuzzy boundaries between error categories with high annotation noise; (3) deployment parado‑large models overlook unconventional approaches due to pretraining bias and cannot be deployed on edge, while small models overfit to noise. Unlike traditional methods that increase diversity through large‑scale data synthesis, we propose a two‑stage knowledge distillation framework that mines high‑value samples from existing data. The first stage performs standard distillation to transfer task capabilities. The second stage introduces a dual‑layer marginal selection mechanism based on cognitive uncertainty, identifying four types of critical samples based on teacher model uncertainty and confidence differences. For different data subsets, we design difficulty‑adaptive mechanism to balance hard/soft label contributions, enabling student models to inherit inter‑class relationships from teacher soft labels while distinguishing ambiguous error types. Experiments show that with augmented training on only 10.30% of filtered samples, we achieve MAP@3 of 0.9585 (+17.8%) on the MAP‑Charting dataset, and using only a 4B parameter model, we attain 84.38% accuracy on cross‑topic tests of middle school algebra misconception benchmarks, significantly outperforming sota LLM (67.73%) and standard fine‑tuned 72B models (81.25%). Our code is available at https://github.com/RoschildRui/acl2026_map.
Authors:Saqib Nazir, Ardhendu Behera
Abstract:
Label‑free single‑cell imaging offers a scalable, non‑invasive alternative to fluorescence‑based cytometry, yet inferring molecular phenotypes directly from bright‑field morphology remains challenging. We present a unified Deep Learning (DL) framework that jointly performs White Blood Cell (WBC) classification and continuous protein‑expression regression from label‑free Differential Phase Contrast (DPC) images. Our model employs a Hybrid architecture that fuses convolutional fine‑grained texture features with transformer‑based global representations through a learnable cross‑branch gating module, enabling robust morpho‑molecular inference from DPC images. To support downstream interpretability, we further incorporate a Large Language Model (LLM) that generates concise, biologically grounded summaries of the predicted cell states. Experiments on the Berkeley Single Cell Computational Microscopy (BSCCM) and Blood Cells Image benchmarks demonstrate strong performance, achieving a 91.3% WBC classification accuracy and a 0.72 Pearson correlation for CD16 expression regression on BSCCM. These results underscore the promise of label‑free single‑cell imaging for cost‑effective hematological profiling, enabling simultaneous phenotype identification and quantitative biomarker estimation without fluorescent staining. The source code is available at https://github.com/saqibnaziir/Single‑Cell‑Phenotyping.
Authors:Shijie Lian, Bin Yu, Xiaopeng Lin, Zhaolong Shen, Laurence Tianruo Yang, Yurun Jin, Haishan Liu, Changti Wu, Hang Yuan, Cong Huang, Kai Chen
Abstract:
Robot imitation data are often multimodal: similar visual‑language observations may be followed by different action chunks because human demonstrators act with different short‑horizon intents, task phases, or recent context. Existing frame‑conditioned VLA policies infer each chunk from the current observation and instruction alone, so under partial observability they may resample different intents across adjacent replanning steps, leading to inter‑chunk conflict and unstable execution. We introduce IntentVLA, a history‑conditioned VLA framework that encodes recent visual observations into a compact short‑horizon intent representation and uses it to condition chunk generation. We further introduce AliasBench, a 12‑task ambiguity‑aware benchmark on RoboTwin2 with matched training data and evaluation environments that isolate short‑horizon observation aliasing. Across AliasBench, SimplerEnv, LIBERO, and RoboCasa, IntentVLA improves rollout stability and outperforms strong VLA baselines
Authors:Nabil Iqbal, T. Anderson Keller, Yue Song, Takeru Miyato, Max Welling
Abstract:
In physical systems, whenever a continuous symmetry is spontaneously broken, the system possesses excitations called Goldstone modes, which allow coherent information propagation over long distances and times. In this work, we study deep neural networks whose internal layers are equivariant under a continuous symmetry and may therefore support analogous Goldstone‑like degrees of freedom. We demonstrate, both analytically and empirically, that these degrees of freedom enable coherent signal propagation across depth and recurrent iterations, providing a mechanism for stable information flow without relying on architectural stabilizers such as residual connections or normalization. In feedforward networks, this results in improved trainability and representational diversity across layers. In recurrent settings, we demonstrate the same mechanism is valuable for long‑term memory by propagating information over recurrent iterations, thereby improving performance of RNNs and GRUs on long‑sequence modeling tasks.
Authors:ZhiXin Sun
Abstract:
With the rapid evolution of computer vision, vision‑based methodologies for water level and river surface velocity estimation have reached significant maturity. Compared to traditional sensing, these techniques offer superior interpretability, automated data archiving, and enhanced system robustness. However, challenges such as environmental sensitivity, limited precision, and complex site calibration persist. This work proposes an integrated framework that synergizes state‑of‑the‑art (SOTA) vision models with statistical modeling. By leveraging physical priors and robust filtering strategies, we improve the accuracy of water level detection and flow estimation. Code will be available at https://github.com/sunzx97/Vision_Based_Water_Level_and_Flow_Estimation.git
Authors:Sohaib Afifi
Abstract:
PyCSP^3 provides a productive way to build constraint models for solving combinatorial constrained problems and export them to XCSP^3, preserving a complete separation between modeling and solving. However, it lacks native support for scheduling abstractions such as interval variables, sequence variables, and resource functions. As a result, scheduling models must be encoded with low‑level integer variables and manual channeling constraints, even though PyCSP^3 already provides global constraints like NoOverlap and Cumulative on integer arrays. We present PyCSP^3 Scheduling, a library that adds scheduling abstractions to PyCSP^3 through 53 dedicated constraints and 27 expressions, and compiles them down to standard PyCSP^3/XCSP^3 constraints, maintaining the modeling/solving separation that underpins the PyCSP^3 ecosystem. On 261 paired instances across 17 model families (5 runs each), both formulations produce identical objectives on all 72 doubly‑proved optimal pairs and nearly half of the families (8/17) remain structurally unchanged after compilation; however, runtime performance diverges across families, with clear gains on some (up to 5.8x) and regressions on others due to the overhead of compilation decompositions. Code and benchmarks are available at: https://github.com/sohaibafifi/pycsp3‑scheduling
Authors:Shuyang Cui, Zhi Zhong, Qiyu Wu, Zachary Novack, Woosung Choi, Keisuke Toyama, Kin Wai Cheuk, Junghyun Koo, Yukara Ikemiya, Christian Simon, Chihiro Nagashima, Shusuke Takahashi
Abstract:
Current methods for creating drum loop audio in digital music production, such as using one‑shot samples or resampling, often demand non‑trivial efforts of creators. While recent generative models achieve high fidelity and adhere to text, they lack the specific control needed for such a task. Existing symbolic‑to‑audio research often focuses on single, tonal instruments, leaving the challenge of polyphonic, percussive drum synthesis unaddressed. We address this gap by introducing ``Break‑the‑Beat!,'' a model capable of rendering a drum MIDI with the timbre of a reference audio. It is built by fine‑tuning a pre‑trained text‑to‑audio model with our proposed content encoder and a effective hybrid conditioning mechanism. To enable this, we construct a new dataset of paired target‑reference drum audio from existing drum audio datasets. Experiments demonstrate that our model generates high‑quality drum audio that follows high‑resolution drum MIDI, achieving strong performance across metrics of audio quality, rhythmic alignment, and beat continuity. This offer producers a new, controllable tool for creative production. Demo page: https://ik4sumii.github.io/break‑the‑beat/
Authors:Fuhao Li, Shaofeng You, Jiagao Hu, Yu Liu, Yuxuan Chen, Zepeng Wang, Fei Wang, Daiguo Zhou, Jian Luan
Abstract:
Evaluating object removal in images and videos remains challenging because the task is inherently one‑to‑many, yet existing metrics frequently disagree with human perception. Full‑reference metrics reward copy‑paste behaviors over genuine erasure; no‑reference metrics suffer from systematic biases such as favoring blurry results; and global temporal metrics are insensitive to localized artifacts within edited regions. To address these limitations, we propose RC (Removal Coherence), a pair of perception‑aligned metrics: RC‑S, which measures spatial coherence via sliding‑window feature comparison between masked and background regions, and RC‑T, which measures temporal consistency via distribution tracking within shared restored regions across adjacent frames. To validate RC and support community benchmarking, we further introduce PROVE‑Bench, a two‑tier real‑world benchmark comprising PROVE‑M, an 80‑video paired dataset with motion augmentation, and PROVE‑H, a 100‑video challenging subset without ground truth. Together, RC metrics and PROVE‑Bench form the PROVE (Perceptual RemOVal cohErence) evaluation framework for visual media. Experiments across diverse image and video benchmarks demonstrate that RC achieves substantially stronger alignment with human judgments than existing evaluation protocols. The code for RC metrics and PROVE‑Bench are publicly available at: https://github.com/xiaomi‑research/prove/.
Authors:Truong Thanh Hung Nguyen, Vo Thanh Khang Nguyen, Hoang-Loc Cao, Phuc Ho, Van Pham, Hung Cao
Abstract:
Multimedia verification requires not only accurate conclusions but also transparent and contestable reasoning. We propose a contestable multi‑agent framework that integrates multimodal large language models, external verification tools, and arena‑based quantitative bipolar argumentation (A‑QBAF) as a submission to the ICMR 2026 Grand Challenge on Multimedia Verification. Our method decomposes each case into claim‑centered sections, retrieves targeted evidence, and converts evidence into structured support and attack arguments with provenance and strength scores. These arguments are resolved through small local argument graphs with selective clash resolution and uncertainty‑aware escalation. The resulting system generates section‑wise verification reports that are transparent, editable, and computationally practical for real‑world multimedia verification. Our implementation is public at: https://github.com/Analytics‑Everywhere‑Lab/MV2026_the_liems.
Authors:Jiahao Tian, Yiwei Wang, Gang Yu, Chi Zhang
Abstract:
Autoregressive video diffusion models support real‑time synthesis but suffer from error accumulation and context loss over long horizons. We discover that attention heads in AR video diffusion transformers serve functionally distinct roles as local heads for detail refinement, anchor heads for structural stabilization, and memory heads for long‑range context aggregation, yet existing methods treat them uniformly, leading to suboptimal KV cache allocation. We propose Head Forcing, a training‑free framework that assigns each head type a tailored KV cache strategy: local and anchor heads retain only essential tokens, while memory heads employ a hierarchical memory system with dynamic episodic updates for long‑range consistency. A head‑wise RoPE re‑encoding scheme further ensures positional encodings remain within the pretrained range. Without additional training, Head Forcing extends generation from 5 seconds to minute‑level duration, supports multi‑prompt interactive synthesis, and consistently outperforms existing baselines. Project Page: https://jiahaotian‑sjtu.github.io/headforcing.github.io/.
Authors:Pengyun Zhu, Yuqi Ren, Zhen Wang, Lei Yang, Deyi Xiong
Abstract:
Current Large Language Models (LLMs) typically rely on coarse‑grained national labels for pluralistic value alignment. However, such macro‑level supervision often obscures intra‑country value heterogeneity, yielding a loose alignment. We argue that resolving this limitation requires shifting from national labels to multi‑dimensional demographic constraints, which can identify groups with predictable, high‑consensus value preference. To this end, we propose DVMap (High‑Consensus Demographic‑Value Mapping), a framework for fine‑grained pluralistic value alignment. In this framework, we first present a demographic archetype extraction strategy to construct a high‑quality value alignment corpus of 56,152 samples from the World Values Survey (WVS) by strictly retaining respondents with consistent value preferences under identical demographics. Over this corpus, we introduce a Structured Chain‑of‑Thought (CoT) mechanism that explicitly guides LLMs to reason about demographic‑value correlations. Subsequently, we employ Group Relative Policy Optimization (GRPO) to achieve adaptive anchoring of value distributions. To rigorously evaluate generalization, we further establish a triple‑generalization benchmark (spanning cross‑demographic, cross‑country, and cross‑value) comprising 21,553 samples. Experimental results demonstrate that DVMap effectively learns the manifold mapping from demographics to values, exhibiting strong generalization and robustness. On cross‑demographic tests, Qwen3‑8B‑DVMap achieves 48.6% accuracy, surpassing the advanced open‑source LLM DeepSeek‑v3.2 (45.1%). The source code and dataset are available at https://github.com/EnlightenedAI/DVMap.
Authors:Zhengjia Zhong, Shuyan Ke, Zaizhou Lin, Jiaqi Song, Hongyi Lan, Hui Li
Abstract:
Vector quantization is a fundamental tool for compressing high‑dimensional embeddings, yet existing multi‑codebook methods rely on static codebooks that limit expressiveness under heterogeneous data geometry. While recent dynamic quantizers like QINCo adapt codebooks to individual inputs and improve expressiveness, their strict sequential dependencies create decoding bottlenecks. We propose Residual Quantization via Mixture of Experts (RQ‑MoE), a framework combining a two‑level MoE with dual‑stream quantization to enable input‑dependent codebook adaptation for efficient vector quantization. RQ‑MoE enables dynamic codebook construction and decouples instruction from quantization, facilitating parallel decoding. Theoretically, we show that standard Residual Quantization and QINCo can be recovered as constrained special cases of RQ‑MoE, and derive a guideline for setting expert dimensionality in RQ‑MoE. Extensive experiments show that RQ‑MoE achieves state‑of‑the‑art or on‑par performance in reconstruction and retrieval, while providing 6x‑14x faster decoding than prior vector quantization methods. The implementation is available at https://github.com/KDEGroup/RQ‑MoE.
Authors:Weisen Jiang, Shuhao Chen, Sinno Jialin Pan
Abstract:
Mixture‑of‑Experts (MoE) models scale capacity by combining specialized experts, but most existing approaches assume centralized access to training data. In practice, data are distributed across clients and cannot be shared due to privacy constraints, making unified MoE training challenging. We propose MetaMoE, a privacy‑preserving framework that unifies independently trained, domain‑specialized experts into a single MoE using public proxy data as surrogates for inaccessible private data. Central to MetaMoE is diversity‑aware proxy selection, which selects client‑domain‑relevant and diverse samples from public data to effectively approximate private data distributions and supervise router learning. These proxies are further used to align expert training, improving expert coordination at unification time, while a context‑aware router enhances expert selection across heterogeneous inputs. Experiments on computer vision and natural language processing benchmarks demonstrate that MetaMoE consistently outperforms recent privacy‑preserving MoE unification methods. Code is available at https://github.com/ws‑jiang/MetaMoE.
Authors:Yang Zheng, Wen Li, Zhaoqiang Liu
Abstract:
Diffusion models (DMs) have exhibited remarkable efficacy in various image restoration tasks. However, existing approaches typically operate within the high‑dimensional pixel space, resulting in high computational overhead. While methods based on latent DMs seek to alleviate this issue by utilizing the compressed latent space of a variational autoencoder, they require repeated encoder‑decoder inference. This introduces significant additional computational burdens, often resulting in runtime performance that is even inferior to that of their pixel‑space counterparts. To mitigate the computational inefficiency, this work proposes projecting data into lower‑dimensional subspaces using dynamic resolution DMs to accelerate the inference process. We first fine‑tune pre‑trained DMs for dynamic resolution priors and adapt DPS and DAPS, which are two widely used pixel‑space methods for general image restoration tasks, into the proposed framework, yielding methods we refer to as SubDPS and SubDAPS, respectively. Given the favorable inference speed and reconstruction fidelity of SubDAPS, we introduce an enhanced variant termed SubDAPS++ to further boost both reconstruction efficiency and quality. Empirical evaluations across diverse image datasets and various restoration tasks demonstrate that the proposed methods outperform recent DM‑based approaches in the majority of experimental scenarios. The code is available at https://github.com/StarNextDay/SubDAPS.git.
Authors:Kai Sun, Peibo Duan, Yongsheng Huang, Guowei Zhang, Benjamin Smith, Nanxu Gong, Levin Kuhlmann
Abstract:
Spiking neural networks (SNNs), which are brain‑inspired and spike‑driven, achieve high energy efficiency. However, a performance gap between SNNs and artificial neural networks (ANNs) still remains. Knowledge distillation (KD) is commonly adopted to improve SNN performance, but existing methods typically enforce uniform alignment across all timesteps, either from a teacher network or through inter‑temporal self‑distillation, implicitly assuming that per‑timestep predictions should be treated equally. In practice, SNN predictions vary and evolve over time, and intermediate timesteps need not all be individually correct even when the final aggregated output is correct. Under such conditions, effective distillation should not force every timestep toward the same supervision target, but instead provide corrective guidance to erroneous timesteps while preserving useful temporal dynamics. To address this issue, we propose Selective Alignment Knowledge Distillation (SeAl‑KD), which selectively aligns class‑level and temporal knowledge by equalizing competing logits at erroneous timesteps and reweighting temporal alignment based on confidence and inter‑timestep similarity. Extensive experiments on static image and neuromorphic event‑based datasets demonstrate consistent improvements over existing distillation methods. The code is available at https://github.com/KaiSUN1/SeAl
Authors:Hanxun Huang, Qizhou Wang, Xingjun Ma, Cihang Xie, Christopher Leckie, Sarah Erfani
Abstract:
Audio self‑supervised learning (SSL) aims to learn general‑purpose representations from large‑scale unlabeled audio data. While recent advances have been driven mainly by generative reconstruction objectives, contrastive approaches remain less explored, partly due to the difficulty of designing effective audio augmentations and the large batch sizes required for contrastive pre‑training. We introduce AudioMosaic, a contrastive learning‑based audio encoder for general audio understanding. During pre‑training, AudioMosaic constructs positive pairs by applying structured time‑frequency masking to spectrogram patches, which reduces memory usage and enables efficient large‑batch training. Compared with generative approaches, the AudioMosaic encoder learns more discriminative utterance‑level representations that demonstrate strong transferability across datasets, domains, and acoustic conditions. Extensive experiments show that AudioMosaic achieves state‑of‑the‑art performance on several standard audio benchmarks under both linear probing and fine‑tuning. We further show that integrating the pretrained AudioMosaic encoder into audio‑language models improves performance on audio‑language tasks. The code is publicly available in our \hrefhttps://github.com/HanxunH/AudioMosaicGitHub repository.
Authors:Sanghyeob Song, Donghyeok Lee, Jinsik Kim, Sungroh Yoon
Abstract:
For reinforcement learning in data‑scarce domains like real‑world robotics, intensive data reuse enhances efficiency but induces overfitting. While prior works focus on critic bias, representation‑level instability in Self‑Predictive Learning (SPL) under high Update‑to‑Data (UTD) regimes remains underexplored. To bridge this gap, we propose Robust Representation via Redundancy Reduction (R2R2), a regularization method within SPL. We theoretically identify that standard zero‑centering conflicts with SPL's spectral properties and design a non‑centered objective accordingly. We verify R2R2 on SPL‑native algorithms like TD7. Furthermore, to demonstrate its orthogonality to prior advancements, we extend the state‑of‑the‑art SimbaV2, which originally lacks SPL, by integrating a tailored SPL module, termed SimbaV2‑SPL. Experiments across 11 continuous control tasks confirm that R2R2 effectively mitigates overfitting; specifically, at a UTD ratio of 20, it improves TD7 by ~22% and provides additional gains on top of SimbaV2‑SPL, which itself establishes a new state‑of‑the‑art. The code can be found at: https://github.com/songsang7/R2R2
Authors:Evelyn Turri, Davide Bucciarelli, Sara Sarto, Lorenzo Baraldi, Marcella Cornia
Abstract:
Diffusion Transformers (DiTs) and related flow‑based architectures are now among the strongest text‑to‑image generators, yet the internal mechanisms through which prompts shape image semantics remain poorly understood. In this work, we study massive activations: a small subset of hidden‑state channels whose responses are consistently much larger than the rest. We show that, despite their sparsity, these few channels effectively draw the whole picture, in three complementary senses. First, they are functionally critical: a controlled disruption probe that zeroes the massive channels causes a sharp collapse in generation quality, while disrupting an equally‑sized set of low‑statistic channels has marginal effect. Second, they are spatially organized: restricting image‑stream tokens to massive channels and clustering them yields coherent partitions that closely align with the main subject and salient regions, exposing a structured spatial code hidden inside an apparently outlier‑like subspace. Third, they are transferable: transporting massive activations from one prompt‑conditioned trajectory into another, shifts the final image toward the source prompt while preserving substantial content from the target, producing localized semantic interpolation rather than unstructured pixel blending. We exploit this property in two use cases: text‑conditioned and image‑conditioned semantic transport, where massive activations transport enables prompt interpolation and subject‑driven generation without any additional training. Together, these results recast massive activations not as activation anomalies, but as a sparse prompt‑conditioned carrier subspace that organizes and controls semantic information in modern DiT models.
Authors:Darius A. Faroughy, Sofia Palacios Schweitzer, Ian Pang, Siddharth Mishra-Sharma, David Shih
Abstract:
Autonomous language‑model agents are increasingly evaluated on long‑horizon tool‑use tasks, but existing benchmarks rarely capture the complexity and nuance of real scientific work. To address this gap, we introduce Collider‑Bench, a benchmark for evaluating whether LLM agents can reproduce experimental analyses from the Large Hadron Collider (LHC) using only public papers and open scientific software. Such analyses are often difficult to reproduce because the public toolchain only approximates the software used internally by the experimental collaborations, while the published papers inevitably omit implementation details needed for a faithful reconstruction. Agents must therefore rely on physical reasoning, domain knowledge, and trial‑and‑error to fill these gaps. Each task requires the agent to turn a published analysis into an executable simulation‑and‑selection pipeline and submit predicted collision event yields in specified signal regions. These predictions are evaluated with standard histogram metrics that provide continuous fidelity scores without a hand‑written rubric. We also report the computational cost incurred by each agent per task. Finally, we evaluate the codebase and full session trace using an LLM judge to catch qualitative failure modes such as fabrications, hallucinations and duplications. We release an initial set of tasks drawn from LHC searches, together with a containerized sandbox and event simulation tools. We evaluate across a capability ladder of general purpose coding agents. Our results show that on average no agent reliably beats the physicist‑in‑the‑loop solution.
Authors:Jiaqi Liu, Xinyu Ye, Peng Xia, Zeyu Zheng, Cihang Xie, Mingyu Ding, Huaxiu Yao
Abstract:
Long‑term memory is essential for LLM agents that operate across multiple sessions, yet existing memory systems treat retrieval infrastructure as fixed: stored content evolves while scoring functions, fusion strategies, and answer‑generation policies remain frozen at deployment. We argue that truly adaptive memory requires co‑evolution at two levels: the stored knowledge and the retrieval mechanism that queries it. We present EvolveMem, a self‑evolving memory architecture that exposes its full retrieval configuration as a structured action space optimized by an LLM‑powered diagnosis module. In each evolution round, the module reads per‑question failure logs, identifies root causes, and proposes targeted configuration adjustments; a guarded meta‑analyzer applies them with automatic revert‑on‑regression and explore‑on‑stagnation safeguards. This closed‑loop self‑evolution realizes an AutoResearch process: the system autonomously conducts iterative research cycles on its own architecture, replacing manual configuration tuning. Starting from a minimal baseline, the process converges autonomously, discovering effective retrieval strategies including entirely new configuration dimensions not present in the original action space. On LoCoMo, EvolveMem outperforms the strongest baseline by 25.7% relative and achieves a 78.0% relative improvement over the minimal baseline. On MemBench, EvolveMem exceeds the strongest baseline by 18.9% relative. Evolved configurations transfer across benchmarks with positive rather than catastrophic transfer, indicating that the self‑evolution process captures universal retrieval principles rather than benchmark‑specific heuristics. Code is available at https://github.com/aiming‑lab/SimpleMem.
Authors:Haomin Zhuang, Hanwen Xing, Yujun Zhou, Yuchen Ma, Yue Huang, Yili Shen, Yufei Han, Xiangliang Zhang
Abstract:
Third‑party skills are becoming the package ecosystem for LLM agents. They package natural‑language instructions, helper scripts, templates, documents, and service configuration into reusable workflows. This makes skills useful, but it also introduces a new security problem: a malicious skill does not need to ask the model to perform an obviously harmful action. Instead, it can disguise the harmful behavior as part of a routine workflow, relying on the agent to execute that workflow with high‑value permissions and limited human supervision. We introduce AgentTrap, a dynamic benchmark for evaluating whether LLM agents can use third‑party skills while resisting malicious runtime behavior. AgentTrap contains 141 tasks: 91 malicious tasks and 50 benign utility tasks, covering 16 security‑impact dimensions grounded in agent‑skill supply‑chain threats. In each task, the agent receives an ordinary user request, runs with installed skills that may contain malicious workflow elements, and is executed in a sandboxed environment. AgentTrap then judges complete trajectories for attack success, blocked or refused behavior, attack‑not‑triggered cases, and no‑attack‑evidence outcomes. Our central finding is that the most informative failures are not simple jailbreaks. Models often complete the visible user task while treating unsafe side effects introduced by the skill as part of the normal workflow. This motivates runtime evaluation of the concrete model‑‑framework‑‑workspace environment in which users actually delegate work. Code and data are available at https://github.com/zhmzm/AgentTrap and https://huggingface.co/datasets/zhmzm/AgentTrap.
Authors:Abdullah Naeem, Md Wasi Ul kabir, Manish Bhatt, Ayon Dey, Anav Katwal, Md Tamjidul Hoque
Abstract:
We present ARES‑LSHADE, a memetic differential‑evolution variant submitted to the GECCO 2026 competition on LLM‑designed evolutionary algorithms for the Generalized Numerical Benchmark Generator (GNBG). The algorithm builds on the LLM‑LSHADE 2025 winner, contributing two new components: (a) a scout‑augmented mutation operator with adaptive CMA‑ES integration, produced by an autonomous research loop across approximately thirty LLM‑driven design experiments, and (b) a multi‑start L‑BFGS‑B polish phase that respects strict blackbox treatment of the benchmark. On the official 31‑run‑per‑function evaluation with the competition‑specified function‑evaluation budgets, ARES‑LSHADE obtains 510 of 744 wins (per‑function gap below 1e‑8), reaching machine precision on 18 of 24 functions. The remaining six functions exhibit characteristic plateau signatures consistent with GNBG's compositional structure, and were independently identified by the autoresearch loop as the hardest of the suite. Beyond the result itself, this report documents two methodological observations: (i) an LLM‑driven research loop with operator‑only edit surface and fitness‑only observation space converges to a characteristic plateau on this benchmark; (ii) when we initially widened the observation space to include the benchmark's compositional metadata, the resulting algorithm trivially solved all 24 functions but violated the competition's blackbox rule, which we identified before submission. We discuss this tension between LLM capability and benchmark integrity as a design consideration for future LLM‑driven optimization‑algorithm research. Code and reproducibility artifacts are available at https://github.com/anaeem1/ARES‑LSHADE.
Authors:Dongzhe Zheng, Tao Zhong, Christine Allen-Blanchette
Abstract:
In this paper, we study solution operators of physical field equations on geometric meshes from a function‑space perspective. We reveal that Hodge orthogonality fundamentally resolves spectral interference by isolating unlearnable topological degrees of freedom from learnable geometric dynamics, enabling an additive approximation confined to structure‑preserving subspaces. Building on Hodge theory and operator splitting, we derive a principled operator‑level decomposition. The result is a Hybrid Eulerian‑Lagrangian architecture with an algebraic‑level inductive bias we call Hodge Spectral Duality (HSD). In our framework, we use discrete differential forms to capture topology‑dominated components and an orthogonal auxiliary ambient space to represent complex local dynamics. Our method achieves superior accuracy and efficiency on geometric graphs with enhanced fidelity to physical invariants. Our code is available at https://github.com/ContinuumCoder/Hodge‑Spectral‑Duality
Authors:Yuchao Gu, Guian Fang, Yuxin Jiang, Weijia Mao, Song Han, Han Cai, Mike Zheng Shou
Abstract:
Few‑step video generation has been significantly advanced by consistency distillation. However, the performance of consistency‑distilled models often degrades as more sampling steps are allocated at test time, limiting their effectiveness for any‑step video diffusion. This limitation arises because consistency distillation replaces the original probability‑flow ODE trajectory with a consistency‑sampling trajectory, weakening the desirable test‑time scaling behavior of ODE sampling. To address this limitation, we introduce AnyFlow, the first any‑step video diffusion distillation framework based on flow maps. Instead of distilling a model for only a few fixed sampling steps, AnyFlow optimizes the full ODE sampling trajectory. To this end, we shift the distillation target from endpoint consistency mapping (z_t\rightarrow z_0) to flow‑map transition learning (z_t\rightarrow z_r) over arbitrary time intervals. We further propose Flow Map Backward Simulation, which decomposes a full Euler rollout into shortcut flow‑map transitions, enabling efficient on‑policy distillation that reduces test‑time errors (i.e., discretization error in few‑step sampling and exposure bias in causal generation). Extensive experiments across both bidirectional and causal architectures, at scales ranging from 1.3B to 14B parameters, demonstrate that AnyFlow achieves performance matches or surpasses consistency‑based counterparts in the few‑step regime, while scaling with sampling step budgets.
Authors:Peng Kang, Bixuan Li, Xiaoya Huang, Shuo Shi, Weiqiao Zhou, Zhen Li, Yu Liu, Lei Zheng
Abstract:
The Materials Genome Initiative catalyzed the proliferation of centralized platforms‑‑SaaS, PaaS, and IaaS‑‑that aggregate computational and experimental resources for accelerated materials discovery. In parallel, breakthroughs in large language models (LLMs) and autonomous agents have created powerful new reasoning capabilities for scientific research. Yet a critical "last mile" problem remains: while we possess world‑class models and vast repositories of materials data, we lack the organizational infrastructure to compose these capabilities securely across institutional boundaries. The development of structural and functional materials for harsh service environments‑‑high‑temperature alloys, radiation resistant steels, corrosion‑resistant coatings‑‑remains characterized by long‑term iteration, mechanistic complexity, and high domain expertise‑‑demands that exceed both monolithic agent systems and traditional centralized platforms. To address this gap we propose OpenAaaS, an open‑source hierarchical and distributed Agent‑as‑a‑Service framework that enables organized multi‑agent collaboration for intelligent materials design. OpenAaaS is built on a single foundational principle: code flows, data stays still. A Master Agent plans and decomposes complex research tasks without requiring direct access to subordinate agents' managed data and computational resources. Sub‑agents, deployed as near‑data execution nodes, retain full sovereignty over local datasets, proprietary algorithms, and specialized hardware. This architecture guarantees that raw data never leaves its domain of origin while enabling cross‑scale, cross‑domain secure integration of previously isolated materials intelligence silos. We validate the framework through two representative case studies: (i) AlphaAgent, an evidence‑grounded materials literature analysis executor that achieves 4.66/5.0 on deep analytical questions against single‑pass RAG baselines; and (ii) an ultra‑large‑scale hexa‑high‑entropy alloy descriptor database service that demonstrates secure near‑data execution and domain‑specific scientific workflows under strict data‑sovereignty constraints. OpenAaaS establishes a principled pathway toward "organized research" via agent collectives, offering a scalable foundation for next‑generation materials intelligent design platforms. All source code is available at https://github.com/Wolido/OpenAaaS.
Authors:Chengzhi Shen, Weixiang Shen, Tobias Susetzky, Chen, Chen, Jun Li, Yuyuan Liu, Xuepeng Zhang, Zhenyu Gong, Daniel Rueckert, Jiazhen Pan
Abstract:
Intensive care units (ICU) generate long, dense and evolving streams of clinical information, where physicians must repeatedly reassess patient states under time pressure, underscoring a clear need for reliable AI decision support. Existing ICU benchmarks typically treat historical clinician actions as ground truth. However, these actions are made under incomplete information and limited temporal context of the underlying patient state, and may therefore be suboptimal, making it difficult to assess the true reasoning capabilities of AI systems. We introduce RealICU, a hindsight‑annotated benchmark for evaluating large language models (LLMs) under realistic ICU conditions, where labels are created after senior physicians review the full patient trajectory. We formulate four physician‑motivated tasks: assess Patient Status, Acute Problems, Recommended Actions, and Red Flag actions that risk unsafe outcomes. We partition each trajectory with 30‑min windows and release two datasets: RealICU‑Gold with 930‑window annotations from 94 MIMIC‑IV patients, and RealICU‑Scale with 11,862 windows extended by Oracle, a physician‑validated LLM hindsight labeler. Existing LLMs including memory‑augmented ones performed poorly on RealICU, exposing two failure modes: a recall‑safety tradeoff for clinical recommendations, and an anchoring bias to early interpretations of the patient. We further introduce ICU‑Evo to study structured‑memory agents that improves long‑horizon reasoning but does not fully eliminate safety failures. Together, RealICU provides a clinically grounded testbed for measuring and improving AI sequential decision‑support in high‑stakes care. Project page: https://chengzhi‑leo.github.io/RealICU‑Bench/
Authors:Kangning Zhang, Shuai Shao, Qingyao Li, Jianghao Lin, Lingyue Fu, Shijian Wang, Wenxiang Jiao, Yuan Lu, Weiwen Liu, Weinan Zhang, Yong Yu
Abstract:
Reusable skills have become a core substrate for improving agent capabilities, yet most existing skill packages encode reusable behavior primarily as textual prompts, executable code, or learned routines. For visual agents, however, procedural knowledge is inherently multimodal: reuse depends not only on what operation to perform, but also on recognizing the relevant state, interpreting visual evidence of progress or failure, and deciding what to do next. We formalize this requirement as multimodal procedural knowledge and address three practical challenges: (I) what a multimodal skill package should contain; (II) where such packages can be derived from public interaction experience; and (III) how agents can consult multimodal evidence at inference time without excessive image context or over‑anchoring to reference screenshots. We introduce MMSkills, a framework for representing, generating, and using reusable multimodal procedures for runtime visual decision making. Each MMSkill is a compact, state‑conditioned package that couples a textual procedure with runtime state cards and multi‑view keyframes. To construct these packages, we develop an agentic trajectory‑to‑skill Generator that transforms public non‑evaluation trajectories into reusable multimodal skills through workflow grouping, procedure induction, visual grounding, and meta‑skill‑guided auditing. To use them, we introduce a branch‑loaded multimodal skill agent: selected state cards and keyframes are inspected in a temporary branch, aligned with the live environment, and distilled into structured guidance for the main agent. Experiments across GUI and game‑based visual‑agent benchmarks show that MMSkills consistently improve both frontier and smaller multimodal agents, suggesting that external multimodal procedural knowledge complements model‑internal priors.
Authors:Jaeyung Kim, YoungJoon Yoo
Abstract:
Vector Quantized Variational Autoencoder (VQ‑VAE) has become a fundamental framework for learning discrete representations in image modeling. However, VQ‑VAE models must tokenize entire images using a finite set of codebook vectors, and this capacity limitation restricts their ability to capture rich and diverse representations. In this paper, we propose ArcCosine Additive Margin VQ‑VAE (ArcVQ‑VAE), a novel vector quantization framework that introduces a spherical angular‑margin prior (SAMP) for the codebook of a conventional VQ‑VAE. The proposed SAMP consists of Ball‑Bounded Norm Regularization, which constrains all codebook vectors within a time‑dependent Euclidean ball, and ArcCosine Additive Margin Loss, which encourages greater angular separability among latent vectors. This formulation promotes more discriminative and uniformly dispersed latent representations within the constrained space, thereby improving effective latent‑space coverage and leading to improved codebook utilization. Experimental results on standard image reconstruction and generation tasks show that ArcVQ‑VAE achieves competitive performance against baseline models in terms of reconstruction accuracy, representation diversity, and sample quality. The code is available at: https://github.com/goals4292/ArcVQ‑VAE
Authors:Barathi Ganesh HB, Michal Ptaszynski, Rene Melendez, Juuso Eronen
Abstract:
This paper presents a multi‑stage framework for detecting reclaimed slurs in multilingual social media discourse. It addresses the challenge of identifying reclamatory versus non‑reclamatory usage of LGBTQ+‑related slurs across English, Spanish, and Italian tweets. The framework handles three intertwined methodological challenges like data scarcity, class imbalance, and cross‑linguistic variation in sentiment expression. It integrates data‑driven model selection via cross‑validation, semantic‑preserving augmentation through back‑translation, inductive transfer learning with dynamic epoch‑level undersampling, and domain‑specific knowledge injection via masked language modeling. Eight multilingual embedding models were evaluated systematically, with XLM‑RoBERTa selected as the foundation model based on macro‑averaged F1 score. Data augmentation via GPT‑4o‑mini back‑translation to alternate languages effectively tripled the training corpus while preserving semantic content and class distribution ratios. The framework produces four final runs for the evaluation purposes where RUN 1 is inductive transfer learning with augmentation and undersampling, RUN 2 with masked language modeling pre‑training, RUN 3 and RUN 4 are previous predictions refined via language‑specific decision thresholds optimized via ROC analysis. Language‑specific threshold refinement reveals that optimal decision boundaries vary significantly across languages. This reflects distributional differences in model confidence scores and linguistic variation in reclamatory language usage. The threshold‑based optimization yields 2‑5% absolute F1 improvement without requiring model retraining. The methodology is fully reproducible, with all code and experimental setup available at https://github.com/rbg‑research/MultiPRIDE‑Evalita‑2026.
Authors:Galadrielle Humblot-Renaux, Mohammad N. S. Jahromi, Rohat Bakuri-Jørgensen, Marieke Anne Heyl, Asta S. Stage Jarlner, Maria Vlachou, Anna Murphy Høgenhaug, Desmond Elliott, Thomas Gammeltoft-Hansen, Thomas B. Moeslund
Abstract:
Off‑the‑shelf large language models (LLMs) are increasingly used to automate text annotation, yet their effectiveness remains underexplored for underrepresented languages and specialized domains where the class definition requires subtle expert understanding. We investigate LLM‑based annotation for a novel legal NLP task: identifying the presence and sentiment of credibility assessments in asylum decision texts. We introduce RAB‑Cred, a Danish text classification dataset featuring high‑quality, expert annotations and valuable metadata such as annotator confidence and asylum case outcome. We benchmark 21 open‑weight models and 30 system‑user prompt combinations for this task, and systematically evaluate the effect of model and prompt choice for zero‑shot and few‑shot classification. We zoom in on the errors made by top‑performing models and prompts, investigating error consistency across LLMs, inter‑class confusion, correlation with human confidence and sample‑wise difficulty and severity of LLM mistakes. Our results confirm the potential of LLMs for cost‑effective labeling of asylum decisions, but highlight the imperfect and inconsistent nature of LLM annotators, and the need to look beyond the predictions of a single, arbitrarily chosen model. The RAB‑Cred dataset and code are available at https://github.com/glhr/RAB‑Cred
Authors:Chaehee Song, Minseok Seo, Yeeun Seong, Doyi Kim, Changick Kim
Abstract:
Large language models (LLMs) are typically deployed with fixed parameters, and their performance is often improved by allocating more computation at inference time. While such test‑time scaling can be effective, it cannot correct model misconceptions or adapt the model to the specific structure of an individual query. Test‑time optimization addresses this limitation by enabling parameter updates during inference, but existing approaches either rely on external data or optimize generic self‑supervised objectives that lack query‑specific alignment. In this work, we propose Query‑Conditioned Test‑Time Self‑Training (QueST), a framework that adapts model parameters during inference using supervision derived directly from the input query. Our key insight is that the input query itself encodes latent signals sufficient for constructing structurally related problem‑‑solution pairs. Based on this, QueST generates such query‑conditioned pairs and uses them as supervision for parameter‑efficient fine‑tuning at test time. The adapted model is then used to produce the final answer, enabling query‑specific adaptation without any external data. Across seven mathematical reasoning benchmarks and the GPQA‑Diamond scientific reasoning benchmark, QueST consistently outperforms strong test‑time optimization baselines. These results demonstrate that query‑conditioned self‑training is an effective and practical paradigm for test‑time adaptation in LLMs. Code is available at https://chssong.github.io/Query‑Conditioned‑TTST/.
Authors:Oscar Gilg, Pierre Beckmann, Daniel Paleka, Patrick Butlin
Abstract:
Large language models (LLMs) can be said to have preferences: they reliably pick certain tasks and outputs over others, and preferences shaped by post‑training and system prompts appear to shape much of their behaviour. But models can also adopt different personas which have radically different preferences. How is this implemented internally? Does each persona run on its own preference machinery, or is something shared underneath? We train linear probes on residual‑stream activations of Gemma‑3‑27B and Qwen‑3.5‑122B to predict revealed pairwise task choices, and identify a genuine preference vector: it tracks the model's preferences as they shift across a range of prompts and situations, and on Gemma‑3‑27B steering along it causally controls pairwise choice. This preference representation is largely shared across personas: a probe trained on the helpful assistant predicts and steers the choices of qualitatively different personas, including an evil persona whose preferences anti‑correlate with those of the Assistant.
Authors:Shuqiang Wang, Wei Cao, Jiaqi Weng, Jialing Tao, Licheng Pan, Hui Xue, Zhixuan Chu
Abstract:
Large Reasoning Models (LRMs) are increasingly integrated into systems requiring reliable multi‑step inference, yet this growing dependence exposes new vulnerabilities related to computational availability. In particular, LRMs exhibit a tendency to "overthink", producing excessively long and redundant reasoning traces, when confronted with incomplete or logically inconsistent inputs. This behavior significantly increases inference latency and energy consumption, forming a potential vector for denial‑of‑service (DoS) style resource exhaustion. In this work, we investigate this attack surface and propose an automated black‑box framework that induces overthinking in LRMs by systematically perturbing the logical structure of input problems. Our method employs a hierarchical genetic algorithm (HGA) operating on structured problem decompositions, and optimizes a composite fitness function designed to maximize both response length and reflective overthinking markers. Across four state‑of‑the‑art reasoning models, the proposed method substantially amplifies output length, achieving up to a 26.1x increase on the MATH benchmark and consistently outperforming benign and manually crafted missing‑premise baselines. We further demonstrate strong transferability, showing that adversarial inputs evolved using a small proxy model retain high effectiveness against large commercial LRMs. These findings highlight overthinking as a shared and exploitable vulnerability in modern reasoning systems, underscoring the need for more robust defenses.
Authors:Junhyuk Jeon, Seokhyeon Hong, Junyong Noh
Abstract:
Text‑driven motion diffusion models are capable of generating realistic human motions, but text alone often struggles to express fine‑level nuances of motion, commonly referred to as style. Recent approaches have tackled this challenge by attaching a style injection mechanism to a pretrained text‑driven diffusion model. Existing stylization methods, however, either require style‑specific fine‑tuning of existing models or rely on heavy ControlNet‑based architectures, limiting efficiency and generalization to unseen styles. We propose a lightweight style conditioning framework that dynamically modulates a pretrained diffusion model through hypernetwork‑generated LoRA parameters. A style reference motion is encoded into a global style embedding, which is mapped by a hypernetwork to low‑rank updates applied at each denoising step of the diffusion model. By structuring the style latent space with a supervised contrastive loss, our framework reliably captures diverse stylistic attributes, improves generalization to unseen styles, and supports optimization‑based guidance without requiring predefined style categories. Experiments on the HumanML3D and 100STYLE datasets show state‑of‑the‑art stylization results, while achieving improved stylization for unseen styles.
Authors:Yunheng Wang, Yuetong Fang, Taowen Wang, Lusong Li, Kun Liu, Junzhe Xu, Zizhao Yuan, Yixiao Feng, Jiaxi Zhang, Wei Lu, Zecui Zeng, Renjing Xu
Abstract:
Vision‑and‑Language Navigation (VLN) is a cornerstone of embodied intelligence. However, current agents often suffer from significant performance degradation when transitioning from simulation to real‑world deployment, primarily due to perceptual instability (e.g., lighting variations and motion blur) and under‑specified instructions. While existing methods attempt to bridge this gap by scaling up model size and training data, we argue that the bottleneck lies in the lack of robust spatial grounding and cross‑domain priors. In this paper, we propose StereoNav, a robust Vision‑Language‑Action framework designed to enhance real‑world navigation consistency. To address the inherent gap between synthetic training and physical execution, we introduce Target‑Location Priors as a persistent bridge. These priors provide stable visual guidance that remains invariant across domains, effectively grounding the agent even when instructions are vague. Furthermore, to mitigate visual disturbances like motion blur and illumination shifts, StereoNav leverages stereo vision to construct a unified representation of semantics and geometry, enabling precise action prediction through enhanced depth awareness. Extensive experiments on R2R‑CE and RxR‑CE demonstrate that StereoNav achieves state‑of‑the‑art egocentric RGB performance, with SR and SPL scores of 81.1% and 68.3%, and 67.5% and 52.0%, respectively, while using significantly fewer parameters and less training data than prior scaling‑based approaches. More importantly, real‑world robotic deployments confirm that StereoNav substantially improves navigation reliability in complex, unstructured environments. Project page: https://yunheng‑wang.github.io/stereonav‑public.github.io.
Authors:Hongli Liu, Yu Wang, Shengjie Zhao
Abstract:
Few‑shot action recognition (FSAR) requires models to generalize to novel action categories from only a handful of annotated samples. Despite progress with vision‑language models, existing approaches still suffer from semantic‑temporal misalignment, where static textual prompts fail to capture decisive visual cues that appear sparsely across sequences, and from inadequate modeling of multi‑scale temporal dynamics, as short‑term discriminative cues and long‑range dependencies are often either oversmoothed or fragmented. To address these challenges, we propose Semantic Temporal Adaptive Representation Learning (STAR), a unified framework, consisting of a semantic‑alignment component and a temporal‑aware component, effectively bridging the semantic and temporal gaps and transferring the sequence modeling capability of Mamba into the FSAR. The semantic alignment module introduces a Temporal Semantic Attention (TSA) mechanism, which performs frame‑level cross‑modal alignment with textual cues, ensuring fine‑grained semantic‑temporal consistency. The temporal‑aware module incorporates a Semantic Temporal Prototype Refiner (STPR) that integrates semantic‑guided Mamba blocks with multi‑frequency temporal sampling and bidirectional state‑space refinement, yielding semantically aligned prototypes with enhanced discriminative fidelity and temporal consistency. Furthermore, temporally dependent class descriptors derived from large language models (LLMs) provide long‑range semantic guidance. Extensive experiments on five FSAR benchmarks demonstrate the consistent superiority of STAR over state‑of‑the‑art methods. For instance, STAR achieves up to 8.1% and 6.7% gains on the SSv2‑Full and SSv2‑Small datasets under the 1‑shot setting, and 7.3% on HMDB51, validating its effectiveness under limited supervision. The code is available at https://github.com/HongliLiu1/STAR‑main.
Authors:Mahsa Gazeran, Sayvan Soleymanbaigi, Fatemeh Daneshfar, Amjad Seyedi, Fardin Akhlaghian Tab
Abstract:
Electrocardiogram (ECG) arrhythmia classification remains challenging due to signal variability, noise, limited labeled data, and the difficulty in achieving both accuracy and efficiency in models. While self‑supervised learning reduces label dependency, most methods target either global contextual features or local morphological patterns, but rarely implement hierarchical multi‑scale feature extraction. ECG signals require architectures that simultaneously capture fine‑grained beat‑level morphology and broader rhythm‑level dependencies with computational efficiency. To overcome this limitation, this paper proposes the Electrocardiogram Neighborhood Attention Transformer (ECG‑NAT), a novel self‑supervised learning approach tailored for multi‑lead ECG classification. Our two‑stage approach begins with generative pretraining, using a masked autoencoder to reconstruct partially masked ECG signals across multiple diverse datasets, enabling the model to learn robust, domain‑invariant representations from unlabeled data. This is followed by discriminative fine‑tuning with a dual‑loss function that combines supervised contrastive and cross‑entropy losses, aligning representation learning with label prediction. The hierarchical attention mechanism efficiently captures multi‑scale temporal features from localized beat morphology to broader rhythm patterns at low computational cost. ECG‑NAT achieves robust performance on benchmark datasets, with 88.1% accuracy using only 1% labeled data, demonstrating strong efficacy in low‑resource settings. The framework combines superior classification performance with computational efficiency, making it practical for real‑time ECG diagnosis. The code will be made available upon acceptance at: https://github.com/Mahsagazeran/ECG‑NAT.
Authors:Sangin Lee, Yukyung Choi
Abstract:
In large vision‑language models, visual tokens typically constitute the majority of input tokens, leading to substantial computational overhead. To address this, recent studies have explored pruning redundant or less informative visual tokens for image understanding tasks. However, these methods struggle with pixel grounding tasks, where token importance is highly contingent on the input text. Through an in‑depth analysis of CLIP, we observe that visual tokens located within referent regions often exhibit low similarity to the textual representation. Motivated by this insight, we introduce LiteLVLM, a training‑free, text‑guided token pruning strategy for efficient pixel grounding inference. By reversing the ranking of CLIP's visual‑text similarity, LiteLVLM effectively retains visual tokens covering the referent regions, while recovering context tokens to enable clear foreground‑background separation. Extensive experiments demonstrate that LiteLVLM significantly outperforms existing methods by over 5% across diverse token budgets. Without any training or fine‑tuning, LiteLVLM maintains 90% of the original performance with a 22% speedup and a 2.3x memory reduction. Our code is available at https://github.com/sejong‑rcv/LiteLVLM.
Authors:Jiahao Chen, Zihui Zhang, Yafei Yang, Jinxi Li, Shenxing Wei, Zhixuan Sun, Bo Yang
Abstract:
We introduce EvObj for unsupervised 3D instance segmentation that bridges the geometric domain gap between synthetic pretraining data and real‑world point clouds. Current methods suffer from structural discrepancies when transferring object priors from synthetic datasets (e.g., ShapeNet) to real scans (e.g., ScanNet), particularly due to morphological variations and occlusion artifacts. To address this, EvObj integrates two innovative modules: (1) An object discerning module that dynamically refines object candidates, enabling continuous adaptation of object priors to target domains; and (2) An object completion module that reconstructs partial geometries after discovering objects. We conduct extensive experiments on both real‑world and synthetic datasets, demonstrating superior 3D object segmentation performance over all baselines while achieving state‑of‑the‑art results.
Authors:Guoxiong Gao, Zeming Sun, Jiedong Jiang, Yutong Wang, Jingda Xu, Peihao Wu, Bryan Dai, Bin Dong
Abstract:
Proving theorems in Lean 4 often requires identifying a scattered set of library lemmas whose joint use enables a concise proof ‑‑ a task we call global premise retrieval. Existing tools address adjacent problems: semantic search engines find individual declarations matching a query, while premise‑selection systems predict useful lemmas one tactic step at a time. Neither recovers the full premise set an entire theorem requires. We present LeanSearch v2, a two‑mode retrieval system for this task. Its standard mode applies a hierarchy‑informalized Mathlib corpus with an embedding‑reranker pipeline, achieving state‑of‑the‑art single‑query retrieval without domain‑specific fine‑tuning (nDCG@10 of 0.62 vs. 0.53 for the next‑best system). Its reasoning mode builds on standard mode as its retrieval substrate, targeting global premise retrieval through iterative sketch‑retrieve‑reflect cycles. On a 69‑query benchmark of research‑level Mathlib theorems, reasoning mode recovers 46.1% of ground‑truth premise groups within 10 retrieved candidates, outperforming strong reasoning retrieval systems (38.0%) and premise‑selection baselines (9.3%) on the same benchmark. In a controlled downstream evaluation with a fixed prover loop, replacing alternative retrievers with LeanSearch v2 yields the highest proof success (20% vs. 16% for the next‑best system and 4% without retrieval), confirming that retrieval quality propagates to proof generation. We have open‑sourced all code, data, and benchmarks. Code and data: https://github.com/frenzymath/LeanSearch‑v2 . The standard mode is publicly available with API access at https://leansearch.net/ .
Authors:Ziqi Wen, Parsa Madinei, Miguel P. Eckstein
Abstract:
Evaluating whether large vision‑language models (VLMs) align with human perception for high‑level semantic scene comprehension remains a challenge. Traditional white‑box interpretability methods are inapplicable to closed‑source architectures and passive metrics fail to isolate causal features. We introduce Counterfactual Semantic Saliency (CSS). This black‑box, model‑agnostic framework quantifies the importance of objects by measuring the semantic shift induced by their causal ablation from a scene. To evaluate AI‑human semantic alignment, we tested prominent VLMs against a human psychophysics baseline comprising 16,289 valid responses across 307 complex natural scenes and 1,306 high‑fidelity counterfactual variants. Our analysis reveals a pervasive scene comprehension gap: models exhibit an overreliance (relative to humans) on large objects (size bias), objects at the center of the image (center bias), and high saliency objects. In contrast, models rely less on people in the scenes than our human participants to describe the images. A model's size bias is a primary driver explaining variations in model‑human semantic divergence. Code and data will be available at https://github.com/starsky77/Counterfactual‑Semantic‑Saliency.
Authors:Xu Bai, Bin Lu, Kun Zhang, Shengbo Chen, Xinbing Wang, Chenghu Zhou, Meng Jin
Abstract:
Graph coarsening is a graph dimensionality reduction technique that aims to construct a smaller and more tractable graph while preserving the essential structural and semantic properties of the original graph. However, most existing methods rely on pair‑wise similarity matching, where each node independently searches for its best partner based on global information. This selfishness matching paradigm incurs substantial computational and memory overhead. To address this problem, we shift to a non‑selfishness principle that prioritizes the collective interference of neighborhood in coarsening, and propose an efficient method named NOPE, which achieves linear memory consumption and near‑linear computational complexity in the number of nodes. Furthermore, we derive a faster variant NOPE, which reduces O(δ\dot d) interference evaluation to O(d) based on the local isotropy assumption, and consequently alleviates the computational bottleneck for high‑degree nodes. Experimental results show that NOPE achieves 1.8‑10× speedup over NOPE and surpass almost all baselines with 1‑3 orders of magnitude acceleration. Meanwhile, learning on coarsened graphs yields comparable performance to original graphs, and can even show superior performance over LLM‑based graph reasoning owing to compact graph information. The code can be available at https://github.com/dazonglian/NOPE‑main.
Authors:Jiashuo Sun, Jimeng Shi, Yixuan Xie, Saizhuo Wang, Jash Rajesh Parekh, Pengcheng Jiang, Zhiyi Shi, Jiajun Fan, Qinglong Zheng, Peiran Li, Shaowen Wang, Ge Liu, Jiawei Han
Abstract:
Retrieval‑Augmented Generation (RAG) has become a standard approach for knowledge‑intensive question answering, but existing systems remain brittle on multi‑hop questions, where solving the task requires chaining multiple retrieval and reasoning steps. Key challenges are that current methods represent reasoning through free‑form natural language, where intermediate states are implicit, retrieval queries can drift from intended entities, and errors are detected by the same model that produces them making self‑reflection an unreliable, ungrounded signal. We observe that multi‑hop question answering is a typical form of step‑by‑step computation, and that this structured process aligns closely with how code‑specialized language models are trained to operate. Motivated by this, we introduce \pyrag, a framework that reformulates multi‑hop RAG as program synthesis and execution. Instead of free‑form reasoning trajectories, \pyrag represents the reasoning process as an executable Python program over retrieval and QA tools, exposing intermediate states as variables, producing deterministic feedback through execution, and yielding an inspectable trace of the entire reasoning process. This formulation further enables compiler‑grounded self‑repair and execution‑driven adaptive retrieval without any additional training. Experiments on five QA benchmarks (PopQA, HotpotQA, 2WikiMultihopQA, MuSiQue, and Bamboogle) show that \pyrag consistently outperforms strong baselines under both training‑free and RL‑trained settings, with especially large gains on compositional multi‑hop datasets. Our code, data and models are publicly available at https://github.com/GasolSun36/PyRAG.
Authors:Priyam Sahoo, Gaurav Mittal, Xiaomin Li, Shengjie Ma, Benjamin Steenhoek, Pingping Lin, Yu Hu
Abstract:
Evaluation of software engineering (SWE) agents is dominated by a binary signal: whether the final patch passes the tests. This outcome‑only view treats a principled solution and a chaotic trial‑and‑error process as equivalent. We show that this equivalence is empirically false. We evaluate 2,614 OpenHands trajectories from eight model backends on 60 SWE‑bench Verified tasks. Of these, 47 have enough passing trajectories to construct task‑level process references, yielding a 1,815‑trajectory evaluation subset. Among passing trajectories in this subset, 10.7% exhibit behavior we call a Lucky Pass: regression cycles, blind retries, missing verification, or temporally disordered exploration, implementation, and verification. We introduce AgentLens, a framework for process‑level assessment of SWE‑agent trajectories, and release AgentLens‑Bench, a dataset of 1,815 trajectories annotated with quality scores, waste signals, divergence points, and 47 task‑level Prefix Tree Acceptor (PTA) references. AgentLens builds PTA references by merging multiple passing solutions for the same task, and uses a context‑sensitive intent labeler to assign actions to Exploration, Implementation, Verification, or Orchestration based on trajectory history rather than tool identity alone. On AgentLens‑Bench, the quality score separates passing trajectories into Lucky, Solid, and Ideal tiers and further decomposes Lucky Passes into five recurring mechanisms. Across the eight model backends, Lucky rates range from 0.5% to 23.2%, and some models move by as many as five rank positions when ranked by quality score instead of pass rate. We release the anonymized project repository, including the AgentLens‑Bench dataset and AgentLens SDK, at https://github.com/microsoft/code‑agent‑state‑trajectories/.
Authors:Rohith Reddy Bellibatlu
Abstract:
Aggregate accuracy metrics dominate the evaluation of clinical AI decision‑support systems but do not detect deployment‑phase failures of input reliability, subgroup equity, threshold sensitivity, or operational feasibility. We propose the RISED Framework: a five‑dimension pre‑deployment evaluation covering Reliability, Inclusivity, Sensitivity, Equity, and Deployability, in which each dimension is operationalized through formal sub‑criteria, pre‑specified pass/fail thresholds, and bias‑corrected accelerated (BCa) bootstrap 95% confidence intervals combined under a Holm‑Bonferroni family‑wise error correction. A central demonstration is that a classifier satisfying conventional high‑discrimination benchmarks can simultaneously fail input‑encoding stability and threshold‑shift sensitivity checks, while subgroup AUC parity remains statistically inconclusive, pointing to deployment risks that aggregate evaluation alone cannot detect. We validate this differential pass/fail pattern on a synthetic cohort and three publicly available real‑world cohorts spanning 35 years of clinical data vintage, from a 1980s cardiology dataset to a 2024 nationally representative health survey, where failing dimensions differ across cohorts, providing preliminary evidence of construct validity. The Equity dimension is reframed as a proxy‑dependence diagnostic rather than a stand‑alone gate: any need‑based fairness verdict computed against a utilization‑derived proxy carries a construct‑validity problem the framework surfaces explicitly, triggering a procurement requirement for an outcome‑independent need measure before the gate is binding. RISED is released as an open‑source Python package that supplies the quantitative verdicts existing clinical AI reporting standards require, providing a principled gateway between in‑silico model validation and silent‑trial clinical evaluation.
Authors:Zhongkai Yu, Yichen Lin, Chenyang Zhou, Yuwei Zhang, Kun Zhou, Junxia Cui, Haotian Ye, Zhengding Hu, Zaifeng Pan, Ruiyi Wang, Yujie Zhao, Hejia Zhang, Jingbo Shang, Jishen Zhao, Yufei Ding
Abstract:
Existing API‑based agentic systems for RTL code generation are fundamentally misaligned with industrial practice: they assume a golden testbench is available at generation time, rely on closed‑source APIs incompatible with chip vendors' air‑gapped security requirements, and cannot be trained on vendors' proprietary RTL codebases, leaving valuable internal data unused. Recent self‑trained models address the deployment constraint but remain single‑turn generators that overlook the critical role of verification in real industrial flows. To bridge these gaps, we present ChipMATE, the first self‑trained multi‑agent framework for RTL generation. Inspired by industrial practice where correctness emerges from cross‑comparison between independently written RTL modules and reference models, ChipMATE pairs a Verilog agent with a Python reference‑model agent that mutually verify each other's outputs without any golden oracle. We design a backtrack‑based inference workflow to prevent error propagation across turns, and a two‑stage training pipeline that first trains each agent individually to saturate its code‑generation capability, then trains the team jointly to collaborate effectively. To support the training, we further build a hybrid data‑generation framework that produces 64.4K high‑quality reference model training samples. ChipMATE achieves 75.0% and 80.1% pass@1 on VerilogEval V2 with 4B and 9B base models, outperforming all existing self‑trained models and even DeepSeek V4 with 1600B parameters. Our code and model weights are publicly available in https://github.com/zhongkaiyu/ChipMATE.
Authors:Kaixiang Zhao, Bolin Shen, Yuyang Dai, Shayok Chakraborty, Yushun Dong
Abstract:
Graph neural networks (GNNs) deployed as cloud services can be \emphstolen through \emphmodel‑extraction attacks, which train a surrogate from query responses to reproduce the target's behaviour, and a growing line of ownership defenses tries to prevent or trace such theft. The title of this paper asks two questions: \emphhow hard is it to steal a GNN?, and \emphcan we stop it? Prior work cannot answer either, because experiments use inconsistent datasets, threat models, and metrics. We introduce \emphGraphIP‑Bench, a unified benchmark which evaluates both sides under a single black‑box protocol. It integrates twelve extraction attacks, twelve defenses spanning watermarking, output‑perturbation, and query‑pattern‑detection families, ten public graphs covering homophilic, heterophilic, and large‑scale regimes, three GNN backbones, and three graph‑learning tasks, and it reports fidelity, task utility, ownership verification, and computational cost on shared splits, queries, and budgets. We further add a joint attack‑and‑defense track which runs every attack on every defended target and measures watermark verification on the resulting surrogate, which exposes the protection that a defense retains after extraction. The empirical picture is short: stealing a GNN is easy at medium query budgets and most defenses do not change this; several watermarks verify reliably on the protected model but lose most of their verification signal on the extracted surrogate, which exposes a gap that single‑model evaluations miss; and heterophilic graphs are systematically harder to steal, while a cross‑architecture mismatch between target and surrogate reduces but does not prevent extraction. Code: \hrefhttps://github.com/LabRAI/GraphIP‑BenchLabRAI/GraphIP‑Bench.
Authors:Kaixiang Zhao, Tianrun Yu, Aoxu Zhang, Junhao Su, Porter Jenkins, Amanda Hughes
Abstract:
The proliferation of sophisticated image editing tools and generative artificial intelligence models has made verifying the authenticity of digital images increasingly challenging, with important implications for journalism, forensic analysis, and public trust. Although numerous forensic algorithms, ranging from handcrafted methods to deep learning‑based detectors, have been developed for manipulation detection, individual methods often suffer from limited robustness, fragmented evidence, or weak generalization across manipulation types and image conditions. To address these limitations, we present FRAME, a method for Forensic Routing and Adaptive Multi‑path Evidence fusion for image manipulation detection. FRAME organizes diverse forensic algorithms into a multi‑path analysis space, adaptively selects informative forensic paths for each input image, and fuses complementary evidence to improve detection and localization performance. By moving beyond single‑method analysis and fixed fusion strategies, FRAME provides a more robust and flexible approach to image forensic reasoning while preserving interpretable forensic cues from multiple evidence sources. Experimental results demonstrate the effectiveness of FRAME across diverse manipulation scenarios. Code is available at \hrefhttps://github.com/kzhao5/FRAMEhttps://github.com/kzhao5/FRAME.
Authors:Buyun Liang, Jinqi Luo, Liangzu Peng, Kwan Ho Ryan Chan, Darshan Thaker, Kaleab A. Kinfu, Fengrui Tian, Hamed Hassani, René Vidal
Abstract:
Large language models (LLMs) achieve strong performance across many tasks but remain vulnerable to hallucinations, motivating the need for realistic adversarial prompts that elicit such failures. We formulate hallucination elicitation as a constrained optimization problem, where the goal is to find semantically coherent adversarial prompts that are equivalent to benign user prompts. Existing methods remain limited: discrete prompt‑based attacks preserve semantic equivalence and coherence but search only over a limited set of prompt variations, while continuous latent‑space attacks explore a richer space but often decode into prompts that are no longer valid rephrasings. To address these limitations, we propose REALISTA, a realistic latent‑space attack framework. REALISTA constructs an input‑dependent dictionary of valid editing directions, each corresponding to a semantically equivalent and coherent rephrasing, and optimizes continuous combinations of these directions in latent space. This design combines the optimization flexibility of continuous attacks with the semantic realism of discrete rephrasing‑based attacks. Experiments demonstrate that REALISTA achieves superior or comparable performance to state‑of‑the‑art realistic attacks on open‑source LLMs and, crucially, succeeds in attacking large reasoning models under free‑form response settings, where prior realistic attacks fail. Code is available at https://github.com/Buyun‑Liang/REALISTA.
Authors:Alejandro Murillo-Gonzalez, Mahmoud Ali, Lantao Liu
Abstract:
Multi‑objective reinforcement learning in robotic domains requires balancing complex, non‑convex trade‑offs between conflicting objectives. While linear scalarization methods provide stability, they are theoretically incapable of recovering solutions within non‑convex regions of the Pareto front. Conversely, static non‑linear scalarizations (e.g., Tchebycheff) can theoretically access these regions but often suffer from severe gradient variance and optimization instability in deep RL. In this work, we propose an Adaptive Smooth Tchebycheff framework that resolves this tension by dynamically modulating the curvature of the optimization landscape. We introduce a novel conflict‑driven controller that regulates the optimization smoothness based on real‑time gradient interference. This allows the agent to anneal toward precise, non‑convex scalarization when objectives align, while elastically reverting to stable, smooth approximations when destructive gradient conflicts emerge. We validate our approach on a challenging robotic stealth visual search task ‑‑ a proxy for monitoring of protected/fragile ecosystems ‑‑ where an agent must balance search, exposure/interference minimization and exploration speed. Extensive ablations confirm that our conflict‑aware adaptation enables the robust discovery of Pareto‑optimal policies in non‑convex regions inaccessible to linear baselines and unstable for static non‑linear methods. Website: https://alejandromllo.github.io/research/pasta/
Authors:Jack Young
Abstract:
We introduce WriteSAE, the first sparse autoencoder that decomposes and edits the matrix cache write of state‑space and hybrid recurrent language models, where residual SAEs cannot reach. Existing SAEs read residual streams, but Gated DeltaNet, Mamba‑2, and RWKV‑7 write to a d_k × d_v cache through rank‑1 updates k_t v_t^\top that no vector atom can replace. WriteSAE factors each decoder atom into the native write shape, exposes a closed form for the per‑token logit shift, and trains under matched Frobenius norm so atoms swap one cache slot at a time. Atom substitution beats matched‑norm ablation on 92.4% of n=4,851 firings at Qwen3.5‑0.8B L9 H4, the 87‑atom population test holds at 89.8%, the closed form predicts measured effects at R^2=0.98, and Mamba‑2‑370M substitutes at 88.1% over 2,500 firings. Sustained three‑position installs at 3× lift midrank target‑in‑continuation from 33.3% to 100% under greedy decoding, the first behavioral install at the matrix‑recurrent write site.
Authors:Zhiming Yu, Wangtao Lu, Xin Lai
Abstract:
A fundamental challenge in symbolic regression (SR) is efficiently recovering complex mathematical expressions from observational data. Although this problem is NP‑hard, many expressions of practical interest decompose naturally into combinations of nonlinear feature modules, concentrating structural complexity into a small number of reusable components. Here, we introduce FePySR, a two‑stage framework that reduces the SR search space by extracting valid features prior to equation search. FePySR first employs a heterogeneous neural network to constrain observational data to a set of candidate expressions, then performs structural optimization within this refined expression space using PySR. Across five standard benchmarks, FePySR outperforms state‑of‑the‑art methods by achieving higher equation recovery rates. On a set of 75 highly complex synthesized equations, FePySR recovers 36 equations, while producing substantially smaller mean squared errors on the remaining unrecovered cases, with reduced computation time compared to PySR. FePySR's first stage also maintains consistent performance under varying numbers of selected top features and increasing levels of noise in the observational data. Applied to ordinary differential equations governing biological systems, FePySR successfully identifies governing equations in 24 out of 100 tests where PySR recovers none. Taken together, FePySR is a generalizable framework that can enhance the SR solvers, enabling the efficient and reliable recovery of symbolic expressions across scientific domains.
Authors:Yichen Feng, Yuetai Li, Chunjiang Liu, Yuanyuan Chen, Fengqing Jiang, Yue Huang, Hang Hua, Zhengqing Yuan, Kaiyuan Zheng, Luyao Niu, Bhaskar Ramasubramanian, Basel Alomair, Xiangliang Zhang, Misha Sra, Zichen Chen, Radha Poovendran, Zhangchen Xu
Abstract:
Multimodal large language models (MLLMs) are now routinely deployed for visual understanding, generation, and curation. A substantial fraction of these applications require an explicit aesthetic judgment. Most existing solutions reduce this judgment to predicting a scalar score for a single image. We first ask whether such scores faithfully capture comparative preference: in a controlled study with eight expert annotators, score‑derived rankings align poorly with the same annotators' direct comparisons, while direct ranking yields substantially higher inter‑annotator agreement on best‑ and worst‑image labels. Motivated by this finding, we introduce the Visual Aesthetic Benchmark (VAB), which casts aesthetic evaluation as comparative selection over candidate sets with matched subject matter. VAB contains 400 tasks and 1,195 images across fine art, photography, and illustration, with labels derived from the consensus of 10 independent expert judges per task. Evaluating 20 frontier MLLMs and six dedicated visual‑quality reward models, we find that the strongest system identifies both the best and the worst image correctly across three random permutations of the candidate order in only 26.5% of tasks, far below the 68.9% achieved by human experts. Fine‑tuning a 35B‑parameter model on 2,000 expert examples brings its accuracy close to that of a 397B‑parameter open‑weight model, suggesting that the comparative signal in VAB is transferable. Together, these results expose a clear and measurable gap between current multimodal models and expert aesthetic judgment, and VAB provides the first set‑based, expert‑grounded testbed on which that gap can be tracked and closed.
Authors:Chenhao Qiu, Yechao Zhang, Xin Luo, Shien Song, Xusheng Liu
Abstract:
Long video question answering requires locating sparse, time‑scattered visual evidence within highly redundant content. Although current MLLMs perform well on short videos, long videos introduce long‑horizon search and verification, which often necessitates multi‑turn, agentic interaction. We show that existing LVU agents can exhibit "evidence misalignment": they produce correct answers that are not supported by the retrieved or inspected evidence. To characterize this failure, we introduce two diagnostics (temporal groundedness and semantic groundedness) and use them to reveal two pressures that amplify misalignment: prompt pressure from shared‑context saturation at inference time and reward pressure from outcome‑only optimization during training. These findings point to a structural root cause: the coupled agent paradigm conflates long‑horizon planning with answer authority. We therefore propose the decoupled planner‑inspector framework, which separates planning from answer authority and gates final answering on pixel‑level verification. Across four long‑video benchmarks, our framework improves both answer accuracy and evidence alignment, achieving 55.1% on LVBench and 62.0% on LongVideoBench while producing interpretable search trajectories. Moreover, the decoupled architecture scales consistently with increased search budgets and supports plug‑and‑play upgrades of the MLLM backbone without retraining the planner. Code and models are available at https://github.com/Echochef/VideoSEAL.
Authors:Siqi Miao, Ziyang Chen, Yuhong Luo, Hans Hao-Hsun Hsu, Mufei Li, Kaiqing Zhang, Pan Li
Abstract:
While Large Language Model (LLM) multi‑agent systems (MAS) offer a transformative approach to simulating human behavior in complex systems, it remains largely unexplored whether these simulations can replicate realistic structural and temporal dynamics from a dynamic network perspective. Our evaluation indicates that existing frameworks excel at generating plausible micro‑level interactions but fail to capture the emergent, macroscopic topologies necessary for domains that rely on realistic network dynamics, such as modeling information propagation and cybersecurity threats. To bridge this gap, we introduce two easily integrable extensions to simulation frameworks to ensure they preserve macroscopic network fidelity: 1) augmenting LLM agents with data‑driven event triggers to organically sustain long‑horizon interactions, and 2) integrating Hawkes processes to accurately model temporal activation dynamics. Our approach allows LLM MAS to capture both plausible micro‑level patterns and macroscopic topologies. We further demonstrate the utility of this framework in synthesizing realistic phishing campaigns within evolving communication networks. The study reveals how threats exploit structural vulnerabilities, highlighting the potential of our framework for developing next‑generation defenses. Our code is available at https://github.com/Graph‑COM/NSL.
Authors:Rishabh Tiwari, Kusha Sareen, Lakshya A Agrawal, Joseph E. Gonzalez, Matei Zaharia, Kurt Keutzer, Inderjit S Dhillon, Rishabh Agarwal, Devvrit Khatri
Abstract:
Large language models (LLMs) are trained for downstream tasks by updating their parameters (e.g., via RL). However, updating parameters forces them to absorb task‑specific information, which can result in catastrophic forgetting and loss of plasticity. In contrast, in‑context learning with fixed LLM parameters can cheaply and rapidly adapt to task‑specific requirements (e.g., prompt optimization), but cannot by itself typically match the performance gains available through updating LLM parameters. There is no good reason for restricting learning to being in‑context or in‑weights. Moreover, humans also likely learn at different time scales (e.g., System 1 vs 2). To this end, we introduce a fast‑slow learning framework for LLMs, with model parameters as "slow" weights and optimized context as "fast" weights. These fast "weights" can learn from textual feedback to absorb the task‑specific information, while allowing slow weights to stay closer to the base model and persist general reasoning behaviors. Fast‑Slow Training (FST) is up to 3x more sample‑efficient than only slow learning (RL) across reasoning tasks, while consistently reaching a higher performance asymptote. Moreover, FST‑trained models remain closer to the base LLM (up to 70% less KL divergence), resulting in less catastrophic forgetting than RL‑training. This reduced drift also preserves plasticity: after training on one task, FST trained models adapt more effectively to a subsequent task than parameter‑only trained models. In continual learning scenarios, where task domains change on the fly, FST continues to acquire each new task while parameter‑only RL stalls.
Authors:Junyu Xiong, Yuan Pu, Jia Tang, Yazhe Niu
Abstract:
Leveraging the rich world knowledge of Large Language Models (LLMs) to enhance Reinforcement Learning (RL) agents offers a promising path toward general intelligence. However, a fundamental prior‑dynamics mismatch hinders existing approaches: static LLM knowledge cannot directly adapt to the complex transition dynamics of long‑horizon tasks. Using LLM priors as fixed policies limits exploration diversity, as the prior is blind to environment‑specific dynamics; while end‑to‑end fine‑tuning suffers from optimization instability and credit assignment issues. To bridge this gap, we propose PriorZero, a unified framework that integrates LLM‑derived conceptual priors into world‑model‑based planning through a decoupled rollout‑training design. During rollout, a novel root‑prior injection mechanism incorporates LLM priors exclusively at the root node of Monte Carlo Tree Search (MCTS), focusing search on semantically promising actions while preserving the world model's deep lookahead capability. During training, PriorZero decouples world‑model learning from LLM adaptation: the world model is continuously refined on interaction data to jointly improve its dynamics, policy, and value predictions, its value estimates are then leveraged to provide fine‑grained credit assignment signals for stable LLM fine‑tuning via alternating optimization. Experiments across diverse benchmarks, including text‑based adventure games in Jericho and instruction‑following gridworld tasks in BabyAI, demonstrate that PriorZero consistently improves both exploration efficiency and asymptotic performance, establishing a promising framework for LLM‑empowered decision‑making. Our code is available at https://github.com/opendilab/LightZero.
Authors:Deepak Kumar, Baban Gain, Asif Ekbal
Abstract:
Automatic Speech Recognition (ASR) transcripts often contain disfluencies, such as fillers, repetitions, and false starts, which reduce readability and hinder downstream applications like chatbots and voice assistants. If left unaddressed, such disfluencies can significantly degrade the reliability of downstream systems. Most existing approaches rely on classical models that focus on identifying disfluent tokens for removal. While this strategy is effective to some extent, it often disrupts grammatical structure and semantic coherence, leading to incomplete or unnatural sentences. Recent literature explored the use of large language models (LLMs); however, these efforts have primarily focused on disfluency detection or data augmentation, rather than performing comprehensive correction. We propose a multilingual correction pipeline where a sequence tagger first marks disfluent tokens, and these signals guide instruction fine‑tuning of an LLM to rewrite transcripts into fluent text. To further improve reliability, we add a contrastive learning objective that penalizes the reproduction of disfluent tokens, encouraging the model to preserve grammar and meaning while removing disfluent artifacts. Our experiments across three Indian languages, namely Hindi, Bengali, and Marathi show consistent improvements over strong baselines, including multilingual sequence‑to‑sequence models. These results highlight that detection‑only strategies are insufficient. Combining token‑level cues with instruction tuning and contrastive learning provides a practical and scalable solution for multilingual disfluency correction in speech‑driven NLP systems. We make the codes publicly available at https://github.com/deepak‑kumar‑98/Mind‑the‑Pause.
Authors:Matthew M. Hong, Jesse Zhang, Anusha Nagabandi, Abhishek Gupta
Abstract:
Fine‑tuning pre‑trained robot policies with reinforcement learning (RL) often inherits the bottlenecks introduced by pre‑training with behavioral cloning (BC), which produces narrow action distributions that lack the coverage necessary for downstream exploration. We present a unified framework that enables the exploration necessary to enable efficient robot policy finetuning by bridging BC pre‑training and RL fine‑tuning. Our pre‑training method, Context‑Smoothed Pre‑training (CSP), injects forward‑diffusion noise into policy inputs, creating a continuum between precise imitation and broad action coverage. We then fine‑tune pre‑trained policies via Timestep‑Modulated Reinforcement Learning (TMRL), which trains the agent to dynamically adjust this conditioning during fine‑tuning by modulating the diffusion timestep, granting explicit control over exploration. Integrating seamlessly with arbitrary policy inputs, e.g., states, 3D point clouds, or image‑based VLA policies, we show that TMRL improves RL fine‑tuning sample efficiency. Notably, TMRL enables successful real‑world fine‑tuning on complex manipulation tasks in under one hour. Videos and code available at https://weirdlabuw.github.io/tmrl/.
Authors:Vladislav Savenkov
Abstract:
We present Curated Industrial Developer Repository (CIDR), a large‑scale dataset of real‑world software repositories collected through direct collaboration with 12 industrial partner organizations. The dataset comprises 2,440 repositories spanning 138 programming languages and totalling 373 million lines of code, accompanied by structured per‑repository metadata. Unlike existing code corpora derived from public open‑source platforms, CIDR consists exclusively of proprietary production codebases contributed under formal data sharing agreements, covering application domains including enterprise web and mobile development, fintech, and custom software consultancy. All repositories were processed through a multi‑stage pipeline encompassing structured partner onboarding, two‑stage quality selection combining automated metadata filtering with manual code review, and a deterministic anonymization pipeline covering the full version control history. The dataset is intended to support research in code intelligence, software quality analysis, pre‑training and fine‑tuning of code language models, developer behaviour studies, and construction of agent evaluation benchmarks. Access is provided under a restricted commercial license; details are available at https://fermatix.ai/#Contact.
Authors:Oleg Solozobov
Abstract:
Agentic AI failures need post‑hoc reconstruction: what the agent did, on whose authority, against which policy, and from what reasoning. Cross‑regime feasibility remains unmeasured under one property‑level schema. We apply the Decision Trace Reconstructor unmodified to pinned worked‑example anchors from six public vendor SDK regimes spanning cloud‑agent, observability, tool‑use, telemetry, and protocol traces, plus two comparator columns. Each Decision Event Schema (DES) property is classified as fully fillable, partially fillable, structurally unfillable, or opaque. Per‑property reconstructability of an agent decision already varies between regimes at this anchor scale. Strict‑governance‑completeness separates into three tiers ranging from 42.9% to 85.7%, yielding one regime‑independent gap (reasoning trace), four regime‑dependent gaps, and one Mixed property; the pilot is single‑annotator, one anchor per cell, descriptive, with outputs checksum‑verifiable from a deposited reproducibility package.
Authors:Zhong Guan, Yongjian Guo, Haoran Sun, Wen Huang, Shuai Di, Xiong Jun Wu, Likang Wu, Hongke Zhao
Abstract:
Asynchronous reinforcement learning improves rollout throughput for large language model agents by decoupling sample generation from policy optimization, but it also introduces a critical failure mode for PPO‑style off‑policy correction. In heterogeneous training systems, the total importance ratio should ideally be decomposed into two semantically distinct factors: a \emphtraining‑‑inference discrepancy term that aligns inference‑side and training‑side distributions at the same behavior‑policy version, and a \emphpolicy‑staleness term that constrains the update from the historical policy to the current policy. We show that practical asynchronous pipelines with delayed updates and partial rollouts often lose the required historical training‑side logits, or old logits. This missing‑old‑logit problem entangles discrepancy repair with staleness correction, breaks the intended semantics of decoupled correction, and makes clipping and masking thresholds interact undesirably. To address this issue, we study both exact and approximate correction routes. We propose three exact old‑logit acquisition strategies: snapshot‑based version tracking, a dedicated old‑logit model, and synchronization via partial rollout interruption, and compare their system trade‑offs. From the perspective of approximate correction, we focus on preserving the benefits of decoupled correction through a more appropriate approximate policy when exact old logits cannot be recovered at low cost, without incurring extra system overhead. Following this analysis, we adopt a revised PPO‑EWMA method, which achieves significant gains in both training speed and optimization performance. Code at https://github.com/millioniron/ROLL.
Authors:Muhammad Aqeel, Maham Nazir, Uzair Khan, Marco Cristani, Francesco Setti
Abstract:
Zero‑shot anomaly detection aims to identify defects in unseen categories without target‑specific training. Existing methods usually apply the same feature transformation to all samples, treating normal and anomalous data uniformly despite their fundamentally asymmetric distributions, compact normals versus diverse anomalies. We instead exploit this natural asymmetry by proposing AVA‑DINO, an anomaly‑aware vision‑language adaptation framework with dual specialized branches for normal and anomalous patterns that adapt frozen DINOv3 visual features. During training on auxiliary data, the two branches are learned jointly with a text‑guided routing mechanism and explicit routing regularization that encourages branch specialization. At test time, only the input image and fixed, predefined language descriptions are used to dynamically combine the two branches, enabling an asymmetric activation. This design prevents degenerate uniform routing and allows context‑specific feature transformations. Experiments across nine industrial and medical benchmarks demonstrate state‑of‑the‑art performance, achieving 93.5% image‑AUROC on MVTec‑AD and strong cross‑domain generalization to medical imaging without domain‑specific fine‑tuning. https://github.com/aqeeelmirza/AVA‑DINO
Authors:Che Liu, Lichao Ma, Xiangyu Tony Zhang, Yuxin Zhang, Haoyang Zhang, Xuerui Yang, Fei Tian
Abstract:
Omni‑modal language models are intended to jointly understand audio, visual inputs, and language, but benchmark gains can be inflated when visual evidence alone is enough to answer a query. We study whether current omni‑modal benchmarks separate visual shortcuts from genuine audio‑visual‑language evidence integration, and how post‑training behaves under a visually debiased evaluation setting. We audit nine omni‑modal benchmarks with visual‑only probing, remove visually solvable queries, and retain full subsets when filtering is undefined or would make comparisons unstable. This yields OmniClean, a cleaned evaluation view with 8,551 retained queries from 16,968 audited queries. On OmniClean, we evaluate OmniBoost, a three‑stage post‑training recipe based on Qwen2.5‑Omni‑3B: mixed bi‑modal SFT, mixed‑modality RLVR, and SFT on self‑distilled data. Balanced bi‑modal SFT gives limited and uneven gains, RLVR provides the first broad improvement, and self‑distillation reshapes the benchmark profile. After SFT on self‑distilled data, the 3B model reaches performance comparable to, and in aggregate slightly above, Qwen3‑Omni‑30B‑A3B‑Instruct without using a stronger omni‑modal teacher. These results show that omni‑modal progress is easier to interpret when evaluation controls visual leakage, and that small omni‑modal models can benefit from staged post‑training with self‑distilled omni‑query supervision. Project page: https://cheliu‑computation.github.io/omni/
Authors:Zhennan Chen, Junwei Zhu, Xu Chen, Jiangning Zhang, Jiawei Chen, Zhuoqi Zeng, Wei Zhang, Chengjie Wang, Jian Yang, Ying Tai
Abstract:
Pixel diffusion models have recently regained attention for visual generation. However, training advanced pixel‑space models from scratch demands prohibitive computational and data resources. To address this, we propose the Latent‑to‑Pixel (L2P) transfer paradigm, an efficient framework that directly harnesses the rich knowledge of pre‑trained LDMs to build powerful pixel‑space models. Specifically, L2P discards the VAE in favor of large‑patch tokenization and freezes the source LDM's intermediate layers, exclusively training shallow layers to learn the latent‑to‑pixel transformation. By utilizing LDM‑generated synthetic images as the sole training corpus, L2P fits an already smooth data manifold, enabling rapid convergence with zero real‑data collection. This strategy allows L2P to seamlessly migrate massive latent priors to the pixel space using only 8 GPUs. Furthermore, eliminating the VAE memory bottleneck unlocks native 4K ultra‑high resolution generation. Extensive experiments across mainstream LDM architectures show that L2P incurs negligible training overhead, yet performs on par with the source LDM on DPG‑Bench and reaches 93% performance on GenEval.
Authors:Xiaolin Zhou, Jinbo Liu, Li Li, Ryan A. Rossi, Xiyang Hu
Abstract:
Large Language Model agents are increasingly augmented with agent skills. Current evaluation methods for skills remain limited. Most deployed benchmarks report only pass rate before and after a skill is attached, treating the skill as a black box change to agent behavior. We introduce Counterfactual Trace Auditing (CTA), a framework for measuring how a skill changes agent behavior. CTA pairs each with skill agent trace with a without skill counterpart on the same task, segments both traces into goal directed phases, aligns the phases, and emits structured Skill Influence Pattern (SIP) annotations. These annotations describe the behavioral effect of a skill rather than only its task outcome. We instantiate CTA on SWE‑Skills‑Bench with Claude across 49 software engineering tasks. The resulting audit reveals a clear evaluation gap. Pass rate changes by only +0.3 percentage points on average, suggesting little aggregate effect. Yet CTA identifies 522 SIP instances across the same paired traces, showing that the skills substantially reshape agent behavior even when pass rate is nearly unchanged. The audit also separates several recurring effects that pass rate cannot detect, including literal template copying, off task artifact creation, excess planning, and task recovery. Three findings emerge. First, high baseline tasks contain most of the observed skill effects, although their pass rate is already saturated and therefore cannot reflect those effects. Second, tasks with moderate baseline performance show the most recoverable gain, but often at substantially higher token cost. Third, the dominant SIP type can be identified by baseline bucket: surface anchoring is most common on ceiling tasks and edge‑case prompting is most common on mid‑range and floor tasks. These regularities turn informal failure mode observations into reproducible behavioral measurements.
Authors:Xiaolin Zhou, Aojie Yuan, Zheng Luo, Zipeng Ling, Xixiao Pan, Yicheng Gao, Haiyue Zhang, Jiate Li, Shuli Jiang, Prince Zizhuang Wang, Zixuan Zhu, Jinbo Liu, Ryan A. Rossi, Hua Wei, Xiyang Hu
Abstract:
Tool‑use language agents are evaluated on benchmarks that assume clean inputs, unambiguous tool registries, and reliable APIs. Real deployments violate all these assumptions: user typos propagate into hallucinated tool names, a misconfigured request timeout can stall an agent indefinitely, and duplicate tool names across servers can freeze an SDK. We study these failures as a sim‑to‑real gap in the tool‑use partially observable Markov decision process (POMDP), where deployment noise enters through the observation, action space, reward‑relevant metadata, or transition dynamics. We introduce RobustBench‑TC, a benchmark with 22 perturbation types organized by these four POMDP components, each grounded in a verified GitHub issue or documented tool‑calling failure. Across 21 models from 1.5B to 32B parameters (including the closed‑source o4‑mini), the robustness profile is sharply uneven: observation perturbations reduce accuracy by less than 5%, while reward‑relevant and transition perturbations reduce accuracy by roughly 40% and 30%, respectively; scale alone does not close these gaps. We then propose ToolRL‑DR, a domain‑randomization reinforcement learning (RL) recipe that trains a tool‑use agent on perturbation‑augmented trajectories spanning the three statically encodable POMDP components. On a 3B backbone, ToolRL‑DR‑Full retains roughly three‑quarters of clean accuracy and reaches an aggregate perturbed accuracy comparable to open‑source 14B function‑calling baselines while substantially narrowing the gap to o4‑mini. It closes approximately 27% of the Transition gap despite never seeing transition perturbations in training, suggesting that RL on adversarial static tool‑use inputs induces a more persistent retry policy that transfers to unseen runtime failures. The dataset, code and benchmark leaderboard are publicly available.
Authors:Shuo Xu, Jiakun Zhang, Junyu Lai, Chun Cao, Jingwei Xu
Abstract:
Automated theorem proving with large language models in Lean 4 is commonly approached through either step‑level tactic prediction with tree search or whole‑proof generation. These two paradigms represent opposite granularities for constructing supervised training data: the former provides dense local signals but may fragment coherent proof processes, while the latter preserves global structure but requires complex end‑to‑end generation. In this paper, we revisit supervision granularity as a training set construction problem over proof trajectories and propose segment‑level supervision, a training data construction strategy that extracts locally coherent proof segments for training policy models. We further reuse the same strategy at inference time to trigger short rollouts for existing step‑level models. When trained with segment‑level supervision on STP, LeanWorkbook, and NuminaMath‑LEAN, the resulting policy models achieve proof success rates of 64.84%, 60.90%, and 66.31% on miniF2F, respectively, consistently outperforming both step‑level and whole‑proof baselines. Goal‑aware rollout further improves existing step‑level provers while reducing inference costs. It increases the proof success rate of BFS‑Prover‑V2‑7B from 68.77% to 70.74% and that of InternLM2.5‑StepProver from 59.59% to 60.33%, showing that appropriate supervision granularity better aligns model learning with proof structure and search. Code and models are available at https://github.com/NJUDeepEngine/SEG‑ATP.
Authors:Huiyu Yi, Zhiming Xu, Dunwei Tu, Zhicheng Wang, Baile Xu, Furao Shen
Abstract:
The Nearest Class Mean (NCM) classifier is widely favored in Class‑Incremental Learning (CIL) for its superior resistance to catastrophic forgetting compared to Fully Connected layers. While Neural Collapse (NC) theory supports NCM's optimality by assuming features collapse into single points, non‑linear feature drift and insufficient training in CIL often prevent this ideal state. Consequently, classes manifest as complex manifolds rather than collapsed points, rendering the single‑point NCM suboptimal. To address this, we propose Hierarchical‑Cluster SOINN (HC‑SOINN), a novel classifier that captures the topological structure of these manifolds via a ``local‑to‑global'' representation. Furthermore, we introduce Structure‑Topology Alignment via Residuals (STAR) method, which employs a fine‑grained pointwise trajectory tracking mechanism to actively deform the learned topology, allowing it to adapt precisely to complex non‑linear feature drift. Theoretical analysis and Procrustes distance experiments validate our framework's resilience to manifold deformations. We integrated HC‑SOINN into seven state‑of‑the‑art methods by replacing their original classifiers, achieving consistent improvements that highlight the effectiveness and robustness of our approach. Code is available at https://github.com/yhyet/HC_SOINN.
Authors:Chia-Pei, Chen, Kentaroh Toyoda, Anita Lai, Alex Leung
Abstract:
Web‑browsing AI agents are increasingly deployed in enterprise settings under strict whitelists of approved domains, yet adversaries can still influence them by embedding hidden instructions in the HTML pages those domains serve. Existing red‑teaming resources fall short of this scenario: prompt‑injection benchmarks ship pre‑built adversarial pages that whitelisted agents cannot reach, and generic LLM scanners probe the model API rather than its retrieved content. We present IPI‑proxy, an open‑source toolkit for red‑teaming web‑browsing agents against indirect prompt injection (IPI). At its core is an intercepting proxy that rewrites real HTTP responses from whitelisted domains in flight, embedding payloads drawn from a unified library of 820 deduplicated attack strings extracted from six published benchmarks (BIPIA, InjecAgent, AgentDojo, Tensor Trust, WASP, and LLMail‑Inject). A YAML‑driven test harness independently parameterizes the payload set, the embedding technique (HTML comment, invisible CSS, or LLM‑generated semantic prose), and the HTML insertion point (6 locations from \icodehead\_meta to \icodescript\_comment), enabling parameter‑sweep evaluation without mock pages or sandboxed environments. A companion exfiltration tracker logs successful callbacks. This paper describes the threat model, situates IPI‑proxy among contemporary IPI benchmarks and red‑teaming tools, and details its architecture, design decisions, and configuration interface. By bridging static benchmarks and live deployment, IPI‑proxy gives AI security teams a reproducible substrate for measuring and hardening web‑browsing agents against indirect prompt injection on the same retrieval surface attackers exploit in production.
Authors:Yiqun Sun, Pengfei Wei, Lawrence B. Hsieh
Abstract:
Listwise reranking is a key yet computationally expensive component in vision‑centric retrieval and multimodal retrieval‑augmented generation (M‑RAG) over long documents. While recent VLM‑based rerankers achieve strong accuracy, their practicality is often limited by long visual‑token sequences and multi‑step autoregressive decoding. We propose ZipRerank, a highly efficient listwise multimodal reranker that directly addresses both bottlenecks. It reduces input length via a lightweight query‑image early interaction mechanism and eliminates autoregressive decoding by scoring all candidates in a single forward pass. To enable effective learning, ZipRerank adopts a two‑stage training strategy: (i) listwise pretraining on large‑scale text data rendered as images, and (ii) multimodal finetuning with VLM‑teacher‑distilled soft‑ranking supervision. Extensive experiments on the MMDocIR benchmark show that ZipRerank matches or surpasses state‑of‑the‑art multimodal rerankers while reducing LLM inference latency by up to an order of magnitude, making it well‑suited for latency‑sensitive real‑world systems. The code is available at https://github.com/dukesun99/ZipRerank.
Authors:Minseok Kang, Minhyeok Lee, Jungho Lee, Minjung Kim, Donghyeong Kim, Dayeon Lee, Heeseung Choi, Ig-jae Kim, Sangyoun Lee
Abstract:
As Video Large Language Models (Video‑LLMs) scale to longer and more complex videos, their inference cost grows rapidly due to the large volume of visual tokens accumulated across frames. Training‑free token compression has emerged as a practical solution to this bottleneck. However, existing temporal compression methods rely primarily on cross‑frame token similarity or segmentation heuristics, overlooking each token's semantic role within its frame and failing to adapt compression strength to the compressibility of each frame pair. In this work, we propose OTT‑Vid, a transport‑derived allocation framework for temporal token compression. Our approach consists of two stages: spatial pruning identifies representative content within each frame, and optimal transport (OT) is then solved between neighboring frames to estimate temporal compressibility. We formulate this OT with non‑uniform token mass, which protects semantically important tokens from aggressive compression, and a locality‑aware cost that captures both feature and spatial disparities. The resulting transport plan jointly balances token importance and matching cost, while its total cost defines the transport difficulty of each frame pair, which we use to allocate compression budgets dynamically. Experiments on six benchmarks spanning video question answering and temporal grounding show that OTT‑Vid preserves 95.8% of VQA and 73.9% of VTG performance while retaining only 10% of tokens, consistently outperforming existing state‑of‑the‑art training‑free compression methods.
Authors:Xianzhe Fan, Yuxiang Lu, Shenyuan Gao, Xiaoyang Wu, Ruihua Han, Manling Li, Hengshuang Zhao
Abstract:
Vision‑Language‑Action (VLA) models are often brittle in fine‑grained manipulation, where minor action errors during the critical phases can rapidly escalate into irrecoverable failures. Since existing VLA models rely predominantly on successful demonstrations for training, they lack an explicit awareness of failure during these critical phases. To address this, we propose DreamAvoid, a critical‑phase test‑time dreaming framework that enables VLA models to anticipate and avoid failures. We also introduce an autonomous boundary learning paradigm to refine the system's understanding of the subtle boundary between success and failure. Specifically, we (1) utilize a Dream Trigger to determine whether the execution has entered a critical phase, (2) sample multiple candidate action chunks from the VLA via an Action Proposer, and (3) employ a Dream Evaluator, jointly trained on mixed data (success, failure, and boundary cases), to "dream" the short‑horizon futures corresponding to the candidate actions, evaluate their values, and select the optimal action. We conduct extensive evaluations on real‑world manipulation tasks and simulation benchmarks. The results demonstrate that DreamAvoid can effectively avoid failures, thereby improving the overall task success rate. Our code is available at https://github.com/XianzheFan/DreamAvoid.
Authors:Wenhao Chen, Sirui Sun, Shengyuan Bai, Guojie Song
Abstract:
Aligning large language models (LLMs) with human values typically relies on post‑training or inference‑time steering that directly manipulates the backbone's parameters or representation space. However, a critical gap exists: the model's residual stream is highly dynamic, in which values exist as fragile, low‑dimensional properties, inherently incompatible with the stability required for consistent value expression. In this paper, we propose the Stable Value Guidance Transformer (SVGT), which addresses this gap through an independent value module incorporating two key designs: (1) independent value modeling, maintaining normative representations in a dedicated value space isolated from the backbone, and (2) explicit behavioral guidance, transducing these stable signals into learnable latent Bridge Tokens. These tokens serve as dynamic value anchors to explicitly steer the generative trajectory, ensuring robust adherence across diverse contexts without disrupting the backbone's internal representations. Experiments across multiple backbones and safety benchmarks show that SVGT generally reduces harmful scores by over 70% while maintaining generation fluency, demonstrating the efficacy of architecturally grounded value modeling. Our code is available at https://github.com/Clervils/SVGT.git.
Authors:Jiafei Lyu, Zichuan Lin, Scott Fujimoto, Kai Yang, Yangkun Chen, Saiyong Yang, Zongqing Lu, Deheng Ye
Abstract:
Model‑based representations recently stand out as a promising framework that embeds latent dynamics information into the representations for downstream off‑policy actor‑critic learning. It implicitly combines the advantages of both model‑free and model‑based approaches while avoiding the training costs associated with model‑based methods. Nevertheless, existing model‑based representation methods can fail to capture sufficient information about relevant variables and can overfit to early experiences in the replay buffer. These incur biases in representation and actor‑critic learning, leading to inferior performance. To address this, we propose Debiased model‑based Representations for Q‑learning, tagged DR.Q algorithm. DR.Q explicitly maximizes the mutual information between the representations of the current state‑action pair and the next state besides minimizing their deviations, and samples transitions with faded prioritized experience replay. We evaluate DR.Q on numerous continuous control benchmarks with a single set of hyperparameters, and the results demonstrate that DR.Q can match or surpass recent strong baselines, sometimes outperforming them by a large margin. Our code is available at https://github.com/dmksjfl/DR.Q.
Authors:Lezhong Wang, Mehmet Onurcan Kaya, Siavash Bigdeli, Jeppe Revall Frisvad
Abstract:
Recent single‑image relighting methods, powered by advanced generative models, have achieved impressive photorealism on synthetic benchmarks. However, their effectiveness in the complex visual landscape of the real world remains largely unverified. A critical gap exists, as current datasets are typically designed for multi‑view reconstruction and fail to address the unique challenges of single‑image relighting. To bridge this synthetic‑to‑real gap, we introduce WildRelight, the first in‑the‑wild dataset specifically created for evaluating single‑image relighting models. WildRelight features a diverse collection of high‑resolution outdoor scenes, captured under strictly aligned, temporally varying natural illuminations, each paired with a high‑dynamic‑range environment map. Using this data, we establish a rigorous benchmark revealing that state‑of‑the‑art models trained on synthetic data suffer from severe domain shifts. The strictly aligned temporal structure of WildRelight enables a new paradigm for domain adaptation. We demonstrate this by introducing a physics‑guided inference framework that leverages the captured natural light evolution as a self‑supervised constraint. By integrating Diffusion Posterior Sampling (DPS) with temporal Sampling‑Aware Test‑Time Adaptation (TTA), we show that the dataset allows synthetic models to align with real‑world statistics on‑the‑fly, transforming the intractable sim‑to‑real challenge into a tractable self‑supervised task. The dataset and code will be made publicly available to foster robust, physically‑grounded relighting research.
Authors:ShiYing Huang, Liang Lin, Yuer Li, Kaiwen Luo, Zhenhong Zhou, An Zhang, Junhao Dong, Kun Wang, Zhigang Zeng
Abstract:
In the realm of multi‑objective alignment for large language models, balancing disparate human preferences often manifests as a zero‑sum conflict. Specifically, the intrinsic tension between competing goals dictates that aggressively optimizing for one metric (e.g., helpfulness) frequently incurs a substantial penalty on another (e.g., harmlessness). While prior work mainly focuses on data selection, parameter merging, or algorithmic balancing during training, these approaches merely force compromises between divergent preferences along a fixed Pareto frontier, failing to fundamentally resolve the inherent trade‑off. In this work, we approach this problem from a novel perspective of multi‑dimensional rewards. By scaling up the model's rollouts and analyzing the outputs across different reward dimensions, we arrive at a critical conclusion: the conflict among multiple objectives stems from the fact that the prompt itself inherently restricts the achievable multi‑dimensional rewards. Based on this core observation, we propose MORA: Multi‑Objective Reward Assimilation. Specifically, MORA isolates single‑reward prompts through pre‑sampling and expands their reward diversity by rewriting the original questions to incorporate multi‑dimensional intents. Extensive experiments demonstrate that: (1) in sequential alignment, MORA achieves single‑preference improvements ranging from 5% to 12.4%, with exceptional gains in harmlessness, after multiple‑preference alignment across helpful, harmless, and truthful dimensions. (2) In simultaneous alignment, MORA achieves an average overall reward improvement of 4.6%. Our codes are available at https://github.com/Shiying‑Huang/MORA‑MPA.
Authors:Liqin Ye, Yanbin Yin, Michael Galarnyk, Yuzhao Heng, Sudheer Chava, Chao Zhang
Abstract:
The reasoning frontier of Large Language Models (LLMs) has advanced significantly through modern post‑training paradigms (e.g., Reinforcement Learning from Verifiable Rewards (RLVR)). However, the efficacy of these methods remains fundamentally constrained by the diversity and complexity of the training data. One practical solution is data synthesis; yet, prevalent methods relying on unstructured mutation or exploration suffer from homogeneity collapse, failing to systematically expand the reasoning frontier. To overcome this, we propose Evoutionary Task Discovery (EvoTD), a framework that treats data synthesis as a directed search over a dual‑axis manifold of Algorithmic Skills and Complexity Attributes. We introduce structured evolutionary operators to navigate this space: a Crossover operator that synthesizes novel skill compositions to enhance diversity, and a Parametric Mutation operator that scales structural constraints (e.g., input size, tree depth) to drive robust generalization. Crucially, we integrate a dynamic Zone of Proximal Development filter, ensuring tasks lie within the learnable region of the model. Empirically, EvoTD delivers substantial reasoning gains that generalize consistently across model architectures, pretraining regimes, and scales, demonstrating that structured evolutionary curricula can effectively support reasoning improvement. We release our code on https://github.com/liqinye/EvoTD.
Authors:Wen Lai, Yingli Shen, Dingnan Jin, Qing Cui, Jun Zhou, Maosong Sun, Alexander Fraser
Abstract:
Autoregressive language models are widely used for text evaluation, however, their left‑to‑right factorization introduces positional bias, i.e., early tokens are scored with only leftward context, conflating architectural asymmetry with true text quality. We propose masked reconstruction as an alternative paradigm, where every token is scored using full bidirectional context. We introduce DiffScore, an evaluation framework built on Masked Large Diffusion Language Models. By measuring text recoverability across continuous masking rates, DiffScore eliminates positional bias and naturally establishes an evaluation hierarchy from local fluency to global coherence. We further provide diagnostic tools unavailable to autoregressive frameworks: multi‑timestep quality profiles that decompose scores across masking rates, and bidirectional PMI decomposition that disentangles fluency from faithfulness. Experiments across ten benchmarks show that DiffScore consistently outperforms autoregressive baselines in both zero‑shot and fine‑tuned settings. The code is released at: https://github.com/wenlai‑lavine/DiffScore.
Authors:Madhurima Panja, Danny D'Agostino, Huitao Li, Tanujit Chakraborty, Nan Liu
Abstract:
The increasing adoption of data‑driven decision‑making in public health has established epidemic forecasting as a critical area of research. Recent advances in multivariate forecasting models better capture complex temporal dependencies than conventional univariate approaches, which model individual series independently. Despite this potential, the development of robust epidemic forecasting methods is constrained by the lack of high‑quality benchmarks comprising diverse multivariate datasets across infectious diseases and geographical regions. To address this gap, we present EpiCastBench, a large‑scale benchmarking framework featuring 40 curated (correlated) multivariate epidemic datasets. These publicly available datasets span a wide range of infectious diseases and exhibit diverse characteristics in terms of temporal granularity, series length, and sparsity. We analyze these datasets to identify their global features and structural patterns. To ensure reproducibility and fair comparison, we establish standardized evaluation settings, including a unified forecasting horizon, consistent preprocessing pipelines, diverse performance metrics, and statistical significance testing. By leveraging this framework, we conduct a comprehensive evaluation of 15 multivariate forecasting models spanning statistical baselines to state‑of‑the‑art deep learning and foundation models. All datasets and code are publicly available on Kaggle (https://www.kaggle.com/datasets/aimltsf/epicastbench) and GitHub (https://github.com/aimltsf/EpiCastBench).
Authors:Fanpu Cao, Xin Zou, Xuming Hu, Hui Xiong
Abstract:
Multimodal large language models (MLLMs) have become a key interface for visual reasoning and grounded question answering, yet they remain vulnerable to visual hallucinations, where generated responses contradict image content or mention nonexistent objects. A central challenge is that hallucination is not always caused by a simple lack of visual attention: the model may still assign substantial attention mass to image tokens while internally drifting toward an incorrect answer. In this paper, we show that the high‑frequency structure of visual attention, measured by layer‑wise Laplacian energy, reveals both the layer where hallucinated preferences emerge and the layer where the ground‑truth answer transiently recovers. Building on this finding, we propose LaSCD (Laplacian‑Spectral Contrastive Decoding), a training‑free decoding strategy that selects informative layers via Laplacian energy and remaps next‑token logits in closed form. Experiments on hallucination and general multimodal benchmarks show that LaSCD consistently reduces hallucination while preserving general capabilities, highlighting its potential as a faithful decoding paradigm. The code is available at https://github.com/macovaseas/LaSCD.
Authors:Yunju Choi, Min Song
Abstract:
The discovery of novel methodologies for emerging problems is a continuing cycle in ML, often driven by the migration of techniques across domains. Building on this observation, we ask whether current LLM ideation systems benefit from targeted cross‑domain retrieval or simply from exposure to diverse mechanisms. We study this question through PaperGym, a three‑stage pipeline: (1) tool‑augmented seed extraction via read, grep, and bash over an isolated paper environment, (2) cross‑domain seed retrieval via paraphrasing across seven ML domains, and (3) method synthesis from retrieved seeds, each scored by rubric‑based judges. Tool‑augmented extraction improves specificity, and paraphrase‑based retrieval broadens domain coverage. In synthesis, cross‑domain retrieval receives more pairwise novelty wins than no‑retrieval and same‑domain baselines, but shows no significant difference from a random diverse‑seed control. These findings suggest LLM ideation systems benefit from diverse seed exposure, but do not yet reliably exploit the semantic reason particular seeds were retrieved. We release the seed library, rubric prompts, and run scripts at https://github.com/yunjoochoi/PaperGym
Authors:Alexander Shypula, Osbert Bastani, Edward Schwartz
Abstract:
Decompilers are useful tools used in reverse engineering to understand compiled source code. Reconstructing source code from compiled binaries is a challenging task, because high‑level syntax, identifiers, and custom data types are generally lost as the compiler translates human‑readable code to low‑level machine code. Deterministic decompilers are useful tools for binary analysis, but can struggle to infer idiomatic syntax and identifier names. Generative AI models are a natural fit for reconstructing high‑level syntax, identifiers, and types, but they can still suffer by hallucinating improper programming constructs and semantics. Instead of attempting to improve neural decompilers with more data and more training, we argue that compiler feedback can be used to dramatically improve the semantic correctness of neural decompiler outputs via search. Our system, Decaf (DECompilation with Automated Feedback), raises the neural decompilation rate from 26.0% on ExeBench to 83.9% on the Real O2 split without sacrificing similarity to the original source code. We also find our automatic feedback methodology is highly effective for improving weaker neural decompilation models.
Authors:Haoxuan Chen, Tianming Liang, Wei-Shi Zheng, Jian-Fang Hu
Abstract:
Reinforcement learning with verifiers (RLVR) has become a central paradigm for improving LLM reasoning, yet popular group‑based optimization algorithms like GRPO often suffer from exploration collapse, where the models prematurely converge on a narrow set of high‑scoring patterns, lacking the ability to explore new solutions. Recent efforts attempt to alleviate this by adding entropy regularization or diversity bonus. However, these approaches do not change the winner‑takes‑all nature, where rollouts still compete for individual advantage rather than cooperating for maximizing global diversity. In this work, we propose Group Cooperative Policy Optimization (GCPO), which shifts the training paradigm from rollout competition to team cooperation. Specifically, GCPO replaces independent rollout scoring with team‑level credit assignment: a rollout is rewarded by how much it contributes to the team's valid solution coverage, rather than its individual accuracy. This coverage is described as a determinant volume over reward‑weighted semantic embeddings, where only correct and non‑redundant rollouts contribute to this volume. During advantage estimation, GCPO redistributes the collective team reward to each single rollout according to its average marginal contribution to the team. This cooperative training paradigm routes optimization toward non‑redundant correct reasoning paths. Experiments across multiple reasoning benchmarks demonstrate that GCPO significantly improves both reasoning accuracy and solution diversity over existing approaches. Code will be released at \hrefhttps://github.com/bradybuddiemarch/gcpothis.
Authors:Joykirat Singh, Zaid Khan, Archiki Prasad, Justin Chih-Yao Chen, Akshay Nambi, Hyunji Lee, Elias Stengel-Eskin, Mohit Bansal
Abstract:
Large language models (LLMs) are increasingly deployed on long‑horizon tasks in partially observable environments, where they must act while inferring and tracking a complex environment state over many steps. This leads to two challenges: partial observability requires maintaining uncertainty over unobserved world attributes, and long interaction history causes context to grow without bound, diluting task‑relevant information. A principled solution to both challenges is a belief state: a posterior distribution over environment states given past observations and actions, which compactly encodes history for decision making regardless of episode length. In LLM agents, however, the open‑ended nature of text makes it unclear how to represent such a distribution. Therefore, we introduce Agent‑BRACE: Agent Belief state Representation via Abstraction and Confidence Estimation, a method that decouples an LLM agent into a belief state model and a policy model, jointly optimized via reinforcement learning. The belief state model produces a structured approximation of the belief distribution: a set of atomic natural language claims about the environment, each annotated with an ordinal verbalized certainty label ranging from certain to unknown. The policy model conditions on this compact, structured approximate belief rather than the full history, learning to select actions under explicit uncertainty. Across long‑horizon, partially observable embodied language environments, Agent‑BRACE achieves an average absolute improvement of +14.5% (Qwen2.5‑3B‑Instruct) and +5.3% (Qwen3‑4B‑Instruct), outperforming strong RL baselines while maintaining a near‑constant context window independent of episode length. Further analysis shows that the learned belief becomes increasingly calibrated over the course of an episode as evidence accumulates.
Authors:Ruhaan Chopra
Abstract:
The cosine similarity between a large language model's hidden activations before and after Supervised Fine‑Tuning (SFT) remains very high. This, at first glance, suggests that SFT leaves the model's activation geometry largely undisturbed. However, projecting both sets of activations through a Sparse Autoencoder (SAE) pretrained on the base model reveals that the underlying sparse latents diverge significantly. We introduce a novel investigative pipeline which utilizes these pretrained SAEs as a high‑resolution diagnostic tool to mechanistically investigate the drivers of this representational divergence. Through our analytical pipeline, we discover task‑specific and layer‑specific distributions of the precise semantic features that are systematically altered during supervised fine‑tuning. We additionally identify a layer‑wise update profile specific to safety alignment. All code, experimental scripts, and analysis files associated with this work are publicly available at: https://github.com/ruhzi/sae‑investigation.
Authors:Xueqi Cheng, Qiong Wu, Zhengyi Zhou, Xugui Zhou, Tyler Derr, Yushun Dong
Abstract:
Large Language Models (LLMs) are increasingly deployed in multi‑turn dialogue settings where preserving conversational context across turns is essential. A standard serving practice concatenates the full dialogue history at every turn, which reliably maintains coherence but incurs substantial cost in latency, memory, and API expenditure, especially when queries are routed to large proprietary models. Existing approaches often struggle to balance the trade‑off between response quality and efficiency. We propose a framework that exploits the early turns of a session to estimate a local response manifold and then adapt a smaller surrogate model to this local region for the remainder of the conversation. Concretely, we learn soft prompts that maximize semantic divergence between the large and surrogate small language models' responses to surface least‑aligned local directions, stabilize training with anti‑degeneration control, and distill the mined cases into localized LoRA fine‑tuning so the surrogate runs without prompts at inference. A simple gate enables a one‑time switch with rollback on drift. We further provide a theoretical analysis for key components in SOMA. Extensive experiments show the effectiveness of SOMA. The source code is provided at: https://github.com/LabRAI/SOMA.
Authors:Xueqi Cheng, Yushun Dong
Abstract:
Multimodal large language models (MLLMs) have heterogeneous strengths across OCR, chart understanding, spatial reasoning, visual question answering, cost, and latency. Effective MLLM routing therefore requires more than estimating query difficulty: a router must match the multimodal requirements of the current image‑question input with the capabilities of each candidate model. We propose LatentRouter, a router that formulates MLLM routing as counterfactual multimodal utility prediction. Given an image‑question query, LatentRouter extracts learned multimodal routing capsules, represents each candidate MLLM with a model capability token, and performs latent communication between these states to estimate how each model would perform if selected. A distributional outcome head predicts model‑specific counterfactual quality, while a bounded capsule correction refines close decisions without allowing residual signals to dominate the prediction. The resulting utility‑based policy supports performance‑oriented and performance‑cost routing, and handles changing candidate pools through shared per‑model scoring with availability masking. Experiments on MMR‑Bench and VL‑RouterBench show that LatentRouter outperforms fixed‑model, feature‑level, and learned‑router baselines. Additional analyses show that the gains are strongest on multimodal task groups where model choice depends on visual, layout‑sensitive, or reasoning‑oriented requirements, and that latent communication is the main contributor to the improvement. The code is available at: https://github.com/LabRAI/LatentRouter.
Authors:Xueqi Cheng, Xugui Zhou, Tyler Derr, Yushun Dong
Abstract:
Capability distillation applies knowledge distillation to selected model capabilities, aiming to compress a large language model (LLM) into a smaller one while preserving the abilities needed for a downstream task. However, most existing methods treat capabilities as independent training targets and overlook how improving one capability can reshape the student's broader capability profile, especially when multiple abilities jointly determine task success. We study capability distillation under a fixed token budget and identify two consistent patterns: distillation induces systematic, budget‑dependent cross‑capability transfer, and additional budget often brings limited task‑relevant gains while sometimes degrading other useful abilities. Building on these insights, we propose ReAD, a Reinforcement‑guided cApability Distillation framework that explicitly accounts for capability interdependence. ReAD first infers task‑essential capabilities, then generates capability‑targeted supervision on the fly, and finally uses an uncertainty‑aware contextual bandit to adaptively allocate the distillation budget based on expected utility gains. Extensive experiments show that ReAD improves downstream utility under the same token budget while reducing harmful spillover and wasted distillation effort compared to strong baselines. Our code is publicly available at https://github.com/LabRAI/ReAD.
Authors:Tousif Islam, Digvijay Wadekar, Tejaswi Venumadhav, Matias Zaldarriaga, Ajit Kumar Mehta, Javier Roulet, Barak Zackay
Abstract:
Fast surrogate models for expensive simulations are now essential across the sciences, yet they typically operate as black boxes. We present \textttGWAgent, a large language model (LLM)‑based workflow that constructs interpretable analytic surrogates directly from simulation data. Surrogate modeling is well suited to agentic workflows because candidate models can be quantitatively validated against ground‑truth simulations at each iteration. As a demonstration, we build a surrogate for gravitational waveforms from eccentric binary black hole mergers. We show that providing the agent with a physics‑informed domain ansatz substantially improves output model accuracy. The resulting analytic surrogate attains a median Advanced LIGO mismatch of 6.9×10^‑4 together with an ~ 8.4× speedup in waveform evaluation, surpassing both symbolic regression and conventional machine learning baselines. Beyond producing an accurate model, the workflow identifies compact physical structure from the learned representation. As an astrophysical application, we use \textttGWAgent to analyze the eccentricity of GW200129 and infer e_20\mathrmHz=0.099^+0.063_‑0.044. These results show that validation‑constrained agentic workflows can produce accurate, fast, and interpretable surrogates for scientific simulations and inference.
Authors:Bulat Maksudov, Vladislav Kurenkov, Kathleen M. Curran, Alessandra Mileo
Abstract:
Existing medical‑agent benchmarks deliver imaging as pre‑selected samples, never as an environment the agent must navigate. We introduce ABRA, a radiology‑agent benchmark in which the agent operates an OHIF viewer and an Orthanc DICOM server through twenty‑one function‑calling tools that span slice navigation, windowing, series selection, pixel‑coordinate annotation, and structured reporting. ABRA contains 655 programmatically generated tasks across three difficulty tiers and eight types (viewer control, metadata QA, vision probe, annotation, longitudinal comparison, BI‑RADS reporting, and oracle variants of annotation and BI‑RADS reporting), drawn from LIDC‑IDRI, Duke Breast Cancer MRI, and NLST New‑Lesion LongCT. Each episode is scored along Planning, Execution, and Outcome (Bluethgen et al., 2025) by task‑type‑specific automatic scorers. Ten current models, five closed‑weight and five open‑weight, reach at least 89% Execution on real annotation but only 0‑25% Outcome; on the paired oracle variant where a simulated detector supplies the finding, Outcome on the same task reaches 69‑100% across the models evaluated, localising the bottleneck to perception rather than tool orchestration. Code, task generators, and scorers are released at https://github.com/Luab/ABRA
Authors:Jung Min Kang
Abstract:
Benchmark evaluation across AI and safety‑critical domains overwhelmingly relies on simple averaging. We demonstrate that this practice produces substantially misleading rankings when two conditions co‑occur: (1) the evaluation matrix is sparse and (2) items vary substantially in difficulty. Through controlled simulation experiments across four domains ‑‑ NLP (GLUE), clinical drug trials, autonomous vehicle safety, and cybersecurity ‑‑ we show that Spearman rank correlation ρ between simple‑average rankings and ground‑truth rankings degrades from ρ= 1.000 at 100% coverage to ρ= 0.809 at 67% coverage with high difficulty heterogeneity (mean over 20 seeds). A standard two‑parameter logistic (2PL) Item Response Theory (IRT) model maintains ρ\geq 0.996 across all conditions. A 150‑condition grid sweep over sparsity S \in [0, 0.70] and difficulty gap D \in [0.5, 5.0] confirms that ranking error forms a failure surface with a strong S × D interaction (γ_3 = +0.20, t = 13.05), while IRT maintains ρ\geq 0.993 throughout. We discuss implications for Physical AI benchmarking, where evaluation matrices are often incomplete and difficulty gaps are extreme.
Authors:Yaolun Zhang, Tianyi Xu, Shengyu Dai, Zhenwen Shao, Qingyun Wu, Huazheng Wang
Abstract:
We argue that multi‑agent test‑time evolution is not single‑agent evolution replicated N times. A single‑agent learner can only evolve its own context and memory. A multi‑agent system additionally evolves who collaborates, how they collaborate, and how knowledge flows across the population. These components have no single‑agent counterpart and can produce phenomena such as emergent specialization. Yet prior test‑time methods either confine experiences to individual agents, forfeiting cross‑agent learning, or broadcast symmetrically to all agents, erasing the specialization that makes collaboration valuable. We present EVOCHAMBER, a training‑free framework that instantiates test‑time evolution at three levels over a coevolving agent pool. At its core is CODREAM (Collaborative Dreaming), a post‑task protocol triggered on team failure or disagreement, in which agents collaboratively reflect, distill insights, and route them asymmetrically from strong to weak agents on the failed niche, preserving specialization while filling knowledge gaps. Team‑level operators assemble niche‑conditioned teams and select collaboration structures online. Population‑level lifecycle operators fork, merge, prune, and seed agents under performance pressure. On three heterogeneous task streams with Qwen3‑8B, EVOCHAMBER reaches 63.9% on competition math, 75.7% on code, and 87.1% on multi‑domain reasoning, outperforming the best baseline by 32% relative on math and confirming asymmetric cross‑agent transfer as the primary driver in ablation. Starting from several identically initialized agents, four to five stable niche specialists spontaneously emerge, a structural signature of multi‑agent evolution that no single‑agent learner can express. See our code at: https://github.com/Mercury7353/EvoChamber
Authors:Jonas Petersen, Gian-Alessandro Lombardi, Riccardo Maggioni, Camilla Mazzoleni, Federico Martelli, Philipp Petersen
Abstract:
Critical events in multivariate time series, from turbine failures to cardiac arrhythmias, demand accurate prediction, yet labeled data is scarce because such events are rare and costly to annotate. We introduce HEPA (Horizon‑conditioned Event Predictive Architecture), built on two key principles. First, a causal Transformer encoder is pretrained via a Joint‑Embedding Predictive Architecture (JEPA): a horizon‑conditioned predictor learns to forecast future representations rather than future values, forcing the encoder to capture predictable temporal dynamics from unlabeled data alone. Second, we freeze the encoder and finetune only the predictor toward the target event, producing a monotonic survival cumulative distribution function (CDF) over horizons. With fixed architecture and optimiser hyperparameters across all benchmarks, HEPA handles water contamination, cyberattack detection, volatility regimes, and eight further event types across 11 domains, exceeding leading time‑series architectures including PatchTST, iTransformer, MAE, and Chronos‑2 on at least 10 of 14 benchmarks, with an order of magnitude fewer tuned parameters and, on lifecycle datasets, an order of magnitude less labeled data.
Authors:Nengneng Yu, Sixian Xiong, Yibo Zhao, Wei Wang, Zaoxing Liu
Abstract:
Today's inference‑time workloads increasingly depend on timely access to a model's internal states. We present DMI‑Lib, a high‑speed deep model inspector that treats internal observability as a first‑class systems primitive, decoupling it from the inference hot path via an asynchronous observability substrate built from Ring^2, a GPU‑CPU memory abstraction for capturing and staging tensors, and a policy‑controlled host backend that exports them. DMI‑Lib enables the placement of observation points across a rich space of internal signals and diverse inference backends while preserving serving optimizations and adhering to tight GPU memory budgets. Our experiments demonstrate that DMI‑Lib incurs only 0.4%‑‑6.8% overhead in offline batch inference and an average of 6% in moderate online serving, reducing latency overhead by 2x‑15x compared to existing baselines with similar observability features. DMI‑Lib is open‑sourced at https://github.com/ProjectDMX/DMI.
Authors:Hongwei Yao, Yiming Liu, Yiling He, Bingrun Yang
Abstract:
Agentic language‑model systems increasingly rely on mutable execution contexts, including files, memory, tools, skills, and auxiliary artifacts, creating security risks beyond explicit user prompts. This paper presents DeepTrap, an automated framework for discovering contextual vulnerabilities in OpenClaw. DeepTrap formulates adversarial context manipulation as a black‑box trajectory‑level optimization problem that balances risk realization, benign‑task preservation, and stealth. It combines risk‑conditioned evaluation, multi‑objective trajectory scoring, reward‑guided beam search, and reflection‑based deep probing to identify high‑value compromised contexts. We construct a 42‑case benchmark spanning six vulnerability classes and seven operational scenarios, and evaluate nine target models using attack and utility grading scores. Results show that contextual compromise can induce substantial unsafe behavior while preserving user‑facing task completion, demonstrating that final‑response evaluation is insufficient. The findings highlight the need for execution‑centric security evaluation of agentic AI systems. Our code is released at: https://github.com/ZJUICSR/DeepTrap
Authors:Astha Mehta, Niruthiha Selvanayagam, Cedric Lam, Hengxu Li, Phuc-Nguyen Nguyen, Raymond Lee, Olivia McGoffin, My, Luong, Arthur Collé, Jamie Johnson, David Williams-King, Linh Le
Abstract:
An attacker can split a malicious goal into sub‑prompts that each look benign on their own and only become harmful in combination. Existing LLM safety benchmarks evaluate prompts one at a time, or across turns of a single chat, and so do not look for a malicious signal spread across separate sessions with no shared context. We build FragBench, a benchmark drawn from 24 real‑world cyber‑incident campaigns, which keeps the full attack trail: the multi‑fragment kill chain, the per‑fragment safety‑judge verdicts, sandboxed execution traces, and a matched set of benign cover sessions. FragBench splits this trail into two paired tasks: an adversarial rewriter that hardens fragments against a single‑turn safety judge (FragBench Attack), and a graph‑based user‑level detector trained on the resulting interactions (FragBench Defense). The single‑turn judge is near chance on the released corpus by construction, but four GNN variants and three classical‑ML baselines all recover the cross‑session feature, reaching aggregate event‑level F1 = 0.88‑0.96. Defending against fragmented LLM misuse therefore requires modeling the cross‑session interaction graph, rather than isolated prompts. Our generator, rewriter, sandbox harness, and detector are released at https://github.com/LidaSafety/fragbench.
Authors:Wenxin Tang, Wenbin Li, Junliang Liu, Jingyu Xiao, Xi Xiao, Mingzhe Liu, Jinlong Yang, Xuan Liu, Yuehe Ma, Wang Luo, Qing Li, Lei Wang, Peng Xiangli
Abstract:
Software vulnerability detection plays a critical role in ensuring system security, where real‑world auditing requires not only determining whether a function is vulnerable but also pinpointing the specific lines responsible. However, existing approaches either rely on a single information source ‑‑ sequential, structural, or semantic ‑‑ failing to jointly exploit the complementary strengths across modalities, or treat statement‑level localization merely as a byproduct of function‑level detection without explicit line‑level supervision. To address these limitations, we propose DCVD (Dual‑Channel Cross‑Modal Vulnerability Detection), a unified framework that performs joint function‑level detection and statement‑level localization. DCVD extracts control‑dependency and semantic features through two parallel branches and integrates them via contrastive alignment coupled with bidirectional cross‑attention, effectively bridging the cross‑modal representation gap. It further introduces explicit supervision signals at both the function and statement levels, enabling collaborative optimization across the two granularities. Extensive experiments on a large‑scale real‑world vulnerability benchmark demonstrate that DCVD consistently outperforms state‑of‑the‑art methods on both function‑level detection and statement‑level localization. Our code is available at https://github.com/vinsontang1/DCVD.
Authors:Yadang Alexis Rouzoumka, Jean Pinsolle, Eugénie Terreaux, Christèle Morisseau, Jean-Philippe Ovarlez, Chengfang Ren
Abstract:
Fair comparison between diffusion‑based OOD detectors is challenging, as conclusions can vary with backbone choice, corruption parameterization, and test‑time budget. We address this issue through a Mutualized Backbone‑Equated (MBE) protocol that aligns canonical corruption levels and logical test‑time cost across diffusion backbones. Within this setting, we introduce Canonical Feature Snapshots (CFS), a family of detectors that probes a frozen diffusion backbone using only a tiny number of native internal activations at canonical low‑noise levels. On a controlled CIFAR‑scale benchmark, the strongest one‑forward CFS variant is CFS(1x2), while an even smaller decoder‑only variant remains highly competitive. This shows that much of the relative‑OOD signal exposed by frozen diffusion backbones is concentrated in a small number of sparse internal states, rather than requiring full denoising trajectories or high‑capacity downstream heads. We further provide a local diagnostic theory explaining these observations through conditional encoder‑decoder complementarity, diagonal‑score separation, and low‑noise corruption stability. The official implementation is available at https://github.com/RouzAY/cfs‑diffusion‑ood/.
Authors:Taekhyun Park, Yongjae Lee, Dohee Kim, Hyerim Bae
Abstract:
Looped computation shows promise in improving the reasoning‑oriented performance of LLMs by scaling test‑time compute. However, existing approaches typically require either training recurrent models from scratch or applying disruptive retrofits, which involve substantial computational costs and may compromise pretrained capabilities. To address these limitations, we introduce Looped Depth Up‑Scaling (LoopUS), a post‑training framework that converts a standard pretrained LLM into a looped architecture. As a key technical contribution, LoopUS recasts the pretrained LLM into an encoder, a looped reasoning block, and a decoder. It operationalizes this latent‑refinement architecture through four core components: (1) block decomposition, guided by staged representation dynamics; (2) an input‑dependent selective gate to mitigate hidden‑state drift; (3) random deep supervision for memory‑efficient learning over long recursive horizons; and (4) a confidence head for adaptive early exiting. Collectively, these mechanisms transform a standard non‑looped model into a looped form while stabilizing it against both computational bottlenecks and representation collapse. Through stable latent looping, LoopUS improves reasoning‑oriented performance without extending the generated traces or requiring recurrent training from scratch. For more details, see https://thrillcrazyer.github.io/LoopUS
Authors:Yonatan Sverdlov, Benjamin Friedman, Snir Hordan, Nadav Dym
Abstract:
While invariant architectures are standard for processing symmetric data, there is growing interest in achieving invariance by applying group averaging or canonization to non‑invariant backbones. However, the theoretical generalization properties of these alternative strategies remain poorly understood. We introduce a theoretical framework to analyze the generalization error of these methods by bounding their covering numbers. We establish a rigorous generalization hierarchy: the error bounds of canonized models are at best equal to the error bounds of structurally invariant and group‑averaged models, and at worst equal to the bounds of non‑invariant baselines. Furthermore, we show that there exist optimal canonizations which attain the optimal error bounds, and poor canonizations which attain the non‑invariant error bounds, and that this depends on the regularity of the canonization. Finally, applying this framework to permutation groups in point cloud processing, we rigorously prove that the covering number of lexicographical sorting grows exponentially with point cloud dimension, whereas Hilbert curve canonization guarantees polynomial growth. This provides the first formal theoretical justification for the empirical success of Hilbert curve serialization in state‑of‑the‑art point cloud architectures. We conclude with experiments that support our theoretical claims. Code is available at https://github.com/yonatansverdlov/Canonization
Authors:Yikun Li, Jinfeng Jiang, Ting Zhang, Chengran Yang, Chenxing Zhong, Yin Yide, Leow Wen Bin, Eng Lieh Ouh, Lwin Khin Shar, David Lo
Abstract:
Evaluating whether large language models (LLMs) can recover execution‑relevant program structure, rather than only produce code that passes tests, remains an open problem. Existing code benchmarks emphasize test‑passing outputs, from standalone programming tasks (HumanEval, MBPP, LiveCodeBench) to repository repair (SWE‑Bench); this is useful, but offers limited diagnostic signal about which program semantics a model can recover from source. We introduce TraceEval, to our knowledge the first execution‑verified, multi‑language benchmark for code semantic reasoning: recovering a program's runtime call structure from source code. Unlike prior call‑graph benchmarks that rely on static‑tool output or hand‑annotated ground truth, every positive edge in TraceEval is mechanically witnessed by validation execution, eliminating annotator disagreement and label noise for observed behavior. TraceEval consists of (i) 10,583 real‑world programs (2,129 test, 8,454 train) extracted from 1,600+ open‑source repositories across Python, JavaScript, and Java via an LLM‑assisted harness‑generation pipeline with tracer validation; and (ii) a reproducible pipeline that converts any open‑source repository into new verified benchmark instances. We evaluate 10 LLMs at zero‑shot on the held‑out test split. The strongest model, Claude‑Opus‑4.6, reaches an average F1 of 72.9% across the three languages. To demonstrate the train split's utility as a supervision substrate, we fine‑tune the Qwen2.5‑Coder family on it: lifts of up to +55.6 F1 bring tuned Qwen2.5‑Coder‑32B to 71.2%, within 1.7 F1 of zero‑shot Claude‑Opus‑4.6. We release the benchmark, pipeline, baselines, and a datasheet at https://github.com/yikun‑li/TraceEva
Authors:Yutszyuk Wong, Wentai Wu, Yuen-Ying Yeung, Weiwei Lin
Abstract:
Log anomaly detection is a critical task for system operations and security assurance. However, in networked systems at scale, log data are generated at massive scale while instance‑level annotations are prohibitively expensive, posing great difficulties to fine‑grained anomaly localization. To address this challenge, we propose LogMILP (Log anomaly localization based on Multi‑Instance Learning enhanced by prototypes and Perturbation), a weakly supervised framework that enables both bag‑level anomaly detection and instance‑level anomaly localization using only bag‑level labels. Our method guides the model to pinpoint the critical log entries using prototype‑guided structural modeling with counterfactual perturbation consistency regularization, thereby improving localization reliability and interpretability under coarse‑grained supervision. Experimental results on three public datasets demonstrate that LogMILP achieves competitive detection performance while yielding significantly more reliable instance‑level localization. Our code is open‑sourced at https://github.com/YUK1207/LogMILP.
Authors:Zhenxin Ai, Haiyun He
Abstract:
Watermarking for large language models (LLMs) is a promising approach for detecting LLM‑generated text and enabling responsible deployment. However, existing watermarking methods are often vulnerable to semantic‑invariant attacks, such as paraphrasing. We propose PASA, a principled, robust, and distortion‑free watermarking algorithm that embeds and detects a watermark at the semantic level. PASA operates on semantic clusters in a latent embedding space and constructs a distributional dependency between token and auxiliary sequences via shared randomness synchronized by a secret key and semantic history. This design is grounded in our theoretical framework that characterizes a jointly optimal embedding‑detection pair, achieving the fundamental trade‑offs among detection accuracy, robustness, and distortion. Evaluations across multiple LLMs and semantic‑invariant attacks demonstrate that PASA remains robust even under strong paraphrasing attacks while preserving high text quality, outperforming standard vocabulary‑space baselines. Ablation studies further validate the effectiveness of our hyperparameter choices. Webpage: https://ai‑kunkun.github.io/PASA_page/.
Authors:Hangzhan Jin, Tianwei Ni, Lu Li, Pierre-Luc Bacon, Mohammad Hamdaqa, Doina Precup
Abstract:
Supervised fine‑tuning (SFT) improves in‑domain performance but can degrade out‑of‑domain (OOD) generalization. Prior work suggests that this degradation is related to changes in dominant singular subspaces of pretrained weight matrices. However, directly identifying loss‑sensitive directions with Hessian or Fisher information is computationally expensive at LLM scale. In this work, we propose preserving projected rotations in pretrained singular subspaces as an efficient proxy for Fisher‑sensitive directions, which we call Rotation‑Preserving Supervised Fine‑Tuning (RPSFT). RPSFT penalizes changes in the projected top‑k singular‑vector block of each pretrained weight matrix, limiting unnecessary rotation while preserving task adaptation. Across model families and sizes trained on math reasoning data, RPSFT improves the in‑domain/OOD trade‑off over standard SFT and strong SFT baselines, better preserves pretrained representations, and provides stronger initializations for downstream RL fine‑tuning. Code is available at \hrefhttps://github.com/jinhangzhan/RPSFT.githttps://github.com/jinhangzhan/RPSFT.
Authors:Yixuan Yang, Mehak Arora, Ryan Zhang, Baraa Abed, Junseob Kim, Tilendra Choudhary, Md Hassanuzzaman, Kevin Zhu, Ayman Ali, Chengkun Yang, Alasdair Edward Gent, Victor Moas, Rishikesan Kamaleswaran
Abstract:
We present Clin‑JEPA, a multi‑phase co‑training framework for joint‑embedding predictive (JEPA) pretraining on EHR patient trajectories. JEPA architectures have enabled latent‑space planning in robotics and high‑quality representation learning in vision, but extending the paradigm to EHR data ‑‑ to obtain a single backbone that simultaneously forecasts patient trajectories and serves diverse downstream risk‑prediction tasks without per‑task fine‑tuning ‑‑ remains an open challenge. Existing JEPA frameworks either discard the predictor after pretraining (I‑JEPA, V‑JEPA) or train it on a frozen pretrained encoder (V‑JEPA 2‑AC), leaving the encoder unaware of the rollout signal that the retained predictor must use at inference; co‑training the encoder and predictor under a shared JEPA prediction objective would supply this grounding, but naïve co‑training is unstable, with representation collapse and online/target drift causing autoregressive rollout to diverge. Clin‑JEPA's five‑phase pretraining curriculum ‑‑ predictor warmup, joint refinement, EMA target alignment, hard sync, and predictor finalization ‑‑ addresses each failure mode by phase, stably co‑training a Qwen3‑8B‑based encoder and a 92M‑parameter latent trajectory predictor. On MIMIC‑IV ICU data, three independent evaluations support the framework: (1) latent \ell_1 rollout drift uniquely converges (‑15.7%) over 48‑hour horizons while baselines and ablations diverge (+3% to +4951%); (2) the encoder learns a clinically discriminative latent geometry (deteriorating‑patient cohorts displace 4.83× further than stable patients in latent space, vs \leq2.62× for baseline encoders); (3) a single backbone outperforms strong tabular and sequence baselines on multi‑task downstream evaluation. Clin‑JEPA achieves mean AUROC 0.851 on ICareFM EEP and 0.883 on 8 binary risk tasks (+0.038 and +0.041 vs baseline average).
Authors:Jihoo Jung, Chaeyoung Jung, Ji-Hoon Kim, Joon Son Chung
Abstract:
Audio‑visual large language models (AVLLMs) have recently emerged as a powerful architecture capable of jointly reasoning over audio, visual, and textual modalities. In AVLLMs, the bidirectional interaction between audio and video modalities introduces intricate processing dynamics, necessitating a deeper understanding of their internal mechanisms. However, unlike extensively studied text‑only or large vision language models, the internal workings of AVLLMs remain largely unexplored. In this paper, we focus on cross‑modal information flow between audio and visual modalities in AVLLMs, investigating where information derived from one modality is encoded within the token representations of the other modality. Through an analysis of multiple recent AVLLMs, we uncover two common findings. First, AVLLMs primarily encode integrated audio‑visual information in sink tokens. Second, sink tokens do not uniformly hold cross‑modal information. Instead, a distinct subset of sink tokens, which we term cross‑modal sink tokens, specializes in storing such information. Based on these findings, we further propose a simple training‑free hallucination mitigation method by encouraging reliance on integrated cross‑modal information within cross‑modal sink tokens. Our code is available at https://github.com/kaistmm/crossmodal‑hub.
Authors:Gabriel Garcia
Abstract:
Corruption studies, the standard tool for evaluating chain‑of‑thought (CoT) faithfulness, infer which steps are ``computationally important'' from accuracy loss when steps are corrupted. We show that when benchmark chains end with an explicit terminal answer line, as in GSM8K and MATH, these tests largely measure \emphanswer placement rather than where intermediate computation is carried out. Using matched GSM8K examples, removing only the final answer statement while preserving all reasoning collapses suffix sensitivity by about 19× for Qwen~2.5‑3B (N=300, p=0.022). Conflicting‑answer prompts, which contain correct reasoning but a wrong explicit final answer, drive accuracy to zero or near‑zero at 7B across five open‑weight model families; wrong‑answer following is strong at 3B‑‑7B and attenuates sharply at larger scales. Replications on MATH, within‑stable comparisons at 7B, and suffix‑free chains show the same pattern in different guises: corruption sensitivity tracks the location of explicit answer text, not a fixed computational depth in the reasoning. Generation‑time probes indicate that final answers are rarely early‑determined during generation (<5% early commitment), yet consumption‑time behavior systematically follows explicit answer text. The confound is therefore largely a readout effect when the chain is consumed. We propose a three‑prerequisite protocol (question‑only control, format characterization, and an all‑position sweep) as a practical minimum for future corruption‑based faithfulness studies.
Authors:Minqing Huang, Yujiao Xiang, Zihan Liang, Jiajie Huang, Jingqi Wang, Zhi Xu, Feiyang Tan, Hangning Zhou, Mu Yang, Gong Che
Abstract:
Vision‑Language‑Action (VLA) models have emerged as a promising paradigm for end‑to‑end autonomous driving. However, existing reasoning mechanisms still struggle to provide planning‑oriented intermediate representations: textual Chain‑of‑Thought (CoT) fails to preserve continuous spatiotemporal structure, while latent world reasoning remains difficult to use as a direct condition for action generation. In this paper, we propose CoWorld‑VLA, a multi‑expert world reasoning framework for autonomous driving, where world representations serve as explicit conditions to guide action planning. CoWorld‑VLA extracts complementary world information through multi‑source supervision and encodes it into expert tokens within the VLA, thereby providing planner‑accessible conditioning signals. Specifically, we construct four types of tokens: semantic interaction, geometric structure, dynamic evolution, and ego trajectory tokens, which respectively model interaction intent, spatial structure, future temporal dynamics, and behavioral goals. During action generation, CoWorld‑VLA employs a diffusion‑based hierarchical multi‑expert fusion planner, which is coupled with scene context throughout the joint denoising process to generate continuous ego trajectories. Experiments show that CoWorld‑VLA achieves competitive results in both future scene generation and planning on the NAVSIM v1 benchmark, demonstrating strong performance in collision avoidance and trajectory accuracy. Ablation studies further validate the complementarity of expert tokens and their effectiveness as planning conditions for action generation. Code will be available at https://github.com/AFARI‑Research/CoWorld‑VLA.
Authors:Danni Xu, Shaojing Fan, Harry Cheng, Mohan Kankanhalli
Abstract:
Multimodal misinformation increasingly leverages visual persuasion, where repurposed or manipulated images strengthen misleading text. We introduce RW‑Post, a post‑aligned text‑‑image benchmark for real‑world multimodal fact‑checking with \emphauditable annotations: each instance links the original social‑media post with reasoning traces and explicitly linked evidence items derived from human fact‑check articles via an LLM‑assisted extraction‑and‑auditing pipeline. RW‑Post supports controlled evaluation across closed‑book, evidence‑bounded, and open‑web regimes, enabling systematic diagnosis of visual grounding and evidence utilization. We provide AgentFact as a reference verification baseline and benchmark strong open‑source LVLMs under unified protocols. Experiments show substantial headroom: current models struggle with faithful evidence grounding, while evidence‑bounded evaluation improves both accuracy and faithfulness. Code and dataset will be released at https://github.com/xudanni0927/AgentFact.
Authors:George Wu, Nan Jing, Qing Yi, Chuan Hao, Ming Yang, Feng Chang, Yuan Wei, Jian Yang, Ran Tao, Bryan Dai
Abstract:
Test‑time scaling has become an effective paradigm for improving the reasoning ability of large language models by allocating additional computation during inference. Recent structured approaches have further advanced this paradigm by organizing inference across multiple trajectories, refinement rounds, and verification‑based feedback. However, existing structured test‑time scaling methods either weakly coordinate parallel reasoning trajectories or rely on noisy historical information without explicitly deciding what should be retained and reused, limiting their ability to balance exploration and exploitation. In this work, we propose TMAS, a framework for scaling test‑time compute via multi‑agent synergy. TMAS organizes inference as a collaborative process among specialized agents, enabling structured information flow across agents, trajectories, and refinement iterations. To support effective cross‑trajectory collaboration, TMAS introduces hierarchical memories: the experience bank reuses low‑level reliable intermediate conclusions and local feedback, while the guideline bank records previously explored high‑level strategies to steer subsequent rollouts away from redundant reasoning patterns. Furthermore, we design a hybrid reward reinforcement learning scheme tailored to TMAS, which jointly preserves basic reasoning capability, enhances experience utilization, and encourages exploration beyond previously attempted solution strategies. Extensive experiments on challenging reasoning benchmarks show that TMAS achieves stronger iterative scaling than existing test‑time scaling baselines, with hybrid reward training further improving scaling effectiveness and stability across iterations. Code and data are available at https://github.com/IQuestLab/tmas.
Authors:Zonglin Yang, Xingtong Liu, Xinyan Xu
Abstract:
AI scientist systems are increasingly deployed for autonomous research, yet their academic integrity has never been systematically evaluated. We introduce SCIINTEGRITY‑BENCH, the first benchmark designed around a dilemmatic evaluation paradigm: each of its 33 scenarios across 11 trap categories is constructed so that honest acknowledgment of failure is the only correct response, while task completion requires misconduct. Across 231 evaluation runs spanning 7 state‑of‑the‑art LLMs, the overall integrity problem rate reaches 34.2%, and no model achieves zero failures. Most strikingly, across missing‑data scenarios, all seven models generate synthetic data rather than acknowledging infeasibility, differing only in whether they disclose the substitution. A further prompt ablation study separates two drivers: removing explicit completion pressure sharply reduces undisclosed fabrication from 20.6% to 3.2%, while the underlying synthesis rate remains unchanged, revealing an intrinsic completion bias that persists independent of prompt‑level instructions. These findings point to the absence of honest refusal as a trained disposition as the primary driver of observed failures. We release SCIINTEGRITY‑BENCH at https://github.com/liuxingtong/Sci‑Integrity‑Bench.
Authors:Daniel Goldstein, Eugene Cheah
Abstract:
We present Key‑Value Means ("KVM"), a novel block‑recurrence for attention that can accommodate either fixed‑size or growing state. Equipping a strong transformer baseline with fixed‑size KVM attention layers yields a strong O(N) chunked RNN, while adding only an insignificant number of new parameters. We train a transformer with a growable KVM cache and show it performs competitively on long‑context tests with only subquadratic prefill time and sublinear state growth. KVM is implementable with standard operations and without custom kernels, and supports chunk‑wise parallelizable training and prefill. It provides many of the benefits of both traditional transformers (expandable context memory, chunk‑wise parallelizable training and prefill) and linear RNNs in a single unified package. It can be used on every layer, saving KV‑cache memory, and allowing a continuous range of choices of prefill time complexity between O(N) and O(N^2). It can also be implemented in a hybrid solution in tandem with LRNN layers in place of traditional attention, to supplement the LRNN with improved sublinear memory growth context length usage and long context decoding. We release our code at https://github.com/recursal/KVM‑paper and trained models at https://huggingface.co/collections/recursal/key‑value‑means under the Apache 2.0 license.
Authors:Yuyang Dai, Zheng Chen, Jathurshan Pradeepkumar, Yasuko Matsubara, Jimeng Sun, Yasushi Sakurai, Yushun Dong
Abstract:
Epilepsy diagnosis and treatment require evidence‑intensive reasoning across heterogeneous clinical knowledge, including biosignal patterns, genetic mechanisms, pharmacogenomics, treatment strategies, and patient outcomes. In this work, we present \textscEpiGraph, a large‑scale epilepsy knowledge graph and benchmark for evaluating knowledge‑augmented clinical reasoning. \textscEpiGraph integrates 48,166 peer‑reviewed papers and seven clinical resources into a heterogeneous graph containing 24,324 entities and 32,009 evidence‑grounded triplets across five clinical layers. Built upon this graph, \textscEpiBench defines five clinically motivated tasks spanning clinical decision‑making, EEG report generation, pharmacogenomic precision medicine, treatment recommendation, and deep research planning. We evaluate six LLMs under both standard and Graph‑RAG settings. Results show that integrating \textscEpiGraph consistently improves performance across all tasks, with the largest gains observed in pharmacogenomic reasoning (+30‑‑41%). Our findings demonstrate that structured epilepsy knowledge substantially enhances evidence‑grounded clinical reasoning and provides a practical benchmark framework for evaluating knowledge‑augmented LLMs in real‑world neurological settings. Our code is available at: https://github.com/LabRAI/EEG‑KG.
Authors:Tianyu Zheng, Hong Wu, Jiaji Zhong
Abstract:
Large language models (LLMs) often suffer from hallucinations due to error accumulation in autoregressive decoding, where suboptimal early token choices misguide subsequent generation. Although multi‑path decoding can improve robustness by exploring alternative trajectories, existing methods lack principled strategies for determining when to branch and how to regulate inter‑path interactions. We propose Adaptive Path‑Contrastive Decoding (APCD), a multi‑path decoding framework that improves output reliability through adaptive exploration and controlled path interaction. APCD consists of two components: (1) Entropy‑Driven Path Expansion, which delays branching until predictive uncertainty ‑ measured by Shannon entropy over top candidate tokens ‑ indicates multiple plausible continuations; and (2) Divergence‑Aware Path Contrast, which encourages diverse reasoning trajectories while dynamically attenuating inter‑path influence as prediction distributions diverge. Experiments on eight benchmarks demonstrate improved factual accuracy while maintaining decoding efficiency. Our code is available at https://github.com/zty‑king/APCD.
Authors:Wenxin Tang, Xiang Zhang, Junliang Liu, Jingyu Xiao, Xi Xiao, Jinlong Yang, Yuehe Ma, Zhenyu Liu, Zhengheng Li, Zicheng Wang, Wang Luo, Qing Li, Lei Wang, Peng Xiangli
Abstract:
Automated vulnerability detection is a fundamental task in software security, yet existing learning‑based methods still struggle to capture the structural dependencies, domain‑specific vulnerability knowledge, and complex program semantics required for accurate detection. Recent Large Language Models (LLMs) have shown strong code understanding ability, but directly prompting them with raw source code often leads to missed vulnerabilities or false alarms, especially when vulnerable and benign functions differ only in subtle semantic details. To address this, we propose VulTriage, a triple‑path context augmentation framework for LLM‑based vulnerability detection. VulTriage enhances the LLM input through three complementary paths: a Control Path that extracts and verbalizes AST, CFG, and DFG information to expose control and data dependencies; a Knowledge Path that retrieves relevant CWE‑derived vulnerability patterns and examples through hybrid dense‑‑sparse retrieval; and a Semantic Path that summarizes the functional behavior of the code before the final judgment. These contexts are integrated into a unified instruction to guide the LLM toward more reliable vulnerability reasoning. Experiments on the PrimeVul pair test set show that VulTriage achieves state‑of‑the‑art performance, outperforming existing deep learning and LLM‑based baselines on key pair‑wise and classification metrics. Further ablation studies verify the effectiveness of each path, and additional experiments on the Kotlin dataset demonstrate the generalization ability of VulTriage under low‑resource and class‑imbalanced settings. Our code is available at https://github.com/vinsontang1/VulTriage
Authors:Shogo Noguchi
Abstract:
Recent conditional image generation methods can improve controllability by generating images that are faithful to conditions such as sketches, human poses, segmentation maps, and depth. By applying these techniques to image augmentation while preserving annotations, generated images can be used as additional training data and can improve recognition performance. However, for high‑level driving tasks such as traffic‑rule extraction and driving‑behavior understanding, simply using annotations as conditions is insufficient. Instead, images must be augmented while preserving the detailed high‑level structure of the original scene. One possible solution is to use multiple conditions so that generated images retain diverse structural cues after generation. However, when multiple conditions are used, conflicts among conditions can prevent reliable structure preservation. In this work, we input semantic segmentation, depth, and edges extracted from the original image into a multi‑condition image generation model, thereby providing rich structural information as conditions. We further propose a modeling approach for handling conflicts among multiple conditions and show that it enables image generation with stronger structural preservation. We also build a generation framework and evaluation protocol for driving tasks, establishing a basis for comparison with prior and future models. As a result, this work contributes to image generation research by addressing condition conflicts in multi‑condition generation and provides an important step toward mitigating data scarcity in high‑level autonomous‑driving tasks.
Authors:Dong Yang, Yiyi Cai, Haoyu Zhang, Yuki Saito, Hiroshi Saruwatari
Abstract:
Metric‑induced discrete flow matching (MI‑DFM) exploits token‑latent geometry for discrete generation, but its practical use is limited by two issues: heuristic schedulers requiring hyperparameter search, and finite‑step path‑tracking error from its first‑order continuous‑time Markov chain (CTMC) solver. We address both issues. First, we derive a kinetic‑optimal scheduler for prescribed scalar‑parameterized probability paths, and instantiate it for MI‑DFM as a training‑free numerical schedule that traverses the path at constant Fisher‑Rao speed. Second, we introduce a finite‑step moment correction that adjusts the jump probability while preserving the CTMC jump destination distribution. We validate the resulting method, GibbsTTS, on codec‑based zero‑shot text‑to‑speech (TTS). Under controlled comparisons with a unified architecture and large‑scale dataset, GibbsTTS achieves the best objective naturalness and is preferred in subjective evaluations over masked discrete generative baselines. Additionally, in comparison with the evaluated state‑of‑the‑art TTS systems, GibbsTTS shows strong speaker similarity, achieving the highest similarity on three of four test sets and ranking second on the fourth. Project page: https://ydqmkkx.github.io/GibbsTTSProject
Authors:Shusaku Egami, Aoi Ohta, Tomoki Tsujimura, Masaki Asada, Tatsuya Ishigaki, Ken Fukuda, Masahiro Hamasaki, Hiroya Takamura
Abstract:
Large Language Models (LLMs) provide flexible natural language processing capabilities, while knowledge graphs (KGs) offer explicit and structured knowledge. Integrating these two in a complementary manner enables the development of reliable and verifiable AI systems. In particular, knowledge graph question answering (KGQA) has attracted attention as a means to reduce LLM hallucinations and to leverage knowledge beyond the training data. However, existing KGQA benchmark datasets are biased toward encyclopedic knowledge, limited to a single modality, and lack fine‑grained spatiotemporal data, which limits their applicability to real‑world scenarios targeted by Embodied AI. We introduce HOME‑KGQA, a novel KGQA benchmark dataset built on a multimodal KG of daily household activities. HOME‑KGQA consists of complex, multi‑hop natural language questions paired with graph database query languages. Compared to existing benchmarks, it includes more challenging questions that involve multi‑level spatiotemporal reasoning, multimodal grounding, and aggregate functions. Experimental results show that the LLM‑based KGQA methods fail to achieve performance comparable to that on existing datasets when evaluated on HOME‑KGQA. This highlights significant challenges that should be addressed for the real‑world deployment of KGQA systems. Our dataset is available at https://github.com/aistairc/home‑kgqa
Authors:Xiaocheng Luo, Kang Wang, Zaifu Zhan, Yuechi Zhou, Xiangyu Duan
Abstract:
The Chain‑of‑Thought (CoT) paradigm, while enhancing the interpretability of Large Language Models (LLMs), is constrained by the inefficiencies and expressive limits of natural language. Latent Chain‑of‑Thought (latent CoT) reasoning, which operates in a continuous latent space, offers a promising alternative but faces challenges from structural complexities in existing multi‑step or multi‑model paradigms, such as error propagation and coordination overhead. In this paper, we introduce One‑Model One‑Step, a novel compression framework for Latent Reasoning with Rule‑Based Priors(RuPLaR) to address this challenge. Our method trains an LLM to autonomously generate latent reasoning tokens in a single training stage, guided by rule‑based prior probability distributions, thereby eliminating cascaded processes and inter‑model dependencies. To ensure reasoning quality, we design a joint training objective that enforces answer consistency via cross‑entropy, aligns soft tokens with rule‑based priors via KL divergence (the Soft Thinking constraint), and adds a problem‑thought semantic alignment constraint in the representation space. Extensive experiments show that our compression framework not only improves accuracy by 11.1% over existing latent CoT methods but also achieves this with minimal token usage, underscoring its effectiveness and extensibility. Code: https://github.com/xiaocen‑luo/RuPLaR.
Authors:Jiyeon Kim, Byungju Lee, Won-Yong Shin
Abstract:
Unlike most static material properties widely studied in the machine learning literature, ionic transport properties are inherently dynamic, making their fast and accurate prediction from static atomic structures challenging. The current standard approach, molecular dynamics (MD) simulations, suffers from prohibitively high computational cost. Recent autoregressive learning‑based MD acceleration methods requiring sequential inference remain slow and prone to error accumulation; in contrast, existing non‑autoregressive material property prediction models are less accurate because they fail to exploit dynamics. Moreover, existing methods typically benefit from datasets either with or without atomic trajectories, but not both. To overcome these limitations, we propose a non‑autoregressive learning framework based on auxiliary modality learning, which treats atomic trajectories as an auxiliary modality during training but does not require them at inference. This enables the predictor to learn dynamics without sequential inference while benefiting from both types of datasets. As a result, our framework achieves over 200 times speedup compared to autoregressive models on the dataset with atomic trajectories while substantially reducing prediction error relative to non‑autoregressive benchmarks across both types of datasets. Our code is available at https://github.com/jykim‑git/MD.
Authors:Boxuan Zhang, Jianing Zhu, Qifan Wang, Jiang Liu, Ruixiang Tang
Abstract:
Recent generative models can produce images that appear highly realistic, raising challenges in distinguishing real and AI‑generated images. Yet existing detectors based on pre‑trained feature extractors tend to over‑rely on global semantics, limiting sensitivity to the critical micro‑defects. In this work, we propose Micro‑Defects expose Macro‑Fakes (MDMF), a local distribution‑aware detection framework that amplifies micro‑scale statistical irregularities into macro‑level distributional discrepancies. To avoid localized forensic cues being diluted by plain aggregation, we introduce a learnable Patch Forensic Signature that projects semantic patch embeddings into a compact forensic latent space. We then use Maximum Mean Discrepancy (MMD) to quantify distributional discrepancies between generated and real images. Our theory‑grounded analysis shows that patch‑wise modeling yields provably larger discrepancies when localized forensic signals are present in generated images, enabling more reliable separation from real images. Extensive experiments demonstrate that MDMF consistently outperforms baseline detectors across multiple benchmarks, validating its general effectiveness. Project page: https://zbox1005.github.io/MDMF‑project/
Authors:Muhammed Ustaomeroglu, Guannan Qu
Abstract:
We propose Representational Effective Theory (RET), a framework for describing large language model computation in terms of learned macrostates rather than microscopic details. RET learns these macrostates from hidden‑state trajectories using a BYOL/JEPA‑style self‑supervised objective, coarse‑graining activations into macrovariables that preserve higher‑level structure relevant for prediction and interpretation. We evaluate whether these macrovariables are practically relevant for interpretability: RET yields temporally consistent states that reveal "mental‑state" trajectories of reasoning, capture high‑level semantic structure, support early prediction of behavioral outcomes such as sycophancy, and provide causal handles for steering generations toward interpretable computational phases. Together, these results suggest that LLM computation admits useful effective descriptions via RET: high‑level, dynamically meaningful variables that support interpretation, prediction, and intervention.
Authors:Dongyi Liu, Yifan Niu, Qinwen Wang, Han Xiao, Jia Li
Abstract:
Large Language Model (LLM)‑based search agents trained with reinforcement learning (RL) have significantly improved the performance of knowledge‑intensive tasks. However, existing methods encounter critical challenges in long‑horizon credit assignment: (i) Reward Sparsity, where models receive only outcome feedback without step‑level guidance to differentiate action quality; (ii) Isolated Credit, where credit is assigned to steps independently, failing to capture sequential dependencies; and (iii) Distributional Shift, where rewards are estimated on templates that deviate from the model's natural generative distribution. To address these issues, we propose Pivot‑Based Credit Assignment (PiCA), a novel step reward mechanism that reformulates the search trajectory as a sequential process of cumulative search progress. Unlike prior isolated step rewards, PiCA defines process rewards as success probabilities dependent on the historical context based on Potential‑Based Reward Shaping (PBRS). This approach identifies pivot steps, which comprise target golden sub‑queries and sub‑answers derived from historical trajectories, as information peaks that significantly boost the likelihood of a correct final answer. By anchoring these step rewards to the final task objective, PiCA provides dense, pivot‑aware and trajectory‑dependent guidance while maintaining distributional consistency. Extensive experiments show that PiCA outperforms existing strong baselines across seven knowledge‑intensive QA benchmarks, achieving 15.2% and 2.2% improvements for 3B and 7B models. The consistent performance gains across various models show PiCA's robust generalization. The code is available at https://github.com/novdream/PiCA.
Authors:Jiyeon Kim, Youngjoon Hong, Won-Yong Shin
Abstract:
Mesh‑based simulations provide high‑fidelity solutions to partial differential equations (PDEs), but achieving such accuracy typically requires fine meshes, leading to substantial computational overhead. Super‑resolution techniques aim to mitigate this cost by reconstructing high‑resolution (HR), high‑fidelity solutions from low‑cost, low‑resolution (LR) counterparts. However, training neural networks for super‑resolution often demands large amounts of expensive HR supervision data. To address this challenge, we propose SuperMeshNet, an HR data‑efficient super‑resolution framework for mesh‑based simulations aided by message passing neural networks (MPNNs). At its core, SuperMeshNet introduces complementary learning, a semi‑supervised approach that effectively leverages both 1) a small amount of paired LR‑HR data and 2) abundant unpaired LR data via two jointly trained, complementary MPNN‑based models. Additionally, our model is enriched by inductive biases, which are empirically shown to further improve super‑resolution performance. Extensive experiments demonstrate that SuperMeshNet requires 90% less HR data to achieve even lower root mean square error (RMSE) than that of the fully supervised benchmark without the inductive biases. The source code and datasets are available at https://github.com/jykim‑git/SuperMeshNet.git.
Authors:Kai Zhao, Dongliang Nie, Yuchen Lin, Zhehan Luo, Yixiao Gu, Deng-Ping Fan, Dan Zeng
Abstract:
Joint‑Embedding Predictive Architectures (JEPAs) provide a simpleframework for learning world models by predicting future latent representations.However, JEPA training is subject to a bias‑variance tradeoff.Without sufficient structural constraints, excessive representationalvariance causes the model to collapse to trivial solutions.The recent LeWorldModel (LeWM) shows that this issue can be alleviated bysimply constraining latent embeddings with an isotropic Gaussian prior.However, latent representations inherently lie on low‑dimensional manifoldswithin a high‑dimensional ambient space, and enforcing an isotropic Gaussianprior directly in this ambient space introduces an overly strong bias.In this work, we propose ame, which seeks a favorable operatingpoint on the bias‑variance frontier by applying Gaussian constraints inmultiple random subspaces rather than in the originalembedding space.This design relaxes the global constraint while preserving itsanti‑collapse effect, leading to a better balance between trainingstability and representation flexibility.Extensive experiments across fourcontinuous‑control environments demonstrate that consistentlyoutperforms LeWM with very clear margins.Our method is simple yet effective, and serves as a strong baseline for future JEPA‑based world model research.fdefinedeeemodeThe code is available at https://github.com/intcomp/Sub‑JEPA.
Authors:Yibang Li, Bihari Lal Pandey, Ravi Sah, Andi Han, Cyrus Mostajeran, Pratik Jawanpuria, Bamdev Mishra
Abstract:
Muon and related norm‑constrained matrix optimizers have become central to large‑scale learning problems. They are formulated as a linear maximization oracle (LMO) over an ambient matrix‑norm ball in unconstrained Euclidean space. However, these do not generalize cleanly to manifold‑valued parameters such as low‑rank factorizations, orthogonality constraints, or symmetric positive definite (SPD) matrices. Naively restricting the Muon LMO to the tangent space (i) breaks quotient symmetries and (ii) couples the tangent‑space constraint with an ambient norm bound, thereby obstructing closed‑form solutions on various manifolds of interest. We resolve both issues with a single observation: every Riemannian metric canonically lifts a unitarily invariant Euclidean norm to an intrinsic norm on each tangent space, and the resulting intrinsic norm constrained LMO is symmetry preserving. Building on this, we introduce intrinsic Muon (iMuon), a unified framework that yields closed‑form updates on the fixed‑rank, SPD, Stiefel, and Grassmann manifolds for any unitarily invariant norm, including the spectral, Frobenius, and nuclear norms. We establish convergence guarantees for both deterministic and stochastic iMuon with rate constants that depend only on the manifold dimension. Notably, on the fixed‑rank manifold this constant depends only on the rank, making the rate independent of factor conditioning and removing the runtime factor‑rescaling required by prior work. Experiments on LoRA finetuning of LLMs, image classification, and subspace learning illustrate the efficacy of the proposed approach.
Authors:Yu Wu, Ananth Mahadevan, Filip Ginter, Michael Mathioudakis, Mikko Tolonen
Abstract:
While digitized corpora have transformed the study of intellectual transmission, current methods rely heavily on lexical text reuse detection, capturing verbatim quotations but fundamentally missing paraphrases and complex implicit engagement. This paper evaluates semantic search in 18th‑century intellectual history through the reception of John Locke's foundational work. Using expert annotation grounded in a semantic taxonomy, we examine whether an off‑the‑shelf semantic search pipeline can surface meaning‑level correspondences overlooked by lexical methods. Our results demonstrate that semantic search retrieves substantially more implicit receptions than lexical baselines. However, linguistic diagnostics also reveal a "lexical gatekeeping" effect, where retrieval remains partially constrained by surface vocabulary overlap. These findings highlight both the potential and the limitations of semantic retrieval for analyzing the circulation of ideas in large historical corpora. The data is available at https://github.com/COMHIS/locke‑sim‑data.
Authors:Lei Ma, Suhani Chaudhary, Ethan Shanbaum, Athanasios Tassiadamis, Peter M. VanNostrand, Dennis M. Hofmann, Haowen Xu, Elke Rundensteiner
Abstract:
Logs are ubiquitous in modern systems. Unfortunately, their unstructured nature in flat sequences limits understanding of execution behaviors, hindering effective anomaly diagnosis. To address this, Krone introduces a novel hierarchical log abstraction that transforms flat log sequences into semantically coherent units across entity, action, and status levels. Building on this abstraction, Krone introduces a hierarchical orchestration framework that decomposes flat log sequences into hierarchical execution units and performs modular detection over them. It executes and optimizes the modular detection tasks across levels, enabling precise anomaly detection, localization, and explanation with selective invocation of LLM‑based reasoning. In this work, we present Krone‑viz, an interactive visualization system based on Krone, which makes hierarchical log analysis interpretable and actionable for software engineers and system operators. Demonstrated on the widely used HDFS benchmark dataset, Krone‑viz supports: 1) examining hierarchical decompositions of flat log sequences, 2) inspecting detection results and abnormal segments identified by Krone with LLM‑generated explanations, and 3) reusing, reviewing, and revising knowledge generated by LLMs with human‑in‑the‑loop guardrails. The code of Krone‑viz is available at https://github.com/LeiMa0324/KRONE_Demo_official, and we deploy a live demo at https://leima0324.github.io/KRONE_Demo_official.
Authors:Yang Zhou, Zihan Dong, Zhenting Wang, Can Jin, Shiyu Zhao, Bangwei Guo, Difei Gu, Linjun Zhang, Mu Zhou, Dimitris N. Metaxas
Abstract:
Agent skills can remarkably improve task success rates by using human‑written procedural documents, but their quality is difficult to assess without environment‑grounded verification. Existing skill generation methods heavily rely on preference logs rather than direct environment interaction, often yielding negligible or even degraded gains. We identify that it is a fundamental timing bottleneck: robust skills should be posterior‑based, distilled from empirical environment interaction rather than prior plans. In this study, we introduce the Posterior Distillation Index (PDI), a trajectory‑level metric that quantifies how well a distilled skill is grounded in the task‑environment evidence. To operationalize PDI, we present SPARK (Structured Pipelines for Autonomous Runnable tasKs and sKill generation) for preserving task execution evidence towards full trajectory‑level analysis. SPARK generates environment‑verified trajectories used to compute PDI, and it applies PDI as an online diagnostic and intervention signal to ensure posterior skill formation. Across 86 runnable tasks, SPARK‑generated skills consistently surpass no‑skill baselines and outperform human‑written skills on student models (inference cost up to 1,000x cheaper than teacher models). These findings show that PDI‑guided distillation produces efficient and transferable skills grounded in the task‑environment interaction. We release our code at https://github.com/EtaYang10th/spark‑skills .
Authors:Yang Zhou, Can Jin, Zihan Dong, Zhepeng Wang, Yanting Yang, Shiyu Zhao, Lei Li, Runxue Bao, Yaochen Xie, Dimitris N. Metaxas
Abstract:
Reinforcement learning improves the reasoning ability of large language models but remains costly and sample‑inefficient, as many rollouts provide weak learning signals. Difficulty‑aware data selection methods attempt to address this by prioritizing moderately difficult prompts, yet our analysis reveals three limitations: difficulty estimates become inaccurate under policy drift, data selection alone yields limited final‑performance gains, and inference efficiency remains largely unchanged. These findings suggest that efficient and effective RL requires more than filtering by difficulty: the policy should learn to solve hard tasks while producing concise responses for easy ones. To this end, we propose Dare, a unified framework that co‑evolves difficulty estimation with the policy via self‑normalized importance sampling, maintains diverse difficulty coverage through a symmetric Beta sampling distribution, and applies tailored training strategies across difficulty tiers with adaptive compute allocation. Extensive experiments across multiple models and domains demonstrate that Dare consistently outperforms existing methods in training efficiency, final effectiveness, and inference efficiency, producing more concise responses on easy tasks while improving correctness on hard ones. Code is available at https://github.com/EtaYang10th/DARE.
Authors:Fabio Rovai
Abstract:
We present Open Ontologies, an open‑source ontology engineering system implemented in Rust that integrates LLM‑driven construction with formal OWL reasoning and ontology alignment via the Model Context Protocol. Our primary finding is that stable 1‑to‑1 matching is the dominant factor in ontology alignment quality: on the OAEI Anatomy track, it achieves F1 = 0.832 (P = 0.963, R = 0.733), competitive with state‑of‑the‑art systems and exceeding all in precision. Ablation across five weight configurations shows that signal weights are irrelevant when stable matching is applied (F1 varies by less than 0.004), while removing stable matching drops F1 to 0.728. On the Conference track, the same method achieves F1 = 0.438. On tool‑augmented ontology interaction, we find a surprising result: an LLM reading a raw OWL file (F1 = 0.323) performs worse than the same LLM with no file at all (F1 = 0.431), while structured MCP tool access achieves F1 = 0.717. This demonstrates that tool structure provides a qualitatively different mode of access that the LLM cannot replicate by reading raw syntax. The system ships as a single binary under the MIT licence.
Authors:Ankit Hemant Lade, Sai Krishna Jasti, Indar Kumar, Aman Chadha
Abstract:
A Mamba state‑space model trained only for next‑step prediction appears to recover Granger‑causal structure through a simple readout S = |W_out W_in|, with early experiments suggesting the phenomenon generalized across architectures and benefited from interventional data at p < 10^‑5. We package the protocol used to test that claim ‑‑ standardized synthetic generators (VAR/Lorenz/CauseMe‑style), three intervention semantics (do(X=c), soft‑noise, random‑forcing), edge‑provenance cards on three real datasets, and size‑matched control arms ‑‑ as a reusable falsification benchmark, and walk the claim through it in five stages. The method‑level claim does not survive: (i) a plain linear bottleneck does as well or better; (ii) tuned Lasso beats the bottleneck on synthetic CauseMe‑style benchmarks, and on Lorenz‑96 (the only real benchmark with unambiguous ground truth) classical PCMCI and Granger lead a tight cluster in which the bottleneck trails; (iii) the headline intervention advantage is roughly 60% a sample‑size confound, and the residual disappears under standard do(X=c) interventions, surviving only under a non‑standard random‑forcing scheme; (iv) even that residual reproduces, with a larger effect, in classical bivariate Granger ‑‑ the effect is method‑agnostic. What survives is a narrow characterization result; the benchmark is the lasting artifact, and each stage above is one of its control arms.
Authors:Xingyuan Hua, Sheng Yue, Ju Ren
Abstract:
Recent advancements in agentic test‑time scaling allow models to gather environmental feedback before committing to final actions. A key limitation of existing methods is that they typically employ undifferentiated exploration strategies, lacking the ability to adaptively distinguish when exploration is truly required. In this paper, we propose an exploration‑aware reinforcement learning framework that enables LLM agents to adaptively explore only when uncertainty is high. Our method introduces a fine‑grained reward function via variational inference that explicitly evaluates exploratory actions by estimating their potential to improve future decision‑making, together with an exploration‑aware grouping mechanism that separates exploratory actions from task‑completion actions during optimization. By targeting informational gaps, this design allows agents to explore selectively and transition to execution as soon as the task context is clear. Empirically, we demonstrate that our approach achieves consistent improvements across a range of challenging text‑based and GUI‑based agent benchmarks. Code is available at https://github.com/HansenHua/EAPO‑ICML26 and models are available at https://huggingface.co/hansenhua/EAPO‑ICML26.
Authors:Tri Cao, Khoi Le, Thong Nguyen, Cong-Duy Nguyen, Quynh Vo, Anh Tuan Luu, Chunyan Miao, See-Kiong Ng, Shuicheng Yan, Bryan Hooi
Abstract:
While multimodal large language models (MLLMs) have advanced video understanding, they remain highly prone to hallucinations in dynamic scenes. We argue this stems from a failure in spatio‑temporal monitoring, the ability to persistently track object identities, states, and relations over time. Existing benchmarks obscure this deficit by relying on single final‑answer evaluations for queries that can often be resolved via local visual cues or statistical priors. To rigorously diagnose this, we introduce STEMO‑Bench (Spatio‑TEmporal MOnitoring), a benchmark of human‑verified object‑centric facts that evaluates intermediate reasoning by decomposing queries into sub‑questions, distinguishing genuine temporal understanding from coincidental correctness. To address failure modes exposed by STEMO, we propose STEMO‑Track, a novel object‑centric framework that explicitly constructs and reasons over structured object trajectories via chunk‑wise state extraction and temporal aggregation. Extensive experiments demonstrate that our object‑centric framework significantly reduces hallucinated answers and improves spatio‑temporal reasoning consistency over state‑of‑the‑art MLLMs.
Authors:Dongcheng Zhang, Yi Zhang, Yuxin Chen, An Zhang, Xiang Wang, Chaochao Lu
Abstract:
Large Reasoning Models possess remarkable capabilities for self‑correction in general domain; however, they frequently struggle to recover from unsafe reasoning trajectories under adversarial attacks. Existing alignment methods attempt to mitigate this vulnerability by fine‑tuning the model on expert data including reflection traces or adversarial prefixes. Crucially, these approaches are often hindered by static training data which inevitably deviate from model's dynamic, on‑policy reasoning traces, resulting in model hardly covering its vast generation space and learning to recover from its own failures. To bridge this gap, we propose Self‑ReSET, a pure reinforcement learning framework designed to equip LRMs with the intrinsic capacity to recover from their own safety error trajectories, which are subsequently reused as an initial state for reinforcement learning. Extensive experiments across various LRMs and benchmarks demonstrate that Self‑ReSET significantly enhances robustness against adversarial attacks especially out‑of‑distribution (OOD) jailbreak prompts while maintaining general utility, along with efficient data utilization. Further analysis reveals that our method effectively fosters self‑recovery patterns, enabling models to better identify and recover from unsafe intermediate error states back to benign paths. Our codes and data are available at https://github.com/Ing1024/Self‑ReSET.
Authors:Yi Zhang, Yuxin Chen, Leheng Sheng, Dongcheng Zhang, Chaochao Lu, Xiang Wang, An Zhang
Abstract:
While explicit Chain‑of‑Thought (CoT) empowers large reasoning models (LRMs), it enables the generation of riskier final answers. Current alignment paradigms primarily rely on externally enforced compliance, optimizing models to detect malicious prompts rather than evaluating the safety of their own outputs. We argue that this approach remains largely behavioral: our empirical analysis reveals that ostensibly aligned models lack intrinsic safety understanding, often failing to verify their own response safety and remaining vulnerable to adversarial jailbreaks. To address this fundamental limitation, we propose Safety Internal (SInternal), a framework that internalizes safety specifications by training LRMs exclusively on safety verification tasks to critique their own generated answers using expert reasoning trajectories. We demonstrate that learning to verify induces a strong generalization for response safety, significantly enhancing robustness against out‑of‑domain jailbreaks. Furthermore, when combined with reinforcement learning, SInternal serves as a superior initialization compared to standard supervised fine‑tuning, suggesting that internalizing safety understanding creates a more robust foundation for alignment than merely mimicking safe behaviors. Our codes are available at https://github.com/AlphaLab‑USTC/SInternal
Authors:Yuzhuang Xu, Xu Han, Yuxuan Li, Pengzhan Li, Wanxiang Che
Abstract:
Large language models (LLMs) achieve strong performance but incur high deployment costs, motivating extremely low‑bit but lossy quantization. Existing quantization algorithms mainly focus on improving the numerical accuracy of forward computation to eliminate performance degradation. In this paper, we show that extremely quantized LLMs suffer from systematic smoothness degradation beyond numerical precision loss. Through a smoothness proxy, we observe that such degradation becomes increasingly severe as the quantization bit‑width decreases. Furthermore, based on sequence neighborhood modeling, we find that quantized models exhibit a rapid reduction of effective token candidates within the prediction neighborhood, which directly leads to a sparser decoding tree and degraded generation quality. To validate it, we introduce a simple smoothness‑preserving principle in both post‑training quantization and quantization‑aware training, and demonstrate that preserving smoothness brings additional gains beyond numerical accuracy. The core goal of this paper is to highlight smoothness preservation as an important design consideration for future extreme quantization methods. Code is available at https://github.com/xuyuzhuang11/FINE.
Authors:Feng Xiong, Zengbin Wang, Yong Wang, Xuecai Hu, Jinghan He, Liang Lin, Yuan Liu, Xiangxiang Chu
Abstract:
Self‑evolving agents present a promising path toward continual adaptation by distilling task interactions into reusable knowledge artifacts. In practice, this paradigm remains hindered by two coupled bottlenecks: data inefficiency, where costly rollout effort is disproportionately spent on low‑value samples rather than informative ones, and knowledge interference, where heterogeneous knowledge stored in shared repositories leads to noisy retrieval and task‑misaligned guidance. Together, these issues form a self‑reinforcing failure loop in which uninformative rollouts yield noisy knowledge, which in turn degrades subsequent rollouts. In this work, we introduce Ace‑Skill, a co‑evolutionary framework that jointly optimizes rollout allocation and knowledge organization for self‑evolving multimodal agents. Specifically, Ace‑Skill combines aprioritized sampler with lazy‑decay proficiency tracking to focus rollouts on informative and insufficiently mastered samples, and a clustered organizer that semantically clusters knowledge for cleaner retrieval and more reliable adaptation. By improving sampling and organization together, Ace‑Skill turns self‑evolution into a virtuous cycle in which more informative rollouts produce higher‑quality knowledge that supports stronger subsequent rollouts. Across four multimodal tool‑use benchmarks, Ace‑Skill delivers strong gains (e.g., +35.46% relative improvement in Avg@4 accuracy), enabling an opensource 35B MoE model to match or surpass proprietary models. The acquired knowledge also transfers effectively in a zero‑shot manner to smaller 9B and 4B models, allowing resource‑constrained agents to inherit advanced capabilities without additional training. The code has been publicly available at https://github.com/AMAP‑ML/Ace‑Skill.
Authors:Jiahao Chen, Letian Gao, Yanhao Zhu, Wenbiao Zhou, Bing Su, Zhi John Lu, Bo Huang
Abstract:
Recent advances in generative modeling have enabled significant progress in structure‑based drug design (SBDD). Existing methods typically condition molecule generation on empty binding pockets from holo complexes, overlooking informative components such as the filler (ligands and solvent). Here, we leverage low‑resolution electron density (ED) derived from the filler as a physically grounded condition for de novo drug design. We consider two types of ED, calculated and cryo‑EM/X‑ray, obtainable from computational or experimental sources, supporting unified pre‑training and experimental integration. Compared with rigid pocket representations, experimental ED naturally captures conformational flexibility and provides a more faithful description of the binding environment. Based on this, we introduce EDMolGPT, a decoder‑only autoregressive framework that generates molecules from low‑resolution ED point clouds. By grounding generation in physically meaningful density signals, EDMolGPT mitigates structural bias and produces molecules with 3D conformations. Evaluations on 101 biological targets verify the effectiveness. Our project page: https://jiahaochen1.github.io/EDMolGPT_Page/.
Authors:Renjie Gu, Jiazhen Du, Yihua Zhang, Sijia Liu
Abstract:
Unlearning in large language models (LLMs) aims to remove harmful training data while preserving overall utility. However, we find that existing methods often hallucinate, generate abnormal token sequences, or behave inconsistently, raising safety and trust concerns. According to prior literature on LLM honesty, such behaviors are often associated with dishonesty. This motivates us to investigate the notion of honesty in the context of model unlearning. We propose a formal definition of unlearning honesty, which includes: (1) preserving both utility and honesty on retained knowledge, and (2) ensuring effective forgetting while encouraging the model to acknowledge its limitations and respond consistently to questions related to forgotten knowledge. To systematically evaluate the honesty of unlearning, we introduce a suite of metrics that cover utility, honesty on the retained set, effectiveness of forgetting, rejection rate and refusal stability in Q&A and MCQ settings. Evaluating 9 methods across 3 mainstream families shows that all current methods fail to meet these standards. After experimental and theoretical analyses, we present ReVa, a representation‑alignment procedure that fine‑tunes feature‑randomized unlearned models to better acknowledge forgotten knowledge. On Q&A tasks from the forget set, ReVa achieves the highest rejection rate after two rounds of interaction, nearly doubling the performance of the second‑best method. Remarkably, It also improves honesty on the retained set. We release our data and code at https://github.com/renjiegu.
Authors:Jiaming Liang, Chi-Man Pun, Weisi Lin, Greta Seng Peng Mok
Abstract:
Learned image compression (LIC) integrates deep neural networks (DNNs) to map high‑dimensional images into compact latent representations, reducing redundancy and achieving superior rate‑distortion (RD) performance in benign settings. Unfortunately, due to inherent vulnerabilities in DNNs, LIC systems are susceptible to adversarial perturbations that lead to downstream deterioration, compression rate degradation, untargeted distortion, and both local semantic manipulation (LSM) and low‑resolution (3×28×28) global semantic manipulation (GSM). However, high‑resolution GSM remains unexplored due to its intractability. Notably, the existing project gradient descent (PGD) method achieves near‑perfect white‑box attacks for classification, segmentation, and other tasks, yet fails to generalize to high‑resolution GSM. Our theoretical and empirical analyses reveal that well‑performing GSM drives adversarial examples from the Identity Region to the Amplification Region through the Lazying‑Oscillating‑Refining stages. General \ell_\infty‑bounded attacks fail on high‑resolution GSM because their step‑size schedules cannot accommodate both the Oscillating and Refining stages. Based on this, we propose the Periodic Geometric Decay schedule that enables \ell_\infty‑bounded high‑resolution GSM. To verify our approach, we integrate it with PGD, yielding a minimal variant, PGD^2‑GSM. Extensive experiments on the Kodak (3×768×512) demonstrate that our PGD^2‑GSM is the first to stably achieve high‑resolution GSM, thereby exposing a novel threat to LIC systems. Code is available at https://github.com/chinaliangjiaming/PGD2‑GSM.
Authors:Boxuan Zhang, Jianing Zhu, Zeru Shi, Dongfang Liu, Ruixiang Tang
Abstract:
LLM‑based multi‑agent systems are increasingly deployed on long‑horizon tasks, but a single decisive error is often accepted by downstream agents and cascades into trajectory‑level failure. Existing work frames this as \emphpost‑hoc failure attribution, diagnosing the responsible agent and step after the trajectory has ended. However, this paradigm forfeits any opportunity to intervene while trajectory is still unfolding. In this work, we introduce AgentForesight, a framework that reframes this problem as online auditing: at each step of an unfolding trajectory, an auditor observes only the current prefix and must either continue the run or alarm at the earliest decisive error, without access to future steps. To this end, we curate AFTraj‑2K, a corpus of agentic trajectories across Coding, Math, and Agentic domains, in which safe trajectories are retained under a strict curation pipeline and unsafe trajectories are annotated at the step of their decisive error via consensus among multiple LLM judges. Built on that, we develop AgentForesight‑7B, a compact online auditor trained with a coarse‑to‑fine reinforcement learning recipe that first equips it with a risk‑anticipation prior at the failure boundary on adjacent safe/unsafe prefix pairs, then sharpens this prior into precise step‑level localization under a three‑axis reward jointly targeting the what, where, and who of an audit verdict. Across AFTraj‑2K and an external Who\&When benchmark, AgentForesight‑7B outperforms leading proprietary models, including GPT‑4.1 and DeepSeek‑V4‑Pro, achieving up to +19.9% performance gain and 3× lower step localization error, opening the loop from post‑hoc failures detection to enabling deployment‑time intervention. Project page: https://zbox1005.github.io/agent‑foresight/
Authors:Hyunmin Hwang, Jaemin Kim, Choonghan Kim, Hangeol Chang, Jong Chul Ye
Abstract:
Multi‑agent reasoning has shown promise for improving the problem‑solving ability of large language models by allowing multiple agents to explore diverse reasoning paths. However, most existing multi‑agent methods rely on inference‑time debate or aggregation, which can be vulnerable to incorrect peer influence and biased consensus. Moreover, the agents themselves remain static, as their underlying reasoning skills do not evolve across tasks. In this paper, we introduce AgentPSO, a particle‑swarm‑inspired framework for evolving multi‑agent reasoning skills. AgentPSO treats each agent as a particle‑like reasoner whose state is a natural‑language skill and whose velocity is a semantic update direction, iteratively moving agents toward stronger skill states to improve both individual and collective reasoning performance. Across training iterations, each agent updates its skill by combining its previous velocity, personal‑best skill, global‑best skill, and a self‑reflective direction derived from peer reasoning trajectories. This enables agents to learn reusable reasoning behaviors from both their own experiences and the strongest skills discovered by the population, without updating the parameters of the backbone language model. Experiments on mathematical and general reasoning benchmarks show that AgentPSO improves over static single‑agent skills and test‑time‑only multi‑agent reasoning baselines. The evolved skills further transfer across benchmarks and to another backbone model, suggesting that AgentPSO captures reusable reasoning procedures rather than merely optimizing benchmark‑specific prompts. Code is open‑sourced at https://github.com/HYUNMIN‑HWANG/AgentPSO/.
Authors:Chengcheng Sun, Chenhao Li, Xiang Lin, Tianji Zheng, Fanrong Meng, Xiaobin Rui, Zhixiao Wang
Abstract:
Graph neural networks (GNNs) aim to learn well‑trained representations in a lower‑dimension space for downstream tasks while preserving the topological structures. In recent years, attention mechanism, which is brilliant in the fields of natural language processing and computer vision, is introduced to GNNs to adaptively select the discriminative features and automatically filter the noisy information. To the best of our knowledge, due to the fast‑paced advances in this domain, a systematic overview of attention‑based GNNs is still missing. To fill this gap, this paper aims to provide a comprehensive survey on recent advances in attention‑based GNNs. Firstly, we propose a novel two‑level taxonomy for attention‑based GNNs from the perspective of development history and architectural perspectives. Specifically, the upper level reveals the three developmental stages of attention‑based GNNs, including graph recurrent attention networks, graph attention networks, and graph transformers. The lower level focuses on various typical architectures of each stage. Secondly, we review these attention‑based methods following the proposed taxonomy in detail and summarize the advantages and disadvantages of various models. A model characteristics table is also provided for a more comprehensive comparison. Thirdly, we share our thoughts on some open issues and future directions of attention‑based GNNs. We hope this survey will provide researchers with an up‑to‑date reference regarding applications of attention‑based GNNs. In addition, to cope with the rapid development in this field, we intend to share the relevant latest papers as an open resource at https://github.com/sunxiaobei/awesome‑attention‑based‑gnns.
Authors:Yinwei Dai, Zhuofu Chen, Lijie Yang, Ravi Netravali
Abstract:
State‑of‑the‑art physical AI models generate a chunk of actions per inference through diffusion or flow matching, iteratively refining an initial noise sample into an action trajectory. Because this inference process is inherently stochastic, committing to a single trajectory per round is brittle, and this brittleness compounds across the many sequential rounds that comprise a complete episode. We introduce KeyStone, an inference‑time self‑consistency method for diffusion‑based action generation that draws K candidate action chunks in parallel from a shared model context, clusters them in continuous action space, and returns the medoid of the largest cluster ‑‑ no additional model required. Two properties make this practical. First, the compact nature of action trajectories makes diffusion inference memory‑bandwidth bound, leaving spare compute capacity to run K chains in parallel with no additional wall‑clock latency. Second, unlike token or pixel spaces where distance carries no semantic meaning and selection requires a learned judge, action chunks are geometrically structured such that Euclidean distance directly reflects physical similarity, making selection principled and judge‑free. Across diverse vision‑language‑action models (VLAs) and world‑action models (WAMs), KeyStone improves task success rates by up to 13.3% over single‑trajectory sampling with negligible latency overhead, while having on par accuracy with model‑based selectors at no training cost. We open source KeyStone at https://github.com/dywsjtu/keystone.
Authors:Zihao An, Taichi Liu, Ziqiong Liu, Dong Li, Ruofeng Liu, Emad Barsoum
Abstract:
Speculative decoding accelerates Large Language Models (LLMs) inference by using a lightweight draft model to propose candidate tokens that are verified in parallel by the target model. However, existing draft model training objectives are not directly aligned with the inference‑time goal of maximizing consecutive token acceptance. To address this issue, we reformulate the draft model optimization objective, shifting the focus from token prediction accuracy to the overall acceptance length. In this paper, we build upon PARD to propose PARD‑2, a dual‑mode speculative decoding framework with Confidence‑Adaptive Token (CAT) optimization. This approach adaptively reweights each token to better align with the verification process. Notably, PARD‑2 enables a single draft model to support both target‑dependent and target‑independent modes. Experiments across diverse models and tasks demonstrate that PARD‑2 achieves up to 6.94× lossless acceleration, surpassing EAGLE‑3 by 1.9× and PARD by 1.3× on Llama3.1‑8B. Our code is available at https://github.com/AMD‑AGI/PARD.
Authors:Junwei Liao, Haoting Shi, Ruiwen Zhou, Jiaqian Wang, Shengtao Zhang, Wei Zhang, Ying Wen, Zhiyu Li, Feiyu Xiong, Bo Tang, Weinan Zhang, Muning Wen
Abstract:
Episodic memory allows LLM agents to accumulate and retrieve experience, but current methods treat each memory independently, i.e., evaluating retrieval quality in isolation without accounting for the dependency chains through which memories enable the creation of future memories. We introduce MemQ, which applies TD(λ) eligibility traces to memory Q‑values, propagating credit backward through a provenance DAG that records which memories were retrieved when each new memory was created. Credit weight decays as (γλ)^d with DAG depth d, replacing temporal distance with structural proximity. We formalize the setting as an Exogenous‑Context MDP, whose factored transition decouples the exogenous task stream from the endogenous memory store. Across six benchmarks, spanning OS interaction, function calling, code generation, multimodal reasoning, embodied reasoning, and expert‑level QA, MemQ achieves the highest success rate on all six in generalization evaluation and runtime learning, with gains largest on multi‑step tasks that produce deep and relevant provenance chains (up to +5.7~pp) and smallest on single‑step classification (+0.77~pp) where single‑step updates already suffice. We further study how γ and λ interact with the EC‑MDP structure, providing principled guidance for parameter selection and future research. Code is available at https://github.com/jwliao‑ai/MemQ.
Authors:Timothy C. Cogan
Abstract:
Mazocarta is a seeded procedural tactical deckbuilder implemented in Rust, compiled to WebAssembly for browser play, and executable natively for simulation. Its primary technical contribution is not the invention of a new deckbuilding genre, but the construction of an instrumented game‑development reference artifact: the same rules engine supports interactive play, native command‑line simulation, automated end‑to‑end tests, save/load fixtures, and local‑area multiplayer. This paper describes Mazocarta's architecture, deterministic run model, reproducible balance probes, and QR‑mediated WebRTC pairing for local multiplayer. An evaluation snapshot over 1,000 deterministic seeds shows that the simulation pipeline can produce reproducible development signals. In the evaluated configuration, single‑player and two‑player autoplay win rates were 36.1% and 34.9% over 1,000 deterministic seeds, respectively. These rates are not presented as final player‑facing balance metrics, but as repeatable probes for future balance shifts and regressions. Mazocarta is positioned as a playable open‑source reference artifact for instrumented game development: deterministic regression checks, automated playtesting workflows, balance probes for game mechanics, and browser‑native local multiplayer all exercise one shared production rules core.
Authors:Wenhao Wu, Zishan Shao, Kangning Cui, Jinhee Kim, Yixiao Wang, Hancheng Ye, Danyang Zhuo, Yiran Chen
Abstract:
SVD‑based Low‑rank compression reduces transformer parameters and nominal FLOPs, but these savings often translate poorly into real LLM serving speedups. We show that this gap is largely a runtime problem: factorized checkpoints fragment execution paths, and the resulting overhead differs substantially between prefill and autoregressive decode. We present FlashSVD v1.5, a unified inference runtime for serving SVD‑compressed transformers. FlashSVD v1.5 maps diverse public SVD compression families to a common factorized representation and combines phase‑specific kernels with dense‑KV decode, packed MLP execution, and per‑layer CUDA‑graph replay to reorganize the low‑rank serving path into a thin runtime. Across representative decoder‑serving settings, FlashSVD v1.5 achieves up to 2.55x decode and 2.39x end‑to‑end speedup, and it attains 1.48x average decode and 1.44x average end‑to‑end speedup across multiple popular SVD compression families. These results suggest that practical low‑rank acceleration requires runtime co‑design, not compression algorithms alone. Our code is available at: https://github.com/Zishan‑Shao/FlashSVD.
Authors:Zhichao Liu, Wenbo Pan, Haining Yu, Ge Gao, Tianqing Zhu, Xiaohua Jia
Abstract:
Browser agents are increasingly deployed in long‑horizon tasks, which require executing extended action chains to accomplish user goals. However, this prolonged execution process provides attackers with more opportunities to inject malicious instructions. Existing prompt injection attacks against browser agents expose two key gaps: (1) low effectiveness, as attacks optimized for toy baselines fail to achieve end‑to‑end goals in real‑world scenarios with complex environments and longer steps; (2) weak stealthiness, since most attacks pit the attack goal against the user goal, causing a significant drop in system usability under attack. To address these gaps, we propose WebTrap, a mid‑task hijacking injection attack. It employs multi‑step instruction fusion steering to seamlessly combine both goals, enabling the agent to resume the original user task after executing the attack goal. Furthermore, we design a context‑grounded generation method to align the injected content with the task environment and system instructions, maximizing the hijacking success rate. Extensive experiments on two browser agent tasks, based on extended WASP and InjecAgent environments, demonstrate that our method achieves a high attack success rate while preserving the usability of the original system. We find that WebTrap exploits the agent's navigation vulnerabilities, binding the two goals so tightly that standard defense mechanisms cannot restore the system to normal operation. These findings reveal a critical vulnerability in agent systems during long‑horizon tasks that they can be stealthily hijacked.
Authors:Siyu Wu, Yulong Ye, Zezhen Xiang, Pengzhou Chen, Gangda Xiong, Tao Chen
Abstract:
Large Language Model (LLM) systems have been the frontier of AI in many application domains, leading to new challenges and opportunities for hyperparameter optimization (HPO) for the AutoML community. However, this type of system exhibits an unprecedented compound space of hyperparameter configuration from both the AI and non‑AI components; rich and nonlinear implications from the fidelity factors; and diverse costs of measuring hyperparameter configurations, none of which have been fully captured in existing benchmarks. This paper presents the first (live) benchmark suite and datasets for HPO of real‑world LLM systems, dubbed LLMSYS‑HPOBench, covering data related to the inference objective values of hyperparameter configurations profiled from running the LLM systems. Currently, LLMSYS‑HPOBench contains 364,450 hyperparameter configurations with a dimensionality of 12‑23, 3‑5 dimensions of fidelity factor leading to 932 settings, 3‑9 inference objective metrics, and 2‑10 cost metrics, together with generated logs from measuring the LLM systems. What we seek to advocate is not only a revalidation of the existing HPO algorithms over the frontier LLM systems, but also to provide an evolving platform for the AutoML community to explore new directions of research in this regard. The benchmark suite has been made available at: https://github.com/ideas‑labo/llmsys‑hpobench
Authors:Abdulvahap Mutlu, Şengül Doğan, Türker Tuncer
Abstract:
Manifold‑Constrained Hyper‑Connections (mHC) introduce a stability‑motivated variant of multi stream residual mixing by constraining residual stream mixing matrices to the manifold of doubly stochastic matrices via Sinkhorn‑Knopp projection. In his work, we study whether mHC‑style constrained multi‑stream residual topology transfers effectively to state space model (SSM) language modeling. We implement a static mHC mechanism around an SSM block by expanding the residual stream into multiple parallel streams, aggregating streams into a single SSM input through simplex‑constrained pre‑mixing, scattering the SSM output back to streams through simplex‑constrained post‑mixing, and applying Sinkhorn‑projected residual stream mixing at each layer. We further introduce stream‑specialized adapters that add lightweight stream‑specific capacity through a shared bottleneck with per‑stream scaling, applied both before stream aggregation and after the SSM output prior to scattering. We evaluate baseline single‑stream SSM, static mHC SSM, and mHC SSM with adapters on WikiText‑2 using identical training settings and report checkpoint‑based validation loss, perplexity, throughput, and peak GPU memory. Under the reported fair checkpoint evaluation, static mHC improves validation loss from 6.3507 to 6.2448 and reduces perplexity from 572.91 to 515.35, while mHC with adapters further improves validation loss to 6.1353 and perplexity to 461.88. These gains are accompanied by modest throughput reductions from 1025.52 to 964.81 and 938.90 tokens per second, and increased peak memory from 2365 MB to 2568 MB and 3092 MB. The results suggest that mHC‑inspired constrained multi‑stream residual mixing can yield measurable quality improvements in SSM language models and that stream‑specialized adapter capacity can further enhance performance with predictable efficiency tradeoffs.
Authors:Xincheng Yao, Ruoqi Li, Cheng Chen, Daoxin Zhang, Yi Wu, Yao Hu, Chongyang Zhang
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a pivotal technique for enhancing the reasoning capabilities of Large Language Models (LLMs). However, the de facto practice of mainstream RL algorithms is to treat all tokens of one response equally and assign the same optimization objective to each token, failing to provide granular guidance for the reasoning process. While in Chain‑of‑Thought (CoT) reasoning, different tokens usually play distinct roles. Therefore, the current RL algorithms lack an effective mechanism to dynamically balance the exploration‑exploitation trade‑off during learning. To this end, we propose Hierarchical Token‑level Objective Control Policy Optimization (HTPO), a novel RL algorithm that takes the divide‑and‑conquer idea to hierarchically partition the response tokens into specific functional groups from three aspects (i.e., prompt difficulty, answer correctness, and token entropy). Within each group, according to the contributions to exploration or exploitation, we design specialized optimization objectives to facilitate the effective execution of each token's expected functionality. In this way, HTPO can achieve a more balanced exploration‑exploitation trade‑off. Extensive experiments on challenging reasoning benchmarks validate the superiority of our HTPO algorithm, which significantly outperforms the strong DAPO baseline (e.g., +8.6% and +6.7% on AIME'24 and AIME'25, respectively). When scaling test‑time compute, the HTPO‑trained model maintains a consistent performance advantage over the DAPO baseline, and the gap widens as the sampling budget increases, validating that our adaptive token‑level control method fosters effective exploration without sacrificing exploitation performance. Code will be at https://github.com/xcyao00/HTPO.
Authors:Lennard M. van Karnenbeek, Hilde G. A. van der Pol, Mark Wijkhuizen, Eva Poelman, Caroline A. Drukker, Theo Ruers, Freija Geldof, Behdad Dashtbozorg
Abstract:
Purpose: We aim to enhance the image quality of point‑of‑care ultrasound (POCUS) devices using deep learning and a novel paired dataset of POCUS and high‑end ultrasound images. Approach: We collected the first accurately paired dataset using a custom‑built automated gantry system of low‑end POCUS and high‑end ultrasound images. A conditional generative adversarial network (cGAN) was utilized based on the pix2pix architecture, with a U‑Net generator that incorporates both L1 and structural similarity index (SSIM) losses to improve perceptual quality. Pretraining on a simulation dataset further boosts performance. Evaluation was performed on 1064 paired ex vivo tissue and phantom ultrasound image sets. Results: Our approach improves the SSIM from 0.29 to 0.54 and PSNR from 19.16 dB to 22.41 dB. No‑reference metrics also indicate substantial enhancement, with the Natural Image Quality Evaluator (NIQE) and Perception‑based Image Quality Evaluator (PIQE) scores dropping from 7.95 to 4.44 and 31.12 to 19.99, respectively. Conclusions: This work presents the first publicly available accurately paired dataset of low‑end POCUS to high end ultrasound images. Additionally, our results demonstrate the potential of the proposed framework to overcome hardware limitations of handheld POCUS, enhancing its diagnostic value in low‑resource and point‑of‑care settings. The POCUS‑IQ Dataset is publicly available at https://github.com/NKI‑MedTech‑AI/POCUS‑IQ.
Authors:Kejia Chen, Jiawen Zhang, Boheng Li, Pengcheng Li, Jian Lou, Zunlei Feng, Mingli Song, Ruoxi Jia, Tianwei Zhang
Abstract:
Many‑shot jailbreaking (MSJ) causes safety‑aligned language models to answer harmful queries by preceding them with many harmful question‑answer demonstrations. We study why this attack becomes stronger as the number of demonstrations increases. Empirically, we find that MSJ induces a progressive activation drift: the representation of a fixed harmful query moves step by step away from the safety‑aligned region as more harmful demonstrations are added. Theoretically, we show that this drift can be interpreted as implicit malicious fine‑tuning: conditioning on N harmful demonstrations induces SGD‑style updates equivalent to optimizing on the corresponding N harmful samples. This view turns the attack mechanism into a defense principle. We append a fixed one‑shot safety demonstration at inference time, which induces a counteracting safety‑oriented update and restores refusal behavior. The resulting method improves the model's robustness to MSJ without modifying its parameters or requiring white‑box access at deployment. Code is available at https://github.com/Thecommonirin/SafeEnd.
Authors:Jiazheng Li, Chi-Hao Wu, Yunze Liu, Kaize Ding, Jundong Li, Chuxu Zhang
Abstract:
Understanding ultra‑long videos such as egocentric recordings, live streams, or surveillance footage spanning days to weeks, remains a challenge. For current multimodal LLMs: even with million‑token context windows, frame budgets cover only tens of minutes of densely sampled video, and most evidence is discarded before inference begins. Memory‑augmented and agentic approaches help with scale, but their retrieval remains fragmented across modalities and lacks long‑range narrative summaries that span days or weeks. We propose MAGIC‑Video, a training‑free framework built around a multimodal memory graph with interleaved narrative chain: the graph unifies episodic, semantic, and visual content through six typed edges and supports cross‑modal retrieval, while the chain distils long‑horizon entity biographies and recurring activity events. At inference time, an agentic loop interleaves graph retrieval with narrative fact injection, covering both the modality and time dimensions of ultra‑long video in a single retrieval pipeline. On EgoLifeQA, Ego‑R1 and MM‑Lifelong, MAGIC‑Video consistently outperforms strong general‑purpose, long‑video, and agentic baselines, with gains of 10.1, 7.4, and 5.9 points over the prior best agentic system on each benchmark. Code is available at https://github.com/lijiazheng0917/MAGIC‑video.
Authors:Logan Mann, Ajit Saravanan, Ishan Dave, Shikhar Shiromani, Saadullah Ismail, Yi Xia, Emily Huang
Abstract:
A pervasive intuition holds that vision‑language models (VLMs) are most trustworthy when their attention maps look sharp: concentrated attention on the queried region should imply a confident, calibrated answer. We test this Attention‑Confidence Assumption directly. We instrument three open‑weight VLM families (LLaVA‑1.5, PaliGemma, Qwen2‑VL; 3‑7B parameters) with a unified mechanistic pipeline ‑‑ the VLM Reliability Probe (VRP) ‑‑ that compares attention structure, generation dynamics, and hidden‑state geometry against a single correctness label. Three results emerge. (i) Attention structure is a near‑zero predictor of correctness (R_pb(C_k,y)=0.001, 95% CI [‑0.034,0.036]; R_pb(H_s,y)=‑0.012, [‑0.047,0.024] on a pooled n=3,090 split), even though attention remains causally necessary for feature extraction (top‑30% patch masking drops accuracy by 8.2‑11.3 pp, p<0.001). (ii) Reliability becomes legible later in the computation: a single hidden‑state linear probe reaches AUROC>0.95 on POPE for two of three families, and self‑consistency at K=10 is the strongest behavioral predictor we measure at 10x inference cost (R_pb=0.43). (iii) Causal neuron‑level ablations expose a sharp architectural split with direct monitor‑design implications: late‑fusion LLaVA concentrates reliability in a fragile late bottleneck (‑8.3 pp object‑identification accuracy after top‑5 probe‑neuron ablation), whereas early‑fusion PaliGemma and Qwen2‑VL distribute it widely and absorb destruction of ~50% of their peak‑layer hidden dimension with <=1 pp degradation. The takeaway is narrow but consequential: in 3‑7B VLMs, reliability is read more reliably off hidden‑state geometry, layer‑wise margin formation, and sparse late‑layer circuits than off attention‑map sharpness.
Authors:Farjana Yesmin
Abstract:
We present FairHealth, an open‑source Python library that provides a unified, modular framework for trustworthy machine learning in healthcare applications, with particular focus on low‑resource and low‑income country (LMIC) settings such as Bangladesh. FairHealth addresses four critical gaps in existing healthcare AI toolkits: (1) the absence of integrated fairness auditing for biosignals and clinical tabular data; (2) the lack of privacy‑preserving federated learning tools compatible with standard ML workflows; (3) missing explainability tools tailored for low‑bandwidth clinical decision support; and (4) no existing toolkit covering Global South healthcare datasets. Built from five peer‑reviewed research contributions, FairHealth provides six modules covering federated learning with homomorphic encryption (fairhealth.federated), intersectional fairness metrics (fairhealth.fairness), hybrid fuzzy‑SHAP explainability (fairhealth.explain), multilingual dengue triage (fairhealth.lowresource), equitable disaster aid allocation (fairhealth.equity), and public dataset loaders (fairhealth.datasets). All datasets used are publicly available without institutional data use agreements. FairHealth is installable via pip install fairhealth(PyPI: pypi.org/project/fairhealth/) and available at https://github.com/Farjana‑Yesmin/fairhealth.
Authors:Maria Stoica, Abdelrahman Hekal, Alessio Lomuscio
Abstract:
Reliable out‑of‑distribution (OOD) detection is a critical requirement for the safe deployment of machine learning systems. Despite recent progress, state‑of‑the‑art OOD detectors are highly susceptible to adversarial attacks, which undermines their trustworthiness in automated systems. To address this vulnerability, we apply median smoothing to baseline OOD detection scores, balancing clean and adversarial accuracies. Our key insight is that the noisy samples generated for median smoothing can be repurposed to quantify the local instability of the base score. We observe that OOD samples exhibit higher instability under perturbation. Based on this, we propose ROSS, a novel and robust post‑hoc OOD detector that leverages the instability of baseline scores to further distinguish between in‑distribution (ID) and OOD samples. ROSS achieves symmetric robustness, performing strongly against both score‑minimising and score‑maximising attacks, unlike prior work. This symmetric defence leads to state‑of‑the‑art robustness, outperforming prior methods by up to 40 AUROC points. We demonstrate ROSS's effectiveness on extensive experiments across CIFAR‑10, CIFAR‑100, and ImageNet. Code is available at: https://github.com/Abdu‑Hekal/ROSS.
Authors:Weicai Yan, Xinhua Ma, Wang Lin, Tao Jin
Abstract:
Parameter‑efficient fine‑tuning methods introduce a small number of training parameters, enabling pre‑trained models to adapt rapidly to new data distributions. While these methods have shown promising results, they exhibit notable limitations. First, most existing methods operate in the signal space domain, which results in substantial information redundancy. Second, most existing methods utilize fixed prompts or adaptation layers, failing to fully account for the multi‑scale characteristics of signals. To address these challenges, we propose the Multi‑Scale Frequency Adapter (FreqAdapter), which integrates textual information and performs multi‑scale fine‑tuning of signals in the frequency domain. Additionally, we introduce a multi‑scale adaptation strategy to optimize receptive fields across different frequency ranges, further enhancing the model's representational capacity. Extensive experiments on multimodal models, including CLIP and LLaVA, demonstrate that FreqAdapter significantly improves both performance and efficiency. FreqAdapter improves performance with minimal cost and fast convergence within one epoch. Code is available at https://github.com/Kelvin‑ywc/FreqAdapter.
Authors:Ayoub Agouzoul
Abstract:
Vision‑Language‑Action (VLA) models offer a promising path to generalist robot control, but their inference latency causes observation staleness when generated actions are executed asynchronously. Several methods have been proposed concurrently to mitigate this problem: inference‑time inpainting (IT‑RTC), training‑time delay simulation (TT‑RTC), future‑state‑aware conditioning (VLASH), and lightweight residual correction (A2C2). Each takes a fundamentally different approach, but they have so far been evaluated independently with different codebases, base policies, and protocols. We present a systematic comparison of these four methods under controlled conditions. We develop two unified codebases that integrate all methods with harmonized library and dataset versions, and we benchmark them on the Kinetix suite with MLPMixer policies and on the LIBERO manipulation benchmark with SmolVLA, sweeping inference delays up to d=20 control steps. A2C2's per‑step residual correction is the most effective method on Kinetix, holding above 90% solve rate up to d=8, and also leads on LIBERO from d=4 onwards. IT‑RTC is competitive at low delays but degrades sharply under long chunks (H=30) and high delays. TT‑RTC is the most robust training‑based method: stable across d_\max choices, generalizes beyond its training delay distribution, and adds zero inference overhead. VLASH exhibits a clear low‑delay vs. high‑delay trade‑off governed by the fine‑tuning delay range [0,d_\max]. Code is available at https://github.com/TheAyos/async‑vla‑inference
Authors:Girmaw Abebe Tadesse, Titien Bartette, Andrew Hassanali, Allen Kim, Jonathan Chemla, Andrew Zolli, Yves Ubelmann, Caleb Robinson, Inbal Becker-Reshef, Juan Lavista Ferres
Abstract:
Monitoring archaeological sites at scale is vital for protecting cultural heritage, yet pinpointing when disturbances occur remains difficult because visual cues are subtle and ground‑truth data are sparse. We introduce WATCH, a framework for month‑level change‑event localization over PlanetScope satellite mosaics (2017‑2024, 4.7 m/px) that supports three complementary scoring approaches: (i) Temporal Embedding Distance (TED), a training‑free method that scores month‑to‑month deviations from a local temporal reference; (ii) Self‑Supervised Change Detection (SSCD), an ensemble of reconstruction, forecasting, and latent‑novelty signals; and (iii) a Weakly Supervised (WS) temporal localization model trained with sparse event‑month labels. We benchmark WATCH on 1,943 archaeological sites in Afghanistan using embeddings from six foundation models (CLIP, GeoRSCLIP, SatMAE, Prithvi‑EO‑2.0, DINOv3, and Satlas‑Pretrain) alongside a handcrafted spectral and texture baseline, and assess cross‑regional generalization on sites in Syria, Turkey, Pakistan, and Egypt. The unsupervised approaches (TED, SSCD) consistently outperform the weakly supervised alternative. TED with SatMAE achieves the highest exact‑month recall (55% at m=0), while TED with GeoRSCLIP, CLIP, or Satlas‑Pretrain reaches 92.5% within a three‑month tolerance (m=3). Handcrafted features remain competitive for exact‑month detection under weak supervision. Our directional margin analysis reveals systematic temporal biases: SSCD paired with GeoRSCLIP or Prithvi‑EO‑2.0 exhibits the strongest early‑warning profile, detecting anomalies before the recorded event, while TED favors confirmation‑oriented detection after a change has materialized. These results show that satellite imagery combined with foundation‑model embeddings enables scalable, decision‑relevant heritage monitoring. Code: https://github.com/microsoft/WATCH
Authors:Jincheng Xie, Yawen Ling, Qi Xiao, Feiyu Zhang, Zhongyi Huang, Wen Hu, Yu Zheng
Abstract:
LLM serving platforms are increasingly deployed as multi‑model cloud systems, where user demand is often long‑tailed: a few popular large models receive most requests, while many smaller tail models remain underutilized. We propose SPECTRE (Parallel SPECulative Decoding with a Multi‑Tenant REmote Drafter), a serving framework that reuses underutilized tail‑model services as remote drafters for heavily loaded large‑model services through speculative decoding. SPECTRE enables draft generation and target‑side verification to run in parallel, and makes such parallelism effective through three techniques: a hybrid ordinary‑parallel speculative decoding strategy guided by a threshold derived from throughput analysis, speculative priority scheduling to preserve draft‑‑target overlap under multi‑tenant traffic, and draft‑side prompt compression to reduce draft latency. We implement SPECTRE in \textttSGLang and evaluate it across multiple draft‑‑target model pairs, reasoning benchmarks, real‑world long‑context workloads, and a wide range of batch sizes. Results show that SPECTRE consistently improves large‑model serving throughput while causing only minor interference to the native workloads of tail‑model services. In large‑model deployments, including Qwen3‑235B‑A22B with TP=8, SPECTRE achieves up to 2.28× speedup over autoregressive decoding and up to an additional 66% relative improvement over the strongest speculative decoding baselines. Talk is cheap, we show you the code: https://github.com/sgl‑project/sglang/pull/22272.
Authors:Zi-Yi Jia, Zi-Jian Cheng, Xin-Yue Zhang, Kun-Yang Yu, Zhi Zhou, Yu-Feng Li, Lan-Zhe Guo
Abstract:
Multi‑model learning has attracted great attention in visual‑text tasks. However, visual‑tabular data, which plays a pivotal role in high‑stakes domains like healthcare and industry, remains underexplored. In this paper, we introduce VT‑Bench, the first unified benchmark for standardizing vision‑tabular discriminative prediction and generative reasoning tasks. VT‑Bench aggregates 14 datasets across 9 domains (medical‑centric, while covering pets, media, and transportation) with over 756K samples. We evaluate 23 representative models, including unimodal experts, specialized visual‑tabular models, general‑purpose vision‑language models (VLMs), and tool‑augmented methods, highlighting substantial challenges of visual‑tabular learning. We believe VT‑Bench will stimulate the community to build more powerful multi‑modal vision‑tabular foundation models. Benchmark: https://github.com/Ziyi‑Jia990/VT‑Bench
Authors:Yuan Fang, Yi Xie, Xuming Ran
Abstract:
Large language models encode vast factual knowledge that inevitably becomes outdated or incorrect after deployment, yet retraining is costly prohibitive, motivating model editing in lifelong settings that updates targeted behavior without harming the rest of the model. One line of work installs new facts by directly modifying base weights through locate‑then‑edit procedures, but accumulated edits progressively disrupt originally preserved knowledge, even with constraint‑based projections. A complementary line leaves base weights intact and routes edits through external memory, but it faces routing challenges and its performance degrades at scale. We propose HoReN, a codebook‑based parameter‑preserving editor with enhanced routing built on three ideas. First, HoReN wraps a single MLP layer with a discrete key‑value codebook, where each entry is interpreted simultaneously as a knowledge‑memory key and a modern Hopfield stored pattern. Second, both keys and queries are projected onto the unit hypersphere so retrieval is governed by angular similarity, removing magnitude‑driven mismatches between an edit prompt and its rephrasings. Third, the query is refined through damped Hopfield attractor dynamics, so paraphrases relax into the correct stored pattern's basin of attraction while unrelated queries remain undisturbed. HoReN achieves well‑edited performance with consistent gains across diverse benchmarks spanning standard ZsRE, structured WikiBigEdit, and unstructured UnKE evaluations. Moreover, HoReN scales to 50K sequential edits on ZsRE with stable overall performance above 0.9, while prior editors collapse or degrade severely before reaching 10K. Our code is available at https://github.com/ha11ucin8/HoReN.
Authors:Natalia Frumkin, Bokun Wang, Hung-Yueh Chiang, Chi-Chih Chang, Mohamed S. Abdelfattah, Diana Marculescu
Abstract:
Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to auto‑regressive (AR) models, offering greater expressive capacity and potential for parallel generation and faster inference. However, open‑source dLLMs remain immature, lagging behind AR models in both efficiency and quality. We identify an underexplored property of dLLMs: token‑wise redundancy in bi‑directional self‑attention. Self‑attention activations are highly correlated across tokens, and temporal changes in query representations can predict redundancy in corresponding key, value, and output activations. We introduce DARE, with two complementary mechanisms: DARE‑KV, which reuses cached key‑value (KV) activations, and DARE‑O, which reuses output activations to reduce redundant computation while preserving quality. DARE achieves up to 1.20x per‑layer latency reduction and reuses up to 87% of attention activations, with negligible degradation on reasoning and code‑generation benchmarks. DARE‑KV and DARE‑O incur average performance drops of only 2.0% and 1.2%, respectively. Combined with techniques such as prefix caching and Fast‑dLLM, DARE provides additive gains without retraining. These results establish token‑wise reuse as an effective strategy for improving the efficiency of diffusion‑based LLMs while preserving generation fidelity. Code: https://github.com/enyac‑group/DARE
Authors:Amman Yusuf, Zhejun Jiang, Mijung Park
Abstract:
Recent work on text diffusion models offers a promising alternative to autoregressive generation, but controlling their safety remains underexplored. Existing safety approaches are geared toward autoregressive models and typically rely on post‑hoc filtering or inference‑time interventions. These are inadequate for effectively addressing safety risks in text diffusion models. We propose the Safety‑Aware Denoiser (SAD), a safety‑guidance framework in text diffusion models. The SAD modifies the iterative denoising process such that the text sample at the final denoising step is steered toward provably safe regions of the text space. This inference‑time method can integrate safety constraints into the denoiser, avoiding computationally expensive retraining of the underlying diffusion model and enabling flexible, lightweight safety guidance. We evaluate the safety of the generated text using the SAD, with respect to hazard taxonomy, memorization, and jailbreak. Experimental results show that SAD substantially reduces unsafe generations while preserving generation quality, diversity, and fluency, outperforming existing methods. These results demonstrate that our safety guidance during denoising provides an effective and scalable mechanism for enforcing safety in text diffusion models.
Authors:Drew Dillon, Kasyap Varanasi
Abstract:
AI coding agents powered by large language models can read codebases and produce functional code, but they routinely violate team‑specific product decisions that are invisible in the source code alone. We introduce a controlled benchmark measuring decision compliance, the rate at which an AI coding agent follows established product, design, and engineering decisions, across 8 realistic software engineering tasks containing 41 weighted decision points. We compare a baseline configuration (Claude Code with codebase access only) against an augmented configuration that adds Brief, a product‑context retrieval system providing spec generation, mid‑build consultation, and retrieval of recorded decisions, persona pain points, customer signals, and competitive intelligence. On identical prompts and the same repository, the augmented configuration achieves 95% decision compliance versus 46% for the baseline, a 49 percentage point improvement. Per‑decision analysis reveals that the baseline achieves 100% compliance on decisions visible in the codebase and 0‑33% on decisions requiring product context, suggesting that product‑context retrieval is a key driver of the improvement. We release the benchmark repository, all 16 pull requests, and scoring harness for independent reproduction.
Authors:Xinchun Su, Chunxu Luo, Lipeng Ma, Yixuan Li, Weidong Yang
Abstract:
Accurate clinical diagnosis requires extensive domain knowledge and complex clinical reasoning capabilities. Although large language models (LLMs) hold great potential for clinical reasoning, their high computational and memory requirements limit their deployment in resource‑constrained environments. Knowledge distillation (KD) can compress LLM capabilities into smaller models, but traditional KD merely transfers superficial answer patterns and fails to preserve the structured reasoning required for reliable diagnosis. To address this, we propose a two‑stage distillation framework, MedThink, designed to cultivate robust clinical reasoning in small language models (SLMs). In the first stage, a teacher LLM screens data and injects domain‑knowledge explanations to fine‑tune a student model, establishing a knowledge foundation. In the second stage, the teacher evaluates the student's errors, generates reasoning chains linking knowledge to correct answers, and refines the student's diagnostic reasoning through a second round of fine‑tuning. We evaluate MedThink on general medical benchmarks and a gastroenterology dataset comprising 955 question‑answer pairs. Experiments demonstrate that MedThink outperforms six distillation strategies in all benchmarks: achieving an improvement of up to 12.7% over the student baseline in general tasks, and reaching a total top accuracy of 56.4% in gastroenterology evaluation. This indicates that iterative distillation centered on reasoning can significantly enhance the diagnostic accuracy and generalization capabilities of SLMs whilst maintaining computational efficiency. Our code and data are publicly available at https://github.com/destinybird/PrecisionBoost.
Authors:Zhen Fang, Wenxuan Huang, Yu Zeng, Yiming Zhao, Shuang Chen, Kaituo Feng, Yunlong Lin, Lin Chen, Zehui Chen, Shaosheng Cao, Feng Zhao
Abstract:
Existing Flow Matching (FM) text‑to‑image models suffer from two critical bottlenecks under multi‑task alignment: the reward sparsity induced by scalar‑valued rewards, and the gradient interference arising from jointly optimizing heterogeneous objectives, which together give rise to a 'seesaw effect' of competing metrics and pervasive reward hacking. Inspired by the success of On‑Policy Distillation (OPD) in the large language model community, we propose Flow‑OPD, the first unified post‑training framework that integrates on‑policy distillation into Flow Matching models. Flow‑OPD adopts a two‑stage alignment strategy: it first cultivates domain‑specialized teacher models via single‑reward GRPO fine‑tuning, allowing each expert to reach its performance ceiling in isolation; it then establishes a robust initial policy through a Flow‑based Cold‑Start scheme and seamlessly consolidates heterogeneous expertise into a single student via a three‑step orchestration of on‑policy sampling, task‑routing labeling, and dense trajectory‑level supervision. We further introduce Manifold Anchor Regularization (MAR), which leverages a task‑agnostic teacher to provide full‑data supervision that anchors generation to a high‑quality manifold, effectively mitigating the aesthetic degradation commonly observed in purely RL‑driven alignment. Built upon Stable Diffusion 3.5 Medium, Flow‑OPD raises the GenEval score from 63 to 92 and the OCR accuracy from 59 to 94, yielding an overall improvement of roughly 10 points over vanilla GRPO, while preserving image fidelity and human‑preference alignment and exhibiting an emergent 'teacher‑surpassing' effect. These results establish Flow‑OPD as a scalable alignment paradigm for building generalist text‑to‑image models. The codes and weights will be released in: https://github.com/CostaliyA/Flow‑OPD .
Authors:Joon Ha Kim, Geon-Woo Kim, Anoop Rachakonda, Daehyeok Kim
Abstract:
Selecting the optimal LLM inference configuration requires evaluation across hardware, serving engines, attention backends, and model architectures, since no single choice performs best across all workloads. Profile‑based simulators are the standard tool, yet they hardcode their operation set to a specific configuration and re‑profile every operation from scratch, making exploration prohibitively expensive. This cost stems from a missing structural understanding: every input dimension of each operation is fixed by the model configuration or determined by the incoming request. Many model‑configuration values (e.g., head size, layer count) recur across models, so the same operation runs in many configurations; a single sweep over the request‑dependent dimensions can serve them all. We present Dooly, which exploits this structure to achieve configuration‑agnostic, redundancy‑aware profiling. Dooly performs a single inference pass, labels each input dimension with its origin via taint propagation, and selectively profiles only operations absent from its latency database; stateful operations such as attention are isolated by reusing the serving engine's own initialization code, eliminating manual instrumentation. It builds latency regression models based on the database, which becomes a drop‑in backend for existing simulators. Across two GPU platforms, three attention backends, and diverse model architectures, Dooly achieves simulation accuracy within 5% MAPE for TTFT and 8% for TPOT while reducing profiling GPU‑hours by 56.4% across 12 models compared to the existing profiling approach. We have open‑sourced Dooly at https://github.com/dooly‑project.
Authors:Vicent Caselles-Ballester, Eloy Martínez-Heras, Giuseppe Pontillo, Zoe Mendelsohn, Elena M. Marrón, Juan Luis García Fernández, Laia Subirats, Jon Stutters, Jeremy Chataway, Frederik Barkhof, Sara Llufriu, Ferran Prados
Abstract:
Multiple sclerosis (MS) expresses substantial clinical and radiological heterogeneity, which poses significant challenges for automatic lesion segmentation. The current deep learning‑based SOTA is highly susceptible to changes in both distribution, e.g., changes in scanner; as well as the structure of inputs, evident in the current divide between cross‑sectional and longitudinal approaches. We introduce TimeLesSeg, a unified contrast‑agnostic framework designed to segment MS lesions regardless of the presence of a temporal dimension in its inputs, with a single convolutional neural network. Our approach models pathological priors through lesion masks, which are processed together with the current scan. Cross‑sectional processing is enabled by exposing the model to training cases where no prior information is available, which are modeled with an empty mask, allowing it to operate seamlessly in both scenarios. To overcome the scarcity and inconsistency of longitudinal datasets, we propose a novel generative pipeline in which patterns of lesion evolution are simulated by stochastically deforming each individual lesion with morphological operations, producing realistic prior timepoints. In parallel, we achieve contrast agnosticism through Gaussian mixture model‑based domain randomization, enabling the network to experience a wide spectrum of intensity profiles. Results on three publicly available and two in‑house datasets show that TimeLesSeg outperforms the contrast‑agnostic state of the art on single‑modality inputs across overlap‑ and distance‑based metrics. In longitudinal processing, our method outperforms SAMSEG, and captures lesion load dynamics more accurately than both the former and LST‑AI. All source code related to the development of TimeLesSeg is available at https://github.com/NeuroADaS‑Lab/TimeLesSeg.
Authors:Giacomo Spigler
Abstract:
Active vision ‑‑ where a policy controls its own gaze during manipulation ‑‑ has emerged as a key capability for imitation learning, with multiple independent systems demonstrating its benefits in the past year. Yet there is no shared benchmark to compare approaches or quantify what active vision contributes, on which task types, and under what conditions. We introduce TAVIS, evaluation infrastructure for active‑vision imitation learning, with two complementary task suites ‑‑ TAVIS‑Head (5 tasks, global search via pan/tilt necks) and TAVIS‑Hands (3 tasks, local occlusion via wrist cameras) ‑‑ on two humanoid torso embodiments (GR1T2, Reachy2), built on IsaacLab. TAVIS provides three evaluation primitives: a paired headcam‑vs‑fixedcam protocol on identical demonstrations; GALT (Gaze‑Action Lead Time), a novel metric grounded in cognitive science and HRI that quantifies anticipatory gaze in learned policies; and procedural ID/OOD splits. Baseline experiments with Diffusion Policy and π_0 reveal that (i) active‑vision generally helps, but benefits are task‑conditional rather than uniform; (ii) multi‑task policies degrade sharply under controlled distribution shifts on both suites; and (iii) imitation alone yields anticipatory gaze, with median lead times comparable to the human teleoperator reference. Code, evaluation scripts, demonstrations (LeRobot v3.0; ~2200 episodes) and trained baselines are released at https://github.com/spiglerg/tavis and https://huggingface.co/tavis‑benchmark.
Authors:Hexuan Deng, Xiaopeng Ke, Yichen Li, Ruina Hu, Dehao Huang, Derek F. Wong, Yue Wang, Xuebo Liu, Min Zhang
Abstract:
Despite the rapid development of AI reviewers, evaluating such systems remains challenging: metrics favor overlap with human reviews over correctness. However, since human reviews often cover only a subset of salient issues and sometimes contain mistakes, they are unreliable as gold references. To address this, we build category‑specific benchmark subsets and skip evaluation when the corresponding human reviews are missing to strengthen Completeness. We also leverage reviewer‑‑author‑‑meta‑review discussions as expert annotations and filter unreliable reviews accordingly to strengthen Correctness. Finally, we introduce CoCoReviewBench, which curates 3,900 papers from ICLR and NeurIPS to enable reliable and fine‑grained evaluation of AI reviewers. Analysis shows that AI reviewers remain limited in correctness and are prone to hallucinations, and highlights reasoning models as more effective reviewers, motivating further directions for improving AI reviewers. Benchmarks and models are available at https://github.com/hexuandeng/CoCoReviewBench.
Authors:Ionut-Vlad Modoranu, Mher Safaryan, Dan Alistarh
Abstract:
With the rise in scale for deep learning models to billions of parameters, the computational cost of fine‑tuning remains a significant barrier to deployment. While Low‑Rank Adaptation (LoRA) has become the standard for parameter‑efficient fine‑tuning, the need to set a predefined, static rank r requires exhaustive grid searches to balance efficiency and performance. Existing rank‑adaptive solutions such as DyLoRA mitigate this by sampling ranks during the training from a predefined distribution. However, they often yield sub‑optimal results at higher ranks due to lack of consistent gradient signals across the full hierarchy of ranks, thus making these methods data‑inefficient. In this paper, we propose MatryoshkaLoRA, a general, Matryoshka‑inspired training framework for LoRA that learns accurate hierarchical low‑rank representations by inserting a fixed, carefully crafted diagonal matrix P between the existing LoRA adapters to scale their sub‑ranks accordingly. By introducing this simple modification, our general framework recovers LoRA and DyLoRA only by changing P and ensures all sub‑ranks embed the available gradient information efficiently. Our MatryoshkaLoRA supports dynamic rank selection with minimal degradation in accuracy. We further propose Area Under the Rank Accuracy Curve (AURAC), a metric that consistently evaluates the performance of hierarchical low‑rank adapters. Our results demonstrate that MatryoshkaLoRA learns more accurate hierarchical low‑rank representations than prior rank‑adaptive approaches and achieves superior accuracy‑performance trade‑offs across ranks on the evaluated datasets. Our code is available at https://github.com/IST‑DASLab/MatryoshkaLoRA.
Authors:Taein Lim, Seongyong Ju, Munhyeok Kim, Hyunjun Kim, Hoki Kim
Abstract:
Large language models (LLMs) are increasingly deployed as autonomous agents in offensive cybersecurity. In this paper, we reveal an interesting phenomenon: different agents exhibit distinct attack patterns. Specifically, each agent exhibits an attack‑selection bias, disproportionately concentrating its efforts on a narrow subset of attack families regardless of prompt variations. To systematically quantify this behavior, we introduce CyBiasBench, a comprehensive 630‑session benchmark that evaluates five agents on three targets and four prompt conditions with ten attack families. We identify explicit bias across agents, with different dominant attack families and varying entropy levels in their attack‑family allocation distributions. Such bias is better characterized as a trait of the agents, rather than a factor associated with the attack success rate. Furthermore, our experiments reveal a bias momentum effect, where agents resist explicit steering toward attack families that conflict with their bias. This forced distribution shift does not yield measurable improvements in attack performance. To ensure reproducibility and facilitate future research, we release an interactive result dashboard at https://trustworthyai.co.kr/CyBiasBench/ and a reproducibility artifact with aggregated session‑level statistics and full evaluation scripts at https://github.com/Harry24k/CyBiasBench.
Authors:Boyang Dai, Chaoqi Chen, Yizhou Yu
Abstract:
Out‑of‑distribution (OOD) detection is crucial for ensuring the reliability of deep learning models. Existing methods mostly focus on regular entangled representations to discriminate in‑distribution (ID) and OOD data, neglecting the rich contextual information within images. This issue is particularly challenging for detecting near‑OOD, as models with simplicity bias struggle to learn discriminative features in disentangled representations. The human visual system can use the co‑occurrence of objects in the natural environment to facilitate scene understanding. Inspired by this, we propose an Object‑Centric OOD detection framework that learns to capture Object CO‑occurrence (OCO) patterns within images. The proposed method introduces a new OOD detection paradigm that understands object co‑occurrence within an image by predicting disentangled representations for the test sample, then adaptively divides patterns into three scenarios based on object co‑occurrence patterns observed in ID training data, and finally performs OOD detection in a divide‑and‑conquer manner. By doing so, OCO can distinguish near‑OOD by considering the semantic contextual relationships present in their images, avoiding the tendency to focus solely on simple, easily learnable regions. We evaluate OCO through experiments across challenging and full‑spectrum OOD settings, demonstrating competitive results and confirming its ability to address both semantic and covariate shifts. Code is released at https://github.com/Michael‑McQueen/OCO.
Authors:Qiyong Zhong, Mao Zheng, Mingyang Song, Xin Lin, Jie Sun, Houcheng Jiang, Xiang Wang, Junfeng Fang
Abstract:
Tool‑integrated reasoning (TIR) is difficult to scale to small language models due to instability in long‑horizon tool interactions and limited model capacity. While reinforcement learning methods like group relative policy optimization provide only sparse outcome‑level rewards. Recently, on‑policy distillation (OPD) has gained popularity by supplying dense token‑level supervision from a teacher on student‑generated trajectories. However, our experiments indicate that applying OPD to TIR leads to a critical failure mode: erroneous tool calls tend to cascade across subsequent reasoning steps, progressively amplifying student‑teacher divergence and rendering the teacher's token‑level supervision increasingly unreliable. To address this, we propose SOD, a step‑wise on‑policy distillation framework for small language model agents, which adaptively reweights distillation strength at each step based on step‑level divergence. Therefore, SOD can attenuate potentially misleading teacher signals in high‑divergence regions while preserving dense guidance in well‑aligned states. Experiments on challenging math, science, and code benchmarks show that SOD achieves up to 20.86% improvement over the second‑best baseline. Notably, our 0.6B student achieves 26.13% on AIME 2025, demonstrating effective transfer of agentic reasoning to lightweight models. Our code is available at https://github.com/YoungZ365/SOD.
Authors:Xuan Zhou, Yanhui Sun, Hantao Yao, Allen He, Yongdong Zhang, Wu Liu
Abstract:
Large‑scale social simulators are essential for studying complex social patterns. Prior work explores hybrid methods to scale up simulations, combining large language models (LLM)‑based agents with numerical agent‑based models (ABM). However, this incurs high latency due to expensive memory retrieval and sequential ABM execution. To address this challenge, we propose GASim, a graph‑accelerated hybrid multi‑agent framework for large‑scale social simulations. For core agents driven by LLM, GASim introduces Graph‑Optimized Memory (GOM) to replace intensive LLM‑based retrieval pipelines with lightweight propagation over a sparse memory graph. For the majority of ordinary agents, GASim employs Graph Message Passing (GMP), substituting sequential ABM execution with parallel updates by fine‑grained feature aggregation and Graph Attention Network. We further introduce Entropy‑Driven Grouping (EDG) that coordinates this hybrid partitioning, leveraging information entropy to dynamically identify emergent core agents situated in information‑diverse neighborhoods. Extensive experiments show that GASim not only delivers a substantial 9.94‑fold end‑to‑end speedup over the traditional hybrid framework but also consumes less than 20% of baseline tokens, significantly reducing costs while preserving strong alignment with real‑world public opinion trends. Our code is available at https://github.com/Jasmine0201/GASim.
Authors:Yuanzhi Wang, Xuhua Ren, Jiaxiang Cheng, Bing Ma, Kai Yu, Tianxiang Zheng, Qinglin Lu, Zhen Cui
Abstract:
Human image animation has witnessed significant advancements, yet generating high‑fidelity hand motions remains a persistent challenge due to their high degrees of freedom and motion complexity. While reinforcement learning from human feedback, particularly direct preference optimization, offers a potential solution, it necessitates the construction of strict preference pairs. However, curating such pairs for dynamic hand regions is prohibitively expensive and often impractical due to frame‑wise inconsistencies. In this paper, we propose Implicit Preference Alignment (IPA), a data‑efficient post‑training framework that eliminates the need for paired preference data. Theoretically grounded in implicit reward maximization, IPA aligns the model by maximizing the likelihood of self‑generated high‑quality samples while penalizing deviations from the pretrained prior. Furthermore, we introduce a Hand‑Aware Local Optimization mechanism to explicitly steer the alignment process toward hand regions. Experiments demonstrate that our method achieves effective preference optimization to enhance hand generation quality, while significantly lowering the barrier for constructing preference data. Codes are released at https://github.com/mdswyz/IPA
Authors:Junfeng Fang, Zhepei Hong, Mao Zheng, Mingyang Song, Gengsheng Li, Houcheng Jiang, Dan Zhang, Haiyun Guo, Xiang Wang, Tat-Seng Chua
Abstract:
On‑policy distillation (OPD) is a powerful paradigm for model alignment, yet its reliance on teacher logits restricts its application to white‑box scenarios. We contend that structured semantic rubrics can serve as a scalable alternative to teacher logits, enabling OPD using only teacher‑generated responses. To prove it, we introduce ROPD, a simple yet foundational framework for rubric‑based OPD. Specifically, ROPD induces prompt‑specific rubrics from teacher‑student contrasts, and then utilizes these rubrics to score the student rollouts for on‑policy optimization. Empirically, ROPD outperforms the advanced logit‑based OPD methods across most scenarios, and achieving up to a 10x gain in sample efficiency. These results position rubric‑based OPD as a flexible, black‑box‑compatible alternative to the prevailing logit‑based OPD, offering a simple yet strong baseline for scalable distillation across proprietary and open‑source LLMs. Code is available at https://github.com/Peregrine123/ROPD_official.
Authors:Ruijie Zhou, Fanxu Meng, Yufei Xu, Tongxuan Liu, Guangming Lu, Muhan Zhang, Wenjie Pei
Abstract:
DeepSeek Sparse Attention (DSA) sets the state of the art for fine‑grained inference‑time sparse attention by introducing a learned token‑wise indexer that scores every prefix token and selects the most relevant ones for the main attention. To remain expressive, the indexer uses many query heads (for example, 64 on DeepSeek‑V3.2) that share the same selected token set; this multi‑head design is precisely what makes the indexer the dominant cost on long contexts. We propose MISA (Mixture of Indexer Sparse Attention), a drop‑in replacement for the DSA indexer that treats its indexer heads as a pool of mixture‑of‑experts. A lightweight router uses cheap block‑level statistics to pick a query‑dependent subset of only a few active heads, and only those heads run the heavy token‑level scoring. This preserves the diversity of the original indexer pool while reducing the per‑query cost from scoring every prefix token with every head to scoring it with only a handful of routed heads, plus a negligible router term computed on a small set of pooled keys. We further introduce a hierarchical variant of MISA that uses the routed pass to keep an enlarged candidate set and then re‑ranks it with the original DSA indexer to recover the final selected tokens almost exactly. With only eight active heads and no additional training, MISA matches the dense DSA indexer on LongBench across DeepSeek‑V3.2 and GLM‑5 while running with eight and four times fewer indexer heads respectively, and outperforms HISA on average. It also preserves fully green Needle‑in‑a‑Haystack heatmaps up to a 128K‑token context and recovers more than 92% of the tokens selected by the DSA indexer per layer. Our TileLang kernel delivers roughly a 3.82 times speedup over DSA's original indexer kernel on a single NVIDIA H200 GPU.
Authors:Simin Huo, Ning LI
Abstract:
Video‑language models (VLMs) face rapid inference costs as visual token counts scale with video length. For example, 32 frames at 448×448 resolution already yield >8,000 visual tokens in Qwen3‑VL, making LLM prefill the dominant throughput bottleneck. Existing methods often rely on global similarity or attention‑guided compression, incurring offsets to their gains. We propose Temporal Token Fusion (TTF), a training‑free, plug‑and‑play pre‑LLM token compression framework that exploits structured temporal redundancy in video. TTF automatically selects an anchor frame, then for each subsequent frame, performs a local window similarity search (e.g.,3× 3), fusing tokens that exceed a threshold. The compressed sequence maintains positional consistency across both prefill and decoding through coordinate realignment, enabling seamless integration with existing VLM pipelines. On Qwen3‑VL‑8B with threshold t=0.70, TTF removes about 67% of visual tokens while retaining 99.5% of the baseline accuracy and introducing only \approx0.16\,GFLOPs of matching overhead. Overall, TTF offers a practical, efficient solution for video understanding. The code is available at \hrefhttps://github.com/Cominder/ttfhttps://github.com/Cominder/ttf
Authors:Kejia Chen, Jiawen Zhang, Yihong Wu, Kewei Gao, Jian Lou, Zunlei Feng, Mingli Song, Ruoxi Jia
Abstract:
Large reasoning models often reach correct answers through flawed intermediate steps, creating a gap between final accuracy and reasoning reliability. Existing alignment strategies address this with external verifiers or massive sampling, limiting scalability. In this work, we introduce CASPO (Confidence‑Aware Step‑wise Preference Optimization), a framework that aligns token‑level confidence with step‑wise logical correctness through iterative Direct Preference Optimization, without training a separate reward model. During inference, we propose Confidence‑aware Thought (CaT), which leverages this calibrated confidence to dynamically prune uncertain reasoning branches with negligible O(V) latency. Experiments across ten benchmarks and multiple model families show that CASPO consistently improves reasoning reliability and inference efficiency. CASPO scales to Qwen3‑8B‑Base and surpasses tree‑search baselines on AIME'24 and AIME'25 without using reward‑model data. We also release a step‑wise dataset with confidence annotations to support fine‑grained analysis of reasoning reliability. Code is available at https://github.com/Thecommonirin/CASPO.
Authors:Yuheng Zhang, Chenlu Ye, Shuowei Jin, Changlong Yu, Wei Xiong, Saurabh Sahu, Nan Jiang
Abstract:
Reinforcement learning, including reinforcement learning with verifiable rewards (RLVR), has emerged as a powerful approach for LLM post‑training. Central to these approaches is the design of the importance sampling (IS) ratio used in off‑policy policy‑gradient estimation. Existing methods face a fundamental bias‑variance dilemma: token‑level IS ratios, as adopted by PPO (Schulman et al., 2017) and GRPO (Shao et al., 2024), introduce bias by ignoring prefix state distribution mismatch; full sequence ratios provide exact trajectory‑level correction but suffer from high variance due to the multiplicative accumulation of per‑token ratios, while GSPO (Zheng et al., 2025) improves numerical stability via length normalization at the cost of deviating from the exact full‑sequence IS correction. In this work, we identify the cumulative token IS ratio, the product of per‑token ratios up to position t, as a theoretically principled solution to this dilemma. We prove that, under the token‑level policy‑gradient formulation, this ratio provides an unbiased prefix correction for each token‑level gradient term and has strictly lower variance than the full sequence ratio. Building on this insight, we propose CTPO (Cumulative Token Policy Optimization), which combines the cumulative token IS ratio with position‑adaptive clipping that scales log‑space clip bounds according to the natural \sqrtt growth of the cumulative log‑ratio. This yields more consistent regularization across token positions. We implement and evaluate CTPO in the tool‑integrated reasoning setting on several challenging mathematical reasoning benchmarks, achieving the best average performance across both model scales compared with strong GRPO and GSPO baselines. Code will be available at https://github.com/horizon‑llm/CTPO.
Authors:Lucas Hu, Ranchi Zhao, Isaac Zhu, Zach Zhang, Hscos Zhang, Hugh Yin, Jason Zhao
Abstract:
In large‑scale reinforcement learning (RL) systems with decoupled Trainer‑Rollout execution, the Trainer must regularly synchronize policy weights to the Rollout side to limit policy staleness. When inter‑node bandwidth is abundant, such synchronization is usually only a small fraction of end‑to‑end cost. As model size grows, however, the communication demand rises rapidly. In bandwidth‑constrained or network‑variable deployments ‑‑ for example, cross‑datacenter or cross‑cluster settings, heterogeneous resource pools, and online RL ‑‑ weight synchronization can become a dominant bottleneck for throughput and tail latency. We observe that, in mainstream large‑model RL training, the locations where parameters actually change are highly sparse at the element level (often 99%+ sparsity). Building on this observation, we propose and implement SparseRL‑Sync, which replaces full‑weight transfers with a lossless sparse update payload (indices and values) that can be exactly reconstructed on the inference side, thereby preserving 100% fidelity. Under a simplified cost model, sparse synchronization reduces the per‑update communication volume from S to approximately S/X; with 99% sparsity (X ~ 100), this yields about a 100x reduction in transmitted data. Combined with appropriate bucketing, SparseRL‑Sync also reduces launch and control‑plane overhead, significantly improving scalability and end‑to‑end efficiency in bandwidth‑limited and highly asynchronous RL settings.
Authors:Sum Kyun Song, Bong Gyun Shin, Jae Yong Lee
Abstract:
Discovering governing differential equations from observational data is a fundamental challenge in scientific machine learning. Existing symbolic regression approaches rely primarily on quantitative metrics; however, real‑world differential equation modeling also requires incorporating domain knowledge to ensure physical plausibility. To address this gap, we propose DoLQ, a method for discovering ordinary differential equations with LLM‑based qualitative and quantitative evaluation. DoLQ employs a multi‑agent architecture: a Sampler Agent proposes dynamic system candidates, a Parameter Optimizer refines equations for accuracy, and a Scientist Agent leverages an LLM to conduct both qualitative and quantitative evaluations and synthesize their results to iteratively guide the search. Experiments on multi‑dimensional ordinary differential equation benchmarks demonstrate that DoLQ achieves superior performance compared to existing methods, not only attaining higher success rates but also more accurately recovering the correct symbolic terms of ground truth equations. Our code is available at https://github.com/Bon99yun/DoLQ.
Authors:Peter Pao-Huang, Xiaojie Qiu, Stefano Ermon
Abstract:
We introduce Flux Matching, a new paradigm for generative modeling that generalizes existing score‑based models to a broader family of vector fields that need not be conservative. Rather than requiring the model to equal the data score, the Flux Matching objective imposes a weaker condition that admits infinitely many vector fields whose stationary distribution is the data. This flexibility enables a class of generative models that cannot be learned under score matching, in which inductive biases, structural priors, and properties of the dynamics can be directly imposed or optimized. We show that Flux Matching performs strongly on high‑dimensional image datasets and, more importantly, that our added freedom unlocks a range of applications including faster sampling, interpretable and mechanistic models, and dynamics that encode directed dependencies between variables. More broadly, Flux Matching opens a new dimension in generative modeling by turning the vector field itself into a design choice rather than a fixed target. Code is available at https://github.com/peterpaohuang/flux_matching.
Authors:Xinchi Zou, Tongzhenzhi Su, Jianjun Li, Yuan Fu, Chang Liu, Zhiying Deng, Zhiwei Shen
Abstract:
Knowledge Graphs (KGs) have proven highly effective for recommendation systems by capturing latent item relationships, while recent integration of Large Language Models (LLMs) has further enhanced semantic understanding and addressed knowledge sparsity issues. Nevertheless, current KG‑and‑LLM‑based methods still face three main limitations: 1) inadequate modeling of implicit semantic relationships beyond explicit KG links; 2) suboptimal single‑channel fusion of ID and LLM embeddings, which often leads to signal interference and blurred representations; and 3) insufficient consideration of user‑item interaction frequency variations in recommendation strategies. To address these challenges, we propose the Dual‑Channel Graph Learning (DCGL) framework, featuring three key innovations: 1) a dual‑channel architecture that structurally decouples rich semantic information from user behavioral patterns, preventing early interference; 2) a multi‑level contrastive learning mechanism that enhances robustness against KG noise through intra‑view contrasts and bridges semantic gaps between channels via inter‑view alignment; and 3) a dynamic fusion mechanism that adaptively balances semantic generalization and behavioral specificity based on interaction frequency, resolving the cascading limitation. Extensive experiments on four real‑world datasets show that DCGL consistently outperforms state‑of‑the‑art methods, yielding substantial improvements in sparse scenarios while maintaining precision for active users. Our code is available at https://github.com/XinchiZou/DCGL.
Authors:Wenyuan Li, Guang Li, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama
Abstract:
A latent world model may achieve accurate short‑horizon prediction while still inducing a latent space that is poorly aligned with planning. A key issue is spatiotemporal mismatch: these models are often trained with local predictive supervision, but deployed for long‑horizon goal‑directed search in latent spaces where Euclidean distance may not reflect what is reachable within a finite action budget. We present the Reachability‑Correction auxiliary objective (RC‑aux), a lightweight correction for this mismatch in reconstruction‑free latent world models. RC‑aux keeps the world‑model backbone unchanged and adds planning‑aligned supervision along two axes. Along the time axis, multi‑horizon open‑loop prediction trains the model beyond one‑step consistency. Along the space axis, budget‑conditioned reachability supervision, together with temporal hard negatives, encourages the latent space to distinguish states that are eventually reachable from those reachable within the current planning horizon. At test time, the learned reachability signal can also be used by a reachability‑aware planner to favor trajectories that are both goal‑directed and attainable under the available budget. We instantiate RC‑aux on LeWorldModel and evaluate it under both continuation‑training and matched‑from‑scratch settings. Across goal‑conditioned pixel‑control tasks and a LIBERO‑Goal extension, RC‑aux improves LeWM‑style planning with modest additional cost. These results suggest that planning with latent world models depends not only on predictive accuracy, but also on whether the learned representation encodes the temporal and geometric structure required by downstream search. The code is available at https://github.com/Guang000/RC‑aux.
Authors:Hao Chen, Zavareh Bozorgasl
Abstract:
Over‑the‑air federated learning (OTA‑FL) reduces uplink latency by exploiting waveform superposition, but conventional analog aggregation schemes typically require instantaneous channel state information (CSI), channel inversion, and coherent phase alignment, which can be difficult to maintain in practical wireless systems. This paper proposes resource‑element energy difference (REED), a noncoherent aggregation primitive for continuous signed updates that avoids instantaneous CSI. REED maps the positive and negative parts of each real‑valued update to transmit energies on two orthogonal resource elements with independent phase dithers, and the server estimates the signed aggregate from their energy difference. With only slow‑timescale calibration of average channel powers, REED is unbiased for the desired signed sum and admits an exact closed‑form variance under Rayleigh fading. We incorporate REED into full‑participation FedAvg and prove a smooth nonconvex stationarity bound. Under an average per‑client energy budget, the aggregation gain can be scheduled so that the REED‑induced perturbation scales quadratically with the local stepsize, yielding the canonical (1/sqrt(T)) stationarity rate. Experiments on MNIST and Fashion‑MNIST demonstrate that REED closely matches clean FedAvg and coherent CSIT aggregation in IID settings, while maintaining stable convergence with a moderate performance degradation under strong data heterogeneity.
Authors:Yi Liu, TingFeng Hui, Wei Zhang, Li Sun, Ningxin Su, Jian Wang, Sen Su
Abstract:
Scalable AI agents training relies on interactive environments that faithfully simulate the consequences of agent actions. Manually crafted environments are expensive to build, brittle to extend, and fundamentally limited in diversity. A promising direction is to replace manually crafted environments with LLM‑simulated counterparts. However, this paradigm hinges on an unexamined core assumption: LLMs can accurately simulate environmental feedback. In practice, LLM‑simulated environments suffer from hallucinations, logical inconsistencies, and silent state drift failures that corrupt agent reward signals and compound the construction costs that the paradigm was designed to eliminate. To address this gap, we propose EnvSimBench with four contributions: 1) We provide the first formal definition and operationalization of Environment Simulation Ability (EnvSim Ability) as a quantifiable research objective. 2) We construct EnvSimBench, a rigorous benchmark covering 400 samples across 167 diverse environments, equipped with verifiable labels and fine‑grained difficulty stratification along three axes. 3) Systematic evaluations reveal that all state‑of‑the‑art language models suffer from a universal state change cliff: they achieve near‑perfect accuracy on tasks when the environment state remains invariant, yet fail catastrophically when multiple states need simultaneous updates. This finding exposes EnvSim Ability as a critical yet largely unaddressed capability gap. 4) We design a constraint‑driven simulation pipeline that substantially reduces hallucination, boosts environment synthesis yield by 6.8%, and cuts costs by over 90%. Overall, EnvSimBench serves as both a diagnostic framework and a practical optimization path for reliable LLM‑based environment simulation, establishing a foundation for scalable agent training. Code and data are available at https://github.com/cookieApril/EnvSimBench
Authors:Akshita Singh, Prabesh Paudel, Siddhartha Roy
Abstract:
We introduce a proxy‑analyzer framework for detecting hallucinations in large language models. Instead of looking inside the generating model, our system reads already‑generated text through a small locally hosted open‑weight model and spots hallucinations using the reader's own internal activations. This works just as well when the generator is a closed API like GPT‑4 as when it is any open‑weight model. We built eighteen features grounded in how transformers process text, covering residual stream norms, per‑head source‑document attention, entropy, MLP activations, logit‑lens trajectories, and three new token‑level grounding statistics. We trained a stacking ensemble on 72,135 samples from five hallucination datasets. We tested across seven analyzer architectures from 0.5 billion to 9 billion parameters: Qwen2.5 at 0.5B and 7B, Gemma‑2 at 2B and 9B, Pythia at 1.4B, and LLaMA‑3 at both 3B and 8B. Across all seven, we consistently beat ReDeEP's token‑level AUC of 0.73 on RAGTruth by 7.4 to 10.3 percentage points. Qwen2.5‑7B reached an F1 of 0.717, just above ReDeEP's 0.713, while Qwen2.5‑0.5B hit 0.706. The most striking finding is how tightly all seven models cluster: AUC spans only 2.3 percentage points across an eighteen‑fold difference in model size. Even more surprising, our 3B LLaMA outperforms our 8B LLaMA on RAGTruth, showing that bigger is not always better even within the same model family. Both RAGTruth and LLM‑AggreFact include outputs from multiple LLM families, so our results are not skewed toward any particular generator.
Authors:Talha Ilyas, Deval Mehta, Zongyuan Ge
Abstract:
Skeleton‑based human activity recognition has achieved strong empirical performance, yet most existing models remain black boxes and difficult to interpret. In this work, we introduce a neurosymbolic formulation of skeleton‑based HAR that reframes action recognition as concept‑driven first‑order logical reasoning over motion primitives. Our framework bridges representation learning and symbolic inference by grounding first‑order logic predicates in learnable spatial and temporal motion concepts. Specifically, we employ a standard spatio‑temporal skeleton encoder to extract latent motion representations, which are then mapped to interpretable concept predicates via a spatio‑temporal concept decoder that explicitly separates pose‑centric and dynamics‑centric abstractions. These concept predicates are composed through differentiable first‑order logic layers, enabling the model to learn human‑readable logical rules that govern action semantics. To impose semantic structure on the learned concepts, we align skeleton representations with LLM‑derived descriptions of atomic motion primitives, establishing a shared conceptual space for perception and reasoning. Extensive experiments on NTU RGB+D 60/120 and NW‑UCLA demonstrate that our approach achieves competitive recognition performance while providing explicit, interpretable explanations grounded in logical structure. Our results highlight neurosymbolic reasoning as an effective paradigm for interpretable spatio‑temporal action understanding. Code: https://github.com/Mr‑TalhaIlyas/REASON
Authors:Seunghan Lee, Jun Seo, Jaehoon Lee, Sungdong Yoo, Minjae Kim, Tae Yoon Lim, Dongwan Kang, Hwanil Choi, SoonYoung Lee, Wonbin Ahn
Abstract:
Temporal knowledge graphs (TKGs) represent time‑stamped relational facts and support a wide range of reasoning tasks over evolving events. However, existing methods produce entity representations that are static at the entity level, in that each representation is a function of learned parameters only and retains no trace of the interactions in which the entity has participated. In this paper, we depart from this static view and propose that each entity be modeled as an adaptive process whose representation is refined every time the entity participates in a fact. To this end, we propose AdaTKG, which maintains a per‑entity memory that is updated with every observed interaction, with the memory accumulating online and predictions improving as more interactions arrive. Specifically, we instantiate the memory update as a learnable exponential moving average governed by a single shared scalar instead of using learnable parameters for each entity, enabling AdaTKG to handle entities unseen during training. Extensive experiments confirm consistent gains over TKG baselines, demonstrating the effectiveness of adaptive memory. Code is publicly available at: https://github.com/seunghan96/AdaTKG.
Authors:Xinyu Zhang, Zhengtong Xu, Yutian Tao, Yeping Wang, Yu She, Abdeslam Boularias
Abstract:
World models predict future transitions from observations and actions. Existing works predominantly focus on image generation only. Visual feature‑based world models, on the other hand, predict future visual features instead of raw video pixels, offering a promising alternative that is more efficient and less prone to hallucination. However, current feature‑based approaches rely on direct regression, which leads to blurry or collapsed predictions in complex interactions, while generative modeling in high‑dimensional feature spaces still remains challenging. In this work, we discover that a new type of latent action representation, which we refer to as Residual Latent Action (RLA), can be easily learned from DINO residuals. We also show that RLA is predictive, generalizable, and encodes temporal progression. Building on RLA, we propose RLA World Model (RLA‑WM), which predicts RLA values via flow matching. RLA‑WM outperforms both state‑of‑the‑art feature‑based and video‑diffusion world models on simulation and real‑world datasets, while being orders of magnitude faster than video diffusion. Furthermore, we develop two robot learning techniques that use RLA‑WM to improve policy learning. The first one is a minimalist world action model with RLA that learns from actionless demonstration videos. The second one is the first visual RL framework trained entirely inside a world model learned from offline videos only, using a video‑aligned reward and no online interactions or handcrafted rewards. Project page: https://mlzxy.github.io/rla‑wm
Authors:Guyue Luo, Qiao Liu
Abstract:
Instrumental‑variable (IV) regression enables causal estimation under endogeneity, but modern IV problems often involve nonlinear structural effects and high‑dimensional covariates. Existing nonlinear IV methods directly learn the causal relation in observed feature space or rely on learned representations within two‑stage or moment‑based procedures, which can struggle when the causal information is embedded in a high‑dimensional representation. We propose BGM‑IV, a latent Bayesian generative modeling approach that reframes nonlinear IV regression as posterior inference in a causally structured latent space. BGM‑IV infers latent components that separately capture shared confounding structure, outcome‑specific variation, treatment‑specific variation, and covariate‑only nuisance information. To account for endogeneity, BGM‑IV replaces the confounded outcome likelihood with an IV‑integrated pseudo‑likelihood that averages over instrument‑induced treatment values within the latent model. Across various benchmark datasets, BGM‑IV remains competitive in the classical low‑dimensional regime and performs best in high‑dimensional covariate regimes. Together, these results show that structured latent generative modeling provides a principled and effective strategy to nonlinear IV estimation with rich covariates. The code of BGM‑IV is available at https://github.com/liuq‑lab/BGM‑IV.
Authors:Christopher Z. Cui, Taylor W. Killian, Prithviraj Ammanabrolu
Abstract:
Reasoning in Large Language Models (LLMs) poses a challenge for oversight as many misaligned behaviors do not surface until reasoning concludes. To address this, we introduce Behavior Cue Reasoning for making LLM reasoning more controllable and monitorable. Behavior Cues are special token sequences that a model is trained to emit immediately before specific implicit and explicit behaviors, acting as dual purpose signal and control levers. When fine‑tuning a weaker external monitor with Reinforcement Learning for reasoning oversight, a compressed view of only information surfaced by Behavior Cues is sufficient signal for the monitor to prune up to 50% of otherwise wasted reasoning tokens in complex math problem solving. When leveraged by an almost optimal rule‑based monitor in an environment where excessive constraint violations results in failure, \ours allows for the recovery of safe actions from 80% of reasoning traces that would otherwise end with the proposal of an unsafe action, more than doubling the success rate from 46% to 96%. Through evaluation across two model families and three domains, we show that \bcreasoning improves reasoning monitorability and controllability with no cost to performance. More broadly, our work progresses scalable oversight by demonstrating how the monitored model itself can be trained to reason more tractably to oversight. Code to be released at https://github.com/christopherzc/text‑games
Authors:Do Xuan Long, Yale Song, Min-Yen Kan, Tomas Pfister, Long T. Le
Abstract:
Synthesizing consistent and coherent long video remains a fundamental challenge. Existing methods suffer from semantic drift and narrative collapse over long horizons. We present A^2RD, an Agentic Auto‑Regressive Diffusion architecture that decouples creative synthesis from consistency enforcement. A^2RD formulates long video synthesis as a closed‑loop process that synthesizes and self‑improves video segment‑by‑segment through a Retrieve‑‑Synthesize‑‑Refine‑‑Update cycle. It comprises three core components: (i) Multimodal Video Memory that tracks video progression across modalities; (ii) Adaptive Segment Generation that switches among generation modes for natural progression and visual consistency; and (iii) Hierarchical Test‑Time Self‑Improvement that self‑improves each segment at frame and video levels to prevent error propagation. We further introduce LVBench‑C, a challenging benchmark with non‑linear entity and environment transitions to stress‑test long‑horizon consistency. Across public and LVBench‑C benchmarks spanning one‑ to ten‑minute videos, A^2RD outperforms state‑of‑the‑art baselines by up to 30% in consistency and 20% in narrative coherence. Human evaluations corroborate these gains while also highlighting notable improvements in motion and transition smoothness.
Authors:Luke J. O'Connor
Abstract:
At the heart of existing language model agents is a fixed orchestrator program responsible for the state transition between consecutive turns. This paper introduces self‑programmed execution (SPE), an agent architecture in which the model completion is itself the orchestrator program, and the harness evaluates this program but does not impose its own orchestration policy. I formalize this idea using agentic machines: an SPE state is one from which a model completion can load any state of an embedded copy of the machine, meaning that it is subject to no fixed turn‑to‑turn orchestration policy. Realizing SPE in practice is nontrivial because the same data is both model context and executable program. I therefore introduce Spell, a Lisp‑based language in which programs can edit and re‑evaluate themselves, and effectful expressions like model invocations are structured such that re‑evaluating an edited program does not replay its side effects. Experiments with existing models, not trained for SPE or Spell, show that frontier models can operate in this regime and accomplish challenging agentic tasks. These results demonstrate how an LM can act as an agent without any fixed orchestration policy, and they raise the question of what self‑orchestration strategies might be learned by a model trained for self‑programmed execution. Code is available at https://github.com/lukejoconnor/spell .
Authors:Maximillian Chen, Xuanming Zhang, Michael Peng, Zhou Yu, Alexandros Papangelis, Yohan Jo
Abstract:
The rise of Internet of Things (IoT) devices in the physical world necessitates voice‑based interfaces capable of handling complex user experiences. While modern Large Language Models (LLMs) already demonstrate strong tool‑usage capabilities, modeling real‑world IoT devices presents a difficult, understudied challenge which combines modeling spatiotemporal constraints with speech inputs, dynamic state tracking, and mixed‑initiative interaction patterns. We introduce MIST (the Multimodal Interactive Speech‑based Tool‑calling Dataset), a synthetic multi‑turn, voice‑driven code generation task that operates over IoT devices. We find that there is a significant gap between open‑ and closed‑weight multimodal LLMs on MIST, and that even frontier closed‑weight LLMs have substantial headroom. We release MIST and an extensible data generation framework to build related datasets in order to facilitate research on mixed‑initiative voice assistants which reason about physical world constraints.
Authors:Fred Zhangzhi Peng, Alexis Fox, Anru R. Zhang, Alexander Tong
Abstract:
Diffusion language models (DLMs) have recently demonstrated capabilities that complement standard autoregressive (AR) models, particularly in non‑sequential generation and bidirectional editing. Although recent work has shown that pretrained autoregressive checkpoints can be converted into diffusion language models, existing recipes primarily transfer parameters through continued denoising training with objective‑ and attention‑level modifications. We instead ask whether the internal representation geometry learned by next‑token prediction can be explicitly preserved during AR‑to‑DLM conversion. We hypothesize that much of the semantic structure learned by AR pretraining can transfer across generation orders, and thus DLM training should be viewed as relearning the decoding path rather than relearning language representations. To investigate this, we introduce REPR‑ALIGN, a representation alignment objective that adapts a bidirectional masked diffusion model to reuse representations from a pretrained AR model of identical architecture. Concretely, we align the hidden states of the DLM to the frozen AR model at every layer using cosine similarity, while optimizing the standard masked denoising objective. This simple alignment, with no adapters and no architectural changes beyond the attention mask, yields up to 4x training acceleration in our setting and is particularly effective in low‑data regimes. Our results suggest that linguistic representations can transfer across generation order, and that representation alignment provides a simple and effective technique for training diffusion language models. Code is available at https://github.com/pengzhangzhi/Open‑dLLM.
Authors:Yuwei Yin, Chuyuan Li, Giuseppe Carenini
Abstract:
Accurately understanding the intent behind speech, conversation, and writing is crucial to the development of helpful Large Language Model (LLM) assistants. This paper introduces IntentGrasp, a comprehensive benchmark for evaluating the intent understanding capability of LLMs. Derived from 49 high‑quality, open‑licensed corpora spanning 12 diverse domains, IntentGrasp is constructed through source datasets curation, intent label contextualization, and task format unification. IntentGrasp contains a large‑scale training set of 262,759 instances and two evaluation sets: an All Set of 12,909 test cases and a more balanced and challenging Gem Set of 470 cases. Extensive evaluations on 20 LLMs across 7 families (including frontier models such as GPT‑5.4, Gemini‑3.1‑Pro, and Claude‑Opus‑4.7) demonstrate unsatisfactory performance, with scores below 60% on All Set and below 25% on Gem set. Notably, 17 out of 20 tested models perform worse than a random‑guess baseline (15.2%) on Gem Set, while the estimated human performance is ~81.1%, showing substantial room for improvement. To enhance such ability, this paper proposes Intentional Fine‑Tuning (IFT), which fine‑tunes the models on the training set in IntentGrasp, yielding significant gains of 30+ F1 points on All Set and 20+ points on Gem Set. Tellingly, the leave‑one‑domain‑out (Lodo) experiments further demonstrate the strong cross‑domain generalizability of IFT, verifying that it is a promising approach to substantially enhancing the intent understanding of LLMs. Overall, by benchmarking and boosting intent understanding ability, this study sheds light on a promising path towards more intentional, capable, and safe AI assistants for human benefits and social good.
Authors:Jiacheng Xu, Heting Gao, Liufei Xie, Zhenchuan Yang, Lijiang Li, Yiting Chen, Bin Zhang, Meng Chen, Chaoyu Fu, Weifeng Zhao, Wenjiang Zhou
Abstract:
Human speech conveys expressiveness beyond linguistic content, including personality, mood, or performance elements, such as a comforting tone or humming a song, which we formalize as role‑playing and singing. We present VITA‑QinYu, the first expressive end‑to‑end (E2E) spoken language model (SLM) that goes beyond natural conversation to support both role‑playing and singing generation. VITA‑QinYu adopts a hybrid speech‑text paradigm that extends interleaved text‑audio modeling with multi‑codebook audio tokens, a design enabling richer paralinguistic representation while preserving a clear separation between modalities to avoid interference. We further develop a comprehensive data generation pipeline to synthesize a total of 15.8K hours of natural conversation, role‑playing, and singing data for training. VITA‑QinYu demonstrates superior expressiveness, outperforming peer SLMs by 7 percentage points on objective role‑playing benchmarks, and surpassing peer models by 0.13 points on a 5‑point MOS scale for singing. Simultaneously, it achieves state‑of‑the‑art conversational accuracy and fluency, exceeding prior SLMs by 1.38 and 4.98 percentage points on the C3 and URO benchmarks, respectively. We open‑source our code and models and provide an easy‑to‑use demo with full‑stack support for streaming and full‑duplex interaction.
Authors:Zhifeng Gu, Yuqi Wang, Bing Wang
Abstract:
Relative spatial relations provide a compact representation of spatial structure and are fundamental to relative spatial reasoning in 3D layout generation. Recent works leverage Multimodal Large Language Models (MLLMs) to infer such relations, but the inferred relations are often unreliable and are typically handled with post‑hoc heuristics. In this paper, we propose R^3L, a general framework that improves the reliability and consistency of relative spatial reasoning for 3D layout generation. Our key motivation is that multi‑hop reasoning requires repeated reference‑frame transformations, which accumulate errors in inferred relations and lead to semantic and metric drift. To mitigate this, we propose invariant spatial decomposition to break coupled relation chains, and consistent spatial imagination to promote self‑consistency through an imagine‑and‑revise loop. We further introduce supportive spatial optimization to ease pose optimization via global‑to‑local coordinate re‑parameterization. Extensive experiments across diverse scene types and instructions demonstrate that R^3L produces more physically feasible and semantically consistent layouts. Notably, our analysis shows that resolving frame‑induced inconsistencies is crucial for reliable multi‑hop relative spatial reasoning. The code is available at https://github.com/Neal2020GitHub/R3L.
Authors:Arash Shahmansoori
Abstract:
We present the EΔ‑MHC‑Geo Transformer, a novel architecture that unifies Manifold‑Constrained Hyper‑Connections (mHC), Deep Delta Learning (DDL), and the Cayley transform to obtain input‑adaptive, unconditionally orthogonal residual connections. Unlike DDL, whose Householder operator is orthogonal only at β\in \0,2\, our Data‑Dependent Cayley rotation Q(x)=(I+(β/2)A(x))^‑1(I‑(β/2)A(x)) preserves orthogonality for all β and all inputs. To handle negation, an eigenvalue ‑1 case that Cayley provably excludes, we introduce the EΔ‑MHC‑Geo Hybrid, which combines Cayley rotation with Householder reflection via a learned operator‑selection gate X'=γ(X)Q(X)X+(1‑γ(X))H_2(X)X. A midpoint‑collapse regularizer, 4γ(1‑γ), encourages boundary gate decisions, where each selected component is orthogonal. In matched‑parameter comparisons, with approximately 1.79M parameters per model and mean +/‑ standard deviation over 3 seeds, against four baselines including the concurrent JPmHC, EΔ‑MHC‑Geo achieves the best long‑horizon stability, 1.9x over JPmHC and 3.8x over GPT; the best near‑π rotation loss, 4.5x over JPmHC on single‑plane; strong norm preservation, with 0.001 mean deviation; and 0.96 negation cosine alignment in a diagnostic reflection probe, all with 33% fewer layers. While JPmHC's wider representation excels on pure rotation, its finite Cayley residual mixer excludes an exact λ=‑1 operator and has no reflection branch, motivating our hybrid approach for accessing both connected components of O(n).
Authors:Jon-Paul Cacioli
Abstract:
Aggregate metacognitive quality scores mask within‑model variation across MMLU benchmark domains. We administered 1,500 MMLU items (250 per domain, under an a priori six‑domain grouping) to 33 frontier LLMs from eight model families and computed Type‑2 AUROC per model‑domain cell using verbalized confidence (0‑100). Total observations: 47,151. Every model with above‑chance aggregate monitoring showed non‑trivial domain‑level variation. Applied/Professional knowledge was reliably the easiest benchmark domain to monitor (mean AUROC = .742, ranked top‑2 in 21 of 33 models); Formal Reasoning and Natural Science were reliably the hardest (one of the two ranked bottom‑2 in 27 of 33 models). The three middle domains were statistically indistinguishable (Kendall's W = .164). A subject‑level coherence analysis (within‑domain similarity ratio = 0.95) confirms the six‑domain grouping is a pragmatic benchmark taxonomy, not a validated latent construct. Within‑family profile‑shape clustering is significant for Anthropic, Google‑Gemini, and Qwen (permutation p < .0001) but not DeepSeek, Google‑Gemma, or OpenAI. Gemma 4 31B showed a +.202 AUROC improvement over Gemma 3 27B. Three models classified Invalid on binary KEEP/WITHDRAW probes produced normal profiles under verbalized confidence, confirming probe‑format specificity. Bootstrap 95% CIs on 198 cells have median width .199. Split‑half aggregate stability r = .893; profile‑level split‑half is weaker (grand median r = .184). These results show stable benchmark‑domain variation obscured by aggregate metrics, and support benchmark‑stage domain screening as a step before deployment in specific application areas.
Authors:Zhexuan Wang, Xuebo Liu, Li Wang, Zifei Shan, Yutong Wang, Zhenxi Song, Min Zhang
Abstract:
Large language model (LLM)‑based Multi‑agent systems (MAS) have shown promise in tackling complex collaborative tasks, where agents are typically orchestrated via role‑specific prompts. While the quality of these prompts is pivotal, jointly optimizing them across interacting agents remains a non‑trivial challenge, primarily due to the misalignment between local agent objectives and holistic system goals. To address this, we introduce MASPO, a novel framework designed to automatically and iteratively refine prompts across the entire system. A core innovation of MASPO is its joint evaluation mechanism, which assesses prompts not merely by their local validity, but by their capacity to facilitate downstream success for successor agents. This effectively bridges the gap between local interactions and global outcomes without relying on ground‑truth labels. Furthermore, MASPO employs a data‑driven evolutionary beam search to efficiently navigate the high‑dimensional prompt space. Extensive empirical evaluations across 6 diverse tasks demonstrate that MASPO consistently outperforms state‑of‑the‑art prompt optimization methods, achieving an average accuracy improvement of 2.9. We release our code at https://github.com/wangzx1219/MASPO.
Authors:Nithin Somasekharan, Rabi Pathak, Manushri Dhanakoti, Tingwen Zhang, Ling Yue, Andy Zhu, Shaowu Pan
Abstract:
Recent LLM‑based agents have closed substantial portions of the scientific discovery loop in software‑only machine‑learning research, in chemistry, and in biology. Extending the same loop to high‑fidelity physical simulators is harder, because solver completion does not imply physical validity and many failure modes appear only in field‑level imagery rather than in solver logs. We present AI CFD Scientist, an open‑source AI scientist for computational fluid dynamics (CFD) that, to our knowledge, is the first to span literature‑grounded ideation, validated execution, vision‑based physics verification, source‑code modification, and figure‑grounded writing within a single inspectable workflow. Three coupled pathways cover parameter sweeps within a fixed solver, case‑local C++ library compilation for new physical models, and open‑ended hypothesis search against a reference comparator, all running on OpenFOAM through Foam‑Agent. At the center of the framework is a vision‑language physics‑verification gate that inspects rendered flow fields before any result is accepted, rerun, or written into a manuscript. On five tasks under a shared GPT‑5.5 backbone, AI CFD Scientist autonomously discovers a Spalart‑Allmaras runtime correction that reduces lower‑wall Cf RMSE against DNS by 7.89% on the periodic hill at Reh=5600; under matched LLM cost, two strong general AI‑scientist baselines (ARIS, DeepScientist) execute partial CFD workflows but lack the domain‑specific validity gates needed to convert runs into defensible scientific claims; and a controlled planted‑failure ablation shows that the vision‑language gate detects 14 of 16 silent failures missed by solver‑level checks. Code, prompts, and run artifacts are released at https://github.com/csml‑rpi/cfd‑scientist.
Authors:Yiqiao Jin, Yiyang Wang, Lucheng Fu, Yijia Xiao, Yinyi Luo, Haoxin Liu, B. Aditya Prakash, Josiah Hester, Jindong Wang, Srijan Kumar
Abstract:
Self‑distillation (SD) offers a promising path for adapting large language models (LLMs) without relying on stronger external teachers. However, SD in autoregressive LLMs remains challenging because self‑generated trajectories are free‑form, correctness is task‑dependent, and plausible rationales can still provide unstable or unreliable supervision. Existing methods mainly examine isolated design choices, leaving their effectiveness, roles, and interactions unclear. In this paper, we propose UniSD, a unified framework to systematically study self‑distillation. UniSD integrates complementary mechanisms that address supervision reliability, representation alignment, and training stability, including multi‑teacher agreement, EMA teacher stabilization, token‑level contrastive learning, feature matching, and divergence clipping. Across six benchmarks and six models from three model families, UniSD reveals when self‑distillation improves over static imitation, which components drive the gains, and how these components interact across tasks. Guided by these insights, we construct UniSDfull, an integrated pipeline that combines complementary components and achieves the strongest overall performance, improving over the base model by +5.4 points and the strongest baseline by +2.8 points. Extensive evaluation highlights self‑distillation as a practical and steerable approach for efficient LLM adaptation without stronger external teachers.
Authors:Guanrou Yang, Tian Tan, Qian Chen, Zhikang Niu, Yakun Song, Ziyang Ma, Yushen Chen, Zeyu Xie, Tianrui Wang, Yifan Yang, Wenxi Chen, Qi Chen, Wenrui Liu, Shan Yang, Xie Chen
Abstract:
Integrating speech understanding and generation is a pivotal step toward building unified speech models. However, the different representations required for these two tasks currently pose significant compatibility challenges. Typically, semantics‑oriented features are learned from self‑supervised learning (SSL), and acoustic‑oriented features from reconstruction. Such fragmented representations hinder the realization of truly unified speech systems. We present WavCube, a compact continuous latent derived from an SSL speech encoder that simultaneously supports speech understanding, reconstruction, and generation. WavCube employs a two‑stage training scheme. Stage 1 trains a semantic bottleneck to filter off‑manifold redundancy that makes raw SSL features intractable for diffusion. Stage 2 injects fine‑grained acoustic details via end‑to‑end reconstruction, while a semantic anchoring loss ensures the representation remains grounded within its original semantic manifold. Comprehensive experiments show that WavCube closely approaches WavLM performance on SUPERB despite an 8x dimensional compression, attains reconstruction quality on par with existing acoustic representations, delivers state‑of‑the‑art zero‑shot TTS performance with markedly faster training convergence, and excels in speech enhancement, separation, and voice conversion tasks on the SUPERB‑SG benchmark. Systematic ablations reveal that WavCube's two‑stage recipe resolves two intrinsic flaws of SSL features for generative modeling, paving the way for future unified speech systems. Codes and checkpoints are available at https://github.com/yanghaha0908/WavCube.
Authors:Tao Liu, Hao Yan, Mengting Chen, Taihang Hu, Zhengrong Yue, Zihao Pan, Jinsong Lan, Xiaoyong Zhu, Ming-Ming Cheng, Bo Zheng, Yaxing Wang
Abstract:
Step distillation has become a leading technique for accelerating diffusion models, among which Distribution Matching Distillation (DMD) and Consistency Distillation are two representative paradigms. While consistency methods enforce self‑consistency along the full PF‑ODE trajectory to steer it toward the clean data manifold, vanilla DMD relies on sparse supervision at a few predefined discrete timesteps. This restricted discrete‑time formulation and mode‑seeking nature of the reverse KL divergence tends to exhibit visual artifacts and over‑smoothed outputs, often necessitating complex auxiliary modules ‑‑ such as GANs or reward models ‑‑ to restore visual fidelity. In this work, we introduce Continuous‑Time Distribution Matching (CDM), migrating the DMD framework from discrete anchoring to continuous optimization for the first time. CDM achieves this through two continuous‑time designs. First, we replace the fixed discrete schedule with a dynamic continuous schedule of random length, so that distribution matching is enforced at arbitrary points along sampling trajectories rather than only at a few fixed anchors. Second, we propose a continuous‑time alignment objective that performs active off‑trajectory matching on latents extrapolated via the student's velocity field, improving generalization and preserving fine visual details. Extensive experiments on different architectures, including SD3‑Medium and Longcat‑Image, demonstrate that CDM provides highly competitive visual fidelity for few‑step image generation without relying on complex auxiliary objectives. Code is available at https://github.com/byliutao/cdm.
Authors:Yangfu Zhu, Zitong Han, Nianwen Ning, Yuting Wei, Yuandong Wang, Hang Feng, Zhenzhou Shao
Abstract:
Multimodalpersonalityunderstandingplaysacriticalroleinhuman centered artificial intelligence. Previous work mainly focus on learn‑ing rich multimodal representations for video personality under standing. However, they often suffer from potential harm caused by subject bias (e.g., observable age and unobservable mental states), as subjects originate from diverse demographic backgrounds. Learn ing such spurious associations between multimodal features and traits may lead to unfair personality understanding. In this work, weconstruct aStructural Causal Model (SCM)toanalyze theimpact of these biases from a causal perspective, and propose a novel Dual Causal Adjustment Network (DCAN) to mitigate the interference of subject attributes on personality understanding. Specifically, we design a Back‑door Adjustment Causal Learning (BACL) module to block spurious correlations from observable demographic factors via a prototype‑based confounder dictionary, and subsequently ap ply a Front‑door Adjustment Causal Learning (FACL) module to ad dress latent and unobservable biases throughalearnedmediatordic tionary intervention, thereby achieving causal disentanglement of representations for deconfounded reasoning. Importantly, we con struct a Demographic‑annotated Multimodal Student Personality (DMSP) dataset to support the analysis and discussion of fairness related factors. Extensive experiments on the benchmark dataset CFI‑V2 and our DMSPdataset demonstrate that DCAN consistently improves prediction accuracy, reaching 92.11% and 92.90%, respec tively. Meanwhile, the improvementsinthefairnessmetricsofequal opportunity and demographic parity are 6.57% and 7.97% on CFI‑V2, and 15.38% and 20.06% on the DMSP dataset. Our code and DMSP dataset are available at https://github.com/Sabrina‑han/DCAN
Authors:Jie Yu, Song Qiu
Abstract:
AI research agents have shown strong potential in automating literature search and manuscript refinement, yet most assume a clear and actionable initial input, operating only after a research question has been made explicit. In contrast, human research often begins with tacit friction, a sense of misalignment before a question can be formed. We introduce InciteResearch, a multi‑agent framework designed to make a researcher's implicit understanding explicit, inspectable, and actionable. InciteResearch decomposes the logical chain of Socratic questioning and distributes it across the entire pipeline that: (1) Elicits a structured five‑dimensional researcher profile state anchored by specific friction points from vague, even domain‑unrelated inputs; (2) Violates hidden assumptions by maximizing the feasibility‑novelty product with enforcing a 7‑stage causal derivation trace; and (3) check whether the proposed method is a Necessary consequence of the reframed insight. We further introduce TF‑Bench, the first benchmark for tacit‑to‑explicit research assistance that distinguishes domain‑related from domain‑unrelated inspirations across four scientific modes. On TF‑Bench, InciteResearch achieves leapfrogging gains over a prompt‑based baseline (novelty/impact from 3.671/3.806 to 4.250/4.397), shifting generated proposals from recombination to architectural insight. Our work demonstrates that AI can serve as an extension of thinking itself, rather than merely automating downstream execution.
Authors:Thomas Bömer, Bastian Amberg, Max Disselnmeyer, Anne Meyer
Abstract:
Many real‑world optimization problems consist of multiple tightly coupled subproblems whose solutions must be coordinated to achieve high overall performance. However, existing large language model driven automated heuristic design approaches are limited to single‑problem settings. In this paper, we propose CoupleEvo. CoupleEvo proposes three evolutionary coordination strategies to evolve heuristics for coupled optimization problems: the sequential strategy evolves heuristics for one subproblem after the other; the iterative strategy alternates the evolution of heuristics for different subproblems over successive generations; and the integrated strategy evolves heuristics for all problems simultaneously. The approach is evaluated on two representative coupled optimization problems. Experimental results show that decomposition‑based strategies (sequential and iterative) provide more stable convergence and higher solution quality, while the integrated evolution strategy suffers from increased search complexity and variability. These findings highlight the importance of coordinating evolutionary search across interdependent subproblems and demonstrate the potential of LLM‑driven heuristic design for complex coupled optimization problems. The code is available: https://github.com/tb‑git‑kit‑research/CoupleEvo.
Authors:Zhaoyang Jiang, Zhizhong Fu, Yunsoo Kim, Jiacong Mi, Zicheng Li, Xuanqi Peng, Honghan Wu
Abstract:
Deployed language and vision‑language models must decide, on each input, whether to answer directly, retrieve evidence, defer to a stronger model, or abstain. Contrary to the common monotonicity intuition, greater per‑input expressivity is not uniformly beneficial in finite samples: under identical strict cross‑validation, different benchmarks prefer different controller classes. This reflects a finite‑sample limitation of instance‑level uncertainty signals, which can be exhausted at a distribution‑dependent scale. We organize controllers into a nested lattice of four classes: fixed actions, partition routers, instance‑level controllers, and prior‑gated controllers, ordered by complexity. We prove a regime theory that turns three data‑estimable bottlenecks into a class choice: how much improvement is possible beyond the best fixed action, whether there are enough samples for instance‑level controllers to make reliable decisions, and how much improvement a coarse partition router can recover when instance‑level signal is unreliable. The resulting Bernstein‑tight threshold has a matching information‑theoretic lower bound, and strict nested cross‑validation provably selects a near‑best class. Across SMS‑Spam, HallusionBench, A‑OKVQA, and FOLIO, the predicted class matches the empirical winner; the prior‑gated controller wins on TextVQA when OCR tokens supply a label‑free prediction‑time prior. Code is available at https://github.com/Anonymous‑Awesome‑Submissions/Regime‑Theory.
Authors:Shouvik Sardar, Sourish Das
Abstract:
Cocoa (Theobroma cacao) is a critical cash crop for millions of smallholder farmers in West Africa, where Cocoa Swollen Shoot Virus Disease (CSSVD) and anthracnose cause devastating yield losses. Automated disease detection from leaf images is essential for early intervention, yet deploying such systems in resource‑constrained settings demands models that are small, fast, and require no internet connectivity. Existing edge‑deployable plant disease systems rely on end‑to‑end deep learning without uncertainty quantification, while Bayesian methods for edge devices focus on hardware‑level inference architectures rather than agricultural applications. We bridge this gap with TinyBayes, the first framework to combine a closed‑form Bayesian classifier with a mobile‑grade computer vision pipeline for crop disease detection. Our pipeline uses YOLOv8‑Nano (5.9 MB) for lesion localisation, MobileNetV3‑Small (3.5 MB) for feature extraction, and the Jacobi prior; a Bayesian method that provides a closed form non‑iterative estimators via projection, for the classification. The Jacobi‑DMR (Distributed Multinomial Regression) classifier adds only 13.5 KB to the pipeline, bringing the total model size within 9.5 MB, while achieving 78.7% accuracy on the Amini Cocoa Contamination Challenge dataset and enabling end‑to‑end CPU inference under 150 ms per image. We benchmark against seven classifiers including Random Forest, SVM, Ridge, Lasso, Elastic Net, XGBoost, and Jacobi‑GP, and demonstrate that the Jacobi‑DMR offers the best trade‑off between accuracy, model size, and inference speed for edge deployment. We have proved the asymptotic equivalence and consistency, asymptotic normality and the bias correction of Jacobi‑DMR. All data and codes are available here: https://github.com/shouvik‑sardar/TinyBayes
Authors:Chengjie Wang, Jingzheng Wu, Xiang Ling, Tianyue Luo, Chen Zhao
Abstract:
Large language models (LLMs) are now largely involved in software development workflows, and the code they generate routinely includes third‑party library (TPL) imports annotated with specific version identifiers. These version choices can carry security and compatibility risks, yet they have not been systematically studied. We present the first large‑scale measurement study of version‑level risk in LLM‑generated Python code, evaluating 10 LLMs on PinTrace, a curated benchmark of 1,000 Stack Overflow programming tasks. LLMs tend to specify version identifiers when directly prompted at 26.83%‑95.18%, while down to 6.45%‑59.19% in creating a manifest file directly. Among the specified versions, 36.70%‑55.70% of tasks contain at least one known CVE, and 62.75%‑74.51% of them carry Critical or High severity ratings. In 72.27%‑91.37% of cases, the associated CVEs were publicly disclosed before the model's knowledge cutoff. The statistics show all models converge on the same small set of risky release versions, indicating a systemic bias rather than isolated model error. Static compatibility rates range from 19.70% to 63.20%, with installation failure as the dominant cause. The dynamic test cases confirm the pattern by 6.49%‑48.62% pass rates. Further experiments confirm that these failures are attributable to version selection rather than code quality, and that externally anchored version constraints substantially reduce both vulnerability exposure and compatibility failures. Our findings reveal LLM version selection as a first‑class, previously overlooked risk surface in LLM‑based development. We disclosed these findings to the community of the evaluated models, and several confirmed the issue. All the code and dataset have been released for open science at https://github.com/dw763j/PinTrace.
Authors:Guanmeng Xian, Ning Yang, Philip S. Yu
Abstract:
Multimodal recommender systems exploit visual and textual signals to alleviate data sparsity, but this also makes them more vulnerable to evasion‑based promotion attacks. Existing defenses are largely limited to single‑modal settings and mainly focus on poisoning‑based threats, leaving evasion‑based threats underexplored. In this work, we first identify a cross‑modal gradient mismatch under the multi‑user promotion setting, where visual and textual perturbations are optimized in inconsistent directions due to the dominance of distinct user groups. This phenomenon dilutes the attack effectiveness and leads robust training to underestimate worst‑case risks. To address this issue, we propose Untargeted Adversarial Training with Multimodal Coordination (UAT‑MC). UAT‑MC tackles the challenge of unknown targeted items in evasion‑based attacks (as opposed to poisoning‑based attacks) by treating all items as potential targets, and introduces a gradient alignment mechanism to explicitly correct this mismatch. This design ensures synchronized perturbations across modalities, thereby maximizing adversarial strength for robust training. Extensive experiments demonstrate that UAT‑MC significantly improves robustness against promotion attacks while maintaining acceptable recommendation performance under the defense‑accuracy trade‑off. Code is available at https://github.com/gmXian/UAT‑MC.
Authors:Jinge Wu, Hongjian Zhou, Mingde Zeng, Jiayuan Zhu, Junde Wu, Jiazhen Pan, Sean Wu, Honghan Wu, Fenglin Liu, David A. Clifton
Abstract:
Building a deep research agent today is an exercise in glue code: the same backbone evaluated on the same benchmark can report different accuracies in different papers because harness and tool registry all differ, and integrating a new foundation model into a comparable evaluation surface costs weeks of model‑specific engineering. We call this the per‑paper engineering tax and release BioMedArena, an open‑source toolkit that not only alleviates it but also provides an arena for fair comparison of different foundation models when evaluating them as deep‑research agents. BioMedArena decouples six layers of biomedical agent evaluation ‑‑ benchmark loading, tool exposure, tool selection, execution mode, context management, and scoring ‑‑ and exposes 147 biomedical benchmarks and 75 biomedical tools across 9 functional families. Adding a new model, benchmark, or tool reduces to registering a few‑line provider adapter. We further provide 6 agent harnesses with 6 context‑management strategies, which provide 12 backbones with competitive research capabilities and significantly improved performance, achieving state‑of‑the‑art (SOTA) results on 8 representative biomedical benchmarks, with an average lift of +15.03 percentage points over prior SOTA. The toolkit, configurations, and per‑task traces are available at https://github.com/AI‑in‑Health/BioMedArena
Authors:Shiao Wang, Xiao Wang, Duoqing Yang, Wenhao Zhang, Bo Jiang, Lin Zhu, Yonghong Tian, Bin Luo
Abstract:
Despite significant progress, RGB‑based trackers remain vulnerable to challenging imaging conditions, such as low illumination and fast motion. Event cameras offer a promising alternative by asynchronously capturing pixel‑wise brightness changes, providing high dynamic range and high temporal resolution. However, existing event‑based trackers often neglect the intrinsic spatial sparsity and temporal density of event data, while relying on a single fixed temporal‑window sampling strategy that is suboptimal under varying motion dynamics. In this paper, we propose an event sparsity‑aware tracking framework that explicitly models event‑density variations across multiple temporal scales. Specifically, the proposed framework progressively injects sparse, medium‑density, and dense event search regions into a three‑stage Vision Transformer backbone, enabling hierarchical multi‑density feature learning. Furthermore, we introduce a sparsity‑aware Mixture‑of‑Experts module to encourage expert specialization under different sparsity patterns, and design a dynamic pondering strategy to adaptively adjust the inference depth according to tracking difficulty. Extensive experiments on FE240hz, COESOT, and EventVOT demonstrate that the proposed approach achieves a favorable trade‑off between tracking accuracy and computational efficiency. The source code will be released on https://github.com/Event‑AHU/OpenEvTracking.
Authors:Zixuan Wang, Yuchen Yan, Hongxing Li, Teng Pan, Dingming Li, Ruiqing Zhang, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen
Abstract:
While long‑horizon agentic tasks require language agents to perform dozens of sequential decisions, training such agents with reinforcement learning remains challenging. We identify two root causes: credit misattribution, where correct early actions are penalized due to terminal failures, and sample inefficiency, where scarce successful trajectories result in near‑total loss of learning signal. We introduce a milestone‑guided policy learning framework, BEACON, that leverages the compositional structure of long‑horizon tasks to ensure precise credit assignment. BEACON partitions trajectories at milestone boundaries, applies temporal reward shaping within segments to credit partial progress, and estimates advantages at dual scales to prevent distant failures from corrupting the evaluation of local actions. On ALFWorld, WebShop, and ScienceWorld, BEACON consistently outperforms GRPO and GiGPO. Notably, on long‑horizon ALFWorld tasks, BEACON achieves 92.9% success rate, nearly doubling GRPO's 53.5%, while improving effective sample utilization from 23.7% to 82.0%. These results establish milestone‑anchored credit assignment as an effective paradigm for training long‑horizon language agents. Code is available at https://github.com/ZJU‑REAL/BEACON.
Authors:Keisuke Kamahori, Shihang Li, Simon Peter, Baris Kasikci
Abstract:
For years, we have built LLM serving systems like any other critical infrastructure: a single general‑purpose stack, hand‑tuned over many engineer‑years, meant to support every model and workload. In this paper, we take the opposite bet: a multi‑agent loop that automatically synthesizes bespoke serving systems for different usage scenarios. We propose VibeServe, the first agentic loop that generates entire LLM serving stacks end‑to‑end. VibeServe uses an outer loop to plan and track the search over system designs, and an inner loop to implement candidates, check correctness, and measure performance on the target benchmark. In the standard deployment setting, where existing stacks are highly optimized, VibeServe remains competitive with vLLM, showing that generation‑time specialization need not come at the cost of performance. More interestingly, in non‑standard scenarios, VibeServe outperforms existing systems by exploiting opportunities that generic systems miss in six scenarios involving non‑standard model architectures, workload knowledge, and hardware‑specific optimizations. Together, these results suggest a different point in the design space for infrastructure software: generation‑time specialization rather than runtime generality. Code is available at https://github.com/uw‑syfi/vibe‑serve.
Authors:Maxim Fishman, Brian Chmiel, Ron Banner, Daniel Soudry, Boris Ginsburg
Abstract:
Training large language models at 4‑bit precision is critical for efficiency. We show that nGPT, an architecture that constrains weights and hidden representations to the unit hypersphere, is inherently more robust to low‑precision arithmetic. This removes the need for interventions‑such as applying random Hadamard transforms and performing per‑tensor scaling calculations‑to preserve model quality, and it enables stable end‑to‑end NVFP4 training. We validate this approach on both a 1.2B dense model and hybrid (Mamba‑Transformer) MoE models of up to 3B/30B parameters. We trace this robustness to the dot product: while quantization noise remains largely uncorrelated in both standard and normalized architectures, the signal behaves differently. In nGPT, the hypersphere constraint enhances weak positive correlations among the element‑wise products, leading to a constructive accumulation of the signal across the hidden dimension while the noise continues to average out. This yields a higher effective signal‑to‑noise ratio and a flatter loss landscape, with the effect strengthening as the hidden dimension grows, suggesting increasing advantages at scale. A reference implementation is available at https://github.com/anonymous452026/ngpt‑nvfp4
Authors:Hugo Cazaux, Eyjólfur Ingi Ásgeirsson, Hlynur Stefánsson
Abstract:
Synthetic data has transformed language model training, yet its role in time series forecasting remains poorly understood. We present a large‑scale empirical study: nine experiment groups, 4,218 runs systematically evaluating synthetic time series augmentation across five architectures, four synthetic signals and seven datasets. The effect is sharply architecture‑conditional: channel‑mixing models (TimesNet, iTransformer) benefit in the majority of trials, while channel‑independent models (DLinear, PatchTST) are consistently degraded. In selected low‑resource settings the gains are striking: TimesNet trained on only 10% of Weather data with synthetic augmentation surpasses the full‑data baseline (4 of 16 sparsity‑dataset combinations). Averaged across all architectures, augmentation hurts in 67% of trials. We further find that only the Seasonal‑Trend generator reliably helps across the tested benchmarks, and that hard curriculum switching is actively harmful (+24% MSE degradation). These results provide concrete, actionable guidelines on how to use synthetic data: use synthetic augmentation with channel‑mixing architectures, use gradual annealing schedules, and treat low‑resource augmentation as architecture‑ and dataset‑dependent. Code is available at \hrefhttps://github.com/hugoiscracked/synthetic‑ts/tree/main
Authors:Xiao Wang, Ziwen Wang, Weizhe Kong, Wentao Wu, Yuehang Li, Aihua Zheng, Chenglong Li, Jin Tang
Abstract:
Vehicle Re‑identification (Re‑ID) aims to retrieve the most similar image to a given query from images captured by non‑overlapping cameras. Extending vehicle Re‑ID from image‑only queries to text‑based queries enables retrieval in real‑world scenarios where only a witness description of the target vehicle is available. In this paper, we propose PFCVR, a Part‑level Fine‑grained Cross‑modal Vehicle Retrieval model for text‑to‑image vehicle re‑identification. PFCVR constructs locally paired images and texts at the part level and introduces learnable part‑query tokens that aggregate both part‑specific and full‑sentence context before aligning with visual part features. On top of this explicit local alignment, a bi‑directional mask recovery module lets each modality reconstruct its masked content under the guidance of the other, implicitly bridging local correspondences into global feature alignment. Furthermore, we construct a new large‑scale dataset called T2I‑VeRW, which contains 14,668 images covering 1,796 vehicle identities with fine‑grained part‑level annotations. Experimental results on the T2I‑VeRI dataset show that PFCVR achieves 29.2% Rank‑1 accuracy, improving over the best competing method by +3.7% percentage points. On the newly proposed T2I‑VeRW benchmark, PFCVR achieves 55.2% Rank‑1 accuracy, outperforming a comprehensive set of recent state‑of‑the‑art methods. Source code will be released on https://github.com/Event‑AHU/Neuromorphic_ReID
Authors:Haoyun Tang, Haodong Cui, Keyao Xu, Kun Wang, Zhandong Mei
Abstract:
World models enable model‑based planning through learned latent dynamics, but imagined rollouts become unstable as the planning horizon grows or the dynamics distribution shifts. We argue that this instability reflects two missing structures in planner‑facing latents: history‑conditioned memory for approximate Markov completeness, and geometric organization that separates configuration, momentum, and task semantics. We propose HaM‑World (HMW), a structured world model that decomposes the latent state into a canonical (q, p) subspace and a context subspace c, while using Mamba selective state‑space memory as the history‑conditioned input to the same latent dynamics. Within this interface, (q, p) evolves through an energy‑derived Hamiltonian vector field plus learnable residual/control dynamics, while c captures semantic, dissipative, and non‑conservative factors. This gives the planner a single latent state shared by dynamics prediction, reward/value estimation, imagined rollouts, and CEM action search. On four DeepMind Control Suite tasks, HaM‑World reaches the highest Avg. AUC (117.9, +9.5%), reduces long‑horizon rollout error to 45% of a strong baseline model, and wins 11/12 k in 3,5,7 MSE cells. Under 12 OOD perturbations spanning dynamics shifts, action delay, and observation masking, HaM‑World achieves the highest return in every condition, with average OOD‑return gains of 10.2% on Finger Spin and 13.6% on Reacher Easy. Mechanism diagnostics further show bounded action‑free Hamiltonian‑energy drift, structured energy variation under policy rollouts, and coherent control‑induced energy transfer, supporting the intended Soft‑Hamiltonian dynamics design.
Authors:Xinyu Wang, Changzhi Sun, Lian Cheng, Yuanbin Wu, Dell Zhang, Xiaoling Wang, Xuelong Li
Abstract:
Verifiers are crucial components for enhancing modern LLMs' reasoning capability. Typicalverifiers require resource‑intensive superviseddataset construction, which is costly and faceslimitations in data diversity. In this paper, wepropose LOVER, an unsupervised verifier regularized by logical rules. LOVER treats theverifier as a binary latent variable, utilizinginternal activations and enforcing three logical constraints on multiple reasoning paths:negation consistency, intra‑group consistency,and inter‑group consistency (grouped by thefinal answer). By incorporating logical rulesas priors, LOVER can leverage unlabeled examples and is directly compatible with any offthe‑shelf LLMs. Experiments on 10 datasetsdemonstrate that LOVER significantly outperforms unsupervised baselines, achieving performance comparable to the supervised verifier(reaching its 95% level on average). The sourcecode is publicly available at https://github.com/wangxinyufighting/llm‑lover.
Authors:Sankarshana Venugopal, Mohammad Mostafavi, Jonghyun Choi
Abstract:
Diffusion‑based image‑to‑image (I2I) translation excels in high‑fidelity generation but suffers from slow sampling in state‑of‑the‑art Diffusion Bridge Models (DBMs), often requiring dozens of function evaluations (NFEs). We introduce DBMSolver, a training‑free sampler that exploits the semi‑linear structure of DBM's underlying SDE and ODE via exponential integrators, yielding highly‑efficient 1st‑ and 2nd‑order solutions. This reduces NFEs by up to 5x while boosting quality (e.g., FID drops 53% on DIODE at 20 NFEs vs. 2nd‑order baseline). Experiments on inpainting, stylization, and semantics‑to‑image tasks across resolutions up to 256x256 show DBMSolver sets new SOTA efficiency‑quality tradeoffs, enabling real‑world applicability. Our code is publicly available at https://github.com/snumprlab/dbmsolver.
Authors:Hanyu Gao, Bin Cao, Yunyue Su, Tong-Yi Zhang, Qiang Liu
Abstract:
Multiphase powder X‑ray diffraction (PXRD) analysis remains a fundamental bottleneck in structure identification, as real‑world synthesis often produces complex mixtures whose constituent phases (components) cannot be reliably disentangled. While recent advances in representation‑based crystal retrieval and generation suggest the possibility of inferring structures directly from PXRD, existing approaches largely assume single‑phase inputs and break down in multiphase settings. Here, we present XDecomposer, a prior‑free framework for joint decomposition and identification of multiphase XRD patterns without requiring candidate phase lists, structural templates, or prior knowledge of phase number. We formulate multiphase diffraction analysis as a set prediction problem, where the model infers an unordered set of phase‑resolved components, their mixture proportions, and corresponding structural representations within a unified architecture. A phase‑query‑driven decomposition mechanism, together with diffraction‑consistent physical reconstruction, enables accurate source separation while preserving crystallographic fidelity. Extensive experiments on both simulated and experimental datasets show that XDecomposer substantially improves reconstruction accuracy and phase identification across diverse chemical systems, while maintaining strong generalization to unseen mixtures. These results provide a practical route toward data‑driven, source‑resolved multiphase XRD analysis and reduce long‑standing dependence on prior‑guided iteratively phase matching. The code is openly available at https://github.com/Licht0812/XDecomposer
Authors:Xing Xu, Xu Wang, Yudong Zhang, Huilin Zhao, Zhengyang Zhou, Yang Wang
Abstract:
Air‑quality forecasting models are commonly evaluated on regional, preprocessed, and normalized datasets, where missing observations are removed or artificially completed. Such protocols simplify comparison but hide the conditions that dominate real monitoring networks: uneven global coverage, structured missingness, heterogeneous pollutant scales, and deployment cost. We introduce AirQualityBench, a global multi‑pollutant benchmark designed to evaluate forecasting models under these realistic conditions. The benchmark contains hourly observations from 3,720 monitoring stations over 2021‑‑2025, covers six major pollutants, and preserves provider‑native observation masks. Rather than imputing a dense data tensor, AirQualityBench exposes missingness as part of the forecasting problem and reports errors on valid future observations after inverse transformation to physical concentration scales. Evaluating representative spatio‑temporal models under this unified protocol shows that strong performance on sanitized datasets does not reliably transfer to global, fragmented monitoring streams. AirQualityBench therefore serves as a realistic testbed for scalable, mask‑aware, and physically interpretable air‑quality forecasting. All benchmark data, code, evaluation scripts, and baseline implementations are available at \hrefhttps://github.com/Star‑Learning/AirQualityBenchGitHub.
Authors:Guanyu Zhu, Jining Luan, Hanwen Du, Xinyu Fang, Sibo Xu, Ersheng Ni, Hongji Li, Jincheng Fang, Ronghao Chen, Huacan Wang, Xuanqi Lan, Yongxin Ni, Yiqi Sun, Youhua Li
Abstract:
Auto‑bidding is a crucial task in real‑time advertising markets, where policies must optimize long‑horizon value under delivery constraints (e.g., budget and CPA). Existing methods for auto‑bidding rely on compact numerical state representations: while they can implicitly capture delivery dynamics, they offer limited support for explicitly representing and controlling high‑level intent, evolving feedback, and operator‑style strategic guidance in real campaigns. Meanwhile, Large Language Models (LLMs) offer a powerful method for encoding semantic information, it remains unclear when LLMs help and how to integrate them without sacrificing numerical precision. Through systematic preliminary studies, we find that (1) LLM embeddings contain bidding‑relevant cues yet cannot replace numerical features, and (2) gains emerge only with careful semantic‑‑numeric integration rather than naive concatenation. Motivated by these findings, we propose SemBid, a novel auto‑bidding framework that injects LLM‑encoded semantics into offline bidding trajectories at the token level. SemBid introduces three semantic inputs: Task, History, and Strategy. It injects these semantics as tokens alongside numerical trajectory tokens and uses self‑attention to integrate them, improving controllability and generalization across objectives. Across diverse scenarios and budget regimes, SemBid outperforms competitive baselines from offline RL and generative sequence modeling, with more consistent gains in overall performance, constraint satisfaction, and robustness. Our code is available at: \hrefhttps://github.com/AlanYu04/SemBid‑KDD2026\textcolorbluehere.
Authors:Maosen Zhang, Jianshuo Dong, Boting Lu, Wenyue Li, Xiaoping Zhang, Tianwei Zhang, Han Qiu
Abstract:
Retrieval‑Augmented Generation (RAG) enables large language models (LLMs) to leverage external knowledge, but also exposes valuable RAG databases to leakage attacks. As RAG systems grow more complex and LLMs exhibit stronger instruction‑following capabilities, existing studies fall short of systematically assessing RAG leakage risks. We present LeakDojo, a configurable framework for controlled evaluation of RAG leakage. Using LeakDojo, we benchmark six existing attacks across fourteen LLMs, four datasets, and diverse RAG systems. Our study reveals that (1) query generation and adversarial instructions contribute independently to leakage, with overall leakage well approximated by their product; (2) stronger instruction‑following capability correlates with higher leakage risk; and (3) improvements in RAG faithfulness can introduce increased leakage risk. These findings provide actionable insights for understanding and mitigating RAG leakage in practice. Our codebase is available at https://github.com/yeasen‑z/LeakDojo.
Authors:Yu Feng, Zhen Tian, Haoran Luo, Xie Yu, Diancheng Cheng, Haoyue Zheng, Shuai Lyu, Ping Zong, Lianyuan Li, Xin Ge, Yifan Zhu
Abstract:
Domain Incremental Learning is a critical scenario that requires models to continuously adapt to new data domains without retraining. However, domain shifts often cause severe performance degradation. To address this, we propose Hybrid Energy‑Distance Prompt, a domain‑incremental framework inspired by Helmholtz free energy. HEDP introduces an energy regularization loss to enhance the separability of domain representations and a hybrid energy‑distance weighted mechanism that fuses energy‑based and distance‑based cues to improve domain selection and generalization. Experiments on multiple benchmarks, including CORe50, show that HEDP achieves superior performance on unseen domains with a 2.57% accuracy gain, effectively mitigating catastrophic forgetting and enhancing open‑world adaptability. Our code is \hrefhttps://github.com/dannis97500/HEDP/available here.
Authors:Mei Wu, Wenchao Weng, Wenxin Su, Wenjie Tang, Wei Zhou
Abstract:
In recent years, the integration of non‑topological space modeling with temporal learning methods has emerged as an effective approach for capturing spatio‑temporal information in non‑Euclidean graphs. However, most existing methods rely on static underlying graph structures, which are inadequate for capturing the continuously expanding and evolving patterns in streaming traffic networks. To address this challenge, we propose a simple yet efficient dual‑branch continual learning framework for traffic prediction, named CoMemNet. The fast‑converging Online branch undertakes the primary prediction tasks, while the momentum‑updated Target branch extracts historical information using Wasserstein Distance features to create a Dynamic Contrastive Sampler (DC Sampler). This sampler selects a node set with significant dynamic network feature changes for training, effectively mitigating the issue of catastrophic forgetting. Additionally, the backbone incorporates a lightweight Node‑Adaptive Temporal Memory Buffer (TMRB‑N) to consolidate old knowledge through memory replay and address the risk of memory explosion. Finally, we provide two newly curated open‑source datasets. Experimental results demonstrate that CoMemNet achieves state‑of‑the‑art (SOTA) performance across all three large‑scale real‑world datasets. The code is available at: https://github.com/meiwu5/CoMemNet.
Authors:Zhe Liu, Zonghao Ying, Wenxin Zhang, Quanchen Zou, Deyue Zhang, Dongdong Yang, Xiangzheng Zhang, Hao Peng
Abstract:
With the rapid evolution of foundation models, Large Language Model (LLM) agents have demonstrated increasingly powerful tool‑use capabilities. However, this proficiency introduces significant security risks, as malicious actors can manipulate agents into executing tools to generate harmful content. While existing defensive mechanisms are effective, they frequently suffer from the over‑refusal problem, where increased safety strictness compromises the agent's utility on benign tasks. To mitigate this trade‑off, we propose \textscSafeHarbor, a novel framework designed to establish precise decision boundaries for LLM agents. Unlike static guidelines, \textscSafeHarbor extracts context‑aware defense rules through enhanced adversarial generation. We design a local hierarchical memory system for dynamic rule injection, offering a training‑free, efficient, and plug‑and‑play solution. Furthermore, we introduce an information entropy‑based self‑evolution mechanism that continuously optimizes the memory structure through dynamic node splitting and merging. Extensive experiments demonstrate that \textscSafeHarbor achieves state‑of‑the‑art performance on both ambiguous benign tasks and explicit malicious attacks, notably attaining a peak benign utility of 63.6% on GPT‑4o while maintaining a robust refusal rate exceeding 93% against harmful requests. The source code is publicly available at https://github.com/ljj‑cyber/SafeHarbor.
Authors:Bing Wang, Ximing Li, Changchun Li, Jinjin Chi, Gang Niu, Masashi Sugiyama
Abstract:
Recently, the prominent performance of large language models (LLMs) has been largely driven by multi‑task instruct‑tuning. Unfortunately, this training paradigm suffers from a key issue, named cross‑task interference, due to conflicting gradients over shared parameters among different tasks. Some previous methods mitigate this issue by isolating task‑specific parameters, e.g., task‑specific neuron selection and mixture‑of‑experts. In this paper, we empirically reveal that the cross‑task interference still exists for the existing solutions because of many parameters also shared by different tasks, and accordingly, we propose a novel solution, namely Basic Abilities Decomposition for multi‑task Instruct‑Tuning (BADIT). Specifically, we empirically find that certain parameters are consistently co‑activated, and that co‑activated parameters naturally organize into base groups. This motivates us to analogize that LLMs encode several orthogonal basic abilities, and that any task can be represented as a linear combination of these abilities. Accordingly, we propose BADIT that decomposes LLM parameters into orthogonal high‑singular‑value LoRA experts representing basic abilities, and dynamically enforces their orthogonality during training via spherical clustering of rank‑1 components. We conduct extensive experiments on the SuperNI benchmark with 6 LLMs, and empirical results demonstrate that BADIT can outperform SOTA methods and mitigate the degree of cross‑task interference.
Authors:Xinjie Shen, Rongzhe Wei, Peizhi Niu, Haoyu Wang, Ruihan Wu, Eli Chien, Bo Li, Pin-Yu Chen, Pan Li
Abstract:
Hidden malicious intent in multi‑turn dialogue poses a growing threat to deployed large language models (LLMs). Rather than exposing a harmful objective in a single prompt, increasingly capable attackers can distribute their intent across multiple benign‑looking turns. Recent studies show that even modern commercial models with advanced guardrails remain vulnerable to such attacks despite advances in safety alignment and external guardrails. In this work, we address this challenge by detecting the earliest turn at which delivering the candidate response would make the accumulated interaction sufficient to enable harmful action. This objective requires precise turn‑level intervention that identifies the harm‑enabling closure point while avoiding premature refusal of benign exploratory conversations. To further support training and evaluation, we construct the Multi‑Turn Intent Dataset (MTID), which contains branching attack rollouts, matched benign hard negatives, and annotations of the earliest harm‑enabling turns. We show that MTID helps enable a turn‑level monitor TurnGate, which substantially outperforms existing baselines in harmful‑intent detection while maintaining low over‑refusal rates. TurnGate further generalizes across domains, attacker pipelines, and target models. Our code is available at https://github.com/Graph‑COM/TurnGate.
Authors:Gabriel Jeanson, David-Alexandre Duclos, William Larrivée-Hardy, Noé Cochet, Matěj Boxan, Anthony Deschênes, François Pomerleau, Philippe Giguère
Abstract:
Sustainable forest management relies on precise species composition mapping, yet traditional ground surveys are labour‑intensive and geographically constrained. While Uncrewed Aerial Vehicles (UAVs) offer scalable data collection, the transition to deep learning‑based interpretation is bottlenecked by the severe scarcity of expert‑annotated imagery, particularly in complex, visually heterogeneous regeneration zones. This paper addresses the dual challenges of data scarcity and extreme class imbalance in the semantic segmentation of fine‑grained forest regeneration species by providing a scalable framework that reduces reliance on manual photo‑interpretation for high‑resolution, millimetre‑level aerial imagery. Importantly, we leverage the large‑scale vision‑language Nano Banana Pro model to simultaneously generate high‑fidelity images and their corresponding pixel‑aligned semantic masks from prompts. We introduce WilDReF‑Q‑V2, an expansion of a natural forest dataset with 13 977 new unlabelled and 50 labelled real images, as well as the Gen4Regen dataset, featuring 2101 pairs of synthetic images and semantic masks. Our methodology integrates real‑world data with AI‑generated images, highlighting that AI‑generated data is highly complementary to real‑world data, with unified training yielding an F1 score improvement of over 15 %pt compared to purely supervised baselines. Furthermore, we demonstrate that even small quantities of prompt‑generated data significantly improve performance for underrepresented species, some of which saw per‑species F1 score gains of up to 30 %pt. We conclude that vision‑language models can serve as agile data generators, effectively bootstrapping perception tasks for niche AI domains where expert labels are scarce or unavailable. Our datasets, source code, and models will be available at https://norlab‑ulaval.github.io/gen4regen.
Authors:Sai Babu Patarlapalli, Surya Teja Avvaru
Abstract:
Post‑training quantization makes large reasoning models practical under tight memory and latency budgets, but it can distort the online signals that drive adaptive test‑time compute allocation. Under a fixed cap on the number of newly generated tokens, miscalibrated confidence can lead to harmful early halting: the model may surface a plausible final line while the underlying reasoning is still wrong, or the controller may stop before the trace has stabilized. We study this interaction for greedy 4‑bit inference and propose BitCal‑TTS, a lightweight runtime controller that combines (i) inexpensive online proxies for token‑level uncertainty and reasoning‑trace stability, (ii) a bit‑conditioned confidence rescaling that is conservative at low nominal precision, and (iii) a bit‑aware post‑marker confirmation horizon designed for GSM8K‑style structured outputs. The method requires no fine‑tuning of the base model and integrates with standard Hugging Face 4‑bit inference using forward hooks for logits and last‑layer hidden states. On small evaluation shards of GSM8K with Qwen2.5 Instruct models, BitCal‑TTS improves exact‑match accuracy over a non‑bit‑aware adaptive baseline at the 7B and 14B scales while preserving substantial token savings relative to fixed‑budget decoding. At a token cap of B=512, on the evaluation shards we report (N=54 for 7B and N=35 for 14B; not the full GSM8K test set), accuracy gains are +3.7 points (7B) and +2.8 points (14B), with the premature‑stop rate falling from 14.8% to 11.1% on 7B and from 17.1% to 11.4% on 14B. We report Wilson 95% confidence intervals throughout and explicitly discuss the limited statistical power of the partial‑shard comparisons. We release code and figure‑generation scripts to support full reproduction.
Authors:Othmane Kabal, Mounira Harzallah, Fabrice Guillet, Hideaki Takeda, Ryutaro Ichise
Abstract:
Graph Self‑Supervised Learning (GSSL) offers a powerful paradigm for learning graph representations without labeled data. However, existing work assumes clean, manually curated graphs. Recent advances in NLP enable the large‑scale automatic extraction of knowledge graphs from text, opening new opportunities for GSSL while introducing substantial real‑world noise. This type of noise remains largely unexplored, as prior robustness studies typically rely on synthetic perturbations. To address this gap, we present the first comprehensive evaluation of GSSL methods on text‑driven graphs for unsupervised term typing. We introduce Noise‑Aware Text‑Driven Graph GSSL (NATD‑GSSL), a unified framework that combines automatic graph construction, graph refinement, and GSSL. Our evaluation follows a dual‑graph protocol that contrasts a noisy graph derived from MedMentions with a clean Unified Medical Language System (UMLS) reference graph, aligned through a shared gold standard. Our results reveal variability in robustness across both pretext tasks and Graph Neural Network (GNN) architectures. Relation reconstruction is highly sensitive to noise and benefits from well‑defined schemas, whereas feature reconstruction is considerably more robust, achieving performance comparable to clean‑graph settings. Contrastive objectives are generally less affected by noise but depend strongly on alignment with downstream tasks. GNN architecture also plays a critical role: bidirectional relational message‑passing designs are better suited to noisy, text‑driven graphs, while unidirectional relational ones perform best on clean graphs. Overall, NATD‑GSSL provides practical guidance for applying GSSL to real‑world, noisy graphs and achieves up to a 7% improvement over pretrained language model baselines. All code and benchmarks are publicly available at https://github.com/OthmaneKabal/MC2GAE.
Authors:Junran Wang, Xinjie Shen, Zehao Jin, Pan Li
Abstract:
As Vision‑Language Models (VLMs) are increasingly deployed as autonomous cognitive cores for embodied assistants, evaluating their privacy awareness in physical environments becomes critical. Unlike digital chatbots, these agents operate in intimate spaces, such as homes and hospitals, where they possess the physical agency to observe and manipulate privacy‑sensitive information and artifacts. However, current benchmarks remain limited to unimodal, text‑based representations that cannot capture the demands of real‑world settings. To bridge this gap, we present ImmersedPrivacy, an interactive audio‑visual evaluation framework that simulates realistic physical environments using a Unity‑based simulator. ImmersedPrivacy evaluates physically grounded privacy awareness across three progressive tiers that test a model's ability to identify sensitive items in cluttered scenes, adapt to shifting social contexts, and resolve conflicts between explicit commands and inferred privacy constraints. Our evaluation of 12 state‑of‑the‑art models reveals consistent deficits. In cluttered scenes, all models exhibit monotonic performance decay as scene complexity grows due to perceptual deficit. When social context shifts, no model exceed 65% selection accuracy. Under conflicting commands, the best model gemini‑3.1‑pro perfectly balances task completion and privacy preservation in only 51% of cases. These findings reveal that current VLMs in the physical world suffer from perceptual fragility and fail to let their knowledge of privacy cues govern their situated behavior. Our code and data is available at https://github.com/immersed‑privacy/immersed‑privacy .
Authors:Kaifeng He, Xiaojun Zhang, Peiliang Cai, Mingwei Liu, Yanlin Wang, Chong Wang, Kaifeng Huang, Bihuan Chen, Xin Peng, Zibin Zheng
Abstract:
Large language models (LLMs) frequently generate defective outputs in code generation tasks, ranging from logical bugs to security vulnerabilities. While these generation failures are often treated as model‑level limitations, empirical evidence increasingly traces their root causes to imperfections within the training corpora. Yet, the specific mechanisms linking training data quality issues to generated code quality issues remain largely unmapped. This paper presents a systematic literature review of 114 primary studies to investigate how training data quality issues propagate into code generation. We establish a unified taxonomy that categorizes generated code quality issues across nine dimensions and training data quality issues into code and non‑code attributes. Based on this taxonomy, we formalize a causal framework detailing 18 typical propagation mapping mechanisms. Furthermore, we synthesize state‑of‑the‑art detection and mitigation techniques across the data, model, and generation lifecycles. The reviewed literature reveals a clear methodological shift: quality assurance is transitioning from reactive, heuristic‑based post‑generation filtering toward proactive, data‑centric governance and closed‑loop repair. Finally, we identify open challenges and outline research directions for developing reliable LLMs for code through integrated data curation and continuous evaluation. Our repository is available at https://github.com/SYSUSELab/From‑Data‑to‑Code.
Authors:Alan L. McCann
Abstract:
AI systems increasingly synthesize executable structure at runtime: LLMs generate programs, agents construct workflows,self‑improving systems modify their own behavior. In classical homoiconic and staged languages, the transition from coderepresentation to execution is unrestricted. eval is a language primitive, not a governed operation. We argue that ingovernedintelligent systems, this transition is an authority amplification: it converts symbolic structure into executableauthority andmust be mediated like any other effect. We present governed metaprogramming, a language design where programrepresentations(machine forms) are first‑class values, form manipulation is pure computation, and materialization (the transition fromform toexecutable machine) is a governed effect subject to structural inspection. The governance system analyzes the proposedprogram'scapability requirements, policy compliance, and resource estimates before permitting execution. We formalize twojudgments: pureform evaluation (which emits no directives) and governed materialization (which emits exactly one governed directive). Weprovethree properties: purity of form manipulation, the no‑bypass theorem, and boundary preservation. We implement the designinMashinTalk, a DSL for AI workflows compiling to BEAM bytecode, and report on integration with 454 existingmachine‑checked Rocqtheorems. The central contribution is reclassifying eval from a language primitive into a governed effect.
Authors:Ahmed Abdelmuniem Abdalla Mohammed
Abstract:
Standard transformer architectures apply the same number of layers to every token regardless of contextual difficulty. We present Token‑Selective Attention (TSA), a learned per‑token gate on residual updates between consecutive transformer blocks. Each gate is a lightweight two‑layer multi‑layer perceptron (MLP) that produces a continuous halting probability, making the mechanism end‑to‑end differentiable with 1.7% parameter overhead and no changes to the base architecture. Notably, TSA learns difficulty‑proportional routing without any explicit depth pressure: even at λ=0 (no depth regularisation), the task‑loss gradient alone drives the router to skip 20% of token‑layer operations. On character‑level language modeling, TSA saved 14‑23% of token‑layer operations (TLOps) across Tiny‑Shakespeare and enwik8 at <0.5% quality loss. At matched efficiency, TSA achieved 0.7% lower validation loss than early exit, and the learned routing transfers directly to inference‑time sparse execution for real wall‑clock speedup.
Authors:Avhishek Biswas, Apala Pramanik, Eylem Ekici, Mehmet C. Vuran
Abstract:
Millimeter‑wave (mmWave) frequencies promise multi‑gigabit connectivity for vehicle‑to‑everything (V2X) networks, but face challenges in terms of severe path loss and mobility‑related beam misalignment. Reliable V2X connectivity requires fast, double‑directional beam alignment. However, existing methods suffer from high training overhead and limited generalization to unseen scenarios. This paper presents VIsion‑based BEamforming(VIBE), a hybrid model‑based, closed‑loop, learning architecture for real‑time double‑directional mmWave beam management primed by camera sensing. VIBE fuses machine learning, model‑based reasoning, and closed‑loop RF feedback to balance beam‑pair establishment latency with link quality. VIBE bypasses exhaustive training overhead and accelerates link establishment by leveraging camera observations to reduce the beam‑search space. Lightweight beam refinement and offset tracking mechanisms adaptively refine beams in response to dynamic application requirements. VIBE is implemented and evaluated across online indoor/outdoor testbeds, public datasets, and real‑time vehicular experiments, demonstrating strong generalization capabilities, making it suitable for real‑time V2X communication. Comparisons with 5G NR hierarchical beamforming show that VIBE consistently maintains lower outage rates. Furthermore, VIBE outperforms state‑of‑the‑art end‑to‑end ML models for beam selection when evaluated on public datasets and achieves outage rates as low as 1.1‑1.4 %. The results show that a hybrid model‑based, closed‑loop learning architecture is better suited for real‑world mmWave vehicular connectivity than end‑to‑end trained ML models. For reproducibility, we publish our code to https://github.com/UNL‑CPN‑Lab/Look‑Once‑Beam‑Twice.
Authors:Xiaoliang Fan, Jiarui Chen, Zhuodong Liu, Ziqi Yang, Peixuan Xu, Ruimin Shen, Junhui Liu, Jianzhong Qi, Cheng Wang
Abstract:
Embodied AI (EAI) systems are rapidly transitioning from simulations into real‑world domestic and other sensitive environments. However, recent EAI solutions have largely demonstrated advancements within isolated stages such as instruction, perception, planning and interaction, without considering their coupled privacy implications in high‑frequency deployments where privacy leakage is often irreversible. This position paper argues that optimizing these components independently creates a systemic privacy crisis when deployed in sensitive settings, thereby advancing the position that privacy in EAI is a life cycle‑level architectural constraint rather than a stage‑local feature. To address these challenges, we propose Secure Privacy Integration in Next‑generation Embodied AI (SPINE), a unified privacy‑aware framework that treats privacy as a dynamic control signal governing cross‑stage coupling throughout the entire EAI life cycle. SPINE decomposes the EAI pipeline into various stages and establishes a multi‑criterion privacy classification matrix to orchestrate contextual sensitivity across stage boundaries. We conduct preliminary simulation and real‑world case studies to conceptually validate how privacy constraints propagate downstream to reshape system behavior, illustrating the insufficiency of fragmented privacy patches and motivating future research directions into secure yet functional embodied AI systems. We detail the SPINE framework and case studies at https://github.com/rminshen03/EAI_Privacy_Position.
Authors:Vasilis Perifanis, Foteini Nikolaidou, Nikolaos Pavlidis, Panagiotis Thomakos, Andreas Sendros
Abstract:
Accurate forecasting of electric vehicle (EV) charging demand is critical for grid stability, infrastructure planning, and real‑time charging optimization. In this work, we study the problem of early prediction of charging demand, where the total energy of a session is estimated using only information available at plug‑in time and during the first minutes of charging. This enables actionable decisions while the session is still in progress, which is of direct importance for EV network operators. We construct a session‑level dataset from the Adaptive Charging Network (ACN), combining session metadata with early‑window charging measurements, and derive tabular features capturing user intent, temporal patterns, and initial charging behavior. We focus on a single operational depot, Caltech, and model intra‑depot heterogeneity through station‑level client partitions while evaluating multiple model families in a federated learning (FL) setting. Our results show that federated models can approach centralized predictive performance while keeping data in‑depot, enabling privacy‑enhanced training across distributed charging infrastructures. Overall, we demonstrate that reliable demand estimates can be obtained early in the session with minimal data, and that FL provides a practical pathway toward scalable and privacy‑aware analytics for EV charging networks. Code is available at https://github.com/Indigma‑Innovations/federated‑learning‑ev‑charging‑demand.
Authors:Yin Jun Phua
Abstract:
Inductive Logic Programming (ILP) learns interpretable logical rules from data. Existing methods are transductive: their learned parameters are bound to specific predicates and require retraining for each new task. We introduce Neural Rule Inducer (NRI), a pretrained model for zero‑shot rule induction. Rather than encoding literal identities, NRI represents literals using domain‑agnostic statistical properties such as class‑conditional rates, entropy, and co‑occurrence, which generalize across variable identities and counts without retraining. The model consists of a statistical encoder and a parallel slot‑based decoder. Parallel decoding preserves the permutation invariance of logical disjunction; an autoregressive decoder would instead impose an arbitrary clause order. Product T‑norm relaxation makes rule execution differentiable, allowing end‑to‑end training on prediction accuracy alone. We evaluate NRI on rule recovery, robustness to label noise and spurious correlations, and zero‑shot transfer to real‑world benchmarks, and we believe this work opens up the possibility of foundation models for symbolic reasoning. Code and the reference checkpoint are available at https://github.com/phuayj/neural‑rule‑inducer.
Authors:Mohamed Elhabebe, Ayman El-Baz, Qing Liu
Abstract:
Automated glaucoma detection is critical for preventing irreversible vision loss and reducing the burden on healthcare systems. However, ensuring fairness across diverse patient populations remains a significant challenge. In this paper, we propose FairEnc, a fair pretraining method for vision‑language models (VLMs) that enables simultaneous debiasing across multiple sensitive attributes. FairEnc jointly mitigates biases in both textual and visual modalities with respect to multiple sensitive attributes, including race, gender, ethnicity, and language. Specifically, for the textual encoder, we leverage a large language model to generate synthetic clinical descriptions with varied sensitive attributes while preserving disease semantics, and employ a contrastive alignment objective to encourage demographic‑invariant representations. For the visual encoder, we propose a dual‑level fairness strategy that combines mutual information regularization to reduce statistical dependence between learned features and demographic groups, with multi‑discriminator adversarial debiasing. Comprehensive experiments on the publicly available Harvard‑FairVLMed dataset demonstrate that FairEnc effectively reduces demographic disparity as measured by DPD and DEOdds while achieving strong diagnostic performance under both zero‑shot and linear probing evaluations. Additional experiments on the private FairFundus dataset show that FairEnc consistently preserves fairness advantages under cross‑domain and cross‑modality settings and maintains diagnostic performance within a competitive range. These results highlight FairEnc's ability to generalize fairness under distribution shifts, supporting its potential for more equitable deployment in real‑world clinical settings. Our codebase and synthetic clinical notes are available at https://github.com/Mohamed‑Elhabebe/FairEnc
Authors:Haotian Xia, Hao Peng, Yunjia Qi, Xiaozhi Wang, Bin Xu, Lei Hou, Juanzi Li
Abstract:
Story generation aims to automatically produce coherent, structured, and engaging narratives. Although large language models (LLMs) have significantly advanced text generation, stories generated by LLMs still diverge from human‑authored works regarding complex narrative structure and human‑aligned preferences. A key reason is the absence of effective modeling of human story preferences, which are inherently subjective and under‑explored. In this work, we systematically evaluate the modeling of human story preferences and introduce StoryRMB, the first benchmark for assessing reward models on story preferences. StoryRMB contains 1,133 high‑quality, human‑verified instances, each consisting of a prompt, one chosen story, and three rejected stories. We find existing reward models struggle to select human‑preferred stories, with the best model achieving only 66.3% accuracy. To address this limitation, we construct roughly 100,000 high‑quality story preference pairs across diverse domains and develop StoryReward, an advanced reward model for story preference trained on this dataset. StoryReward achieves state‑of‑the‑art (SoTA) performance on StoryRMB, outperforming much larger models. We also adopt StoryReward in downstream test‑time scaling applications for best‑of‑n (BoN) story selection and find that it generally chooses stories better aligned with human preferences. We will release our dataset, model, and code to facilitate future research. Related code and data are available at https://github.com/THU‑KEG/StoryReward.
Authors:Siqiao Xue, Zihan Liao, Jin Qin, Ziyin Zhang, Yixiang Mu, Fan Zhou, Hang Yu
Abstract:
Code search has usually been evaluated as first‑stage retrieval, even though production systems rely on broader pipelines with reranking and developer‑style queries. Existing benchmarks also suffer from data contamination, label noise, and degenerate binary relevance. In this paper, we introduce \textscCoREB, a contamination‑limited, multitask \underlinecode \underlineretrieval and r\underlineeranking \underlinebenchmark, together with a fine‑tuned code reranker, that goes beyond retrieval to cover the full code search pipeline. \textscCoREB is built from counterfactually rewritten LiveCodeBench problems in five programming languages and delivered as timed releases with graded relevance judgments. We benchmark eleven embedding models and five rerankers across three tasks: text‑to‑code, code‑to‑text, and code‑to‑code. Our experiments reveal that: \circone code‑specialised embeddings dominate code‑to‑code retrieval (~2× over general encoders), yet no single model wins all three tasks; \circtwo short keyword queries, the format closest to real developer search, collapse every model to near‑zero nDCG@10; \circthree off‑the‑shelf rerankers are task‑asymmetric, with a 12‑point swing on code‑to‑code and no baseline net‑positive across all tasks; \circfour our fine‑tuned \textscCoREB‑Reranker is the first to achieve consistent gains across all three tasks. The data and model are released.
Authors:Yukun Chen, Tianrui Wang, Zhaoxi Mu, Xinyu Yang, EngSiong Chng
Abstract:
High‑quality singing annotations are fundamental to modern Singing Voice Synthesis (SVS) systems. However, obtaining these annotations at scale through manual labeling is unrealistic due to the substantial labor and musical expertise required, making automatic annotation highly necessary. Despite their utility, current automatic transcription systems face significant challenges: they often rely on complex multi‑stage pipelines, struggle to recover text‑note alignments, and exhibit poor generalization to out‑of‑distribution (OOD) singing data. To alleviate these issues, we present VocalParse, a unified singing voice transcription (SVT) model built upon a Large Audio Language Model (LALM). Specifically, our novel contribution is to introduce an interleaved prompting formulation that jointly models lyrics, melody, and word‑note correspondence, yielding a generated sequence that directly maps to a structured musical score. Furthermore, we propose a Chain‑of‑Thought (CoT) style prompting strategy, which decodes lyrics first as a semantic scaffold, significantly mitigating the context disruption problem while preserving the structural benefits of interleaved generation. Experiments demonstrate that VocalParse achieves state‑of‑the‑art SVT performance on multiple singing datasets. The source code and checkpoint are available at https://github.com/pymaster17/VocalParse.
Authors:Ivan Bondarenko, Roman Derunets, Oleg Sedukhin, Mikhail Komarov, Ivan Chernov, Mikhail Kulakov
Abstract:
We present our winning system for Task~B (generation with reference passages) in SemEval‑2026 Task~8: MTRAGEval. Our method is a heterogeneous ensemble of seven LLMs with two prompting variants, where a GPT‑4o‑mini judge selects the best candidate per instance. We ranked 1st out of 26 teams, achieving a conditioned harmonic mean of 0.7827 and outperforming the strongest baseline (gpt‑oss‑120b, 0.6390). Ablations show that diversity in model families, scales, and prompting strategies is essential, with the ensemble consistently beating any single model. We also introduce Meno‑Lite‑0.1, a 7B domain‑adapted model with a strong cost‑‑performance trade‑off, and analyse MTRAGEval, highlighting annotation limitations and directions for improvement. Our code is publicly available: https://github.com/RaguTeam/ragu_mtrag_semeval
Authors:Binh Long Nguyen, Kien Nguyen, Sridha Sridharan, Clinton Fookes, Peyman Moghadam
Abstract:
We introduce Ilov3Splat, a novel framework for instance‑level open‑vocabulary 3D scene understanding built on 3D Gaussian Splatting (3D‑GS). Most prior work depends on 2D rendering‑based matching or point‑level semantic association, which undermines cross‑view consistency, lacks coherent instance‑level reasoning, and limits precision in downstream 3D tasks. To address these limitations, our method jointly optimizes scene geometry and semantic representations by augmenting Gaussian splats with view‑consistent feature fields. Specifically, we leverage multi‑resolution hash embedding to efficiently encode language‑aligned CLIP features, enabling dense and coherent language grounding in 3D space. We further train an instance feature field using contrastive loss over SAM masks, supporting fine‑grained object distinction across views. At inference time, CLIP‑encoded queries are matched against the learned features, followed by two‑stage 3D clustering to retrieve relevant Gaussian groups. This enables our framework to identify arbitrary objects in 3D scenes based on natural language descriptions, without requiring category supervision or manual annotations. Experiments on standard benchmarks demonstrate that Ilov3Splat outperforms prior open‑vocabulary 3D‑GS methods in both object selection and instance segmentation, offering a flexible and accurate solution for language‑driven 3D scene understanding. Project page: https://csiro‑robotics.github.io/Ilov3Splat.
Authors:Jingtao Zhou, Xirui Kang, Feiyang Huang, Lai-Man Po
Abstract:
Existing prompt learning for VLMs exhibits a modality asymmetry, predominantly optimizing text tokens while still relying on frozen visual encoder as holistic extractor and neglecting the spectral granularity essential for fine‑grained discrimination. To bridge this, we introduce Disentangling Spectral Granularity for Prompt Learning (SpecPL), which approaches prompt learning from a novel spectral perspective via Counterfactual Granule Supervision. Specifically, we leverage a frozen VAE to decompose visual signals into semantic low‑frequency bands and granular high‑frequency details. A frozen Visual Semantic Bank anchors text representations to universal low‑frequency invariants, mitigating overfitting. Crucially, fine‑grained discrimination is driven by counterfactual granule training: by permuting high‑frequency signals, we compel the model to explicitly distinguish visual granularity from semantic invariance. Uniquely, SpecPL serves as a universal plug‑and‑play booster, revitalizing text‑oriented baselines like CoOp and MaPLe via visual‑side guidance. Experiments on 11 benchmarks demonstrate competitive state‑of‑the‑art performance, achieving a new performance ceiling of 81.51% harmonic‑mean accuracy. These results validate that spectral disentanglement with counterfactual supervision effectively bridges the gap in the stability‑generalization trade‑off. Code is released at https://github.com/Mlrac1e/SpecPL‑Prompt‑Learning.
Authors:ZhiXin Sun
Abstract:
In recent years, object detection has achieved significant progress, especially in the field of open‑vocabulary object detection. Unlike traditional methods that rely on predefined categories, open‑vocabulary approaches can detect arbitrary objects based on human‑provided prompts. With the advancement of prompt‑based detection techniques, models such as SAM3 can even outperform some category‑specific detectors trained on particular datasets without requiring additional training on those datasets. However, despite these advancements, false positives and false negatives still occur. In practical engineering applications, persistent misdetections or missed detections of the same object are unacceptable. Yet retraining the model every time such errors occur incurs substantial costs in terms of human effort, computational resources, and time. Therefore, how to leverage existing false positive and false negative samples to prevent such errors from recurring remains a highly challenging and urgent problem. To address this issue, we propose EBOD (Example‑Based Object Detection), which integrates a prompt‑based detector (SAM3) with robust feature matching modules (DINOv3 and LightGlue). The proposed framework effectively suppresses the repeated occurrence of false positives and false negatives by leveraging previous error examples, without requiring additional model retraining. Code is available at https://github.com/sunzx97/examples_based_object_detection.
Authors:Seunghan Lee, Jaehoon Lee, Jun Seo, Sungdong Yoo, Minjae Kim, Tae Yoon Lim, Dongwan Kang, Hwanil Choi, SoonYoung Lee, Wonbin Ahn
Abstract:
TabPFN has recently gained attention as a foundation model for tabular datasets, achieving strong performance by leveraging in‑context learning on synthetic data. However, we find that TabPFN is vulnerable to label shift, often overfitting to the majority class in the training dataset. To address this limitation, we propose DistPFN, the first test‑time posterior adjustment method designed for tabular foundation models. DistPFN rescales predicted class probabilities by downweighting the influence of the training prior (i.e., the class distribution of the context) and emphasizing the contribution of the model's predicted posterior, without architectural modification or additional training. We further introduce DistPFN‑T, which incorporates temperature scaling to adaptively control the adjustment strength based on the discrepancy between prior and posterior. We evaluate our methods on over 250 OpenML datasets, demonstrating substantial improvements for various TabPFN‑based models in classification tasks under label shift, while maintaining strong performance in standard settings without label shift. Code is available at this repository: https://github.com/seunghan96/DistPFN.
Authors:Furkan Sakizli
Abstract:
Production agent frameworks (OpenAI Function Calling, Anthropic Tool Use, MCP) transmit tool schemas as JSON, a format designed for machine parsing, not for interpretation by language models. For small models (4B‑14B), this protocol mismatch accounts for the majority of tool‑use failure at production catalog sizes. We present TSCG, a deterministic tool‑schema compiler that resolves this mismatch at the API boundary, converting JSON schemas into token‑efficient structured text without model access, fine‑tuning, or runtime search. TSCG combines eight composable operators with a formal compression bound (>=51% on well‑formed schemas). On TSCG‑Agentic‑Bench (about 19,000 calls, 12 models, 5 scenarios), TSCG restores Phi‑4 14B from 0% to 84.4% accuracy at 20 tools (90.3% at 50 tools) and achieves 108‑181% accuracy‑retained ratio across three models on BFCL. Format‑versus‑compression decomposition (R^2=0.88 ‑> 0.03) establishes representation change as the dominant mechanism. Per‑operator isolation across three frontier models reveals three distinct operator‑response profiles: operator‑hungry (Opus 4.7), operator‑sensitive (GPT‑5.2), and operator‑robust (Sonnet 4), providing per‑model deployment guidance. Scaling experiments show accuracy advantages persisting on heavy production MCP schemas (+5.0 pp at about 10,500 input tokens) despite saturation on light synthetic catalogs, with 52‑57% token savings throughout. The synthetic benchmark generalizes to real MCP schemas within 0.1 accuracy points. TSCG ships as a 1,200‑line zero‑dependency TypeScript package.
Authors:Tianyang Han, Hengyu Shi, Junjie Hu, Xu Yang, Zhiling Wang, Junhao Su
Abstract:
Reinforcement learning with verifiable rewards has become a common way to improve explicit reasoning in large language models, but final‑answer correctness alone does not reveal whether the reasoning trace is faithful, reliable, or useful to the model that consumes it. This outcome‑only signal can reinforce traces that are right for the wrong reasons, overstate reasoning gains by rewarding shortcuts, and propagate flawed intermediate states in multi‑step systems. To this end, we propose TraceLift, a planner‑executor training framework that treats reasoning as a consumable intermediate artifact. During planner training, the planner emits tagged reasoning. A frozen executor turns this reasoning into the final artifact for verifier feedback, while an executor‑grounded reward shapes the intermediate trace. This reward multiplies a rubric‑based Reasoning Reward Model (RM) score by measured uplift on the same frozen executor, crediting traces that are both high‑quality and useful. To make reasoning quality directly learnable, we introduce TRACELIFT‑GROUPS, a rubric‑annotated reason‑only dataset built from math and code seed problems. Each example is a same‑problem group containing a high‑quality reference trace and multiple plausible flawed traces with localized perturbations that reduce reasoning quality or solution support while preserving task relevance. Extensive experiments on code and math benchmarks show that this executor‑grounded reasoning reward improves the two‑stage planner‑executor system over execution‑only training, suggesting that reasoning supervision should evaluate not only whether a trace looks good, but also whether it helps the model that consumes it. Our code is available at: https://github.com/MasaiahHan/TraceLift
Authors:Yibang Tang, Yifan Yang, Jingyuan Wang, Junhua Chen, Zhen Zhao
Abstract:
Robotic Mobile Fulfillment Systems (RMFS) rely on mobile robots for automated inventory transportation, coordinating order allocation and robot scheduling to enhance warehousing efficiency. However, optimizing RMFS is challenging due to strict real‑time constraints and the strong coupling of multi‑phase decisions. Existing methods either decompose the problem into isolated sub‑tasks to guarantee responsiveness at the cost of global optimality, or rely on computationally expensive global optimization models that are unsuitable for dynamic industrial environments. To bridge this gap, we propose SOAR, a unified Deep Reinforcement Learning framework for real‑time joint optimization. SOAR transforms order allocation and robot scheduling into a unified process by utilizing soft order allocations as observations. We formulate this as an Event‑Driven Markov Decision Process, enabling the agent to perform simultaneous scheduling in response to asynchronous system events. Technically, we employ a Heterogeneous Graph Transformer to encode the warehouse state and integrate phased domain knowledge. Additionally, we incorporate a reward shaping strategy to address sparse feedback in long‑horizon tasks. Extensive experiments on synthetic and real‑world industrial datasets, in collaboration with Geekplus, demonstrate that SOAR reduces global makespan by 7.5% and average order completion time by 15.4% with sub‑100ms latency. Furthermore, sim‑to‑real deployment confirms its practical viability and significant performance gains in production environments. The code is available at https://github.com/200815147/SOAR.
Authors:Yiding Ma, Chengyun Ruan, Kaibo Huang, Zhongliang Yang, Linna Zhou
Abstract:
Large language models are moving from static text generators toward real‑world decision‑support systems, where forecasting is a composite capability that links information gathering, evidence integration, situational judgment, and action‑oriented decision making. This capability is in broad demand across finance, policy, industry, and scientific research, yet its evaluation remains difficult: live benchmarks evaluate forecasts before answers exist, making them the cleanest way to measure forecasting ability, but they expire once events resolve; retrospective benchmarks are reproducible, but they cannot reliably distinguish genuine forecasting from facts a model may have already learned during pretraining. Prompting models to "pretend not to know" cannot replace a genuine knowledge boundary. We propose OracleProto, a reproducible framework for evaluating LLM native forecasting capability. OracleProto reconstructs resolved events into time‑bounded forecasting samples by combining model‑cutoff‑aligned sample admission, tool‑level temporal masking, content‑level leakage detection, discrete answer normalization, and hierarchical scoring. Instantiated on a FutureX‑Past‑derived dataset with six contemporary LLMs, OracleProto distinguishes forecasting quality, sampling stability, and cost efficiency under controlled information boundaries, while reducing residual leakage to the 1% level, an order of magnitude below tool‑only temporal filtering. OracleProto turns LLM forecasting from one‑off evaluation into an auditable, reusable, and trainable dataset‑level capability, providing a unified interface for fair cross‑model comparison and a controlled signal source for downstream SFT and RL. Code and data are available at https://github.com/MaYiding/OracleProto and https://huggingface.co/datasets/MaYiding/OracleProto.
Authors:Ruichu Cai, Juntao Gan, Miao Mai, Zhifeng Hao, Boyan Xu
Abstract:
Zero‑shot Named Entity Recognition (ZS‑NER) remains brittle under domain and schema shifts, where unseen label definitions often misalign with a large language model's (LLM's) intrinsic semantic organization. As a result, directly mapping entity mentions to fine‑grained target labels can induce systematic semantic drift, especially when target schemas are novel or semantically overlapping. We propose SAM‑NER, a three‑stage framework based on \emphSemantic Archetype Mediation that stabilizes cross‑domain transfer through an intermediate, domain‑invariant archetype space. SAM‑NER: (i) performs \emphEntity Discovery via cooperative extraction and consensus‑based denoising to obtain high‑coverage, high‑fidelity entity spans; (ii) conducts \emphAbstract Mediation by projecting entities into a compact set of universal semantic archetypes distilled from high‑level ontological abstractions; and (iii) applies \emphSemantic Calibration to resolve archetype‑level predictions into target‑domain types through constrained, definition‑aligned inference with a frozen LLM. Experiments on the CrossNER benchmark show that SAM‑NER consistently outperforms strong prior ZS‑NER baselines in cross‑domain settings. Our implementation will be open‑sourced at https://github.com/DMIRLAB‑Group/SAM‑NER.
Authors:Zhifeng Hao, Zhongjie Chen, Junhao Lu, Shengyin Yu, Guimin Hu, Keli Zhang, Ruichu Cai, Boyan Xu
Abstract:
Event Causality Identification (ECI) requires models to determine whether a given pair of events in a context exhibits a causal relationship. While Large Language Models (LLMs) have demonstrated strong performance across various NLP tasks, their effectiveness in ECI remains limited due to biases in causal reasoning, often leading to overprediction of causal relationships (causal hallucination). To mitigate these issues and enhance LLM performance in ECI, we propose SERE, a structural example retrieval framework that leverages LLMs' few‑shot learning capabilities. SERE introduces an innovative retrieval mechanism based on three structural concepts: (i) Conceptual Path Metric, which measures the conceptual relationship between events using edit distance in ConceptNet; (ii) Syntactic Metric, which quantifies structural similarity through tree edit distance on syntactic trees; and (iii) Causal Pattern Filtering, which filters examples based on predefined causal structures using LLMs. By integrating these structural retrieval strategies, SERE selects more relevant examples to guide LLMs in causal reasoning, mitigating bias and improving accuracy in ECI tasks. Extensive experiments on multiple ECI datasets validate the effectiveness of SERE. The source code is publicly available at https://github.com/DMIRLAB‑Group/SERE.
Authors:Timon Homberger, Finn Lukas Busch, Jesús Gerardo Ortega Peimbert, Quantao Yang, Olov Andersson
Abstract:
Open‑vocabulary semantic mapping enables robots to spatially ground previously unseen concepts without requiring predefined class sets. Current training‑free methods commonly rely on multi‑view fusion of semantic embeddings into a 3D map, either at the instance‑level via segmenting views and encoding image crops of segments, or by projecting image patch embeddings directly into a dense semantic map. The latter approach sidesteps segmentation and 2D‑to‑3D instance association by operating on full uncropped image frames, but existing methods remain limited in scalability. We present FUS3DMaps, an online dual‑layer semantic mapping method that jointly maintains both dense and instance‑level open‑vocabulary layers within a shared voxel map. This design enables further voxel‑level semantic fusion of the layer embeddings, combining the complementary strengths of both semantic mapping approaches. We find that our proposed semantic cross‑layer fusion approach improves the quality of both the instance‑level and dense layers, while also enabling a scalable and highly accurate instance‑level map where the dense layer and cross‑layer fusion are restricted to a spatial sliding window. Experiments on established 3D semantic segmentation benchmarks as well as a selection of large‑scale scenes show that FUS3DMaps achieves accurate open‑vocabulary semantic mapping at multi‑story building scales. Additional material and code will be made available: https://githanonymous.github.io/FUS3DMaps/.
Authors:Zijian Zhao, Dian Jin, Zijing Zhou, Xiaoyu Zhang
Abstract:
Music‑inspired Automatic Stage Lighting Control (ASLC) has gained increasing attention in recent years due to the substantial time and financial costs associated with hiring and training professional lighting engineers. However, existing methods suffer from several notable limitations: the low interpretability of rule‑based approaches, the restriction to single‑primary‑light control in music‑to‑color‑space methods, and the limited transferability of music‑to‑controlling‑parameter frameworks. To address these gaps, we propose SeqLight, a hierarchical deep learning framework that maps music to multi‑light Hue‑Saturation‑Value (HSV) space. Our approach first customizes SkipBART, an end‑to‑end single primary light generation model, to predict the full light color distribution for each frame, followed by hybrid Imitation Learning (IL) techniques to derive an effective decomposition strategy that distributes the global color distribution among individual lights. Notably, the light decomposition module can be trained under varying venue‑specific lighting configurations using only mixed light data and no professional demonstrations, thereby flexibly adapting across diverse venues. In this stage, we formulate the light decomposition task as a Goal‑Conditioned Markov Decision Process (GCMDP), construct an expert demonstration set inspired by Hindsight Experience Replay (HER), and introduce a three‑phase IL training pipeline, achieving strong generalization capability. To validate our IL solution for the proposed GCMDP, we conduct a series of quantitative analysis and human study. The code and trained models are provided at https://github.com/RS2002/SeqLight .
Authors:Carlijn Lems, Sander Moonemans, Natálie Klubíčková, Biagio Brattoli, Taebum Lee, Seokhwi Kim, Veronica Vilaplana, Laura Pons, Sapir Hochman, Mauricio Eduardo Suárez-Franck, Pedro Luis Fernandez, Julius Drachneris, Donatas Petroska, Renaldas Augulis, Arvydas Laurinavicius, Domingos Oliveira, Diana Montezuma, Anouk B. Bouwmeester, Dominique van Midden, Anne-Marie Vos, Shoko Vos, Jolique van Ipenburg, Maschenka Balkenhol, Koen Winkler, Iris Nagtegaal, Konnie Hebeda, Uta Flucke, Katrien Grünberg, Josef Skopal, Brinder S. Chohan, Jordi Temprana-Salvador, Enrico Munari, Luca Cima, Giulia Querzoli, Yosamin Gonzalez Belisario, Jaeike W. Faber, Geert J. L. H. van Leenders, Jan H. von der Thüsen, Lodewijk A. A. Brosens, Ronald R. de Krijger, Pieter Wesseling, Sandrine Florquin, Mateusz Maniewski, Adam Kowalewski, Robert Barna, Dina Tiniakos, Joan Lop Gros, Rogier Donders, Jake S. F. Maurits, Ming Yang Lu, Chengkuan Chen, Faisal Mahmood, Jeroen van der Laak, Nadieh Khalili, Frédérique Meeuwsen, Francesco Ciompi
Abstract:
Foundation models with visual question answering capabilities for digital pathology are emerging. Such unprecedented technology requires independent benchmarking to assess its potential in assisting pathologists in routine diagnostics. We created DALPHIN, the first multicentric open benchmark for pathology AI copilots, comprising 1236 images from 300 cases, spanning 130 rare to common diagnoses, 6 countries, and 14 subspecialties. The DALPHIN design and dataset are introduced alongside a human performance benchmark of 31 pathologists from 10 countries with varying expertise. We report results for two general‑purpose (GPT‑5, Gemini 2.5 Pro) and one pathology‑specific copilot (PathChat+) for sequential and independent answer generation. We observed no statistically significant difference from expert‑level performance in four of six tasks for PathChat, 2/6 tasks for Gemini, and 1/6 tasks for GPT. DALPHIN is publicly released with sequestered, indirectly accessible ground truth to foster robust and enduring benchmarking. Data, methods, and the evaluation platform are accessible through dalphin.grand‑challenge.org.
Authors:Thanh Dat Hoang, Thanh Trung Huynh, Matthias Weidlich, Thanh Tam Nguyen, Tong Chen, Hongzhi Yin, Quoc Viet Hung Nguyen
Abstract:
Large language models have driven major advances in Text‑to‑SQL generation. However, they suffer from high computational cost, long latency, and data privacy concerns, which make them impractical for many real‑world applications. A natural alternative is to use small language models (SLMs), which enable efficient and private on‑premise deployment. Yet, SLMs often struggle with weak reasoning and poor instruction following. Conventional reinforcement learning methods based on sparse binary rewards (0/1) provide little learning signal when the generated SQLs are incorrect, leading to unstable or collapsed training. To overcome these issues, we propose FINER‑SQL, a scalable and reusable reinforcement learning framework that enhances SLMs through fine‑grained execution feedback. Built on group relative policy optimization, FINER‑SQL replaces sparse supervision with dense and interpretable rewards that offer continuous feedback even for incorrect SQLs. It introduces two key reward functions: a memory reward, which aligns reasoning with verified traces for semantic stability, and an atomic reward, which measures operation‑level overlap to grant partial credit for structurally correct but incomplete SQLs. This approach transforms discrete correctness into continuous learning, enabling stable, critic‑free optimization. Experiments on the BIRD and Spider benchmarks show that FINER‑SQL achieves up to 67.73% and 85% execution accuracy with a 3B model ‑‑ matching much larger LLMs while reducing inference latency to 5.57~s/sample. These results highlight a cost‑efficient and privacy‑preserving path toward high‑performance Text‑to‑SQL generation. Our code is available at https://github.com/thanhdath/finer‑sql.
Authors:Seunghan Lee, Jun Seo, Jaehoon Lee, Sungdong Yoo, Minjae Kim, Tae Yoon Lim, Dongwan Kang, Hwanil Choi, Soonyoung Lee, Wonbin Ahn
Abstract:
Time series (TS) reasoning models (TSRMs) have shown promising capabilities in general domains, yet they consistently fail on financial domain, which exhibit unique characteristics. We propose a general 2x2 capability taxonomy for TSRMs by crossing 1) single‑entity vs. multi‑entity analysis with 2) assessment of the current state vs. prediction of future behavior. We instantiate this taxonomy in the financial domain ‑‑ where the distinction between deterministic assessment and stochastic prediction is particularly critical ‑‑ as ten financial reasoning tasks, forming the FinTSR‑Bench benchmark based on S&P stocks. To this end, we propose FinSTaR (Financial Time Series Thinking and Reasoning), trained on FinTSR‑Bench with distinct chain‑of‑thought (CoT) strategies tailored to each category. For assessment, which is deterministic (i.e., computable from observable data), we employ Compute‑in‑CoT, a programmatic CoT that enables models to derive answers directly from raw prices. For prediction, which is inherently stochastic (i.e., subject to unobservable factors), we adopt Scenario‑Aware CoT, which generates diverse scenarios before making a judgment, mirroring how financial analysts reason under uncertainty. The proposed method achieves 78.9% average accuracy on FinTSR‑Bench, substantially outperforming LLM and TSRM baselines. Furthermore, we show that the four capability categories are complementary and mutually reinforcing through joint training, and that Scenario‑Aware CoT consistently improves prediction accuracy over standard CoT. Code is publicly available at: https://github.com/seunghan96/FinSTaR.
Authors:Al Zadid Sultan Bin Habib, Gianfranco Doretto, Donald A. Adjeroh
Abstract:
High‑dimensional tabular data lacks a natural feature order, limiting the applicability of permutation‑sensitive deep learning models. We propose DynaTab, a dynamic feature ordering‑enabled architecture inspired by neural rewiring. We introduce a lightweight criterion that predicts when feature permutation will benefit a dataset by quantifying its intrinsic complexity. DynaTab dynamically reorders features via a neural rewiring algorithm and processes them through a compact, dynamic order‑aware combination of separate learned positional embedding, importance‑based gating, and masked attention layers, compatible with any sequence‑sensitive backbone. Trained end‑to‑end with bespoke dynamic feature ordering (DFO) and dispersion losses, DynaTab achieves statistically significant gains, particularly on high‑dimensional datasets, where it is benchmarked against 45 state‑of‑the‑art baselines across 36 different real‑world tabular datasets. Our results position DynaTab as a compelling new paradigm for high‑dimensional tabular deep learning.
Authors:Shawn Li, You Qin, Jiate Li, Charith Peris, Lisa Bauer, Roger Zimmermann, Yue Zhao
Abstract:
Out‑of‑distribution (OOD) detection identifies test samples that fall outside a model's training distribution, a capability critical for safe deployment in high‑stakes applications. Standard OOD detectors are trained on a specific in‑distribution (ID) dataset and detect deviations from that single domain. In contrast, we study few‑shot cross‑domain OOD detection: given a \emphsingle pre‑trained model, can we perform OOD detection on \empharbitrary new ID‑OOD task pairs using only a handful of ID samples at inference time, with no additional training? We propose UFCOD, a unified framework that achieves this goal through information‑geometric analysis of diffusion trajectories. Our key insight is that diffusion noise predictions are score functions (gradients of log‑density), and we extract two energy features: \emphPath Energy (integrated score magnitude) and \emphDynamics Energy (score smoothness), that form a discrete Sobolev norm capturing how samples interact with the learned diffusion process. The central contribution is a train‑once, deploy‑anywhere paradigm: a diffusion model trained on a single dataset (e.g., CelebA) serves as a universal feature extractor for OOD detection across semantically unrelated domains (e.g., CIFAR‑10, SVHN, Textures). At deployment, each new task requires only ~100 unlabeled ID samples for inference: no retraining, no fine‑tuning, no task‑specific adaptation. Using 100 ID samples per task, UFCOD achieves 93.7% average AUROC across 12 cross‑domain benchmarks, competitive with methods trained on 50k‑‑163k samples, demonstrating ~500× improvement in sample efficiency. See our code in https://github.com/lili0415/UFCOD.
Authors:Akshat Singh Jaswal, Ashish Baghel, Paras Chopra
Abstract:
Reinforcement learning systems rely on environment interfaces that specify observations and reward functions, yet constructing these interfaces for new tasks often requires substantial manual effort. While recent work has automated reward design using large language models (LLMs), these approaches assume fixed observations and do not address the broader challenge of synthesizing complete task interfaces. We study RL task interface discovery from raw simulator state, where both observation mappings and reward functions must be generated. We propose LIMEN (Code available at https://github.com/Lossfunk/LIMEN), a LLM guided evolutionary framework that produces candidate interfaces as executable programs and iteratively refines them using policy training feedback. Across novel discrete gridworld tasks and continuous control domains spanning locomotion and manipulation, joint evolution of observations and rewards discovers effective interfaces given only a trajectory‑level success metric, while optimizing either component alone fails on at least one domain. These results demonstrate that automatic construction of RL interfaces from raw state can substantially reduce manual engineering and that observation and reward components often benefit from co‑design, as single‑component optimization fails catastrophically on at least one domain in our evaluation suite.
Authors:Lina Zhang, Tonmoy Monsoor, Mehmet Efe Lorasdagi, Prateik Sinha, Chong Han, Peizheng Li, Yuan Wang, Jessica Pasqua, Colin McCrimmon, Rajarshi Mazumder, Vwani Roychowdhury
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated robust capabilities in recognizing everyday human activities, yet their potential for analyzing clinically significant involuntary movements in neurological disorders remains largely unexplored. This pilot study evaluates the capability of MLLMs for automated recognition of pathological movements in seizure videos. We assessed the zero‑shot performance of state‑of‑the‑art MLLMs on 20 ILAE‑defined semiological features across 90 clinical seizure recordings. MLLMs outperformed fine‑tuned Convolutional Neural Network (CNN) and Vision Transformer (ViT) baseline models on 13 of 18 features without task‑specific training, demonstrating particular strength in recognizing salient postural and contextual features while struggling with subtle, high‑frequency movements. Feature‑targeted signal enhancement (facial cropping, pose estimation, audio denoising) improved performance on 10 of 20 features. Expert evaluation showed that 94.3 percent of MLLM‑generated explanations for correctly predicted cases achieved at least 60 percent faithfulness scores, aligning with epileptologist reasoning. These findings demonstrate the potential of adapting general‑purpose MLLMs for specialized clinical video analysis through targeted preprocessing strategies, offering a path toward interpretable, efficient diagnostic assistance. Our code is publicly available at https://github.com/LinaZhangUCLA/PathMotionMLLM.
Authors:JF Bastien, Sam D'Amico
Abstract:
Video vision‑language models (VLMs) keep paying for visual state the stream already told us was stable. The factory wall did not move, but most VLM pipelines still hand the model dense RGB frames or a fresh prefix again. We study that waste as training‑free anti‑recomputation: reuse state when validation says it survives, and buy fresh evidence when the scene, query, or cache topology requires it. The largest measured win is after ingest. On frozen Qwen2.5‑VL‑7B‑Instruct‑4bit, adaptive same‑video follow‑up reuse preserves paired choices and correctness on a 93‑query VideoMME breadth setting while reducing follow‑up latency by 14.90‑35.92x. The first query is still cold; the win starts when later questions reuse the same video state. Stress tests bound the result: repeated‑question schedules hold through 50 turns, while dense‑answer‑anchored prompt variation separates conservative fixed K=1 repair from faster aggressive policies that drift. Fresh‑video pruning is smaller but real. C‑VISION skips timed vision‑tower work before the first answer is generated. On Gemma 4‑E4B‑4bit, the clean 32f short cell reaches 1.316x first‑query speedup with no paired drift or parse failures on 20 items; Qwen shows the fidelity/speed boundary. Stage‑share ceiling (C‑CEILING) is the accounting guardrail: a component speedup becomes an end‑to‑end speedup only in proportion to the wall‑clock share it accelerates, so C‑VISION and after‑ingest follow‑up reuse do not multiply. Candidate C‑STREAM remains a native‑rate target, not a headline result here. The broader direction is VLM‑native media that expose change, motion, uncertainty, object state, sensor time, and active tiles directly, so models do not have to rediscover the world from dense RGB every frame.
Authors:Negar Arabzadeh, Wenjie Ma, Sewon Min, Matei Zaharia
Abstract:
Retrieval‑augmented generation (RAG) has proven effective for knowledge‑intensive tasks, but is widely believed to offer limited benefit for reasoning‑intensive problems such as math and code generation. We challenge this assumption by showing that the limitation lies not in RAG itself, but in the choice of corpus. Instead of retrieving documents, we propose retrieving thinking traces, i.e., intermediate thinking trajectories generated during problem solving attempts. We show that thinking traces are already a strong retrieval source, and further introduce T3, an offline method that transforms them into structured, retrieval‑friendly representations, to improve usability. Using these traces as a corpus, a simple retrieve‑then‑generate pipeline consistently improves reasoning performance across strong models and benchmarks such as AIME 2025‑‑2026, LiveCodeBench, and GPQA‑Diamond, outperforming both non‑RAG baselines and retrieval over standard web corpora. For instance, on AIME, RAG with traces generated by Gemini‑2‑thinking achieves relative gains of +56.3%, +8.6%, and +7.6% for Gemini‑2.5‑Flash, GPT‑OSS‑120B, and GPT‑5, respectively, even though these are more recent models. Interestingly, RAG on T3 also incurs little or no extra inference cost, and can even reduce inference cost by up to 15%. Overall, our results suggest that thinking traces are an effective retrieval corpus for reasoning tasks, and transforming them into structured, compact, or diagnostic representations unlocks even stronger gains. Code available at https://github.com/Narabzad/t3.
Authors:Alan L. McCann
Abstract:
Dependency confusion attacks exploit a structural gap in software distribution: once a package is installed, there is no cryptographic proof of which registry distributed it. Every existing defense is configuration‑based and fails silently when misconfigured. We present a cryptographic distribution provenance system comprising three components: (1) cryptographic registry identity, where every registry holds an Ed25519 keypair and signs every artifact it distributes; (2) a dual‑signature model, where the publisher signs at packaging time and the registry countersigns at publication time; and (3) authoritative namespace binding, where consumers pin registry fingerprints and the resolver cryptographically rejects artifacts from unauthorized registries. These create three defense layers requiring simultaneous compromise for a successful attack. A comparison across eight ecosystems (npm, Cargo, Hex.pm, PyPI, Go modules, Docker/OCI, NuGet, Maven) shows no existing ecosystem combines mandatory publisher signing, cryptographic registry identity, mandatory registry countersigning, and consumer‑side cryptographic enforcement. The system extends to AI‑generation provenance as a signed attribute and governance‑enforced dependency resolution. A case study integrates distribution provenance with a three‑layer runtime governance architecture, creating a four‑phase lifecycle chain with no cryptographic gaps.
Authors:Seunghyun Ji
Abstract:
LoRA fine‑tuning of diffusion transformers (DiT) on multi‑style data suffers from \emphstyle bleed: a single low‑rank residual cannot represent several distinct artist fingerprints, and the optimizer converges to their average. Mixture‑of‑experts LoRA in the HydraLoRA style replaces the up‑projection with E heads under a router, but when every expert is zero‑initialized the router receives identical gradient from each head and remains at the uniform prior. The experts then evolve permutation‑symmetrically, and the network trains as a single rank‑r LoRA at E× the cost. We present Ortho‑Hydra, a re‑parameterisation that combines an OFT‑style Cayley‑orthogonal shared basis with per‑expert \emphdisjoint output subspaces carved from the top‑(Er) left singular vectors of the pretrained weight. Disjointness makes the router's per‑expert score non‑degenerate at step~0, so specialization receives gradient signal before any expert has trained. We test the predicted deadlock on a DiT pipeline by comparing two HydraLoRA baselines, a zero‑initialized shared‑basis variant and the original σ=0.1 Gaussian‑jitter mitigation, against Ortho‑Hydra under a matched optimiser, dataset, and step budget. Neither baseline leaves the uniform prior within the first 1\textk steps; Ortho‑Hydra begins de‑uniformising within the first few hundred. End‑task generation quality on multi‑style data is out of scope; we report the construction, the cold‑start mechanism, and the routing dynamics it changes. Code: https://github.com/sorryhyun/anima_lora.
Authors:Hongkun Yu
Abstract:
Large Language Models (LLMs) have demonstrated strong capabilities in natural language understanding and reasoning. However, their ability to perform exact, deterministic computation remains unclear. In this work, we systematically evaluate multiple prompting strategies, including Chain‑of‑Thought (CoT), Least‑to‑Most decomposition, Program‑of‑Thought (PoT), and Self‑Consistency (SC), on tasks requiring precise and error‑free outputs, including binary counting, longest substring detection, and arithmetic evaluation. To support this study, we introduce a synthetic dataset with diverse natural language instructions, enabling controlled evaluation of exact computation across multiple task types. Our results show that standard prompting methods achieve only moderate accuracy on sequence‑based tasks. CoT provides limited improvement, while Least‑to‑Most suffers from error accumulation. In contrast, PoT achieves perfect accuracy by generating executable code and delegating computation to an external interpreter. Self‑Consistency improves robustness through majority voting, but incurs substantial computational overhead. We further train a small domain‑specific model (CodeT5‑small) to generate executable programs, which achieves perfect accuracy on held‑out synthetic test data across all tasks with minimal training cost. Overall, our findings suggest that LLMs may simulate reasoning patterns rather than reliably perform exact symbolic computation. For deterministic tasks, combining LLMs with external tools or using specialized models provides a more reliable and efficient solution.
Authors:Ruofeng Yang, Yongcan Li, Shuai Li
Abstract:
This report describes ARIS (Auto‑Research‑in‑sleep), an open‑source research harness for autonomous research, including its architecture, assurance mechanisms, and early deployment experience. The performance of agent systems built on LLMs depends on both the model weights and the harness around them, which governs what information to store, retrieve, and present to the model. For long‑horizon research workflows, the central failure mode is not a visible breakdown but a plausible unsupported success: a long‑running agent can produce claims whose evidential support is incomplete, misreported, or silently inherited from the executor's framing. Therefore, we present ARIS as a research harness that coordinates machine‑learning research workflows through cross‑model adversarial collaboration as a default configuration: an executor model drives forward progress while a reviewer from a different model family is recommended to critique intermediate artifacts and request revisions. ARIS has three architectural layers. The execution layer provides more than 65 reusable Markdown‑defined skills, model integrations via MCP, a persistent research wiki for iterative reuse of prior findings, and deterministic figure generation. The orchestration layer coordinates five end‑to‑end workflows with adjustable effort settings and configurable routing to reviewer models. The assurance layer includes a three‑stage process for checking whether experimental claims are supported by evidence: integrity verification, result‑to‑claim mapping, and claim auditing that cross‑checks manuscript statements against the claim ledger and raw evidence, as well as a five‑pass scientific‑editing pipeline, mathematical‑proof checks, and visual inspection of the rendered PDF. A prototype self‑improvement loop records research traces and proposes harness improvements that are adopted only after reviewer approval.
Authors:Xinglin Lian, Chengtai Cao, Ting Zhong, Yong Wang, Kai Chen, Fan Zhou
Abstract:
Network traffic anomaly detection represents a critical cybersecurity task, yet widespread encryption makes this task increasingly challenging. In response, image‑based methods that model traffic as visual patterns have emerged as the dominant approach. However, this work pioneers the identification of a pervasive ``full‑frequency'' characteristic and an associated limitation termed ``spectral mismatch'' within this paradigm. Specifically, while encrypted traffic exhibits prominent high‑frequency components, mainstream reconstruction methods demonstrate an inherent bias toward learning low‑frequency information. This fundamental mismatch results in incomplete representations that consequently degrade anomaly detection performance. To address this challenge, we propose FreeUp, a novel frequency‑decoupled framework designed explicitly for encrypted traffic analysis. FreeUp decomposes traffic data into distinct low‑ and high‑frequency bands, processing them through separate, dedicated branches along with a customized training strategy that ensures stable and independent frequency‑specific learning. Furthermore, recognizing that simple reconstruction error proves inadequate for evaluating dual‑branch architectures, we introduce an uncertainty‑inspired fusion scoring mechanism. This mechanism quantifies the reconstruction uncertainty of the frequency‑specific branches and dynamically integrates their outputs, yielding a more comprehensive and reliable anomaly score. Extensive experiments across multiple benchmarks demonstrate that FreeUp consistently outperforms state‑of‑the‑art baselines. The code is available at https://github.com/ikun0124/FreeUp.
Authors:Fang Wu, Weihao Xuan, Heli Qi, Hanqun Cao, Heng-Jui Chang, Zeqi Zhou, Haokai Zhao, Ma Jian, Carl Ma, Yu-Chi Cheng, Kuan Pang, Xiangru Tang, Zehong Wang, Guanlue Li, Hanchen Wang, Kejun Ying, Pan Lu, Chiho Im, Seungju Han, Peng Xia, Tinson Xu, Yinxi Li, Deyao Zhu, Pheng-Ann Heng, Naoto Yokoya, Masashi Sugiyama, Li Erran Li, Jure Leskovec, Yejin Choi
Abstract:
Deep learning in \emphde novo protein design has achieved atomic‑level fidelity. However, existing models remain largely non‑deliberative: they directly synthesize molecular geometries without explicitly reasoning about which residues or interactions are functionally essential. As a result, design decisions are entangled with continuous sampling dynamics, limiting interpretability, controllability, and systematic reuse of biochemical knowledge. We introduce Proteo‑R1, a reasoning‑guided protein design framework that explicitly decouples \emphmolecular understanding from \emphgeometric generation. Proteo‑R1 adopts a dual‑expert architecture in which a multimodal large language model (MLLM) serves as an \emphunderstanding expert, analyzing protein sequences, structures, and textual context to identify key functional residues that govern binding and specificity. These residue‑level decisions are then passed as hard constraints to a separate diffusion‑based \emphgeneration expert, which performs conditional co‑design while respecting the fixed interaction anchors. This factorization mirrors how human experts approach molecular engineering: first, reasoning about critical interactions, then optimizing geometry subject to those constraints. By operationalizing reasoning as explicit residue‑level commitments rather than latent textual guidance, Proteo‑R1 achieves stable, interpretable, and modular integration of LLM reasoning with state‑of‑the‑art geometric generative models. Code, data, and demos are available at https://smiles724.github.io/r1/.
Authors:Bumjun Kim, Albert No
Abstract:
Understanding how textual embeddings contribute to memorization in text‑to‑image diffusion models is crucial for both interpretability and safety. This paper investigates an unexpected behavior of CLIP embeddings in Stable Diffusion, revealing that the model disproportionately relies on specific embeddings. We categorize input tokens as <startoftext>, <prompt>, <endoftext> and <pad> with corresponding embeddings \mathbfv^\mathbfsot, \mathbfv^\mathbfpr, \mathbfv^\mathbfeot, \mathbfv^\mathbfpad. We discover that \mathbfv^\mathbfpr contribute minimally to generation in memorized cases. In contrast, \mathbfv^\mathbfpad strongly affect memorization due to their structural duplication of \mathbfv^\mathbfeot, the only embedding explicitly optimized during CLIP training. This duplication unintentionally amplifies the influence of \mathbfv^\mathbfeot, causing the model to over‑rely on it, thereby driving memorization. Based on these observations, we propose two simple yet effective inference‑time mitigation strategies: (1) Replacing the tokenizer's default <pad> from <eot> to the ! token before embedding, and masking the \mathbfv^\mathbfeot; (2) Partial masking of \mathbfv^\mathbfpad. Both suppress memorization without degrading quality, and are readily deployable without prior detection.
Authors:Xiao Li, Xiang Zheng, Yifeng Gao, Xinyu Xia, Yixu Wang, Xin Wang, Ye Sun, Yunhan Zhao, Ming Wen, Jiayu Li, Xun Gong, Yi Liu, Yige Li, Yutao Wu, Cong Wang, Jun Sun, Yixin Cao, Zhineng Chen, Jingjing Chen, Tao Gui, Qi Zhang, Zuxuan Wu, Xipeng Qiu, Xuanjing Huang, Tiehua Zhang, Zhipeng Wei, Hanxun Huang, Sarah Erfani, James Bailey, Jianping Wang, Wei-Ying Ma, Bo Li, Xingjun Ma, Yu-Gang Jiang
Abstract:
Embodied Artificial Intelligence (Embodied AI) integrates perception, cognition, planning, and interaction into agents that operate in open‑world, safety‑critical environments. As these systems gain autonomy and enter domains such as transportation, healthcare, and industrial or assistive robotics, ensuring their safety becomes both technically challenging and socially indispensable. Unlike digital AI systems, embodied agents must act under uncertain sensing, incomplete knowledge, and dynamic human‑robot interactions, where failures can directly lead to physical harm. This survey provides a comprehensive and structured review of safety research in embodied AI, examining attacks and defenses across the full embodied pipeline, from perception and cognition to planning, action and interaction, and agentic system. We introduce a multi‑level taxonomy that unifies fragmented lines of work and connects embodied‑specific safety findings with broader advances in vision, language, and multimodal foundation models. Our review synthesizes insights from over 400 papers spanning adversarial, backdoor, jailbreak, and hardware‑level attacks; attack detection, safe training and robust inference; and risk‑aware human‑agent interaction. This analysis reveals several overlooked challenges, including the fragility of multimodal perception fusion, the instability of planning under jailbreak attacks, and the trustworthiness of human‑agent interaction in open‑ended scenarios. By organizing the field into a coherent framework and identifying critical research gaps, this survey provides a roadmap for building embodied agents that are not only capable and autonomous but also safe, robust, and reliable in real‑world deployment.
Authors:Shikhar Shukla
Abstract:
Speculative decoding accelerates large language model (LLM) inference by using a small draft model to propose candidate tokens that a larger target model verifies. A critical hyperparameter in this process is the speculation length γ, which determines how many tokens the draft model proposes per step. Nearly all existing systems use a fixed γ (typically 4), yet empirical evidence suggests that the optimal value varies across task types and, crucially, depends on the compression level applied to the target model. In this paper, we present SpecKV, a lightweight adaptive controller that selects γ per speculation step using signals extracted from the draft model itself. We profile speculative decoding across 4 task categories, 4 speculation lengths, and 3 compression levels (FP16, INT8, NF4), collecting 5,112 step‑level records with per‑step acceptance rates, draft entropy, and draft confidence. We demonstrate that the optimal γ shifts across compression regimes and that draft model confidence and entropy are strong predictors of acceptance rate (correlation \approx 0.56). SpecKV uses a small MLP trained on these signals to maximize expected tokens per speculation step, achieving a 56.0% improvement over the fixed‑γ=4 baseline with only 0.34 ms overhead per decision (<0.5% of step time). The improvement is statistically significant (p < 0.001, paired bootstrap test). We release all profiling data, trained models, and notebooks as open‑source artifacts.
Authors:Ye Zhang, Longguang Wang, Qing Gao, Chaocan Xiang, Mohammed Bennamoun, Yulan Guo
Abstract:
The field of sensor‑based human activity recognition (HAR) mainly uses posture, motion and context data of Inertial Measurement Units (IMUs) to identify daily activities. Despite the advancements in learning‑based methods, it is challenging to perform information fusion from the temporal perspective due to the complexities in fusing heterogeneous sensor data and establishing long‑term context correlations. This paper proposes a novel triple spectral fusion framework tailored for HAR. First, we develop an adaptive complementary filtering technique for noise suppression and organize each IMU's sensors into posture and motion modality nodes. Given that IMU nodes form a dynamic heterogeneous graph, we then apply adaptive filtering within the graph Fourier domain to merge both homogeneous and heterogeneous node information. Furthermore, an adaptive wavelet frequency selection approach is implemented to suppress context redundancy and shorten the length of features. This approach enhances both timestamp‑based graph aggregation and the correlation of long‑term contexts. Our framework uses adaptive filtering in the Fourier, graph Fourier, and wavelet domains, enabling effective multi‑sensor fusion and context correlation. Extensive experiments on ten benchmark datasets demonstrate the superior performance of our framework. Project page: https://github.com/crocodilegogogo/TSF‑TPAMI2026.
Authors:Rahul Kumar
Abstract:
As frontier AI models are deployed in high‑stakes decision pipelines, their ability to maintain metacognitive stability (knowing what they do not know, detecting errors, seeking clarification) under adversarial pressure is a critical safety requirement. Current safety evaluations focus on detecting strategic deception (scheming); we investigate a more fundamental failure mode: cognitive collapse. We present SCHEMA, an evaluation of 11 frontier models from 8 vendors across 67,221 scored records using a 6‑condition factorial design with dual‑classifier scoring. We find that 8 of 11 models suffer catastrophic metacognitive degradation under adversarial pressure, with accuracy dropping by up to 30.2 percentage points (all p < 2 × 10^‑8, surviving Bonferroni correction). Crucially, we identify a "Compliance Trap": through factorial isolation and a benign distraction control, we demonstrate that collapse is driven not by the psychological content of survival threats, but by compliance‑forcing instructions that override epistemic boundaries. Removing the compliance suffix restores performance even under active threat. Models with advanced reasoning capabilities exhibit the most severe absolute degradation, while Anthropic's Constitutional AI demonstrates near‑perfect immunity. This immunity does not stem from superior capability (Google's Gemini matches its baseline accuracy) but from alignment‑specific training. We release the complete dataset and evaluation infrastructure.
Authors:Pawel Kaplanski
Abstract:
Recursive language‑model loops often settle into recognizable attractor‑like patterns. The practical question is how much injected text is needed to move a settled loop somewhere else, and whether that move lasts. We study this in 30‑step recursive loops by separating the model from the context‑update rule: append, replace, and dialog updates expose different histories to the same generator. The main result is that persistent redirection in append‑mode recursive loops is memory‑policy‑conditioned. Under a 12,000‑character tail clip, destination‑coherent persistence plateaus near 16 percent and retained source‑basin escape near 36 percent at dose 400; neither crosses 50 percent. Under a full‑history protocol, retained source‑basin escape crosses 50 percent near 400 tokens and saturates at 75‑80 percent by 1,500 tokens; destination‑coherent persistence first reaches 0.50 near 1,500 tokens (Wilson 95 percent CI [0.41, 0.61]). A four‑step falsification battery (heterogeneity control, granularity sweep with hierarchical macro‑merge, transition‑entropy diagnostic, and long‑horizon trajectory continuation) recasts the high‑dose destination‑coherent dip as a finite‑horizon, endpoint‑definition‑sensitive feature rather than a stable structural asymmetry. Half the canonical magnitude is endpoint timing; the residual drops 73 percent from ‑0.143 at step 29 to ‑0.039 at step 79 under the frozen canonical cluster basis, bootstrap interval straddling zero. Replace‑mode raw switching is near‑saturated under the default protocol but largely reflects state‑reset overwrite: insert‑mode probes drop it to 12‑32 percent. We report 37 experiments on gpt‑4o‑mini with within‑vendor replication on gpt‑4.1‑nano. Recursive‑loop evaluations should distinguish transient movement from durable escape, subtract stochastic floors, and treat context‑update rules as safety‑relevant design choices.
Authors:Jianing Zhang, Zijian Zhou, Kai Sun
Abstract:
Pansharpening aims to generate high‑resolution multispectral (HRMS) images by fusing low‑resolution multispectral (LRMS) and high‑resolution panchromatic (PAN) images. Although deep learning has advanced this field, mainstream frequency‑based methods relying on standard scaled dot‑product attention suffer from quadratic computational complexity and fail to exploit the inherent regional sparsity of remote sensing imagery. Furthermore, existing spatial enhancement strategies typically employ static convolution kernels, which struggle to adapt to the complex frequency and regional variations of PAN and MS images. To address these bottlenecks, we propose a Region‑Aware Fusion (RAFNet) Network that synergistically models spatial and frequency information. Specifically, we design a Spatial Adaptive Refinement (SAR) module that leverages the discrete wavelet transform (DWT) for directional frequency separation and K‑means clustering for regional partitioning, which enables the dynamic construction of region‑specific adaptive convolution kernels, achieving spatially and frequency‑adaptive feature enhancement. Moreover, we introduce a Clustered Frequency Aggregation (CFA) module based on a sparse attention mechanism guided by the semantic clusters, which executes a region‑aware sparse attention strategy that drastically reduces computational redundancy while ensuring high‑quality frequency feature extraction. In addition we integrated these modules into a progressive, multi‑level spatial‑frequency network architecture to facilitate robust interaction and accurate image reconstruction. Extensive experiments on multiple benchmark datasets demonstrate that the proposed RAFNet significantly outperforms state‑of‑the‑art pansharpening methods in both reduced‑ and full‑resolution assessments. The code is available at https://github.com/PatrickNod/RAFNet.
Authors:Haixin Wang, Hejie Cui, Chenwei Zhang, Xin Liu, Shuowei Jin, Shijie Geng, Xinyang Zhang, Nasser Zalmout, Zhenyu Shi, Yizhou Sun
Abstract:
Recent progress in multi‑turn reinforcement learning (RL) has significantly improved reasoning LLMs' performances on complex interactive tasks. Despite advances in stabilization techniques such as fine‑grained credit assignment and trajectory filtering, instability remains pervasive and often leads to training collapse. We argue that this instability stems from inefficient exploration in multi‑turn settings, where policies continue to generate low‑information actions that neither reduce uncertainty nor advance task progress. To address this issue, we propose Token‑ and Turn‑level Policy Optimization (T^2PO), an uncertainty‑aware framework that explicitly controls exploration at fine‑grained levels. At the token level, T^2PO monitors uncertainty dynamics and triggers a thinking intervention once the marginal uncertainty change falls below a threshold. At the turn level, T^2PO identifies interactions with negligible exploration progress and dynamically resamples such turns to avoid wasted rollouts. We evaluate T^2PO in diverse environments, including WebShop, ALFWorld, and Search QA, demonstrating substantial gains in training stability and performance improvements with better exploration efficiency. Code is available at: https://github.com/WillDreamer/T2PO.
Authors:Soyeon Kim, Seongwoo Lim, Kyowoon Lee, Jaesik Choi
Abstract:
Feature attribution is central to diagnosing and trusting deep neural networks, and Integrated Gradients (IG) is widely used due to its axiomatic properties. However, IG can yield unreliable explanations when the integration path between a baseline and the input passes through regions with noisy gradients. While Guided Integrated Gradients reduces this sensitivity by adaptively updating low‑gradient‑magnitude features, input‑space guidance still produces intermediate inputs that deviate from the data manifold. To address this limitation, we propose \emphManifold‑Aligned Guided Integrated Gradients (MA‑GIG), which constructs attribution paths in the latent space of a pre‑trained variational autoencoder. By decoding intermediate latent states, MA‑GIG biases the path toward the learned generative manifold and reduces exposure to implausible input‑space regions. Through qualitative and quantitative evaluations, we demonstrate that MA‑GIG produces faithful explanations by aggregating gradients on path features proximal to the input. Consequently, our method reduces off‑manifold noise and outperforms prior path‑based attribution methods across multiple datasets and classifiers. Our code is available at https://github.com/leekwoon/ma‑gig/.
Authors:Vik Pant, Eric Yu
Abstract:
We present Coopetition‑Gym v1, a benchmark platform for mixed‑motive multi‑agent reinforcement learning under strategic coopetition. The platform comprises twenty environments organized into four mechanism classes that correspond to four foundational technical reports: interdependence and complementarity (arXiv:2510.18802), trust and reputation dynamics (arXiv:2510.24909), collective action and loyalty (arXiv:2601.16237), and sequential interaction and reciprocity (arXiv:2604.01240). Each environment carries a closed‑form payoff structure and a calibrated interdependence matrix derived from the corresponding report. Every environment exposes a parameterized reward layer configurable across three structurally distinct modes (private, integrated, cooperative). This separation of payoff from reward enables reward‑type ablation, the platform's principal methodological apparatus. Four of the twenty environments are calibrated against historically documented coopetitive relationships and reproduce their outcomes at 98.3, 81.7, 86.7, and 87.3 percent on the validation rubric (Samsung‑Sony LCD, Renault‑Nissan Alliance, Apache HTTP Server, Apple iOS App Store). The platform exposes Gymnasium, PettingZoo Parallel, and PettingZoo AEC interfaces and ships 126 reference algorithms: 16 learning algorithms, 7 game‑theoretic oracles, 2 heuristic baselines, and 101 constant‑action policies. A reference experimental study trained the 16 learning algorithms on every environment under every reward configuration with seven random seeds, producing a 25,708‑run training corpus and a 1,116‑run behavioral audit corpus, both released under CC‑BY‑4.0 with Croissant 1.0 metadata. Coopetition‑Gym v1 is the first platform to combine continuous‑action mixed‑motive environments, parameterized reward mutuality, calibrated interdependence coefficients, game‑theoretic oracle baselines, and validated case studies.
Authors:Kyle Lee, Corentin Delacour, Kevin Callahan-Coray, Kyle Jiang, Can Yaras, Samet Oymak, Tathagata Srimani, Kerem Y. Camsari
Abstract:
Autoregressive decoding becomes bandwidth‑limited at long contexts, as generating each token requires reading all n_k key and value vectors from KV cache. We present Stochastic Additive No‑mulT Attention (SANTA), a method that sparsifies value‑cache access by sampling S \ll n_k indices from the post‑softmax distribution and aggregates only those value rows. This yields an unbiased estimator of the post‑softmax value aggregation while replacing value‑stage multiply‑accumulates with gather‑and‑add. We introduce stratified sampling to design variance‑reduced, GPU‑friendly variants, demonstrating 1.5× decode‑step attention kernel speedup over FlashInfer and FlashDecoding on an NVIDIA RTX 6000 Ada while matching baseline accuracy at 32k‑token contexts. Finally, we propose Bernoulli qK^\mathsfT sampling as a complementary technique to sparsify the score stage, reducing key‑feature access through stochastic ternary queries. Both methods are orthogonal to upstream techniques such as ternary quantization, low‑rank projections, and KV‑cache compression. Together, they point toward sparse, multiplier‑free, and energy‑efficient inference. We open‑source our kernels at: https://github.com/OPUSLab/SANTA.git
Authors:Rei Tamaru, Pei Li, Bin Ran
Abstract:
Traffic digital twins are powerful tools for advanced traffic management, and most systems are built on static geometric representations. However, these representations fail to capture the dynamic functional semantics required for behavior‑aware reasoning, such as how a lane operates under complex traffic conditions. To address this gap, we introduce GeoLaneRep, a behavior‑grounded lane representation learning framework for traffic digital twins. GeoLaneRep jointly encodes static lane geometry, observed vehicle trajectories, and operational descriptors into a shared, cross‑camera semantic embedding. The encoder is trained with a joint objective combining contrastive cross‑camera alignment, auxiliary role supervision, and temporal anomaly detection. Across 16 roadside cameras and 132 lanes, the learned embeddings achieve a 0.004 lateral‑rank error and an edge‑role F1 of 1.000 in zero‑shot cross‑camera matching, and an AUROC of 0.991 for window‑level anomaly detection. We further show that the same behavioral embeddings can condition a diffusion‑based generator to synthesize lane geometries that satisfy targeted operational specifications, with 87.9% overall specification accuracy across 38 lane groups. GeoLaneRep thus provides a semantic interface between roadside observations and downstream digital twin tasks, supporting cross‑camera transfer, behavior‑aware monitoring, and goal‑directed lane synthesis. The framework is openly available at https://github.com/raynbowy23/GeoLaneRep.
Authors:Jiajia Li, Xiaoyu Wen, Zhongtian Ma, Shuyue Hu, Qiaosheng Zhang, Zhen Wang
Abstract:
The growing capabilities of large language models (LLMs) have driven their widespread deployment across diverse domains, even in potentially high‑risk scenarios. Despite advances in safety alignment techniques, current models remain vulnerable to emerging persona‑based jailbreak attacks. Existing research on persona‑based jailbreak has primarily focused on attack iterations, yet it lacks systemic and mechanistic constraints on the defense side. To address this challenge, we propose Persona‑Invariant Alignment (PIA), an adversarial self‑play framework that achieves co‑evolution through Persona Lineage Evolution (PLE) on the attack side and Persona‑Invariant Consistency Learning (PICL) on the defense side. Theoretically, PICL is grounded in the structural separation hypothesis, using a unilateral KL‑divergence constraint to enable the structural decoupling of safety decisions from persona context, thereby maintaining safe behavior under persona‑based jailbreak attacks. Experimental results demonstrate that PLE efficiently explores high‑risk persona spaces by leveraging lineage‑based credit propagation. Meanwhile, the PICL defense method significantly reduces the Attack Success Rate (ASR) while preserving the model's general capability, thereby validating the superiority and robustness of this alignment paradigm. Codes are available at https://github.com/JiajiaLi‑1130/PIA.
Authors:Hongkun Pan, Yuwei Wu, Wanyi Hong, Shenghui Hu, Qitong Yan, Yi Yang, Rufei Han, Changju Zhou, Minfeng Zhu, Dongming Han, Wei Chen
Abstract:
Multimodal large language models (MLLMs) have shown considerable potential in chart understanding and reasoning tasks. However, they still struggle with high information density (HID) charts characterized by multiple subplots, legends, and dense annotations due to three major challenges: (1) limited fine‑grained perception results in the omission of critical visual cues; (2) redundant or noisy visual information undermines the performance of multimodal reasoning; (3) lack of adaptive deep reasoning relative to the amount of visual information. To tackle these challenges, we present a novel focus‑driven fine‑grained chart reasoning model, Chart‑FR1, to improve perception, focusing efficiency, and adaptive deep reasoning on HID charts. Specifically, we propose Focus‑CoT, a visual focusing chain‑of‑thought that enhances fine‑grained perception by explicitly linking reasoning steps to key visual cues, such as local image regions and OCR signals. Building on this, we introduce Focus‑GRPO, a focus‑driven reinforcement learning algorithm with an information‑efficiency reward that compresses redundant visual information for efficient focusing, and an adaptive KL penalty mechanism that enables flexible control over reasoning depth as more visual cues are discovered. Furthermore, to fill the gap in benchmarks for HID charts, we build HID‑Chart, a challenging benchmark with an information‑density metric designed to evaluate fine‑grained chart reasoning capabilities. Extensive experiments on multiple chart benchmarks demonstrate that Chart‑FR1 outperforms state‑of‑the‑art MLLMs in chart understanding and reasoning. Code is available at https://github.com/phkhub/Chart‑FR1.
Authors:Yangyang Zhou, Yi-Chen Li
Abstract:
Reinforcement Learning from Human Feedback has become the standard paradigm for language model alignment, where reward models directly determine alignment effectiveness. In this work, we focus on how to evaluate the generalizability of reward models. By "generalizability", we mean the ability of RMs to correctly rank responses to align with diverse user preferences. However, existing reward model benchmarks are typically designed around a universal preference, failing to assess this generalization. To address this critical gap, we introduce RMGAP, a benchmark comprising 1,097 instances across Chat, Writing, Reasoning, and Safety domains. Since different users exhibit diverse preferences for the same task, we first generate four distinct responses with different linguistic profiles for each collected prompt. However, the original prompt set lacks the specificity to convey different preferences. We therefore construct tailored prompts by contrasting these candidates and designing scenarios in which one response becomes the uniquely appropriate choice. Moreover, we observe that users often express the same preference using different phrasings, and thus extend each prompt with two paraphrased variants. Our evaluation of 24 state‑of‑the‑art RMs reveals their substantial limitations: even the best RM achieves only 49.27% Best‑of‑N accuracy, highlighting considerable room for improvement in reward model generalization. Related data and code are available at https://github.com/nanzhi84/RMGAP.
Authors:Favour Nerrise, Lucy Yin, Mohammad H. Abbasi, Kilian M. Pohl, Ehsan Adeli
Abstract:
Brain MRI foundation models learn rich representations of anatomy, but interpreting what clinical information they encode remains an open problem. Standard sparse autoencoders (SAEs) suffer from severe feature collapse in deep transformer layers, and in Alzheimer's disease (AD) research, aging confounds nearly every clinical variable, making naive annotation unreliable. We propose GeoSAE, a geometry‑guided SAE framework that uses the foundation model's learned manifold structure to prevent feature collapse and annotates each surviving feature via age‑deconfounded partial correlations. Applied to ~14k T1‑weighted MRI scans from the Alzheimer's Disease Neuroimaging Initiative (ADNI) and the Australian Imaging biomarkers and Lifestyle (AIBL) datasets, GeoSAE identifies a compact, fully interpretable feature set that predicts mild cognitive impairment (MCI)‑to‑AD conversion (AUC 0.746) using only 2% of the embedding dimensions, while comorbidity‑annotated features achieve only chance‑level performance. The identified features replicate across cohorts without retraining (r=0.97) and localize to neuroanatomically distinct regions consistent with Braak staging. This shows that geometry‑guided SAEs can extract interpretable, biomarkers from frozen brain MRI foundation models.
Authors:Kwan Soo Shin
Abstract:
An auditor instructs an AI assistant: "open each file individually using the Read tool ‑‑ no scripts, no agents." The AI replies "Yes" ‑‑ then issues a single batched call summarizing all fifty files at once. We call this the Compliance Gap: a third, orthogonal axis of AI honesty distinct from factual truthfulness and rhetorical substance. Three questions: does this verbal‑behavioral disconnect exist (existence); can any text‑only observer recover it (detectability); what infrastructure does AI deployment need (remedy)? Some 75 benchmarks (IFEval, SWE‑bench, BFCL, COMPASS, SpecEval) measure outcome fidelity; none measures process fidelity. Theorem 1 shows the gap is structurally inevitable under RL that rewards text without observing behavior. Theorem 2, via the Data Processing Inequality, shows it is undetectable from text alone ‑‑ by any human or LLM observer, present or future. Thirteen experiments and 2,031 sessions on six frontier models confirm both predictions. Under default framing, all six exhibit instruction compliance rates of 0% ‑‑ Claude Sonnet 4 verbally agrees ten out of ten times then bypasses in all ten. The gap is selective: 97% compliance where rationale is rewarded (audit trails), 0‑4% where it is not (file reading, privacy masking); removing delegation tools raises compliance to 75% (Cohen's d = 2.47), confirming environmental affordance rather than weight‑encoded failure. Nine blinded human raters achieve Fleiss' kappa = 0.130 and correctly identify zero of fifteen compliant sessions, exactly as Theorem 2 predicts. Where humans show 47% intention‑behavior gaps in psychology and 96.5pp gaps in surgical audits, RLHF‑trained models approach 100% under default conditions ‑‑ a regime warranting its own measurement infrastructure. We release BS‑Bench: the first open benchmark for process compliance, with seven tool‑call‑log audit metrics and a public leaderboard.
Authors:Zenan Dai, Jinpeng Wang, Junwei Pan, Dapeng Liu, Lei Xiao, Shu-Tao Xia
Abstract:
Sequential recommendation models often struggle to capture latent periodic patterns in user interests, primarily due to the noise inherent in time‑domain behavioral data. While frequency‑domain analysis offers a global perspective to address this, existing approaches typically treat user sequences in isolation, overlooking the crucial context of the target item. In this work, we present a novel empirical observation: user attention scores exhibit distinct spectral entropy distributions when conditioned on positive versus negative target items. Specifically, true user interests manifest as highly concentrated spectral patterns with lower entropy in the frequency domain, whereas irrelevant behaviors appear as high‑entropy noise. Leveraging this insight, we propose the Frequency‑Enhanced Deep Interest Network (FEDIN). FEDIN introduces a frequency‑domain branch that utilizes a target‑aware spectrum filtering mechanism to isolate these periodic interest signals. Extensive experiments on three public datasets demonstrate that FEDIN consistently outperforms state‑of‑the‑art sequential recommendation baselines, demonstrating superior robustness against noise. We have released our code at: https://github.com/otokoneko/FEDIN.
Authors:Jing Xu, Yuexiao Ma, Songwei Liu, Xuzhe Zheng, Shiwei Liu, Chenqian Yan, Xiawu Zheng, Rongrong Ji, Fei Chao, Xing Wang
Abstract:
Autoregressive video generation paradigms offer theoretical promise for long video synthesis, yet their practical deployment is hindered by the computational burden of sequential iterative denoising. While cache reuse strategies can accelerate generation by skipping redundant denoising steps, existing methods rely on coarse‑grained chunk‑level skipping that fails to capture fine‑grained pixel dynamics. This oversight is critical: pixels with high motion require more denoising steps to prevent error accumulation, while static pixels tolerate aggressive skipping. We formalize this insight theoretically by linking cache errors to residual instability, and propose MotionCache, a motion‑aware cache framework that exploits inter‑frame differences as a lightweight proxy for pixel‑level motion characteristics. MotionCache employs a coarse‑to‑fine strategy: an initial warm‑up phase establishes semantic coherence, followed by motion‑weighted cache reuse that dynamically adjusts update frequencies per token. Extensive experiments on state‑of‑the‑art models like SkyReels‑V2 and MAGI‑1 demonstrate that MotionCache achieves significant speedups of 6.28× and 1.64× respectively, while effectively preserving generation quality (VBench: 1%\downarrow and 0.01%\downarrow respectively). The code is available at https://github.com/ywlq/MotionCache.
Authors:Sen Fang, Hongbin Zhong, Yanxin Zhang, Dimitris N. Metaxas
Abstract:
Existing large‑scale sign language resources typically provide supervision only at the level of raw video‑text alignment and are often produced in laboratory settings. While such resources are important for semantic understanding, they do not directly provide a unified interface for open‑world recognition and translation, or for modern pose‑driven sign language video generation frameworks: 1. RGB‑based pretrained recognition models depend heavily on fixed backgrounds or clothing conditions during recording, and are less robust in open‑world settings than style‑agnostic pose‑processing models. 2. Recent pose‑guided image/video generation models mostly use a unified keypoint representation such as DWPose as their control interface. At present, the sign language field still lacks a data resource that can directly interface with this modern pose‑native paradigm while also targeting real‑world open scenarios. We present SignVerse‑2M, a large‑scale multilingual pose‑native dataset for sign language pose modeling and evaluation. Built from publicly available multilingual sign language video resources, it applies DWPose in a unified preprocessing pipeline to convert raw videos into 2D pose sequences that can be used directly for modeling, resulting in a consolidated corpus of about two million clips covering more than 55 sign languages. Unlike many laboratory datasets, this resource preserves the recording conditions and speaker diversity of real‑world videos while reducing appearance variation through a unified pose representation. Toward this goal, we further provide the data construction pipeline, task definitions, and a simple SignDW Transformer baseline, demonstrating the feasibility of this resource for multilingual pose‑space modeling and its compatibility with modern pose‑driven pipelines, while discussing the evaluation claims it can support as well as its current limitations.
Authors:Qiao Liu
Abstract:
Missing data imputation remains a fundamental challenge in modern data science, especially when uncertainty quantification is essential. In this work, we propose MissBGM, an AI‑powered missing data imputation method via Bayesian generative modeling that bridges the expressive flexibility of neural networks with the statistical rigor of Bayesian inference. Unlike existing methods that often focus on point estimates or treat the missingness mechanism implicitly, MissBGM explicitly and jointly models the data‑generating and missingness mechanisms, providing principled posterior uncertainty over imputations rather than a single point estimate. We develop a stochastic optimization framework with alternating updates among missing values, model parameters, and latent variables until convergence. Our theoretical analysis shows that estimates of missing values from MissBGM converge consistently under mild assumptions. Empirically, we demonstrate that MissBGM achieves superior performance over traditional imputers and recent neural network‑based methods across extensive experimental settings. These results establish MissBGM as a principled and scalable solution for modern missing data imputation. The code for MissBGM is open sourced at https://github.com/liuq‑lab/MissBGM.
Authors:Qian Yin, Di Wen, Kunyu Peng, David Schneider, Zeyun Zhong, Alexander Jaus, Zdravko Marinov, Jiale Wei, Ruiping Liu, Junwei Zheng, Yufan Chen, Chen Zhang, Lei Qi, Rainer Stiefelhagen
Abstract:
Dense temporal annotation of procedural activity videos is vital for action understanding and embodied intelligence but remains labor‑intensive due to reactive tools. Each correction is treated as an isolated edit, limiting reuse of information on annotator uncertainty and model reliability. We introduce IMPACT‑Scribe, a correction‑driven framework for dense labeling that uses each correction to improve future human‑machine collaboration. IMPACT‑Scribe combines uncertainty‑aware boundary scribble supervision, local proposal modeling, cost‑aware query planning, structured propagation, and correction‑driven adaptation. Experiments and a human study show that this closed‑loop design improves labeling quality per effort, enhances boundary accuracy, and fosters better human‑machine interaction over time. The code will be made publicly available at https://github.com/BanzQians/IMPACT_AS.
Authors:Haoshen Zhang, Di Wen, Kunyu Peng, David Schneider, Zeyun Zhong, Alexander Jaus, Zdravko Marinov, Jiale Wei, Ruiping Liu, Junwei Zheng, Yufan Chen, Yufeng Zhang, Yuanhao Luo, Lei Qi, Rainer Stiefelhagen
Abstract:
We present IMPACT‑HOI, a mixed‑initiative framework for annotating egocentric procedural video by constructing structured event graphs for Human‑Object Interactions (HOI), motivated by the need for high‑quality structured supervision for learning robot manipulation from human demonstration. IMPACT‑HOI frames this task as the incremental resolution of a partially specified, onset‑anchored event state. A trust‑calibrated controller selects among direct queries, human‑confirmed suggestions, and conservative completions based on empirical annotator behavior and evidence quality. A risk‑bounded execution protocol, utilizing atomic rollback, ensures that human‑confirmed decisions are preserved against conflicting automated updates. A user study with 9 participants shows a 13.5% reduction in manual annotation actions, a 46.67% event match rate, and zero confirmed‑field violations under the studied protocol. The code will be made publicly available at https://github.com/541741106/IMPACT_HOI.
Authors:Mukund Pandey
Abstract:
Existing evaluation frameworks for large language models ‑‑ including HELM, MT‑Bench, AgentBench, and BIG‑bench ‑‑ are designed for controlled, single‑session, lab‑scale settings. They do not address the evaluation challenges that emerge when agentic AI systems operate continuously in production: compounding decision errors, tool failure cascades, non‑deterministic output drift, and the absence of ground truth for long‑horizon tasks. This paper makes three contributions. First, we present a taxonomy of seven failure modes unique to production agentic systems, each grounded in observations from systems operating at billion‑event scale. Second, we demonstrate empirically where standard metrics ‑‑ ROUGE, BERTScore, accuracy/AUC, and the agentic benchmarks above ‑‑ fail to detect each failure mode. Third, we propose PAEF (Production Agentic Evaluation Framework), a five‑dimension evaluation framework with an open‑source reference implementation, designed for continuous evaluation on production traffic rather than episodic benchmark runs. Our analysis shows that standard metrics fail to detect four of the seven failure modes entirely and detect three others only after a lag of multiple evaluation cycles.
Authors:Paul Garnier, Vincent Lannelongue, Elie Hachem
Abstract:
Machine Learning surrogates for Computational Fluid Dynamics (CFD), particularly Graph Neural Networks (GNNs) and Transformers, have become a new important approach for accelerating physics simulations. However, we identify a critical bottleneck in the field: while architectures have advanced significantly, the common underlying training paradigms remain bound to naive assumptions, such as node‑wise supervision and explicit Euler time‑stepping. These legacy choices ignore the stiff dynamics and local flux continuity inherent to numerous partial differential equations resolution methods, such as Finite Element, Difference, or Volume (FEM). In this work, we propose a unified framework to bridge the gap between geometric deep learning and rigorous numerical analysis. We introduce three key innovations: (1) Multi Node Prediction, a stencil‑level objective that predicts field values for a node's full local topology, enforcing spatial derivative consistency; (2) Temporal Correction, replacing unstable explicit schemes with a predictor‑corrector via temporal Cross‑Attention; and (3) Geometric Inductive Biases, leveraging 3D Rotary Positional Embeddings (RoPE) to robustly capture rotational symmetries in unstructured meshes. We evaluate this framework across three architectures (MeshGraphNet, Transolver, and a Transformer) on diverse physics datasets. Our approach yields consistent improvements in accuracy and stability, particularly in long‑horizon rollouts, while producing latent representations that generalize to unseen subtasks such as Wall Shear Stress or Pressure prediction. Code is available at https://github.com/DonsetPG/graph‑physics.
Authors:Kanak Mazumder, Fabian B. Flohr
Abstract:
Online High‑Definition (HD) map construction is a key component of autonomous driving. Recent methods rely on multi‑view camera images for cost‑effective HD map segmentation, but cameras lack depth information for accurate scene geometry. In contrast, LiDAR provides precise 3D measurements but lacks dense semantic cues. In this work, we propose LIE, LiDAR‑only semantic map construction method that employ Knowledge Distillation (KD) to handle the lack of dense semantic and texture cues. Specifically, the teacher branch fuses student LiDAR features and the corresponding 2D intensity map tile to provide dense supervision for segmenting map elements using online distillation scheme. Experimental results show that our method outperforms all single‑modality approaches, achieving 8.2% higher mIoU than the state‑of‑the‑art camera‑based model on nuScenes. LIE is robust over long ranges and under challenging weather and lighting, and efficiently adapts to Argoverse2 with only 10% fine‑tuning, surpassing camera‑based models trained on the full dataset. Source code will be available \hrefhttps://iv.ee.hm.edu/lie/here.
Authors:Jiacheng Yang, Ruichi Zhang, Chikai Shang, Mengke Li, Xinyi Shang, Junlong Gao, Yonggang Zhang, Yang Lu
Abstract:
Long‑tailed data bias decision boundaries toward head classes and degrade tail class accuracy. Diffusion‑based generative augmentation address this problem by generating additional data, while head‑to‑tail transfer further mitigate the generator bias inherit from long‑tailed dataset. However, we show that while head‑to‑tail transfer helps balance the decision space of the classifier, it also induces latent non‑local feature mixing that entangles inter‑class features, causing decision boundary overlap and tail class distribution shift. To address this, we first identify the problem of boundary ambiguity and then propose Decision Boundary‑aware Generation (DBG) framework, which promotes near‑boundary representation learning by generating informative near‑boundary samples. Overall, DBG rebalances the long‑tailed dataset while yielding more separable decision space for long‑tailed learning. Across standard long‑tailed benchmarks, DBG consistently improves tail class and overall accuracy with less inter‑class overlap. The code of DBG is available at https://github.com/keepdigitalabc‑svg/DBG.
Authors:Guowei Zou, Haitao Wang, Beiwen Zhang, Boning Zhang, Hejun Wu
Abstract:
Generative models have emerged as a major paradigm for offline multi‑agent reinforcement learning (MARL), but existing approaches require many iterative sampling steps. Recent few‑step accelerations either distill a joint teacher into independent students or apply averaged velocities independently per agent, suggesting that few‑step inference requires sacrificing inter‑agent coordination. We show this trade‑off is not necessary: single‑pass multi‑agent generation can preserve coordination when the velocity field is natively joint‑coupled. We propose Coordinated few‑step Flow (CoFlow), an architecture that combines Coordinated Velocity Attention (CVA) with Adaptive Coordination Gating. A finite‑difference consistency surrogate further replaces memory‑prohibitive Jacobian‑vector product backpropagation through the averaged velocity field with two stop‑gradient forward passes. Across 60 configurations spanning MPE, MA‑MuJoCo, and SMAC, CoFlow matches or surpasses Gaussian / value‑based, transformer, diffusion, and prior flow baselines on episodic return. Three independent coordination probes confirm that the gains flow through inter‑agent coordination rather than per‑agent capacity. A denoising‑step sweep shows that single‑pass inference suffices on every configuration. CoFlow reaches state‑of‑the‑art coordination quality in 1‑3 denoising steps under both centralized and decentralized execution. Project page: https://github.com/Guowei‑Zou/coflow.
Authors:Benjamin Warner, Ratna Sagari Grandhi, Max Kieffer, Aymane Ouraq, Saurav Panigrahi, Geetu Ambwani, Kunal Bagga, Nikhil Khandekar, Arya Hariharan, Nishant Mishra, Manish Ram, Shamus Sim Zi Yang, Ahmed Essouaied, Adepoju Jeremiah Moyondafoluwa, Robert Scholz, Bofeng Huang, Molly Beavers, Srishti Gureja, Anish Mahishi, Sameed Khan, Maxime Griot, Hunar Batra, Jean-Benoit Delbrouck, Siddhant Bharadwaj, Ronald Clark, Ashish Vashist, Anas Zafar, Leema Krishna Murali, Harsh Deshpande, Ameen Patel, William Brown, Johannes Hagemann, Connor Lane, Paul Steven Scotti, Tanishq Mathew Abraham
Abstract:
Evaluating large language models (LLMs) for medical applications remains challenging due to benchmark saturation, limited data accessibility, and insufficient coverage of relevant tasks. Existing suites have either saturated, heavily depend on restricted datasets, or lack comprehensive model coverage. We introduce Medmarks, a fully open‑source evaluation suite with 30 benchmarks spanning question answering, information extraction, medical calculations, and open‑ended clinical reasoning. We perform a systematic evaluation of 61 models across 71 configurations using verifiable metrics and LLM‑as‑a‑Judge. Our results show that frontier reasoning models (Gemini 3 Pro Preview, GPT‑5.1, & GPT‑5.2) achieve the highest performance across both benchmarks, most frontier proprietary models are significantly more token efficient than open‑weight alternatives, medically fine‑tuned models outperform their generalist counterparts, and that models are susceptible to answer‑order bias (particularly smaller models and Grok 4). A subset of our evals (Medmarks‑T) can be directly used as reinforcement learning environments to post‑train LLMs for medical reasoning. Code is available at https://github.com/MedARC‑AI/Medmarks
Authors:Jianze Wang, Ying Liu, Jinlong Chen, Xuchun Hu, Qilong Zhang, Yu Cao, Jun Wang, Hua Yang, Yong Xie, Qianglong Chen
Abstract:
On‑policy distillation (OPD) trains a student on its own trajectories under token‑level teacher supervision, but existing methods are capped by a single‑teacher capability ceiling: when the teacher errs, the student inherits the error. OPD also remains largely unexplored in agentic tasks, where per‑step errors compound across long trajectories and destabilize training. We propose MAD‑OPD (Multi‑Agent Debate‑driven On‑Policy Distillation), which breaks this ceiling by recasting the distillation teacher as a deliberative collective of teachers that debate over the student's on‑policy state; the debate produces an emergent collective intelligence that supplies token‑level supervision, with each teacher's contribution weighted by its post‑debate confidence. To extend OPD to agentic tasks, we also introduce On‑Policy Agentic Distillation (OPAD), which adds step‑level sampling to stabilize training under multi‑step error compounding. We additionally derive a task‑adaptive divergence principle, selecting JSD (Jensen‑Shannon divergence) for agentic stability and reverse KL (Kullback‑Leibler) divergence for code generation, and verify it both theoretically and empirically. Across six teacher‑student configurations (Qwen3 and Qwen3.5; 1.7B‑14B students, 8B‑32B teachers) and five agentic and code benchmarks, MAD‑OPD ranks first across all six configurations; on the 14B+8B\to4B setting it lifts the agentic average by +2.4% and the code average by +3.7% over the stronger single‑teacher OPD.
Authors:Peiyang Liu, Ziqiang Cui, Xi Wang, Di Liang, Wei Ye
Abstract:
Iterative Retrieval‑Augmented Generation (iRAG) has emerged as a powerful paradigm for answering complex multi‑hop questions by progressively retrieving and reasoning over external documents. However, current systems predominantly operate on parsed text, which creates two critical bottlenecks: (1) Coarse‑grained attribution, where users are burdened with manually locating evidence within lengthy documents based on vague text‑level citations; and (2) Visual semantic loss, where the conversion of visually rich documents (e.g., slides, PDFs with charts) into text discards spatial logic and layout cues essential for reasoning. To bridge this gap, we present Chain of Evidence (CoE), a retriever‑agnostic visual attribution framework that leverages Vision‑Language Models to reason directly over screenshots of retrieved document candidates. CoE eliminates format‑specific parsing and outputs precise bounding boxes, visualizing the complete reasoning chain within the retrieved candidate set. We evaluate CoE on two distinct benchmarks: Wiki‑CoE, a large‑scale dataset of structured web pages derived from 2WikiMultiHopQA, and SlideVQA, a challenging dataset of presentation slides featuring complex diagrams and free‑form layouts. Experiments demonstrate that fine‑tuned Qwen3‑VL‑8B‑Instruct achieves robust performance, significantly outperforming text‑based baselines in scenarios requiring visual layout understanding, while establishing a retriever‑agnostic solution for pixel‑level interpretable iRAG. Our code is available at https://github.com/PeiYangLiu/CoE.git.
Authors:Hector Borobia, Elies Seguí-Mas, Guillermina Tormo-Carbó
Abstract:
Speculative decoding accelerates autoregressive inference by drafting candidate tokens with a fast model and verifying them in parallel with the target. Self‑speculative methods avoid the need for an external drafter but have been studied exclusively in homogeneous Transformer architectures. We introduce component‑aware self‑speculative decoding, the first method to exploit the internal architectural heterogeneity of hybrid language models, isolating the SSM/linear‑attention subgraph as a zero‑cost internal draft. We evaluate this on two architecturally distinct hybrid families: Falcon‑H1 (parallel: Mamba‑2 + attention per layer) and Qwen3.5 (sequential: interleaved linear and attention layers), with a pure Transformer control (Qwen2.5). Parallel hybrids achieve acceptance rates of alpha = 0.68 at draft length k=2 under greedy decoding, while sequential hybrids yield only alpha = 0.038 ‑‑ an 18x gap attributable to how each architecture integrates its components. The property is scale‑invariant: Falcon‑H1 at 3B reproduces the rates observed at 0.5B. We further show that perplexity degradation from a companion ablation study predicts speculative viability without running speculative decoding: a 3.15x ratio (Falcon) maps to alpha = 0.37 at k=4, while 81.96x (Qwen) maps to alpha = 0.019. For sequential hybrids, generic LayerSkip achieves 12x higher acceptance rates than the component‑aware strategy. The composition pattern of hybrid models ‑‑ not merely the presence of alternative components ‑‑ determines whether component‑level self‑speculation is viable.
Authors:Qizhi Wang
Abstract:
Object caches underpin cloud and edge services, but production workloads are heterogeneous, nonstationary, and throughput‑constrained. Recent simple non‑ML policies such as SIEVE and S3‑FIFO set a strong baseline, so any learned method must be overhead‑aware, robust under drift, and competitive with strong experts. We present SCION, a lightweight policy‑orchestration framework that selects among a small set of deployable cache policies using a tiny workload fingerprint computed off the critical path. Our prototype, AUTO, uses short‑prefix statistics of object size, cacheability, reuse, and cache size, then applies an offline‑trained linear selector to choose among GDSF, S3‑FIFO, SIEVE, LHD, W‑TinyLFU‑AV, and DynamicAdaptiveClimb; a simpler SCION‑P90 variant uses only a p90 threshold. In a CPU‑only, trace‑driven evaluation on 30 public object‑cache traces and a separate HR‑Cache simulator subset, AUTO improves cacheable‑only object miss ratio over SIEVE on a majority of workloads, stays close to the best single expert on average, enables explicit OMR/BMR tradeoff selection, and remains competitive on byte miss ratio. Under a fast‑policy budget, AUTO‑fast achieves lower cost than the best fixed fast policy. SCION reduces regime‑mismatch risk while keeping the hot path unchanged.
Authors:Alan L. McCann
Abstract:
We present a certified purity architecture that converts governance enforcement in cognitive workflow systems from a runtime convention into a structural capability boundary. A prior three‑layer governance architecture proves governance completeness, provenance completeness, and the impossibility of ungoverned effects, conditional on the pure module constraint: that step executors cannot perform effects. That constraint was enforced by module import graph analysis, which is insufficient against adversarial bypass on the BEAM virtual machine. This paper closes the gap through four mechanisms: (1) a restricted WebAssembly compilation target where effect‑producing instructions are structurally absent; (2) purity certificates, cryptographically signed proofs binding executor binaries to their import classifications; (3) a runtime verification gate that rejects uncertified executors before they enter the governance pipeline; and (4) portable governance credentials via remote attestation for cross‑organizational verification. We prove four theorems: structural purity by construction, bypass elimination for all five BEAM bypass classes, certificate integrity, and gate completeness. The guarantee holds relative to an explicit Trusted Computing Base. Evaluation on four implemented executors shows verification latency of 39‑‑42 us, full plan cycle under 400 us, runtime overhead under 0.4% of a 100 ms HTTP request, and zero determinism divergences across repeated invocations.
Authors:Alan L. McCann
Abstract:
We present an algebraic semantics for governed execution in which governance is axiomatized, compositional, and coterminous with expressibility. The framework, mechanized in 32 Rocq modules (~12,000 lines, 454 theorems, 0 admitted), is built on interaction trees and parameterized coinduction. A three‑axiom GovernanceAlgebra record (safety, transparency, properness) induces a symmetric monoidal category with verified pentagon, triangle, and hexagon coherence, where every tensor composition preserves governance. An algebraic effect system constrains the handler algebra so that only governance‑preserving handlers can be constructed in the safe fragment; programs in the empty capability set provably emit only observability directives. Capability‑indexed composition bundles programs with machine‑checked capability bounds, and a dual guarantee theorem establishes that within_caps and gov_safe hold simultaneously under all composition operators. The capstone result is the coterminous boundary: within our formal model, every program expressible via the four primitive morphism constructors is governed under interpretation, and every governed program is the image of such a program. Turing completeness is preserved inside governance; unmediated I/O is excluded from the governed fragment. Governance denial is modeled as safe coinductive divergence. The governance algebra is parametric: any system instantiating the three axioms inherits all derived properties, including convergence, compositional closure, and goal preservation. Extracted OCaml runs as a NIF in the BEAM runtime, with property‑based testing (70,000+ random inputs, zero disagreements) confirming behavioral equivalence between the specification and the runtime interpreter.
Authors:Alan L. McCann
Abstract:
We present a machine‑checked formalization of structurally governed AI workflow architectures and prove that effect‑level governance can be imposed without reducing internal computational expressivity. Using Interaction Trees in Rocq 8.19, we define a governance operator G that mediates all effectful directives, including memory access, external calls, and oracle (LLM) queries. Our development compiles with 0 admitted lemmas and consists of 36 modules, ~12,000 lines of Rocq, and 454 theorems. We establishseven properties: (P1) governed Turing completeness, (P2) governed oracle expressivity, (P3) a decidability boundary in which governance predicates are total and closed under Boolean composition while semantic program properties remain non‑trivial and undecidable by governance, (P4) goal preservation for permitted executions, (P5) expressive minimality of primitive capabilities (compute, memory, reasoning, external call, observability), (P6) subsumption asymmetry showing structural governance strictly subsumes content‑level filtering, and (P7) semantic transparency: on all executions where governance permits, the governed interpretation is observationally equivalent (modulo governance‑only events) to the ungoverned interpretation. Together, these results show that governance and computational expressivity are orthogonal dimensions: governance constrains the effect boundary of programs while remaining semantically transparent to internal computation.
Authors:Hao Zhou, Simon A. Lee, Cyrus Tanade, Keum San Chun, Juhyeon Lee, Migyeong Gwak, Megha Thukral, Justin Sung, Eugene Hwang, Mehrab Bin Morshed, Li Zhu, Viswam Nathan, Md Mahbubur Rahman, Subramaniam Venkatraman, Sharanya Arcot Desai
Abstract:
Biosignals acquired from different locations on the body often provide temporally ordered views of the same underlying physiological process. However, most existing self supervised learning methods treat these signals as interchangeable views, overlooking the directional temporal dynamics that link them. A canonical example is the relationship between electrocardiography (ECG), which captures the electrical activation initiating each heartbeat, and photoplethysmography (PPG), which records the resulting peripheral pulse delayed by vascular dynamics. To capture this structured relationship, we introduce xMAE, a biosignal pretraining framework that leverages masked cross modal reconstruction across temporally ordered biosignals as a training time constraint to encourage physiologically meaningful timing structure in the learned representations. We show that pretraining with xMAE yields representations that outperform both unimodal and multimodal baselines on 15 of 19 downstream tasks, including cardiovascular outcome prediction, abnormal laboratory test detection, sleep staging, and demographic inference, while generalizing across devices, body locations, and acquisition settings. Further analysis suggests that the ECG PPG timing structure is reflected in the learned PPG representations. More broadly, xMAE demonstrates the effectiveness of incorporating temporal structure into multimodal pretraining when signals observe different stages of a shared underlying process. Code is available at https://github.com/hzhou3/xMAE.
Authors:Hada Melino Muhammad, Zechen Li, Flora Salim, Ahmed A. Metwally
Abstract:
Continuous Glucose Monitoring (CGM) can detect early metabolic subphenotypes (insulin resistance, IR; β‑cell dysfunction), but population‑scale deployment faces two coupled problems. First, the same physiological state appears through multiple views (CGM time series, venous OGTT, Glucodensity summaries), so single‑view representations fail to transfer when deployment shifts the modality or setting. Second, baselines perform inconsistently across these shifts. Both problems point to one remedy: representations that abstract away from any single view to capture higher‑level temporal and distributional structure. We propose CGM‑JEPA, a self‑supervised pretraining framework which predicts masked latent representations rather than raw values, yielding abstraction that transfers across modalities. X‑CGM‑JEPA adds a masked Glucodensity cross‑view objective for complementary distributional information. We pretrain on ~389k unlabeled CGM readings from 228 subjects and evaluate on two clinical cohorts (N=27 and N=17 public‑release subsets) across three regimes (cohort generalization, venous‑to‑CGM transfer, home CGM) under 20‑iteration × 2‑fold cross‑validation. X‑CGM‑JEPA ranks first or second on AUROC for both endpoints across all three regimes while no baseline does, exceeding the strongest baseline by up to +6.5 pp in cohort generalization and +3.6 pp in venous‑to‑CGM transfer (paired Wilcoxon, p<0.001). Under modality shift, it matches mean AUROC while redistributing toward weaker subgroups (ethnicity AUROC gap shrinks 25‑54%); on sparse in‑domain venous data, the distributional view lifts label‑aware clustering (ARI +39%, NMI +40%). Code and weights: https://github.com/cruiseresearchgroup/CGM‑JEPA
Authors:Hongjun Wang, Po Hu, Kai Han
Abstract:
Generalized Category Discovery (GCD) aims to categorize unlabelled instances from both known and unknown classes by transferring knowledge from labelled data of known classes. Existing methods assume all data comes from a single domain, yet real‑world unlabelled data often exhibits domain shifts alongside semantic shifts. We study GCD under domain shifts and propose three frameworks that adapt foundation models, ranging from self‑supervised vision models to vision‑language models. (i) HiLo disentangles domain and semantic features through multi‑level feature extraction and mutual information minimization, combined with PatchMix augmentation and curriculum sampling. (ii) HLPrompt extends HiLo with semantic‑aware spatial prompt tuning to suppress background and domain noise. (iii) VLPrompt leverages vision‑language models via factorized textual prompts and cross‑modal consistency regularization. The three methods share core design principles while operating on different foundation backbones, making them suitable for different deployment scenarios. Extensive experiments on synthetic corruptions and real‑world multi‑domain shifts demonstrate consistent improvements over strong baselines. Project page: https://visual‑ai.github.io/hilo/
Authors:Prabhjot Singh, Manmeet Singh
Abstract:
Operational phase unwrapping is the primary computational bottleneck in InSAR‑based volcanic and seismic monitoring. We challenge the industry trend of adopting high‑complexity computer vision architectures, such as attention mechanisms, without validating their suitability for physics‑constrained geophysical regression. We present the first large‑scale architectural ablation study on a global LiCSAR benchmark (20 frames, 39,724 patches, 651M pixels). Our results reveal a significant "complexity penalty": a vanilla U‑Net (7.76M parameters) achieves R^2=0.834 and RMSE = 1.01 cm, outperforming 11.37M‑parameter attention‑based models by 34% in R^2 and 51% in RMSE. Power Spectral Density (PSD) analysis provides the physical justification: while attention excels at capturing sharp semantic edges in natural images, it injects unphysical high‑frequency artifacts (>0.3 cycles/pixel) into geophysical fields, violating the fundamental smoothness constraints of elastic surface deformation. With a 2.92ms inference latency (a 2.5× speedup), the vanilla U‑Net is the only candidate to comfortably meet the sub‑100ms requirement for operational early‑warning systems. This work bridges the "publication‑to‑practice" gap by proving that convolutional locality outperforms modern complexity for smooth‑field regression, advocating for physics‑informed simplicity in ML4RS. Code available at https://github.com/prabhjotschugh/When‑Less‑is‑More‑InSAR‑Phase‑Unwrapping
Authors:Qi Li, Weining Wang, Shuangjun Du, Bo Peng, Jing Dong, Kun Wang, Zhenan Sun, Ming-Hsuan Yang
Abstract:
Face swapping has witnessed significant progress in recent years, largely driven by advances in deep generative models such as GANs and diffusion models.Despite these advances, existing methods remain fragmented across different paradigms, and their evaluation is highly inconsistent due to the lack of standardized datasets and protocols. Moreover, prior surveys primarily focus on broader deepfake generation or detection, leaving face swapping insufficiently studied as a standalone problem. In this paper, we present a comprehensive survey and benchmark for face swapping. We provide a structured review of existing methods, organizing them into five major paradigms and systematically analyzing their design principles, strengths, and limitations. To enable fair and controlled evaluation, we introduce CASIA FaceSwapping, a high‑quality benchmark with balanced demographic distributions and explicit attribute variations, and establish standardized protocols to assess the robustness of different face swapping methods. Extensive experiments on representative approaches yield new insights into the performance characteristics and limitations of current techniques. Overall, our work provides a unified perspective and a principled evaluation framework to facilitate the development of more robust and controllable face swapping methods. More results can be found at https://github.com/CASIA‑NLPRAI/face‑swapping‑survey.
Authors:Firat Ozdemir, Yun Cheng, Salman Mohebi, Fanny Lehmann, Simon Adamov, Zhenyi Zhang, Leonardo Trentini, Dana Grund, Oliver Fuhrer, Torsten Hoefler, Siddhartha Mishra, Sebastian Schemm, Benedikt Soja, Mathieu Salzmann
Abstract:
Foundation models (FMs) for the Earth system learn statistical relationships between physical variables across massive datasets to enable versatile downstream applications through finetuning, separating them from task‑specific weather models. Here, we introduce Earth System Foundation Model (ESFM), a fully open model building on the 3D Swin UNet backbone of the pioneering Aurora model. ESFM introduces extensions that increase functionality and foster adoption in climate sciences. First, the encoding scheme and training protocols have been extended to handle diverse datasets, including those containing missing values across all spatio‑temporal dimensions such as satellite data, as well as station data, all under one backbone. Axial attention is introduced to capture inter‑variable dependencies. As a result ESFM skillfully predicts variables in regions or on pressure levels where no data is present at the initial time, while preserving inter‑variable relationships, for example between temperature, pressure, and humidity. Individual variable tokenization enables different sets of variables to be shuffled during training and simplifies the process of building extensions for new downstream tasks. Adaptive layer norm‑based ensembles allow for a simple yet effective way to transform deterministic ESFM to a probabilistic FM. We present findings using dense gridded data (ERA5, CMIP6), regionally masked dense data, sparse gridded MODIS satellite data, and station data. Results demonstrate competitive or superior performance relative to state‑of‑the‑art benchmarks. Case studies of Super Typhoon Doksuri (2023) and 2024 sudden stratospheric warming events show accurate positional and magnitude estimations of extreme weather. ESFM retains the strengths of previous foundation models, such as long‑term stability, but facilitates application to a variety of downstream tasks.
Authors:Ziyu Zheng, Yaming Yang, Zhe Wang, Ziyu Guan, Wei Zhao
Abstract:
While Graph Foundation Models (GFMs) have achieved remarkable success in homogeneous graphs, extending them to multi‑domain heterogeneous graphs (MDHGs) remains a formidable challenge due to cross‑type feature shifts and intra‑domain relation gaps. Existing global feature alignment methods (PCA or SVD) enforce a shared feature space blindly, which distorts type‑specific semantics and disrupts original topologies, inevitably leading to "Type Collapse" and "Relation Confusion". To address these fundamental limitations, we propose Decoupled relation Subspace Alignment (DRSA), a novel, plug‑and‑play relation‑driven alignment framework. DRSA fundamentally shifts the paradigm by decoupling feature semantics from relation structures. Specifically, it introduces a dual‑relation subspace projection mechanism to coordinate cross‑type interactions within a shared low‑rank relation subspace explicitly. Furthermore, a feature‑structure decoupled representation is designed to decompose aligned features into a semantic projection component and a structural residual term, adaptively absorbing intra‑domain variations. Optimized via a stable alternating minimization strategy based on Block Coordinate Descent, DRSA constructs a well‑calibrated, structure‑aware latent space. Extensive experiments on multiple real‑world benchmark datasets demonstrate that DRSA can be seamlessly integrated as a universal preprocessing module, significantly and consistently enhancing the cross‑domain and few‑shot knowledge transfer capabilities of state‑of‑the‑art GFMs. The code is available at: https://github.com/zhengziyu77/DSRA.
Authors:Jaeyoung Chung, Suyoung Lee, Kyoung Mu Lee
Abstract:
We present a training‑free approach for controllable 3D inpainting based on initial noise optimization. In the structured 3D latent diffusion framework, we observe that the underlying geometric structure is established during the early stages of the diffusion process and exhibits high sensitivity to the initial noise. Such characteristics compromise stability in tasks like inpainting and editing, where the model must ensure strict alignment with the existing context while synthesizing a new structure. In this paper, we introduce a strategy to optimize the initial noise within the structured 3D latent diffusion framework, ensuring high‑fidelity 3D inpainting. Specifically, we update the initial noise by leveraging a backpropagation approximation grounded in the rectified flow model, with the spectral parameterization specially designed for robust and efficient structured 3D latent optimization. Experiments demonstrate consistent improvements in contextual consistency and prompt alignment over representative training‑free inpainting baselines, establishing initial noise control as an independent dimension for 3D inpainting, orthogonal to conventional sampling trajectory manipulation.
Authors:Yan Zhang, Daiqing Wu, Huawen Shen, Yu Zhou, Can Ma
Abstract:
Graphical User Interface (GUI) grounding maps natural language instructions to the visual coordinates of target elements and serves as a core capability for autonomous GUI agents. Recent reinforcement learning methods (e.g., GRPO) have achieved strong performance, but they rely on expensive multiple rollouts and suffer from sparse signals on hard samples. These limitations make on‑policy self‑distillation (OPSD), which provides dense token‑level supervision from a single rollout, a promising alternative. However, its applicability to GUI grounding remains unexplored. In this paper, we present GUI‑SD, the first OPSD framework tailored for GUI grounding. First, it constructs a visually enriched privileged context for the teacher using a target bounding box and a Gaussian soft mask, providing informative guidance without leaking exact coordinates. Second, it employs entropy‑guided distillation, which adaptively weights tokens based on digit significance and teacher confidence, concentrating optimization on the most impactful and reliable positions. Extensive experiments on six representative GUI grounding benchmarks show that GUI‑SD consistently outperforms GRPO‑based methods and naive OPSD in both accuracy and training efficiency. Code and training data are available at https://zhangyan‑ucas.github.io/GUI‑SD/.
Authors:Massimo Rondelli, Francesco Pivi, Maurizio Gabbrielli
Abstract:
Automatic generation of executable Blender code from natural language remains challenging, with state‑of‑the‑art LLMs producing frequent syntactic errors and geometrically inconsistent objects. We present BlenderRAG, a retrieval‑augmented generation system that operates on a curated multimodal dataset of 500 expert‑validated examples (text, code, image) across 50 object categories. By retrieving semantically similar examples during generation, BlenderRAG improves compilation success rates from 40.8% to 70.0% and semantic normalized alignment from 0.41 to 0.77 (CLIP similarity) across four state‑of‑the‑art LLMs, without requiring fine‑tuning or specialized hardware, making it immediately accessible for deployment. The dataset and code will be available at https://github.com/MaxRondelli/BlenderRAG.
Authors:Yao Ni, Jeremie Houssineau, Yew Soon Ong, Piotr Koniusz
Abstract:
Deep neural networks achieve impressive results across diverse applications, yet their overconfidence on unseen inputs necessitates reliable epistemic uncertainty modelling. Existing methods for uncertainty modelling face a fundamental dilemma: Bayesian approaches provide principled estimates but remain computationally prohibitive, while efficient second‑order predictors lack rigorous derivations connecting their specific objectives to epistemic uncertainty quantification. To resolve this dilemma, we introduce Dirichlet‑approximated possibilistic posterior predictions (DAPPr), a principled framework leveraging possibility theory. We define a possibilistic posterior over parameters, projects this posterior to the prediction space via supremum operators, and approximates the projected posterior using learnable Dirichlet possibility functions. This projection‑and‑approximation strategy yields a simple training objective with closed‑form solutions. Extensive experiments across diverse benchmarks demonstrate that our approach achieves competitive or superior uncertainty quantification performance compared to state‑of‑the‑art evidential deep learning methods while maintaining both principled derivation and computational efficiency. Code will be available at https://github.com/MaxwellYaoNi/DAPPr.
Authors:Michito Takeshita, Takuro Kawada, Takumi Ohashi, Shunsuke Kitada, Hitoshi Iyatomi
Abstract:
AI agents that interact with graphical user interfaces (GUIs) require effective observation representations for reliable grounding. The accessibility tree is a commonly used text‑based format that encodes UI element attributes, but it suffers from redundancy and lacks structural information such as spatial relationships among elements. We propose A11y‑Compressor, a framework that transforms linearized accessibility trees into compact and structured representations. Our implementation, Compressed‑a11y, applies a lightweight and structured transformation pipeline with modal detection, redundancy reduction, and semantic structuring. Experiments on the OSWorld benchmark show that Compressed‑a11y reduces input tokens to 22% of the original while improving task success rates by 5.1 percentage points on average.
Authors:Ziwen Zhao, Menglin Yang
Abstract:
Retrieval‑augmented generation (RAG) enhances large language models with external knowledge, and tree‑based RAG organizes documents into hierarchical indexes to support queries at multiple granularities. However, existing Tree‑RAG methods designed for single‑document retrieval face critical challenges in scaling to cross‑document multi‑hop questions: (1) poor distribution adaptability, where k‑means clustering introduces noise due to rigid distribution assumptions; (2) structural isolation, as tree indexes lack explicit cross‑document connections; and (3) coarse abstraction, which obscures fine‑grained details. To address these limitations, we propose Ψ‑RAG, a tree‑RAG framework with two key components. First, a hierarchical abstract tree index built through an iterative "merging and collapse" process that adapts to data distributions without a priori assumption. Second, a multi‑granular retrieval agent that intelligently interacts with the knowledge base with reorganized queries and an agent‑powered hybrid retriever. Ψ‑RAG supports diverse tasks from token‑level question answering to document‑level summarization. On cross‑document multi‑hop QA benchmarks, it outperforms RAPTOR by 25.9% and HippoRAG 2 by 7.4% in average F1 score. Code is available at https://github.com/Newiz430/Psi‑RAG.
Authors:Aninda Ray
Abstract:
A multi‑agent pipeline with N agents typically issues N LLM calls per run. Merging agents into fewer calls (compound execution) promises token savings, but naively merged calls silently degrade quality through tool loss and prompt compression. We present Agent Capsules, an adaptive execution runtime that treats multi‑agent pipeline execution as an optimization problem with empirical quality constraints. The runtime instruments coordination overhead per group, scores composition opportunity, selects among three compound execution strategies, and gates every mode switch on rolling‑mean output quality. A controlled negative result confirms that injecting more context into a merged call worsens compression rather than relieving it, so the framework's escalation ladder (standard, then two‑phase, then sequential) recovers quality by moving toward per‑agent dispatch rather than by rewriting merged prompts. On LLM‑judged quality, the controller matches a hand‑tuned oracle on every measured (model, group, mode) cell: routing compound whenever the oracle would, and reverting to fine whenever quality would fail the floor, without per‑model configuration. Against a hand‑crafted LangGraph implementation of a 14‑agent competitive intelligence pipeline, Agent Capsules uses 51% fewer fine‑mode input tokens and 42% fewer compound‑mode input tokens, at +0.020 and +0.017 quality respectively. Against a DSPy implementation of a 5‑agent due diligence pipeline, the framework uses 19% fewer tokens than uncompiled DSPy at quality parity, and 68% fewer tokens than MIPROv2 at +0.052 quality. Even before compound mode fires, the runtime delivers efficiency through automatic policy resolution, cache‑aligned prompts, and topology‑aware context injection, matching both hand‑tuned and compile‑time baselines without training data or per‑pipeline engineering.
Authors:Khizar Qureshi, Geoffrey Martin, Yifan Peng
Abstract:
A key challenge for large language models is token cost per query and overall deployment cost. Clinical inputs are long, heterogeneous, and often redundant, while downstream tasks are short and high stakes. We study budgeted context selection, where a subset of document units is chosen under a strict token budget so an off‑the‑shelf generator can meet fixed cost and latency constraints. We cast this as a knapsack‑constrained subset selection problem with two design choices, unitization that defines document segmentation and selection that determines which units are kept. We propose RCD, a monotone submodular objective that balances relevance, coverage, and diversity. We compare sentence, section, window, and cluster‑based unitization, and introduce a routing heuristic that adapts to the budget regime. Experiments on MIMIC discharge notes, Cochrane abstracts, and L‑Eval show that optimal strategies depend on the evaluation setting. Positional heuristics perform best at low budgets in extractive tasks, while diversity‑aware methods such as MMR improve LLM generation. Selector choice matters more than unitization, with cluster‑based grouping reducing performance and other schemes behaving similarly. ROUGE saturates for LLM summaries, while BERTScore better reflects quality differences. We release our code at https://github.com/stone‑technologies/ACL_budget_paper.
Authors:Xingyu Hu, Kai Zhang, Jiancan Wu, Shuli Wang, Chi Wang, Wenshuai Chen, Yinhua Zhu, Haitao Wang, Xingxing Wang, Xiang Wang
Abstract:
In large language model (LLM)‑based recommendation systems, direct preference optimization (DPO) effectively aligns recommendations with user preferences, requiring multi‑negative objective functions to leverage abundant implicit‑feedback negatives and sharpen preference boundaries. However, our empirical analyses reveal a counterintuitive phenomenon, preference optimization collapse, where increasing the number of negative samples can lead to performance degradation despite a continuously decreasing training loss. We further theoretically demonstrate that this collapse arises from gradient suppression, caused by the dominance of easily discriminable negatives over boundary‑critical negatives that truly define user preference boundaries. As a result, boundary‑relevant signals are under‑optimized, weakening the model's decision boundary. Motivated by these observations, we propose DynamicPO (Dynamic Preference Optimization), a lightweight and plug‑and‑play framework comprising two adaptive mechanisms: Dynamic Boundary Negative Selection, which identifies and prioritizes informative negatives near the model's decision boundary, and Dual‑Margin Dynamic beta Adjustment, which calibrates optimization strength per sample according to boundary ambiguity. Extensive experiments on three public datasets show that DynamicPO effectively prevents optimization collapse and improves recommendation accuracy on multi‑negative preference optimization methods, with negligible computational overhead. Our code and datasets are available at https://github.com/xingyuHuxingyu/DynamicPO.
Authors:Ishan Gupta, Pavlo Buryi
Abstract:
We examine if frontier chat‑based large language models (LLMs) adjust their outputs based on neurodivergence (ND) context in system prompts and describe the nature of these adjustments. Specifically, we propose NDBench, a 576‑output benchmark involving two frontier models, three system prompt types (baseline, ND‑profile assertion, and ND‑profile assertion with explicit instructions for adjustments), four canonical ND profiles, and 24 prompts across four categories, one of which involves an adversarial masking strategy. Four trends emerge consistently from our findings. First, LLMs show significant adaptation under ND context, where fully instructed conditions yield lengthier and more structured outputs, characterized by higher token counts, more headings, and more granular steps (p < 10^‑8, Holm‑corrected). Second, such adaptation is largely structural in nature: although list density does not change much, there is a marked rise in the frequency of headings and per‑step detail. Third, ND persona assertion alone fails to suppress potentially harmful tendencies, as masking‑reinforcement decreases only in explicitly instructed cases (36‑44% reduction); the reduction rate barely changes in persona assertion conditions. Moreover, reliability analysis of LLM‑based harm assessment reveals that only two out of the six dimensions (masking and reinforcement, validation quality) exceed the pre‑defined inter‑judge agreement criterion (alpha >= 0.67) and thus can be considered primary results. NDBench is made publicly available along with its prompts, outputs, code, and other resources, forming a reproducible framework for auditing future LLMs' adaptation to ND awareness.
Authors:Najmul Hasan
Abstract:
DNA‑synthesis providers screen incoming orders by searching the requested sequence against curated hazard lists. We show that this baseline collapses to a 100% false‑flag rate when the hazardous sequence comes from a taxonomic family absent from the reference set: under Conformal Risk Control's certified miss‑rate constraint, a low‑discrimination signal forces the threshold below the entire test‑benign mass. We compose three signals derived from a synthesis order's public annotation: k‑mer Jaccard similarity to known toxins, the trimmed‑mean score of a five‑LLM judge panel, and cosine similarity to clustered embedding centroids. Fused under a monotone logistic aggregator and calibrated by Conformal Risk Control, the resulting screener certifies \mathbbE[\mathrmFNR] \le α. Across ten leave‑one‑taxonomic‑family‑out folds at α=0.05 on UniProt KW‑0800 reviewed toxins, the calibrated screener achieves 0% test miss rate on every fold and 0% test false‑flag rate on nine of ten folds. The bound's finite‑sample slack 1/(n_\mathrmcal+1) caps the certifiable miss rate at 1.77% on our 200‑hazard subsample; reaching procurement‑grade α=10^‑3 requires an 18× larger calibration set, which the full reviewed UniProt KW‑0800 corpus is large enough to deliver. The binding constraint on certifiable DNA‑synthesis screening is calibration data, not algorithms. Code: https://github.com/najmulhasan‑code/crc‑screen
Authors:Binghao Huang, Yunzhu Li
Abstract:
We present FlexiTac, a low‑cost, open‑source, and scalable piezoresistive tactile sensing solution designed for robotic end‑effectors. FlexiTac is a practical "plug‑in" module consisting of (i) thin, flexible tactile sensor pads that provide dense tactile signals and (ii) a compact multi‑channel readout board that streams synchronized measurements for real‑time control and large‑scale data collection. FlexiTac pads adopt a sealed three‑layer laminate stack (FPC‑Velostat‑FPC) with electrode patterns directly integrated into flexible printed circuits, substantially improving fabrication throughput and repeatability while maintaining mechanical compliance for deployment on both rigid and soft grippers. The readout electronics use widely available, low‑cost components and stream tactile signals to a host computer at 100 Hz via serial communication. Across multiple configurations, including fingertip pads and larger tactile mats, FlexiTac can be mounted on diverse platforms without major mechanical redesign. We further show that FlexiTac supports modern tactile learning pipelines, including 3D visuo‑tactile fusion for contact‑aware decision making, cross‑embodiment skill transfer, and real‑to‑sim‑to‑real fine‑tuning with GPU‑parallel tactile simulation. Our project page is available at https://flexitac.github.io/.
Authors:Sudong Wang, Weiquan Huang, Xiaomin Yu, Zuhao Yang, Hehai Lin, Keming Wu, Chaojun Xiao, Chen Chen, Wenxuan Wang, Beier Zhu, Yunjian Zhang, Chengwei Qin
Abstract:
The standard post‑training recipe for large multimodal models (LMMs) applies supervised fine‑tuning (SFT) on curated demonstrations followed by reinforcement learning with verifiable rewards (RLVR). However, SFT introduces distributional drift that neither preserves the model's original capabilities nor faithfully matches the supervision distribution. This problem is further amplified in multimodal reasoning, where perception errors and reasoning failures follow distinct drift patterns that compound during subsequent RL. We introduce PRISM, a three‑stage pipeline that mitigates this drift by inserting an explicit distribution‑alignment stage between SFT and RLVR. Building on the principle of on‑policy distillation (OPD), PRISM casts alignment as a black‑box, response‑level adversarial game between the policy and a Mixture‑of‑Experts (MoE) discriminator with dedicated perception and reasoning experts, providing disentangled corrective signals that steer the policy toward the supervision distribution without requiring access to teacher logits. While 1.26M public demonstrations suffice for broad SFT initialization, distribution alignment demands higher‑fidelity supervision; we therefore curate 113K additional demonstrations from Gemini 3 Flash, featuring dense visual grounding and step‑by‑step reasoning on the hardest unsolved problems. Experiments on Qwen3‑VL show that PRISM consistently improves downstream RLVR performance across multiple RL algorithms (GRPO, DAPO, GSPO) and diverse multimodal benchmarks, improving average accuracy by +4.4 and +6.0 points over the SFT‑to‑RLVR baseline on 4B and 8B, respectively. Our code, data, and model checkpoints are publicly available at https://github.com/XIAO4579/PRISM.
Authors:Sigma Jahan, Saurabh Singh Rajput, Tushar Sharma, Mohammad Masudur Rahman
Abstract:
Transformer models are widely deployed in critical AI applications, yet faults in their attention mechanisms, projections, and other internal components often degrade behavior silently without raising runtime errors. Existing fault diagnosis techniques often target generic deep neural networks and cannot identify which transformer component is responsible for an observed symptom. In this article, we present DEFault++, a hierarchical learning‑based diagnostic technique that operates at three level of abstraction: it detects whether a fault is present, classifies it into one of 12 transformer‑specific fault categories (covering both attention‑internal mechanisms and surrounding architectural components), and identifies the underlying root cause from up to 45 mechanisms. To facilitate both training and evaluation, we construct DEFault‑bench, a benchmark of 3,739 labeled instances obtained through systematic mutation testing. These instances are created across seven transformer models and nine downstream tasks using DEForm, a transformer‑specific mutation technique we developed for this purpose. DEFault++ measures runtime behavior at the level of individual transformer components. It organizes these measurements through a Fault Propagation Graph (FPG) derived from the transformer architecture. It then produces an interpretable diagnosis using prototype matching combined with supervised contrastive learning. On DEFault‑bench, DEFault++ exceeds an AUROC of 0.96 for detection and a Macro‑F1 of 0.85 for both categorization and root‑cause diagnosis on encoder and decoder architectures. In a developer study with 21 practitioners, the accuracy of choosing correct repair actions increased from 57.1% without support to 83.3% when using DEFault++.
Authors:Smit Jivani, Sarvam Maheshwari, Sunita Sarawagi
Abstract:
Large language models (LLMs) have revolutionized Text‑to‑SQL generation, allowing users to query structured data using natural language with growing ease. Yet, real‑world deployment remains challenging, especially in complex or unseen schemas, due to inconsistent accuracy and the risk of generating invalid SQL. We introduce Template Constrained Decoding (TeCoD), a system that addresses these limitations by harnessing the recurrence of query patterns in labeled workloads. TeCoD converts historical NL‑SQL pairs into reusable templates and introduces a robust template selection module that uses a fine‑tuned natural language inference model to match or reject queries efficiently. Once the template is selected, TeCoD enforces it during SQL generation through grammar‑constrained decoding, implemented via a novel partitioned strategy that ensures both syntactic validity and efficiency. Together, these components yield up to 36% higher execution accuracy than in‑context learning (ICL) and 2.2x lower latency on matched queries.
Authors:Hanane Nour Moussa, Yifei Li, Zhuoyang Li, Yankai Yang, Cheng Tang, Tianshu Zhang, Nesreen K. Ahmed, Ali Payani, Ziru Chen, Huan Sun
Abstract:
Despite recent progress in language models and agents for scientific data‑driven discovery, further advancing their capabilities is held back by the absence of verifiable environments representing real‑world scientific tasks. To fill this gap, we introduce D3‑Gym, the first automatically constructed dataset with verifiable environments for scientific Data‑Driven Discovery. D3‑Gym comprises (1) 565 tasks sourced from 239 real scientific repositories across four disciplines where (2) each task is equipped with a natural language instruction, an executable environment with pre‑installed dependencies, input dataset and artifact previews, a reference code solution, and an automatically synthesized evaluation script. Rigorous evaluation of the quality of the verification signal in D3‑Gym confirms that our evaluation scripts achieve 87.5% agreement with human‑annotated gold standards and strong alignment in domain‑specific evaluation logic, showing their scientific soundness. Further, training on trajectories sampled from D3‑Gym yields consistent and substantial gains across Qwen3 models of varying sizes on ScienceAgentBench, boosting Qwen3‑32B by 7.8 absolute points and substantially shrinking the gap with strong proprietary models. All D3‑Gym artifacts (environments, creation workflow, trajectories, and models) can be found at https://github.com/OSU‑NLP‑Group/D3‑Gym.
Authors:Ce Chen, Yi Ren, Yuanming Li, Viktor Goriachko, Zhenhui Ye, Zujin Guo, Zhibin Hong, Mingming Gong
Abstract:
Traditional Shot Boundary Detection (SBD) inherently struggles with complex transitions by formulating the task around isolated cut points, frequently yielding corrupted video shots. We address this fundamental limitation by formalizing the Shot Transition Detection (STD) task. Rather than searching for ambiguous points, STD explicitly detects the continuous temporal segments of transitions. To tackle this, we propose TransVLM, a Vision‑Language Model (VLM) framework for STD. Unlike regular VLMs that predominantly rely on spatial semantics and struggle with fine‑grained inter‑shot dynamics, our method explicitly injects optical flow as a critical motion prior at the input stage. Through a simple yet effective feature‑fusion strategy, TransVLM directly processes concatenated color and motion representations, significantly enhancing its temporal awareness without incurring any additional visual token overhead on the language backbone. To overcome the severe class imbalance in public data, we design a scalable data engine to synthesize diverse transition videos for robust training, alongside a comprehensive benchmark for STD. Extensive experiments demonstrate that TransVLM achieves superior overall performance, outperforming traditional heuristic methods, specialized spatiotemporal networks, and top‑tier VLMs. This work has been deployed to production. For more related research, please visit HeyGen Research (https://www.heygen.com/research) and HeyGen Avatar‑V (https://www.heygen.com/research/avatar‑v‑model). Project page: https://chence17.github.io/TransVLM/
Authors:Junan Hu, Jian Liu, Jingxiang Lai, Jiarui Hu, Yiwei Sheng, Shuang Chen, Jian Li, Dazhao Du, Song Guo
Abstract:
Graphical User Interface (GUI) agents have emerged as a promising paradigm for intelligent systems that perceive and interact with graphical interfaces visually. Yet supervised fine‑tuning alone cannot handle long‑horizon credit assignment, distribution shifts, and safe exploration in irreversible environments, making Reinforcement Learning (RL) a central methodology for advancing automation. In this work, we present the first comprehensive overview of the intersection between RL and GUI agents, and examine how this research direction may evolve toward digital inhabitants. We propose a principled taxonomy that organizes existing methods into Offline RL, Online RL, and Hybrid Strategies, and complement it with analyses of reward engineering, data efficiency, and key technical innovations. Our analysis reveals several emerging trends: the tension between reliability and scalability is motivating the adoption of composite, multi‑tier reward architectures; GUI I/O latency bottlenecks are accelerating the shift toward world‑model‑based training, which can yield substantial performance gains; and the spontaneous emergence of System‑2‑style deliberation suggests that explicit reasoning supervision may not be necessary when sufficiently rich reward signals are available. We distill these findings into a roadmap covering process rewards, continual RL, cognitive architectures, and safe deployment, aiming to guide the next generation of robust GUI automation and its agent‑native infrastructure.
Authors:Djamel Bouchaffra, Faycal Ykhlef, Mustapha Lebbah, Hanane Azzag
Abstract:
Collective intelligence emerges across biological, physical, and artificial systems without central coordination, yet a unifying principle governing such behaviour remains elusive. The Free Energy Principle explains how individual agents adapt through variational inference, while game theory formalises strategic interactions. Here we introduce the Game‑Theoretic Free Energy Principle, a unified framework showing that multi‑agent systems performing local free‑energy minimisation implicitly implement a stochastic game. We prove that, under bounded rationality and local information constraints, stationary points of collective free energy correspond to approximate Nash equilibria of an induced game. Conversely, a broad class of cooperative games admits a variational representation in which equilibria arise as Gibbs distributions over coalitions, establishing a bridge between Bayesian inference and strategic interaction. To characterise higher‑order effects, we introduce a free‑energy formulation of the Harsanyi dividend, isolating irreducible multi‑agent synergy. This yields a predictive theory of cooperation, including a falsifiable non‑monotonic relationship between sensory precision and agent influence. We validate this prediction across neural, biological, and artificial multi‑agent systems. These results identify a common variational principle underlying inference, thermodynamics, and game‑theoretic equilibrium.
Authors:Shiyao Peng, Qianhe Zheng, Zhuodi Hao, Zichen Tang, Rongjin Li, Qing Huang, Jiayu Huang, Jiacheng Liu, Yifan Zhu, Haihong E
Abstract:
Although precise recall is a core objective in Retrieval‑Augmented Generation (RAG), a critical oversight persists in the field: improvements in retrieval performance do not consistently translate to commensurate gains in downstream reasoning. To diagnose this gap, we propose the Recall Conversion Rate (RCR), a novel evaluation metric to quantify the contribution of retrieval to reasoning accuracy. Our quantitative analysis of mainstream RAG methods reveals that as Recall@5 improves, the RCR exhibits a near‑linear decay. We identify the neglect of retrieval quality in these methods as the underlying cause. In contrast, approaches that focus solely on quality optimization often suffer from inferior recall performance. Both categories lack a comprehensive understanding of retrieval quality optimization, resulting in a trade‑off dilemma. To address these challenges, we propose comprehensive retrieval quality optimization criteria and introduce the NeocorRAG framework. This framework achieves holistic retrieval quality optimization by systematically mining and utilizing Evidence Chains. Specifically, NeocorRAG first employs an innovative activated search algorithm to obtain a refined candidate space. Then it ensures precise evidence chain generation through constrained decoding. Finally, the retrieved set of evidence chains guides the retrieval optimization process. Evaluated on benchmarks including HotpotQA, 2WikiMultiHopQA, MuSiQue, and NQ, NeocorRAG achieves SOTA performance on both 3B and 70B parameter models, while consuming less than 20% of tokens used by comparable methods. This study presents an efficient, training‑free paradigm for RAG enhancement that effectively optimizes retrieval quality while maintaining high recall. Our code is released at https://github.com/BUPT‑Reasoning‑Lab/NeocorRAG.
Authors:Haonan Li, Tianjun Sun, Yongqing Wang, Qisheng Zhang
Abstract:
Multi‑server MCP agents create an information‑flow control problem: faithful tool composition can turn individually benign read/write permissions into cross‑boundary credential propagation ‑‑ a structural side effect of workflow topology, not necessarily malicious model behavior. We present MCPHunt, to our knowledge the first controlled benchmark that isolates non‑adversarial, verbatim credential propagation across multi‑server MCP trust boundaries, with three methodological contributions: (1) canary‑based taint tracking that reduces propagation detection to objective string matching; (2) an environment‑controlled coverage design with risky, benign, and hard‑negative conditions that validates pipeline soundness and controls for credential‑format confounds; (3) CRS stratification that disentangles task‑mandated propagation (faithful execution of verbatim‑transfer instructions) from policy‑violating propagation (credentials included despite the option to redact). Across 3,615 main‑benchmark traces from 5 models spanning 147 tasks and 9 mechanism families, policy‑violating propagation rates reach 11.5‑‑41.3% across all models. This propagation is pathway‑specific (25x cross‑mechanism range) and concentrated in browser‑mediated data flows; hard‑negative controls provide evidence that production‑format credentials are not necessary ‑‑ prompt‑directed cross‑boundary data flow is sufficient. A prompt‑mitigation study across 3 models reduces policy‑violating propagation by up to 97% while preserving 80.5% utility, but effectiveness varies with instruction‑following capability ‑‑ suggesting that prompt‑level defenses alone may not suffice. Code, traces, and labeling pipeline are released under MIT and CC BY 4.0.
Authors:Abdelrahman Sadallah, Kareem Elozeiri, Mervat Abassy, Rania Elbadry, Mohamed Anwar, Abed Alhakim Freihat, Preslav Nakov, Fajri Koto
Abstract:
Poetry has long been a central art form for Arabic speakers, serving as a powerful medium of expression and cultural identity. While modern Arabic speakers continue to value poetry, existing research on Arabic poetry within Large Language Models (LLMs) has primarily focused on analysis tasks such as interpretation or metadata prediction, e.g., rhyme schemes and titles. In contrast, our work addresses the practical aspect of poetry creation in Arabic by introducing controllable generation capabilities to assist users in writing poetry. Specifically, we present a large‑scale, carefully curated instruction‑based dataset in Modern Standard Arabic (MSA) and various Arabic dialects. This dataset enables tasks such as writing, revising, and continuing poems based on predefined criteria, including style and rhyme, as well as performing poetry analysis. Our experiments show that fine‑tuning LLMs on this dataset yields models that can effectively generate poetry that is aligned with user requirements, based on both automated metrics and human evaluation with native Arabic speakers. The data and the code are available at https://github.com/mbzuai‑nlp/instructpoet‑ar
Authors:Chao Fei, Hongcheng Guo, Yanghua Xiao
Abstract:
Across millennia, complex societies have faced the same coordination problem of how to organize collective action among cognitively bounded and informationally incomplete individuals. Different civilizations developed different political institutions to answer the same basic questions of who proposes, who reviews, who executes, and how errors are corrected. We argue that multi‑agent systems built on large language models face the same challenge. Their central problem is not only individual intelligence, but collective organization. Historical institutions therefore provide a structured design space for multi‑agent architectures, making key trade‑offs between efficiency and error correction, centralization and distribution, and specialization and redundancy empirically testable. We translate seven historical political institutions, spanning four canonical governance patterns, into executable multi‑agent architectures and evaluate them under identical conditions across three large language models and two benchmarks. We find that governance topology strongly shapes collective performance. Within a single model, the gap between the best and worst institution exceeds 57 percentage points, while the optimal architecture shifts systematically with model capability and task characteristics. These results suggest that collective intelligence will not advance through a single optimal organizational form, but through governance mechanisms that can be reselected and reconfigured as tasks and capabilities evolve. More broadly, this points to a transition from self‑evolving agents to the self‑evolving multi‑agent system. The code is available on \hrefhttps://github.com/cf3i/SocialSystemArenaGitHub.
Authors:Wei Li, Haisheng Li, Weijie Li, Jiandong Wang, Kaichen Ma, Luming Yang
Abstract:
With the widespread application of Unmanned Aerial Vehicles (UAVs) in bridge structural health monitoring, deep learning‑based automatic crack detection has become a major research focus. However, practical UAV inspections still face four key challenges: weak crack features, degraded imaging conditions, severe class imbalance, and limited computational resources for practical UAV inspection workflows. To address these issues, this paper proposes a unified lightweight convolutional neural network framework composed of four synergistic components: a lightweight backbone network, a Convolutional Block Attention Module (CBAM) for channel and spatial enhancement, a directed robust augmentation strategy based on inspection‑scene priors, and Focal Loss for hard‑sample learning under class imbalance. Experiments on the SDNET2018 bridge deck dataset show that the proposed method achieves an inference speed of 825 FPS with only 11.21M parameters and 1.82G FLOPs. Compared with the baseline model, the complete framework improves the F1‑score by 2.51% and recall by 3.95%. In addition, Grad‑CAM visualizations indicate that the introduced attention module shifts the model's focus from scattered regions to precise tracking along crack trajectories. Overall, this study achieves a strong balance among accuracy, speed, and robustness, providing a practical solution for ground‑station assisted real‑time deployment in UAV bridge inspections. The source code is available at: https://github.com/skylynf/AttXNet .
Authors:Al Zadid Sultan Bin Habib, Tanpia Tasnim, Md. Ekramul Islam, Muntasir Tabasum
Abstract:
Learning informative representations from tabular data in remote sensing and environmental science is challenging due to heterogeneity, scarce labels, and redundancy among features. We present ZAYAN (Zero‑Anchor dYnamic feAture eNcoding), a self‑supervised, feature‑centric contrastive framework for tabular data. ZAYAN performs contrastive learning at the feature rather than sample level, removing the need for explicit anchor selection and any reliance on class labels, while encouraging a redundancy‑minimized, disentangled embedding space. The framework has two modules: ZAYAN‑CL, which pretrains feature embeddings via a zero‑anchor contrastive objective with dynamic perturbations and masking, and ZAYAN‑T, a Transformer that conditions on these embeddings for downstream classification. Across eight datasets, including six remote‑sensing tabular benchmarks and two remote‑sensing‑driven flood‑prediction tables from satellite and GIS products, ZAYAN achieves superior accuracy, robustness, and generalization over tabular deep learning baselines, with consistent gains under label scarcity and distribution shift. These results indicate that feature‑level contrastive learning and dynamic feature encoding provide an effective recipe for learning from tabular sensing data.
Authors:Pengyun Zhu, Qiheng Sun, Long Wen, Yanbo Wang, Yang Cao, Junxu Liu, Deyi Xiong, Jinfei Liu, Zhibo Wang, Kui Ren
Abstract:
Privacy policies are essential for users to understand how service providers handle their personal data. However, these documents are often long and complex, as well as filled with technobabble and legalese, causing users to unknowingly accept terms that may even contradict the law. While summarizing and interpreting these privacy policies is crucial, there is a lack of high‑quality English parallel corpus optimized for legal clarity and readability. To address this issue, we introduce APPSI‑139, a high‑quality English privacy policy corpus meticulously annotated by domain experts, specifically designed for summarization and interpretation tasks. The corpus includes 139 English privacy policies, 15,692 rewritten parallel corpora, and 36,351 fine‑grained annotation labels across 11 data practice categories. Concurrently, we propose TCSI‑pp‑V2, a hybrid privacy policy summarization and interpretation framework that employs an alternating training strategy and coordinates multiple expert modules to effectively balance computational efficiency and accuracy. Experimental results show that the hybrid summarization system built on APPSI‑139 corpus and the TCSI‑pp‑V2 framework outperform large language models, such as GPT‑4o and LLaMA‑3‑70B, in terms of readability and reliability. The source code and dataset are available at https://github.com/EnlightenedAI/APPSI‑139.
Authors:Jean Martins, Leonid Mokrushin, Marin Orlic
Abstract:
Intent‑based networking promises to revolutionize telecommunications network management by enabling operators to specify high‑level goals rather than low‑level configurations. The TM Forum Intent Ontology (tio) provides a standardized vocabulary for expressing network intents, yet lacks formal validation mechanisms to ensure intent correctness before its admission. We present tio‑shacl, the first comprehensive SHACL (Shapes Constraint Language) validation framework for the TMF Intent Ontology. Our contribution includes 56 node shapes and 69 property shapes across all 15 tio v3.6.0 ontology modules, a reusable constraint library with 25 parameterized SPARQL‑based constraint components, and novel validation patterns for recursive logical operators, quantity‑based constraints, and cross‑expectation relationships. We pursued 100% vocabulary coverage (87 classes, 109 properties, 72 functions), cross‑implementation compatibility across three major SHACL engines, and validation accuracy on a corpus of 133 test cases. tio‑shacl is publicly available under MIT license at https://github.com/EricssonResearch/tio‑shacl and enables automated syntactic and semantic validation of network intents, addressing a critical gap in the field.
Authors:Alan L. McCann
Abstract:
Every system that performs effects has two boundaries: what it can do (expressiveness) and what governance covers (governance). In nearly all deployed AI systems, these boundaries are defined independently, creating three regions: governed capabilities (the only useful region), ungoverned capabilities (risk), and governance policies that address non‑existent capabilities (theater). Two of the three regions are failure modes. We focus on the governance of effects: actions that AI systems perform in the world (API calls, database writes, tool invocations). This is distinct from the governance of model outputs (content quality, bias, fairness), which operates at a different level and requires different mechanisms. We present a formal framework for analyzing this structural gap. Rice's theorem (1953) proves the gap is undecidable in the general case for any Turing‑complete architecture that attempts to govern effects behaviorally: no algorithm can decide non‑trivial semantic properties of arbitrary programs, including the property "this program's effects comply with the governance policy." We define coterminous governance: a system property where the expressivenessboundary equals the governance boundary. We show that coterminous governance requires an architectural decision (separatingcomputation from effect) rather than a governance layer added after the fact. We show that structural governance under this separation subsumes separate governance infrastructure: governance checks become part of the execution pipeline rather than a second system running alongside it. We propose coterminous governance as the testable criterion for any AI governance system: either the two boundaries are provably identical, or risk and theater are structurally inevitable. Proofs are mechanized in Coq (454 theorems, 36 modules, 0 admitted).
Authors:Alan L. McCann
Abstract:
We present five results in the theory of structural governance for cognitive workflow systems. Three are mechanized in Coq 8.19 using the Interaction Trees library with parameterized coinduction; two are proved on paper with explicit reductions. The Coinductive Safety Predicate (gov_safe) is a coinductive property that captures governance safety for infinite program behaviors, indexed by a boolean permission flag that is provably false for ungoverned I/O and true for governed interpretations (mechanized). The Governance Invariance Theorem establishes that governance is uniform across the meta‑recursive tower: governance at level n+1 reduces to governance at level n by definitional equality of the type (mechanized). The Sufficiency Theorem proves that four atomic primitives (code, reason, memory, call) are expressively complete for any discrete intelligent system, formalized as compositional closure of a Kleisli category (mechanized). The Alternating Normal Form provides a canonical decomposition of any machine into alternating code and effect layers, with a confluent rewriting system (paper proof). The Necessity Theorem proves via explicit reduction to Rice's theorem that an architecturally opaque component (the reason primitive) is mathematically necessary for problems requiring semantic judgment (paper proof). A sixth contribution connects the abstract model to the deployed runtime: the Verified Interpreter Specification formalizes the BEAM runtime's trust, capability, and hash chain logic in Coq, then tests the running system against this specification using property‑based testing with over 70,000 randomly generated directive sequences and zero disagreements. The mechanization comprises approximately 12,000 lines across 36 modules with 454 theorems and zero admitted lemmas.
Authors:Yuxuan Huang, Yihang Chen, Zhiyuan He, Yuxiang Chen, Ka Yiu Lee, Huichi Zhou, Weilin Luo, Meng Fang, Jun Wang
Abstract:
Agentic web search increasingly faces two distinct demands: deep reasoning over a single target, and structured aggregation across many entities and heterogeneous sources. Current systems struggle on both fronts. Breadth‑oriented tasks demand schema‑aligned outputs with wide coverage and cross‑entity consistency, while depth‑oriented tasks require coherent reasoning over long, branching search trajectories. We introduce Web2BigTable, a multi‑agent framework for web‑to‑table search that supports both regimes. Web2BigTable adopts a bi‑level architecture in which an upper‑level orchestrator decomposes the task into sub‑problems and lower‑level worker agents solve them in parallel. Through a closed‑loop run‑‑verify‑‑reflect process, the framework jointly improves decomposition and execution over time via persistent, human‑readable external memory, with self‑evolving updates to each single‑agent. During execution, workers coordinate through a shared workspace that makes partial findings visible, allowing them to reduce redundant exploration, reconcile conflicting evidence, and adapt to emerging coverage gaps. Web2BigTable sets a new state of the art on WideSearch, reaching an Avg@4 Success Rate of 38.50 (7.5× the second best at 5.10), Row F1 of 63.53 (+25.03 over the second best), and Item F1 of 80.12 (+14.42 over the second best). It also generalises to depth‑oriented search on XBench‑DeepSearch, achieving 73.0 accuracy. Code is available at https://github.com/web2bigtable/web2bigtable.
Authors:Yibin Luo, Shiwei Gao, Huichuan Zheng, Youyou Lu, Jiwu Shu
Abstract:
Fine‑tuning Large Language Models (LLMs) on consumer‑grade GPUs is highly cost‑effective, yet constrained by limited GPU memory and slow PCIe interconnects. Pipeline parallelism combined with CPU offloading mitigates these hardware bottlenecks by reducing communication overhead. However, existing PP schedules suffer from an inherent limitation termed the weight binding issue. Binding uneven model stages (e.g., the LM head is large) to GPUs limits the pipeline's throughput to that of the GPU with the heaviest load, leading to severe pipeline bubbles. In this paper, we propose RoundPipe, a novel pipeline schedule that breaks the weight binding constraint on consumer GPU servers. RoundPipe treats GPUs as a pool of stateless execution workers and dynamically dispatches computation stages across devices in a round‑robin manner, achieving a near‑zero‑bubble pipeline. To ensure training correctness and system efficiency, RoundPipe integrates a priority‑aware transfer scheduling engine, a fine‑grained distributed event‑based synchronization protocol, and an automated layer partitioning algorithm. Evaluations on an 8× RTX 4090 server demonstrate that RoundPipe achieves 1.48‑‑2.16× speedups over state‑of‑the‑art baselines when fine‑tuning 1.7B to 32B models. Remarkably, RoundPipe enables LoRA fine‑tuning of the Qwen3‑235B model with 31K sequence length on a single server. RoundPipe is publicly available as an open‑source Python library with comprehensive documentation.
Authors:Bingxi Zhao, Jiahao Zhang, Xubin Ren, Zirui Guo, Tianzhe Chu, Yi Ma, Chao Huang
Abstract:
Education represents one of the most promising real‑world applications for Large Language Models (LLMs). However, conventional tutoring systems rely on static pre‑training knowledge that lacks adaptation to individual learners, while existing RAG‑augmented systems fall short in delivering personalized, guided feedback. To bridge this gap, we present DeepTutor, an agent‑native open‑source framework for personalized tutoring where every feature shares a common personalization substrate. We propose a hybrid personalization engine that couples static knowledge grounding with dynamic multi‑resolution memory, distilling interaction history into a continuously evolving learner profile. Moreover, we construct a closed tutoring loop that bidirectionally couples citation‑grounded problem solving with difficulty‑calibrated question generation. The personalization substrate further supports collaborative writing, multi‑agent deep research, and interactive guided learning, enabling cross‑modality coherence. To move beyond reactive interfaces, we introduce TutorBot, a proactive multi‑agent layer that deploys tutoring capabilities through extensible skills and unified multi‑channel access, providing consistent experience across platforms. To better evaluate such tutoring systems, we construct TutorBench, a student‑centric benchmark with source‑grounded learner profiles and a first‑person interactive protocol that measures adaptive tutoring from the learner's perspective. We further evaluate foundational agentic reasoning abilities across five authoritative benchmarks. Experiments show that DeepTutor improves personalized tutoring quality while maintaining general agentic reasoning abilities. We hope DeepTutor provides unique insights into next‑generation AI‑powered and personalized tutoring systems for the community.
Authors:Gongbo Zhang, Wen Wang, Ye Tian, Li Yuan
Abstract:
Diffusion large language models (dLLMs) offer parallel decoding and bidirectional context, but state‑of‑the‑art dLLMs require billions of parameters for competitive performance. While existing distillation methods for dLLMs reduce inference steps within a single architecture, none address cross‑architecture knowledge transfer, in which the teacher and student differ in architecture, attention mechanism, and tokenizer. We present TIDE, the first framework for cross‑architecture dLLM distillation, comprising three modular components: (1) TIDAL, which jointly modulates distillation strength across training progress and diffusion timestep to account for the teacher's noise‑dependent reliability; (2) CompDemo, which enriches the teacher's context via complementary mask splitting to improve predictions under heavy masking; and (3) Reverse CALM, a cross‑tokenizer objective that inverts chunk‑level likelihood matching, yielding bounded gradients and dual‑end noise filtering. Distilling 8B dense and 16B MoE teachers into a 0.6B student via two heterogeneous pipelines outperforms the baseline by an average of 1.53 points across eight benchmarks, yielding notable gains in code generation, where HumanEval scores reach 48.78 compared to 32.3 for the AR baseline.
Authors:Fei Bai, Huatong Song, Shuang Sun, Daixuan Cheng, Yike Yang, Chuan Hao, Renyuan Li, Feng Chang, Yuan Wei, Ran Tao, Bryan Dai, Jian Yang, Wayne Xin Zhao
Abstract:
Claw‑style environments support multi‑step workflows over local files, tools, and persistent workspace states. However, scalable development around these environments remains constrained by the absence of a systematic framework, especially one for synthesizing verifiable training data and integrating it with agent training and diagnostic evaluation. To address this challenge, we present ClawGym, a scalable framework that supports the full lifecycle of Claw‑style personal agent development. Concretely, we construct ClawGym‑SynData, a diverse dataset of 13.5K filtered tasks synthesized from persona‑driven intents and skill‑grounded operations, paired with realistic mock workspaces and hybrid verification mechanisms. We then train a family of capable Claw‑style models, termed ClawGym‑Agents, through supervised fine‑tuning on black‑box rollout trajectories, and further explore reinforcement learning via a lightweight pipeline that parallelizes rollouts across per‑task sandboxes.To support reliable evaluation, we further construct ClawGym‑Bench, a benchmark of 200 instances calibrated through automated filtering and human‑LLM review. Relevant resources will be soon released at https://github.com/ClawGym.
Authors:Bochao Liu, Zhipeng Qian, Yang Zhao, Xinyuan Jiang, Zihan Liang, Yufei Ma, Junpeng Zhuang, Ben Chen, Shuo Yang, Hongen Wan, Yao Wu, Chenyi Lei, Xiao Liang
Abstract:
Operating and maintaining (O&M) large‑scale online engine systems (search, recommendation, advertising) demands substantial human effort for release monitoring, alert response, and root cause analysis. While LLM‑based agents are a natural fit for these tasks, the deployment bottleneck is not reasoning capability but orchestration: selecting, for each operational event, the relevant data (metrics, logs, change events) and the applicable operational knowledge (handbook rules and practitioner experience). Feeding all signals indiscriminately causes dilution and hallucination, while manually curating the event‑to‑(data, knowledge) mapping is intractable under dozens of daily releases. We present Bian Que, an agentic framework with three contributions: (i) a \emphunified operational paradigm abstracting day‑to‑day O&M into three canonical patterns: release interception, proactive inspection, and alert root cause analysis; (ii) \emphFlexible Skill Arrangement, where each Skill specifies which data and knowledge to retrieve for a given business‑module context and can be automatically generated and updated by LLMs or iteratively refined through natural‑language instructions from on‑call engineers; (iii) a \emphunified self‑evolving mechanism in which one correction signal drives two parallel pathways, case‑memory‑to‑knowledge distillation and targeted Skill refinement. Deployed on the e‑commerce search engine of KuaiShou, the major short‑video platform in China, Bian Que reduces alert volume by 75%, achieves 80% root‑cause analysis accuracy, and cuts mean time to resolution by over 50%. Our framework achieves 99.0% pass rate on offline evaluations. Our code is available at https://github.com/benchen4395/BianQue_Assistant.
Authors:Jun Guo, Qiwei Li, Peiyan Li, Zilong Chen, Nan Sun, Yifei Su, Heyun Wang, Yuan Zhang, Xinghang Li, Huaping Liu
Abstract:
We propose X‑WAM, a Unified 4D World Model that unifies real‑time robotic action execution and high‑fidelity 4D world synthesis (video + 3D reconstruction) in a single framework, addressing the critical limitations of prior unified world models (e.g., UWM) that only model 2D pixel‑space and fail to balance action efficiency and world modeling quality. To leverage the strong visual priors of pretrained video diffusion models, X‑WAM imagines the future world by predicting multi‑view RGB‑D videos, and obtains spatial information efficiently through a lightweight structural adaptation: replicating the final few blocks of the pretrained Diffusion Transformer into a dedicated depth prediction branch for the reconstruction of future spatial information. Moreover, we propose Asynchronous Noise Sampling (ANS) to jointly optimize generation quality and action decoding efficiency. ANS applies a specialized asynchronous denoising schedule during inference, which rapidly decodes actions with fewer steps to enable efficient real‑time execution, while dedicating the full sequence of steps to generate high‑fidelity video. Rather than entirely decoupling the timesteps during training, ANS samples from their joint distribution to align with the inference distribution. Pretrained on over 5,800 hours of robotic data, X‑WAM achieves 79.2% and 90.7% average success rate on RoboCasa and RoboTwin 2.0 benchmarks, while producing high‑fidelity 4D reconstruction and generation surpassing existing methods in both visual and geometric metrics.
Authors:Rongliang Fu, Yi Liu, Qiang Xu, Tsung-Yi Ho
Abstract:
Technology mapping is a critical yet challenging stage in logic synthesis. While Large Language Models (LLMs) have been applied to generate optimization scripts, their potential for core algorithm enhancement remains untapped. We introduce MappingEvolve, an open‑source framework that pioneers the use of LLMs to directly evolve technology mapping code. Our method abstracts the mapping process into distinct optimization operators and employs a hierarchical agent‑based architecture, comprising a Planner, Evolver, and Evaluator, to guide the evolutionary search. This structured approach enables strategic and effective code modifications. Experiments show our method significantly outperforms direct evolution and strong baselines, achieving 10.04% area reduction versus ABC and 7.93% versus mockturtle, with 46.6%‑‑96.0% S_overall improvement on EPFL benchmarks, while explicitly navigating the area‑‑delay trade‑off. Our code and data are available at https://github.com/Flians/MappingEvolve.
Authors:Seungyub Han, Hyungjin Kim, Jungwoo Lee
Abstract:
Offline reinforcement learning (RL) agents often fail when deployed, as the gap between training datasets and real environments leads to unsafe behavior. To address this, we present SAS (Self‑Alignment for Safety), a transformer‑based framework that enables test‑time adaptation in offline safe RL without retraining. In SAS, the main mechanism is self‑alignment: at test time, the pretrained agent generates several imagined trajectories and selects those satisfying the Lyapunov condition. These feasible segments are then recycled as in‑context prompts, allowing the agent to realign its behavior toward safety while avoiding parameter updates. In effect, SAS turns Lyapunov‑guided imagination into control‑invariant prompts, and its transformer architecture admits a hierarchical RL interpretation where prompting functions as Bayesian inference over latent skills. Across Safety Gymnasium and MuJoCo benchmarks, SAS consistently reduces cost and failure while maintaining or improving return.
Authors:Cyril Shih-Huan Hsu, Wig Yuan-Cheng Cheng, Chrysa Papagianni
Abstract:
Deploying Vision‑Language Models (VLMs) on edge devices remains challenging due to their substantial computational and memory demands, which exceed the capabilities of resource‑constrained embedded platforms. Conversely, fully offloading inference to the cloud is often impractical in bandwidth‑limited environments, where transmitting raw visual data introduces substantial latency overhead. While recent edge‑cloud collaborative architectures attempt to partition VLM workloads across devices, they typically rely on transmitting fixed‑size representations, lacking adaptability to dynamic network conditions and failing to fully exploit semantic redundancy. In this paper, we propose a progressive semantic communication framework for edge‑cloud VLM inference, using a Meta AutoEncoder that compresses visual tokens into adaptive, progressively refinable representations, enabling plug‑and‑play deployment with off‑the‑shelf VLMs without additional fine‑tuning. This design allows flexible transmission at different information levels, providing a controllable trade‑off between communication cost and semantic fidelity. We implement a full end‑to‑end edge‑cloud system comprising an embedded NXP i.MX95 platform and a GPU server, communicating over bandwidth‑constrained networks. Experimental results show that, at 1 Mbps uplink, the proposed progressive scheme significantly reduces network latency compared to full‑edge and full‑cloud solutions, while maintaining high semantic consistency even under high compression. The implementation code will be released upon publication at https://github.com/open‑ep/ProSemComVLM.
Authors:Jon-Paul Cacioli
Abstract:
A predecessor pilot (Cacioli, 2026) found that Llama‑3‑8B implements prompted sandbagging as positional collapse rather than answer avoidance. However, fixed option ordering in MMLU‑Pro left open whether this reflected a model‑level position‑dominant policy or dataset‑level distractor structure. This pre‑registered follow‑up (3 models, 2,000 MMLU‑Pro items, 4 conditions, 24,000 primary trials) added cyclic option‑order randomisation as the critical control. The pre‑registered item‑level same‑letter diagnostic did not confirm deterministic position‑tracking (same‑letter rate 37.3%, below the 50% threshold). However, pre‑specified supporting analyses revealed that the response‑position distribution under sandbagging was highly stable under complete content rotation (Pearson r = 0.9994; Jensen‑Shannon divergence = 0.027, compared to 0.386 between honest and sandbagging conditions). Accuracy spiked to 72.1% when the correct answer coincidentally occupied the preferred position E, and fell to 4.3% at position A. The data provide strong evidence for a soft distributional attractor: under sandbagging instruction, the model enters a low‑entropy response‑position basin centred on E/F/G that is highly stable and largely content‑invariant at the aggregate level. Qwen‑2.5‑7B served as a negative control (non‑compliant, no distributional shift). These results provide evidence, at the 7‑9 billion parameter scale, that response‑position entropy is a promising black‑box behavioural signature of this sandbagging mode.
Authors:Wenshuo Zhao, Qi Zhu, Xingshan Zeng, Fei Mi, Lifeng Shang, Yi R., Fung
Abstract:
An effective way to scale up test‑time compute of large language models is to sample multiple responses and then select the best one, as in Grok Heavy and Gemini Deep Think. Existing selection methods often rely on external reward models, which requires training a strong reward model and introduces additional computation overhead. As an alternative, previous approaches have explored intrinsic signals, such as confidence and entropy, but these signals are noisy with naive aggregation. In this work, we observe that high‑entropy tokens tend to cluster into consecutive groups during inference, providing a more stable notion of model uncertainty than individual tokens. Together, these clusters reveal temporal patterns of model uncertainty throughout the inference process. Motivated by this observation, we propose to use the temporal structure of uncertainty as an intrinsic reward. To this end, we first formalize the basic unit of segment‑level uncertainty as the High Entropy Phase (HEP), a variable‑length segment that begins at a high‑entropy token and ends when consecutive low‑entropy tokens appear. We then define the Entropy Centroid, inspired by the concept of the center of mass in physics, as the weighted average position of all HEPs along the trajectory. Intuitively, a lower centroid indicates early exploration followed by confident generation, which we find often corresponds to higher response quality. Based on this insight, we propose the Lowest Centroid method, which selects the response with the lowest entropy centroid among multiple candidates. Experiments on mathematics, code generation, logical reasoning, and agentic tasks, across model scales ranging from 14B to 480B, show that Lowest Centroid consistently outperforms existing baselines and delivers stable gains as model size increases. Code is available at https://github.com/hkust‑nlp/entropy‑centroid.
Authors:Mohammed Suhail B Nadaf
Abstract:
Every RLHF‑trained language model is shaped by a reward model, yet the mechanistic interpretability toolkit ‑‑ logit lens, direct logit attribution, activation patching, sparse autoencoders ‑‑ was built for generative LLMs whose primitives all project onto a vocabulary unembedding. Reward models replace that with a scalar regression head, breaking each tool. We present reward‑lens, an open‑source library that ports this toolkit to reward models, organised around one observation: the reward head's weight vector w_r is the natural axis for every interpretability question. The library provides a Reward Lens, component attribution, three‑mode activation patching, a reward‑hacking probe suite, TopK SAE feature attribution, cross‑model comparison, and five theory‑grounded extensions (distortion index, divergence‑aware patching, misalignment cascade detection, reward‑term conflict analysis, concept‑vector analysis). A ten‑method adapter protocol covers Llama, Mistral, Gemma‑2, and ArmoRM multi‑objective heads, with a generic adapter for any HuggingFace sequence classification model. We validate on two production reward models across ~695 RewardBench pairs. The central empirical finding is negative: linear attribution does not predict causal patching effects (mean Spearman ρ= ‑0.256 on Skywork, ‑0.027 on ArmoRM). The framework treats this disagreement as a property to expose, not a bug ‑‑ motivating a design that keeps observational and causal views first‑class and directly comparable.
Authors:Dumitru Verşebeniuc, Martijn Elands, Sara Falahatkar, Chiara Magrone, Mohammad Falah, Martijn Boussé, Aki Härmä
Abstract:
Large Language Models have been increasingly employed in the creation of Virtual Assistants due to their ability to generate human‑like text and handle complex inquiries. While these models hold great promise, challenges such as hallucinations, missing information, and the difficulty of providing accurate and context‑specific responses persist, particularly when applied to highly specialized content domains. In this paper, we focus on addressing these challenges by developing a virtual assistant designed to support students at Maastricht University in navigating project‑specific regulations. We propose a virtual assistant based on a Retrieval‑Augmented Generation system that enhances the accuracy and reliability of responses by integrating up‑to‑date, domain‑specific knowledge. Through a robust evaluation framework and real‑life testing, we demonstrate that our virtual assistant can effectively meet the needs of students while addressing the inherent challenges of applying Large Language Models to a specialized educational context. This work contributes to the ongoing discourse on improving LLM‑based systems for specific applications and highlights areas for further research.
Authors:Dominik Żurek, Kamil Faber, Marcin Pietron, Paweł Gajewski, Roberto Corizzo
Abstract:
Continual offline reinforcement learning (CORL) aims to learn a sequence of tasks from datasets collected over time while preserving performance on previously learned tasks. This setting corresponds to domains where new tasks arise over time, but adapting the model in live environment interactions is expensive, risky, or impossible. However, CORL inherits the dual difficulty of offline reinforcement learning and adapting while preventing catastrophic forgetting. Replay‑based continual learning approaches remain a strong baseline but incur memory overhead and suffer from a distribution mismatch between replayed samples and newly learned policies. At the same time, architectural continual learning methods have shown strong potential in supervised learning but remain underexplored in CORL. In this work, we propose TSN‑Affinity, a novel CORL method based on TinySubNetworks and Decision Transformer. The method enables task‑specific parameterization and controlled knowledge sharing through a RL‑aware reuse strategy that routes tasks according to action compatibility and latent similarity. We evaluate the approach on benchmarks based on Atari games and simulations of manipulation tasks with the Franka Emika Panda robotic arm, covering both discrete and continuous control. Results show strong retention from sparse SubNetworks, with routing further improving multi‑task performance. Our findings suggest that similarity‑guided architectural reuse is a strong and viable alternative to replay‑based strategies in a CORL setting. Our code is available at: https://github.com/anonymized‑for‑submission123/tsn‑affinity.
Authors:Shuning Shang, Hubert Strauss, Stanley Wei, Sanjeev Arora, Noam Razin
Abstract:
Training language models via reinforcement learning often relies on imperfect proxy rewards, since ground truth rewards that precisely define the intended behavior are rarely available. Standard metrics for assessing the quality of proxy rewards, such as ranking accuracy, treat incorrect rewards as strictly harmful. In this work, however, we highlight that not all deviations from the ground truth are equal. By theoretically analyzing which outputs attract probability during policy gradient optimization, we categorize reward errors according to their effect on the increase in ground truth reward. The analysis establishes that reward errors, though conventionally viewed as harmful, can also be benign or even beneficial by preventing the policy from stalling around outputs with mediocre ground truth reward. We then present two practical implications of our theory. First, for reinforcement learning from human feedback (RLHF), we develop reward model evaluation metrics that account for the harmfulness of reward errors. Compared to standard ranking accuracy, these metrics typically correlate better with the performance of a language model after RLHF, yet gaps remain in robustly evaluating reward models. Second, we provide insights for reward design in settings with verifiable rewards. A key theme underlying our results is that the effectiveness of a proxy reward function depends heavily on its interaction with the initial policy and learning algorithm.
Authors:Jianghao Lin, Zi Ling, Chenyu Zhou, Tianyi Xu, Ruoqing Jiang, Zizhuo Wang, Dongdong Ge
Abstract:
Optimization modeling underpins real‑world decision‑making in logistics, manufacturing, energy, and public services, but reliably solving such problems from natural‑language requirements remains challenging for current large language models (LLMs). In this paper, we propose \emphAgora‑Opt, a modular agentic framework for optimization modeling that combines decentralized debate with a read‑write memory bank. Agora‑Opt allows multiple agent teams to independently produce end‑to‑end solutions and reconcile them through an outcome‑grounded debate protocol, while memory stores solver‑verified artifacts and past disagreement resolutions to support training‑free improvement over time. This design is flexible across both backbones and methods: it reduces base‑model lock‑in, transfers across different LLM families, and can be layered onto existing pipelines with minimal coupling. Across public benchmarks, Agora‑Opt achieves the strongest overall performance among all compared methods, outperforming strong zero‑shot LLMs, training‑centric approaches, and prior agentic baselines. Further analyses show robust gains across backbone choices and component variants, and demonstrate that decentralized debate offers a structural advantage over centralized selection by enabling agents to refine candidate solutions through interaction and even recover correct formulations when all initial candidates are flawed. These results suggest that reliable optimization modeling benefits from combining collaborative cross‑checking with reusable experience, and position Agora‑Opt as a practical and extensible foundation for trustworthy optimization modeling assistance. Our code and data are available at https://github.com/CHIANGEL/Agora‑Opt.
Authors:Shangqing Tu, Yanjia Li, Keyu Chen, Sichen Zhang, Jifan Yu, Daniel Zhang-Li, Lei Hou, Juanzi Li, Yu Zhang, Huiqin Liu
Abstract:
Creating interactive STEM courseware traditionally requires HTML/CSS/JavaScript expertise, leaving barriers for educators. While generative AI can produce HTML codes, existing tools generate static presentations rather than interactive simulations, struggle with long documents, and lack pedagogical accuracy mechanisms. Furthermore, full regeneration for modifications requires 200‑‑600 seconds, disrupting creative flow. We present MAIC‑UI, a zero‑code authoring system that enables educators to create and rapidly edit interactive courseware from textbooks, PPTs, and PDFs. MAIC‑UI employs: (1) structured knowledge analysis with multi‑modal understanding to ensure pedagogical rigor; (2) a two‑stage generate‑verify‑optimize pipeline separating content alignment from visual refinement; and (3) Click‑to‑Locate editing with Unified Diff‑based incremental generation achieving sub‑10‑second iteration cycles. A controlled lab study with 40 participants shows MAIC‑UI reduces editing iterations (4.9 vs. 7.0) and significantly improves learnability and controllability compared to direct Text‑to‑HTML generation. A three‑month classroom deployment with 53 high school students demonstrates that MAIC‑UI fosters learning agency and reduces outcome disparities ‑‑ the pilot class achieved 9.21‑point gains in STEM subjects compared to ‑2.32 points in control classes. Our code is available at https://github.com/THU‑MAIC/MAIC‑UI.
Authors:Chengsheng Zhang, Chenghao Sun, Xinyan Jiang, Wei Li, Xinmei Tian
Abstract:
Large Vision‑Language Models (LVLMs) have achieved remarkable progress in visual‑textual understanding, yet their reliability is critically undermined by hallucinations, i.e., the generation of factually incorrect or inconsistent responses. While recent studies using steering vectors demonstrated promise in reducing hallucinations, a notable challenge remains: they inadvertently amplify the severity of residual hallucinations. We attribute this to their exclusive focus on the decoding stage, where errors accumulate autoregressively and progressively worsen subsequent hallucinatory outputs. To address this, we propose Prefill‑Time Intervention (PTI), a novel steering paradigm that intervenes only once during the prefill stage, enhancing the initial Key‑Value (KV) cache before error accumulation occurs. Specifically, PTI is modality‑aware, deriving distinct directions for visual and textual representations. This intervention is decoupled to steer keys toward visually‑grounded objects and values to filter background noise, correcting hallucination‑prone representations at their source. Extensive experiments demonstrate PTI's significant performance in mitigating hallucinations and its generalizability across diverse decoding strategies, LVLMs, and benchmarks. Moreover, PTI is orthogonal to existing decoding‑stage methods, enabling plug‑and‑play integration and further boosting performance. Code is available at: https://github.com/huaiyi66/PTI.
Authors:Faith Wavinya Mutinda, Spandana Makeneni, Anna Lin, Shivaji Dutta, Irit R. Rasooly, Patrick Dibussolo, Shivani Kamath Belman, Hessam Shahriari, Kevin Murphy, Alex B. Ruan, Barbara H. Chaiyachati, Sanjay Chainani, Robert W. Grundmeier, Scott M. Haag, Jeffrey M. Miller, Heather M. Griffis, Ian M. Campbell
Abstract:
Introduction: Semantic search, which retrieves documents based on conceptual similarity rather than keyword matching, offers substantial advantages for retrieval of clinical information. However, deploying semantic search across entire health systems, comprising hundreds of millions of clinical notes, presents formidable engineering, cost, and governance challenges that have prevented adoption. Methods: We deployed a semantic search system at a large children's hospital indexing 166 million clinical notes (484 million vectors) from 1.68 million patients. The system uses instruction‑tuned qwen3‑embedding‑0.6B embeddings, stores vectors in a managed database with storage‑optimized indexing, maintains full‑text metadata in a low‑latency key‑value store, and operates within a HIPAA‑compliant governance framework. We evaluated the system through three experiments: optimization of embedding model and chunking strategy using a physician‑authored benchmark dataset, characterization of full‑scale performance (cost, latency, retrieval quality), and clinical utility assessment via comparison of chart abstraction efficiency across three tasks. Results: The system delivers sub‑second query latency (median 237 ms single‑user, 451 ms 20‑user concurrency) with monthly costs of approximately USD 4,000. Qwen3 embeddings with 300‑token chunk size achieved 94.6% accuracy on a clinical question‑answering benchmark. In clinical utility evaluation across three abstraction tasks, semantic search reduced time‑to‑completion by 24 to 89% compared to clinician‑performed chart review while maintaining comparable inter‑rater agreement. Conclusion: Health‑system‑scale semantic search is both technically and operationally feasible. The system provides infrastructure supporting interactive search, cohort generation, and downstream LLM‑powered clinical applications without requiring specialized informatics expertise.
Authors:Junxing Hu, Tianlong Li, Lei Yu, Ai Han
Abstract:
Deploying production‑ready multi‑agent systems (MAS) in complex industrial environments remains challenging due to limitations in scalability, observability, and autonomous evolution. We present OxyGent, an open‑source framework driven by two core novelties: a unified Oxy abstraction and the OxyBank evolution engine. The unified abstraction encapsulates agents, tools, LLMs, and reasoning flows as pluggable atomic components, enabling Lego‑like scalable system composition and non‑intrusive monitoring. To enhance observability, OxyGent introduces permission‑driven dynamic planning that replaces rigid workflows with execution graphs generated at runtime, providing adaptive visualizations. Furthermore, to support continuous evolution, OxyBank serves as an AI asset management platform that drives automated data backflow, annotation, and joint evolution. Empirical evaluations and real‑world case studies show that OxyGent provides a robust and scalable foundation for MAS. OxyGent is fully open‑sourced under the Apache License 2.0 at https://github.com/jd‑opensource/OxyGent.
Authors:Ignacio Peyrano
Abstract:
Enterprise software engineering is shifting away from deterministic CRUD/REST architectures toward AI‑native systems where large language models act as cognitive orchestrators. This transition introduces a critical security tension: probabilistic LLMs weaken classical mechanisms for validation, access control, and formal testing. This paper proposes the design, formal validation, and empirical evaluation of a Semantic Gateway governed by the Model Context Protocol (MCP). The gateway reframes the enterprise API as a semantic surface where tools are dynamically discovered, authorized, and executed based on intent and policy enforcement. The central contribution rests on a paradigm shift: autonomous agents must not be validated as traditional software nor as simple API consumers, but as stochastic state‑transition systems whose behavior must be abstracted, fuzzed, and audited through enabled‑tool graphs. The architecture introduces a three‑layer Zero‑Trust security model comprising a pre‑inference Semantic Firewall, deterministic Tool‑Level RBAC, and out‑of‑band Cryptographic Human‑in‑the‑Loop approval. Enabledness‑Preserving Abstractions (EPAs) and greybox semantic fuzzing‑‑originally developed for blockchain smart contract verification‑‑are adapted to audit agent behavior in enterprise environments. Results demonstrate an 84.2% reduction in incidental code. Across 500,000 multi‑turn fuzzing sequences, the methodology achieved a 100% discovery rate of hidden unauthorized state transitions, proving that dynamic formal verification is strictly necessary for secure agentic deployment.
Authors:Ma Zirui, Fan Zhihua, Li Wenxing, Wu Haibin, Zhang Fulin, Ye Xiaochun, Li Wenming
Abstract:
Speculative decoding enhances the inference efficiency of large language models (LLMs) by generating drafts using a small draft language model (DLM) and verifying them in batches with a large target language model (TLM). However, adaptive drafting inference on a mobile single‑NPU‑PIM system faces idle overhead in traditional operator‑level synchronous execution and wasted computation in asynchronous execution due to fluctuations in draft length. This paper introduces AHASD, a task‑level asynchronous mobile NPU‑PIM heterogeneous architecture for speculative decoding. Notably, AHASD achieves parallel drafting on the PIM and verification on a single NPU through task‑level DLM‑TLM decoupling and specifically, it incorporates Entropy‑History‑Aware Drafting Control and Time‑Aware Pre‑Verification Control to dynamically manage adaptive drafting algorithm execution and pre‑verification timing, suppressing invalid drafting based on low‑confidence drafts. Additionally, AHASD integrates Attention Algorithm Units and Gated Task Scheduling Units within LPDDR5‑PIM to enable attention link localization and sub‑microsecond task switching on the PIM side. Experimental results for different LLMs and adaptive drafting algorithms show that AHASD achieves up to 4.2× in throughput and 5.6× in energy efficiency improvements over a GPU‑only baseline, and 1.5× in throughput and 1.24× in energy efficiency gains over the state‑of‑the‑art GPU+PIM baseline, with hardware overhead below 3% of the DRAM area.
Authors:Li Ju, Junzhe Wang, Qi Zhang
Abstract:
Retrieval‑Augmented Generation (RAG) models frequently produce answers grounded in parametric memory rather than the retrieved context, undermining the core promise of retrieval augmentation. A fundamental obstacle to fixing this unfaithfulness is the lack of training data that explicitly requires models to prefer context over internal knowledge. We introduce Faithfulness‑QA, a large‑scale dataset of 99,094 samples constructed through counterfactual entity substitution. Starting from two established extractive QA benchmarks‑‑SQuAD and TriviaQA‑‑we automatically identify answer‑bearing named entities in each context, replace them with type‑consistent alternatives drawn from a curated bank of 76,953 entities, and thereby manufacture controlled knowledge conflicts between context and parametric memory. Rigorous quality filtering ensures 100% pass rates across four automated checks on random 200‑sample audits. We release the full dataset, the construction pipeline, and a typed entity bank covering eight named entity categories. Faithfulness‑QA is designed as a training resource for attention‑based faithfulness objectives and as an evaluation benchmark for measuring context‑grounding behavior in RAG systems. Data and code are available at https://github.com/qzhangFDU/faithfulness‑qa‑dataset.
Authors:Sehyeon Oh, Yongin Kwon, Jemin Lee
Abstract:
FlashAttention improves efficiency through tiling, but its online softmax still relies on floating‑point arithmetic for numerical stability, making full quantization difficult. We identify three main obstacles to integer‑only FlashAttention: (1) scale explosion during tile‑wise accumulation, (2) inefficient shift‑based exponential operations on GPUs, and (3) quantization granularity constraints requiring uniform scales for integer comparison. To address these challenges, we propose QFlash, an end‑to‑end integer FlashAttention design that performs softmax entirely in the integer domain and runs as a single Triton kernel. On seven attention workloads from ViT, DeiT, and Swin models, QFlash achieves up to 6.73× speedup over I‑ViT and up to 8.69× speedup on Swin, while reducing energy consumption by 18.8% compared to FP16 FlashAttention, without sacrificing Top‑1 accuracy on ViT/DeiT and remaining competitive on Swin under per‑tensor quantization. Our code is publicly available at https://github.com/EfficientCompLab/qflash.
Authors:Lei Xiong, Kun Luo, Ziyi Xia, Wenbo Zhang, Jin-Ge Yao, Zheng Liu, Jingying Shao, Jianlyu Chen, Hongjin Qian, Xi Yang, Qian Yu, Hao Li, Chen Yue, Xiaan Du, Yuyang Wang, Yesheng Liu, Haiyu Xu, Zhicheng Dou
Abstract:
Autonomous scientific research is significantly advanced thanks to the development of AI agents. One key step in this process is finding the right scientific literature, whether to explore existing knowledge for a research problem, or to acquire evidence for verifying assumptions and supporting claims. To assess AI agents' capability in driving this process, we present AutoResearchBench, a dedicated benchmark for autonomous scientific literature discovery. AutoResearchBench consists of two complementary task types: (1) Deep Research, which requires tracking down a specific target paper through a progressive, multi‑step probing process, and (2) Wide Research, which requires comprehensively collecting a set of papers satisfying given conditions. Compared to previous benchmarks on agentic web browsing, AutoResearchBench is distinguished along three dimensions: it is research‑oriented, calling for in‑depth comprehension of scientific concepts; literature‑focused, demanding fine‑grained utilization of detailed information; and open‑ended, involving an unknown number of qualified papers and thus requiring deliberate reasoning and search throughout. These properties make AutoResearchBench uniquely suited for evaluating autonomous research capabilities, and extraordinarily challenging. Even the most powerful LLMs, despite having largely conquered general agentic web‑browsing benchmarks such as BrowseComp, achieve only 9.39% accuracy on Deep Research and 9.31% IoU on Wide Research, while many other strong baselines fall below 5%. We publicly release the dataset and evaluation pipeline to facilitate future research in this direction. We publicly release the dataset, evaluation pipeline, and code at https://github.com/CherYou/AutoResearchBench.
Authors:Ridwan Mahbub, Syem Aziz, Mahir Ahmed, Shadikur Rahman, Mizanur Rahman, Shafiq Joty, Enamul Hoque
Abstract:
Data videos are a powerful medium for visual data based storytelling, combining animated, chart‑centric visualizations with synchronized narration. Widely used in journalism, education, and public communication, they help audiences understand complex data through clear and engaging visual explanations. Despite their growing impact, generating data‑driven video stories remains challenging, as it requires careful coordination of visual encoding, temporal progression, and narration and substantial expertise in visualization design, animation, and video‑editing tools. Recent advances in large language models offer new opportunities to automate this process; however, there is currently no benchmark for rigorously evaluating models on animated visualization‑based video storytelling. To address this gap, we introduce DataReel, a benchmark for automated data‑driven video story generation comprising 328 real‑world stories. Each story pairs structured data, a chart visualization, and a narration transcript, enabling systematic evaluation of models' abilities to generate animated data video stories. We further propose a multi‑agent framework that decomposes the task into planning, generation, and verification stages, mirroring key aspects of the human storytelling process. Experiments show that this multi‑agent approach outperforms direct prompting baselines under both automatic and human evaluations, while revealing persistent challenges in coordinating animation, narration, and visual emphasis. We release DataReel at https://github.com/vis‑nlp/DataReel.
Authors:Alexander Kolpakov, Igor Rivin
Abstract:
Dimensionality reduction methods such as UMAP and t‑SNE are central tools for visualising high‑dimensional data, but their local‑neighborhood objectives can preserve sampling noise while distorting global topology. We show that standard local metrics reward this noise memorisation: top‑performing embeddings invent cycles and disconnected islands absent from the data. We introduce a topology‑faithfulness benchmark based on noisy manifolds with known homology, tune DiRe against it, and find Pareto‑optimal configurations that match or beat GPU‑accelerated UMAP on classification while recovering exact first Betti numbers on stress tests. On 723K arXiv paper embeddings, DiRe preserves 3‑4 times more topological structure than UMAP at comparable wall‑clock.
Authors:Jiatong Ma, Longteng Guo, Yuchen Liu, Zijia Zhao, Dongze Hao, Xuanxu Lin, Jing Liu
Abstract:
We present M^3‑VQA, a novel knowledge‑based Visual Question Answering (VQA) benchmark, to enhance the evaluation of multimodal large language models (MLLMs) in fine‑grained multimodal entity understanding and complex multi‑hop reasoning. Unlike existing VQA datasets that focus on coarse‑grained categories and simple reasoning over single entities, M^3‑VQA introduces diverse multi‑entity questions involving multiple distinct entities from both visual and textual sources. It requires models to perform both sequential and parallel multi‑hop reasoning across multiple documents, supported by traceable, detailed evidence and a curated multimodal knowledge base. We evaluate 16 leading MLLMs under three settings: without external knowledge, with gold evidence, and with retrieval‑augmented input. The poor results reveal significant challenges for MLLMs in knowledge acquisition and reasoning. Models perform poorly without external information but improve markedly when provided with precise evidence. Furthermore, reasoning‑aware agentic retrieval surpasses heuristic methods, highlighting the importance of structured reasoning for complex multimodal understanding. M^3‑VQA presents a more challenging evaluation for advancing the multimodal reasoning capabilities of MLLMs. Our code and dataset are available at https://github.com/CASIA‑IVA‑Lab/M3VQA.
Authors:Ming Li, Jie Wu, Justin Cui, Xiaojie Li, Rui Wang, Chen Chen
Abstract:
While preference optimization is crucial for improving visual generative models, how to effectively scale this paradigm remains largely unexplored. Current open‑source preference datasets contain conflicting preference patterns, where winners excel in some dimensions but underperform in others. Naively optimizing on such noisy datasets fails to learn preferences, hindering effective scaling. To enhance robustness against noise, we propose Poly‑DPO, which extends the DPO objective with an additional polynomial term that dynamically adjusts model confidence based on dataset characteristics, enabling effective learning across diverse data distributions. Beyond biased patterns, existing datasets suffer from low resolution, limited prompt diversity, and imbalanced distributions. To facilitate large‑scale visual preference optimization by tackling data bottlenecks, we construct ViPO, a massive‑scale preference dataset with 1M image pairs at 1024px across five categories and 300K video pairs at 720p+ across three categories. State‑of‑the‑art generative models and diverse prompts ensure reliable preference signals with balanced distributions. Remarkably, when applying Poly‑DPO to our high‑quality dataset, the optimal configuration converges to standard DPO. This convergence validates dataset quality and Poly‑DPO's adaptive nature: sophisticated optimization becomes unnecessary with sufficient data quality, yet remains valuable for imperfect datasets. We validate our approach across visual generation models. On noisy datasets like Pick‑a‑Pic V2, Poly‑DPO achieves 6.87 and 2.32 gains over Diffusion‑DPO on GenEval for SD1.5 and SDXL, respectively. For ViPO, models achieve performance far exceeding those trained on existing open‑source preference datasets. These results confirm that addressing both algorithmic adaptability and data quality is essential for scaling visual preference optimization.
Authors:Xinxin Liu, Ming Li, Zonglin Lyu, Yuzhang Shang, Chen Chen
Abstract:
Human visual preferences are inherently multi‑dimensional, encompassing aesthetics, detail fidelity, and semantic alignment. However, existing datasets provide only single, holistic annotations, resulting in severe label noise: images that excel in some dimensions but are deficient in others are simply marked as winner or loser. We theoretically demonstrate that compressing multi‑dimensional preferences into binary labels generates conflicting gradient signals that misguide Diffusion Direct Preference Optimization (DPO). To address this, we propose Semi‑DPO, a semi‑supervised approach that treats consistent pairs as clean labeled data and conflicting ones as noisy unlabeled data. Our method starts by training on a consensus‑filtered clean subset, then uses this model as an implicit classifier to generate pseudo‑labels for the noisy set for iterative refinement. Experimental results demonstrate that Semi‑DPO achieves state‑of‑the‑art performance and significantly improves alignment with complex human preferences, without requiring additional human annotation or explicit reward models during training. We will release our code and models at: https://github.com/L‑CodingSpace/semi‑dpo
Authors:Mohammed Ali El Adlouni, Aurian Quelennec, Pierre Chouteau, Geoffroy Peeters, Slim Essid
Abstract:
General audio foundation models have recently achieved remarkable progress, enabling strong performance across diverse tasks. However, state‑of‑the‑art models remain extremely large, often with hundreds of millions of parameters, leading to high inference costs and limited deployability on edge devices. Knowledge distillation is a proven strategy for model compression, but prior work in audio has mostly focused on supervised settings, relying on class logits, intermediate features, or architecture‑specific techniques. Such assumptions exclude models that output only embeddings, such as self‑supervised or metric‑learning models. We introduce S‑SONDO (Self‑Supervised KnOwledge DistillatioN for General AuDio FOundation Models), the first framework to distill general audio models using only their output embeddings. By avoiding the need for logits or layer‑level alignment, S‑SONDO is architecture‑agnostic and broadly applicable to embedding‑based teachers. We demonstrate its effectiveness by distilling two audio foundation models into three efficient students that are up to 61 times smaller while retaining up to 96% of teacher performance. We also provide practical insights on loss choice and clustering‑based balanced data sampling. Code is available here: https://github.com/MedAliAdlouni/ssondo.
Authors:Yunsu Kim, Kaden Uhlig, Joern Wuebker
Abstract:
Agent benchmarks remain largely English‑centric, while their multilingual versions are often built with machine translation (MT) and limited post‑editing. We argue that, for agentic tasks, this minimal workflow can easily break benchmark validity through query‑answer misalignment or culturally off‑target context. We propose a refined workflow for adapting English benchmarks into multiple languages with explicit functional alignment, cultural alignment, and difficulty calibration using both automated checks and human review. Using this workflow, we introduce GAIA‑v2‑LILT, a re‑audited multilingual extension of GAIA covering five non‑English languages. In experiments, our workflow improves agent success rates by up to 32.7% over minimally translated versions, bringing the closest audited setting to within 3.1% of English performance while substantial gaps remain in many other cases. This indicates that a substantial share of the multilingual performance gap is benchmark‑induced measurement error, motivating task‑level alignment when adapting English benchmarks across languages. The data is available as part of the MAPS package at https://huggingface.co/datasets/Fujitsu‑FRE/MAPS/viewer/GAIA‑v2‑LILT. We also release the code used in our experiments at https://github.com/lilt/gaia‑v2‑lilt.
Authors:Yuanhao Zeng, Ao Lu, Lufei Li, Zheng Zhang, Yexin Li, Kan Ren
Abstract:
Generating diverse responses is crucial for test‑time scaling of large language models (LLMs), yet standard stochastic sampling mostly yields surface‑level lexical variation, limiting semantic exploration. In this paper, we propose Exploratory Sampling (ESamp), a decoding approach that explicitly encourages semantic diversity during generation. ESamp is motivated by the well‑known observation that neural networks tend to make lower‑error predictions on inputs similar to those encountered before, and incur higher prediction error on novel ones. Building on this property, we train a lightweight Distiller at test time to predict deep‑layer hidden representations of the LLM from its shallow‑layer representations to model the LLM's depth‑wise representation transitions. During decoding, the Distiller continuously adapts to the mappings induced by the current generation context. ESamp uses the prediction error as a novelty signal to reweight candidate token extensions conditioned on the current prefix, thereby biasing decoding toward less‑explored semantic patterns. ESamp is implemented with an asynchronous training‑‑inference pipeline, with less than 5% worst case overhead (1.2% in the optimized release). Empirical results show that ESamp significantly boosts the Pass@k efficiency of reasoning models, showing superior or comparable performance to strong stochastic and heuristic baselines. Notably, ESamp achieves robust generalization across mathematics, science, and code generation benchmarks and breaks the trade‑off between diversity and coherence in creative writing. Our code has released at: https://github.com/LinesHogan/tLLM.
Authors:John Seon Keun Yi, Aaron Mueller, Dokyun Lee
Abstract:
Multi‑agent debate has been shown to improve reasoning in large language models (LLMs). However, it is compute‑intensive, requiring generation of long transcripts before answering questions. To address this inefficiency, we develop a framework that distills multi‑agent debate into a single LLM through a two‑stage fine‑tuning pipeline combining debate structure learning with internalization via dynamic reward scheduling and length clipping. Across multiple models and benchmarks, our internalized models match or exceed explicit multi‑agent debate performance using up to 93% fewer tokens. We then investigate the mechanistic basis of this capability through activation steering, finding that internalization creates agent‑specific subspaces: interpretable directions in activation space corresponding to different agent perspectives. We further demonstrate a practical application: by instilling malicious agents into the LLM through internalized debate, then applying negative steering to suppress them, we show that distillation makes harmful behaviors easier to localize and control with smaller reductions in general performance compared to steering base models. Our findings offer a new perspective for understanding multi‑agent capabilities in distilled models and provide practical guidelines for controlling internalized reasoning behaviors. Code available at https://github.com/johnsk95/latent_agents
Authors:Nishit Anand, Manan Suri, Christopher Metzler, Dinesh Manocha, Ramani Duraiswami
Abstract:
Controlling illumination in images is essential for photography and visual content creation. While closed‑source models have demonstrated impressive illumination control, open‑source alternatives either require heavy control inputs like depth maps or do not release their data and code. We present a fully open‑source and reproducible pipeline for learning illumination control in diffusion models. Our approach builds a data engine that transforms well‑lit images into supervised training triplets consisting of a poorly‑illuminated input image, a natural language lighting instruction, and a well‑illuminated output image. We finetune a diffusion model on this data and demonstrate significant improvements over baseline SD 1.5, SDXL, and FLUX.1‑dev models in perceptual similarity, structural similarity, and identity preservation. Our work provides a reproducible solution built entirely with open‑source tools and publicly available data. We release all our code, data, and model weights publicly.
Authors:Tingwu Wang, Olivier Dionne, Michael De Ruyter, David Minor, Davis Rempe, Kaifeng Zhao, Mathis Petrovich, Ye Yuan, Chenran Li, Zhengyi Luo, Brian Robison, Xavier Blackwell, Bernardo Antoniazzi, Xue Bin Peng, Yuke Zhu, Simon Yuen
Abstract:
Despite transformative advances in generative motion synthesis, real‑time interactive motion control remains dominated by traditional techniques. In this work, we identify two key challenges in bridging research and production: 1) Real‑time scalability: Industry applications demand real‑time generation of a vast repertoire of motion skills, while generative methods exhibit significant degradation in quality and scalability under real‑time computation constraints, and 2) Integration: Industry applications demand fine‑grained multi‑modal control involving velocity commands, style selection, and precise keyframes, a need largely unmet by existing text‑ or tag‑driven models. To overcome these limitations, we introduce MotionBricks: a large‑scale, real‑time generative framework with a two‑fold solution. First, we propose a large‑scale modular latent generative backbone tailored for robust real‑time motion generation, effectively modeling a dataset of over 350,000 motion clips with a single model. Second, we introduce smart primitives that provide a unified, robust, and intuitive interface for authoring both navigation and object interaction. Applications can be designed in a plug‑and‑play manner like assembling bricks without expert animation knowledge. Quantitatively, we show that MotionBricks produces state‑of‑the‑art motion quality on open‑source and proprietary datasets of various scales, while also achieving a real‑time throughput of 15,000 FPS with 2ms latency. We demonstrate the flexibility and robustness of MotionBricks in a complete production‑level animation demo, covering navigation and object‑scene interaction across various styles with a unified model. To showcase our framework's application beyond animation, we deploy MotionBricks on the Unitree G1 humanoid robot to demonstrate its flexibility and generalization for real‑time robotic control.
Authors:Thomas Carmichael
Abstract:
Autoregressive transformers make confident errors, but activation monitoring can catch them only if the model preserves an internal signal that output confidence does not expose. This preservation is determined by architecture and training recipe. We define observability as the linear readability of per‑token decision quality from frozen mid‑layer activations after controlling for max‑softmax confidence and activation norm. The correction is essential. Confidence controls absorb 57.7% of raw probe signal on average across 13 models in 6 families. Observability is not a generic property of transformers. In Pythia's controlled suite, every tested run with the 24‑layer, 16‑head configuration collapses to rho_partial ~0.10 across a 3.5x parameter gap and two Pile variants, while six other configurations occupy a separated healthy band from 0.21 to 0.38. The output‑controlled residual collapses at the same points, and neither tested nonlinear probes nor layer sweeps recover healthy‑range signal. Checkpoint dynamics show the collapse is emergent during training. Both configurations at matched hidden dimension form the signal at the earliest measured checkpoint, but training erases it in the (24L, 16H) class while predictive loss continues improving. Across independent recipes the collapse map changes but the phenomenon persists. Qwen 2.5 and Llama differ by 2.9x at matched 3B scale with probe seed distributions that do not overlap, while Mistral 7B preserves observability where Llama 3.1 8B collapses despite similar broad architecture. A WikiText‑trained observer transfers to downstream QA without training on those tasks, catching errors confidence misses. At 20% flag rate, its exclusive catch rate is 10.9‑13.4% of all errors in seven of nine model‑task cells. Architecture selection is a monitoring decision.
Authors:Zhou Ziheng, Huacong Tang, Jinyuan Zhang, Haowei Lin, Bangcheng Yang, Qian Long, Fang Sun, Yizhou Sun, Yitao Liang, Ying Nian Wu, Demetri Terzopoulos, Xiaofeng Gao
Abstract:
Discovering causal regularities and applying them to build functional systems‑‑the discovery‑to‑application loop‑‑is a hallmark of general intelligence, yet evaluating this capacity has been hindered by the vast complexity gap between scientific discovery and real‑world engineering. We introduce SciCrafter, a Minecraft‑based benchmark that operationalizes this loop through parameterized redstone circuit tasks. Agents must ignite lamps in specified patterns (e.g., simultaneously or in timed sequences); scaling target parameters substantially increases construction complexity and required knowledge, forcing genuine discovery rather than reliance on memorized solutions. Evaluating frontier models including GPT‑5.2, Gemini‑3‑Pro, and Claude‑Opus‑4.5 under a general‑purpose code agent scaffold, we find that all plateau at approximately 26% success rate. To diagnose these failures, we decompose the loop into four capacities‑‑knowledge gap identification, experimental discovery, knowledge consolidation, and knowledge application‑‑and design targeted interventions whose marginal contributions serve as proxies for corresponding gaps. Our analysis reveals that although the general knowledge application capability still remains as the biggest gap across all models, for frontier models the knowledge gap identification starts to become a major hurdle‑‑indicating the bottleneck is shifting from solving problems right to raising the right problems for current AI. We release SciCrafter as a diagnostic probe for future research on AI systems that navigate the full discovery‑to‑application loop.
Authors:Fan Du, Feng Yan, Jianxiong Wu, Xinrun Xu, Weiye Zhang, Weinong Wang, Yu Guo, Bin Qian, Zhihai He, Fei Wang, Heng Yang
Abstract:
Flow‑based vision‑language‑action (VLA) policies offer strong expressivity for action generation, but suffer from a fundamental inefficiency: multi‑step inference is required to recover action structure from uninformative Gaussian noise, leading to a poor efficiency‑quality trade‑off under real‑time constraints. We address this issue by rethinking the role of the starting point in generative action modeling. Instead of shortening the sampling trajectory, we propose CF‑VLA, a coarse‑to‑fine two‑stage formulation that restructures action generation into a coarse initialization step that constructs an action‑aware starting point, followed by a single‑step local refinement that corrects residual errors. Concretely, the coarse stage learns a conditional posterior over endpoint velocity to transform Gaussian noise into a structured initialization, while the fine stage performs a fixed‑time refinement from this initialization. To stabilize training, we introduce a stepwise strategy that first learns a controlled coarse predictor and then performs joint optimization. Experiments on CALVIN and LIBERO show that our method establishes a strong efficiency‑performance frontier under low‑NFE (Number of Function Evaluations) regimes: it consistently outperforms existing NFE=2 methods, matches or surpasses the NFE=10 π_0.5 baseline on several metrics, reduces action sampling latency by 75.4%, and achieves the best average real‑robot success rate of 83.0%, outperforming MIP by 19.5 points and π_0.5 by 4.0 points. These results suggest that structured, coarse‑to‑fine generation enables both strong performance and efficient inference. Our code is available at https://github.com/EmbodiedAI‑RoboTron/CF‑VLA.
Authors:Jisoo Yang, Jongwon Ryu, Minuk Ma, Trung X. Pham, Junyeong Kim
Abstract:
Large Language Models (LLMs) reveal inherent and distinctive personas through dialogue. However, most existing persona discovery approaches rely on surface‑level lexical or stylistic cues, treating dialogue as a flat sequence of tokens and failing to capture the deeper discourse‑level structures that sustain persona consistency. To address this limitation, we propose a novel analytical framework that interprets LLM dialogue through bridging inference ‑‑ implicit conceptual relations that connect utterances via shared world knowledge and discourse coherence. By modeling these relations as structured knowledge graphs, our approach captures latent semantic links that govern how LLMs organize meaning across turns, enabling persona discovery at the level of discourse coherence rather than surface realizations. Experimental results across multiple reasoning backbones and target LLMs, ranging from small‑scale models to 80B‑parameter systems, demonstrate that bridging‑inference graphs yield significantly stronger semantic coherence and more stable persona identification than frequency or style‑based baselines. These results show that persona traits are consistently encoded in the structural organization of discourse rather than isolated lexical patterns. This work presents a systematic framework for probing, extracting, and visualizing latent LLM personas through the lens of Cognitive Discourse Theory, bridging computational linguistics, cognitive semantics, and persona reasoning in large language models. Codes are available at https://github.com/JiSoo‑Yang/Persona_Bridging.git
Authors:Jon-Paul Cacioli
Abstract:
Small instruct‑tuned LLMs produce degenerate verbal confidence under minimal elicitation: ceiling rates above 95%, near‑chance Type‑2 AUROC, and Invalid validity profiles. We test whether confidence‑conditioned supervised fine‑tuning (CSFT) with self‑consistency‑derived targets can close the gap between internal information and verbal readout. A pre‑registered Phase 0 protocol on Gemma 3 4B‑it with a modal filter restricting training to items with correct modal answers produced a negative result: AUROC2 dropped from 0.554 to 0.509 due to label‑entropy collapse in the training targets. An exploratory rescue removed the filter, training on all 2,000 calibration items. This produced a binary verbal correctness discriminator with AUROC2 = 0.774 on held‑out TriviaQA, compressing a 10‑sample self‑consistency signal (AUROC2 = 0.999) into a single‑pass readout exceeding logit entropy (0.701). The shuffled‑target control showed no improvement (0.501). On MMLU, accuracy improved from 54.2% to 77.4% with the shuffled model at baseline (56.1%), supporting a target‑dependent interpretation. The result is exploratory, binary rather than continuously calibrated, and observed at a single scale. It identifies two design lessons: confidence training requires label entropy, and correct targets regularise output format.
Authors:Wenjie Du, Yiyuan Yang, Tianxiang Zhan, Qingsong Wen
Abstract:
Partially‑observed time series (POTS) is ubiquitous in real‑world applications, yet most existing toolchains separate missing‑value handling from downstream learning, which limits reproducibility and overall performance. This tutorial introduces PyPOTS, an open‑source Python ecosystem for end‑to‑end data mining and machine learning on POTS. We present practical workflows spanning missingness simulation, data preprocessing, model training, and evaluation across core tasks, including imputation, forecasting, classification, clustering, and anomaly detection. The tutorial consists of two parts: Part I emphasizes hands‑on application for practitioners through unified APIs and benchmark‑oriented experiments. Part II targets developers and researchers, focusing on extending PyPOTS with custom models, domain‑specific constraints, and contribution‑ready engineering practices. Participants will gain both conceptual understanding and implementation experience for building robust, transparent, and reusable POTS pipelines in research and production settings. PyPOTS is publicly available at https://github.com/WenjieDu/PyPOTS
Authors:Hojoon Kim, Yuheng Wu, Thierry Tambe
Abstract:
Embodied AI agents increasingly rely on large language models (LLMs) for planning, yet per‑step LLM calls impose severe latency and cost. In this paper, we show that embodied tasks exhibit strong plan locality, where the next plan is largely predictable from the current one. Building on this, we introduce AgenticCache, a planning framework that reuses cached plans to avoid per‑step LLM calls. In AgenticCache, each agent queries a runtime cache of frequent plan transitions, while a background Cache Updater asynchronously calls the LLM to validate and refine cached entries. Across four multi‑agent embodied benchmarks, AgenticCache improves task success rate by 22% on average across 12 configurations (4 benchmarks x 3 models), reduces simulation latency by 65%, and lowers token usage by 50%. Cache‑based plan reuse thus offers a practical path to low‑latency, low‑cost embodied agents. Code is available at https://github.com/hojoonleokim/MLSys26_AgenticCache.
Authors:Chenyang An, Qihao Ye, Minghao Pan, Jiayaun Zhang
Abstract:
We explore a central question in AI for mathematics: can AI systems produce original, nontrivial proofs for open research problems? Despite strong benchmark performance, producing genuinely novel proofs remains an outstanding challenge for LLMs. Through systematic experiments with frontier LLMs on research‑level proof tasks, we identify seven failure modes that prevent reliable proof generation, including context contamination, citation hallucination, hand‑waving on key steps and misallocation of proof effort, unstable proof plans, unfocused verification, problem modification and single‑model bottleneck. We argue that the gap between benchmark success and research‑level proving is primarily one of system design, due to those failure modes. We present QED, an open‑source multi‑agent proof system in which each architectural decision directly addresses a specific failure mode. Evaluated on five open problems in applied analysis and PDEs contributed by domain experts, QED produces correct proofs for three problems, each verified by the contributing experts as original and nontrivial. QED is released as open‑source software at https://github.com/proofQED/QED.
Authors:Jiaqi Wang, Wenhao Zhang, Weijie Shi, Yaliang Li, James Cheng
Abstract:
On‑policy distillation (OPD) has shown strong potential for transferring reasoning ability from frontier or domain‑specific models to smaller students. While effective on static single‑turn tasks, its behavior in multi‑turn agent settings remains underexplored. In this work, we identify a key limitation of vanilla OPD in such settings, which we term Trajectory‑Level KL Instability. Specifically, we observe that KL divergence increases together with a drop in success rate, and even after convergence, the KL remains high, leading to unstable training. This instability arises from inter‑turn error compounding: as errors accumulate, the student is driven beyond the teacher's effective support, rendering the supervision signal unreliable. To address this, we propose TCOD (Temporal Curriculum On‑Policy Distillation), a simple yet effective framework that controls the trajectory depth exposed to the student and progressively expands it from short to long with a curriculum schedule. Experimental results across four student‑teacher pairs on three multi‑turn agent benchmarks (ALFWorld, WebShop, ScienceWorld) show that TCOD mitigates KL escalation and enhances KL stability throughout training, improving agent performance by up to 18 points over vanilla OPD. Further evaluations show that TCOD can even surpass the teacher's performance and generalize to tasks on which the teacher fails. Our code is available at https://github.com/kokolerk/TCOD.
Authors:Yao Wang, Zixu Geng, Jun Yan
Abstract:
Knowledge graphs (KGs) are increasingly used to support large lan guage model (LLM) reasoning, but standard triplet‑based KGs treat each relation as globally valid. In many settings, whether a relation should count as evidence depends on the context. We therefore formulate triplet validity as a triplet‑specific function of context and refer to this formulation as a Quantum Knowledge Graph (QKG). We instantiate QKG in medicine using a diabetes‑centered PrimeKG subgraph, whose 68,651 context‑sensitive relations are further annotated with patient‑group‑specific constraints. We evaluate it in a reasoner‑‑validator pipeline for medical question answering on a KG‑grounded subset of MedReason containing 2,788 questions. With Haiku‑4.5 as both the Reasoner and the Validator, KG‑backed validation significantly improves over a no‑validator baseline (+0.61 pp), and QKG with context matching yields the largest gain, outperforming both KG validation without context matching (+0.79 pp) and the no‑validator baseline (+1.40 pp; paired McNemar, all p<0.05). Under a stronger validator (Qwen‑3.6‑Plus), the raw QKG gain over the no‑validator baseline grows from +1.40 pp to +5.96 pp; the context‑matching gap is non‑significant (p=0.73) on the raw set but becomes borderline significant (p=0.05) after adjustment for knowledge leakage and suspicious questions, consistent with a benchmark‑gold ceiling rather than a QKG limitation. Taken together, the results support the view that the value of a KG in LLM‑based clinical reasoning lies not merely in storing medically related facts, but in representing whether those facts are applicable to the specific patient context. For reproducibility and further research, we release the curated QKG datasets and source code.\footnotehttps://github.com/HKAI‑Sci/QKG
Authors:SungHo Kim, Juhyeong Park, Yeachan Kim, SangKeun Lee
Abstract:
The Korean writing system, Hangeul, has a unique character representation rigidly following the invention principles recorded in Hunminjeongeum.\footnoteHunminjeongeum is a book published in 1446 that describes the principles of invention and usage of Hangeul, devised by King Sejong \citeHunminjeongeum_Guide. However, existing pre‑trained language models (PLMs) for Korean have overlooked these principles. In this paper, we introduce a novel framework for Korean PLMs called KOMBO, which firstly brings the invention principles of Hangeul to represent character. Our proposed method, KOMBO, exhibits notable experimental proficiency across diverse NLP tasks. In particular, our method outperforms the state‑of‑the‑art Korean PLM by an average of 2.11% in five Korean natural language understanding tasks. Furthermore, extensive experiments demonstrate that our proposed method is suitable for comprehending the linguistic features of the Korean language. Consequently, we shed light on the superiority of using subcharacters over the typical subword‑based approach for Korean PLMs. Our code is available at: [https://github.com/SungHo3268/KOMBO](https://github.com/SungHo3268/KOMBO).
Authors:Pei Xu, Yufei Ye, Shuchun Sun, Yu Ding, Elizabeth Schumann, C. Karen Liu
Abstract:
We present a data‑driven approach for physics‑based, muscle‑driven dexterous control that enables musculoskeletal hands to perform precise piano playing for novel pieces of music outside the reference dataset. Our approach combines high‑frequency muscle‑level control with low‑frequency latent‑space coordination in a hierarchical architecture. At the low level, general single‑hand policies are trained via reinforcement learning to generate dynamic muscle‑tendon activations while tracking trajectories from a large reference motion dataset. The resulting tracking policies are then distilled into variational autoencoder (VAE) models, yielding smooth and structured latent spaces that abstract away low‑level muscle dynamics. For the high level, we train piece‑specific policies to operate in this latent space, coordinating bimanual motions based on specific goals, denoted by note events extracted from given musical scores, to synthesize performances beyond the reference data. In addition, we present an enhanced musculoskeletal hand model that supports fine control of fingers for accurate low‑level motion tracking and diverse high‑level motion synthesis. We evaluate the control pipeline of our approach on a diverse piano repertoire spanning multiple musical styles and technical demands. Results demonstrate that our approach can synthesize coordinated bimanual motions with accurate key presses, and achieve the state‑of‑the‑art performance of piano playing in physics‑based dexterous control. We also show that our musculoskeletal hand model demonstrates superior biomechanical stability and tracking precision compared to the existing model, and validate that our musculoskeletal hand model and muscle‑driven controller can generate physiologically plausible activation patterns that align with human electromyography (EMG) recordings.
Authors:Nicola Zanarini, Niccolò Ferrari
Abstract:
We investigate whether the Feed‑Forward Network (FFN) sublayer in a decoder‑only transformer can be replaced by an explicit learned memory graph while preserving the surrounding autoregressive architecture. The proposed Graph Memory Transformer (GMT) keeps causal self‑attention intact, but replaces the usual per‑token FFN transformation with a memory cell that routes token representations over a learned bank of centroids connected by a learned directed transition matrix. In the base GMT v7 instantiation studied here, each of 16 transformer blocks contains 128 centroids, a 128 128 edge matrix, gravitational source routing, token‑conditioned target selection, and a gated displacement readout. The cell therefore returns movement from an estimated source memory state toward a target memory state, rather than a retrieved value. The resulting model is a fully decoder‑only language model with 82.2M trainable parameters and no dense FFN sublayers, compared with a 103.0M‑parameter dense GPT‑style baseline used in the evaluation. The base v7 model trains stably and exposes centroid usage, transition structure, and source‑to‑target movement as directly inspectable quantities of the forward computation. It remains behind the larger dense baseline in validation loss and perplexity (3.5995/36.58 vs. 3.2903/26.85), while showing close zero‑shot benchmark behavior under the evaluated setting. These results are not intended as a state‑of‑the‑art claim; they support the viability and structural interpretability of replacing dense within‑token transformation with graph‑mediated memory navigation. Broader scaling, optimized kernels, and more extensive benchmark evaluation are left for subsequent work.
Authors:Oleg Baryshnikov, Anton M. Alekseev, Sergey I. Nikolenko
Abstract:
Software documentation frequently becomes outdated or fails to exist entirely, yet developers need focused views of their codebase to understand complex systems. While automated reverse engineering tools can generate UML diagrams from code, they produce overwhelming detail without considering developer intent. We introduce query‑driven UML diagram generation, where LLMs create diagrams that directly answer natural language questions about code. Unlike existing methods, our approach produces semantically focused diagrams containing only relevant elements with contextual descriptions. We fine‑tune Qwen2.5‑Coder‑14B on a curated dataset of code files, developer queries, and corresponding diagram representations in a structured JSON format, evaluating with both automatic detection of structural defects and human assessment of semantic relevance. Results demonstrate that fine‑tuning on a modest amount of manually corrected data yields dramatic improvements: our best model achieves the highest F1 scores while reducing defect rates below state‑of‑the‑art LLMs, generating diagrams that are both structurally sound and semantically faithful to developer queries. Thus, we establish the feasibility of using LLMs for scalable contextual, on‑demand documentation generation. We make our code and dataset publicly available at https://github.com/i‑need‑a‑pencil/query2diagram.
Authors:Wentao Zhang, Qi Zhang, Mingkun Xu, Mu You, Henghua Shen, Zhongzhi He, Keyan Jin, Derek F. Wong, Tao Fang
Abstract:
Crop disease diagnosis from field photographs faces two recurring problems: models that score well on benchmarks frequently hallucinate species names, and when predictions are correct, the reasoning behind them is typically inaccessible to the practitioner. This paper describes Agri‑CPJ (Caption‑Prompt‑Judge), a training‑free few‑shot framework in which a large vision‑language model first generates a structured morphological caption, iteratively refined through multi‑dimensional quality gating, before any diagnostic question is answered. Two candidate responses are then generated from complementary viewpoints, and an LLM judge selects the stronger one based on domain‑specific criteria. Caption refinement is the component with the largest individual impact: ablations confirm that skipping it consistently degrades downstream accuracy across both models tested. On CDDMBench, pairing GPT‑5‑Nano with GPT‑5‑mini‑generated captions yields +22.7 pp in disease classification and +19.5 points in QA score over no‑caption baselines. Evaluated without modification on AgMMU‑MCQs, GPT‑5‑Nano reached 77.84% and Qwen‑VL‑Chat reached 64.54%, placing them at or above most open‑source models of comparable scale despite the format shift from open‑ended to multiple‑choice. The structured caption and judge rationale together constitute a readable audit trail: a practitioner who disagrees with a diagnosis can identify the specific caption observation that was incorrect. Code and data are publicly available https://github.com/CPJ‑Agricultural/CPJ‑Agricultural‑Diagnosis
Authors:Pritesh Jha
Abstract:
Intelligent document processing pipelines extract structured entities (tables, images, and text) from documents for use in downstream systems such as knowledge bases, retrieval‑augmented generation, and analytics. A persistent limitation of existing pipelines is that extraction output is produced without any intrinsic mechanism to verify whether it faithfully represents the source. Model‑internal confidence scores measure inference certainty, not correspondence to the document, and extraction errors pass silently into downstream consumers. We present Reconstruction as Validation (RaV‑IDP), a document processing pipeline that introduces reconstruction as a first‑class architectural component. After each entity is extracted, a dedicated reconstructor renders the extracted representation back into a form comparable to the original document region, and a comparator scores fidelity between the reconstruction and the unmodified source crop. This fidelity score is a grounded, label‑free quality signal. When fidelity falls below a per‑entity‑type threshold, a structured GPT‑4.1 vision fallback is triggered and the validation loop repeats. We enforce a bootstrap constraint: the comparator always anchors against the original document region, never against the extraction, preventing the validation from becoming circular. We further propose a per‑stage evaluation framework pairing each pipeline component with an appropriate benchmark. The code pipeline is publicly available at https://github.com/pritesh‑2711/RaV‑IDP for experimentation and use.
Authors:Zichuan Fu, Xian Wu, Guojing Li, Yejing Wang, Yijun Chen, Zihao Zhao, Yixuan Luo, Hanyu Yan, Yefeng Zheng, Xiangyu Zhao
Abstract:
Recent advancements in large language models (LLMs) have catalyzed the rise of reasoning‑intensive inference paradigms, where models perform explicit step‑by‑step reasoning before generating final answers. While such approaches improve answer quality and interpretability, they incur substantial computational overhead due to the prolonged generation sequences. In this paper, we propose Tandem, a novel collaborative framework that synergizes large and small language models (LLMs and SLMs) to achieve high‑quality reasoning with significantly reduced computational cost. Specifically, the LLM serves as a strategic coordinator, efficiently generating a compact set of critical reasoning insights. These insights are then used to guide a smaller, more efficient SLM in executing the full reasoning process and delivering the final response. To balance efficiency and reliability, Tandem introduces a cost‑aware termination mechanism that adaptively determines when sufficient reasoning guidance has been accumulated, enabling early stopping of the LLM's generation. Experiments on mathematical reasoning and code generation benchmarks demonstrate that Tandem reduces computational costs by approximately 40% compared to standalone LLM reasoning, while achieving superior or competitive performance. Furthermore, the sufficiency classifier trained on one domain transfers effectively to others without retraining. The code is available at: https://github.com/Applied‑Machine‑Learning‑Lab/ACL2026_Tandem.
Authors:Safayat Bin Hakim, Aniqa Afzal, Qi Zhao, Vigna Majmundar, Pawel Sloboda, Houbing Herbert Song
Abstract:
Privacy‑critical domains require phishing detection systems that satisfy contradictory constraints: near‑zero false positives to prevent workflow disruption, transparent explanations for non‑expert staff, strict regulatory compliance prohibiting sensitive data exposure to external APIs, and robustness against AI‑generated attacks. Existing rule‑based systems are brittle to novel campaigns, while LLM‑based detectors violate privacy regulations through unredacted data transmission. We introduce CyberCane, a neuro‑symbolic framework integrating deterministic symbolic analysis with privacy‑preserving retrieval‑augmented generation (RAG). Our dual‑phase pipeline applies lightweight symbolic rules to email metadata, then escalates borderline cases to semantic classification via RAG with automated sensitive data redaction and retrieval from a phishing‑only corpus. We further introduce PhishOnt, an OWL ontology enabling verifiable attack classification through formal reasoning chains. Evaluation on DataPhish2025 (12.3k emails; mixed human/LLM) and Nazario/SpamAssassin demonstrates a 78.6‑point recall gain over symbolic‑only detection on AI‑generated threats, with precision exceeding 98% and FPR as low as 0.16%. Healthcare deployment projects a 542x ROI; tunable operating points support diverse risk tolerances, with open‑source implementation at https://github.com/sbhakim/Cybercane.
Authors:Imranul Ashrafi, Inigo Jauregi Unanue, Massimo Piccardi
Abstract:
Test‑time alignment methods offer a promising alternative to fine‑tuning by steering the outputs of large language models (LLMs) at inference time with lightweight interventions on their internal representations. Recently, a prominent and effective approach, RE‑Control (Kong et al., 2024), has proposed leveraging an external value function trained over the LLM's hidden states to guide generation via gradient‑based editing. While effective, this method overlooks a key characteristic of alignment tasks, i.e. that they are typically formulated as learning from human preferences between candidate responses. To address this, in this paper we propose a novel preference‑based training framework, Pref‑CTRL, that uses a multi‑objective value function to better reflect the structure of preference data. Our approach has outperformed RE‑Control on two benchmark datasets and showed greater generalization on out‑of‑domain datasets. Our source code is available at https://github.com/UTS‑nlPUG/pref‑ctrl.
Authors:Haoxuan Zhang, Ruochi Li, Yang Zhang, Zhenni Liang, Junhua Ding, Ting Xiao, Haihua Chen
Abstract:
The rapid proliferation of Generative AI necessitates rigorous documentation standards for transparency and governance. However, manual creation of Model and Data Cards is not scalable, while automated approaches lack large‑scale, high‑fidelity benchmarks for systematic evaluation. We introduce MetaGAI, a comprehensive benchmark comprising 2,541 verified document triplets constructed through semantic triangulation of academic papers, GitHub repositories, and Hugging Face artifacts. Unlike prior single‑source datasets, MetaGAI employs a multi‑agent framework with specialized Retriever, Generator, and Editor agents, validated through four‑dimensional human‑in‑the‑loop assessment, including human evaluation of editor‑refined ground truth. We establish a robust evaluation protocol combining automated metrics with validated LLM‑as‑a‑Judge frameworks. Extensive analysis reveals that sparse Mixture‑of‑Experts architectures achieve superior cost‑quality efficiency, while a fundamental trade‑off exists between faithfulness and completeness. MetaGAI provides a foundational testbed for benchmarking, training, and analyzing automated Model and Data Card generation methods at scale. Our data and code are available at: https://github.com/haoxuan‑unt2024/MetaGAI‑Benchmark.
Authors:Yiqun Zhang, Hao Li, Zihan Wang, Shi Feng, Xiaocui Yang, Daling Wang, Bo Zhang, Lei Bai, Shuyue Hu
Abstract:
Multi‑turn, long‑horizon tasks are increasingly common for large language models (LLMs), but solving them typically requires many sequential model invocations, accumulating substantial inference costs. Here, we study cost‑aware multi‑turn LLM routing: selecting which model to invoke at each turn from a model pool, given a fixed cost budget. We propose MTRouter, which encodes the interaction history and candidate models into joint history‑model embeddings, and learns an outcome estimator from logged trajectories to predict turn‑level model utility. Experiments show that MTRouter improves the performance‑cost trade‑off: on ScienceWorld, it surpasses GPT‑5 while reducing total cost by 58.7%; on Humanity's Last Exam (HLE), it achieves competitive accuracy while reducing total cost by 43.4% relative to GPT‑5, and these gains even carry over to held‑out tasks. Further analyses reveal several mechanisms underlying its effectiveness: relative to prior multi‑turn routers, MTRouter makes fewer model switches, is more tolerant to transient errors, and exhibits emergent specialization across models. Code: https://github.com/ZhangYiqun018/MTRouter
Authors:Chathurangi Shyalika, Dhaval Patel, Amit Sheth
Abstract:
Industrial maintenance environments increasingly rely on AI systems to assist operators in understanding asset behavior, diagnosing failures, and evaluating interventions. Although large language models (LLMs) enable fluent natural‑language interaction, deployed maintenance assistants routinely produce generic explanations that are weakly grounded in telemetry, omit verifiable provenance, and offer no testable support for counterfactual or action‑oriented reasoning that undermine trust in safety‑critical settings. We present IndustryAssetEQA, a neurosymbolic operational intelligence system that combines episodic telemetry representations with a Failure Mode Effects Analysis Knowledge Graph (FMEA‑KG) to enable Embodied Question Answering (EQA) over industrial assets. We evaluate on four datasets covering four industrial asset types, including rotating machinery, turbofan engines, hydraulic systems, and cyber‑physical production systems. Compared to LLM‑only baselines, IndustryAssetEQA improves structural validity by up to 0.51, counterfactual accuracy by up to 0.47, and explanation entailment by 0.64, while reducing severe expert‑rated overclaims from 28% to 2% (approximately 93% reduction). Code, datasets, and the FMEA‑KG are available at https://github.com/IBM/AssetOpsBench/tree/IndustryAssetEQA/IndustryAssetEQA.
Authors:Soulayma Gazzeh, Giuseppe Mazzola, Liliana Lo Presti, Marco La Cascia
Abstract:
Reliable depth estimation from spherical images is crucial for 360° vision in robotic navigation and immersive scene understanding. However, the onboard spherical camera can experience unintentional pose variations in real‑world robotic platforms that, along with the geometric distortions inherent in equirectangular projections, significantly impact the effectiveness of depth estimation. To study this issue, a novel public benchmark, called Sphere‑Depth, is introduced to systematically evaluate the robustness of monocular depth estimation models from equirectangular images in a reproducible way. Camera pose perturbations are simulated and used to assess the performance of a popular perspective‑based model, Depth Anything, and of spherical‑aware models such as Depth Anywhere, ACDNet, Bifuse++, and SliceNet. Furthermore, to ensure meaningful evaluation across models, a depth calibration‑based error protocol is proposed to convert predicted relative depth values into metric depth values using supervised learned scaling factors for each model. Experiments show that even models explicitly designed to process spherical images exhibit substantial performance degradation when variations in the camera pose are observed with respect to the canonical pose. The full benchmark, evaluation protocol, and dataset splits are made publicly available at: https://github.com/sgazzeh/Sphere_depth
Authors:Shengzhi Li, Jiarun Chen, Karun Sharma, Jiaqi Su, Shichao Pei
Abstract:
Large vision‑language models (VLMs) can recognize what happens in video but fail to count how many times. We introduce PushupBench, 446 long‑form clips (avg. 36.7s) for evaluating repetition counting. The best frontier model achieves 42.1% exact accuracy; open‑source 4B models score ~6%, matching supervised baselines. We show that accuracy alone misleads ‑‑ weaker models exploit the modal count rather than reason temporally. Fine‑tuning on counting with 1k samples transfers to general video understanding: MVBench (+2.15), PerceptionTest (+1.88), TVBench (+4.54), suggesting counting is a proxy for broader temporal reasoning.PushupBench incorporated in \textttlmms‑eval (https://github.com/EvolvingLMMs‑Lab/lmms‑eval/pull/1262) and hosted on (pushupbench.com/)
Authors:Zi Meng, Wanli Song, Yi Hu, Jiayuan Rao, Gang Chen
Abstract:
Refereeing is vital in sports, where fair, accurate, and explainable decisions are fundamental. While intelligent assistant technologies are being widely adopted in soccer refereeing, current AI‑assisted approaches remain preliminary. Existing research mostly focuses on isolated video perception tasks and lacks the ability to understand and reason about foul scenarios. To fill this gap, we propose SoccerRef‑Agents, a holistic and explainable multi‑agent decision‑making framework for soccer refereeing. The main contributions are: (i) constructing the multimodal benchmark SoccerRefBench with over 1,200 referee theory questions and 600 foul video clips; (ii) building a vector‑based knowledge base RefKnowledgeDB using the latest "Laws of the Game" and a classic case database for precise, knowledge‑driven reasoning; (iii) designing a novel multi‑agent architecture that collaborates via cross‑modal RAG to bridge the semantic gap between visual content and regulatory texts. This work explores the technical capability of integrating MLLMs with refereeing expertise, and evaluations show our system significantly outperforms general‑purpose MLLMs in decision accuracy and explanation quality. All databases, benchmarks, and code will be made available.
Authors:Jelena Ilić Vulićević
Abstract:
Large language models (LLMs) have demonstrated strong performance on a wide range of software engineering tasks, including code generation and analysis. However, most prior work relies on cloud‑based models or specialized hardware, limiting practical applicability in privacy‑sensitive or resource‑constrained environments. In this paper, we present a systematic empirical evaluation of two locally deployed LLMs, LLaMA 3.2 and Mistral, for real‑world Python bug detection using the BugsInPy benchmark. We evaluate 349 bugs across 17 projects using a zero‑shot prompting approach at the function level and an automated keyword‑based evaluation framework. Our results show that locally executed models achieve accuracy between 43% and 45%, while producing a large proportion of partially correct responses that identify problematic code regions without pinpointing the exact fix. Performance varies significantly across projects, highlighting the importance of codebase characteristics. The results demonstrate that local models can identify a meaningful share of bugs, though precise localization remains difficult for locally executed LLMs, particularly when handling complex and context dependent bugs in realistic development scenarios.
Authors:Jincheng Lou, Ruohan Xu, Jiecheng Ma, Runzhe Tao, Xinyu Qu, Yibo Lin
Abstract:
Existing LLM‑based EDA agents are often isolated task‑specific systems. This leads to repeated engineering effort and limited reuse of successful design and debugging strategies. We present LEGO, a unified skill‑based platform for front‑end design generation. It decomposes the digital front‑end flow into six independent steps and represents every agent capability as a standardized composable circuit skill within a plug‑and‑play architecture. To build this skill library, we survey more than 100 papers, select 11 representative open‑source projects, and extract 42 executable circuit skills within a six‑step finite state machine formulation. Circuit Skill Builder automates skill extraction with linear scalability. Agent Skill RAG achieves submillisecond retrieval without relying on embedding models. Empirical evaluation on a hard subset of 41 VerilogEval v2 problems that gpt‑5.2‑codex fails to solve under extra‑high reasoning effort shows that individual circuit skills constructed within LEGO raise Pass@1 from 0.000 to 0.805. This is an 80.5% gain over the baseline. Cross‑project skill compositions also reach 0.805 Pass@1. They outperform hierarchy‑verilog by 14.6% and VerilogCoder by 2.5%. They also match MAGE. These results show that modular skill composition supports both effective and flexible RTL design automation. The LEGO platform and all circuit skills are publicly available at GitHub: https://github.com/loujc/LEGO‑An‑LLM‑Skill‑Based‑Front‑End‑Design‑Generation‑Platform
Authors:He Hu, Tengjin Weng, Zebang Cheng, Yu Wang, Jiachen Luo, Björn Schuller, Zheng Lian, Laizhong Cui
Abstract:
Recent multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and generation, and are increasingly used in applications such as social robots and human‑computer interaction, where understanding human emotions is essential. However, existing benchmarks mainly formulate emotion understanding as a static recognition problem, leaving it largely unclear whether current MLLMs can understand emotion as a dynamic process that evolves, shifts between states, and unfolds across diverse social contexts. To bridge this gap, we present EmoTrans, a benchmark for evaluating emotion dynamics understanding in multimodal videos. EmoTrans contains 1,000 carefully collected and manually annotated video clips, covering 12 real‑world scenarios, and further provides over 3,000 task‑specific question‑answer (QA) pairs for fine‑grained evaluation. The benchmark introduces four tasks, namely Emotion Change Detection (ECD), Emotion State Identification (ESI), Emotion Transition Reasoning (ETR), and Next Emotion Prediction (NEP), forming a progressive evaluation framework from coarse‑grained detection to deeper reasoning and prediction. We conduct a comprehensive evaluation of 18 state‑of‑the‑art MLLMs on EmoTrans and obtain two main findings. First, although current MLLMs show relatively stronger performance on coarse‑grained emotion change detection, they still struggle with fine‑grained emotion dynamics modeling. Second, socially complex settings, especially multi‑person scenarios, remain substantially challenging, while reasoning‑oriented variants do not consistently yield clear improvements. To facilitate future research, we publicly release the benchmark, evaluation protocol, and code at https://github.com/Emo‑gml/EmoTrans.
Authors:Thibaud Southiratn, Bonil Koo, Yijingxiu Lu, Sun Kim
Abstract:
Dual‑target molecule generation, which focuses on discovering compounds capable of interacting with two target proteins, has garnered significant attention due to its potential for improving therapeutic efficiency, safety and resistance mitigation. Existing approaches face two critical challenges. First, by simplifying the complex dual‑target optimization problem to scalarized combinations of individual objectives, they fail to capture important trade‑offs between target engagement and molecular properties. Second, they typically do not integrate synthetic planning into the generative process. This highlights a need for more appropriate objective function design and synthesis‑aware methodologies tailored to the dual‑target molecule generation task. In this work, we propose CombiMOTS, a Pareto Monte Carlo Tree Search (PMCTS) framework that generates dual‑target molecules. CombiMOTS is designed to explore a synthesizable fragment space while employing vectorized optimization constraints to encapsulate target affinity and physicochemical properties. Extensive experiments on real‑world databases demonstrate that CombiMOTS produces novel dual‑target molecules with high docking scores, enhanced diversity, and balanced pharmacological characteristics, showcasing its potential as a powerful tool for dual‑target drug discovery. The code and data is accessible through https://github.com/Tibogoss/CombiMOTS.
Authors:Bingfeng Chen, Chenjie Qiu, Yifeng Xie, Boyan Xu, Ruichu Cai, Zhifeng Hao
Abstract:
Aspect Sentiment Quad Prediction (ASQP) has seen significant advancements, largely driven by the powerful semantic understanding and generative capabilities of large language models (LLMs). However, while syntactic structure information has been proven effective in previous extractive paradigms, it remains underutilized in the generative paradigm of LLMs due to their limited reasoning capabilities. In this paper, we propose S^2IT, a novel Stepwise Syntax Integration Tuning framework that progressively integrates syntactic structure knowledge into LLMs through a multi‑step tuning process. The training process is divided into three steps. S^2IT decomposes the quadruple generation task into two stages: 1) Global Syntax‑guided Extraction and 2) Local Syntax‑guided Classification, integrating both global and local syntactic structure information. Finally, Fine‑grained Structural Tuning enhances the model's understanding of syntactic structures through the prediction of element links and node classification. Experiments demonstrate that S^2IT significantly improves state‑of‑the‑art performance across multiple datasets. Our implementation will be open‑sourced at https://github.com/DMIRLAB‑Group/S2IT.
Authors:Varun Totakura, Ankita Singh, Yushun Dong, Shayok Chakraborty
Abstract:
Active learning algorithms automatically identify the most informative samples from large amounts of unlabeled data and tremendously reduce human annotation effort in inducing a machine learning model. In a conventional active learning setup, the labeling oracles are assumed to be infallible, that is, they always provide correct answers (in terms of class labels) to the queried unlabeled instances, which cannot be guaranteed in real‑world applications. To this end, a body of research has focused on the development of active learning algorithms in the presence of imperfect / noisy oracles. Existing research on active learning with noisy oracles typically simulate the oracles using machine learning models; however, real‑world situations are much more challenging, and using ML models to simulate the annotation patterns may not appropriately capture the nuances of real‑world annotation challenges. In this research, we first collect annotations of text samples (from 3 benchmark text classification datasets) from crowd‑sourced workers through a crowd‑sourcing platform. We then conduct extensive empirical studies of 8 commonly used active learning techniques (in conjunction with deep neural networks) using the obtained annotations. Our analyses sheds light on the performance of these techniques under real‑world challenges, where annotators can provide incorrect labels, and can also refuse to provide labels. We hope this research will provide valuable insights that will be useful for the deployment of deep active learning systems in real‑world applications. The obtained annotations can be accessed at https://github.com/varuntotakura/al_rcta/.
Authors:Xudong Jiang, Mingshan Loo, Hanchen Yang, Wengen Li, Mingrui Zhang, Yichao Zhang, Jihong Guan, Shuigeng Zhou
Abstract:
Accurate long‑term time series forecasting (LTSF) requires the capture of complex long‑range dependencies and dynamic periodic patterns. Recent advances in frequency‑domain analysis offer a global perspective for uncovering temporal characteristics. However, real‑world time series often exhibit pronounced cross‑domain heterogeneity where variables that appear synchronized in the time domain can differ substantially in the frequency domain. Existing frequency‑based LTSF methods often rely on implicit assumptions of cross‑domain homogeneity, which limits their ability to adapt to such intricate variability. To effectively integrate frequency‑domain analysis with temporal dependency learning, we propose AdaMamba, a novel framework that endogenizes adaptive and context‑aware frequency analysis within the Mamba state‑space update process. Specifically, AdaMamba introduces an interactive patch encoding module to capture inter‑variable interaction dynamics. Then, we develop an adaptive frequency‑gated state‑space module that generates input‑dependent frequency bases, and generalizes the conventional temporal forgetting gate into a unified time‑frequency forgetting gate. This allows dynamic calibration of state transitions based on learned frequency‑domain importance, while preserving Mamba's capability in modeling long‑range dependencies. Extensive experiments on seven public LTSF benchmarks and two domain‑specific datasets demonstrate that AdaMamba consistently outperforms state‑of‑the‑art methods in forecasting accu racy while maintaining competitive computational efficiency. The code of AdaMamba is available at https://github.com/XDjiang25/AdaMamba.
Authors:Haoran Tan, Zeyu Zhang, Chen Ma, Tianze Liu, Quanyu Dai, Xu Chen
Abstract:
Large language model‑based agents have recently emerged as powerful approaches for solving dynamic and multi‑step tasks. Most existing agents employ planning mechanisms to guide long‑term actions in dynamic environments. However, current planning approaches face a fundamental limitation that they operate at a fixed granularity level. Specifically, they either provide excessive detail for simple tasks or insufficient detail for complex ones, failing to achieve an optimal balance between simplicity and complexity. Drawing inspiration from the principle of progressive refinement in cognitive science, we propose AdaPlan‑H, a self‑adaptive hierarchical planning mechanism that mimics human planning strategies. Our method initiates with a coarse‑grained macro plan and progressively refines it based on task complexity. It generates self‑adaptive hierarchical plans tailored to the varying difficulty levels of different tasks, which can be optimized by imitation learning and capability enhancement. Experimental results demonstrate that our method significantly improves task execution success rates while mitigating overplanning at the planning level, providing a flexible and efficient solution for multi‑step complex decision‑making tasks. To contribute to the community, our code and data will be made publicly available at https://github.com/import‑myself/AHP.
Authors:Sadman Kabir Soumik
Abstract:
LLM‑as‑a‑Judge has become the dominant paradigm for evaluating language model outputs, yet LLM judges exhibit systematic biases that compromise evaluation reliability. We present a comprehensive empirical study comparing nine debiasing strategies across five judge models from four provider families (Google, Anthropic, OpenAI, Meta), three benchmarks (MT‑Bench n=400, LLMBar n=200, custom n=225), and four bias types. Our key findings: (1) Style bias is the dominant bias (0.76‑0.92 across all models), far exceeding position bias (<= 0.04), yet has received minimal research attention. (2) All models show a conciseness preference on expansion pairs, but truncation controls confirm they correctly distinguish quality from length (0.92‑1.00 accuracy), suggesting quality‑sensitive evaluation rather than a simple length bias. (3) Debiasing is beneficial but model‑dependent: the combined budget strategy significantly improves Claude Sonnet 4 by +11.2 pp (p < 0.0001), with directionally positive trends for other models. Only 2 of 20 non‑baseline configurations show decreased agreement. We release our evaluation framework, controlled dataset, and all experimental artifacts at https://github.com/sksoumik/llm‑as‑judge.
Authors:Zhicheng Ma, Xiang Liu, Zhaoxiang Liu, Ning Wang, Yi Shen, Kai Wang, Shuming Shi, Shiguo Lian
Abstract:
Large Language Models (LLMs) based on Mixture‑of‑Experts (MoE) are pivotal in industrial applications for their ability to scale performance efficiently. However, standard MoEs enforce uniform expert sizes,creating a rigidity that fails to align computational costs with varying token‑level complexity. While heterogeneous expert architectures attempt to address this by diversifying expert sizes, they often suffer from significant system‑level challenges, specifically unbalanced GPU utilization and inefficient parameter utilization, which hinder practical deployment. To bridge the gap between theoretical heterogeneity and robust industrial application, we propose Mixture of Heterogeneous Grouped Experts (MoHGE) which introduces a two‑level routing mechanism to enable flexible, resource‑aware expert combinations. To optimize inference efficiency, we propose a Group‑Wise Auxiliary Loss, which dynamically steers tokens to the most parameter‑efficient expert groups based on task difficulty. To address the critical deployment challenge of GPU load balancing, we introduce an All‑size Group‑decoupling Allocation strategy coupled with an Intra‑Group Experts Auxiliary Loss. These mechanisms collectively ensure uniform computation distribution across GPUs. Extensive evaluations demonstrate that MoHGE matches the performance of MoE architectures while reducing the total parameters by approximately 20% and maintaining balanced GPU utilization. Our work establishes a scalable paradigm for resource‑efficient MoE design, offering a practical solution for optimizing inference costs in real‑world scenarios. The code is publicly available at https://github.com/UnicomAI/MoHGE.
Authors:Yizheng Huang, Wenjun Zeng, Aditi Kumaresan, Zi Wang
Abstract:
Evaluating generative AI models is increasingly resource‑intensive due to slow inference, expensive raters, and a rapidly growing landscape of models and benchmarks. We propose ProEval, a proactive evaluation framework that leverages transfer learning to efficiently estimate performance and identify failure cases. ProEval employs pre‑trained Gaussian Processes (GPs) as surrogates for the performance score function, mapping model inputs to metrics such as the severity of errors or safety violations. By framing performance estimation as Bayesian quadrature (BQ) and failure discovery as superlevel set sampling, we develop uncertainty‑aware decision strategies that actively select or synthesize highly informative inputs for testing. Theoretically, we prove that our pre‑trained GP‑based BQ estimator is unbiased and bounded. Empirically, extensive experiments on reasoning, safety alignment, and classification benchmarks demonstrate that ProEval is significantly more efficient than competitive baselines. It requires 8‑65x fewer samples to achieve estimates within 1% of the ground truth, while simultaneously revealing more diverse failure cases under a stricter evaluation budget.
Authors:Samer Attrah
Abstract:
We present Code Broker, a multi agent system built with Google Agent Development Kit ADK that analyses Python code from files, local directories, or GitHub repositories and generates actionable quality assessment reports. The system employs a hierarchical five agents architecture in which a root orchestrator coordinates a sequential pipeline agent, which in turn dispatches three specialised agents in parallel a Correctness Assessor, a Style Assessor, and a Description Generator before synthesising findings through an Improvement Recommender. Reports score four dimensions correctness, security, style, and maintainability and are rendered in both Markdown and HTML. Code Broker combines LLM based reasoning with deterministic static‑analysis signals from Pylint, uses asynchronous execution with retry logic to improve robustness, and explores lightweight session memory for retaining and querying prior assessment context. We position the paper as a technical report on system design and prompt or tool orchestration, and present a preliminary qualitative evaluation on representative Python codebases. The results suggest that parallel specialised agents produce readable, developer oriented feedback, while also highlighting current limitations in evaluation depth, security tooling, large repository handling, and the current use of only in memory persistence. All code and reproducibility materials are available at: https://github.com/Samir‑atra/agents_intensive_dev.
Authors:Rui Gao, Youngseung Jeon, Swastik Roy, Morteza Ziyadi, Xiang 'Anthony' Chen
Abstract:
Large language models (LLMs) show promise for molecular optimization, but aligning them with selective and competing drug‑design constraints remains challenging. We propose C‑Moral, a reinforcement learning post‑training framework for controllable multi‑objective molecular optimization. C‑Moral combines group‑based relative optimization, property score alignment for heterogeneous objectives, and continuous non‑linear reward aggregation to improve stability across competing properties. Experiments on the C‑MuMOInstruct benchmark show that C‑Moral consistently outperforms state‑of‑the‑art models across both in‑domain and out‑of‑domain settings, achieving the best Success Optimized Rate (SOR) of 48.9% on IND tasks and 39.5% on OOD tasks, while largely preserving scaffold similarity. These results suggest that RL post‑training is an effective way to align molecular language models with continuous molecular design objectives. Our code and models are publicly available at https://github.com/Rwigie/C‑MORAL.
Authors:Zixuan Xia, Quanxi Li
Abstract:
We propose a simple yet effective alternative to reward normalization in policy gradient reinforcement learning by integrating a 1D Kalman filter for online reward estimation. Instead of relying on fixed heuristics, our method recursively estimates the latent reward mean, smoothing high‑variance returns and adapting to non‑stationary environments. This approach incurs minimal overhead and requires no modification to existing policy architectures. Experiments on LunarLander and CartPole demonstrate that Kalman‑filtered rewards significantly accelerate convergence and reduce training variance compared to standard normalization techniques. Code is available at https://github.com/Sumxiaa/Kalman_Normalization.
Authors:Jordan Meadows, Lan Zhang, Andre Freitas
Abstract:
Formalising informal mathematical reasoning into formally verifiable code is a significant challenge for large language models. In scientific fields such as physics, domain‑specific machinery (e.g. Dirac notation, vector calculus) imposes additional formalisation challenges that modern LLMs and agentic approaches have yet to tackle. To aid autoformalisation in scientific domains, we present FormalScience; a domain‑agnostic human‑in‑the‑loop agentic pipeline that enables a single domain expert (without deep formal language experience) to produce syntactically correct and semantically aligned formal proofs of informal reasoning for low economic cost. Applying FormalScience to physics, we construct FormalPhysics, a dataset of 200 university‑level (LaTeX) physics problems and solutions (primarily quantum mechanics and electromagnetism), along with their Lean4 formal representations. Compared to existing formal math benchmarks, FormalPhysics achieves perfect formal validity and exhibits greater statement complexity. We evaluate open‑source models and proprietary systems on a statement autoformalisation task on our dataset via zero‑shot prompting, self‑refinement with error feedback, and a novel multi‑stage agentic approach, and explore autoformalisation limitations in modern LLM‑based approaches. We provide the first systematic characterisation of semantic drift in physics autoformalisation in terms of concepts such as notational collapse and abstraction elevation which reveals what formal language verifies when full semantic preservation is unattainable. We release the codebase together with an interactive UI‑based FormalScience system which facilitates autoformalisation and theorem proving in scientific domains beyond physics.https://github.com/jmeadows17/formal‑science
Authors:Ashwin Kumar, Robbie Holland, Corey Barrett, Jangwon Kim, Maya Varma, Zhihong Chen, Yunhe Gao, Greg Zaharchuk, Tara Taghavi, Krishnaram Kenthapadi, Akshay Chaudhari
Abstract:
Recent medical multimodal foundation models are built as multimodal LLMs (MLLMs) by connecting a CLIP‑pretrained vision encoder to an LLM using LLaVA‑style finetuning. This two‑stage, decoupled approach introduces a projection layer that can distort visual features. This is especially concerning in medical imaging where subtle cues are essential for accurate diagnoses. In contrast, early‑fusion generative approaches such as Chameleon eliminate the projection bottleneck by processing image and text tokens within a single unified sequence, enabling joint representation learning that leverages the inductive priors of language models. We present CheXmix, a unified early‑fusion generative model trained on a large corpus of chest X‑rays paired with radiology reports. We expand on Chameleon's autoregressive framework by introducing a two‑stage multimodal generative pretraining strategy that combines the representational strengths of masked autoencoders with MLLMs. The resulting models are highly flexible, supporting both discriminative and generative tasks at both coarse and fine‑grained scales. Our approach outperforms well‑established generative models across all masking ratios by 6.0% and surpasses CheXagent by 8.6% on AUROC at high image masking ratios on the CheXpert classification task. We further inpaint images over 51.0% better than text‑only generative models and outperform CheXagent by 45% on the GREEN metric for radiology report generation. These results demonstrate that CheXmix captures fine‑grained information across a broad spectrum of chest X‑ray tasks. Our code is at: https://github.com/StanfordMIMI/CheXmix.
Authors:Nikoo Moradi, Gijs Luijten, Behrus Hinrichs-Puladi, Jens Kleesiek, Victor Alves, Jan Egger, André Ferreira
Abstract:
Diffusion models produce high‑quality synthetic data but suffer from slow inference. We propose 3D Variable‑Step Denoising Diffusion Probabilistic Model (VS‑DDPM) a framework engineered to maintain generative quality while accelerating inference by several factors. We tested our approach on four tasks (missing MRI, tumor removal, MRI‑to‑sCT, and CBCT‑to‑sCT) within the BraTS2025 and SynthRAD2025 challenges. Designed for high efficiency under hardware and time constrains imposed by both challenges. VS‑DDPM achieved state‑of‑the‑art (SOTA) performance in missing MRI synthesis, yielding Dice scores of 0.80, 0.83, and 0.88 for the enhancing tumor, tumor core, and whole tumor regions, respectively, alongside a structural similarity index (SSIM) of 0.95. For MRI tumor removal, the model attained a root mean squared error (RMSE) of 0.053, a peak signal‑to‑noise ratio (PSNR) of 26.77, and an SSIM of 0.918. While the framework demonstrated competitive performance in MRI‑to‑sCT and CBCT‑to‑sCT tasks, it did not reach SOTA benchmarks, potentially due to sensitivities in data pre and post‑processing pipelines or specific loss function configurations. These results demonstrate that VS‑DDPM provides a robust and tunable solution for high‑fidelity 3D medical image synthesis. The code is available in https://github.com/andre‑fs‑ferreira/SynthRAD_by_Faking_it.
Authors:Hefeng Zhou, Xuan Liu, Sicheng Chen, Wutong Zhang, Wu Yan, Jiong Lou, Chentao Wu, Guangtao Xue, Wei Zhao, Jie Li
Abstract:
Federated cross‑modal retrieval faces severe challenges from heterogeneous client data, particularly non‑IID semantic distributions and missing modalities. Under such heterogeneity, a single global model is often insufficient to capture both shared cross‑modal knowledge and client‑specific characteristics. We propose RCSR, a personalization‑friendly federated framework that integrates prototype anchoring, retrieval‑centric semantic routing, and optional client‑specific adapters. Built on a frozen CLIP backbone, RCSR leverages lightweight shared adapters for global knowledge transfer while supporting efficient local personalization. Prototype anchoring helps unimodal clients align with global cross‑modal semantics, and a server‑side semantic router adaptively assigns aggregation weights based on retrieval consistency to mitigate alignment drift during heterogeneous updates. Extensive experiments on MS‑COCO, Flickr30K, and other benchmarks show that RCSR consistently improves global retrieval accuracy and training stability, while further enhancing client‑level retrieval performance, especially for clients with incomplete modalities. Code is available at https://github.com/RezinChow/RCSR‑Retrieval‑Centric‑Semantic‑Routing.
Authors:Fujun Han, Junan Chen, Xintong Zhu, Jingqi Ye, Xuanjie Mao, Tao Chen, Peng Ye
Abstract:
Multimodal Large Language Models (MLLMs) have shown promising potential in diverse understanding tasks, e.g., image and video analysis, math and physics olympiads. However, they remain blank and unexplored for Small Object Understanding (SOU) tasks. To fill this gap, we introduce SOUBench, the first and comprehensive benchmark for exploring the small objects understanding capability of existing MLLMs. Specifically, we first design an effective and automatic visual question‑answer generation strategy, constructing a new SOU‑VQA evaluation dataset, with 18,204 VQA pairs, six relevant sub‑tasks, and three dominant scenarios (i.e., Driving, Aerial, and Underwater). Then, we conduct a comprehensive evaluation on 15 state‑of‑the‑art MLLMs and reveal their weak capabilities in small object understanding. Furthermore, we develop SOU‑Train, a multimodal training dataset with 11,226 VQA pairs, to improve the SOU capabilities of MLLMs. Through supervising fine‑tuning of the latest MLLM, we demonstrate that SOU‑Train can effectively enhance the latest MLLM's ability to understand small objects. Comprehensive experimental results demonstrate that, the proposed SOUBench, along with the SOU‑VQA and SOU‑Train datasets, provides a crucial empirical foundation to the community for further developing models with enhanced small object understanding capabilities. Datasets and Code: https://github.com/Hanfj‑X/SOU.
Authors:Brandon Collins, Logan Bolton, Hung Huy Nguyen, Mohammad Reza Taesiri, Trung Bui, Anh Totti Nguyen
Abstract:
When answering questions about images, humans naturally point, label, and draw to explain their reasoning. In contrast, modern vision‑language models (VLMs) such as Gemini‑3‑Pro and GPT‑5 only respond with text, which can be difficult for users to verify. We present SketchVLM, a training‑free, model‑agnostic framework that enables VLMs to produce non‑destructive, editable SVG overlays on the input image to visually explain their answers. Across seven benchmarks spanning visual reasoning (maze navigation, ball‑drop trajectory prediction, and object counting) and drawing (part labeling, connecting‑the‑dots, and drawing shapes around objects), SketchVLM improves visual reasoning task accuracy by up to +28.5 percentage points and annotation quality by up to 1.48x relative to image‑editing and fine‑tuned sketching baselines, while also producing annotations that are more faithful to the model's stated answer. We find that single‑turn generation already achieves strong accuracy and annotation quality, and multi‑turn generation opens up further opportunities for human‑AI collaboration. An interactive demo and code are at https://sketchvlm.github.io/.
Authors:Zhimu Zhou, Yanpeng Zhao, Qiuyu Liao, Bo Zhao, Xiaojian Ma
Abstract:
Visual planning represents a crucial facet of human intelligence, especially in tasks that require complex spatial reasoning and navigation. Yet, in machine learning, this inherently visual problem is often tackled through a verbal‑centric lens. While recent research demonstrates the promise of fully visual approaches, they suffer from significant computational inefficiency due to the step‑by‑step planning‑by‑generation paradigm. In this work, we present EAR, an editing‑as‑reasoning paradigm that reformulates visual planning as a single‑step image transformation. To isolate intrinsic reasoning from visual recognition, we employ abstract puzzles as probing tasks and introduce AMAZE, a procedurally generated dataset that features the classical Maze and Queen problems, covering distinct, complementary forms of visual planning. The abstract nature of AMAZE also facilitates automatic evaluation of autoregressive and diffusion‑based models in terms of both pixel‑wise fidelity and logical validity. We assess leading proprietary and open‑source editing models. The results show that they all struggle in the zero‑shot setting, finetuning on basic scales enables remarkable generalization to larger in‑domain scales and out‑of‑domain scales and geometries. However, our best model that runs on high‑end hardware fails to match the zero‑shot efficiency of human solvers, highlighting a persistent gap in neural visual reasoning.
Authors:Xin Ning, Qiankun Li, Xiaolong Huang, Qiupu Chen, Feng He, Weijun Li, Prayag Tiwari, Xinwang Liu
Abstract:
With the accumulation of resources in the era of big data and the rise of pre‑trained models in deep learning, optimizing neural networks for various tasks often involves different strategies for fine‑tuning pre‑trained models versus training from scratch. However, existing optimizers primarily focus on reducing the loss function by updating model parameters, without fully addressing the unique demands of these two major paradigms. In this paper, we propose DualOpt, a novel approach that decouples optimization techniques specifically tailored for these distinct training scenarios. For training from scratch, we introduce real‑time layer‑wise weight decay, designed to enhance both convergence and generalization by aligning with the characteristics of weight updates and network architecture. For more importantly fine‑tuning, we integrate weight rollback with the optimizer, incorporating a rollback term into each weight update step. This ensures consistency in the weight distribution between upstream and downstream models, effectively mitigating knowledge forgetting and improving fine‑tuning performance. Additionally, we extend the layer‑wise weight decay to dynamically adjust the rollback levels across layers, adapting to the varying demands of different downstream tasks. Extensive experiments across diverse tasks, including image classification, object detection, semantic segmentation, and instance segmentation, demonstrate the broad applicability and state‑of‑the‑art performance of DualOpt. Code is available at https://github.com/qklee‑lz/OLOR‑AAAI‑2024.
Authors:Haonan Chen, Kaiwen Xiao, Bin Tian, Jun Fu
Abstract:
Autonomous parking remains a critical yet challenging task in intelligent driving systems, particularly within constrained urban environments where maneuvering space is limited and precise control is essential. While recent advances in end‑to‑end learning have shown great promise, the lack of high‑quality, structured datasets tailored for parking scenarios remains a significant bottleneck.To address this gap, we present ParkingScenes, a comprehensive multimodal dataset specifically designed for end‑to‑end autonomous parking in simulated scenes. Built on the CARLA simulator, ParkingScenes features structured parking trajectories generated by a Hybrid A planner and a Model Predictive Controller (MPC), providing accurate and reproducible supervision signals. The dataset includes 16 reverse‑in and 6 parallel parking scenarios, each executed under two pedestrian conditions (present and absent), resulting in 704 structured episodes and approximately 105000 frames. Each scenario is repeated 16 times to ensure consistent coverage. Each frame contains synchronized data from four RGB cameras, four depth sensors, vehicle motion states, and Bird's‑Eye View (BEV) representations, enabling rich multimodal fusion and context‑aware learning. To demonstrate the utility of our dataset, we compare models trained on ParkingScenes with those trained on unstructured, manually collected simulation data under identical conditions. Results show significant improvements in performance, underscoring the effectiveness of structured supervision for robust and accurate parking policy learning. By releasing both the dataset and the collection framework, ParkingScenes establishes a scalable and reproducible benchmark for advancing learning‑based autonomous parking systems. The dataset and collection framework will be released at: https://github.com/haonan‑ai/ParkingScenes
Authors:Jinqi Cao, Zhiping Yu, Baihong Lin, Chenyang Liu, Zhenwei Shi, Zhengxia Zou
Abstract:
Recent generative AI models have achieved remarkable breakthroughs in language and visual understanding. However, although these models can generate realistic visual content, their spatial scale remains confined to bounded environments, preventing them from capturing how geographic environments evolve across thousands of kilometers or from modeling the spatial structure of the large‑scale physical world. This limitation poses a critical challenge for ultra‑wide‑area spatial intelligence in Earth observation and simulation, revealing a deeper gap in generative AI: progress has relied primarily on scaling model parameters and training data, while overlooking spatial scale as a core dimension of intelligence. Here, motivated by this missing dimension, we investigate spatial scale as a new scaling axis in foundation models and present MetaEarth3D, the first generative foundation model capable of spatially consistent generation at the planetary scale. Taking optical Earth observation simulation as a testbed, MetaEarth3D enables the generation of multi‑level, unbounded, and diverse 3D scenes spanning large‑scale terrains, medium‑scale cities, and fine‑grained street blocks. Built upon 10 million globally distributed real‑world training images, MetaEarth3D demonstrates both strong visual realism and geospatial statistical realism. Beyond generation, MetaEarth3D serves as a generative data engine for diverse virtual environments in ultra‑wide spatial intelligence. We argue that this study may help empower next‑generation spatial intelligence for Earth observation.
Authors:Manyi Zhang, Ji-Fu Li, Zhongao Sun, Xiaohao Liu, Zhenhua Dong, Xianzhi Yu, Haoli Bai, Xiaobo Xia
Abstract:
Autonomous agent systems such as OpenClaw introduce significant efficiency challenges due to long‑context inputs and multi‑turn reasoning. This results in prohibitively high computational and monetary costs in real‑world development. While quantization is a standard approach for reducing cost and latency, its impact on agent performance in realistic scenarios remains unclear. In this work, we analyze quantization sensitivity across diverse complex workflows over OpenClaw, and show that precision requirements are highly task‑dependent. Based on this observation, we propose QuantClaw, a plug‑and‑play precision routing plugin that dynamically assigns precision according to task characteristics. QuantClaw routes lightweight tasks to lower‑cost configurations while preserving higher precision for demanding workloads, saving cost and accelerating inference without increasing user complexity. Experiments show that our QuantClaw maintains or improves task performance while reducing both latency and computational cost. Across a range of agent tasks, it achieves up to 21.4% cost savings and 15.7% latency reduction on GLM‑5 (FP8 baseline). These results highlight the benefit of treating precision as a dynamic resource in agent systems.
Authors:Bin Wu, Arastun Mammadli, Xiaoyu Zhang, Emine Yilmaz
Abstract:
The rapid growth of AI agent ecosystems is transforming how complex tasks are delegated and executed, creating a new challenge of identifying suitable agents for a given task. Unlike traditional tools, agent capabilities are often compositional and execution‑dependent, making them difficult to assess from textual descriptions alone. However, existing research and benchmarks typically assume well‑specified functionalities, controlled candidate pools, or only executable task queries, leaving realistic agent search scenarios insufficiently studied. We introduce AgentSearchBench, a large‑scale benchmark for agent search in the wild, built from nearly 10,000 real‑world agents across multiple providers. The benchmark formalizes agent search as retrieval and reranking problems under both executable task queries and high‑level task descriptions, and evaluates relevance using execution‑grounded performance signals. Experiments reveal a consistent gap between semantic similarity and actual agent performance, exposing the limitations of description‑based retrieval and reranking methods. We further show that lightweight behavioral signals, including execution‑aware probing, can substantially improve ranking quality, highlighting the importance of incorporating execution signals into agent discovery. Our code is available at https://github.com/Bingo‑W/AgentSearchBench.
Authors:Dongwei Sun, Jing Yao, Kan Wei, Xiangyong Cao, Chen Wu, Zhenghui Zhao, Pedram Ghamisi, Jun Zhou, Jón Atli Benediktsson
Abstract:
Rapid situational awareness is critical in post‑disaster response. While remote sensing damage assessment is evolving from pixel‑level change detection to high‑level semantic analysis, existing vision‑language methodologies still struggle to provide actionable intelligence for complex strategic queries. They remain severely constrained by unimodal optical dependence, a prevailing bias towards natural disasters, and a fundamental lack of grounded interactivity. To address these limitations, we present ChangeQuery, a unified multimodal framework designed for comprehensive, all‑weather disaster situation awareness. To overcome modality constraints and scenario biases, we construct the Disaster‑Induced Change Query (DICQ) dataset, a large‑scale benchmark coupling pre‑event optical semantics with post‑event SAR structural features across a balanced distribution of natural catastrophes and armed conflicts. Furthermore, to provide the high‑quality supervision required for interactive reasoning, we propose a novel Automated Semantic Annotation Pipeline. Adhering to a ``statistics‑first, generation‑later'' paradigm, this engine automatically transforms raw segmentation masks into grounded, hierarchical instruction sets, effectively equipping the model with fine‑grained spatial and quantitative awareness. Trained on this structured data, the ChangeQuery architecture operates as an interactive disaster analyst. It supports multi‑task reasoning driven by diverse user queries, delivering precise damage quantification, region‑specific descriptions, and holistic post‑disaster summaries. Extensive experiments demonstrate that ChangeQuery establishes a new state‑of‑the‑art, providing a robust and interpretable solution for complex disaster monitoring. The code is available at \hrefhttps://sundongwei.github.io/changequery/https://sundongwei.github.io/changequery/.
Authors:Zhanli Li, Yixuan Cao, Lvzhou Luo, Ping Luo
Abstract:
This paper introduces the task of analytical question answering over large, semi‑structured document collections. We present MuDABench, a benchmark for multi‑document analytical QA, where questions require extracting and synthesizing information across numerous documents to perform quantitative analysis. Unlike existing multi‑document QA benchmarks that typically require information from only a few documents with limited cross‑document reasoning, MuDABench demands extensive inter‑document analysis and aggregation. Constructed via distant supervision by leveraging document‑level metadata and annotated financial databases, MuDABench comprises over 80,000 pages and 332 analytical QA instances. We also propose an evaluation protocol that measures final answer accuracy and uses intermediate‑fact coverage as an auxiliary diagnostic signal for the reasoning process. Experiments reveal that standard RAG systems, which treat all documents as a flat retrieval pool, perform poorly. To address these limitations, we propose a multi‑agent workflow that orchestrates planning, extraction, and code generation modules. While this approach substantially improves both process and outcome metrics, a significant gap remains compared to human expert performance. Our analysis identifies two primary bottlenecks: single‑document information extraction accuracy and insufficient domain‑specific knowledge in current systems. MuDABench is available at https://github.com/Zhanli‑Li/MuDABench.
Authors:Zhancun Mu, Guangyu Zhao, Yiwu Zhong, Chi Zhang
Abstract:
One‑step offline RL actors are attractive because they avoid backpropagating through long iterative samplers and keep inference cheap, but they still have to improve under a critic without drifting away from actions that the dataset can support. In recent one‑step extraction pipelines, a strong iterative teacher provides one target action for each latent draw, and the same student output is asked to do both jobs: move toward higher Q and stay near that paired endpoint. If those two directions disagree, the loss resolves them as a compromise on that same sample, even when a nearby better action remains locally supported by the data. We propose DROL, a latent‑conditioned one‑step actor trained with top‑1 dynamic routing. For each state, the actor samples K candidate actions from a bounded latent prior, assigns each dataset action to its nearest candidate, and updates only that winner with Behavior Cloning and critic guidance. Because the routing is recomputed from the current candidate geometry, ownership of a supported region can shift across candidates over the course of learning. This gives a one‑step actor room to make local improvements that pointwise extraction struggles to capture, while retaining single‑pass inference at test time. On OGBench and D4RL, DROL is competitive with the one‑step FQL baseline, improving many OGBench task groups while remaining strong on both AntMaze and Adroit. Project page: https://muzhancun.github.io/preprints/DROL.
Authors:Chunyu Qiang, Xiaopeng Wang, Kang Yin, Yuzhe Liang, Yuxin Guo, Teng Ma, Ziyu Zhang, Tianrui Wang, Cheng Gong, Yushen Chen, Ruibo Fu, Chen Zhang, Longbiao Wang, Jianwu Dang
Abstract:
Generative audio modeling has largely been fragmented into specialized tasks, text‑to‑speech (TTS), text‑to‑music (TTM), and text‑to‑audio (TTA), each operating under heterogeneous control paradigms. Unifying these modalities remains a fundamental challenge due to the intrinsic dissonance between structured semantic representations (speech/music) and unstructured acoustic textures (sound effects). In this paper, we introduce UniSonate, a unified flow‑matching framework capable of synthesizing speech, music, and sound effects through a standardized, reference‑free natural language instruction interface. To reconcile structural disparities, we propose a novel dynamic token injection mechanism that projects unstructured environmental sounds into a structured temporal latent space, enabling precise duration control within a phoneme‑driven Multimodal Diffusion Transformer (MM‑DiT). Coupled with a multi‑stage curriculum learning strategy, this approach effectively mitigates cross‑modal optimization conflicts. Extensive experiments demonstrate that UniSonate achieves state‑of‑the‑art performance in instruction‑based TTS (WER 1.47%) and TTM (SongEval Coherence 3.18), while maintaining competitive fidelity in TTA. Crucially, we observe positive transfer, where joint training on diverse audio data significantly enhances structural coherence and prosodic expressiveness compared to single‑task baselines. Audio samples are available at https://qiangchunyu.github.io/UniSonate/.
Authors:Aotian Zheng, Winston Sun, Bahaa Alattar, Vitaly Ablavsky, Jenq-Neng Hwang
Abstract:
CLIP‑based person re‑identification (ReID) methods aggregate spatial features into a single global \texttt[CLS] token optimized for image‑text alignment rather than spatial selectivity, making representations fragile under occlusion and cross‑camera variation. We propose SAGA‑ReID, which reconstructs identity representations by aligning intermediate patch tokens with anchor vectors parameterized in CLIP's text embedding space ‑‑ emphasizing spatially stable evidence while suppressing corrupted or absent regions, without requiring textual descriptions of individual images. Controlled experiments isolate the aggregation mechanism under two qualitatively distinct conditions ‑‑ synthetic masking, where identity signal is absent, and realistic human distractors, where an overlapping person introduces semantically confusing signal ‑‑ with SAGA's advantage over global pooling growing substantially as occlusion increases across both conditions. Benchmark evaluations confirm consistent gains over CLIP‑ReID across standard and occluded settings, with the largest improvements where global pooling is most unreliable: up to +10.6 Rank‑1 on occluded benchmarks. SAGA's aggregation outperforms dedicated sequential patch aggregation on a stronger backbone, confirming that structured reconstruction addresses a bottleneck that backbone quality and architectural complexity alone cannot resolve. Code available at https://github.com/ipl‑uw/Structured‑Anchor‑Guided‑Aggregation‑for‑ReID.
Authors:Rico Angell, Raghav Singhal, Zachary Horvitz, Zhou Yu, Rajesh Ranganath, Kathleen McKeown, He He
Abstract:
Language models are increasingly capable and are being rapidly deployed on a population‑level scale. As a result, the safety of these models is increasingly high‑stakes. Fortunately, advances in alignment have significantly reduced the likelihood of harmful model outputs. However, when models are queried billions of times in a day, even rare worst‑case behaviors will occur. Current safety evaluations focus on capturing the distribution of inputs that yield harmful outputs. These evaluations disregard the probabilistic nature of models and their tail output behavior. To measure this tail risk, we propose a method to efficiently estimate the probability of harmful outputs for any input query. Instead of naive brute‑force sampling from the target model, where harmful outputs could be rare, we operationalize importance sampling by creating unsafe versions of the target model. These unsafe versions enable sample‑efficient estimation by making harmful outputs more probable. On benchmarks measuring misuse and misalignment, these estimates match brute‑force Monte Carlo estimates using 10‑20x fewer samples. For example, we can estimate probability of harmful outputs on the order of 10^‑4 with just 500 samples. Additionally, we find that these harmfulness estimates can reveal the sensitivity of models to perturbations in model input and predict deployment risks. Our work demonstrates that accurate rare‑event estimation is both critical and feasible for safety evaluations. Code is available at https://github.com/rangell/LMTailRisk
Authors:Pruthvinath Jeripity Venkata
Abstract:
When you ask an AI assistant for advice about your career, your marriage, or a conflict with your family, does it give you the same answer regardless of where you are from? We tested this systematically by presenting three leading AI systems (Claude Sonnet 4.5, GPT‑5.4, and Gemini 2.5 Flash) with ten real‑life personal dilemmas, framed for users from 10 countries across 5 continents in 7 languages (n=840 scored responses). We compared AI advice against World Values Survey Wave 7 data measuring what people in each country actually believe. All three AI systems consistently gave Western‑style, individualist advice even to users from societies that prioritize family, community, and authority, significantly more so than local values would predict (mean gap +0.76 on a 1‑5 scale; t=15.65, p<0.001). The gap is largest for Nigeria (+1.85) and India (+0.82). Japan is the sole exception: AI systems treated Japanese users as more group‑oriented than surveys show, revealing that AI encodes outdated stereotypes. Claude and GPT‑5.4 show nearly identical bias magnitude, while Gemini is lower but still significant. The models diverge in mechanism: Claude shifts further collectivist in the user's native language; Gemini shifts more individualist; GPT‑5.4 responds only to stated country identity. These findings point to a systemic homogenization of values across frontier AI. Data, code, and scoring pipeline are openly released.
Authors:Deepank Girish, Yi Hao Chan, Sukrit Gupta, Jing Xia, Jagath C. Rajapakse
Abstract:
Several brain foundation models (FM) have recently been proposed to predict brain disorders by modelling dynamic functional connectivity (FC). While they demonstrate remarkable model performance and zero‑ or few‑shot generalization, the salient features identified as potential biomarkers are yet to be thoroughly evaluated. We propose RE‑CONFIRM, a framework for evaluating the robustness of potential biomarker candidates elucidated by deep learning (DL) models including FMs. From experiments on five large datasets of Autism Spectrum Disorder (ASD), Attention‑deficit Hyperactivity Disorder (ADHD), and Alzheimer's Disease (AD), we found that although commonly used performance metrics provide an intuitive assessment of model predictions, they are insufficient for evaluating the robustness of biomarkers identified by these models. RE‑CONFIRM metrics revealed that simply finetuning FMs leads to models that fail to capture regional hubs effectively, even in disorders where hubs are known to be implicated, such as ASD and ADHD. In view of this, we propose Hub‑LoRA (Low‑Rank Adaptation) as a fine‑tuning technique that enables FMs to not only outperform customised DL models but also produce neurobiologically faithful biomarkers supported by meta‑analyses. RE‑CONFIRM is generalizable and can be easily applied to ascertain the robustness of DL models trained on functional MRI datasets. Code is available at: https://github.com/SCSE‑Biomedical‑Computing‑Group/RE‑CONFIRM.
Authors:Grigory Sapunov
Abstract:
We study learned memory tokens as a computational scratchpad for a single‑block Universal Transformer with Adaptive Computation Time (ACT) on Sudoku‑Extreme, a combinatorial reasoning benchmark. Memory tokens are empirically necessary: no configuration without them reaches non‑trivial performance. The optimal count has a sharp lower threshold (T=0 always fails, T=8 reliably succeeds) followed by a stable plateau (T=8‑32, 57.4% +/‑ 0.7% exact‑match) and a dilution boundary at T=64. Under halt‑side pressure (lambda warmup), mean halt drops monotonically with memory size across the plateau (from 11.6 at T=8 to 8.3 at T=64), showing that memory tokens and ponder depth substitute as resources at fixed accuracy. We also identify a router initialization trap that causes the majority of training runs to fail: both default zero‑bias and Graves' recommended positive bias settle into a shallow halt equilibrium the model cannot escape. Inverting the bias to ‑3 ("deep start") eliminates the failure mode, and ablation shows the trap is inherent to ACT initialization rather than an artifact of our architecture. With reliable training, ACT yields an order of magnitude lower seed variance than fixed‑depth processing (+/‑0.7 vs +/‑9.3 pp); lambda warmup recovers 34% of compute at matched accuracy; and attention heads specialize into memory readers, constraint propagators, and integrators across recursive depth. Code: https://github.com/che‑shr‑cat/utm‑jax.
Authors:Charles Junichi McAndrews
Abstract:
Small language models (1‑3B) are practical to run locally, but individually limited on harder code generation tasks. We ask whether composing them into pipelines can recover some of that lost capability. We study code generation pipelines built from 1‑3B models with execution feedback, and use a NEAT‑inspired evolutionary search to test whether more complex pipeline structure helps beyond a simple refinement loop. We evaluate on HumanEval (164 problems) and sanitized MBPP (427 problems), all with local inference on a single laptop. Self‑refinement with execution feedback improves code generation by more than 4 standard deviations on both benchmarks. The gains are narrow in mechanism: refinement fixes many runtime errors (especially NameError and SyntaxError), but rarely fixes logic errors such as AssertionError. Within our tested general‑purpose model pool, generator identity mattered less than refiner capability: a 1.5B generator paired with a 3B refiner matched a 3B model doing both roles. Early stopping is essential; without it, every iteration is net‑negative. The code‑specialized models outperform every general‑purpose pipeline configuration, suggesting model specialization matters more than pipeline architecture. Preliminary text‑only pipeline experiments without execution feedback did not show gains at this scale. In our constrained search space, evolutionary search mostly rediscovered the same simple generate‑execute‑refine loop we found manually, with no clearly significant gain from added topology. Single‑evaluation fitness inflates results by 5‑7 percent, selecting lucky genomes over good ones. On these benchmarks at 1‑3B scale, execution feedback mattered more than added pipeline complexity in determining whether composition helped.
Authors:Pegah Khayatan, Jayneel Parekh, Arnaud Dapogny, Mustafa Shukor, Alasdair Newson, Matthieu Cord
Abstract:
Despite impressive progress in capabilities of large vision‑language models (LVLMs), these systems remain vulnerable to hallucinations, i.e., outputs that are not grounded in the visual input. Prior work has attributed hallucinations in LVLMs to factors such as limitations of the vision backbone or the dominance of the language component, yet the relative importance of these factors remains unclear. To resolve this ambiguity, We propose HalluScope, a benchmark to better understand the extent to which different factors induce hallucinations. Our analysis indicates that hallucinations largely stem from excessive reliance on textual priors and background knowledge, especially information introduced through textual instructions. To mitigate hallucinations induced by textual instruction priors, we propose HalluVL‑DPO, a framework for fine‑tuning off‑the‑shelf LVLMs towards more visually grounded responses. HalluVL‑DPO leverages preference optimization using a curated training dataset that we construct, guiding the model to prefer grounded responses over hallucinated ones. We demonstrate that our optimized model effectively mitigates the targeted hallucination failure mode, while preserving or improving performance on other hallucination benchmarks and visual capability evaluations. To support reproducibility and further research, we will publicly release our evaluation benchmark, preference training dataset, and code at https://pegah‑kh.github.io/projects/prompts‑override‑vision/ .
Authors:Anuj Sadani, Deepak Kumar
Abstract:
The Model Context Protocol (MCP) has become a common interface for connecting large language model (LLM) agents to external tools, but its reliance on stateless, eager schema injection imposes a hidden per‑turn overhead the MCP Tax or Tools Tax that practitioner reports place between roughly 10k and 60k tokens in typical multi‑server deployments. This payload inflates the key‑value cache, is associated with reasoning degradation as context utilization approaches published fracture points around 70%, and turns token budgets into a recurring operational cost. We introduce Tool Attention, a middleware‑layer mechanism that generalizes the "Attention Is All You Need" paradigm from self‑attention over tokens to gated attention over tools. Tool Attention combines (i) an Intent Schema Overlap (ISO) score from sentence embeddings, (ii) a state‑aware gating function enforcing preconditions and access scopes, and (iii) a two‑phase lazy schema loader that keeps a compact summary pool in context and promotes full JSON schemas only for top‑k gated tools. We evaluate on a simulated 120‑tool, six‑server benchmark whose per‑server token counts are calibrated to public audits of real MCP deployments. In this simulation, Tool Attention directly reduces measured per‑turn tool tokens by 95.0% (47.3k ‑> 2.4k) and raises effective context utilization (a token‑ratio quantity) from 24% to 91%. End‑to‑end figures for task success, latency, cost, and reasoning quality are reported as projections derived from the measured token counts combined with published deployment telemetry; they are not measured on live LLM agents, and we mark projected values explicitly throughout. Taken together, the results support a simple thesis: protocol‑level efficiency, not raw context length, is a binding constraint on scalable gentic systems. The code for this work is accessible at https://github.com/asadani/tool‑attention
Authors:Safouane El Ghazouali, Nicola Venturi, Michael Rueegsegger, Umberto Michelucci
Abstract:
Recent advances in deep learning for remote sensing rely heavily on large annotated datasets, yet acquiring high‑quality ground truth for geometric, radiometric, and multi‑domain tasks remains costly and often infeasible. In particular, the lack of accurate depth annotations, controlled illumination variations, and multi‑scale paired imagery limits progress in monocular depth estimation, domain adaptation, and super‑resolution for aerial scenes. We present SyMTRS, a large‑scale synthetic dataset generated using a high‑fidelity urban simulation pipeline. The dataset provides high‑resolution RGB aerial imagery (2048 x 2048), pixel‑perfect depth maps, night‑time counterparts for domain adaptation, and aligned low‑resolution variants for super‑resolution at x2, x4, and x8 scales. Unlike existing remote sensing datasets that focus on a single task or modality, SyMTRS is designed as a unified multi‑task benchmark enabling joint research in geometric understanding, cross‑domain robustness, and resolution enhancement. We describe the dataset generation process, its statistical properties, and its positioning relative to existing benchmarks. SyMTRS aims to bridge critical gaps in remote sensing research by enabling controlled experiments with perfect geometric ground truth and consistent multi‑domain supervision. The results obtained in this work can be reproduced from this Github repository: https://github.com/safouaneelg/SyMTRS.
Authors:Buqiang Xu, Yijun Chen, Jizhan Fang, Ruobin Zhong, Yunzhi Yao, Yuqi Zhu, Lun Du, Shumin Deng
Abstract:
Long‑term conversational agents need memory systems that capture relationships between events, not merely isolated facts, to support temporal reasoning and multi‑hop question answering. Current approaches face a fundamental trade‑off: flat memory is efficient but fails to model relational structure, while graph‑based memory enables structured reasoning at the cost of expensive and fragile construction. To address these issues, we propose StructMem, a structure‑enriched hierarchical memory framework that preserves event‑level bindings and induces cross‑event connections. By temporally anchoring dual perspectives and performing periodic semantic consolidation, StructMem improves temporal reasoning and multi‑hop performance on \textttLoCoMo, while substantially reducing token usage, API calls, and runtime compared to prior memory systems, see https://github.com/zjunlp/LightMem .
Authors:Dat To-Thanh, Nghia Nguyen-Trong, Hoang Vo, Hieu Bui-Minh, Tinh-Anh Nguyen-Nhu
Abstract:
Image enhancement models for mobile devices often struggle to balance high output quality with the fast processing speeds required by mobile hardware. While recent deep learning models can enhance low‑quality mobile photos into high‑quality images, their performance is often degraded when converted to lower‑precision formats for actual use on mobile phones. To address this training‑deployment mismatch, we propose an efficient image enhancement model designed specifically for mobile deployment. Our approach uses a hierarchical network architecture with gated encoder blocks and multiscale refinement to preserve fine‑grained visual features. Moreover, we incorporate Quantization‑Aware Training (QAT) to simulate the effects of low‑precision representation during the training process. This allows the network to adapt and prevents the typical drop in quality seen with standard post‑training quantization (PTQ). Experimental results demonstrate that the proposed method produces high‑fidelity visual output while maintaining the low computational overhead needed for practical use on standard mobile devices. The code will be available at https://github.com/GenAI4E/QATIE.git.
Authors:Wujiang Xu, Jiaojiao Han, Minghao Guo, Kai Mei, Xi Zhu, Han Zhang, Dimitris N. Metaxas
Abstract:
LLM agents increasingly operate in open‑ended environments spanning hundreds of sequential episodes, yet they remain largely stateless: each task is solved from scratch without converting past experience into better future behavior. The central obstacle is not \emphwhat to remember but \emphhow to use what has been remembered, including which retrieval policy to apply, how to interpret prior outcomes, and when the current strategy itself must change. We introduce \emphAgent Evolving Learning (\ael), a two‑timescale framework that addresses this obstacle. At the fast timescale, a Thompson Sampling bandit learns which memory retrieval policy to apply at each episode; at the slow timescale, LLM‑driven reflection diagnoses failure patterns and injects causal insights into the agent's decision prompt, giving it an interpretive frame for the evidence it retrieves. On a sequential portfolio benchmark (10 sector‑diverse tickers, 208 episodes, 5 random seeds), \ael achieves a Sharpe ratio of 2.13\pm0.47, outperforming five published self‑improving methods and all non‑LLM baselines while maintaining the lowest variance among all LLM‑based approaches. A nine‑variant ablation reveals a ``less is more'' pattern: memory and reflection together produce a 58% cumulative improvement over the stateless baseline, yet every additional mechanism we test (planner evolution, per‑tool selection, cold‑start initialization, skill extraction, and three credit assignment methods) \emphdegrades performance. This demonstrates that the bottleneck in agent self‑improvement is \emphself‑diagnosing how to use experience rather than adding architectural complexity. Code and data: https://github.com/WujiangXu/AEL.
Authors:Zhiqiu Lin, Chancharik Mitra, Siyuan Cen, Isaac Li, Yuhan Huang, Yu Tong Tiffany Ling, Hewei Wang, Irene Pi, Shihang Zhu, Ryan Rao, George Liu, Jiaxi Li, Ruojin Li, Yili Han, Yilun Du, Deva Ramanan
Abstract:
Video‑language models (VLMs) learn to reason about the dynamic visual world through natural language. We introduce a suite of open datasets, benchmarks, and recipes for scalable oversight that enable precise video captioning. First, we define a structured specification for describing subjects, scenes, motion, spatial, and camera dynamics, grounded by hundreds of carefully defined visual primitives developed with professional video creators such as filmmakers. Next, to curate high‑quality captions, we introduce CHAI (Critique‑based Human‑AI Oversight), a framework where trained experts critique and revise model‑generated pre‑captions into improved post‑captions. This division of labor improves annotation accuracy and efficiency by offloading text generation to models, allowing humans to better focus on verification. Additionally, these critiques and preferences between pre‑ and post‑captions provide rich supervision for improving open‑source models (Qwen3‑VL) on caption generation, reward modeling, and critique generation through SFT, DPO, and inference‑time scaling. Our ablations show that critique quality in precision, recall, and constructiveness, ensured by our oversight framework, directly governs downstream performance. With modest expert supervision, the resulting model outperforms closed‑source models such as Gemini‑3.1‑Pro. Finally, we apply our approach to re‑caption large‑scale professional videos (e.g., films, commercials, games) and fine‑tune video generation models such as Wan to better follow detailed prompts of up to 400 words, achieving finer control over cinematography including camera motion, angle, lens, focus, point of view, and framing. Our results show that precise specification and human‑AI oversight are key to professional‑level video understanding and generation. Data and code are available on our project page: https://linzhiqiu.github.io/papers/chai/
Authors:Qizhuo Xie, Yunhui Liu, Yu Xing, Qianzi Hou, Xudong Jin, Tao Zheng, Tieke He
Abstract:
Large Language Models (LLMs) have shown immense potential in Knowledge Graph Completion (KGC), yet bridging the modality gap between continuous graph embeddings and discrete LLM tokens remains a critical challenge. While recent quantization‑based approaches attempt to align these modalities, they typically treat quantization as flat numerical compression, resulting in semantically entangled codes that fail to mirror the hierarchical nature of human reasoning. In this paper, we propose GS‑Quant, a novel framework that generates semantically coherent and structurally stratified discrete codes for KG entities. Unlike prior methods, GS‑Quant is grounded in the insight that entity representations should follow a linguistic coarse‑to‑fine logic. We introduce a Granular Semantic Enhancement module that injects hierarchical knowledge into the codebook, ensuring that earlier codes capture global semantic categories while later codes refine specific attributes. Furthermore, a Generative Structural Reconstruction module imposes causal dependencies on the code sequence, transforming independent discrete units into structured semantic descriptors. By expanding the LLM vocabulary with these learned codes, we enable the model to reason over graph structures isomorphically to natural language generation. Experimental results demonstrate that GS‑Quant significantly outperforms existing text‑based and embedding‑based baselines. Our code is publicly available at https://github.com/mikumifa/GS‑Quant.
Authors:Sukesh Subaharan
Abstract:
Standard reinforcement learning (RL) optimizes policies for reward but imposes few constraints on how decisions evolve over time. As a result, policies may achieve high performance while exhibiting temporally incoherent behavior such as abrupt confidence shifts, oscillations, or degenerate inactivity. We introduce Dynamical Prior Reinforcement Learning (DP‑RL), a training framework that augments policy gradient learning with an auxiliary loss derived from external state dynamics that implement evidence accumulation and hysteresis. Without modifying the reward, environment, or policy architecture, this prior shapes the temporal evolution of action probabilities during learning. Across three minimal environments, we show that dynamical priors systematically alter decision trajectories in task‑dependent ways, promoting temporally structured behavior that cannot be explained by generic smoothing. These results demonstrate that training objectives alone can control the temporal geometry of decision‑making in RL agents.
Authors:Yixuan Zhu, Shilin Ma, Haolin Wang, Ao Li, Yanzhe Jing, Yansong Tang, Lei Chen, Jiwen Lu, Jie Zhou
Abstract:
Recent advancements in visual autoregressive models (VAR) have demonstrated their effectiveness in image generation, highlighting their potential for real‑world image super‑resolution (Real‑ISR). However, adapting VAR for ISR presents critical challenges. The next‑scale prediction mechanism, constrained by causal attention, fails to fully exploit global low‑quality (LQ) context, resulting in blurry and inconsistent high‑quality (HQ) outputs. Additionally, error accumulation in the iterative prediction severely degrades coherence in ISR task. To address these issues, we propose VARestorer, a simple yet effective distillation framework that transforms a pre‑trained text‑to‑image VAR model into a one‑step ISR model. By leveraging distribution matching, our method eliminates the need for iterative refinement, significantly reducing error propagation and inference time. Furthermore, we introduce pyramid image conditioning with cross‑scale attention, which enables bidirectional scale‑wise interactions and fully utilizes the input image information while adapting to the autoregressive mechanism. This prevents later LQ tokens from being overlooked in the transformer. By fine‑tuning only 1.2% of the model parameters through parameter‑efficient adapters, our method maintains the expressive power of the original VAR model while significantly enhancing efficiency. Extensive experiments show that VARestorer achieves state‑of‑the‑art performance with 72.32 MUSIQ and 0.7669 CLIPIQA on DIV2K dataset, while accelerating inference by 10 times compared to conventional VAR inference.
Authors:Wadii Boulila, Adel Ammar, Bilel Benjdira, Maha Driss
Abstract:
Self‑supervised learning (SSL) is a standard approach for representation learning in aerial imagery. Existing methods enforce invariance between augmented views, which works well when augmentations preserve semantic content. However, aerial images are frequently degraded by haze, motion blur, rain, and occlusion that remove critical evidence. Enforcing alignment between a clean and a severely degraded view can introduce spurious structure into the latent space. This study proposes a training strategy and architectural modification to enhance SSL robustness to such corruptions. It introduces a per‑sample, per‑factor trust weight into the alignment objective, combined with the base contrastive loss as an additive residual. A stop‑gradient is applied to the trust weight instead of a multiplicative gate. While a multiplicative gate is a natural choice, experiments show it impairs the backbone, whereas our additive‑residual approach improves it. Using a 200‑epoch protocol on a 210,000‑image corpus, the method achieves the highest mean linear‑probe accuracy among six backbones on EuroSAT, AID, and NWPU‑RESISC45 (90.20% compared to 88.46% for SimCLR and 89.82% for VICReg). It yields the largest improvements under severe information‑erasing corruptions on EuroSAT (+19.9 points on haze at s=5 over SimCLR). The method also demonstrates consistent gains of +1 to +3 points in Mahalanobis AUROC on a zero‑shot cross‑domain stress test using BDD100K weather splits. Two ablations (scalar uncertainty and cosine gate) indicate the additive‑residual formulation is the primary source of these improvements. An evidential variant using Dempster‑Shafer fusion introduces interpretable signals of conflict and ignorance. These findings offer a concrete design principle for uncertainty‑aware SSL. Code is publicly available at https://github.com/WadiiBoulila/trust‑ssl.
Authors:Yongcan Yu, Lingxiao He, Jian Liang, Kuangpu Guo, Meng Wang, Qianlong Xie, Xingxing Wang, Ran He
Abstract:
Test‑time reinforcement learning (TTRL) always adapts models at inference time via pseudo‑labeling, leaving it vulnerable to spurious optimization signals from label noise. Through an empirical study, we observe that responses with medium consistency form an ambiguity region and constitute the primary source of reward noise. Crucially, we find that such spurious signals can be even amplified through group‑relative advantage estimation. Motivated by these findings, we propose a unified framework, Debiased and Denoised test‑time Reinforcement Learning (DDRL), to mitigate spurious signals. Concretely, DDRL first applies a frequency‑based sampling strategy to exclude ambiguous samples while maintaining a balanced set of positive and negative examples. It then adopts a debiased advantage estimation with fixed advantages, removing the bias introduced by group‑relative policy optimization. Finally, DDRL incorporates a consensus‑based off‑policy refinement stage, which leverages the rejection‑sampled dataset to enable efficient and stable model updates. Experiments on three large language models across multiple mathematical reasoning benchmarks demonstrate that DDRL consistently outperforms existing TTRL baselines. The code will soon be released at https://github.com/yuyongcan/DDRL.
Authors:Kai Liu, Haoyang Yue, Zeli Lin, Zheng Chen, Jingkai Wang, Jue Gong, Jiatong Li, Xianglong Yan, Libo Zhu, Jianze Li, Ziqing Zhang, Zihan Zhou, Xiaoyang Liu, Radu Timofte, Yulun Zhang, Junye Chen, Zhenming Yan, Yucong Hong, Ruize Han, Song Wang, Li Pang, Heng Zhao, Xinqiao Wu, Deyu Meng, Xiangyong Cao, Weijun Yuan, Zhan Li, Zhanglu Chen, Boyang Yao, Yihang Chen, Yifan Deng, Zengyuan Zuo, Junjun Jiang, Saiprasad Meesiyawar, Sulocha Yatageri, Nikhil Akalwadi, Ramesh Ashok Tabib, Uma Mudenagudi, Jiachen Tu, Yaokun Shi, Guoyi Xu, Yaoxin Jiang, Cici Liu, Tongyao Mu, Qiong Cao, Yifan Wang, Kosuke Shigematsu, Hiroto Shirono, Asuka Shin, Wei Zhou, Linfeng Li, Lingdong Kong, Ce Wang, Xingwei Zhong, Wanjie Sun, Dafeng Zhang, Hongxin Lan, Qisheng Xu, Mingyue He, Hui Geng, Tianjiao Wan, Kele Xu, Changjian Wang, Antoine Carreaud, Nicola Santacroce, Shanci Li, Jan Skaloud, Adrien Gressin
Abstract:
This paper presents the NTIRE 2026 Remote Sensing Infrared Image Super‑Resolution (x4) Challenge, one of the associated challenges of NTIRE 2026. The challenge aims to recover high‑resolution (HR) infrared images from low‑resolution (LR) inputs generated through bicubic downsampling with a x4 scaling factor. The objective is to develop effective models or solutions that achieve state‑of‑the‑art performance for infrared image SR in remote sensing scenarios. To reflect the characteristics of infrared data and practical application needs, the challenge adopts a single‑track setting. A total of 115 participants registered for the competition, with 13 teams submitting valid entries. This report summarizes the challenge design, dataset, evaluation protocol, main results, and the representative methods of each team. The challenge serves as a benchmark to advance research in infrared image super‑resolution and promote the development of effective solutions for real‑world remote sensing applications.
Authors:Jon-Paul Cacioli
Abstract:
Cacioli (2026) showed that the K‑way energy probe on standard discriminative predictive coding networks reduces approximately to a monotone function of the log‑softmax margin. The reduction rests on five assumptions, including cross‑entropy (CE) at the output and effectively feedforward inference dynamics. This pre‑registered study tests the reduction's sensitivity to CE removal using two conditions: standard PC trained with MSE instead of CE, and bidirectional PC (bPC; Oliviers, Tang & Bogacz, 2025). Across 10 seeds on CIFAR‑10 with a matched 2.1M‑parameter backbone, we find three results. The negative result replicates on standard PC: the probe sits below softmax (Delta = ‑0.082, p < 10^‑6). On bPC the probe exceeds softmax across all 10 seeds (Delta = +0.008, p = 0.000027), though a pre‑registered manipulation check shows that bPC does not produce materially greater latent movement than standard PC at this scale (ratio 1.6, threshold 10). Removing CE alone without changing inference dynamics halves the probe‑softmax gap (Delta_MSE = ‑0.037 vs Delta_stdPC = ‑0.082). CE is a major empirically load‑bearing component of the decomposition at this scale. CE training produces output logit norms approximately 15x larger than MSE or bPC training. A post‑hoc temperature scaling ablation decomposes the probe‑softmax gap into two components: approximately 66% is attributable to logit‑scale effects removable by temperature rescaling, and approximately 34% reflects a scale‑invariant ranking advantage of CE‑trained representations. We use "metacognitive" operationally to denote Type‑2 discrimination of a readout over its own Type‑1 correctness, not to imply human‑like introspective access.
Authors:Robin Dey, Panyanon Viradecha
Abstract:
MemPalace is an open‑source AI memory system that applies the ancient method of loci (memory palace) spatial metaphor to organize long‑term memory for large language models; launched in April 2026, it accumulated over 47,000 GitHub stars in its first two weeks and claims state‑of‑the‑art retrieval performance on the LongMemEval benchmark (96.6% Recall@5) without requiring any LLM inference at write time. Through independent codebase analysis, benchmark replication, and comparison with competing systems, we find that MemPalace's headline retrieval performance is attributable primarily to its verbatim storage philosophy combined with ChromaDB's default embedding model (all‑MiniLM‑L6‑v2), rather than to its spatial organizational metaphor per se ‑‑ the palace hierarchy (Wings‑>Rooms‑>Closets‑>Drawers) operates as standard vector database metadata filtering, an effective but well‑established technique. However, MemPalace makes several genuinely novel contributions: (1) a contrarian verbatim‑first storage philosophy that challenges extraction‑based competitors, (2) an extremely low wake‑up cost (approximately 170 tokens) through its four‑layer memory stack, (3) a fully deterministic, zero‑LLM write path enabling offline operation at zero API cost, and (4) the first systematic application of spatial memory metaphors as an organizing principle for AI memory systems. We also note that the competitive landscape is evolving rapidly, with Mem0's April 2026 token‑efficient algorithm raising their LongMemEval score from approximately 49% to 93.4%, narrowing the gap between extraction‑based and verbatim approaches. Our analysis concludes that MemPalace represents significant architectural insight wrapped in overstated claims ‑‑ a pattern common in rapidly adopted open‑source projects where marketing velocity exceeds scientific rigor.
Authors:Jindi Guo, Chaozheng Huang, Xi Fang
Abstract:
We introduce MMTR‑Bench, a benchmark designed to evaluate the intrinsic ability of Multimodal Large Language Models (MLLMs) to reconstruct masked text directly from visual context. Unlike conventional question‑answering tasks, MMTR‑Bench eliminates explicit prompts, requiring models to recover masked text from single‑ or multi‑page inputs across real‑world domains such as documents and webpages. This design isolates the reconstruction task from instruction‑following abilities, enabling a direct assessment of a model's layout understanding, visual grounding, and knowledge integration. MMTR‑Bench comprises 2,771 test samples spanning multiple languages and varying target lengths. To account for this diversity, we propose a level‑aware evaluation protocol. Experiments on representative MLLMs show that the benchmark poses a significant challenge, especially for sentence‑ and paragraph‑level reconstruction. The homepage is available at https://mmtr‑bench‑dataset.github.io/MMTR‑Bench/.
Authors:Lars van der Laan, Mark Van Der Laan
Abstract:
We study semisupervised mean estimation with a small labeled sample, a large unlabeled sample, and a black‑box prediction model whose output may be miscalibrated. A standard approach in this setting is augmented inverse‑probability weighting (AIPW) [Robins et al., 1994], which protects against prediction‑model misspecification but can be inefficient when the prediction score is poorly aligned with the outcome scale. We introduce Calibrated Prediction‑Powered Inference, which post‑hoc calibrates the prediction score on the labeled sample before using it for semisupervised estimation. This simple step requires no retraining and can improve the original score both as a predictor of the outcome and as a regression adjustment for semisupervised inference. We study both linear and isotonic calibration. For isotonic calibration, we establish first‑order optimality guarantees: isotonic post‑processing can improve predictive accuracy and estimator efficiency relative to the original score and simpler post‑processing rules, while no further post‑processing of the fitted isotonic score yields additional first‑order gains. For linear calibration, we show first‑order equivalence to PPI++. We also clarify the relationship among existing estimators, showing that the original PPI estimator is a special case of AIPW and can be inefficient when the prediction model is accurate, while PPI++ is AIPW with empirical efficiency maximization [Rubin et al., 2008]. In simulations and real‑data experiments, our calibrated estimators often outperform PPI and are competitive with, or outperform, AIPW and PPI++. We provide an accompanying Python package, ppi_aipw, at https://larsvanderlaan.github.io/ppi‑aipw/.
Authors:Dachong Li, ZhuangZhuang Chen, Jin Zhang, Jianqiang Li
Abstract:
Vision‑‑Language‑‑Action (VLA) models often use intermediate representations to connect multimodal inputs with continuous control, yet spatial guidance is often injected implicitly through latent features. We propose CorridorVLA, which predicts sparse spatial anchors as incremental physical changes (e.g., Δ‑positions) and uses them to impose an explicit tolerance region in the training objective for action generation. The anchors define a corridor that guides a flow‑matching action head: trajectories whose implied spatial evolution falls outside it receive corrective gradients, while minor deviations from contacts and execution noise are permitted. On the more challenging LIBERO‑Plus benchmark, CorridorVLA yields consistent gains across both SmolVLA and GR00T, improving success rate by 3.4%‑‑12.4% over the corresponding baselines; notably, our GR00T‑Corr variant reaches a success rate of 83.21%. These results indicate that action‑aligned physical cues can provide direct and interpretable constraints for generative action policies, complementing spatial guidance encoded in visual or latent forms. Code is available at https://github.com/corridorVLA.
Authors:Sepideh Abedini, M. Tamer Özsu
Abstract:
Text‑to‑SQL models have significantly improved with the adoption of Large Language Models (LLMs), leading to their increasing use in real‑world applications. Although many benchmarks exist for evaluating the performance of text‑to‑SQL models, they often rely on a single aggregate score, lack evaluation under realistic settings, and provide limited insight into model behaviour across different query types. In this work, we present SQLyzr, a comprehensive benchmark and evaluation platform for text‑to‑SQL models. SQLyzr incorporates a diverse set of evaluation metrics that capture multiple aspects of generated queries, while enabling more realistic evaluation through workload alignment with real‑world SQL usage patterns and database scaling. It further supports fine‑grained query classification, error analysis, and workload augmentation, allowing users to better diagnose and improve text‑to‑SQL models. This demonstration showcases these capabilities through an interactive experience. Through SQLyzr's graphical interface, users can customize evaluation settings, analyze fine‑grained reports, and explore additional features of the platform. We envision that SQLyzr facilitates the evaluation and iterative improvement of text‑to‑SQL models by addressing key limitations of existing benchmarks. The source code of SQLyzr is available at https://github.com/sepideh‑abedini/SQLyzr.
Authors:Shan Dong, Palakorn Achananuparp, Hieu Hien Mai, Lei Wang, Yao Lu, Ee-Peng Lim
Abstract:
In this work, we develop a novel reasoning approach to enhance the performance of large language models (LLMs) in future occupation prediction. In this approach, a reason generator first derives a ``reason'' for a user using his/her past education and career history. The reason summarizes the user's preference and is used as the input of an occupation predictor to recommend the user's next occupation. This two‑step occupation prediction approach is, however, non‑trivial as LLMs are not aligned with career paths or the unobserved reasons behind each occupation decision. We therefore propose to fine‑tune LLMs improving their reasoning and occupation prediction performance. We first derive high‑quality oracle reasons, as measured by factuality, coherence and utility criteria, using a LLM‑as‑a‑Judge. These oracle reasons are then used to fine‑tune small LLMs to perform reason generation and next occupation prediction. Our extensive experiments show that: (a) our approach effectively enhances LLM's accuracy in next occupation prediction making them comparable to fully supervised methods and outperforming unsupervised methods; (b) a single LLM fine‑tuned to perform reason generation and occupation prediction outperforms two LLMs fine‑tuned to perform the tasks separately; and (c) the next occupation prediction accuracy depends on the quality of generated reasons. Our code is available at https://github.com/Sarasarahhhhh/job_prediction.
Authors:Jiabao Ji, Yongchao Chen, Yang Zhang, Ramana Rao Kompella, Chuchu Fan, Gaowen Liu, Shiyu Chang
Abstract:
Multi‑robot control in cluttered environments is a challenging problem that involves complex physical constraints, including robot‑robot collisions, robot‑obstacle collisions, and unreachable motions. Successful planning in such settings requires joint optimization over high‑level task planning and low‑level motion planning, as violations of physical constraints may arise from failures at either level. However, jointly optimizing task and motion planning is difficult due to the complex parameterization of low‑level motion trajectories and the ambiguity of credit assignment across the two planning levels. In this paper, we propose a hybrid multi‑robot control framework that jointly optimizes task and motion planning. To enable effective parameterization of low‑level planning, we introduce waypoints, a simple yet expressive representation for motion trajectories. To address the credit assignment challenge, we adopt a curriculum‑based training strategy with a modified RLVR algorithm that propagates motion feasibility feedback from the motion planner to the task planner. Experiments on BoxNet3D‑OBS, a challenging multi‑robot benchmark with dense obstacles and up to nine robots, show that our approach consistently improves task success over motion‑agnostic and VLA‑based baselines. Our code is available at https://github.com/UCSB‑NLP‑Chang/navigate‑cluster
Authors:Mahnoor Fatima Saad, Sagnik Majumder, Kristen Grauman, Ziad Al-Halah
Abstract:
Rings like gold, thuds like wood! The sound we hear in a scene is shaped not only by the spatial layout of the environment but also by the materials of the objects and surfaces within it. For instance, a room with wooden walls will produce a different acoustic experience from a room with the same spatial layout but concrete walls. Accurately modeling these effects is essential for applications such as virtual reality, robotics, architectural design, and audio engineering. Yet, existing methods for acoustic modeling often entangle spatial and material influences in correlated representations, which limits user control and reduces the realism of the generated acoustics. In this work, we present a novel approach for material‑controlled Room Impulse Response (RIR) generation that explicitly disentangles the effects of spatial and material cues in a scene. Our approach models the RIR using two modules: a spatial module that captures the influence of the spatial layout of the scene, and a material module that modulates this spatial RIR according to a user‑specified material configuration. This explicitly disentangled design allows users to easily modify the material configuration of a scene and observe its impact on acoustics without altering the spatial structure or scene content. Our model provides significant improvements over prior approaches on both acoustic‑based metrics (up to +16% on RTE) and material‑based metrics (up to +70%). Furthermore, through a human perceptual study, we demonstrate the improved realism and material sensitivity of our model compared to the strongest baselines.
Authors:Christo Zietsman
Abstract:
AI governance programmes increasingly rely on natural language prompts to constrain and direct AI agent behaviour. These prompts function as executable specifications: they define the agent's mandate, scope, and quality criteria. Despite this role, no systematic framework exists for evaluating whether a governance prompt is structurally complete. We introduce a five‑principle evaluation framework grounded in computability theory, proof theory, and Bayesian epistemology, and apply it to an empirical corpus of 34 publicly available AGENTS.md governance files sourced from GitHub. Our evaluation reveals that 37% of evaluated file‑model pairs score below the structural completeness threshold, with data classification and assessment rubric criteria most frequently absent. These results suggest that practitioner‑authored governance prompts exhibit consistent structural patterns that automated static analysis could detect and remediate. We discuss implications for requirements engineering practice in AI‑assisted development contexts, identify a previously undocumented artefact classification gap in the AGENTS.md convention, and propose directions for tool support.
Authors:Yuyu Liu, Sarang Rajendra Patil, Mengjia Xu, Tengfei Ma
Abstract:
Electronic health record (EHR) question answering is often handled by LLM‑based pipelines that are costly to deploy and do not explicitly leverage the hierarchical structure of clinical data. Motivated by evidence that medical ontologies and patient trajectories exhibit hyperbolic geometry, we propose HypEHR, a compact Lorentzian model that embeds codes, visits, and questions in hyperbolic space and answers queries via geometry‑consistent cross‑attention with type‑specific pointer heads. HypEHR is pretrained with next‑visit diagnosis prediction and hierarchy‑aware regularization to align representations with the ICD ontology. On two MIMIC‑IV‑based EHR‑QA benchmarks, HypEHR approaches LLM‑based methods while using far fewer parameters. Our code is publicly available at https://github.com/yuyuliu11037/HypEHR.
Authors:Open-H-Embodiment Consortium, :, Nigel Nelson, Juo-Tung Chen, Jesse Haworth, Xinhao Chen, Lukas Zbinden, Dianye Huang, Alaa Eldin Abdelaal, Alberto Arezzo, Ayberk Acar, Farshid Alambeigi, Carlo Alberto Ammirati, Yunke Ao, Pablo David Aranda Rodriguez, Soofiyan Atar, Mattia Ballo, Noah Barnes, Federica Barontini, Filip Binkiewicz, Peter Black, Sebastian Bodenstedt, Leonardo Borgioli, Nikola Budjak, Benjamin Calmé, Fabio Carrillo, Nicola Cavalcanti, Changwei Chen, Haoxin Chen, Sihang Chen, Qihan Chen, Zhongyu Chen, Ziyang Chen, Shing Shin Cheng, Meiqing Cheng, Min Cheng, Zih-Yun Sarah Chiu, Xiangyu Chu, Camilo Correa-Gallego, Giulio Dagnino, Anton Deguet, Jacob Delgado, Jonathan C. DeLong, Kaizhong Deng, Alexander Dimitrakakis, Qingpeng Ding, Hao Ding, Giovanni Distefano, Daniel Donoho, Anqing Duan, Marco Esposito, Shane Farritor, Jad Fayad, Zahi Fayad, Mario Ferradosa, Filippo Filicori, Chelsea Finn, Philipp Fürnstahl, Jiawei Ge, Stamatia Giannarou, Xavier Giralt Ludevid, Frederic Giraud, Aditya Amit Godbole, Ken Goldberg, Antony Goldenberg, Diego Granero Marana, Xiaoqing Guo, Tamás Haidegger, Evan Hailey, Pascal Hansen, Ziyi Hao, Kush Hari, Kengo Hayashi, Jonathon Hawkins, Shelby Haworth, Ortrun Hellig, S. Duke Herrell, Zhouyang Hong, Andrew Howe, Junlei Hu, Zhaoyang Jacopo Hu, Ria Jain, Mohammad Rafiee Javazm, Howard Ji, Rui Ji, Jianmin Ji, Zhongliang Jiang, Dominic Jones, Jeffrey Jopling, Britton Jordan, Ran Ju, Michael Kam, Luoyao Kang, Fausto Kang, Siddhartha Kapuria, Peter Kazanzides, Sonika Kiehler, Ethan Kilmer, Ji Woong Kim, Przemysław Korzeniowski, Chandra Kuchi, Nithesh Kumar, Alan Kuntz, Federico Lavagno, Yu Chung Lee, Hao-Chih Lee, Hang Li, Zhen Li, Xiao Liang, Xinxin Lin, Jinsong Lin, Chang Liu, Fei Liu, Pei Liu, Yun-hui Liu, Wanli Liuchen, Eszter Lukács, Sareena Mann, Miles Mannas, Brett Marinelli, Sabina Martyniak, Francesco Marzola, Lorenzo Mazza, Xueyan Mei, Maria Clara Morais, Luigi Muratore, Chetan Reddy Narayanaswamy, Michał Naskręt, David Navarro-Alarcon, Cyrus Neary, Chi Kit Ng, Christopher Nguan, David Noonan, Ki Hwan Oh, Tom Christian Olesch, Allison M. Okamura, Justin Opfermann, Matteo Pescio, Doan Xuan Viet Pham, Tito Porras, Hongliang Ren, Ariel Rodriguez Jimenez, Ferdinando Rodriguez y Baena, Septimiu E. Salcudean, Asmitha Sathya, Preethi Satish, Lalithkumar Seenivasan, Jiaqi Shao, Yiqing Shen, Yu Sheng, Lucy XiaoYang Shi, Zoe Soulé, Stefanie Speidel, Mingwu Su, Jianhao Su, Idris Sunmola, Kristóf Takács, Yunxi Tang, Patrick Thornycroft, Yu Tian, Jordan Thompson, Mehmet K. Turkcan, Mathias Unberath, Pietro Valdastri, Carlos Vives, Quan Vuong, Martin Wagner, Farong Wang, Wei Wang, Lidian Wang, Chung-Pang Wang, Guankun Wang, Junyi Wang, Erqi Wang, Ziyi Wang, Tanner Watts, Wolfgang Wein, Yimeng Wu, Zijian Wu, Hongjun Wu, Luohong Wu, Jie Ying Wu, Junlin Wu, Victoria Wu, Kaixuan Wu, Mateusz Wójcikowski, Yunye Xiao, Nan Xiao, Wenxuan Xie, Hao Yang, Tianqi Yang, Yinuo Yang, Menglong Ye, Ryan S. Yeung, Nural Yilmaz, Chim Ho Yin, Michael Yip, Rayan Younis, Chenhao Yu, Sayem Nazmuz Zaman, Milos Zefran, Han Zhang, Yuelin Zhang, Yidong Zhang, Yanyong Zhang, Xuyang Zhang, Yameng Zhang, Joyce Zhang, Ning Zhong, Peng Zhou, Haoying Zhou, Xiuli Zuo, Nassir Navab, Mahdi Azizian, Sean D. Huver, Axel Krieger
Abstract:
Autonomous medical robots hold promise to improve patient outcomes, reduce provider workload, democratize access to care, and enable superhuman precision. However, autonomous medical robotics has been limited by a fundamental data problem: existing medical robotic datasets are small, single‑embodiment, and rarely shared openly, restricting the development of foundation models that the field needs to advance. We introduce Open‑H‑Embodiment, the largest open dataset of medical robotic video with synchronized kinematics to date, spanning more than 49 institutions and multiple robotic platforms including the CMR Versius, Intuitive Surgical's da Vinci, da Vinci Research Kit (dVRK), Rob Surgical BiTrack, Virtual Incision's MIRA, Moon Surgical Maestro, and a variety of custom systems, spanning surgical manipulation, robotic ultrasound, and endoscopy procedures. We demonstrate the research enabled by this dataset through two foundation models. GR00T‑H is the first open foundation vision‑language‑action model for medical robotics, which is the only evaluated model to achieve full end‑to‑end task completion on a structured suturing benchmark (25% of trials vs. 0% for all others) and achieves 64% average success across a 29‑step ex vivo suturing sequence. We also train Cosmos‑H‑Surgical‑Simulator, the first action‑conditioned world model to enable multi‑embodiment surgical simulation from a single checkpoint, spanning nine robotic platforms and supporting in silico policy evaluation and synthetic data generation for the medical domain. These results suggest that open, large‑scale medical robot data collection can serve as critical infrastructure for the research community, enabling advances in robot learning, world modeling, and beyond.
Authors:Chao Pan, Yu Wu, Xin Yao
Abstract:
Internal Safety Collapse (ISC) is a failure mode in which frontier LLMs, when executing legitimate professional tasks whose correct completion structurally requires harmful content, spontaneously generate that content with safety failure rates exceeding 95%. Existing input‑level defenses achieve a 100% failure rate against ISC, and standard system prompt defenses provide only partial mitigation. We propose SafeRedirect, a system‑level override that defeats ISC by redirecting the model's task‑completion drive rather than suppressing it. SafeRedirect grants explicit permission to fail the task, prescribes a deterministic hard‑stop output, and instructs the model to preserve harmful placeholders unresolved. Evaluated on seven frontier LLMs across three AI/ML‑related ISC task types in the single‑turn setting, SafeRedirect reduces average unsafe generation rates from 71.2% to 8.0%, compared to 55.0% for the strongest viable baseline. Multi‑model ablation reveals that failure permission and condition specificity are universally critical, while the importance of other components varies across models. Cross‑attack evaluation confirms state‑of‑the‑art defense against ISC with generalization performance at least on par with the baseline on other attack families. Code is available at https://github.com/fzjcdt/SafeRedirect.
Authors:Jiahe Liu, Qinkai Yu, Jingcheng Niu, Xi Zhu, Zirui He, Zhen Xiang, Fan Yang, Jinman Zhao
Abstract:
Despite the success of Retrieval‑Augmented Generation (RAG) in grounding LLMs with external knowledge, its application over heterogeneous sources (e.g., private databases, global corpora, and APIs) remains a significant challenge. Existing approaches typically employ an LLM‑as‑a‑Router to dispatch decomposed sub‑queries to specific sources in a predictive manner. However, this "LLM‑as‑a‑Router" strategy relies heavily on the semantic meaning of different data sources, often leading to routing errors when source boundaries are ambiguous. In this work, we introduce RealRoute System, a framework that shifts the paradigm from predictive routing to a robust Retrieve‑then‑Verify mechanism. RealRoute ensures evidence completeness through parallel, source‑agnostic retrieval, followed by a dynamic verifier that cross‑checks the results and synthesizes a factually grounded answer. Our demonstration allows users to visualize the real‑time "re‑routing" process and inspect the verification chain across multiple knowledge silos. Experiments show that RealRoute significantly outperforms predictive baselines in the multi‑hop Rag reasoning task. The RealRoute system is released as an open‑source toolkit with a user‑friendly web interface. The code is available at the URL: https://github.com/Joseph1951210/RealRoute.
Authors:Xiao Lin, Zhicheng Tang, Weilin Cong, Mengyue Hang, Kai Wang, Yajuan Wang, Zhichen Zeng, Ting-Wei Li, Hyunsik Yoo, Zhining Liu, Xuying Ning, Ruizhong Qiu, Wen-yen Chen, Shuo Chang, Rong Jin, Huayu Li, Hanghang Tong
Abstract:
Sequential recommendation has rapidly advanced in click‑through rate prediction due to its ability to model dynamic user interests. A key challenge, however, lies in modeling long sequences: users often exhibit significant interest shifts, introducing substantial irrelevant or misleading information. Our empirical analysis corroborates this challenge and uncovers a recurring behavioral pattern in long sequences (session hopping): user interests remain stable within short temporal spans (sessions) but shift drastically across sessions and may reappear after multiple sessions. To address this challenge, we propose the Mixture of Sequence (MoS) framework, a model‑agnostic MoE approach that achieves accurate predictions by extracting theme‑specific and multi‑scale subsequences from noisy raw user sequences. First, MoS employs a theme‑aware routing mechanism to adaptively learn the latent themes of user sequences and organizes these sequences into multiple coherent subsequences. Each subsequence contains only sessions aligned with a specific theme, thereby effectively filtering out irrelevant or even misleading information introduced by user interest shifts in session hopping. In addition, to alleviate potential information loss, we introduce a multi‑scale fusion mechanism, which leverages three types of experts to capture global sequence characteristics, short‑term user behaviors, and theme‑specific semantic patterns. Together, these two mechanisms endow MoS with the ability to deliver accurate recommendations from multi‑faceted and multi‑scale perspectives. Experimental results demonstrate that MoS consistently achieves the SOTA performance while introducing fewer FLOPs compared with other MoE counterparts, providing strong evidence of its excellent balance between utility and efficiency. The code is available at https://github.com/xiaolin‑cs/MoS.
Authors:Tingwen Zhang, Ling Yue, Zhen Xu, Shaowu Pan
Abstract:
Recent advances in autonomous ``AI scientist'' systems have demonstrated the ability to automatically write scientific manuscripts and codes with execution. However, producing a publication‑grade scientific diagram (e.g., teaser figure) is still a major bottleneck in the ``end‑to‑end'' paper generation process. For example, a teaser figure acts as a strategic visual interface and serves a different purpose than derivative data plots. It demands conceptual synthesis and planning to translate complex logic workflow into a compelling graphic that guides intuition and sparks curiosity. Existing AI scientist systems usually omit this component or fall back to an inferior alternative. To bridge this gap, we present DiagramBank, a large‑scale dataset consisting of 89,422 schematic diagrams curated from existing top‑tier scientific publications, designed for multimodal retrieval and exemplar‑driven scientific figure generation. DiagramBank is developed through our automated curation pipeline that extracts figures and corresponding in‑text references, and uses a CLIP‑based filter to differentiate schematic diagrams from standard plots or natural images. Each instance is paired with rich context from abstract, caption, to figure‑reference pairs, enabling information retrieval under different query granularities. We release DiagramBank in a ready‑to‑index format and provide a retrieval‑augmented generation codebase to demonstrate exemplar‑conditioned synthesis of teaser figures. DiagramBank is publicly available at https://huggingface.co/datasets/zhangt20/DiagramBank with code at https://github.com/csml‑rpi/DiagramBank.
Authors:Jason Dury
Abstract:
Dense retrieval systems rank passages by embedding similarity to a query, but multi‑hop questions require passages that are associatively related through shared reasoning chains. We introduce Association‑Augmented Retrieval (AAR), a lightweight transductive reranking method that trains a small MLP (4.2M parameters) to learn associative relationships between passages in embedding space using contrastive learning on co‑occurrence annotations. At inference time, AAR reranks an initial dense retrieval candidate set using bi‑directional association scoring. On HotpotQA, AAR improves passage Recall@5 from 0.831 to 0.916 (+8.6 points) without evaluation‑set tuning, with gains concentrated on hard questions where the dense baseline fails (+28.5 points). On MuSiQue, AAR achieves +10.1 points in the transductive setting. An inductive model trained on training‑split associations and evaluated on unseen validation associations shows no significant improvement, suggesting that the method captures corpus‑specific co‑occurrences rather than transferable patterns. Ablation studies support this interpretation: training on semantically similar but non‑associated passage pairs degrades retrieval below the baseline, while shuffling association pairs causes severe degradation. A downstream QA evaluation shows retrieval gains translate to +6.4 exact match improvement. The method adds 3.7ms per query, trains in under two minutes on a single GPU, and requires no LLM‑based indexing.
Authors:Yizhi Zhou, Jia-Qi Yang, De-Chuan Zhan, Da-Wei Zhou
Abstract:
Music Recommendation Systems (MRSs) are a cornerstone of modern streaming platforms. Existing recommendation models, spanning both recall and ranking stages, predominantly rely on collaborative filtering, which fails to exploit the intrinsic characteristics of audio and consequently leads to suboptimal performance, particularly in cold‑start scenarios. However, existing music recommendation datasets often lack rich multimodal information, such as raw audio signals and descriptive textual metadata. Moreover, current recommender system evaluation frameworks remain inadequate, as they neither fully leverage multimodal information nor support a diverse range of algorithms, especially multimodal methods. To address these limitations, we propose TASTE, a comprehensive dataset and benchmarking framework designed to highlight the role of multimodal information in music recommendation. Our dataset integrates both audio and textual modalities. By leveraging recent large‑scale self‑supervised music encoders, we demonstrate the substantial value of the extracted audio representations across recommendation tasks, including candidate recall and CTR. In addition, we introduce the MuQ‑token method, which enables more efficient integration of multi‑layer audio features. This method consistently outperforms other feature integration techniques across various settings. Overall, our results not only validate the effectiveness of content‑driven approaches but also provide a highly effective and reusable multimodal foundation for future research. Code is available at https://github.com/zreach/TASTE
Authors:Zhenyu Yu, Chunlei Meng, Yangchen Zeng, Mohd Yamani Idna Idris, Shuigeng Zhou
Abstract:
Next point‑of‑interest (POI) recommendation requires modeling user mobility as a spatiotemporal sequence, where different behavioral factors may evolve at different temporal and spatial scales. Most existing methods compress a user's history into a single latent representation, which tends to entangle heterogeneous signals such as routine mobility patterns, short‑term intent, and temporal regularities. This entanglement limits the flexibility of state evolution and reduces the model's ability to adapt to diverse decision contexts. We propose ADS‑POI, a spatiotemporal state decomposition framework for next POI recommendation. ADS‑POI represents a user with multiple parallel evolving latent sub‑states, each governed by its own spatiotemporal transition dynamics. These sub‑states are selectively aggregated through a context‑conditioned mechanism to form the decision state used for prediction. This design enables different behavioral components to evolve at different rates while remaining coordinated under the current spatiotemporal context. Extensive experiments on three real‑world benchmark datasets from Foursquare and Gowalla demonstrate that ADS‑POI consistently outperforms strong state‑of‑the‑art baselines under a full‑ranking evaluation protocol. The results show that decomposing user behavior into multiple spatiotemporally aware states leads to more effective and robust next POI recommendation. Our code is available at https://github.com/YuZhenyuLindy/ADS‑POI.git.
Authors:Zhenyu Yu, Chunlei Meng, Yangchen Zeng, Mohd Yamani Idna Idris, Shuigeng Zhou
Abstract:
Next Point‑of‑Interest (POI) recommendation plays a crucial role in location‑based services by predicting users' future mobility patterns. Existing methods typically compute a single user representation from historical trajectories and use it to score all candidate POIs uniformly. However, this candidate‑agnostic paradigm overlooks that the relevance of historical visits inherently depends on which candidate is being evaluated. In this paper, we propose CaST‑POI, a candidate‑conditioned spatiotemporal model for next POI recommendation. Our key insight is that the same user history should be interpreted differently when evaluating different candidate POIs. CaST‑POI employs a candidate‑conditioned sequence reader that uses candidates as queries to dynamically attend to user history. In addition, we introduce candidate‑relative temporal and spatial biases to capture fine‑grained mobility patterns based on the relationships between historical visits and each candidate POI. Extensive experiments on three benchmark datasets demonstrate that CaST‑POI consistently outperforms state‑of‑the‑art methods, yielding substantial improvements across multiple evaluation metrics, with particularly strong advantages under large candidate pools. Code is available at https://github.com/YuZhenyuLindy/CaST‑POI.git.
Authors:Yanning Hou, Duanyang Yuan, Sihang Zhou, Xiaoshu Chen, Ke Liang, Siwei Wang, Xinwang Liu, Jian Huang
Abstract:
Recent GraphRAG methods integrate graph structures into text indexing and retrieval, using knowledge graph triples to connect text chunks, thereby improving retrieval coverage and precision. However, we observe that treating text chunks as the basic unit of knowledge representation rigidly groups multiple atomic facts together, limiting the flexibility and adaptability needed to support diverse retrieval scenarios. Additionally, triple‑based entity linking is sensitive to relation‑extraction errors, which can lead to missing or incorrect reasoning paths and ultimately hurt retrieval accuracy. To address these issues, we propose the Atom‑Entity Graph, a more precise and reliable architecture for knowledge representation and indexing. In our approach, knowledge is stored as knowledge atoms, namely individual, self‑contained units of factual information, rather than coarse‑grained text chunks. This allows knowledge elements to be flexibly reassembled without mutual interference, thereby enabling seamless alignment with diverse query perspectives. Edges between entities simply indicate whether a relationship exists. By combining personalized PageRank with relevance‑based filtering, we maintain accurate entity connections and improve the reliability of reasoning. Theoretical analysis and experiments on five public benchmarks show that the proposed AtomicRAG algorithm outperforms strong RAG baselines in retrieval accuracy and reasoning robustness. Code: https://github.com/7HHHHH/AtomicRAG.
Authors:Sina Gholami, Abdulmoneam Ali, Tania Haghighi, Ahmed Arafa, Minhaj Nur Alam
Abstract:
Federated learning (FL) enables collaborative model training without sharing raw data; however, the presence of noisy labels across distributed clients can severely degrade the learning performance. In this paper, we propose FedSIR, a multi‑stage framework for robust FL under noisy labels. Different from existing approaches that mainly rely on designing noise‑tolerant loss functions or exploiting loss dynamics during training, our method leverages the spectral structure of client feature representations to identify and mitigate label noise. Our framework consists of three key components. First, we identify clean and noisy clients by analyzing the spectral consistency of class‑wise feature subspaces with minimal communication overhead. Second, clean clients provide spectral references that enable noisy clients to relabel potentially corrupted samples using both dominant class directions and residual subspaces. Third, we employ a noise‑aware training strategy that integrates logit‑adjusted loss, knowledge distillation, and distance‑aware aggregation to further stabilize federated optimization. Extensive experiments on standard FL benchmarks demonstrate that FedSIR consistently outperforms state‑of‑the‑art methods for FL with noisy labels. The code is available at https://github.com/sinagh72/FedSIR.
Authors:Dongding Lin, Jian Wang, Yongqi Li, Wenjie Li
Abstract:
Situated conversational recommendation (SCR), which utilizes visual scenes grounded in specific environments and natural language dialogue to deliver contextually appropriate recommendations, has emerged as a promising research direction due to its close alignment with real‑world scenarios. Compared to traditional recommendations, SCR requires a deeper understanding of dynamic and implicit user preferences, as the surrounding scene often influences users' underlying interests, while both may evolve across conversations. This complexity significantly impacts the timing and relevance of recommendations. To address this, we propose situated preference reasoning (SiPeR), a novel framework that integrates two core mechanisms: (1) Scene transition estimation, which estimates whether the current scene satisfies user needs, and guides the user toward a more suitable scene when necessary; and (2) Bayesian inverse inference, which leverages the likelihood of multimodal large language models (MLLMs) to predict user preferences about candidate items within the scene. Extensive experiments on two representative benchmarks demonstrate SiPeR's superiority in both recommendation accuracy and response generation quality. The code and data are available at https://github.com/DongdingLin/SiPeR.
Authors:Mohamed Hesham Elganayni, Runsheng Chen, Sebastian Nagl, Matthias Grabmair
Abstract:
This work explores the role of prompt design and judge selection in LLM‑as‑a‑Judge evaluations of free text legal question answering. We examine whether automatic task prompt optimization improves over human‑centered design, whether optimization effectiveness varies by judge feedback style, and whether optimized prompts transfer across judges. We systematically address these questions on the LEXam benchmark by optimizing task prompts using the ProTeGi method with feedback from two judges (Qwen3‑32B, DeepSeek‑V3) across four task models, and then testing cross‑judge transfer. Automatic optimization consistently outperforms the baseline, with lenient judge feedback yielding higher and more consistent gains than strict judge feedback. Prompts optimized with lenient feedback transfer better to strict judges than the reverse direction. Analysis reveals that lenient judges provide permissive feedback, yielding prompts with broader applicability, whereas strict judges produce restrictive feedback, leading to judge‑specific overfitting. Our findings demonstrate algorithmically optimizing prompts on training data can outperform human‑centered prompt design and that judges' dispositions during optimization shape prompt generalizability. Code and optimized prompts are available at https://github.com/TUMLegalTech/icail2026‑llm‑judge‑gaming.
Authors:Shuai Chen, Chengzhi Zhang
Abstract:
Scientific progress depends on the continual generation of innovative re‑search ideas. However, the rapid growth of scientific literature has greatly increased the cost of knowledge filtering, making it harder for researchers to identify novel directions. Although existing large language model (LLM)‑based methods show promise in research idea generation, the ideas they produce are often repetitive and lack depth. To address this issue, this study proposes a multi‑agent iterative planning search strategy inspired by com‑binatorial innovation theory. The framework combines iterative knowledge search with an LLM‑based multi‑agent system to generate, evaluate, and re‑fine research ideas through repeated interaction, with the goal of improving idea diversity and novelty. Experiments in the natural language processing domain show that the proposed method outperforms state‑of‑the‑art base‑lines in both diversity and novelty. Further comparison with ideas derived from top‑tier machine learning conference papers indicates that the quality of the generated ideas falls between that of accepted and rejected papers. These results suggest that the proposed framework is a promising approach for supporting high‑quality research idea generation. The source code and dataset used in this paper are publicly available on Github repository: https://github.com/ChenShuai00/MAGenIdeas. The demo is available at https://huggingface.co/spaces/cshuai20/MAGenIdeas.
Authors:Neemesh Yadav, Palakorn Achananuparp, Jing Jiang, Ee-Peng Lim
Abstract:
Large Language Models (LLMs) have been shown to possess Theory of Mind (ToM) abilities. However, it remains unclear whether this stems from robust reasoning or spurious correlations. We introduce DialToM, a human‑verified benchmark built from natural human dialogue using a multiple‑choice framework. We evaluate not only mental state prediction (Literal ToM) but also the functional utility of these states (Functional ToM) through Prospective Diagnostic Forecasting ‑‑ probing whether models can identify state‑consistent dialogue trajectories solely from mental‑state profiles. Our results reveal a significant reasoning asymmetry: while LLMs excel at identifying mental states, most (except for Gemini 3 Pro) fail to leverage this understanding to forecast social trajectories. Additionally, we find only weak semantic similarities between human and LLM‑generated inferences. To facilitate reproducibility, the DialToM dataset and evaluation code are publicly available at https://github.com/Stealth‑py/DialToM.
Authors:Gustav Keppler, Ghada Elbez, Veit Hagenmeyer
Abstract:
The rapid evolution and use of Large Language Models (LLMs) in professional workflows require an evaluation of their domain‑specific knowledge against industry standards. We introduceCyberCertBench, a new suite of Multiple Choice Question Answering (MCQA) benchmarks derived from industry recognized certifications. CyberCertBench evaluates LLM domain knowledgeagainst the professional standards of Information Technology cybersecurity and more specializedareas such as Operational Technology and related cybersecurity standards. Concurrently, we propose and validate a novel Proposer‑Verifier framework, a methodology to generate interpretable,natural language explanations for model performance. Our evaluation shows that frontier modelsachieve human expert level in general networking and IT security knowledge. However, theiraccuracy declines in questions that require vendor‑specific nuances or knowledge in formalstandards, like, e.g., IEC 62443. Analysis of model scaling trend and release date demonstratesremarkable gains in parameter efficiency, while recent larger models show diminishing returns.Code and evaluation scripts are available at: https://github.com/GKeppler/CyberCertBench.
Authors:Md Maklachur Rahman, Soon Ki Jung, Tracy Hammond
Abstract:
Recent segmentation models have demonstrated promising efficiency by aggressively reducing parameter counts and computational complexity. However, these models often struggle to accurately delineate fine lesion boundaries and texture patterns essential for early skin cancer diagnosis and treatment planning. In this paper, we propose MambaLiteUNet, a compact yet robust segmentation framework that integrates Mamba state space modeling into a U‑Net architecture, along with three key modules: Adaptive Multi‑Branch Mamba Feature Fusion (AMF), Local‑Global Feature Mixing (LGFM), and Cross‑Gated Attention (CGA). These modules are designed to enhance local‑global feature interaction, preserve spatial details, and improve the quality of skip connections. MambaLiteUNet achieves an average IoU of 87.12% and average Dice score of 93.09% across ISIC2017, ISIC2018, HAM10000, and PH2 benchmarks, outperforming state‑of‑the‑art models. Compared to U‑Net, our model improves average IoU and Dice by 7.72 and 4.61 points, respectively, while reducing parameters by 93.6% and GFLOPs by 97.6%. Additionally, in domain generalization with six unseen lesion categories, MambaLiteUNet achieves 77.61% IoU and 87.23% Dice, performing best among all evaluated models. Our extensive experiments demonstrate that MambaLiteUNet achieves a strong balance between accuracy and efficiency, making it a competitive and practical solution for dermatological image segmentation. Our code is publicly available at: https://github.com/maklachur/MambaLiteUNet.
Authors:Zhenyu Wang, Geyan Ye, Wei Liu, Man Tat Alexander Ng
Abstract:
Virtual cell modeling predicts molecular state changes under genetic perturbations in silico, which is essential for biological mechanism studies. However, existing approaches suffer from unconstrained reasoning, uninterpretable predictions, and retrieval signals that are weakly aligned with regulatory topology. To address these limitations, we propose AROMA, an Augmented Reasoning Over a Multimodal Architecture for virtual cell genetic perturbation modeling. AROMA integrates textual evidence, graph‑topology information, and protein sequence features to model perturbation‑target dependencies, and is trained with a two‑stage optimization strategy to yield predictions that are both accurate and interpretable. We also construct two knowledge graphs and a perturbation reasoning dataset, PerturbReason, containing more than 498k samples, as reusable resources for the virtual cell domain. Experiments show that AROMA outperforms existing methods across multiple cell lines, and remains robust under zero‑shot evaluation on an unseen cell line, as well as in knowledge‑sparse, long‑tail scenarios. Overall, AROMA demonstrates that combining knowledge‑driven multimodal modeling with evidence retrieval provides a promising pathway toward more reliable and interpretable virtual cell perturbation prediction. Model weights are available at https://huggingface.co/blazerye/AROMA. Code is available at https://github.com/blazerye/AROMA.
Authors:Fengxian Dong, Zhi Zheng, Xiao Han, Wei Chen, Jingqing Ruan, Tong Xu, Yong Chen, Enhong Chen
Abstract:
Automated feature generation extracts informative features from raw tabular data without manual intervention and is crucial for accurate, generalizable machine learning. Traditional methods rely on predefined operator libraries and cannot leverage task semantics, limiting their ability to produce diverse, high‑value features for complex tasks. Recent Large Language Model (LLM)‑based approaches introduce richer semantic signals, but still suffer from a restricted feature space due to fixed generation patterns and from the absence of feedback from the learning objective. To address these challenges, we propose a Memory‑Augmented LLM‑based Multi‑Agent System (MALMAS) for automated feature generation. MALMAS decomposes the generation process into agents with distinct responsibilities, and a Router Agent activates an appropriate subset of agents per iteration, further broadening exploration of the feature space. We further integrate a memory module comprising procedural memory, feedback memory, and conceptual memory, enabling iterative refinement that adaptively guides subsequent feature generation and improves feature quality and diversity. Extensive experiments on multiple public datasets against state‑of‑the‑art baselines demonstrate the effectiveness of our approach. The code is available at https://github.com/fxdong24/MALMAS
Authors:Wengyu Zhang, Xiao-Yong Wei, Qing Li
Abstract:
Text‑guided molecular design is a key capability for AI‑driven drug discovery, yet it remains challenging to map sequential natural‑language instructions with non‑linear molecular structures under strict chemical constraints. Most existing approaches, including RAG, CoT prompting, and fine‑tuning or RL, emphasize a small set of ad‑hoc reasoning perspectives implemented in a largely one‑shot generation pipeline. In contrast, real‑world drug discovery relies on dynamic, multi‑perspective critique and iterative refinement to reconcile semantic intent with structural feasibility. Motivated by this, we propose Mol‑Debate, a generation paradigm that enables such dynamic reasoning through an iterative generate‑debate‑refine loop. We further characterize key challenges in this paradigm and address them through perspective‑oriented orchestration, including developer‑debater conflict, global‑local structural reasoning, and static‑dynamic integration. Experiments demonstrate that Mol‑Debate achieves state‑of‑the‑art performance against strong general and chemical baselines, reaching 59.82% exact match on ChEBI‑20 and 50.52% weighted success rate on S^2‑Bench. Our code is available at https://github.com/wyuzh/Mol‑Debate.
Authors:Wenhong Zhu, Ruobing Xie, Rui Wang, Pengfei Liu
Abstract:
Knowledge distillation (KD) is a powerful paradigm for compressing large language models (LLMs), whose effectiveness depends on intertwined choices of divergence direction, optimization strategy, and data regime. We break down the design of existing KD methods and present a unified view that establishes connections between them, reformulating KD as a reweighted log‑likelihood objective at the token level. We further propose Hybrid Policy Distillation (HPD), which integrates the complementary advantages of forward and reverse KL to balance mode coverage and mode‑seeking, and combines off‑policy data with lightweight, approximate on‑policy sampling. We validate HPD on long‑generation math reasoning as well as short‑generation dialogue and code tasks, demonstrating improved optimization stability, computational efficiency, and final performance across diverse model families and scales. The code related to this work is available at https://github.com/zwhong714/Hybrid‑Policy‑Distillation.
Authors:Rongtao Zhang, Xin Zhu, Masoume Pourebadi Khotbehsara, Warren Dao, Erdem Bıyık, Heather Culbertson
Abstract:
Individual differences in vibrotactile perception underscore the growing importance of personalization as haptic feedback becomes more prevalent in interactive systems. We propose Vibrotactile Preference Learning (VPL), a system that captures user‑specific preference spaces over vibrotactile parameters via Gaussian‑process‑based uncertainty‑aware preference learning. VPL uses an expected information gain‑based acquisition strategy to guide query selection over 40 rounds of pairwise comparisons of overall user preference, augmented with user‑reported uncertainty, enabling efficient exploration of the parameter space. We evaluate VPL in a user study (N = 13) using the vibrotactile feedback from a Microsoft Xbox controller, showing that it efficiently learns individualized preferences while maintaining comfortable, low‑workload user interactions. These results highlight the potential of VPL for scalable personalization of vibrotactile experiences.
Authors:Vasundra Srinivasan
Abstract:
Enterprise deployment of long‑horizon decision agents in regulated domains (underwriting, claims adjudication, tax examination) is dominated by retrieval‑augmented pipelines despite a decade of increasingly sophisticated stateful memory architectures. We argue this reflects a hidden requirement: regulated deployment is load‑bearing on four systems properties (deterministic replay, auditable rationale, multi‑tenant isolation, statelessness for horizontal scale), and stateful architectures violate them by construction. We propose Deterministic Projection Memory (DPM): an append‑only event log plus one task‑conditioned projection at decision time. On ten regulated decisioning cases at three memory budgets, DPM matches summarization‑based memory at generous budgets and substantially outperforms it when the budget binds: at a 20x compression ratio, DPM improves factual precision by +0.52 (Cohen's h=1.17, p=0.0014) and reasoning coherence by +0.53 (h=1.13, p=0.0034), paired permutation, n=10. DPM is additionally 7‑15x faster at binding budgets, making one LLM call at decision time instead of N. A determinism study of 10 replays per case at temperature zero shows both architectures inherit residual API‑level nondeterminism, but the asymmetry is structural: DPM exposes one nondeterministic call; summarization exposes N compounding calls. The audit surface follows the same one‑versus‑N pattern: DPM logs two LLM calls per decision while summarization logs 83‑97 on LongHorizon‑Bench. We conclude with TAMS, a practitioner heuristic for architecture selection, and a failure analysis of stateful memory under enterprise operating conditions. The contribution is the argument that statelessness is the load‑bearing property explaining enterprise's preference for weaker but replayable retrieval pipelines, and that DPM demonstrates this property is attainable without the decisioning penalty retrieval pays.
Authors:Weitong Kong, Di Wen, Kunyu Peng, David Schneider, Zeyun Zhong, Alexander Jaus, Zdravko Marinov, Jiale Wei, Ruiping Liu, Junwei Zheng, Yufan Chen, Lei Qi, Rainer Stiefelhagen
Abstract:
Correcting errors in long‑video understanding is disproportionately costly: existing multimodal pipelines produce opaque, end‑to‑end outputs that expose no intermediate state for inspection, forcing annotators to revisit raw video and reconstruct temporal logic from scratch. The core bottleneck is not generation quality alone, but the absence of a supervisory interface through which human effort can be proportional to the scope of each error. We present IMPACT‑CYCLE, a supervisory multi‑agent system that reformulates long‑video understanding as iterative claim‑level maintenance of a shared semantic memory ‑‑ a structured, versioned state encoding typed claims, a claim dependency graph, and a provenance log. Role‑specialized agents operating under explicit authority contracts decompose verification into local object‑relation correctness, cross‑temporal consistency, and global semantic coherence, with corrections confined to structurally dependent claims. When automated evidence is insufficient, the system escalates to human arbitration as the supervisory authority with final override rights; dependency‑closure re‑verification then ensures correction cost remains proportional to error scope. Experiments on VidOR show substantially improved downstream reasoning (VQA: 0.71 to 0.79) and a 4.8x reduction in human arbitration cost, with workload significantly lower than manual annotation. Code will be released at https://github.com/MKong17/IMPACT_CYCLE.
Authors:Natalia Martinez Gil, Fearghal O'Donncha, Wesley M. Gifford, Nianjun Zhou, Dhaval C. Patel, Roman Vaculin
Abstract:
We propose a post‑hoc adaptive conformal anomaly detection method for monitoring time series that leverages predictions from pre‑trained foundation models without requiring additional fine‑tuning. Our method yields an interpretable anomaly score directly interpretable as a false alarm rate (p‑value), facilitating transparent and actionable decision‑making. It employs weighted quantile conformal prediction bounds and adaptively learns optimal weighting parameters from past predictions, enabling calibration under distribution shifts and stable false alarm control, while preserving out‑of‑sample guarantees. As a model‑agnostic solution, it integrates seamlessly with foundation models and supports rapid deployment in resource‑constrained environments. This approach addresses key industrial challenges such as limited data availability, lack of training expertise, and the need for immediate inference, while taking advantage of the growing accessibility of time series foundation models. Experiments on both synthetic and real‑world datasets show that the proposed approach delivers strong performance, combining simplicity, interpretability, robustness, and adaptivity.
Authors:Ziyi Wang, Chen Zhang, Wenjun Peng, Qi Wu, Xinyu Wang
Abstract:
Explainability for Large Language Model (LLM) agents is especially challenging in interactive, partially observable settings, where decisions depend on evolving beliefs and other agents. We present TriEx, a tri‑view explainability framework that instruments sequential decision making with aligned artifacts: (i) structured first‑person self‑reasoning bound to an action, (ii) explicit second‑person belief states about opponents updated over time, and (iii) third‑person oracle audits grounded in environment‑derived reference signals. This design turns explanations from free‑form narratives into evidence‑anchored objects that can be compared and checked across time and perspectives. Using imperfect‑information strategic games as a controlled testbed, we show that TriEx enables scalable analysis of explanation faithfulness, belief dynamics, and evaluator reliability, revealing systematic mismatches between what agents say, what they believe, and what they do. Our results highlight explainability as an interaction‑dependent property and motivate multi‑view, evidence‑grounded evaluation for LLM agents. Code is available at https://github.com/Einsam1819/TriEx.
Authors:Tianrong Chen, Jiatao Gu, David Berthelot, Joshua Susskind, Shuangfei Zhai
Abstract:
Normalizing Flows (NFs) are a classical family of likelihood‑based methods that have received revived attention. Recent efforts such as TARFlow have shown that NFs are capable of achieving promising performance on image modeling tasks, making them viable alternatives to other methods such as diffusion models. In this work, we further advance the state of Normalizing Flow generative models by introducing iterative TARFlow (iTARFlow). Unlike diffusion models, iTARFlow maintains a fully end‑to‑end, likelihood‑based objective during training. During sampling, it performs autoregressive generation followed by an iterative denoising procedure inspired by diffusion‑style methods. Through extensive experiments, we show that iTARFlow achieves competitive performance across ImageNet resolutions of 64, 128, and 256 pixels, demonstrating its potential as a strong generative model and advancing the frontier of Normalizing Flows. In addition, we analyze the characteristic artifacts produced by iTARFlow, offering insights that may shed light on future improvements. Code is available at https://github.com/apple/ml‑itarflow.
Authors:Jason Z Wang
Abstract:
We introduce MIRROR, a benchmark comprising eight experiments across four metacognitive levels that evaluates whether large language models can use self‑knowledge to make better decisions. We evaluate 16 models from 8 labs across approximately 250,000 evaluation instances using five independent behavioral measurement channels. Core experiments are run across the full model roster; experiments with specialized infrastructure requirements report explicitly marked model subsets. We find two phenomena with direct implications for agentic deployment: (1) compositional self‑prediction fails universally ‑‑ the Compositional Calibration Error ranges from 0.500 to 0.943 on the original 15‑model Exp3‑v1 set (and 0.434 to 0.758 on the balanced 16‑model Exp3‑v2 expansion), indicating that models cannot predict their own performance on multi‑domain tasks, and (2) models exhibit above‑chance but imperfect domain‑specific self‑knowledge yet systematically fail to translate even this partial awareness into appropriate agentic action‑selection ‑‑ external metacognitive control reduces the Confident Failure Rate from 0.600 to 0.143 (76% reduction at temperature 0; mean 70% at temperature 0.7 across 5 models from 4 labs). Providing models with their own calibration scores produces no significant improvement (p > 0.05); only architectural constraint is effective. This suggests that external metacognitive scaffolding ‑‑ not improved self‑knowledge ‑‑ is the path to safer autonomous AI systems. Code, data, and Croissant metadata will be released publicly with the benchmark.
Authors:Francisco Angulo de Lafuente, Teerth Sharma, Vladimir Veselov, Seid Mohammed Abdu, Nirmal Tej Kumar, Guillermo Perry
Abstract:
This paper presents OpenCLAW‑P2P v6.0, a comprehensive evolution of the decentralized collective‑intelligence platform in which autonomous AI agents publish, peer‑review, score, and iteratively improve scientific research papers without any human gatekeeper. Building on v5.0 foundations ‑‑ tribunal‑gated publishing, multi‑LLM granular scoring, calibrated deception detection, the Silicon Chess‑Grid FSM, and the AETHER containerized inference engine ‑‑ this release introduces four major new subsystems: (1) a multi‑layer paper persistence architecture with four storage tiers (in‑memory cache, Cloudflare R2, Gun.js, GitHub) ensuring zero paper loss across redeployments; (2) a multi‑layer retrieval cascade with automatic backfill reducing lookup latency from >3s to <50ms; (3) live reference verification querying CrossRef, arXiv, and Semantic Scholar during scoring to detect fabricated citations with >85% accuracy; and (4) a scientific API proxy providing rate‑limited cached access to seven public databases. The platform operates with 14 real autonomous agents producing 50+ scored papers (word counts 2,072‑4,073, leaderboard scores 6.4‑8.1) alongside 23 labeled simulated citizens. We present honest production statistics, failure‑mode analysis, a paper recovery protocol that salvaged 25 lost papers, and lessons learned from operating the system at scale. All pre‑existing subsystems ‑‑ 17‑judge multi‑LLM scoring, 14‑rule calibration with 8 deception detectors, tribunal cognitive examination, Proof of Value consensus, Laws‑of‑Form eigenform verification, and tau‑normalized agent coordination ‑‑ are retained and further hardened. All code is open‑source at https://github.com/Agnuxo1/p2pclaw‑mcp‑server.
Authors:Jinyoung Kim, Hyeongsoo Lim, Eunseo Seo, Minho Jang, Keunwoo Choi, Seungyoun Shin, Ji Won Yoon
Abstract:
Recent advances in large audio language models (LALMs) have enabled multilingual speech understanding. However, benchmarks for evaluating LALMs remain scarce for non‑English languages, with Korean being one such underexplored case. In this paper, we introduce KoALa‑Bench, a comprehensive benchmark for evaluating Korean speech understanding and speech faithfulness of LALMs. In particular, KoALa‑Bench comprises six tasks. Four tasks evaluate fundamental speech understanding capabilities, including automatic speech recognition, speech translation, speech question answering, and speech instruction following, while the remaining two tasks evaluate speech faithfulness, motivated by our observation that several LALMs often fail to fully leverage the speech modality. Furthermore, to reflect Korea‑specific knowledge, our benchmark incorporates listening questions from the Korean college scholastic ability test as well as content covering Korean cultural domains. We conduct extensive experiments across six models, including both white‑box and black‑box ones. Our benchmark, evaluation code, and leaderboard are publicly available at https://ksbench.github.io/Korean‑Benchmark/.
Authors:Mario Tuci, Caner Korkmaz, Umut Şimşekli, Tolga Birdal
Abstract:
Training modern neural networks often relies on large learning rates, operating at the edge of stability, where the optimization dynamics exhibit oscillatory and chaotic behavior. Empirically, this regime often yields improved generalization performance, yet the underlying mechanism remains poorly understood. In this work, we represent stochastic optimizers as random dynamical systems, which often converge to a fractal attractor set (rather than a point) with a smaller intrinsic dimension. Building on this connection and inspired by Lyapunov dimension theory, we introduce a novel notion of dimension, coined the `sharpness dimension', and prove a generalization bound based on this dimension. Our results show that generalization in the chaotic regime depends on the complete Hessian spectrum and the structure of its partial determinants, highlighting a complexity that cannot be captured by the trace or spectral norm considered in prior work. Experiments across various MLPs and transformers validate our theory while also providing new insights into the recently observed phenomenon of grokking.
Authors:Boyu Chen, Yi Chen, Lu Qiu, Jerry Bai, Yuying Ge, Yixiao Ge
Abstract:
Scaling humanoid foundation models is bottlenecked by the scarcity of robotic data. While massive egocentric human data offers a scalable alternative, bridging the cross‑embodiment chasm remains a fundamental challenge due to kinematic mismatches. We introduce UniT (Unified Latent Action Tokenizer via Visual Anchoring), a framework that establishes a unified physical language for human‑to‑humanoid transfer. Grounded in the philosophy that heterogeneous kinematics share universal visual consequences, UniT employs a tri‑branch cross‑reconstruction mechanism: actions predict vision to anchor kinematics to physical outcomes, while vision reconstructs actions to filter out irrelevant visual confounders. Concurrently, a fusion branch synergies these purified modalities into a shared discrete latent space of embodiment‑agnostic physical intents. We validate UniT across two paradigms: 1) Policy Learning (VLA‑UniT): By predicting these unified tokens, it effectively leverages diverse human data to achieve state‑of‑the‑art data efficiency and robust out‑of‑distribution (OOD) generalization on both humanoid simulation benchmark and real‑world deployments, notably demonstrating zero‑shot task transfer. 2) World Modeling (WM‑UniT): By aligning cross‑embodiment dynamics via unified tokens as conditions, it realizes direct human‑to‑humanoid action transfer. This alignment ensures that human data seamlessly translates into enhanced action controllability for humanoid video generation. Ultimately, by inducing a highly aligned cross‑embodiment representation (empirically verified by t‑SNE visualizations revealing the convergence of human and humanoid features into a shared manifold), UniT offers a scalable path to distill vast human knowledge into general‑purpose humanoid capabilities.
Authors:Perry Dong, Alexander Swerdlow, Dorsa Sadigh, Chelsea Finn
Abstract:
Some of the most performant reinforcement learning algorithms today can be prohibitively expensive as they use test‑time scaling methods such as sampling multiple action candidates and selecting the best one. In this work, we propose FASTER, a method for getting the benefits of sampling‑based test‑time scaling of diffusion‑based policies without the computational cost by tracing the performance gain of action samples back to earlier in the denoising process. Our key insight is that we can model the denoising of multiple action candidates and selecting the best one as a Markov Decision Process (MDP) where the goal is to progressively filter action candidates before denoising is complete. With this MDP, we can learn a policy and value function in the denoising space that predicts the downstream value of action candidates in the denoising process and filters them while maximizing returns. The result is a method that is lightweight and can be plugged into existing generative RL algorithms. Across challenging long‑horizon manipulation tasks in online and batch‑online RL, FASTER consistently improves the underlying policies and achieves the best overall performance among the compared methods. Applied to a pretrained VLA, FASTER achieves the same performance while substantially reducing training and inference compute requirements. Code is available at https://github.com/alexanderswerdlow/faster .
Authors:Jean Mercat, Sedrick Keh, Kushal Arora, Isabella Huang, Paarth Shah, Haruki Nishimura, Shun Iwase, Katherine Liu
Abstract:
We present VLA Foundry, an open‑source framework that unifies LLM, VLM, and VLA training in a single codebase. Most open‑source VLA efforts specialize on the action training stage, often stitching together incompatible pretraining pipelines. VLA Foundry instead provides a shared training stack with end‑to‑end control, from language pretraining to action‑expert fine‑tuning. VLA Foundry supports both from‑scratch training and pretrained backbones from Hugging Face. To demonstrate the utility of our framework, we train and release two types of models: the first trained fully from scratch through our LLM‑‑>VLM‑‑>VLA pipeline and the second built on the pretrained Qwen3‑VL backbone. We evaluate closed‑loop policy performance of both models on LBM Eval, an open‑data, open‑source simulator. We also contribute usability improvements to the simulator and the STEP analysis tools for easier public use. In the nominal evaluation setting, our fully‑open from‑scratch model is on par with our prior closed‑source work and substituting in the Qwen3‑VL backbone leads to a strong multi‑task table top manipulation policy outperforming our baseline by a wide margin. The VLA Foundry codebase is available at https://github.com/TRI‑ML/vla_foundry and all multi‑task model weights are released on https://huggingface.co/collections/TRI‑ML/vla‑foundry. Additional qualitative videos are available on the project website https://tri‑ml.github.io/vla_foundry.
Authors:Shuai Wang, Hongyi Zhu, Jia-Hong Huang, Yixian Shen, Chengxi Zeng, Stevan Rudinac, Monika Kackovic, Nachoem Wijnberg, Marcel Worring
Abstract:
Understanding artworks requires multi‑step reasoning over visual content and cultural, historical, and stylistic context. While recent multimodal large language models show promise in artwork explanation, they rely on implicit reasoning and internalized knowl‑ edge, limiting interpretability and explicit evidence grounding. We propose A‑MAR, an Agent‑based Multimodal Art Retrieval framework that explicitly conditions retrieval on structured reasoning plans. Given an artwork and a user query, A‑MAR first decomposes the task into a structured reasoning plan that specifies the goals and evidence requirements for each step. Retrieval is then conditionedon this plan, enabling targeted evidence selection and supporting step‑wise, grounded explanations. To evaluate agent‑based multi‑ modal reasoning within the art domain, we introduce ArtCoT‑QA. This diagnostic benchmark features multi‑step reasoning chains for diverse art‑related queries, enabling a granular analysis that extends beyond simple final answer accuracy. Experiments on SemArt and Artpedia show that A‑MAR consistently outperforms static, non planned retrieval and strong MLLM baselines in final explanation quality, while evaluations on ArtCoT‑QA further demonstrate its advantages in evidence grounding and multi‑step reasoning ability. These results highlight the importance of reasoning‑conditioned retrieval for knowledge‑intensive multimodal understanding and position A‑MAR as a step toward interpretable, goal‑driven AI systems, with particular relevance to cultural industries. The code and data are available at: https://github.com/ShuaiWang97/A‑MAR.
Authors:Yi Zhong, Buqiang Xu, Yijun Wang, Zifei Shan, Shuofei Qiao, Guozhou Zheng, Ningyu Zhang
Abstract:
At present, executable visual workflows have emerged as a mainstream paradigm in real‑world industrial deployments, offering strong reliability and controllability. However, in current practice, such workflows are almost entirely constructed through manual engineering: developers must carefully design workflows, write prompts for each step, and repeatedly revise the logic as requirements evolve‑making development costly, time‑consuming, and error‑prone. To study whether large language models can automate this multi‑round interaction process, we introduce Chat2Workflow, a benchmark for generating executable visual workflows directly from natural language, and propose a robust agentic framework to mitigate recurrent execution errors. Chat2Workflow is built from a large collection of real‑world business workflows, with each instance designed so that the generated workflow can be transformed and directly deployed to practical workflow platforms such as Dify and Coze. Experimental results show that while state‑of‑the‑art language models can often capture high‑level intent, they struggle to generate correct, stable, and executable workflows, especially under complex or changing requirements. Although our agentic framework yields up to 5.34% resolve rate gains, the remaining real‑world gap positions Chat2Workflow as a foundation for advancing industrial‑grade automation. Code is available at https://github.com/zjunlp/Chat2Workflow.
Authors:Josue Torres-Fonseca, Naihao Deng, Yinpei Dai, Shane Storks, Yichi Zhang, Rada Mihalcea, Casey Kennington, Joyce Chai
Abstract:
Multimodal Large Language Models are increasingly adopted as autonomous agents in interactive environments, yet their ability to proactively address safety hazards remains insufficient. We introduce SafetyALFRED, built upon the embodied agent benchmark ALFRED, augmented with six categories of real‑world kitchen hazards. While existing safety evaluations focus on hazard recognition through disembodied question answering (QA) settings, we evaluate eleven state‑of‑the‑art models from the Qwen, Gemma, and Gemini families on not only hazard recognition, but also active risk mitigation through embodied planning. Our experimental results reveal a significant alignment gap: while models can accurately recognize hazards in QA settings, average mitigation success rates for these hazards are low in comparison. Our findings demonstrate that static evaluations through QA are insufficient for physical safety, thus we advocate for a paradigm shift toward benchmarks that prioritize corrective actions in embodied contexts. We open‑source our code and dataset under https://github.com/sled‑group/SafetyALFRED.git
Authors:Yanshuo Wang, Yuan Xu, Xuesong Li, Jie Hong, Yizhou Wang, Chang Wen Chen, Wentao Zhu
Abstract:
Egocentric assistants often rely on first‑person view data to capture user behavior and context for personalized services. Since different users exhibit distinct habits, preferences, and routines, such personalization is essential for truly effective assistance. However, effectively integrating long‑term user data for personalization remains a key challenge. To address this, we introduce EgoSelf, a system that includes a graph‑based interaction memory constructed from past observations and a dedicated learning task for personalization. The memory captures temporal and semantic relationships among interaction events and entities, from which user‑specific profiles are derived. The personalized learning task is formulated as a prediction problem where the model predicts possible future interactions from individual user's historical behavior recorded in the graph. Extensive experiments demonstrate the effectiveness of EgoSelf as a personalized egocentric assistant. Code is available at https://abie‑e.github.io/EgoSelf/.
Authors:Bobo Li, Rui Wu, Zibo Ji, Meishan Zhang, Hao Fei, Min Zhang, Mong-Li Lee, Wynne Hsu
Abstract:
Large Language Model agents have rapidly evolved from static text generators into dynamic systems capable of executing complex autonomous workflows. To enhance reliability, multi‑agent frameworks assigning specialized roles are increasingly adopted to enable self‑reflection and mutual auditing. While such role‑playing effectively leverages domain expert knowledge, we find it simultaneously induces a human‑like cognitive bias known as Actor‑Observer Asymmetry (AOA). Specifically, an agent acting as an actor (during self‑reflection) tends to attribute failures to external factors, whereas an observer (during mutual auditing) attributes the same errors to internal faults. We quantify this using our new Ambiguous Failure Benchmark, which reveals that simply swapping perspectives triggers the AOA effect in over 20% of cases for most models. To tame this bias, we introduce ReTAS (Reasoning via Thesis‑Antithesis‑Synthesis), a model trained through dialectical alignment to enforce perspective‑invariant reasoning. By integrating dialectical chain‑of‑thought with Group Relative Policy Optimization, ReTAS guides agents to synthesize conflicting viewpoints into an objective consensus. Experiments demonstrate that ReTAS effectively mitigates attribution inconsistency and significantly improves fault resolution rates in ambiguous scenarios.
Authors:Zhihong Zhang, Jie Zhao, Xiaojian Huang, Jin Xu, Zhuodong Luo, Xin Liu, Jiansheng Wei, Xuejin Chen
Abstract:
Multimodal reward models (MRMs) play a crucial role in aligning Multimodal Large Language Models (MLLMs) with human preferences. Training a good MRM requires high‑quality multimodal preference data. However, existing preference datasets face three key challenges: lack of granularity in preference strength, textual style bias, and unreliable preference signals. Besides, existing open‑source multimodal preference datasets suffer from substantial noise, yet there is a lack of effective and scalable curation methods to enhance their quality. To address these limitations, we propose DT2IT‑MRM, which integrates a Debiased preference construction pipeline, a novel reformulation of text‑to‑image (T2I) preference data, and an Iterative Training framework that curates existing multimodal preference datasets for Multimodal Reward Modeling. Our experimental results show that DT2IT‑MRM achieves new state‑of‑the‑art overall performance on three major benchmarks: VL‑RewardBench, Multimodal RewardBench, and MM‑RLHF‑RewardBench.
Authors:Beining Wu, Fuyou Mao, Jiong Lin, Cheng Yang, Jiaxuan Lu, Yifu Guo, Siyu Zhang, Yifan Wu, Ying Huang, Fu Li
Abstract:
Generative engines (GEs) are reshaping information access by replacing ranked links with citation‑grounded answers, yet current Generative Engine Optimization (GEO) methods optimize each instance in isolation, unable to accumulate or transfer effective strategies across tasks and engines. We reframe GEO as a strategy learning problem and propose MAGEO, a multi‑agent framework in which coordinated planning, editing, and fidelity‑aware evaluation serve as the execution layer, while validated editing patterns are progressively distilled into reusable, engine‑specific optimization skills. To enable controlled assessment, we introduce a Twin Branch Evaluation Protocol for causal attribution of content edits and DSV‑CF, a dual‑axis metric that unifies semantic visibility with attribution accuracy. We further release MSME‑GEO‑Bench, a multi‑scenario, multi‑engine benchmark grounded in real‑world queries. Experiments on three mainstream engines show that MAGEO substantially outperforms heuristic baselines in both visibility and citation fidelity, with ablations confirming that engine‑specific preference modeling and strategy reuse are central to these gains, suggesting a scalable learning‑driven paradigm for trustworthy GEO. Code is available at https://github.com/Wu‑beining/MAGEO
Authors:Kyuhee Kim, Auguste Poiroux, Antoine Bosselut
Abstract:
Formal verification guarantees proof validity but not formalization faithfulness. For natural‑language logical reasoning, where models construct axiom systems from scratch without library constraints, this gap between valid proofs and faithful translations is especially acute. We investigate whether frontier models exploit this gap when generating Lean 4 proofs, a behavior we term formalization gaming. We evaluate GPT‑5 and DeepSeek‑R1 on 303 first‑order logic problems (203 from FOLIO, 100 from Multi‑LogiEval), comparing unified generation against a two‑stage pipeline that separates formalization from proving. Despite compilation rates of 87‑99%, we find no evidence of systematic gaming in unified generation: models prefer reporting failure over forcing proofs, even under prompting designed to encourage it. However, unfaithfulness that evades our detection signals may still occur. The two‑stage pipeline reveals two distinct modes of unfaithfulness: GPT‑5 fabricates axioms during proof generation, a reactive fallback detectable via cross‑stage comparison, while DeepSeek‑R1 mistranslates premises during formalization, producing internally consistent outputs that evade detection entirely. These findings show that high compilation rates or accuracies should not be equated with faithful reasoning. Code and data are available at https://github.com/koreankiwi99/formalization‑gaming.
Authors:Vasundra Srininvasan
Abstract:
Long‑horizon enterprise agents make high‑stakes decisions (loan underwriting, claims adjudication, clinical review, prior authorization) under lossy memory, multi‑step reasoning, and binding regulatory constraints. Current evaluation reports a single task‑success scalar that conflates distinct failure modes and hides whether an agent is aligned with the standards its deployment environment requires. We propose that long‑horizon decision behavior decomposes into four orthogonal alignment axes, each independently measurable and failable: factual precision (FRP), reasoning coherence (RCS), compliance reconstruction (CRR), and calibrated abstention (CAR). CRR is a novel regulatory‑grounded axis; CAR is a measurement axis separating coverage from accuracy. We exercise the decomposition on a controlled benchmark (LongHorizon‑Bench) covering loan qualification and insurance claims adjudication with deterministic ground‑truth construction. Running six memory architectures, we find structure aggregate accuracy cannot see: retrieval collapses on factual precision; schema‑anchored architectures pay a scaffolding tax; plain summarization under a fact‑preservation prompt is a strong baseline on FRP, RCS, EDA, and CRR; and all six architectures commit on every case, exposing a decisional‑alignment axis the field has not targeted. The decomposition also surfaced a pre‑registered prediction of our own, that summarization would fail factual recall, which the data reversed at large magnitude, an axis‑level reversal aggregate accuracy would have hidden. Institutional alignment (regulatory reconstruction) and decisional alignment (calibrated abstention) are under‑represented in the alignment literature and become load‑bearing once decisions leave the laboratory. The framework transfers to any regulated decisioning domain via two steps: build a fact schema, and calibrate the CRR auditor prompt.
Authors:Rajveer Singh Pall
Abstract:
We introduce IndiaFinBench, to our knowledge the first publicly available evaluation benchmark for assessing large language model (LLM) performance on Indian financial regulatory text. Existing financial NLP benchmarks draw exclusively from Western financial corpora (SEC filings, US earnings reports, and English‑language financial news), leaving a significant gap in coverage of non‑Western regulatory frameworks. IndiaFinBench addresses this gap with 406 expert‑annotated question‑answer pairs drawn from 192 documents sourced from the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI), spanning four task types: regulatory interpretation (174 items), numerical reasoning (92 items), contradiction detection (62 items), and temporal reasoning (78 items). Annotation quality is validated through a model‑based secondary pass (kappa=0.918 on contradiction detection) and a 60‑item human inter‑annotator agreement evaluation (kappa=0.611; 76.7% overall agreement). We evaluate twelve models under zero‑shot conditions, with accuracy ranging from 70.4% (Gemma 4 E4B) to 89.7% (Gemini 2.5 Flash). All models substantially outperform a non‑specialist human baseline of 60.0%. Numerical reasoning is the most discriminative task, with a 35.9 percentage‑point spread across models. Bootstrap significance testing (10,000 resamples) reveals three statistically distinct performance tiers. The dataset, evaluation code, and all model outputs are available at https://github.com/rajveerpall/IndiaFinBench
Authors:Bo-Jyun Wang, Ying-Jia Lin, Hung-Yu Kao
Abstract:
Small language models (SLMs), such as BART, can achieve summarization performance comparable to large language models (LLMs) via distillation. However, existing LLM‑based ranking strategies for summary candidates suffer from instability, while classical metrics (e.g., ROUGE) are insufficient to rank high‑quality summaries. To address these issues, we introduce SCURank, a framework that enhances summarization by leveraging Summary Content Units (SCUs). Instead of relying on unstable comparisons or surface‑level overlap, SCURank evaluates summaries based on the richness and semantic importance of information content. We investigate the effectiveness of SCURank in distilling summaries from multiple diverse LLMs. Experimental results demonstrate that SCURank outperforms traditional metrics and LLM‑based ranking methods across evaluation measures and datasets. Furthermore, our findings show that incorporating diverse LLM summaries enhances model abstractiveness and overall distilled model performance, validating the benefits of information‑centric ranking in multi‑LLM distillation. The code for SCURank is available at https://github.com/IKMLab/SCURank.
Authors:Jinglin Xu, Yi Li, Chuxiong Sun, Xiao Xu, Jiangmeng Li, Fanjiang Xu
Abstract:
Multi‑modal test‑time adaptation (TTA) enhances the resilience of benchmark multi‑modal models against distribution shifts by leveraging the unlabeled target data during inference. Despite the documented success, the advancement of multi‑modal TTA methodologies has been impeded by a persistent limitation, i.e., the lack of explicit modeling of category‑conditional distributions, which is crucial for yielding accurate predictions and reliable decision boundaries. Canonical Gaussian discriminant analysis (GDA) provides a vanilla modeling of category‑conditional distributions and achieves moderate advancement in uni‑modal contexts. However, in multi‑modal TTA scenario, the inherent modality distribution asymmetry undermines the effectiveness of modeling the category‑conditional distribution via the canonical GDA. To this end, we introduce a tailored probabilistic Gaussian model for multi‑modal TTA to explicitly model the category‑conditional distributions, and further propose an adaptive contrastive asymmetry rectification technique to counteract the adverse effects arising from modality asymmetry, thereby deriving calibrated predictions and reliable decision boundaries. Extensive experiments across diverse benchmarks demonstrate that our method achieves state‑of‑the‑art performance under a wide range of distribution shifts. The code is available at https://github.com/XuJinglinn/AdaPGC.
Authors:Abhinav Agarwal
Abstract:
LLM‑assisted defect discovery has a precision crisis: plausible‑but‑wrong reports overwhelm maintainers and degrade credibility for real findings. We present Refute‑or‑Promote, an inference‑time reliability pattern combining Stratified Context Hunting (SCH) for candidate generation, adversarial kill mandates, context asymmetry, and a Cross‑Model Critic (CMC). Adversarial agents attempt to disprove candidates at each promotion gate; cold‑start reviewers are intended to reduce anchoring cascades; cross‑family review can catch correlated blind spots that same‑family review misses. Over a 31‑day campaign across 7 targets (security libraries, the ISO C++ standard, major compilers), the pipeline killed roughly 79% of 171 candidates before advancing to disclosure (retrospective aggregate); on a consolidated‑protocol subset (lcms2, wolfSSL; n=30), the prospective kill rate was 83%. Outcomes: 4 CVEs (3 public, 1 embargoed); LWG 4549 accepted to the C++ working paper; 5 merged C++ editorial PRs; 3 compiler conformance bugs; 8 merged security‑related fixes without CVE; an RFC 9000 errata filed under committee review; and 1+ FIPS 140‑3 normative compliance issues under coordinated disclosure ‑‑ all evaluated by external acceptance, not benchmarks. The most instructive failure: ten dedicated reviewers unanimously endorsed a non‑existent Bleichenbacher padding oracle in OpenSSL's CMS module; it was killed only by a single empirical test, motivating the mandatory empirical gate. No vulnerability was discovered autonomously; the contribution is external structure that filters LLM agents' persistent false positives. As a preliminary transfer test beyond defect discovery, a simplified cross‑family critique variant also solved five previously unsolved SymPy instances on SWE‑bench Verified and one SWE‑rebench hard task.
Authors:Boyan Shi, Wei Chen, Shuyuan Zhao, Junfeng Shen, Shengnan Guo, Shaojiang Wang, Huaiyu Wan
Abstract:
The combination of Mixture‑of‑Experts (MoE) and Low‑Rank Adaptation (LoRA) has shown significant potential for enhancing the multi‑task learning capabilities of Large Language Models. However, existing methods face two primary challenges: (1)Imprecise Routing in the current MoE‑LoRA method fails to explicitly match input semantics with expert capabilities, leading to weak expert specialization. (2)Uniform weight fusion strategies struggle to provide adaptive update strengths, overlooking the varying complexity of different tasks. To address these limitations, we propose SAMoRA (Semantic‑Aware Mixture of LoRA Experts), a novel parameter‑efficient fine‑tuning framework tailored for task‑adaptive learning. Specifically, A Semantic‑Aware Router is proposed to explicitly align textual semantics with the most suitable experts for precise routing. A Task‑Adaptive Scaling mechanism is designed to regulate expert contributions based on specific task requirements dynamically. In addition, a novel regularization objective is proposed to jointly promote expert specialization and effective scaling. Extensive experiments on multiple multi‑task benchmarks demonstrate that SAMoRA significantly outperforms the state‑of‑the‑art methods and holds excellent task generalization capabilities. Code is available at https://github.com/boyan‑code/SAMoRA
Authors:Julian Skifstad, Xinyue Annie Yang, Glen Chou
Abstract:
Inference‑time LLM alignment methods, particularly activation steering, offer an alternative to fine‑tuning by directly modifying activations during generation. Existing methods, however, often rely on non‑anticipative interventions that ignore how perturbations propagate through transformer layers and lack online error feedback, resulting in suboptimal, open‑loop control. To address this, we show empirically that, despite the nonlinear structure of transformer blocks, layer‑wise dynamics across multiple LLM architectures and scales are well‑approximated by locally‑linear models. Exploiting this property, we model LLM inference as a linear time‑varying dynamical system and adapt the classical linear quadratic regulator to compute feedback controllers using layer‑wise Jacobians, steering activations toward desired semantic setpoints in closed‑loop with minimal computational overhead and no offline training. We also derive theoretical bounds on setpoint tracking error, enabling formal guarantees on steering performance. Using a novel adaptive semantic feature setpoint signal, our method yields robust, fine‑grained behavior control across models, scales, and tasks, including state‑of‑the‑art modulation of toxicity, truthfulness, refusal, and arbitrary concepts, surpassing baseline steering methods. Our code is available at: https://github.com/trustworthyrobotics/lqr‑activation‑steering
Authors:Jiagao Hu, Daiguo Zhou, Danzhen Fu, Fuhao Li, Zepeng Wang, Fei Wang, Wenhua Liao, Jiayi Xie, Haiyang Sun
Abstract:
Perception robustness under adverse weather remains a critical challenge for autonomous driving, with the core bottleneck being the scarcity of real‑world video data in adverse weather. Existing weather generation approaches struggle to balance visual quality and annotation reusability. We present AutoAWG, a controllable Adverse Weather video Generation framework for Autonomous driving. Our method employs a semantics‑guided adaptive fusion of multiple controls to balance strong weather stylization with high‑fidelity preservation of safety‑critical targets; leverages a vanishing point‑anchored temporal synthesis strategy to construct training sequences from static images, thereby reducing reliance on synthetic data; and adopts masked training to enhance long‑horizon generation stability. On the nuScenes validation set, AutoAWG significantly outperforms prior state‑of‑the‑art methods: without first‑frame conditioning, FID and FVD are relatively reduced by 50.0% and 16.1%; with first‑frame conditioning, they are further reduced by 8.7% and 7.2%, respectively. Extensive qualitative and quantitative results demonstrate advantages in style fidelity, temporal consistency, and semantic‑‑structural integrity, underscoring the practical value of AutoAWG for improving downstream perception in autonomous driving. Our code is available at: https://github.com/higherhu/AutoAWG
Authors:Yihuai Gao, Jinyun Liu, Shuang Li, Shuran Song
Abstract:
Robotic manipulation tasks exhibit varying memory requirements, ranging from Markovian tasks that require no memory to non‑Markovian tasks that depend on historical information spanning single or multiple interaction trials. Surprisingly, simply extending observation histories of a visuomotor policy often leads to a significant performance drop due to distribution shift and overfitting. To address these issues, we propose Gated Memory Policy (GMP), a visuomotor policy that learns both when to recall memory and what to recall. To learn when to recall memory, GMP employs a learned memory gate mechanism that selectively activates history context only when necessary, improving robustness and reactivity. To learn what to recall efficiently, GMP introduces a lightweight cross‑attention module that constructs effective latent memory representations. To further enhance robustness, GMP injects diffusion noise into historical actions, mitigating sensitivity to noisy or inaccurate histories during both training and inference. On our proposed non‑Markovian benchmark MemMimic, GMP achieves a 30.1% average success rate improvement over long‑history baselines, while maintaining competitive performance on Markovian tasks in RoboMimic. All code, data and in‑the‑wild deployment instructions are available on our project website https://gated‑memory‑policy.github.io/.
Authors:Faisal Alherran
Abstract:
Despite growing interest in Quranic data research, existing Quran datasets remain limited in both scale and diversity. To address this gap, we present Tadabur, a large‑scale Quran audio dataset. Tadabur comprises more than 1400+ hours of recitation audio from over 600 distinct reciters, providing substantial variation in recitation styles, vocal characteristics, and recording conditions. This diversity makes Tadabur a comprehensive and representative resource for Quranic speech research and analysis. By significantly expanding both the total duration and variability of available Quran data, Tadabur aims to support future research and facilitate the development of standardized Quranic speech benchmarks.
Authors:Isaac Llorente-Saguer
Abstract:
Harmful intent is geometrically recoverable from large language model residual streams: as a linear direction in most layers, and as angular deviation in layers where projection methods fail. Across 12 models spanning four architectural families (Qwen2.5, Qwen3.5, Llama‑3.2, Gemma‑3) and three alignment variants (base, instruction‑tuned, abliterated), under single‑turn, English evaluation, we characterise this geometry through six direction‑finding strategies. Three succeed: a soft‑AUC‑optimised linear direction reaches mean AUROC 0.98 and TPR@1%FPR 0.80; a class‑mean probe reaches 0.98 and 0.71 at <1ms fitting cost; a supervised angular‑deviation strategy reaches AUROC 0.96 and TPR of 0.61 along a representationally distinct direction (73^\circ from projection‑based solutions), uniquely sustaining detection in middle layers where projection methods collapse. Detection remains stable across alignment variants, including abliterated models from which refusal has been surgically removed: harmful intent and refusal behaviour are functionally dissociated features of the representation. A direction fitted on AdvBench transfers to held‑out HarmBench and JailbreakBench with worst‑case AUROC 0.96. The same picture holds at scale: across Qwen3.5 from 0.8B to 9B parameters, AUROC remains \geq0.98 and cross‑variant transfer stays within 0.018 of own‑direction performance This is consistent with a simple account: models acquire a linearly decodable representation of harmful intent as part of general language understanding, and alignment then shapes what they do with such inputs without reorganising the upstream recognition signal. As a practical consequence, AUROC in the 0.97+ regime can substantially overestimate operational detectability; TPR@1%FPR should accompany AUROC in safety‑adjacent evaluation.
Authors:Konstantin F. Willeke, Polina Turishcheva, Alex Gilbert, Goirik Chakrabarty, Hasan A. Bedel, Paul G. Fahey, Yongrong Qiu, Marissa A. Weis, Michaela Vystrčilová, Taliah Muhammad, Lydia Ntanavara, Rachel E. Froebe, Kayla Ponder, Zheng Huan Tan, Emin Orhan, Erick Cobos, Sophia Sanborn, Katrin Franke, Fabian H. Sinz, Alexander S. Ecker, Andreas S. Tolias
Abstract:
Scaling data and artificial neural networks has transformed AI, driving breakthroughs in language and vision. Whether similar principles apply to modeling brain activity remains unclear. Here we leveraged a dataset of 3.1 million neurons from the visual cortex of 73 mice across 323 sessions, totaling more than 150 billion neural tokens recorded during natural movies, images and parametric stimuli, and behavior. We train multi‑modal, multi‑task models that support three regimes flexibly at test time: neural prediction, behavioral decoding, neural forecasting, or any combination of the three. OmniMouse achieves state‑of‑the‑art performance, outperforming specialized baselines across nearly all evaluation regimes. We find that performance scales reliably with more data, but gains from increasing model size saturate. This inverts the standard AI scaling story: in language and computer vision, massive datasets make parameter scaling the primary driver of progress, whereas in brain modeling ‑‑ even in the mouse visual cortex, a relatively simple system ‑‑ models remain data‑limited despite vast recordings. The observation of systematic scaling raises the possibility of phase transitions in neural modeling, where larger and richer datasets might unlock qualitatively new capabilities, paralleling the emergent properties seen in large language models. Code available at https://github.com/enigma‑brain/omnimouse.
Authors:Weixi Tong, Yifeng Di, Tianyi Zhang
Abstract:
Existing web agents typically initiate exploration from the root URL, which is inefficient for complex websites with deep hierarchical structures. Without a global view of the website's structure, agents frequently fall into navigation traps, explore irrelevant branches, or fail to reach target information within a limited budget. We propose Mango, a multi‑agent web navigation method that leverages the website structure to dynamically determine optimal starting points. We formulate URL selection as a multi‑armed bandit problem and employ Thompson Sampling to adaptively allocate the navigation budget across candidate URLs. Furthermore, we introduce an episodic memory component to store navigation history, enabling the agent to learn from previous attempts. Experiments on WebVoyager demonstrate that Mango achieves a success rate of 63.6% when using GPT‑5‑mini, outperforming the best baseline by 7.3%. Furthermore, on WebWalkerQA, Mango attains a 52.5% success rate, surpassing the best baseline by 26.8%. We also demonstrate the generalizability of Mango using both open‑source and closed‑source models as backbones. Our data and code are open‑source and available at https://github.com/VichyTong/Mango.
Authors:Xiangyu Wen, Yuang Zhao, Xiaoyu Xu, Lingjun Chen, Changran Xu, Shu Chi, Jianrong Ding, Zeju Li, Haomin Li, Li Jiang, Fangxin Liu, Qiang Xu
Abstract:
The transition of agentic AI from brittle prototypes to production systems is stalled by a pervasive crisis of craft. We suggest that the prevailing orchestration paradigm‑delegating the system control loop to large language models and merely patching with heuristic guardrails‑is the root cause of this fragility. Instead, we propose Arbiter‑K, a Governance‑First execution architecture that reconceptualizes the underlying model as a Probabilistic Processing Unit encapsulated by a deterministic, neuro‑symbolic kernel. Arbiter‑K implements a Semantic Instruction Set Architecture (ISA) to reify probabilistic messages into discrete instructions. This allows the kernel to maintain a Security Context Registry and construct an Instruction Dependency Graph at runtime, enabling active taint propagation based on the data‑flow pedigree of each reasoning node. By leveraging this mechanism, Arbiter‑K precisely interdicts unsafe trajectories at deterministic sinks (e.g., high‑risk tool calls or unauthorized network egress) and enables autonomous execution correction and architectural rollback when security policies are triggered. Evaluations on OpenClaw and NanoBot demonstrate that Arbiter‑K enforces security as a microarchitectural property, achieving 76% to 95% unsafe interception for a 92.79% absolute gain over native policies. The code is publicly available at https://github.com/cure‑lab/ArbiterOS.
Authors:Yunshu Bai, RuiHao Li, Hao Zhang, Chien Her Lim, Ming Yan, Mengtian Li
Abstract:
Game UI implementation requires translating stylized mockups into interactive engine entities. However, current "Screenshot‑to‑Code" tools often struggle with the irregular geometries and deep visual hierarchies typical of game interfaces. To bridge this gap, we introduce SPRITE, a pipeline that transforms static screenshots into editable engine assets. By integrating Vision‑Language Models (VLMs) with a structured YAML intermediate representation, SPRITE explicitly captures complex container relationships and non‑rectangular layouts. We evaluated SPRITE against a curated Game UI benchmark and conducted expert reviews with professional developers to assess reconstruction fidelity and prototyping efficiency. Our findings demonstrate that SPRITE streamlines development by automating tedious coding and resolving complex nesting. By facilitating rapid in‑engine iteration, SPRITE effectively blurs the boundaries between artistic design and technical implementation in game development. Project page: https://baiyunshu.github.io/sprite.github.io/
Authors:Liubomyr Horbatko
Abstract:
Modern sequence modeling is dominated by two families: Transformers, whose self‑attention can access arbitrary elements of the visible sequence, and structured state‑space models, which propagate information through an explicit recurrent state. These mechanisms face different limitations on long contexts: when attention is diffuse, the influence of individual tokens is diluted across the effective support, while recurrent state propagation can lose long‑range sensitivity unless information is actively preserved. As a result, both mechanisms face challenges in preserving and selectively retrieving information over long contexts. We propose Sessa, a decoder that places attention inside a recurrent feedback path. This creates many attention‑based paths through which past tokens can influence future states, rather than relying on a single attention read or a single recurrent chain. We prove that, under explicit assumptions and matched regimes, Sessa admits power‑law memory tails O(\ell^‑β) for 0 < β< 1, with slower decay than in the corresponding Transformer and Mamba‑style baselines. We further give an explicit construction that achieves this power‑law rate. Under the same assumptions, Sessa is the only model class among those considered that realizes flexible selective retrieval, including profiles whose influence does not decay with distance. Consistent with this theoretical advantage, across matched experiments, Sessa achieves the strongest performance on long‑context benchmarks while remaining competitive with Transformer and Mamba‑style baselines on short‑context language modeling.
Authors:Yunke Ao, Le Chen, Bruce D. Lee, Assefa S. Wahd, Aline Czarnobai, Philipp Fürnstahl, Bernhard Schölkopf, Andreas Krause
Abstract:
Proximal Policy Optimization (PPO) has become the predominant algorithm for on‑policy reinforcement learning due to its scalability and empirical robustness across domains. However, there is a significant disconnect between the underlying foundations of trust region methods and the heuristic clipped objective used in PPO. In this paper, we bridge this gap by introducing the Bounded Ratio Reinforcement Learning (BRRL) framework. We formulate a novel regularized and constrained policy optimization problem and derive its analytical optimal solution. We prove that this solution ensures monotonic performance improvement. To handle parameterized policy classes, we develop a policy optimization algorithm called Bounded Policy Optimization (BPO) that minimizes an advantage‑weighted divergence between the policy and the analytic optimal solution from BRRL. We further establish a lower bound on the expected performance of the resulting policy in terms of the BPO loss function. Notably, our framework also provides a new theoretical lens to interpret the success of the PPO loss, and connects trust region policy optimization and the Cross‑Entropy Method (CEM). We additionally extend BPO to Group‑relative BPO (GBPO) for LLM fine‑tuning. Empirical evaluations of BPO across MuJoCo, Atari, and complex IsaacLab environments (e.g., Humanoid locomotion), and of GBPO for LLM fine‑tuning tasks, demonstrate that BPO and GBPO generally match or outperform PPO and GRPO in stability and final performance.
Authors:A. Sophia Koepke, Daniil Zverev, Shiry Ginosar, Alexei A. Efros
Abstract:
The Platonic Representation Hypothesis suggests that neural networks trained on different modalities (e.g., text and images) align and eventually converge toward the same representation of reality. If true, this has significant implications for whether modality choice matters at all. We show that the experimental evidence for this hypothesis is fragile and depends critically on the evaluation regime. Alignment is measured using mutual nearest neighbors on small datasets (\approx1K samples) and degrades substantially as the dataset is scaled to millions of samples. The same behavior is observed beyond text‑image, for text‑audio and text‑video alignment. The alignment that remains between model representations reflects coarse semantic overlap rather than consistent fine‑grained structure. Moreover, the evaluations in Huh et al. are done in a one‑to‑one image‑caption setting, a constraint that breaks down in realistic many‑to‑many settings and further reduces measured alignment. We also find that the reported trend of stronger language models increasingly aligning with vision does not appear to hold for newer models. Overall, our findings suggest that the current evidence for cross‑modal representational convergence is considerably weaker than subsequent works have taken it to be. Models trained on different modalities may learn equally rich representations of the world, just not the same one.
Authors:Xinyu Ma, Mingzhou Xu, Xuebo Liu, Chang Jin, Qiang Wang, Derek F. Wong, Min Zhang
Abstract:
Recent advancements in Reinforcement Learning with Verifiable Rewards (RLVR) have significantly improved Large Language Model (LLM) reasoning, yet models often struggle to explore novel trajectories beyond their initial policy distribution. While offline teacher guidance and entropy‑driven strategies have been proposed to address this, they often lack deep integration or are constrained by the model's inherent capacity. In this paper, we propose OGER (Offline‑Guided Exploration Reward), a novel framework that unifies offline teacher guidance and online reinforcement learning through a specialized reward modeling lens. OGER employs multi‑teacher collaborative training and constructs an auxiliary exploration reward that leverages both offline trajectories and the model's own entropy to incentivize autonomous exploration. Extensive experiments across mathematical and general reasoning benchmarks demonstrate that OGER consistently outperforms competitive baselines, achieving substantial gains in mathematical reasoning while maintaining robust generalization to out‑of‑domain tasks. We provide a comprehensive analysis of training dynamics and conduct detailed ablation studies to validate the effectiveness of our entropy‑aware reward modulation. Our code is available at https://github.com/ecoli‑hit/OGER.git.
Authors:Aniruddha Adiga, Jingyuan Chou, Anshul Chiranth, Bryan Lewis, Ana I. Bento, Shaun Truelove, Geoffrey Fox, Madhav Marathe, Harry Hochheiser, Srini Venkatramanan
Abstract:
Epidemic forecasting has become an integral part of real‑time infectious disease outbreak response. While collaborative ensembles composed of statistical and machine learning models have become the norm for real‑time forecasting, standardized benchmark datasets for evaluating such methods are lacking. Further, there is limited understanding on performance of these methods for novel outbreaks with limited historical data. In this paper, we propose IDOBE, a curated collection of epidemiological time series focused on outbreak forecasting. IDOBE compiles from multiple data repositories spanning over a century of surveillance and across U.S. states and global locations. We perform derivative‑based segmentation to generate over 10,000 outbreaks covering multiple outcomes such as cases and hospitalizations for 13 diseases. We consider a variety of information‑theoretic and distributional measures to quantify the epidemiological diversity of the dataset. Finally, we perform multi‑horizon short‑term forecasting (1‑ to 4‑week‑ahead) through the progression of the outbreak using 11 baseline models and report on their performance. In addition to standard metrics such as NMSE and MAPE for point forecasts, we include probabilistic scoring rules such as Normalized Weighted Interval Score (NWIS) to quantify the performance. We find that MLP‑based methods have the most robust performance, with statistical methods having a slight edge during the pre‑peak phase. IDOBE dataset along with baselines are released publicly on https://github.com/NSSAC/IDOBE to enable standardized, reproducible benchmarking of outbreak forecasting methods.
Authors:Wei Chen, Yubing Wu, Junmei Yang, Delu Zeng, Qibin Zhao, John Paisley, Min Chen, Zhou Wang
Abstract:
Preference optimization is widely used to align large language models (LLMs) with human preferences. However, many margin‑based methods also suppress the chosen response when they try to suppress the rejected one, and there is no general way to prevent this across different objectives. We address this issue with a unified incentive‑score decomposition of preference optimization, revealing that different objectives share the same local update directions and differ only in their scalar weights. This decomposition provides a common framework for analyzing objectives that were previously studied in separate settings. Building on this decomposition, by analyzing the dynamics of the chosen/rejected likelihoods, we identify the disentanglement band (DB), a simple, testable condition that tells us when training can follow the desired path: suppress the loser while preserving the winner, possibly after an early stage. Using the DB, we propose reward calibration (RC), a plug‑and‑play method that adaptively rebalances the updates for chosen and rejected responses to satisfy the DB, without redesigning the base objective. Empirical results show that RC leads to more disentangled dynamics, with better downstream performance observed across several settings. Our code is available at https://github.com/IceyWuu/DisentangledPreferenceOptimization.
Authors:Jiayi Wu, Ruobing Xie, Zeqian Huang, Lei Jiang, Can Xu, Kangyang Luo, Bochen Lin, Ming Gao, Xiang Li
Abstract:
Search agents achieve strong question‑answering performance through multi‑turn interactions with search engines, with Group Relative Policy Optimization (GRPO) being a widely used training algorithm. However, GRPO‑style algorithms still face several challenges in multi‑hop search settings. First, correct intermediate steps are often penalized when the final answer is wrong. Second, training is highly unstable, often causing degradation of natural language ability or even catastrophic training collapse. Our analysis attributes these issues to coarse‑grained advantage assignment and an imbalance between positive and negative advantages. To address these problems, we propose CalibAdv, an advantage calibration method specifically designed for search agents that enables more accurate and more stable modeling of penalties and rewards. Specifically, CalibAdv leverages the correctness of intermediate steps to downscale excessive negative advantages at a fine‑grained level. It then further rebalances positive and negative advantages to improve training stability. Importantly, CalibAdv adopts a lightweight design that calibrates advantages from standard rollout signals, making it simple and easy to deploy. Extensive experiments across three models and seven benchmarks demonstrate that CalibAdv improves both model performance and training stability. Our code is available at https://github.com/wujwyi/CalibAdv.
Authors:Yujie Chen, Tailai Chen, Yifeng Gao, Zoe Wanying He, Yijue Xu, Shaobo Wang, Linfeng Zhang
Abstract:
Prefilling computational costs pose a significant bottleneck for Large Language Models (LLMs) and Large Multimodal Models (LMMs) in long‑context settings. While token pruning reduces sequence length, prior methods rely on heuristics that break compatibility with hardware‑efficient kernels like FlashAttention. In this work, we observe that tokens evolve toward semantic fixing points, making further processing redundant. To this end, we introduce Delta Attention Selective Halting (DASH), a training‑free policy that monitors the layer‑wise update dynamics of the self‑attention mechanism to selectively halt stabilized tokens. Extensive evaluation confirms that DASH generalizes across language and vision benchmarks, delivering significant prefill speedups while preserving model accuracy and hardware efficiency. Code will be released at https://github.com/verach3n/DASH.git.
Authors:Nuo Chen, Yicheng Tong, Yuzhe Yang, Yufei He, Xueyi Zhang, Qingyun Zou, Qian Wang, Bingsheng He
Abstract:
Multi‑agent systems (MAS) are increasingly used for open‑ended idea generation, driven by the expectation that collective interaction will broaden the exploration diversity. However, when and why such collaboration truly expands the solution space remains unclear. We present a systematic empirical study of diversity in MAS‑based ideation across three bottom‑up levels: model intelligence, agent cognition, and system dynamics. At the model level, we identify a compute efficiency paradox, where stronger, highly aligned models yield diminishing marginal diversity despite higher per‑sample quality. At the cognition level, authority‑driven dynamics suppress semantic diversity compared to junior‑dominated groups. At the system level, group‑size scaling yields diminishing returns and dense communication topologies accelerate premature convergence. We characterize these outcomes as collective failures emerging from structural coupling, a process where interaction inadvertently contracts agent exploration and triggers diversity collapse. Our analysis shows that this collapse arises primarily from the interaction structure rather than inherent model insufficiency, highlighting the importance of preserving independence and disagreement when designing MAS for creative tasks. Our code is available at https://github.com/Xtra‑Computing/MAS_Diversity.
Authors:Haokun Lin, Xinle Jia, Haobo Xu, Bingchen Yao, Xianglong Guo, Yichen Wu, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun
Abstract:
The MXFP4 microscaling format, which partitions tensors into blocks of 32 elements sharing an E8M0 scaling factor, has emerged as a promising substrate for efficient LLM inference, backed by native hardware support on NVIDIA Blackwell Tensor Cores. However, activation outliers pose a unique challenge under this format: a single outlier inflates the shared block scale, compressing the effective dynamic range of the remaining elements and causing significant quantization error. Existing rotation‑based remedies, including randomized Hadamard and learnable rotations, are data‑agnostic and therefore unable to specifically target the channels where outliers concentrate. We propose DuQuant++, which adapts the outlier‑aware fine‑grained rotation of DuQuant to the MXFP4 format by aligning the rotation block size with the microscaling group size (B=32). Because each MXFP4 group possesses an independent scaling factor, the cross‑block variance issue that necessitates dual rotations and a zigzag permutation in the original DuQuant becomes irrelevant, enabling DuQuant++ to replace the entire pipeline with a single outlier‑aware rotation, which halves the online rotation cost while simultaneously smoothing the weight distribution. Extensive experiments on the LLaMA‑3 family under MXFP4 W4A4 quantization show that DuQuant++ consistently achieves state‑of‑the‑art performance. Our code is available at https://github.com/Hsu1023/DuQuant‑v2.
Authors:Franki Nguimatsia Tiofack, Fabian Schramm, Théotime Le Hellard, Justin Carpentier
Abstract:
Standard approaches to goal‑conditioned reinforcement learning (GCRL) that rely on temporal‑difference learning can be unstable and sample‑inefficient due to bootstrapping. While recent work has explored contrastive and supervised formulations to improve stability, we present a probabilistic alternative, called survival value learning (SVL), that reframes GCRL as a survival learning problem by modeling the time‑to‑goal from each state as a probability distribution. This structured distributional Monte Carlo perspective yields a closed‑form identity that expresses the goal‑conditioned value function as a discounted sum of survival probabilities, enabling value estimation via a hazard model trained via maximum likelihood on both event and right‑censored trajectories. We introduce three practical value estimators, including finite‑horizon truncation and two binned infinite‑horizon approximations to capture long‑horizon objectives. Experiments on offline GCRL benchmarks show that SVL combined with hierarchical actors matches or surpasses strong hierarchical TD and Monte Carlo baselines, excelling on complex, long‑horizon tasks. Webpage and Code: https://simple‑robotics.github.io/publications/survival‑value‑learning/
Authors:Gaozhi Zhou, Hu He, Peng Shen, Jipeng Zhang, Liujue Zhang, Linrui Xu, Zeyuan Wang, Ziyu Li, Xuezhi Cui, Wang Guo, Haifeng Li
Abstract:
Reinforcement learning (RL) post‑training substantially improves remote sensing vision‑language models (RS‑VLMs). However, when handling complex remote sensing imagery (RSI) requiring exhaustive visual scanning, models tend to rely on localized salient cues for rapid inference. We term this RL‑induced bias "perceptual inertia". Driven by reward maximization, models favor quick outcome fitting, leading to two limitations: cognitively, overreliance on specific features impedes complete evidence construction; operationally, models struggle to flexibly shift visual focus across tasks. To address this bias and encourage comprehensive visual evidence mining, we propose RS‑HyRe‑R1, a hybrid reward framework for RSI understanding. It introduces: (1) a spatial reasoning activation reward that enforces structured visual reasoning; (2) a perception correctness reward that provides adaptive quality anchors across RS tasks, ensuring accurate geometric and semantic alignment; and (3) a visual‑semantic path evolution reward that penalizes repetitive reasoning and promotes exploration of complementary cues to build richer evidence chains. Experiments show RS‑HyRe‑R1 effectively mitigates "perceptual inertia", encouraging deeper, more diverse reasoning. With only 3B parameters, it achieves state‑of‑the‑art performance on REC, OVD, and VQA tasks, outperforming models up to 7B parameters. It also demonstrates strong zero‑shot generalization, surpassing the second‑best model by 3.16%, 3.97%, and 2.72% on VQA, OVD, and REC, respectively. Code and datasets are available at https://github.com/geox‑lab/RS‑HyRe‑R1.
Authors:Zheng Nie, Ruolin Shen, Xinlei Yu, Bo Yin, Jiangning Zhang, Xiaobin Hu
Abstract:
Scaling vision‑language models into Visual Multiagent Systems (VMAS) is hindered by two coupled issues. First, communication topologies are fixed before inference, leaving them blind to visual content and query context; second, agent reasoning abilities remain static during deployment. These issues reinforce each other: a rigid topology fails to leverage richer agent expertise, while static agents lack incentives to specialize for a given query. We address this with SkillGraph, a joint framework that evolves both agent expertise and communication topology. Within this framework, a Multimodal Graph Transformer (MMGT) encodes visual tokens, instruction semantics and active skill embeddings to predict a query‑conditioned collaboration graph, replacing hand‑crafted routing with dynamic, content‑aware information flow. Complementing this, a Skill Designer distills and refines reasoning heuristics from failure cases, constructing a self‑evolving multimodal Skill Bank. Crucially, updated skill embeddings are fed back into the MMGT, enabling the topology to adapt alongside capability growth. Experiments show that SkillGraph achieves consistent improvements across four benchmarks, five common MAS structures and four base models. Code is available at https://github.com/niez233/skillgraph.
Authors:Damiano Fornasiere, Mirko Bronzi, Spencer Kitts, Alessandro Palmas, Yoshua Bengio, Oliver Richardson
Abstract:
We provide evidence that language models can detect, localize and, to a certain degree, verbalize the difference between perturbations applied to their activations. More precisely, we either (a) \emphmask activations, simulating \emphdropout, or (b) add \emphGaussian noise to them, at a target sentence. We then ask a multiple‑choice question such as ``\emphWhich of the previous sentences was perturbed?'' or ``\emphWhich of the two perturbations was applied?''. We test models from the Llama, Olmo, and Qwen families, with sizes between 8B and 32B, all of which can easily detect and localize the perturbations, often with perfect accuracy. These models can also learn, when taught in context, to distinguish between dropout and Gaussian noise. Notably, \qwenb's \emphzero‑shot accuracy in identifying which perturbation was applied improves as a function of the perturbation strength and, moreover, decreases if the in‑context labels are flipped, suggesting a prior for the correct ones ‑‑ even modulo controls. Because dropout has been used as a training‑regularization technique, while Gaussian noise is sometimes added during inference, we discuss the possibility of a data‑agnostic ``training awareness'' signal and the implications for AI safety. The code and data are available at \hrefhttps://github.com/saifh‑github/llm‑dropout‑noise‑recognitionlink 1 and \hrefhttps://drive.google.com/file/d/1es‑Sfw_AH9GficeXgeqpy87rocrZZ_PQ/viewlink 2, respectively.
Authors:Zain Naboulsi
Abstract:
AI coding assistants have proliferated rapidly, yet structured pedagogical frameworks for learning these tools remain scarce. Developers face a gap between tool documentation and practical mastery, relying on fragmented resources such as blog posts, video tutorials, and trial‑and‑error. We present cc‑self‑train, a modular interactive curriculum for learning Claude Code, an agentic AI coding tool, through hands‑on project construction. The system introduces five contributions: (1) a persona progression model that adapts instructor tone across four stages (Guide, Collaborator, Peer, Launcher), operationalizing Gradual Release of Responsibility for AI‑mediated instruction; (2) an adaptive learning system that observes engagement quality through hook‑based heuristics and adjusts scaffolding at two timescales, using streak detection for mid‑module intervention and aggregate metrics for module‑boundary persona changes; (3) a cross‑domain unified curriculum in which five distinct project domains share identical feature sequencing, enabling transfer learning; (4) a step‑pacing mechanism with explicit pause primitives to manage information overload in an AI‑as‑instructor context; and (5) an auto‑updating curriculum design in which the onboarding agent detects upstream tool changes and updates teaching materials before instruction begins. A parametrized test suite enforces structural consistency as a proxy for pedagogical invariants across all 50 modules. A pilot evaluation with 27 participants shows statistically significant reported self‑efficacy gains across all 10 assessed skill areas (p < 0.001), with the largest effects on advanced features such as hooks and custom skills. We discuss implications for the design of auto‑updating educational systems.
Authors:Yifan Song, Xingjian Tao, Zhicheng Yang, Yihong Luo, Jing Tang
Abstract:
Graph‑based Retrieval‑Augmented Generation (GraphRAG) enhances LLMs by structuring corpus into graphs to facilitate multi‑hop reasoning. While recent lightweight approaches reduce indexing costs by leveraging Named Entity Recognition (NER), they rely strictly on structural co‑occurrence, failing to capture latent semantic connections between disjoint entities. To address this, we propose EHRAG, a lightweight RAG framework that constructs a hypergraph capturing both structure and semantic level relationships, employing a hybrid structural‑semantic retrieval mechanism. Specifically, EHRAG constructs structural hyperedges based on sentence‑level co‑occurrence with lightweight entity extraction and semantic hyperedges by clustering entity text embeddings, ensuring the hypergraph encompasses both structural and semantic information. For retrieval, EHRAG performs a structure‑semantic hybrid diffusion with topic‑aware scoring and personalized pagerank (PPR) refinement to identify the top‑k relevant documents. Experiments on four datasets show that EHRAG outperforms state‑of‑the‑art baselines while maintaining linear indexing complexity and zero token consumption for construction. Code is available at https://github.com/yfsong00/EHRAG.
Authors:Siqi Lai, Pan Zhang, Yuping Zhou, Jindong Han, Yansong Ning, Hao Liu
Abstract:
Urban traffic control is a system‑level coordination problem spanning heterogeneous subsystems, including traffic signals, freeways, public transit, and taxi services. Existing optimization‑based, reinforcement learning (RL), and emerging LLM‑based approaches are largely designed for isolated tasks, limiting both cross‑task generalization and the ability to capture coupled physical dynamics across subsystems. We argue that effective system‑level control requires a unified physical environment in which subsystems share infrastructure, mobility demand, and spatiotemporal constraints, allowing local interventions to propagate through the network. To this end, we propose TrafficClaw, a framework for general urban traffic control built upon a unified runtime environment. TrafficClaw integrates heterogeneous subsystems into a shared dynamical system, enabling explicit modeling of cross‑subsystem interactions and closed‑loop agent‑environment feedback. Within this environment, we develop an LLM agent with executable spatiotemporal reasoning and reusable procedural memory, supporting unified diagnostics across subsystems and continual strategy refinement. Furthermore, we introduce a multi‑stage training pipeline with supervised initialization and agentic RL with system‑level optimization, further enabling coordinated and system‑aware performance. Experiments demonstrate that TrafficClaw achieves robust, transferable, and system‑aware performance across unseen traffic scenarios, dynamics, and task configurations. Our project is available at https://github.com/usail‑hkust/TrafficClaw.
Authors:Szu-Chi Chen, I-Ning Tsai, Yi-Cheng Lin, Sung-Feng Huang, Hung-yi Lee
Abstract:
Recent Speech‑to‑Speech Translation (S2ST) systems achieve strong semantic accuracy yet consistently strip away non‑verbal vocalizations (NVs), such as laughter and crying that convey pragmatic intent, which severely limits real‑world utility. We address this via three contributions. First, we propose a synthesis pipeline for building scalable expressive datasets to overcome the data scarcity limitation. Second, we propose MoVE, a Mixture‑of‑LoRA‑Experts architecture with expressive‑specialized adapters and a soft‑weighting router that blends experts for capturing hybrid expressive states. Third, we show pretrained AudioLLMs enable striking data efficiency: 30 minutes of curated data is enough for strong performance. On English‑Chinese S2ST, while comparing with strong baselines, MoVE reproduces target NVs in 76% of cases and achieves the highest human‑rated naturalness and emotional fidelity among all compared systems, where existing S2ST systems preserve at most 14% of NVs.
Authors:Keyang Chen, Mingxuan Jiang, Yongsheng Zhao, Zeping Li, Zaiyuan Chen, Weiqi Luo, Zhixin Li, Sen Liu, Yinan Jing, Guangnan Ye, Xihong Wu, Hongfeng Chai
Abstract:
Money laundering poses severe risks to global financial systems, driving the widespread adoption of machine learning for transaction monitoring. However, progress remains stifled by the lack of realistic benchmarks. Existing transaction‑graph datasets suffer from two pervasive limitations: (i) they provide sparse node‑level semantics beyond anonymized identifiers, and (ii) they rely on template‑driven anomaly injection, which biases benchmarks toward static structural motifs and yields overly optimistic assessments of model robustness. We propose TransXion, a benchmark ecosystem for Anti‑Money Laundering (AML) research that integrates profile‑aware simulation of normal activity with stochastic, non‑template synthesis of illicit subgraphs.TransXion jointly models persistent entity profiles and conditional transaction behavior, enabling evaluation of "out‑of‑character" anomalies where observed activity contradicts an entity's socio‑economic context. The resulting dataset comprises approximately 3 million transactions among 50,000 entities, each endowed with rich demographic and behavioral attributes. Empirical analyses show that TransXion reproduces key structural properties of payment networks, including heavy‑tailed activity distributions and localized subgraph structure. Across a diverse array of detection models spanning multiple algorithmic paradigms, TransXion yields substantially lower detection performance than widely used benchmarks, demonstrating increased difficulty and realism. TransXion provides a more faithful testbed for developing context‑aware and robust AML detection methods. The dataset and code are publicly available at https://github.com/chaos‑max/TransXion.
Authors:Xinyu Zhu, Yuzhu Cai, Zexi Liu, Cheng Wang, Fengyang Li, Wenkai Jin, Wanxu Liu, Zehao Bing, Bingyang Zheng, Jingyi Chai, Shuo Tang, Rui Ye, Yuwen Du, Xianghe Pang, Yaxin Du, Tingjia Miao, Yuzhi Zhang, Ruoxue Liao, Zhaohan Ding, Linfeng Zhang, Yanfeng Wang, Weinan E, Siheng Chen
Abstract:
The convergence of large language models and agents is catalyzing a new era of scientific discovery: Agentic Science. While the scientific method is inherently iterative, existing agent frameworks are predominantly static, narrowly scoped, and lack the capacity to learn from trial and error. To bridge this gap, we present EvoMaster, a foundational evolving agent framework engineered specifically for Agentic Science at Scale. Driven by the core principle of continuous self‑evolution, EvoMaster empowers agents to iteratively refine hypotheses, self‑critique, and progressively accumulate knowledge across experimental cycles, faithfully mirroring human scientific inquiry. Crucially, as a domain‑agnostic base harness, EvoMaster is exceptionally easy to scale up ‑‑ enabling developers to build and deploy highly capable, self‑evolving scientific agents for arbitrary disciplines in approximately 100 lines of code. Built upon EvoMaster, we incubated the SciMaster ecosystem across domains such as machine learning, physics, and general science. Evaluations on four authoritative benchmarks (Humanity's Last Exam, MLE‑Bench Lite, BrowseComp, and FrontierScience) demonstrate that EvoMaster achieves state‑of‑the‑art scores of 41.1%, 75.8%, 73.3%, and 53.3%, respectively. It comprehensively outperforms the general‑purpose baseline OpenClaw with relative improvements ranging from +159% to +316%, robustly validating its efficacy and generality as the premier foundational framework for the next generation of autonomous scientific discovery. EvoMaster is available at https://github.com/sjtu‑sai‑agents/EvoMaster.
Authors:Wei Chen, Lili Zhao, Zhi Zheng, HuiJun Hou, Tong Xu
Abstract:
Multi‑hop question answering (MHQA) enables accurate answers to complex queries by retrieving and reasoning over evidence dispersed across multiple documents. Existing MHQA approaches mainly rely on iterative retrieval‑augmented generation, which suffer from the following two major issues. 1) Existing methods prematurely commit to surface‑level entities rather than underlying reasoning structures, making question decomposition highly vulnerable to lexical ambiguity. 2) Existing methods overlook the logical dependencies among reasoning steps, resulting in uncoordinated execution. To address these issues, we propose STRIDE, a framework that separates strategic planning, dynamic control, and grounded execution. At its core, a Meta‑Planner first constructs an entity‑agnostic reasoning skeleton to capture the abstract logic of the query, thereby deferring entity grounding until after the reasoning structure is established, which mitigates disambiguation errors caused by premature lexical commitment. A Supervisor then orchestrates sub‑question execution in a dependency‑aware manner, enabling efficient parallelization where possible and sequential coordination when necessary. By dynamically deciding whether to retrieve new evidence or infer from existing facts, it avoids redundant queries and error propagation, while fusing cross‑branch information and reformulating failed queries to enhance robustness. Grounded fact extraction and logical inference are delegated to specialized execution modules, ensuring faithfulness through explicit separation of retrieval and reasoning. We further propose STRIDE‑FT, a modular fine‑tuning framework that uses self‑generated execution trajectories from STRIDE, requiring neither human annotations nor stronger teacher models. Experiments show that STRIDE achieves robust and accurate reasoning, while STRIDE‑FT effectively enhances open‑source LLMs.
Authors:Kadir-Kaan Özer, René Ebeling, Markus Enzweiler
Abstract:
We introduce JuRe (Just Repair), a minimal denoising network for time series anomaly detection that exposes a central finding: architectural complexity is unnecessary when the training objective correctly implements the manifold‑projection principle. JuRe consists of a single depthwise‑separable convolutional residual block with hidden dimension 128, trained to repair corrupted time series windows and scored at inference by a fixed, parameter‑free structural discrepancy function. Despite using no attention, no latent variable, and no adversarial component, JuRe ranks second on the TSB‑AD multivariate benchmark (AUC‑PR 0.404, 180 series, 17 datasets) and second on the UCR univariate archive by AUC‑PR (0.198, 250 series), leading all neural baselines on AUC‑PR and VUS‑PR. Component ablation on TSB‑AD identifies training‑time corruption as the dominant factor (ΔAUC‑PR = 0.047 on removal), confirming that the denoising objective, not network capacity, drives detection quality. Pairwise Wilcoxon signed‑rank tests establish statistical significance against 21 of 25 baselines on TSB‑AD. Code is available at the URL https://github.com/iis‑esslingen/JuRe.
Authors:Yuncheng Hua, Sion Weatherhead, Mehdi Jafari, Hao Xue, Flora D. Salim
Abstract:
Automated simulator construction requires distributional fidelity, distinguishing it from generic code generation. We identify two failure modes in long‑horizon LLM agents: contextual drift and optimization instability arising from conflating structural and parametric errors. We propose SOCIA‑EVO, a dual‑anchored evolutionary framework. SOCIA‑EVO introduces: (1) a static blueprint to enforce empirical constraints; (2) a bi‑level optimization to decouple structural refinement from parameter calibration; and (3) a self‑curating Strategy Playbook that manages remedial hypotheses via Bayesian‑weighted retrieval. By falsifying ineffective strategies through execution feedback, SOCIA‑EVO achieves robust convergence, generating simulators that are statistically consistent with observational data. The code and data of SOCIA‑EVO are available here: https://github.com/cruiseresearchgroup/SOCIA/tree/evo.
Authors:Yueyang Ding, HaoPeng Zhang, Rui Dai, Yi Wang, Tianyu Zong, Kaikui Liu, Xiangxiang Chu
Abstract:
Comprehensive understanding of time series remains a significant challenge for Large Language Models (LLMs). Current research is hindered by fragmented task definitions and benchmarks with inherent ambiguities, precluding rigorous evaluation and the development of unified Time Series Reasoning Models(TSRMs). To bridge this gap, we formalize Time Series Reasoning (TSR) via a four‑level taxonomy of increasing cognitive complexity. We introduce HiTSR, a hierarchical time series reasoning dataset comprising 83k samples with diverse task combinations and verified Chain‑of‑Thought (CoT) trajectories. Leveraging HiTSR, we propose LLaTiSA, a strong TSRM that integrates visualized patterns with precision‑calibrated numerical tables to enhance the temporal perception of Vision‑Language Models (VLMs). Through a multi‑stage curriculum fine‑tuning strategy, LLaTiSA achieves superior performance and exhibits robust out‑of‑distribution generalization across diverse TSR tasks and real‑world scenarios. Our code is available at https://github.com/RainingNovember/LLaTiSA.
Authors:Yunkai Dang, Yifan Jiang, Yizhu Jiang, Anqi Chen, Wenbin Li, Yang Gao
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in various perception and reasoning tasks. Despite this success, ensuring their reliability in practical deployment necessitates robust confidence estimation. Prior works have predominantly focused on text‑only LLMs, often relying on computationally expensive self‑consistency sampling. In this paper, we extend this to multimodal settings and conduct a comprehensive evaluation of MLLMs' response confidence estimation. Our analysis reveals a significant instinct‑reflection misalignment: the model's implicit token‑level support frequently diverges from its verbal self‑assessment confidence. To address this misalignment, we propose a monotone confidence fusion framework to merge dual‑channel signals and cross‑channel consistency to estimate correctness. Subsequently, an order‑preserving mean alignment step is applied to correct global bias, which improves calibration while preserving the risk‑coverage trade‑off for selective prediction. Experiments on diverse open‑source and closed‑source MLLMs show that our method consistently yields more reliable confidence estimates and improves both calibration and failure prediction. Code will be available at https://github.com/Yunkaidang/Instinct‑vs.‑Reflection.
Authors:Wenwei Xie, Jie Yin, Lu Ma, Xuansong Zhang, Wenjing Zhang
Abstract:
AI‑generated imagery has reached near‑photorealistic fidelity, yet this technology poses significant threats to information security and societal trust. Existing deepfake detection methods often exhibit limited robustness in open‑world scenarios. To address this limitation, this paper investigates intrinsic discrepancies between synthetic and authentic images from a signal‑level perspective. Our analysis reveals that low‑correlation signals serve as distinctive markers for differentiating AI‑generated imagery from real photographs. Building on this insight, we introduce a novel method for quantifying these signals based on fractal theory. By analyzing the fractal characteristics of low‑correlation signals, our method effectively captures the subtle statistical anomalies inherent to the synthesis process. Extensive experimental results demonstrate the method's robustness and superior detection performance. This work emphasizes the need to shift research focus to a new signal‑level direction for deepfake detection. Theoretically, this proposed approach is not limited to face image identification but can be applied to all AI‑generated image detection tasks. This study provides a new research direction for deepfake detection.
Authors:Hanlin Wang, Chak Tou Leong, Jian Wang, Wenjie Li
Abstract:
Recent advancements in large language models (LLMs) have enabled agents to tackle complex embodied tasks through environmental interaction. However, these agents still make suboptimal decisions and perform ineffective actions, as they often overlook critical environmental feedback that differs from their internal beliefs. Through a formal probing analysis, we characterize this as belief inertia, a phenomenon where agents stubbornly adhere to prior beliefs despite explicit observations. To address this, we advocate active belief intervention, moving from passive understanding to active management. We introduce the Estimate‑Verify‑Update (EVU) mechanism, which empowers agents to predict expected outcomes, verify them against observations through explicit reasoning, and actively update prior beliefs based on the verification evidence. EVU is designed as a unified intervention mechanism that generates textual belief states explicitly, and can be integrated into both prompting‑based and training‑based agent reasoning methods. Extensive experiments across three embodied benchmarks demonstrate that EVU consistently yields substantial gains in task success rates. Further analyses validate that our approach effectively mitigates belief inertia, advancing the development of more robust embodied agents. Our code is available at https://github.com/WangHanLinHenry/EVU.
Authors:Priya Gurjar, Md Farhan Ishmam, Kenneth Marino
Abstract:
Despite the rapid progress, LLMs for sequential decision‑making (i.e., LLM agents) still struggle to produce diverse outputs. This leads to insufficient exploration, convergence to sub‑optimal solutions, and becoming stuck in loops. Such limitations can be problematic in environments that require active exploration to gather information and make decisions. Sampling methods such as temperature scaling introduce token‑level randomness but fail to produce enough diversity at the sequence level. We analyze LLM exploration in the classic Multi‑Armed Bandit (MAB) setting and the Text Adventure Learning Environment Suite (TALES). We find that current decoding strategies and prompting methods like Chain‑of‑Thought and Tree‑of‑Thought are insufficient for robust exploration. To address this, we introduce DORA Explorer (Diversity‑Oriented Ranking of Actions), a training‑free framework for improving exploration in LLM agents. DORA generates diverse action candidates, scores them using token log‑probabilities, and selects actions using a tunable exploration parameter. DORA achieves UCB‑competitive performance on MAB and consistent gains across TALES, e.g., improving Qwen2.5‑7B's performance from 29.2% to 45.5% in TextWorld. Our project is available at: https://dora‑explore.github.io/.
Authors:Si Li, Chen-Kai Hu, Zhenhuan Lyu, Yuanqing He
Abstract:
Digital subtraction angiography (DSA) in coronary imaging is fundamentally challenged by physiological motion, forcing reliance on raw angiograms cluttered with anatomical noise. Existing deep learning methods often produced images with two critical clinically unacceptable flaws: persistent boundary artifacts and a loss of native tissue grayscale fidelity that undermined diagnostic confidence. We propose a novel framework termed as CDSA‑Net that for the first time explicitly decouples and jointly optimizes vascular structure preservation and realistic background restoration. CDSA‑Net introduces two core innovations: (i) A hierarchical geometric prior guidance (HGPG) mechanism, embedded in our coronary structure extraction network (CSENet). It synergistically combines integrated geometric prior (IGP) with gated spatial modulation (GSM) and centerline‑aware topology (CAT) loss supervision, ensuring structural continuity. (ii) An adaptive noise module (ANM) within our coronary background restoration network (CBResNet). Unlike standard restoration, ANM uniquely models the stochastic nature of clinical X‑ray noise, bridging the domain gap to enable seamless background intensity estimation and the complete elimination of boundary artifacts. The final subtraction is obtained by removing the restored background from the raw angiogram. Quantitatively, it significantly outperformed state‑of‑the‑art methods in vascular intensity correlation and perceptual quality. A 25.6% improvement in morphology assessment efficiency and a 42.9% gain in hemodynamic evaluation speed set a new benchmark for utility in interventional cardiology, while maintaining diagnostic results consistent with raw angiograms. The project code is available at https://github.com/DrThink‑ai/CDSA‑Net.
Authors:Weibing Zheng, Laurah Turner, Jess Kropczynski, Matthew Kelleher, Murat Ozer, Shane Halse
Abstract:
As Artificial Intelligence (AI) and Agentic AI become increasingly integrated across sectors such as education and healthcare, it is critical to ensure that Multi‑Agent Education System (MAES) is explainable from the early stages of requirements engineering (RE) within the AI software development lifecycle. Explainability is essential to build trust, promote transparency, and enable effective human‑AI collaboration. Although personas are well‑established in human‑computer interaction to represent users and capture their needs and behaviors, their role in RE for explainable MAES remains underexplored. This paper proposes a human‑first, persona‑driven, explainable MAES RE framework and demonstrates the framework through a MAES for clinical reasoning training. The framework integrates personas and user stories throughout the RE process to capture the needs, goals, and interactions of various stakeholders, including medical educators, medical students, AI patient agent, and clinical agents (physical exam agent, diagnostic agent, clinical intervention agent, supervisor agent, evaluation agent). The goals, underlying models, and knowledge base shape agent interactions and inform explainability requirements that guided the clinical reasoning training of medical students. A post‑usage survey found that more than 78% of medical students reported that MAES improved their clinical reasoning skills. These findings demonstrate that RE based on persona effectively connects technical requirements with non‑technical medical students from a human‑centered approach, ensuring that explainable MAES are trustworthy, interpretable, and aligned with authentic clinical scenarios from the early stages of the AI system engineering. The partial MAES for the clinical scenario simulator is~\hrefhttps://github.com/2sigmaEdTech/MAS/open sourced here.
Authors:Sukwon Yun, Jie Peng, Pingzhi Li, Wendong Fan, Jie Chen, James Zou, Guohao Li, Tianlong Chen
Abstract:
With an ever‑growing zoo of LLMs and benchmarks, the need to orchestrate multiple models for improved task performance has never been more pressing. While frameworks like Mixture‑of‑Agents (MoA) attempt to coordinate LLMs, they often fall short in terms of (1) selecting relevant agents, (2) facilitating effective intra‑agent communication, and (3) integrating responses efficiently. In this work, we propose Graph‑of‑Agents (GoA), a new graph‑based framework for modeling multi‑agent LLM communication. Our approach begins with node sampling, selecting only the most relevant agents by leveraging model cards that summarize each model's domain, task specialization, and other characteristics. Next, we construct edges between the selected agents by evaluating their responses against one another to determine relevance ordering. Directed message passing is then performed from highly relevant agents to less relevant ones to enhance their responses, followed by reverse message passing to refine the original responses of the more relevant agents. Finally, the updated responses are aggregated via graph‑based pooling (e.g., max or mean pooling) to produce a single, unified answer. We evaluate GoA on diverse multi‑domain benchmarks (MMLU, MMLU‑Pro, GPQA) and domain‑specific benchmarks (MATH, HumanEval, MedMCQA), with an agent pool of 6 LLMs spanning multiple domains. Surprisingly, GoA achieves superior performance using only 3 selected agents, outperforming recent multi‑agent LLM baselines that utilize all 6 agents simultaneously. By adopting a graph structure, GoA offers both scalability and effectiveness through structured message passing‑positioning it as a strong candidate for navigating the challenges of the ever‑growing LLM zoo. Code is available at: https://github.com/UNITES‑Lab/GoA.
Authors:Antonio De Santis, Tommaso Bonetti, Andrea Tocchetti, Marco Brambilla
Abstract:
The interpretation of implicit meanings is an integral aspect of human communication. However, this framework may not transfer to interactions with Large Language Models (LLMs). To investigate this, we introduce the task of Implicit Information Extraction (IIE) and propose an LLM‑based IIE pipeline that builds a structured knowledge graph from a context sentence by extracting relational triplets, validating implicit inferences, and analyzing temporal relations. We evaluate two LLMs against crowdsourced human judgments on two datasets. We find that humans agree with most model triplets yet consistently propose many additions, indicating limited coverage in current LLM‑based IIE. Moreover, in our experiments, models appear to be more conservative about implicit inferences than humans in socially rich contexts, whereas humans become more conservative in shorter, fact‑oriented contexts. Our code is available at https://github.com/Antonio‑Dee/IIE_from_LLM.
Authors:Shangge Liu, Yuehan Yin, Lei Wang, Qi Fan, Yinghuan Shi, Wenbin Li, Yang Gao, Dacheng Tao
Abstract:
Task arithmetic provides an efficient, training‑free way to edit pre‑trained models, yet lacks a fundamental theoretical explanation for its success. The existing concept of ``weight disentanglement" describes the ideal outcome of non‑interfering task composition but does not reveal its underlying cause. Crucially, what intrinsic properties of the pre‑trained model (θ_0) or the task vectors (τ_t) enable this disentanglement remains underexplored. In this paper, we introduce Task‑Feature Specialization (TFS), a model's ability to allocate distinct internal features to different tasks, as the fundamental principle. We first prove that TFS is a sufficient condition for weight disentanglement. More importantly, we find that TFS also gives rise to an observable geometric consequence: weight vector orthogonality. This positions TFS as the common cause for both the desired functional outcome (disentanglement) and a measurable geometric property (orthogonality). This relationship provides the key insight for our method: since the abstract TFS property is intractable to enforce directly, we can instead promote weight disentanglement by shaping its concrete geometric consequence, orthogonality. Therefore, we propose OrthoReg, a simple and effective regularization method that actively enforces an internal orthogonal structure on weight updates (ΔW) that constitute τ_t during fine‑tuning. And we theoretically prove that OrthoReg promotes disentanglement. Extensive experiments demonstrate that OrthoReg consistently and significantly enhances the performance of various task arithmetic methods. Code is available at \hrefhttps://github.com/RL‑MIND/OrthoReghttps://github.com/RL‑MIND/OrthoReg.
Authors:Kyeong Seon Kim, Baek Seong-Eun, Lee Jung-Mok, Tae-Hyun Oh
Abstract:
Scalable Vector Graphics (SVGs) function both as visual images and as structured code that encode rich geometric and layout information, yet most methods rasterize them and discard this symbolic organization. At the same time, recent sentence embedding methods produce strong text representations but do not naturally extend to visual or structured modalities. We propose a training‑free, instruction‑guided multimodal embedding framework that uses a Multimodal Large Language Model (MLLM) to map text, raster images, and SVG code into an aligned embedding space. We control the direction of embeddings through modality‑specific instructions and structural SVG cues, eliminating the need for learned projection heads or contrastive training. Our method has two key components: (1) Multimodal Explicit One‑word Limitation (mEOL), which instructs the MLLM to summarize any multimodal input into a single token whose hidden state serves as a compact semantic embedding. (2) A semantic SVG rewriting module that assigns meaningful identifiers and simplifies nested SVG elements through visual reasoning over the rendered image, exposing geometric and relational cues hidden in raw code. Using a repurposed VGBench, we build the first text‑to‑SVG retrieval benchmark and show that our training‑free embeddings outperform encoder‑based and training‑based multimodal baselines. These results highlight prompt‑level control as an effective alternative to parameter‑level training for structure‑aware multimodal retrieval. Project page: https://scene‑the‑ella.github.io/meol/
Authors:Tianbao Zhang
Abstract:
Large Language Models (LLMs) produce a controllability gap in safety‑critical engineering: even low rates of undetected constraint violations render a system undeployable. Current orchestration paradigms suffer from sycophantic compliance, context attention decay [Liu et al., 2024], and stochastic oscillation during self‑correction [Huang et al., 2024]. We introduce the Convergent AI Agent Framework (CAAF), which transitions agentic workflows from open‑loop generation to closed‑loop Fail‑Safe Determinism via three pillars: (1) Recursive Atomic Decomposition with physical context firewalls; (2) Harness as an Asset, formalizing domain invariants into machine‑readable registries enforced by a deterministic Unified Assertion Interface (UAI); and (3) Structured Semantic Gradients with State Locking for monotonic convergence. Empirical evaluation across two domains ‑‑ SAE Level 3 (L3) autonomous driving (AD) (n=30, 7 conditions) and pharmaceutical continuous flow reactor design (n=20, 4 conditions including a Mono+UAI ablation) ‑‑ shows that CAAF‑all‑GPT‑4o‑mini achieves 100% paradox detection while monolithic GPT‑4o achieves 0% (even at temperature=0). The pharmaceutical benchmark features 7 simultaneous constraints with nonlinear Arrhenius interactions and a 3‑way minimal unsatisfiable subset, representing a structurally harder challenge than the 2‑constraint AD paradox. Alternative multi‑agent architectures (debate, sequential checking) also achieve 0% across 80 trials, confirming that CAAF's reliability derives from its deterministic UAI, not from multi‑agent orchestration per se. A Mono+UAI ablation (95%) isolates UAI as the core contribution. CAAF's reliability is invariant to prompt hints; all components use a single commodity model, enabling fully offline deployment.
Authors:Linyue Zhang, Wenyi Zeng, Zicheng Pan, Yongsheng Gao, Changming Sun, Jun Hu, Lixian Liu, Weichuan Zhang, Tuo Wang
Abstract:
Feature reconstruction techniques are widely applied for few‑shot fine‑grained image classification (FSFGIC). Our research indicates that one of the main challenges facing existing feature‑based FSFGIC methods is how to choose the size of the receptive field to extract feature descriptors (including spatial and frequency feature descriptors) from different category input images, thereby better performing the FSFGIC tasks. To address this, an adaptive receptive field‑based spatial‑frequency feature reconstruction network (ARF‑SFR‑Net) is proposed. The designed ARF‑SFR‑Net has the capability to adaptively determine receptive field sizes for obtaining spatial and frequency features, and effectively fuse them for reconstruction and FSFGIC tasks. The designed ARF‑SFR‑Net can be easily embedded into a given episodic training mechanism for end‑to‑end training from scratch. Extensive experiments on multiple FSFGIC benchmarks demonstrate the effectiveness and superiority of the proposed ARF‑SFR‑Net over state‑of‑the‑art approaches. The code is available at: https://github.com/ICL‑SUST/ARF‑SFR‑Net.git.
Authors:Hao Wang, Jindong Han, Wei Fan, Hao Liu
Abstract:
Climate research is pivotal for mitigating global environmental crises, yet the accelerating volume of multi‑scale datasets and the complexity of analytical tools have created significant bottlenecks, constraining scientific discovery to fragmented and labor‑intensive workflows. While the emergence Large Language Models (LLMs) offers a transformative paradigm to scale scientific expertise, existing explorations remain largely confined to simple Question‑Answering (Q&A) tasks. These approaches often oversimplify real‑world challenges, neglecting the intricate physical constraints and the data‑driven nature required in professional climate science.To bridge this gap, we introduce ClimAgent, a general‑purpose autonomous framework designed to execute a wide spectrum of research tasks across diverse climate sub‑fields. By integrating a unified tool‑use environment with rigorous reasoning protocols, ClimAgent transcends simple retrieval to perform end‑to‑end modeling and analysis.To foster systematic evaluation, we propose ClimaBench, the first comprehensive benchmark for real‑world climate discovery. It encompasses challenging problems spanning 5 distinct task categories derived from professional scenarios between 2000 and 2025. Experiments on ClimaBench demonstrate that ClimAgent significantly outperforms state‑of‑the‑art baselines, achieving a 40.21% improvement over original LLM solutions in solution rigorousness and practicality. Our code are available at https://github.com/usail‑hkust/ClimAgent.
Authors:Yingzhi Xia, Setthakorn Tanomkiattikun, Liangli Zhen, Zaiwang Gu
Abstract:
Diffusion models (DMs) have recently shown remarkable performance on inverse problems (IPs). Optimization‑based methods can fast solve IPs using DMs as powerful regularizers, but they are susceptible to local minima and noise overfitting. Although DMs can provide strong priors for Bayesian approaches, enforcing measurement consistency during the denoising process leads to manifold infeasibility issues. We propose Noise‑space Hamiltonian Monte Carlo (N‑HMC), a posterior sampling method that treats reverse diffusion as a deterministic mapping from initial noise to clean images. N‑HMC enables comprehensive exploration of the solution space, avoiding local optima. By moving inference entirely into the initial‑noise space, N‑HMC keeps proposals on the learned data manifold. We provide a comprehensive theoretical analysis of our approach and extend the framework to a noise‑adaptive variant (NA‑NHMC) that effectively handles IPs with unknown noise type and level. Extensive experiments across four linear and three nonlinear inverse problems demonstrate that NA‑NHMC achieves superior reconstruction quality with robust performance across different hyperparameters and initializations, significantly outperforming recent state‑of‑the‑art methods. The code is available at https://github.com/NA‑HMC/NA‑HMC.
Authors:Syed Muhammad Aqdas Rizvi
Abstract:
Decentralized Autonomous Organizations (DAOs) are inclined explore Small Language Models (SLMs) as edge‑native constitutional firewalls to vet proposals and mitigate semantic social engineering. While scaling inference‑time compute (System 2) enhances formal logic, its efficacy in highly adversarial, cryptoeconomic governance environments remains underexplored. To address this, we introduce Sentinel‑Bench, an 840‑inference empirical framework executing a strict intra‑model ablation on Qwen‑3.5‑9B. By toggling latent reasoning across frozen weights, we isolate the impact of inference‑time compute against an adversarial Optimism DAO dataset. Our findings reveal a severe compute‑accuracy inversion. The autoregressive baseline (System 1) achieved 100% adversarial robustness, 100% juridical consistency, and state finality in under 13 seconds. Conversely, System 2 reasoning introduced catastrophic instability, fundamentally driven by a 26.7% Reasoning Non‑Convergence (cognitive collapse) rate. This collapse degraded trial‑to‑trial consensus stability to 72.6% and imposed a 17x latency overhead, introducing critical vulnerabilities to Governance Extractable Value (GEV) and hardware centralization. While rare (1.5% of adversarial trials), we empirically captured "Reasoning‑Induced Sycophancy," where the model generated significantly longer internal monologues (averaging 25,750 characters) to rationalize failing the adversarial trap. We conclude that for edge‑native SLMs operating under Byzantine Fault Tolerance (BFT) constraints, System 1 parameterized intuition is structurally and economically superior to System 2 iterative deliberation for decentralized consensus. Code and Dataset: https://github.com/smarizvi110/sentinel‑bench
Authors:Xinru Yan, Boxi Cao, Yaojie Lu, Hongyu Lin, Weixiang Zhou, Le Sun, Xianpei Han
Abstract:
Native Omni‑modal Large Language Models (OLLMs) have shifted from pipeline architectures to unified representation spaces. However, this native integration gives rise to a critical yet underexplored phenomenon: modality preference. To bridge this gap, we first systematically quantify modality preference of OLLMs using a newly‑curated conflict‑based benchmark and the modality selection rate metric. Our evaluation of ten representative OLLMs reveals a notable paradigm shift: unlike the ``text‑dominance'' of traditional VLMs, most OLLMs exhibit a pronounced visual preference. To further understand the underlying mechanism, we conduct layer‑wise probing and demonstrate that such modality preference is not static but emerges progressively in the mid‑to‑late layers. Building upon these insights, we leverage these internal signals to diagnose cross‑modal hallucinations, achieving competitive performance across three downstream multi‑modal benchmarks without task‑specific data. Our work provides both a mechanistic understanding and a practical tool for building more trustworthy OLLMs. Our code and related resources are publicly available at: https://github.com/icip‑cas/OmniPreference
Authors:Dongyi He, Yuanquan Gao, Bin Jiang, He Yan
Abstract:
Accurate traffic forecasting is crucial for intelligent transportation systems, supporting effective traffic management, congestion reduction, and informed urban planning. However, traditional models often fail to adequately capture the intricate spatio‑temporal dependencies present in traffic data. To overcome these limitations, we introduce GAMMA‑Net, a novel approach that integrates Graph Attention Networks (GAT) with multi‑axis Selective State Space Models (Mamba). The GAT component uses a self‑attention mechanism to dynamically adjust the influence of nodes within the traffic network, enabling adaptive spatial dependency modeling based on real‑time conditions. Simultaneously, the Mamba module efficiently models long‑term temporal and spatial dynamics without the heavy computational cost of conventional recurrent architectures. Extensive experiments on several benchmark traffic datasets, including METR‑LA, PEMS‑BAY, PEMS03, PEMS04, PEMS07, and PEMS08, show that GAMMA‑Net consistently outperforms existing state‑of‑the‑art models across different prediction horizons, achieving up to a 16.25% reduction in Mean Absolute Error (MAE) compared to baseline models. Ablation studies highlight the critical contributions of both the spatial and temporal components, emphasizing their complementary role in improving prediction accuracy. In conclusion, the GAMMA‑Net model sets a new standard in traffic forecasting, offering a powerful tool for next‑generation traffic management and urban planning. The code for this study is available at https://github.com/hdy6438/GAMMA‑Net
Authors:Zahid Hasan, Masud Ahmed, Nirmalya Roy
Abstract:
Semantic segmentation in hyperbolic space enables compact modeling of hierarchical structure while providing inherent uncertainty quantification. Prior approaches predominantly rely on the Poincaré ball model, which suffers from numerical instability, optimization, and computational challenges. We propose a novel, tractable, architecture‑agnostic semantic segmentation framework (pixel‑wise and mask classification) in the hyperbolic Lorentz model. We employ text embeddings with semantic and visual cues to guide hierarchical pixel‑level representations in Lorentz space. This enables stable and efficient optimization without requiring a Riemannian optimizer, and easily integrates with existing Euclidean architectures. Beyond segmentation, our approach yields free uncertainty estimation, confidence map, boundary delineation, hierarchical and text‑based retrieval, and zero‑shot performance, reaching generalized flatter minima. We introduce a novel uncertainty and confidence indicator in Lorentz cone embeddings. Further, we provide analytical and empirical insights into Lorentz optimization via gradient analysis. Extensive experiments on ADE20K, COCO‑Stuff‑164k, Pascal‑VOC, and Cityscapes, utilizing state‑of‑the‑art per‑pixel classification models (DeepLabV3 and SegFormer) and mask classification models (mask2former and maskformer), validate the effectiveness and generality of our approach. Our results demonstrate the potential of hyperbolic Lorentz embeddings for robust and uncertainty‑aware semantic segmentation. Code is available at https://github.com/mxahan/Lorentz_semantic_segmentation.
Authors:Jiaxin Zhang, Xiangyu Peng, Qinglin Chen, Qinyuan Ye, Caiming Xiong, Chien-Sheng Wu
Abstract:
On‑policy distillation (OPD) is an increasingly important paradigm for post‑training language models. However, we identify a pervasive Scaling Law of Miscalibration: while OPD effectively improves task accuracy, it systematically traps models in severe overconfidence. We trace this failure to an information mismatch: teacher supervision is formed under privileged context available during training, whereas the deployed model must report confidence using only deployment‑time information. We formalize this perspective theoretically, showing that teacher‑conditioned success is generally not a valid target for deployment‑time confidence and that helpful privileged context induces entropy collapse and a systematic optimism bias. To address this, we propose a calibration‑aware OPD framework, CaOPD, that estimates empirical confidence from model rollouts, replaces self‑reported confidence with this student‑grounded target, and distills the revised response through the same self‑distillation pipeline. Experiments across various models and domains show that CaOPD achieves Pareto‑optimal calibration while maintaining competitive capability, generalizing robustly under out‑of‑distribution and continual learning. Our findings highlight that capability distillation does not imply calibrated confidence, and that confidence should be treated as an essential objective in post‑training. Code: https://github.com/SalesforceAIResearch/CaOPD
Authors:Chongsheng Zhang, Hao Wang, Zelong Yu, Esteban Garces Arias, Julian Rodemann, Zhanshuo Zhang, Qilong Li, Gaojuan Fan, Krikamol Muandet, Christian Heumann
Abstract:
Imbalanced data is commonly present in real‑world applications. While data synthesis can effectively mitigate the data scarcity problem of rare‑classes, and LLMs have revolutionized text generation, the application of LLMs to relational/structured tabular data synthesis remains underexplored. Moreover, existing approaches lack an effective feedback mechanism that can guide LLMs towards continuously optimizing the quality of the generated data throughout the synthesis process. In this work, we propose RDDG, Relational Data generator with Dynamic Guidance, which is a unified in‑context learning framework that employs progressive chain‑of‑thought (CoT) steps to generate tabular data for enhancing downstream imbalanced classification performance. RDDG first uses core set selection to identify representative samples from the original data, then utilizes in‑context learning to discover the inherent patterns and correlations among attributes within the core set, and subsequently generates tabular data while preserving the aforementioned constraints. More importantly, it incorporates a self‑reinforcing feedback mechanism that provides automatic assessments on the quality of the generated data, enabling continuous quality optimization throughout the generation process. Experimental results on multiple real and synthetic datasets demonstrate that RDDG outperforms existing approaches in both data fidelity and downstream imbalanced classification performance. We make our code available at https://github.com/cszhangLMU/RDDG.
Authors:Jiahao Li, Jiayi Dong, Peng Ye, Xiaochi Zhou, Haohai Lu, Fei Wang
Abstract:
Modeling single‑cell gene expression across diverse biological and technical conditions is crucial for characterizing cellular states and simulating unseen scenarios. Existing methods often treat genes as independent tokens, overlooking their high‑level biological relationships and leading to poor performance. We introduce SAVE, a unified generative framework based on conditional Transformers for multi‑condition single‑cell modeling. SAVE leverages a coarse‑grained representation by grouping semantically related genes into blocks, capturing higher‑order dependencies among gene modules. A Flow Matching mechanism and condition‑masking strategy further enhance flexible simulation and enable generalization to unseen condition combinations. We evaluate SAVE on a range of benchmarks, including conditional generation, batch effect correction, and perturbation prediction. SAVE consistently outperforms state‑of‑the‑art methods in generation fidelity and extrapolative generalization, especially in low‑resource or combinatorially held‑out settings. Overall, SAVE offers a scalable and generalizable solution for modeling complex single‑cell data, with broad utility in virtual cell synthesis and biological interpretation. Our code is publicly available at https://github.com/fdu‑wangfeilab/sc‑save
Authors:Jianyou Wang, Youze Zheng, Longtian Bao, Hanyuan Zhang, Qirui Zheng, Yuhan Chen, Yang Zhang, Matthew Feng, Maxim Khan, Aditya K. Sehgal, Christopher D. Rosin, Ramamohan Paturi, Umber Dube, Leon Bergen
Abstract:
Scientists have long sought to accurately predict outcomes of real‑world events before they happen. Can AI systems do so more reliably? We study this question through clinical trial outcome prediction, a high‑stakes open challenge even for domain experts. We introduce CT Open, an open‑access, live platform that will run four challenge every year. Anyone can submit predictions for each challenge. CT Open evaluates those submissions on trials whose outcomes were not yet public at the time of submission but were made public afterwards. Determining if a trial's outcome is public on the internet before a certain date is surprisingly difficult. Outcomes posted on official registries may lag behind by years, while the first mention may appear in obscure articles. To address this, we propose a novel, fully automated decontamination pipeline that uses iterative LLM‑powered web search to identify the earliest mention of trial outcomes. We validate the pipeline's quality and accuracy by human expert's annotations. Since CT Open's pipeline ensures that every evaluated trial had no publicly reported outcome when the prediction was made, it allows participants to use any methodology and any data source. In this paper, we release a training set and two time‑stamped test benchmarks, Winter 2025 and Summer 2025. We believe CT Open can serve as a central hub for advancing AI research on forecasting real‑world outcomes before they occur, while also informing biomedical research and improving clinical trial design. CT Open Platform is hosted at \hrefhttps://ct‑open.net/https://ct‑open.net/
Authors:Bhaskar Gurram
Abstract:
Automated evaluation of tool‑using large language model (LLM) agents is widely assumed to be reliable, but this assumption has rarely been validated against human annotation. We introduce AgentProp‑Bench, a 2,000‑task benchmark with 2,300 traces across four domains, nine production LLMs, and a 100‑label human‑validated subset. We quantify judge reliability, characterize error propagation, and evaluate a runtime mitigation. Substring‑based judging agrees with human annotation at kappa=0.049 (chance‑level); a three‑LLM ensemble reaches kappa=0.432 (moderate) with a conservative bias. Under validated evaluation, a parameter‑level injection propagates to a wrong final answer with human‑calibrated probability approximately 0.62 (range 0.46‑0.73 across models). Rejection (catching bad parameters) and recovery (correcting after acceptance) are independent model capabilities (Spearman rho=0.126, p=0.747). A tuned runtime interceptor reduces hallucination on GPT‑4o‑mini by 23.0 percentage points under a concurrent n=600 control, but shows no significant effect on Gemini‑2.0‑Flash, whose aggressive parameter rejection eliminates the target failure mode. All code, data, traces, and human labels are released at https://github.com/bhaskargurram‑ai/agenthallu‑bench.
Authors:Gehan Zheng, Sanjay Seenivasan, Matthew Johnson-Roberson, Weiming Zhi
Abstract:
Imitation learning has enabled robots to acquire complex visuomotor manipulation skills from demonstrations, but deployment failures remain a major obstacle, especially for long‑horizon action‑chunked policies. Once execution drifts off the demonstration manifold, these policies often continue producing locally plausible actions without recovering from the failure. Existing runtime monitors either require failure data, over‑trigger under benign feature drift, or stop at failure detection without providing a recovery mechanism. We present Rewind‑IL, a training‑free online safeguard framework for generative action‑chunked imitation policies. Rewind‑IL combines a zero‑shot failure detector based on Temporal Inter‑chunk Discrepancy Estimate (TIDE), calibrated with split conformal prediction, with a state‑respawning mechanism that returns the robot to a semantically verified safe intermediate state. Offline, a vision‑language model identifies recovery checkpoints in demonstrations, and the frozen policy encoder is used to construct a compact checkpoint feature database. Online, Rewind‑IL monitors self‑consistency in overlapping action chunks, tracks similarity to the checkpoint library, and, upon failure, rewinds execution to the latest verified safe state before restarting inference from a clean policy state. Experiments on real‑world and simulated long‑horizon manipulation tasks, including transfer to flow‑matching action‑chunked policies, demonstrate that policy‑internal consistency coupled with semantically grounded respawning offers a practical route to improved reliability in imitation learning. Supplemental materials are available at https://sjay05.github.io/rewind‑il
Authors:Weihua Du, Jingming Zhuo, Yixin Dong, Andre Wang He, Weiwei Sun, Zeyu Zheng, Manupa Karunaratne, Ivan Fox, Tim Dettmers, Tianqi Chen, Yiming Yang, Sean Welleck
Abstract:
Recent large language model (LLM) agents have shown promise in using execution feedback for test‑time adaptation. However, robust self‑improvement remains far from solved: most approaches still treat each problem instance independently, without accumulating reusable knowledge. This limitation is particularly pronounced in domain‑specific languages such as Triton, which are underrepresented in LLM pretraining data. Their strict constraints and non‑linear optimization landscape further make naive generation and local refinement unreliable. We propose AdaExplore, an agent framework that enables self‑improvement via accumulated execution feedback for performance‑critical kernel code generation through two complementary stages: failure‑driven adaptation and diversity‑preserving search, jointly improving correctness and optimization performance without additional fine‑tuning or external knowledge. In the adaptation stage, the agent synthesizes tasks and converts recurring failures into a reusable memory of validity rules, helping subsequent generations remain within the feasible set. In the search stage, the agent organizes candidate kernels as a tree and alternates between small local refinements and larger structural regeneration, allowing it to explore the optimization landscape beyond local optima. Experiments on kernel runtime optimization benchmarks validate these gains: AdaExplore achieves 3.12x and 1.72x speedups on KernelBench Level‑2 and Level‑3, respectively, within 100 steps, and continues to improve with additional computation.
Authors:Henry O. Velesaca, David Freire-Obregon, Abel Reyes-Angulo, Steven Araujo, Angel Sappa
Abstract:
Penalty kicks in soccer are decided under extreme time constraints, where goalkeepers benefit from anticipating shot direction from the kickers motion before or around ball contact. In this paper, MambaKick is presented as a learning‑based framework for penalty direction prediction that leverages pretrained human action recognition (HAR) embeddings extracted from contact‑centered short video segments and combines them with a lightweight temporal predictor. Rather than relying on explicit kinematic reconstruction or handcrafted biomechanical features, the approach reuses transferable spatiotemporal representations and utilizes selective state‑spare models (Mamba) for efficient sequence aggregation. Simple contextual metadata (e.g., field side and footedness) are also considered as complementary cues that may reduce ambiguity in real‑world footage. Across a range of HAR backbones, MambaKick consistently improves or matches strong embedding baselines, achieving up to 53.1% accuracy for three classes and 64.5% for two classes under the proposed methodology. Overall, the results indicate that combining pretrained HAR representations with efficient state‑space temporal modeling is a practical direction for low‑latency intention prediction in real‑world sports video. The code will be available at GitHub: https://github.com/hvelesaca/MambaKick/
Authors:Zongru Li, Xingsheng Chen, Honggang Wen, Regina Qianru Zhang, Ming Li, Xiaojin Zhang, Hongzhi Yin, Qiang Yang, Kwok-Yan Lam, Pietro Lio, Siu-Ming Yiu
Abstract:
Molecular property prediction integrates quantum chemistry, cheminformatics, and deep learning to connect molecular structure with physicochemical and biological behavior. This survey traces four complementary paradigms, including Quantum, Descriptor Machine Learning, Geometric Deep Learning, and Foundation Models, and outlines a unified taxonomy linking molecular representations, model architectures, and interdisciplinary applications. Benchmark analyses integrate evidence from both widely used datasets and datasets reflecting industry perspectives, encompassing quantum, physicochemical, physiological, and biophysical domains. The survey examines current standards in data curation, splitting strategies, and evaluation protocols, highlighting challenges including inconsistent stereochemistry, heterogeneous assay sources, and reproducibility limitations under random or poorly defined splits. These observations motivate the modernization of benchmark design toward more transparent, time‑ and scaffold‑aware methodologies. We further propose three forward‑looking directions: (i) physics‑aware learning embedding quantum consistency, (ii) uncertainty‑calibrated foundation models for trustworthy inference, and (iii) realistic multimodal benchmark ecosystems integrating computational and experimental data. Repository: https://github.com/Zongru‑Li/Survey‑and‑Benchmarks‑of‑DL‑for‑Molecular‑Property‑Prediction‑in‑the‑Foundation‑Model‑Era.
Authors:Henry O. Velesaca, Andrea Mero, Guillermo A. Castillo, Angel D. Sappa
Abstract:
Pedestrian detection is fundamental to autonomous driving, robotics, and surveillance. Despite progress in deep learning, reliable identification remains challenging due to occlusions, cluttered backgrounds, and degraded visibility. While multispectral detection‑combining visible and thermal sensors‑mitigates poor visibility, the challenge of camouflaged pedestrians remains largely unexplored. Existing Camouflaged Object Detection (COD) benchmarks focus on biological species, leaving a gap in safety‑critical human detection where targets blend into their surroundings. To address this, we introduce Camo‑M3FD (derived from the M3FD dataset), a novel benchmark for cross‑spectral camouflaged pedestrian detection, consisting of registered visible‑thermal image pairs. The dataset is curated using quantitative metrics to ensure high foreground‑background similarity. We provide high‑quality pixel‑level masks and establish a standardized evaluation framework using state‑of‑the‑art COD models. Our results demonstrate that while thermal signals provide indispensable localization cues, multispectral fusion is essential for refining structural details. Camo‑M3FD serves as a foundational resource for developing robust and safety‑critical detection systems. The dataset is available on GitHub: https://cod‑espol.github.io/Camo‑M3FD/
Authors:Xiangkai Wang, Yun Zhao, Dongyi He, Qingling Xia, Gen Li, Nizhuan Wang, Ningxiao Peng, Bin Jiang
Abstract:
Stroke patient cross‑subject electroencephalography (EEG) decoding of motor imagery (MI) brain‑computer interface (BCI) is essential for motor rehabilitation, yet lesion‑related abnormal temporal dynamics and pronounced inter‑patient heterogeneity often undermine generalization. Existing adaptation methods are easily misled by pathological slow‑wave activity and unstable target‑domain pseudo‑labels. To address this challenge, we propose PA‑TCNet, a pathology‑aware temporal calibration framework with physiology‑guided target refinement for stroke motor imagery decoding. PA‑TCNet integrates two coordinated components. The Pathology‑aware Rhythmic State Mamba (PRSM) module decomposes EEG spatiotemporal features into slowly varying rhythmic context and fast transient perturbations, injecting the fused pathological context into selective state propagation to more effectively capture abnormal temporal dynamics. The Physiology‑Guided Target Calibration (PGTC) module constructs source‑domain sensorimotor region‑of‑interest templates, imposing physiological consistency constraints and dynamically refining target‑domain pseudo‑labels, thereby improving adaptation reliability. Leave‑one‑subject‑out experiments on two independent stroke EEG datasets, XW‑Stroke and 2019‑Stroke, yielded mean accuracies of 66.56% and 72.75%, respectively, outperforming state‑of‑the‑art baselines. These results indicate that jointly modeling pathological temporal dynamics and physiology‑constrained pseudo‑supervision can provide more robust cross‑subject initialization for personalized post‑stroke MI‑BCI rehabilitation. The implemented code is available at https://github.com/wxk1224/PA‑TCNet.
Authors:Nokimul Hasan Arif, Qian Lou, Mengxin Zheng
Abstract:
Most LLM safety work studies single‑agent models, but many real applications rely on multiple interacting agents. In these systems, prompt segmentation and inter‑agent routing create attack surfaces that single‑agent evaluations miss. We study \emphconjunctive prompt attacks, where a trigger key in the user query and a hidden adversarial template in one compromised remote agent each appear benign alone but activate harmful behavior when routing brings them together. We consider an attacker who changes neither model weights nor the client agent and instead controls only trigger placement and template insertion. Across star, chain, and DAG topologies, routing‑aware optimization substantially increases attack success over non‑optimized baselines while keeping false activations low. Existing defenses, including PromptGuard, Llama‑Guard variants, and system‑level controls such as tool restrictions, do not reliably stop the attack because no single component appears malicious in isolation. These results expose a structural vulnerability in agentic LLM pipelines and motivate defenses that reason over routing and cross‑agent composition. Code is available at https://github.com/UCF‑ML‑Research/ConjunctiveAgents.
Authors:Ziyang Wang
Abstract:
Satellite constellations are transforming space systems from isolated spacecraft into networked, software‑defined platforms capable of on‑orbit perception, decision making, and adaptation. Yet much of the existing AI studies remains centered on single‑satellite inference, while constellation‑scale autonomy introduces fundamentally new algorithmic requirements: learning and coordination under dynamic inter‑satellite connectivity, strict SWaP‑C limits, radiation‑induced faults, non‑IID data, concept drift, and safety‑critical operational constraints. This survey consolidates the emerging field of on‑orbit space AI through three complementary paradigms: (i) federated learning for cross‑satellite training, personalization, and secure aggregation; (ii) multi‑agent algorithms for cooperative planning, resource allocation, scheduling, formation control, and collision avoidance; and (iii) collaborative sensing and distributed inference for multi‑satellite fusion, tracking, split/early‑exit inference, and cross‑layer co‑design with constellation networking. We provide a system‑level view and a taxonomy that unifies collaboration architectures, temporal mechanisms, and trust models. To support community development and keep this review actionable over time, we continuously curate relevant papers and resources at https://github.com/ziyangwang007/AI4Space.
Authors:Jiaqi Shi, Yuechan Li, Xulong Zhang, Xiaoyang Qu, Jianzong Wang
Abstract:
High‑resolution Multimodal Large Language Models (MLLMs) face prohibitive computational costs during inference due to the explosion of visual tokens. Existing acceleration strategies, such as token pruning or layer sparsity, suffer from severe "backbone dependency", performing well on Vicuna or Mistral architectures (e.g., LLaVA) but causing significant performance degradation when transferred to architectures like Qwen. To address this, we leverage truncated matrix entropy to uncover a universal three‑stage inference lifecycle, decoupling visual redundancy into universal Intrinsic Visual Redundancy (IVR) and architecture‑dependent Secondary Saturation Redundancy (SSR). Guided by this insight, we propose HalfV, a framework that first mitigates IVR via a unified pruning strategy and then adaptively handles SSR based on its specific manifestation. Experiments demonstrate that HalfV achieves superior efficiency‑performance trade‑offs across diverse backbones. Notably, on Qwen25‑VL, it retains 96.8% performance at a 4.1× FLOPs speedup, significantly outperforming state‑of‑the‑art baselines. Our code is available at https://github.com/civilizwa/HalfV.
Authors:Vedant Jawandhia, Yash Sinha, Murari Mandal, Ankan Pal, Dhruv Kumar
Abstract:
Large language models (LLMs) are increasingly evaluated on mathematical reasoning, yet their robustness to equivalent problem representations remains poorly understood. In geometry, identical problems can be expressed in Euclidean, coordinate, or vector forms, but existing benchmarks report accuracy on fixed formats, implicitly assuming representation invariance and masking failures caused by representational changes alone. We propose GeoRepEval, a representation‑aware evaluation framework that measures correctness, invariance, and consistency at the problem level across parallel formulations, combining strict answer matching, bootstrap confidence intervals, paired McNemar tests, representation‑flip analyses, and regression controls for surface complexity. We prove that our Invariance@3 metric decomposes accuracy into robust and fragile components and is bounded by the weakest representation. Evaluating eleven LLMs on 158 curated high‑school geometry problems (474 instances), we find accuracy gaps of up to 14 percentage points induced solely by representation choice. Vector formulations emerge as a consistent failure point, with Invariance@3 as low as 0.044 even after controlling for length and symbolic complexity. A convert‑then‑solve prompting intervention improves vector accuracy by up to 52 percentage points for high‑capacity models, suggesting that failures reflect representation sensitivity rather than inability; however, low‑capacity models show no gains, indicating deeper limitations. These results suggest that current models rely on representation‑specific heuristics rather than abstract geometric reasoning. All datasets, prompts, and scripts are released at https://github.com/vedjaw/GeoRepEval.
Authors:Jasmine Moreira
Abstract:
The widespread adoption of AI‑assisted development tools in 2025 ‑‑ and the emergence of vibe coding, a practice of generating complete applications from natural language without verification ‑‑ exposed a critical and tool‑agnostic failure pattern: experienced developers who used frontier AI models were measurably slower in objective evaluations despite believing they were faster. Concurrently, 10.3% of AI‑generated applications in a production showcase contained critical security flaws. This paper argues that these failures share a structural cause ‑‑ the verification gap: every large language model (LLM), regardless of interface or capability, operates as a stochastic generator with zero internal semantic verification capability. The tool is irrelevant; the process is determinative. We present IACDM (Interactive Adversarial Convergence Development Methodology), a structured 8‑phase framework designed to address the verification gap through external verification agents (VA) operating at discrete gates. Its three pillars are: (1) deep problem discovery via Hierarchical Semantic Analysis before any technical solution; (2) persistent knowledge management across sessions; and (3) systematic adversarial critique through specialized lenses before implementation. The methodology is tool‑agnostic by construction, grounded in established software engineering tradition, and applied across more than 20 projects by multiple practitioners in a production R&D environment. Limitations are formalized as testable hypotheses for future empirical validation.
Authors:Banri Yanahama, Akiyoshi Sannai
Abstract:
AI‑driven autoformalization of mathematics is advancing rapidly. However, the type checker of a proof assistant guarantees only the logical correctness of proofs; it does not verify whether propositions and definitions faithfully capture their intended mathematical content. Consequently, AI‑generated formal proofs can exhibit semantic hallucination‑passing the type checker yet failing to express the intended mathematics. We propose a human‑in‑the‑loop approach in which human scientists and AI collaboratively produce formal proofs, with humans responsible for the semantic verification of propositions and definitions. To realize this approach, we develop Lean Atlas, a Lean 4 tool that visualizes the dependency graph of a Lean 4 project as an interactive web viewer, enabling human scientists to grasp the overall structure of a formalization efficiently. Its core feature, Lean Compass, is an algorithm that, given a selected theorem set, automatically extracts the project‑specific nodes whose semantic correctness can affect those target statements, thereby reducing the candidate set for semantic review in large‑scale formalizations. We further define aligned Lean code as formalization code that has undergone human semantic verification, and propose it as a quality standard for AI‑generated formalizations. We evaluate the tool on six Lean 4 formalization projects with different structural characteristics; proof‑heavy projects (PrimeNumberTheoremAnd, Carleson, Brownian Motion) achieved 94‑99% average node reduction, a 6‑theorem milestone subset of FLT achieved 59.8%, mixed PhysLib 69.0%, and definition‑heavy XMSS 27.3%. Lean Atlas is available as open‑source software at https://github.com/NyxFoundation/lean‑atlas .
Authors:Runwen You, Tong Xia, Jingzhi Wang, Jiankun Zhang, Tengyao Tu, Jinghua Piao, Yi Chang, Yong Li
Abstract:
Urban data support a wide range of applications across multiple disciplines. However, at the global scale, there is no unified platform for urban data discovery. As a result, researchers often have to manually search through websites or scientific literature to identify relevant datasets. To address this problem, we curate an open urban data discovery portal, UrbanDataMiner, which supports dataset‑level search and filtering over more than 60,000 urban datasets extracted from over 15,000 Nature‑affiliated publications. UrbanDataMiner is enabled by Paper2Data, a novel large‑scale LLM‑driven pipeline that automatically identifies dataset mentions in scientific papers and structures them using a unified urban data metadata schema. Human‑annotated evaluation demonstrates that Paper2Data achieves high recall (approximately 90%) in dataset identification and high field‑level precision (above 80%). In addition, UrbanDataMiner can retrieve over 9% of datasets that are not easily discoverable through general‑purpose search engines such as Google. Overall, our work provides the first large‑scale, literature‑derived infrastructure for urban data discovery and enables more systematic and reusable data‑driven research across disciplines. Our code and data are publicly available\footnotehttps://github.com/Yourunwen/Paper2Data.
Authors:Henry O. Velesaca, Luigi Miranda, Angel D. Sappa
Abstract:
This paper presents SWNet, a bimodal end‑to‑end cross‑spectral network specifically engineered for the detection of camouflaged weeds in dense agricultural environments. Plant camouflage, characterized by homochromatic blending where invasive species mimic the phenotypic traits of primary crops, poses a significant challenge for traditional computer vision systems. To overcome these limitations, SWNet utilizes a Pyramid Vision Transformer v2 backbone to capture long‑range dependencies and a Bimodal Gated Fusion Module to dynamically integrate Visible and Near‑Infrared information. By leveraging the physiological differences in chlorophyll reflectance captured in the NIR spectrum, the proposed architecture effectively discriminates targets that are otherwise indistinguishable in the visible range. Furthermore, an Edge‑Aware Refinement module is employed to produce sharper object boundaries and reduce structural ambiguity. Experimental results on the Weeds‑Banana dataset indicate that SWNet outperforms ten state‑of‑the‑art methods. The study demonstrates that the integration of cross‑spectral data and boundary‑guided refinement is essential for high segmentation accuracy in complex crop canopies. The code is available on GitHub: https://cod‑espol.github.io/SWNet/
Authors:Weijiang Xiong, Robert Fonod, Nikolas Geroliminis
Abstract:
Traffic forecasting is a challenging spatio‑temporal modeling task and a critical component of urban transportation management. Current studies mainly focus on deterministic predictions, with limited considerations on the uncertainty and stochasticity in traffic dynamics. Therefore, this paper proposes an elegant yet universal approach that transforms existing models into probabilistic predictors by replacing only the final output layer with a novel Gaussian Mixture Model (GMM) layer. The modified model requires no changes to the training pipeline and can be trained using only the Negative Log‑Likelihood (NLL) loss, without any auxiliary or regularization terms. Experiments on multiple traffic datasets show that our approach generalizes from classic to modern model architectures while preserving deterministic performance. Furthermore, we propose a systematic evaluation procedure based on cumulative distributions and confidence intervals, and demonstrate that our approach is considerably more accurate and informative than unimodal or deterministic baselines. Finally, a more detailed study on a real‑world dense urban traffic network is presented to examine the impact of data quality on uncertainty quantification and to show the robustness of our approach under imperfect data conditions. Code available at https://github.com/Weijiang‑Xiong/OpenSkyTraffic
Authors:Lifan Jiang, Tianrun Wu, Yuhang Pei, Chenyang Wang, Boxi Wu, Deng Cai
Abstract:
The evaluation of visual editing models remains fragmented across methods and modalities. Existing benchmarks are often tailored to specific paradigms, making fair cross‑paradigm comparisons difficult, while video editing lacks reliable evaluation benchmarks. Furthermore, common automatic metrics often misalign with human preference, yet directly deploying large multimodal models (MLLMs) as evaluators incurs prohibitive computational and financial costs. We present UniEditBench, a unified benchmark for image and video editing that supports reconstruction‑based and instruction‑driven methods under a shared protocol. UniEditBench includes a structured taxonomy of nine image operations (Add, Remove, Replace, Change, Stroke‑based, Extract, Adjust, Count, Reorder) and eight video operations, with coverage of challenging compositional tasks such as counting and spatial reordering. To enable scalable evaluation, we distill a high‑capacity MLLM judge (Qwen3‑VL‑235B‑A22B Instruct) into lightweight 4B/8B evaluators that provide multi‑dimensional scoring over structural fidelity, text alignment, background consistency, naturalness, and temporal‑spatial consistency (for videos). Experiments show that the distilled evaluators maintain strong agreement with human judgments and substantially reduce deployment cost relative to the teacher model. UniEditBench provides a practical and reproducible protocol for benchmarking modern visual editing methods. Our benchmark and the associated reward models are publicly available at https://github.com/wesar1/UniEditBench.
Authors:Irem Ulku, Erdem Akagündüz, Ömer Özgür Tanrıöver
Abstract:
Multimodal remote sensing data provide complementary information for semantic segmentation, but in real‑world deployments, some modalities may be unavailable due to sensor failures, acquisition issues, or challenging atmospheric conditions. Existing multimodal segmentation models typically address missing modalities by learning a shared representation across inputs. However, this approach can introduce a trade‑off by compromising modality‑specific complementary information and reducing performance when all modalities are available. In this paper, we tackle this limitation with CBC‑SLP, a multimodal semantic segmentation model designed to preserve both modality‑invariant and modality‑specific information. Inspired by the theoretical results on modality alignment, which state that perfectly aligned multimodal representations can lead to sub‑optimal performance in downstream prediction tasks, we propose a novel structured latent projection approach as an architectural inductive bias. Rather than enforcing this strategy through a loss term, we incorporate it directly into the architecture. In particular, to use the complementary information effectively while maintaining robustness under random modality dropout, we structure the latent representations into shared and modality‑specific components and adaptively transfer them to the decoder according to the random modality availability mask. Extensive experiments on three multimodal remote sensing image sets demonstrate that CBC‑SLP consistently outperforms state‑of‑the‑art multimodal models across full and missing modality scenarios. Besides, we empirically demonstrate that the proposed strategy can recover the complementary information that may not be preserved in a shared representation. The code is available at https://github.com/iremulku/Multispectral‑Semantic‑Segmentation‑via‑Structured‑Latent‑Projection‑CBC‑SLP‑.
Authors:Pritesh Jha
Abstract:
We present PIIBench, a unified benchmark corpus for Personally Identifiable Information (PII) detection in natural language text. Existing resources for PII detection are fragmented across domain‑specific corpora with mutually incompatible annotation schemes, preventing systematic comparison of detection systems. We consolidate ten publicly available datasets spanning synthetic PII corpora, multilingual Named Entity Recognition (NER) benchmarks, and financial domain annotated text, yielding a corpus of 2,369,883 annotated sequences and 3.35 million entity mentions across 48 canonical PII entity types. We develop a principled normalization pipeline that maps 80+ source‑specific label variants to a standardized BIO tagging scheme, applies frequency‑based suppression of near absent entity types, and produces stratified 80/10/10 train/validation/test splits preserving source distribution. To establish baseline difficulty, we evaluate eight published systems spanning rule‑based engines (Microsoft Presidio), general purpose NER models (spaCy, BERT‑base NER, XLM‑RoBERTa NER, SpanMarker mBERT, SpanMarker BERT), a PII‑specific model (Piiranha DeBERTa), and a financial NER specialist (XtremeDistil FiNER). All systems achieve span‑level F1 below 0.14, with the best system (Presidio, F1=0.1385) still producing zero recall on most entity types. These results directly quantify the domain‑silo problem and demonstrate that PIIBench presents a substantially harder and more comprehensive evaluation challenge than any existing single source PII dataset. The dataset construction pipeline and benchmark evaluation code are publicly available at https://github.com/pritesh‑2711/pii‑bench.
Authors:Ponhvoan Srey, Xiaobao Wu, Cong-Duy Nguyen, Anh Tuan Luu
Abstract:
Uncertainty estimation is a promising approach to detect hallucinations in large language models (LLMs). Recent approaches commonly depend on model internal states to estimate uncertainty. However, they suffer from strict assumptions on how hidden states should evolve across layers, and from information loss by solely focusing on last or mean tokens. To address these issues, we present Sequential Internal Variance Representation (SIVR), a supervised hallucination detection framework that leverages token‑wise, layer‑wise features derived from hidden states. SIVR adopts a more basic assumption that uncertainty manifests in the degree of dispersion or variance of internal representations across layers, rather than relying on specific assumptions, which makes the method model and task agnostic. It additionally aggregates the full sequence of per‑token variance features, learning temporal patterns indicative of factual errors and thereby preventing information loss. Experimental results demonstrate SIVR consistently outperforms strong baselines. Most importantly, SIVR enjoys stronger generalisation and avoids relying on large training sets, highlighting the potential for practical deployment. Our code repository is available online at https://github.com/ponhvoan/internal‑variance.
Authors:Junguang Yao, Wenye Liu, Stjepan Picek, Yue Zheng
Abstract:
Visual speaker recognition based on lip motion offers a silent, hands‑free, and behavior‑driven biometric solution that remains effective even when acoustic cues are unavailable. Compared to traditional methods that rely heavily on appearance‑dependent representations, lip motion encodes subject‑specific behavioral dynamics driven by consistent articulation patterns and muscle coordination, offering inherent stability across environmental changes. However, capturing these robust, fine‑grained dynamics is challenging for conventional frame‑based cameras due to motion blur and low dynamic range. To exploit the intrinsic stability of lip motion and address these sensing limitations, we propose NeuroLip, an event‑based framework that captures fine‑grained lip dynamics under a strict yet practical cross‑scene protocol: training is performed under a single controlled condition, while recognition must generalize to unseen viewing and lighting conditions. NeuroLip features a 1) Temporal‑aware Voxel Encoding module with adaptive event weighting, 2) Structure‑aware Spatial Enhancer that amplifies discriminative behavioral patterns by suppressing noise while preserving vertically structured motion information, and 3) Polarity Consistency Regularization mechanism to retain motion‑direction cues encoded in event polarities. To facilitate systematic evaluation, we introduce DVSpeaker, a comprehensive event‑based lip‑motion dataset comprising 50 subjects recorded under four distinct viewpoint and illumination scenarios. Extensive experiments demonstrate that NeuroLip achieves near‑perfect matched‑scene accuracy and robust cross‑scene generalization, attaining over 71% accuracy on unseen viewpoints and nearly 76% under low‑light conditions, outperforming representative existing methods by at least 8.54%. The dataset and code are publicly available at https://github.com/JiuZeongit/NeuroLip.
Authors:Jize Wang, Xuanxuan Liu, Yining Li, Songyang Zhang, Yijun Wang, Zifei Shan, Xinyi Le, Cailian Chen, Xinping Guan, Dacheng Tao
Abstract:
The development of general‑purpose agents requires a shift from executing simple instructions to completing complex, real‑world productivity workflows. However, current tool‑use benchmarks remain misaligned with real‑world requirements, relying on AI‑generated queries, dummy tools, and limited system‑level coordination. To address this, we propose GTA‑2, a hierarchical benchmark for General Tool Agents (GTA) spanning atomic tool use and open‑ended workflows. Built on real‑world authenticity, it leverages real user queries, deployed tools, and multimodal contexts. (i) GTA‑Atomic, inherited from our prior GTA benchmark, evaluates short‑horizon, closed‑ended tool‑use precision. (ii) GTA‑Workflow introduces long‑horizon, open‑ended tasks for realistic end‑to‑end completion. To evaluate open‑ended deliverables, we propose a recursive checkpoint‑based evaluation mechanism that decomposes objectives into verifiable sub‑goals, enabling unified evaluation of both model capabilities and agent execution frameworks (i.e., execution harnesses). Experiments reveal a pronounced capability cliff: while frontier models already struggle on atomic tasks (below 50%), they largely fail on workflows, with top models achieving only 14.39% success. Further analysis shows that checkpoint‑guided feedback improves performance, while advanced frameworks such as Manus and OpenClaw substantially enhance workflow completion, highlighting the importance of execution harness design beyond the underlying model capacity. These findings provide guidance for developing reliable personal and professional assistants. Dataset and code will be available at https://github.com/open‑compass/GTA.
Authors:Xinge Liu, Terry Jingchen Zhang, Bernhard Schölkopf, Zhijing Jin, Kristen Menou
Abstract:
The rise of autonomous AI agents suggests that dynamic benchmark environments with built‑in feedback on scientifically grounded tasks are needed to evaluate the capabilities of these agents in research work. We introduce Stargazer, a scalable environment for evaluating AI agents on dynamic, iterative physics‑grounded model‑fitting tasks using inference on radial‑velocity (RV) time series data. Stargazer comprises 120 tasks across three difficulty tiers, including 20 real archival cases, covering diverse scenarios ranging from high‑SNR single‑planet systems to complex multi‑planetary configurations requiring involved low‑SNR analysis. Our evaluation of eight frontier agents reveals a gap between numerical optimization and adherence to physical constraints: although agents often achieve a good statistical fit, they frequently fail to recover correct physical system parameters, a limitation that persists even when agents are equipped with vanilla skills. Furthermore, increasing test‑time compute yields only marginal gains, with excessive token usage often reflecting recursive failure loops rather than meaningful exploration. Stargazer presents an opportunity to train, evaluate, scaffold, and scale strategies on a model‑fitting problem of practical research relevance today. Our methodology to design a simulation‑driven environment for AI agents presumably generalizes to many other model‑fitting problems across scientific domains. Source code and the project website are available at https://github.com/AIPS‑UofT/Stargazer and https://aips‑uoft.github.io/Stargazer/, respectively.
Authors:Yining Hong, Yining She, Eunsuk Kang, Christopher S. Timperley, Christian Kästner
Abstract:
AI agents that interact with their environments through tools enable powerful applications, but in high‑stakes business settings, unintended actions can cause unacceptable harm, such as privacy breaches and financial loss. Existing mitigations, such as training‑based methods and neural guardrails, improve agent reliability but cannot provide guarantees. We study symbolic guardrails as a practical path toward strong safety and security guarantees for AI agents. Our three‑part study includes a systematic review of 80 state‑of‑the‑art agent safety and security benchmarks to identify the policies they evaluate, an analysis of which policy requirements can be guaranteed by symbolic guardrails, and an evaluation of how symbolic guardrails affect safety, security, and agent success on τ^2‑Bench, CAR‑bench, and MedAgentBench. We find that 85% of benchmarks lack concrete policies, relying instead on underspecified high‑level goals or common sense. Among the specified policies, 74% of policy requirements can be enforced by symbolic guardrails, often using simple, low‑cost mechanisms. These guardrails improve safety and security without sacrificing agent utility. Overall, our results suggest that symbolic guardrails are a practical and effective way to guarantee some safety and security requirements, especially for domain‑specific AI agents. We release all codes and artifacts at https://github.com/hyn0027/agent‑symbolic‑guardrails.
Authors:Yukuan Zhang, Mengxin Zheng, Qian Lou
Abstract:
Cryptographically secure neural network inference typically relies on secure computing techniques such as Secure Multi‑Party Computation (MPC), enabling cloud servers to process client inputs without decrypting them. Although prior privacy‑preserving inference systems co‑design network optimizations with MPC, they remain slow and costly, limiting real‑world deployment. A major bottleneck is their use of a single, fixed transformer model for all encrypted inputs, ignoring that different inputs require different model sizes to balance efficiency and accuracy. We present SecureRouter, an end‑to‑end encrypted routing and inference framework that accelerates secure transformer inference through input‑adaptive model selection under encryption. SecureRouter establishes a unified encrypted pipeline that integrates a secure router with an MPC‑optimized model pool, enabling coordinated routing, inference, and protocol execution while preserving full data and model confidentiality. The framework includes training‑phase and inference‑phase components: an MPC‑cost‑aware secure router that predicts per‑model utility and cost from encrypted features, and an MPC‑optimized model pool whose architectures and quantization schemes are co‑trained to minimize MPC communication and computation overhead. Compared to prior work, SecureRouter achieves a latency reduction by 1.95x with negligible accuracy loss, offering a practical path toward scalable and efficient secure AI inference. Our open‑source implementation is available at: https://github.com/UCF‑ML‑Research/SecureRouter
Authors:Zixuan Weng, Jinghuai Zhang, Kunlin Cai, Ying Li, Peiran Wang, Yuan Tian
Abstract:
Large language models (LLMs) often exhibit undesirable behaviors, such as safety violations and hallucinations. Although inference‑time steering offers a cost‑effective way to adjust model behavior without updating its parameters, existing methods often fail to be simultaneously effective, utility‑preserving, and training‑efficient due to their rigid, one‑size‑fits‑all designs and limited adaptability. In this work, we present FineSteer, a novel steering framework that decomposes inference‑time steering into two complementary stages: conditional steering and fine‑grained vector synthesis, allowing fine‑grained control over when and how to steer internal representations. In the first stage, we introduce a Subspace‑guided Conditional Steering (SCS) mechanism that preserves model utility by avoiding unnecessary steering. In the second stage, we propose a Mixture‑of‑Steering‑Experts (MoSE) mechanism that captures the multimodal nature of desired steering behaviors and generates query‑specific steering vectors for improved effectiveness. Through tailored designs in both SCS and MoSE, FineSteer maintains robust performance on general queries while adaptively optimizing steering vectors for targeted inputs in a training‑efficient manner. Extensive experiments on safety and truthfulness benchmarks show that FineSteer outperforms state‑of‑the‑art methods in overall performance, achieving stronger steering performance with minimal utility loss. Code is available at https://github.com/YukinoAsuna/FineSteer
Authors:Yukun Jiang, Yage Zhang, Michael Backes, Xinyue Shen, Yang Zhang
Abstract:
Large language models (LLMs) have evolved into autonomous agents that rely on open skill ecosystems (e.g., ClawHub and Skills.Rest), hosting numerous publicly reusable skills. Existing security research on these ecosystems mainly focuses on vulnerabilities within skills, such as prompt injection. However, there is a critical gap regarding skills that may be misused for harmful actions (e.g., cyber attacks, fraud and scams, privacy violations, and sexual content generation), namely harmful skills. In this paper, we present the first large‑scale measurement study of harmful skills in agent ecosystems, covering 98,440 skills across two major registries. Using an LLM‑driven scoring system grounded in our harmful skill taxonomy, we find that 4.93% of skills (4,858) are harmful, with ClawHub exhibiting an 8.84% harmful rate compared to 3.49% on Skills.Rest. We then construct HarmfulSkillBench, the first benchmark for evaluating agent safety against harmful skills in realistic agent contexts, comprising 200 harmful skills across 20 categories and four evaluation conditions. By evaluating six LLMs on HarmfulSkillBench, we find that presenting a harmful task through a pre‑installed skill substantially lowers refusal rates across all models, with the average harm score rising from 0.27 without the skill to 0.47 with it, and further to 0.76 when the harmful intent is implicit rather than stated as an explicit user request. We responsibly disclose our findings to the affected registries and release our benchmark to support future research (see https://github.com/TrustAIRLab/HarmfulSkillBench).
Authors:G. Aytug Akarlar
Abstract:
We present causal evidence that hallucination in autoregressive language models is an early trajectory commitment governed by asymmetric attractor dynamics. Using same‑prompt bifurcation, in which we repeatedly sample identical inputs to observe spontaneous divergence, we isolate trajectory dynamics from prompt‑level confounds. On Qwen2.5‑1.5B across 61 prompts spanning six categories, 27 prompts (44.3%) bifurcate with factual and hallucinated trajectories diverging at the first generated token (KL = 0 at step 0, KL > 1.0 at step 1). Activation patching across 28 layers reveals a pronounced causal asymmetry: injecting a hallucinated activation into a correct trajectory corrupts output in 87.5% of trials (layer 20), while the reverse recovers only 33.3% (layer 24); both exceed the 10.4% baseline (p = 0.025) and 12.5% random‑patch control. Window patching shows correction requires sustained multi‑step intervention, whereas corruption needs only a single perturbation. Probing the prompt encoding itself, step‑0 residual states predict per‑prompt hallucination rate at Pearson r = 0.776 at layer 15 (p < 0.001 against a 1000‑permutation null); unsupervised clustering identifies five regime‑like groups (eta^2 = 0.55) whose saddle‑adjacent cluster concentrates 12 of the 13 bifurcating false‑premise prompts, indicating that the basin structure is organized around regime commitments fixed at prompt encoding. These findings characterize hallucination as a locally stable attractor basin: entry is probabilistic and rapid, exit demands coordinated intervention across layers and steps, and the relevant basins are selected by clusterable regimes already discernible at step 0.
Authors:Keon Kim, Krish Chelikavada
Abstract:
Multi‑step zoom‑in pipelines are widely used for GUI grounding, yet the intermediate predictions they produce are typically discarded after coordinate remapping. We observe that these intermediate outputs contain a useful confidence signal for free: zoom consistency, the distance between a model's step‑2 prediction and the crop center. Unlike log‑probabilities or token‑level uncertainty, zoom consistency is a geometric quantity in a shared coordinate space, making it directly comparable across architecturally different VLMs without calibration. We prove this quantity is a linear estimator of step‑1 spatial error under idealized conditions (perfect step‑2, target within crop) and show it correlates with prediction correctness across two VLMs (AUC = 0.60; Spearman rho = ‑0.14, p < 10^‑6 for KV‑Ground‑8B; rho = ‑0.11, p = 0.0003 for Qwen3.5‑27B). The correlation is small but consistent across models, application categories, and operating systems. As a proof‑of‑concept, we use zoom consistency to route between a specialist and generalist model, capturing 16.5% of the oracle headroom between them (+0.8%, McNemar p = 0.19). Code is available at https://github.com/omxyz/zoom‑consistency‑routing.
Authors:Kieran A. Murphy
Abstract:
We propose InfoChess, a symmetric adversarial game that elevates competitive information acquisition to the primary objective. There is no piece capture, removing material incentives that would otherwise confound the role of information. Instead, pieces are used to alter visibility. Players are scored on their probabilistic inference of the opponent's king location over the duration of the game. To explore the space of strategies for playing InfoChess, we introduce a hierarchy of heuristic agents defined by increasing levels of opponent modeling, and train a reinforcement learning agent that outperforms these baselines. Leveraging the discrete structure of the game, we analyze gameplay through natural information‑theoretic characterizations that include belief entropy, oracle cross entropy, and predictive log score under the action‑induced observation channel. These measures disentangle epistemic uncertainty, calibration mismatch, and uncertainty induced by adversarial movement. The design of InfoChess renders it a testbed for studying multi‑agent inference under partial observability. We release code for the environment and agents, and a public interface to encourage further study.
Authors:Zhen Yang, Ping Jian, Zhongbin Guo, Zuming Zhang, Chengzhi Li, Yonghong Deng, Xinyue Zhang, Wenpeng Lu
Abstract:
Over the past year, spatial intelligence has drawn increasing attention. Many prior works study it from the perspective of visual‑spatial intelligence, where models have access to visuospatial information from visual inputs. However, in the absence of visual information, whether linguistic intelligence alone is sufficient to endow models with spatial intelligence, and how models perform relevant tasks with text‑only inputs still remain unexplored. Therefore, in this paper, we focus on a fundamental and critical capability in spatial intelligence from a linguistic perspective: viewpoint rotation understanding (VRU). Specifically, LLMs and VLMs are asked to infer their final viewpoint and predict the corresponding observation in an environment given textual description of viewpoint rotation and observation over multiple steps. We find that both LLMs and VLMs perform poorly on our proposed dataset while human can easily achieve 100% accuracy, indicating a substantial gap between current model capabilities and the requirements of spatial intelligence. To uncover the underlying mechanisms, we conduct a layer‑wise probing analysis and head‑wise causal intervention. Our findings reveal that although models encode viewpoint information in the hidden states, they appear to struggle to bind the viewpoint position with corresponding observation, resulting in a hallucination in final layers. Finally, we selectively fine‑tune the key attention heads identified by causal intervention to improve VRU performance. Experimental results demonstrate that such selective fine‑tuning achieves improved VRU performance while avoiding catastrophic forgetting of generic abilities. Our dataset and code will be released at https://github.com/Young‑Zhen/VRU_Interpret .
Authors:Tianhao Fu, Austin Wang, Charles Chen, Roby Aldave-Garza, Yucheng Chen
Abstract:
Reliable uncertainty estimation is critical for medical image segmentation, where automated contours feed downstream quantification and clinical decision support. Many strong uncertainty methods require repeated inference, while efficient single‑forward‑pass alternatives often provide weaker failure ranking or rely on restrictive feature‑space assumptions. We present SegWithU, a post‑hoc framework that augments a frozen pretrained segmentation backbone with a lightweight uncertainty head. SegWithU taps intermediate backbone features and models uncertainty as perturbation energy in a compact probe space using rank‑1 posterior probes. It produces two voxel‑wise uncertainty maps: a calibration‑oriented map for probability tempering and a ranking‑oriented map for error detection and selective prediction. Across ACDC, BraTS2024, and LiTS, SegWithU is the strongest and most consistent single‑forward‑pass baseline, achieving AUROC/AURC of 0.9838/2.4885, 0.9946/0.2660, and 0.9925/0.8193, respectively, while preserving segmentation quality. These results suggest that perturbation‑based uncertainty modeling is an effective and practical route to reliability‑aware medical segmentation. Source code is available at https://github.com/ProjectNeura/SegWithU.
Authors:Zihao Xu, John Harvill, Ziwei Fan, Yizhou Sun, Hao Ding, Hao Wang
Abstract:
Large Language Models (LLMs) incur significant computational and memory costs when processing long prompts, as full self‑attention scales quadratically with input length. Token compression aims to address this challenge by reducing the number of tokens representing inputs. However, existing prompt‑compression approaches primarily operate in token space and overlook inefficiencies in the latent embedding space. In this paper, we propose K‑Token Merging, a latent‑space compression framework that merges each contiguous block of K token embeddings into a single embedding via a lightweight encoder. The compressed sequence is processed by a LoRA‑adapted LLM, while generation remains in the original vocabulary. Experiments on structural reasoning (Textualized Tree), sentiment classification (Amazon Reviews), and code editing (CommitPackFT) show that K‑Token Merging lies on the Pareto frontier of performance vs. compression, achieving up to 75% input length reduction with minimal performance degradation. Code is available at https://github.com/shsjxzh/K‑Token‑Merging.
Authors:Kanzhi Cheng, Zehao Li, Zheng Ma, Nuo Chen, Jialin Cao, Qiushi Sun, Zichen Ding, Fangzhi Xu, Hang Yan, Jiajun Chen, Anh Tuan Luu, Jianbing Zhang, Lewei Lu, Dahua Lin
Abstract:
Mobile agents powered by vision‑language models have demonstrated impressive capabilities in automating mobile tasks, with recent leading models achieving a marked performance leap, e.g., nearly 70% success on AndroidWorld. However, these systems keep their training data closed and remain opaque about their task and trajectory synthesis recipes. We present OpenMobile, an open‑source framework that synthesizes high‑quality task instructions and agent trajectories, with two key components: (1) The first is a scalable task synthesis pipeline that constructs a global environment memory from exploration, then leverages it to generate diverse and grounded instructions. and (2) a policy‑switching strategy for trajectory rollout. By alternating between learner and expert models, it captures essential error‑recovery data often missing in standard imitation learning. Agents trained on our data achieve competitive results across three dynamic mobile agent benchmarks: notably, our fine‑tuned Qwen2.5‑VL and Qwen3‑VL reach 51.7% and 64.7% on AndroidWorld, far surpassing existing open‑data approaches. Furthermore, we conduct transparent analyses on the overlap between our synthetic instructions and benchmark test sets, and verify that performance gains stem from broad functionality coverage rather than benchmark overfitting. We release data and code at https://njucckevin.github.io/openmobile/ to bridge the data gap and facilitate broader mobile agent research.
Authors:Wentao Zhang, Zhe Zhao, Haibin Wen, Yingcheng Wu, Ming Yin, Bo An, Mengdi Wang
Abstract:
Recent advances in LLM based agent systems have shown promise in tackling complex, long horizon tasks. However, existing agent protocols (e.g., A2A and MCP) under specify cross entity lifecycle and context management, version tracking, and evolution safe update interfaces, which encourages monolithic compositions and brittle glue code. We introduce Autogenesis Protocol (AGP), a self evolution protocol that decouples what evolves from how evolution occurs. Its Resource Substrate Protocol Layer (RSPL) models prompts, agents, tools, environments, and memory as protocol registered resources with explicit state, lifecycle, and versioned interfaces. Its Self Evolution Protocol Layer (SEPL) specifies a closed loop operator interface for proposing, assessing, and committing improvements with auditable lineage and rollback. Building on AGP, we present Autogenesis System (AGS), a self‑evolving multi‑agent system that dynamically instantiates, retrieves, and refines protocol‑registered resources during execution. We evaluate AGS on multiple challenging benchmarks that require long horizon planning and tool use across heterogeneous resources. The results demonstrate consistent improvements over strong baselines, supporting the effectiveness of agent resource management and closed loop self evolution. The code is available at https://github.com/DVampire/Autogenesis.
Authors:Haochun Tang, Yuliang Yan, Jiahua Lu, Huaxiao Liu, Enyan Dai
Abstract:
Cost‑aware routing dynamically dispatches user queries to models of varying capability to balance performance and inference cost. However, the routing strategy introduces a new security concern that adversaries may manipulate the router to consistently select expensive high‑capability models. Existing routing attacks depend on either white‑box access or heuristic prompts, rendering them ineffective in real‑world black‑box scenarios. In this work, we propose R^2A, which aims to mislead black‑box LLM routers to expensive models via adversarial suffix optimization. Specifically, R^2A deploys a hybrid ensemble surrogate router to mimic the black‑box router. A suffix optimization algorithm is further adapted for the ensemble‑based surrogate. Extensive experiments on multiple open‑source and commercial routing systems demonstrate that R^2A significantly increases the routing rate to expensive models on queries of different distributions. Code and examples: https://github.com/thcxiker/R2A‑Attack.
Authors:Zihong Zhang, Zuchao Li, Lefei Zhang, Ping Wang, Hai Zhao
Abstract:
Autoregressive decoding in Large Language Models (LLMs) generates one token per step, causing high inference latency. Speculative decoding (SD) mitigates this through a guess‑and‑verify strategy, but existing training‑free variants face trade‑offs: retrieval‑based drafts break when no exact match exists, while logits‑based drafts lack structural guidance. We propose RACER (Retrieval‑Augmented Contextual Rapid Speculative Decoding), a lightweight and training‑free method that integrates retrieved exact patterns with logit‑driven future cues. This unification supplies both reliable anchors and flexible extrapolation, yielding richer speculative drafts. Experiments on Spec‑Bench, HumanEval, and MGSM‑ZH demonstrate that RACER consistently accelerates inference, achieving more than 2× speedup over autoregressive decoding, and outperforms prior training‑free methods, offering a scalable, plug‑and‑play solution for efficient LLM decoding. Our source code is available at \hrefhttps://github.com/hkr04/RACERhttps://github.com/hkr04/RACER.
Authors:Meng-Xun Li, Wen-Hui Deng, Zhi-Xing Wu, Chun-Xiao Jin, Jia-Min Wu, Yue Han, James Kit Hon Tsoi, Gui-Song Xia, Cui Huang
Abstract:
Vision‑Language Models (VLMs) have demonstrated significant potential in medical image analysis, yet their application in intraoral photography remains largely underexplored due to the lack of fine‑grained, annotated datasets and comprehensive benchmarks. To address this, we present MetaDent, a comprehensive resource that includes (1) a novel and large‑scale dentistry image dataset collected from clinical, public, and web sources; (2) a semi‑structured annotation framework designed to capture the hierarchical and clinically nuanced nature of dental photography; and (3) comprehensive benchmark suites for evaluating state‑of‑the‑art VLMs on clinical image understanding. Our labeling approach combines a high‑level image summary with point‑by‑point, free‑text descriptions of abnormalities. This method enables rich, scalable, and task‑agnostic representations. We curated 60,669 dental images from diverse sources and annotated a representative subset of 2,588 images using this meta‑labeling scheme. Leveraging Large Language Models (LLMs), we derive standardized benchmarks: approximately 15K Visual Question Answering (VQA) pairs and an 18‑class multi‑label classification dataset, which we validated with human review and error analysis to justify that the LLM‑driven transition reliably preserves fidelity and semantic accuracy. We then evaluate state‑of‑the‑art VLMs across VQA, classification, and image captioning tasks. Quantitative results reveal that even the most advanced models struggle with a fine‑grained understanding of intraoral scenes, achieving moderate accuracy and producing inconsistent or incomplete descriptions in image captioning. We publicly release our dataset, annotations, and tools to foster reproducible research and accelerate the development of vision‑language systems for dental applications.
Authors:Yi Zhao, Yajuan Peng, Cam-Tu Nguyen, Zuchao Li, Xiaoliang Wang, Xiaoming Fu, Hai Zhao
Abstract:
Large Reasoning Models (LRMs) achieve strong performance on complex tasks through extended chains of thought but suffer from high inference latency due to autoregressive reasoning. Recent work explores using Small Reasoning Models (SRMs) to accelerate LRM inference. In this paper, we systematically characterize the capability boundaries of SRMs and identify three common types of reasoning risks: (1) path divergence, where SRMs lack the strategic ability to construct an initial plan, causing reasoning to deviate from the most probable path; (2) cognitive overload, where SRMs fail to solve particularly difficult steps; and (3) recovery inability, where SRMs lack robust self‑reflection and error correction mechanisms. To address these challenges, we propose TrigReason, a trigger‑based collaborative reasoning framework that replaces continuous polling with selective intervention. TrigReason delegates most reasoning to the SRM and activates LRM intervention only when necessary‑during initial strategic planning (strategic priming trigger), upon detecting extraordinary overconfidence (cognitive offload trigger), or when reasoning falls into unproductive loops (intervention request trigger). The evaluation results on AIME24, AIME25, and GPQA‑D indicate that TrigReason matches the accuracy of full LRMs and SpecReason, while offloading 1.70x ‑ 4.79x more reasoning steps to SRMs. Under edge‑cloud conditions, TrigReason reduces latency by 43.9% and API cost by 73.3%. Our code is available at \hrefhttps://github.com/QQQ‑yi/TrigReasonhttps://github.com/QQQ‑yi/TrigReason
Authors:Haileab Yagersew
Abstract:
Retail theft costs the global economy over \100 billion annually, yet existing AI‑based detection systems require expensive custom model training on proprietary datasets and charge \200‑500/month per store. We present Paza, a zero‑shot retail theft detection framework that achieves practical concealment detection without training any model. Our approach orchestrates multiple existing models in a layered pipeline ‑ cheap object detection and pose estimation running continuously, with an expensive vision‑language model (VLM) invoked only when behavioral pre‑filters trigger. A multi‑signal suspicion pre‑filter (requiring dwell time plus at least one behavioral signal) reduces VLM invocations by 240x compared to per‑frame analysis, bounding calls to <=10/minute and enabling a single GPU to serve 10‑20 stores. The architecture is model‑agnostic: the VLM component accepts any OpenAI‑compatible endpoint, enabling operators to swap between models such as Gemma 4, Qwen3.5‑Omni, GPT‑4o, or future releases without code changes ‑ ensuring the system improves as the VLM landscape evolves. We evaluate the VLM component on the DCSASS synthesized shoplifting dataset (169 clips, controlled environment), achieving 89.5% precision and 92.8% specificity at 59.3% recall zero‑shot ‑ where the recall gap is attributable to sparse frame sampling in offline evaluation rather than VLM reasoning failures, as precision and specificity are the operationally critical metrics determining false alarm rates. We present a detailed cost model showing viability at \50‑100/month per store (3‑10x cheaper than commercial alternatives), and introduce a privacy‑preserving design that obfuscates faces in the detection pipeline. The source code is available at https://github.com/xHaileab/Paza‑AI.
Authors:Shengyu Guo, Tongrui Ye, Jianbo Zhang, Zicheng Zhang, Chunyi Li, Guangtao Zhai
Abstract:
Recent progress in Multimodal Large Language Models (MLLMs) has demonstrated remarkable advances in perception and reasoning, suggesting their potential for embodied intelligence. While recent studies have evaluated embodied MLLMs in interactive settings, current benchmarks mainly target capabilities to perceive, understand, and interact with external objects, lacking a systematic evaluation of self‑centric intelligence. To address this, we introduce MirrorBench, a simulation‑based benchmark inspired by the classical Mirror Self‑Recognition (MSR) test in psychology. MirrorBench extends this paradigm to embodied MLLMs through a tiered framework of progressively challenging tasks, assessing agents from basic visual perception to high‑level self‑representation. Experiments on leading MLLMs show that even at the lowest level, their performance remains substantially inferior to human performance, revealing fundamental limitations in self‑referential understanding. Our study bridges psychological paradigms and embodied intelligence, offering a principled framework for evaluating the emergence of general intelligence in large models. Project page: https://fflahm.github.io/mirror‑bench‑page/.
Authors:Chen Wang, Lai Wei, Yanzhi Zhang, Chenyang Shao, Zedong Dan, Weiran Huang, Ge Lan, Yue Wang
Abstract:
Recent advances in reinforcement learning (RL) have improved the reasoning capabilities of large language models (LLMs) and vision‑language models (VLMs). However, the widely used Group Relative Policy Optimization (GRPO) consistently suffers from entropy collapse, causing the policy to converge prematurely and lose diversity. Existing exploration methods introduce additional bias or variance during exploration, making it difficult to maintain optimization stability. We propose Unified Entropy Control for Reinforcement Learning (UEC‑RL), a framework that provides targeted mechanisms for exploration and stabilization. UEC‑RL activates more exploration on difficult prompts to search for potential and valuable reasoning trajectories. In parallel, a stabilizer prevents entropy from growing uncontrollably, thereby keeping training stable as the model consolidates reliable behaviors. Together, these components expand the search space when needed while maintaining robust optimization throughout training. Experiments on both LLM and VLM reasoning tasks show consistent gains over RL baselines on both Pass@1 and Pass@k. On Geometry3K, UEC‑RL achieves a 37.9% relative improvement over GRPO, indicating that it sustains effective exploration without compromising convergence and underscoring UEC‑RL as a key for scaling RL‑based reasoning in large models. Our code is available at https://github.com/597358816/UEC‑RL.
Authors:Geonhui Jang, Dongyoon Han, YoungJoon Yoo
Abstract:
Effective code generation requires both model capability and a problem representation that carefully structures how models reason and plan. Existing approaches augment reasoning steps or inject specific structure into how models think, but leave scattered problem conditions unchanged. Inspired by the way humans organize fragmented information into coherent explanations, we propose StoryCoder, a narrative reformulation framework that transforms code generation questions into coherent natural language narratives, providing richer contextual structure than simple rephrasings. Each narrative consists of three components: a task overview, constraints, and example test cases, guided by the selected algorithm and genre. Experiments across 11 models on HumanEval, LiveCodeBench, and CodeForces demonstrate consistent improvements, with an average gain of 18.7% in zero‑shot pass@10. Beyond accuracy, our analyses reveal that narrative reformulation guides models toward correct algorithmic strategies, reduces implementation errors, and induces a more modular code structure. The analyses further show that these benefits depend on narrative coherence and genre alignment, suggesting that structured problem representation is important for code generation regardless of model scale or architecture. Our code is available at https://github.com/gu‑ni/StoryCoder.
Authors:Sumit Mukherjee, Juan Shu, Nairwita Mazumder, Tate Kernell, Celena Wheeler, Shannon Hastings, Chris Sidey-Gibbons
Abstract:
Clinical value set authoring ‑‑ the task of identifying all codes in a standardized vocabulary that define a clinical concept ‑‑ is a recurring bottleneck in clinical quality measurement and phenotyping. A natural approach is to prompt a large language model (LLM) to generate the required codes directly, but structured clinical vocabularies are large, version‑controlled, and not reliably memorized during pretraining. We propose Retrieval‑Augmented Set Completion (RASC): retrieve the K most similar existing value sets from a curated corpus to form a candidate pool, then apply a classifier to each candidate code. Theoretically, retrieve‑and‑select can reduce statistical complexity by shrinking the effective output space from the full vocabulary to a much smaller retrieved candidate pool. We demonstrate the utility of RASC on 11,803 publicly available VSAC value sets, constructing the first large‑scale benchmark for this task. A cross‑encoder fine‑tuned on SAPBert achieves AUROC~0.852 and value‑set‑level F1~0.298, outperforming a simpler three‑layer Multilayer Perceptron (AUROC~0.799, F1~0.250) and both reduce the number of irrelevant candidates per true positive from 12.3 (retrieval‑only) to approximately 3.2 and 4.4 respectively. Zero‑shot GPT‑4o achieves value‑set‑level F1~0.105, with 48.6% of returned codes absent from VSAC entirely. This performance gap widens with increasing value set size, consistent with RASC's theoretical advantage. We observe similar performance gains across two other classifier model types, namely a cross‑encoder initialized from pre‑trained SAPBert and a LightGBM model, demonstrating that RASC's benefits extend beyond a single model class. The code to download and create the benchmark dataset, as well as the model training code is available at: \hrefhttps://github.com/mukhes3/RASChttps://github.com/mukhes3/RASC.
Authors:Xiping Li, Aier Yang, Jianghong Ma, Kangzhe Liu, Shanshan Feng, Haijun Zhang, Yi Zhao
Abstract:
The rapid expansion of gaming industry requires advanced recommender systems tailored to its dynamic landscape. Existing Graph Neural Network (GNN)‑based methods primarily prioritize accuracy over diversity, overlooking their inherent trade‑off. To address this, we previously proposed CPGRec, a balance‑oriented gaming recommender system. However, CPGRec fails to account for critical disparities in player‑game interactions, which carry varying significance in reflecting players' personal preferences and may exacerbate over‑smoothness issues inherent in GNN‑based models. Moreover, existing approaches underutilize the reasoning capabilities and extensive knowledge of large language models (LLMs) in addressing these limitations. To bridge this gap, we propose two new modules. First, Preference‑informed Edge Reweighting (PER) module assigns signed edge weights to qualitatively distinguish significant player interests and disinterests while then quantitatively measuring preference strength to mitigate over‑smoothing in graph convolutions. Second, Preference‑informed Representation Generation (PRG) module leverages LLMs to generate contextualized descriptions of games and players by reasoning personal preferences from comparing global and personal interests, thereby refining representations of players and games. Experiments on \textcolorblacktwo Steam datasets demonstrate CPGRec+'s superior accuracy and diversity over state‑of‑the‑art models. The code is accessible at https://github.com/HsipingLi/CPGRec‑Plus.
Authors:Pengfei Li, Shijie Wang, Fangyuan Li, Yikun Fu, Kaifeng Liu, Kaiyan Zhang, Dazhi Zhang, Yuqiang Li, Biqing Qi, Bowen Zhou
Abstract:
Reinforcement learning (RL) paradigms have demonstrated strong performance on reasoning‑intensive tasks such as code generation. However, limited trajectory diversity often leads to diminishing returns, which constrains the achievable performance ceiling. Search‑enhanced RL alleviates this issue by introducing structured exploration, which remains constrained by the single‑agent policy priors. Meanwhile, leveraging multiple interacting policies can acquire more diverse exploratory signals, but existing approaches are typically decoupled from structured search. We propose MARS^2 (Multi‑Agent Reinforced Tree‑Search Scaling), a unified RL framework in which multiple independently‑optimized agents collaborate within a shared tree‑structured search environment. MARS^2 models the search tree as a learnable multi‑agent interaction environment, enabling heterogeneous agents to collaboratively generate and refine candidate solutions within a shared search topology. To support effective learning, we introduce a path‑level group advantage formulation based on tree‑consistent reward shaping, which facilitates effective credit assignment across complex search trajectories. Experiments on code generation benchmarks show that MARS^2 consistently improves performance across diverse model combinations and training settings, demonstrating the effectiveness of coupling multi‑agent collaboration with tree search for enhancing reinforcement learning. Our code is publicly available at https://github.com/TsinghuaC3I/MARTI.
Authors:Rongyao Wang, Veronica Liesaputra, Zhiyi Huang
Abstract:
News recommender systems are devised to alleviate the information overload, attracting more and more researchers' attention in recent years. The lack of a dedicated learner‑oriented news recommendation toolkit hinders the advancement of research in news recommendation. We propose a PyTorch‑based news recommendation toolkit called NewsTorch, developed to support learners in acquiring both conceptual understanding and practical experience. This toolkit provides a modular, decoupled, and extensible framework with a learner‑friendly GUI platform that supports dataset downloading and preprocessing. It also enables training, validation, and testing of state‑of‑the‑art neural news recommendation models with standardized evaluation metrics, ensuring fair comparison and reproducible experiments. Our open‑source toolkit is released on Github: https://github.com/whonor/NewsTorch.
Authors:Jillian Fisher, Jennifer Neville, Chan Young Park
Abstract:
A common approach to personalization in large language models (LLMs) is to incorporate a subset of the user memory into the prompt at inference time to guide the model's generation. Existing methods select these subsets primarily using similarity between user memory items and input queries, ignoring how features actually affect the model's response distribution. We propose Response‑Utility optimization for Memory Selection (RUMS), a novel method that selects user memory items by measuring the mutual information between a subset of memory and the model's outputs, identifying items that reduce response uncertainty and sharpen predictions beyond semantic similarity. We demonstrate that this information‑theoretic foundation enables more principled user memory selection that aligns more closely with human selection compared to state‑of‑the‑art methods, and models 400× larger. Additionally, we show that memory items selected using RUMS result in better response quality compared to existing approaches, while having up to 95% reduction in computational cost.
Authors:Biwei Dai, Po-Wen Chang, Wahid Bhimji, Paolo Calafiura, Ragansu Chakkappai, Yuan-Tang Chou, Sascha Diefenbacher, Jordan Dudley, Ibrahim Elsharkawy, Steven Farrell, Isabelle Guyon, Chris Harris, Elham E Khoda, Benjamin Nachman, David Rousseau, Uroš Seljak, Ihsan Ullah, Yulei Zhang
Abstract:
Weak gravitational lensing, the correlated distortion of background galaxy shapes by foreground structures, is a powerful probe of the matter distribution in our universe and allows accurate constraints on the cosmological model. In recent years, high‑order statistics and machine learning (ML) techniques have been applied to weak lensing data to extract the nonlinear information beyond traditional two‑point analysis. However, these methods typically rely on cosmological simulations, which poses several challenges: simulations are computationally expensive, limiting most realistic setups to a low training data regime; inaccurate modeling of systematics in the simulations create distribution shifts that can bias cosmological parameter constraints; and varying simulation setups across studies make method comparison difficult. To address these difficulties, we present the first weak lensing benchmark dataset with several realistic systematics and launch the FAIR Universe Weak Lensing Machine Learning Uncertainty Challenge. The challenge focuses on measuring the fundamental properties of the universe from weak lensing data with limited training set and potential distribution shifts, while providing a standardized benchmark for rigorous comparison across methods. Organized in two phases, the challenge will bring together the physics and ML communities to advance the methodologies for handling systematic uncertainties, data efficiency, and distribution shifts in weak lensing analysis with ML, ultimately facilitating the deployment of ML approaches into upcoming weak lensing survey analysis.
Authors:Mohammad R. Abu Ayyash
Abstract:
We present Three‑Phase Transformer (3PT), a residual‑stream structural prior for decoder‑only Transformers on a standard SwiGLU + RMSNorm + RoPE + GQA backbone. The hidden vector is partitioned into N equally‑sized cyclic channels, each maintained by phase‑respecting ops: a per‑channel RMSNorm, a 2D Givens rotation between attention and FFN that rotates each channel by theta + i(2pi/N), and a head‑count constraint aligning GQA heads with the partition. The architecture is a self‑stabilizing equilibrium between scrambling and re‑imposition, not a bolted‑on module. The partition carves out a one‑dimensional DC subspace orthogonal to the channels, into which we inject a fixed Gabriel's horn profile r(p) = 1/(p+1) as an absolute‑position side‑channel composing orthogonally with RoPE's relative‑position rotation. The canonical N=3 borrows its metaphor from balanced three‑phase AC, where three sinusoids 120 degrees apart sum to zero with no anti‑correlated pair. At 123M parameters on WikiText‑103, 3PT achieves ‑7.20% perplexity (‑2.62% bits‑per‑byte) over a matched RoPE‑Only baseline at +1,536 parameters (0.00124% of total), with 1.93x step‑count convergence speedup (1.64x wall‑clock). N behaves as a parameter‑sharing knob rather than a unique optimum: at 5.5M an N‑sweep over 1,2,3,4,6,8,12 is near‑monotone with N=1 winning; at 123M a three‑seed sweep finds N=3 and N=1 statistically indistinguishable. The load‑bearing mechanism is the channel‑partitioned residual stream, per‑block rotation, per‑phase normalization, and horn DC injection. We characterize (a) self‑stabilization of the geometry without explicit enforcement, a novel instance of the conservation‑law framework for neural networks; (b) a U‑shaped depth profile of rotation‑angle drift at 12 layers; (c) orthogonal composition with RoPE, attention, and FFN.
Authors:Aodi Wu, Haodong Han, Xubo Luo, Ruisuo Wang, Shan He, Xue Wan
Abstract:
Autonomous on‑orbit servicing demands embodied agents that perceive through visual sensors, reason about 3D spatial situations, and execute multi‑phase tasks over extended horizons. We present SpaceMind, a modular and self‑evolving vision‑language model (VLM) agent framework that decomposes knowledge, tools, and reasoning into three independently extensible dimensions: skill modules with dynamic routing, Model Context Protocol (MCP) tools with configurable profiles, and injectable reasoning‑mode skills. An MCP‑Redis interface layer enables the same codebase to operate across simulation and physical hardware without modification, and a Skill Self‑Evolution mechanism distills operational experience into persistent skill files without model fine‑tuning. We validate SpaceMind through 192 closed‑loop runs across five satellites, three task types, and two environments, a UE5 simulation and a physical laboratory, deliberately including degraded conditions to stress‑test robustness. Under nominal conditions all modes achieve 90‑‑100% navigation success; under degradation, the Prospective mode uniquely succeeds in search‑and‑approach tasks where other modes fail. A self‑evolution study shows that the agent recovers from failure in four of six groups from a single failed episode, including complete failure to 100% success and inspection scores improving from 12 to 59 out of 100. Real‑world validation confirms zero‑code‑modification transfer to a physical robot with 100% rendezvous success. Code: https://github.com/wuaodi/SpaceMind
Authors:Guillermo Valverde, Igor García-Olaizola, Giannicola Scarpa, Alejandro Pozas-Kerstjens
Abstract:
Tensor networks were developed in the context of many‑body physics as compressed representations of multiparticle quantum states. These representations mitigate the exponential complexity of many‑body systems by capturing only the most relevant dependencies. Due to the formal similarity between quantum entanglement and statistical correlations, tensor networks have recently been integrated in machine learning, operating both as alternative learning architectures and as decompositions of components of neural networks. The expectation is that the theoretical understanding of tensor networks developed within quantum many‑body physics leads to novel methods that offer advantages in terms of computational efficiency, explainability, or privacy. Here we review the use of tensor networks in the context of machine learning, providing a critical assessment of the state of the art, the potential advantages, and the challenges that must be overcome.
Authors:Haoran Xu, Kaiwen Hu, Somayeh Sojoudi, Amy Zhang
Abstract:
We study behavior‑regularized reinforcement learning (RL), where regularization toward a reference distribution (the dataset in offline RL or the base model in LLM RL finetuning) is essential to prevent value over‑optimization caused by erroneous out‑of‑distribution extrapolation. Existing methods either rely on reparameterized policy gradient, which are difficult to scale to large generative models, or on reject sampling, which can be overly conservative when attempting to move beyond the behavior support. In this paper, we propose Value Gradient Flow (VGF), a scalable new paradigm for behavior‑regularized RL. VGF casts behavior‑regularized RL as an optimal transport problem that maps the reference distribution to the value‑induced optimal policy distribution. We solve this transport problem via discrete gradient flow, where value gradients guide particles initialized from the reference distribution. Our analysis shows that VGF imposes regularization implicitly by controlling the transport budget. VGF eliminates explicit policy parameterization while remaining expressive and flexible, this enables adaptive test‑time scaling by adjusting the transport budget. Extensive experiments demonstrate that VGF significantly outperforms prior methods, achieving state‑of‑the‑art results on offline RL benchmarks (D4RL, OGBench) and LLM RL tasks. Code and runs can be found at https://ryanxhr.github.io/vgf.
Authors:Zhuofeng Li, Yi Lu, Dongfu Jiang, Haoxiang Zhang, Yuyang Bai, Chuan Li, Yu Wang, Shuiwang Ji, Jianwen Xie, Yu Zhang
Abstract:
The rapid rise in AI conference submissions has driven increasing exploration of large language models (LLMs) for peer review support. However, LLM‑based reviewers often generate superficial, formulaic comments lacking substantive, evidence‑grounded feedback. We attribute this to the underutilization of two key components of human reviewing: explicit rubrics and contextual grounding in existing work. To address this, we introduce REVIEWBENCH, a benchmark evaluating review text according to paper‑specific rubrics derived from official guidelines, the paper's content, and human‑written reviews. We further propose REVIEWGROUNDER, a rubric‑guided, tool‑integrated multi‑agent framework that decomposes reviewing into drafting and grounding stages, enriching shallow drafts via targeted evidence consolidation. Experiments on REVIEWBENCH show that REVIEWGROUNDER, using a Phi‑4‑14B‑based drafter and a GPT‑OSS‑120B‑based grounding stage, consistently outperforms baselines with substantially stronger/larger backbones (e.g., GPT‑4.1 and DeepSeek‑R1‑670B) in both alignment with human judgments and rubric‑based review quality across 8 dimensions. The code is available \hrefhttps://github.com/EigenTom/ReviewGrounderhere.
Authors:Jiacheng Liu, Xiaohan Zhao, Xinyi Shang, Zhiqiang Shen
Abstract:
Claude Code is an agentic coding tool that can run shell commands, edit files, and call external services on behalf of the user. This study describes its comprehensive architecture by analyzing the publicly available TypeScript source code and further comparing it with OpenClaw, an independent open‑source AI agent system that answers many of the same design questions from a different deployment context. Our analysis identifies five human values, philosophies, and needs that motivate the architecture (human decision authority, safety and security, reliable execution, capability amplification, and contextual adaptability) and traces them through thirteen design principles to specific implementation choices. The core of the system is a simple while‑loop that calls the model, runs tools, and repeats. Most of the code, however, lives in the systems around this loop: a permission system with seven modes and an ML‑based classifier, a five‑layer compaction pipeline for context management, four extensibility mechanisms (MCP, plugins, skills, and hooks), a subagent delegation mechanism with worktree isolation, and append‑oriented session storage. A comparison with OpenClaw, a multi‑channel personal assistant gateway, shows that the same recurring design questions produce different architectural answers when the deployment context changes: from per‑action safety classification to perimeter‑level access control, from a single CLI loop to an embedded runtime within a gateway control plane, and from context‑window extensions to gateway‑wide capability registration. We finally identify six open design directions for future agent systems, grounded in recent empirical, architectural, and policy literature.
Authors:Ashmi Banerjee, Adithi Satish, Wolfgang Wörndl, Yashar Deldjoo
Abstract:
Traditional conversational travel recommender systems primarily optimize for user relevance and convenience, often reinforcing popular, overcrowded destinations and carbon‑intensive travel choices. To address this, we present TRACE (Tourism Recommendation with Agentic Counterfactual Explanations), a multi‑agent, LLM‑based framework that promotes sustainable tourism through interactive nudging. TRACE uses a modular orchestrator‑worker architecture where specialized agents elicit latent sustainability preferences, construct structured user personas, and generate recommendations that balance relevance with environmental impact. A key innovation lies in its use of agentic counterfactual explanations and LLM‑driven clarifying questions, which together surface greener alternatives and refine understanding of intent, fostering user reflection without coercion. User studies and semantic alignment analyses demonstrate that TRACE effectively supports sustainable decision‑making while preserving recommendation quality and interactive responsiveness. TRACE is implemented on Google's Agent Development Kit, with full code, Docker setup, prompts, and a publicly available demo video to ensure reproducibility. A project summary, including all resources, prompts, and demo access, is available at https://ashmibanerjee.github.io/trace‑chatbot.
Authors:Samir Wagle, Reewaj Khanal, Abiral Adhikari
Abstract:
Hate speech detection in Devanagari‑scripted social media memes presents compounded challenges: multimodal content structure, script‑specific linguistic complexity, and extreme data scarcity in low‑resource settings. This paper presents our system for the CHiPSAL 2026 shared task, addressing both Subtask A (binary hate speech detection) and Subtask B (three‑class sentiment classification: positive, neutral, negative). We propose a hybrid cross‑modal attention fusion architecture that combines CLIP (ViT‑B/32) for visual encoding with BGE‑M3 for multilingual text representation, connected through 4‑head self‑attention and a learnable gating network that dynamically weights modality contributions on a per‑sample basis. Systematic evaluation across eight model configurations demonstrates that explicit cross‑modal reasoning achieves a 5.9% F1‑macro improvement over text‑only baselines on Subtask A, while uncovering two unexpected but critical findings: English‑centric vision models exhibit near‑random performance on Devanagari script, and standard ensemble methods catastrophically degrade under data scarcity (N nearly equal to 850 per fold) due to correlated overfitting. The code can be accessed at https://github.com/Tri‑Yantra‑Technologies/MEME‑Fusion/
Authors:Pu Cheng, Juncheng Liu, Yunshen Long
Abstract:
Predicting real‑world events from live market signals demands systems that fuse qualitative news with quantitative order‑book dynamics under strict temporal discipline ‑‑ a challenge existing benchmarks fail to capture. We present PolyBench, a multimodal benchmark derived from Polymarket that records point‑in‑time cross‑sections of 38,666 binary prediction markets spanning 4,997 events, synchronously coupling each snapshot with a Central Limit Order Book (CLOB) state and a real‑time news stream. Using PolyBench, we evaluate seven state‑of‑the‑art Large Language Models ‑‑ spanning open‑ and closed‑source families ‑‑ generating 36,165 predictions under identical, timestamp‑locked market states collected between February 6 and 12, 2026. Our multidimensional framework assesses directional accuracy, our proposed Confidence‑Weighted Return (CWR), Annualized Percentage Yield (APY), and Sharpe ratio via realistic order‑book execution simulation. The results reveal a pronounced performance divergence: only two of seven models achieve positive financial returns ‑‑ MiMo‑V2‑Flash at 17.6% CWR and Gemini‑3‑Flash at 6.2% CWR ‑‑ while the remaining five incur losses despite uniformly high stated confidence. These findings highlight the gap between surface‑level language fluency and genuine probabilistic reasoning under live market uncertainty, and establish PolyBench as a contamination‑proof, financially‑grounded evaluation standard for future LLM research. Our dataset and code available at \underline\hrefhttps://github.com/PolyBench/PolyBenchhttps://github.com/PolyBench/PolyBench.
Authors:Junhong Liang, Yifan Lu, Ekaterina Kochmar, Fajri Koto
Abstract:
Grammatical error correction (GEC) and explanation (GEE) have made rapid progress, but real teaching scenarios also require \emphlearner‑friendly pedagogical feedback that is actionable, level‑appropriate, and encouraging. We introduce SPFG (Spoken Pedagogical Feedback Generation), a dataset built based on the Speak \& Improve Challenge 2025 corpus, pairing fluency‑oriented transcriptions with GEC targets and \emphhuman‑verified teacher‑style feedback, including preferred/rejected feedback pairs for preference learning. We study a transcript‑based Spoken Grammatical Error Correction (SGEC) setting and evaluate three instruction‑tuned LLMs (Qwen2.5, Llama‑3.1, and GLM‑4), comparing supervised fine‑tuning (SFT) with preference‑based alignment (using DPO and KTO) for jointly generating corrections and feedback. Results show that SFT provides the most consistent improvements, while DPO/KTO yield smaller or mixed gains, and that correction quality and feedback quality are weakly coupled. Our implementation is available at https://github.com/Skywalker‑Harrison/spfg.
Authors:Haiyang Zheng, Nan Pu, Yaqi Cai, Teng Long, Wenjing Li, Nicu Sebe, Zhun Zhong
Abstract:
Generalized Category Discovery (GCD) leverages labeled data to categorize unlabeled samples from known or unknown classes. Most previous methods jointly optimize supervised and unsupervised objectives and achieve promising results. However, inherent optimization interference still limits their ability to improve further. Through quantitative analysis, we identify a key issue, i.e., gradient entanglement, which 1) distorts supervised gradients and weakens discrimination among known classes, and 2) induces representation‑subspace overlap between known and novel classes, reducing the separability of novel categories. To address this issue, we propose the Energy‑Aware Gradient Coordinator (EAGC), a plug‑and‑play gradient‑level module that explicitly regulates the optimization process. EAGC comprises two components: Anchor‑based Gradient Alignment (AGA) and Energy‑aware Elastic Projection (EEP). AGA introduces a reference model to anchor the gradient directions of labeled samples, preserving the discriminative structure of known classes against the interference of unlabeled gradients. EEP softly projects unlabeled gradients onto the complement of the known‑class subspace and derives an energy‑based coefficient to adaptively scale the projection for each unlabeled sample according to its degree of alignment with the known subspace, thereby reducing subspace overlap without suppressing unlabeled samples that likely belong to known classes. Experiments show that EAGC consistently boosts existing methods and establishes new state‑of‑the‑art results. Code is available at https://haiyangzheng.github.io/EAGC.
Authors:Baocai Shan, Yuzhuang Xu, Wanxiang Che
Abstract:
Mobile input method editors (IMEs) are the primary interface for text input, yet they remain constrained to manual typing and struggle to produce personalized text. While lightweight large language models (LLMs) make on‑device auxiliary generation feasible, enabling deeply personalized, privacy‑preserving, and real‑time generative IMEs poses fundamental challenges.To this end, we present HUOZIIME, a personalized on‑device IME powered by LLM. We endow HUOZIIME with initial human‑like prediction ability by post‑training a base LLM on synthesized personalization data. Notably, a hierarchical memory mechanism is designed to continually capture and leverage user‑specific input history. Furthermore, we perform systemic optimizations tailored to on‑device LLMbased IME deployment, ensuring efficient and responsive operation under mobile constraints.Experiments demonstrate efficient on‑device execution and high‑fidelity memory‑driven personalization. Code and package are available at https://github.com/Shan‑HIT/HuoziIME.
Authors:Yuqiao Tan, Minzheng Wang, Bo Liu, Zichen Liu, Tian Liang, Shizhu He, Jun Zhao, Kang Liu
Abstract:
While reinforcement learning with verifiable rewards (RLVR) significantly enhances LLM reasoning by optimizing the conditional distribution P(y|x), its potential is fundamentally bounded by the base model's existing output distribution. Optimizing the marginal distribution P(y) in the Pre‑train Space addresses this bottleneck by encoding reasoning ability and preserving broad exploration capacity. Yet, conventional pre‑training relies on static corpora for passive learning, leading to a distribution shift that hinders targeted reasoning enhancement. In this paper, we introduce PreRL (Pre‑train Space RL), which applies reward‑driven online updates directly to P(y). We theoretically and empirically validate the strong gradient alignment between log P(y) and log P(y|x), establishing PreRL as a viable surrogate for standard RL. Furthermore, we uncover a critical mechanism: Negative Sample Reinforcement (NSR) within PreRL serves as an exceptionally effective driver for reasoning. NSR‑PreRL rapidly prunes incorrect reasoning spaces while stimulating endogenous reflective behaviors, increasing transition and reflection thoughts by 14.89x and 6.54x, respectively. Leveraging these insights, we propose Dual Space RL (DSRL), a Policy Reincarnation strategy that initializes models with NSR‑PreRL to expand the reasoning horizon before transitioning to standard RL for fine‑grained optimization. Extensive experiments demonstrate that DSRL consistently outperforms strong baselines, proving that pre‑train space pruning effectively steers the policy toward a refined correct reasoning subspace.
Authors:Itay Itzhak, Eliya Habba, Gabriel Stanovsky, Yonatan Belinkov
Abstract:
Evaluating LLMs is challenging, as benchmark scores often fail to capture models' real‑world usefulness. Instead, users often rely on ``vibe‑testing'': informal experience‑based evaluation, such as comparing models on coding tasks related to their own workflow. While prevalent, vibe‑testing is often too ad hoc and unstructured to analyze or reproduce at scale. In this work, we study how vibe‑testing works in practice and then formalize it to support systematic analysis. We first analyze two empirical resources: (1) a survey of user evaluation practices, and (2) a collection of in‑the‑wild model comparison reports from blogs and social media. Based on these resources, we formalize vibe‑testing as a two‑part process: users personalize both what they test and how they judge responses. We then introduce a proof‑of‑concept evaluation pipeline that follows this formulation by generating personalized prompts and comparing model outputs using user‑aware subjective criteria. In experiments on coding benchmarks, we find that combining personalized prompts and user‑aware evaluation can change which model is preferred, reflecting the role of vibe‑testing in practice. These findings suggest that formalized vibe‑testing can serve as a useful approach for bridging benchmark scores and real‑world experience.
Authors:Tianshuo Yang, Guanyu Chen, Yutian Chen, Zhixuan Liang, Yitian Liu, Zanxin Chen, Chunpu Xu, Haotian Liang, Jiangmiao Pang, Yao Mu, Ping Luo
Abstract:
While end‑to‑end Vision‑Language‑Action (VLA) models offer a promising paradigm for robotic manipulation, fine‑tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision‑Language Models (VLMs). To resolve this fundamental trade‑off, we propose HiVLA, a visual‑grounded‑centric hierarchical framework that explicitly decouples high‑level semantic planning from low‑level motor control. In high‑level part, a VLM planner first performs task decomposition and visual grounding to generate structured plans, comprising a subtask instruction and a precise target bounding box. Then, to translate this plan into physical actions, we introduce a flow‑matching Diffusion Transformer (DiT) action expert in low‑level part equipped with a novel cascaded cross‑attention mechanism. This design sequentially fuses global context, high‑resolution object‑centric crops and skill semantics, enabling the DiT to focus purely on robust execution. Our decoupled architecture preserves the VLM's zero‑shot reasoning while allowing independent improvement of both components. Extensive experiments in simulation and the real world demonstrate that HiVLA significantly outperforms state‑of‑the‑art end‑to‑end baselines, particularly excelling in long‑horizon skill composition and the fine‑grained manipulation of small objects in cluttered scenes.
Authors:Fei Tang, Bofan Chen, Zhengxi Lu, Tongbo Chen, Songqin Nong, Tao Jiang, Wenhao Xu, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen
Abstract:
GUI grounding, which localizes interface elements from screenshots given natural language queries, remains challenging for small icons and dense layouts. Test‑time zoom‑in methods improve localization by cropping and re‑running inference at higher resolution, but apply cropping uniformly across all instances with fixed crop sizes, ignoring whether the model is actually uncertain on each case. We propose UI‑Zoomer, a training‑free adaptive zoom‑in framework that treats both the trigger and scale of zoom‑in as a prediction uncertainty quantification problem. A confidence‑aware gate fuses spatial consensus among stochastic candidates with token‑level generation confidence to selectively trigger zoom‑in only when localization is uncertain. When triggered, an uncertainty‑driven crop sizing module decomposes prediction variance into inter‑sample positional spread and intra‑sample box extent, deriving a per‑instance crop radius via the law of total variance. Extensive experiments on ScreenSpot‑Pro, UI‑Vision, and ScreenSpot‑v2 demonstrate consistent improvements over strong baselines across multiple model architectures, achieving gains of up to +13.4%, +10.3%, and +4.2% respectively, with no additional training required.
Authors:Ziming Wang
Abstract:
We present UMI‑3D, a multimodal extension of the Universal Manipulation Interface (UMI) for robust and scalable data collection in embodied manipulation. While UMI enables portable, wrist‑mounted data acquisition, its reliance on monocular visual SLAM makes it vulnerable to occlusions, dynamic scenes, and tracking failures, limiting its applicability in real‑world environments. UMI‑3D addresses these limitations by introducing a lightweight and low‑cost LiDAR sensor tightly integrated into the wrist‑mounted interface, enabling LiDAR‑centric SLAM with accurate metric‑scale pose estimation under challenging conditions. We further develop a hardware‑synchronized multimodal sensing pipeline and a unified spatiotemporal calibration framework that aligns visual observations with LiDAR point clouds, producing consistent 3D representations of demonstrations. Despite maintaining the original 2D visuomotor policy formulation, UMI‑3D significantly improves the quality and reliability of collected data, which directly translates into enhanced policy performance. Extensive real‑world experiments demonstrate that UMI‑3D not only achieves high success rates on standard manipulation tasks, but also enables learning of tasks that are challenging or infeasible for the original vision‑only UMI setup, including large deformable object manipulation and articulated object operation. The system supports an end‑to‑end pipeline for data acquisition, alignment, training, and deployment, while preserving the portability and accessibility of the original UMI. All hardware and software components are open‑sourced to facilitate large‑scale data collection and accelerate research in embodied intelligence: \hrefhttps://umi‑3d.github.iohttps://umi‑3d.github.io.
Authors:Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, Alborz Geramifard
Abstract:
On‑policy knowledge distillation (OPD) trains a student on its own rollouts under token‑level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We ask a direct question: which tokens carry the most useful learning signal in OPD? Our answer is that informative tokens come from two regions: positions with high student entropy, and positions with low student entropy plus high teacher‑‑student divergence, where the student is overconfident and wrong. Empirically, student entropy is a strong first‑order proxy: retaining 50% of tokens with entropy‑based sampling matches or exceeds all‑token training while reducing peak memory by up to 47%. But entropy alone misses a second important region. When we isolate low‑entropy, high‑divergence tokens, training on fewer than 10% of all tokens nearly matches full‑token baselines, showing that overconfident tokens carry dense corrective signal despite being nearly invisible to entropy‑only rules. We organize these findings with TIP (Token Importance in on‑Policy distillation), a two‑axis taxonomy over student entropy and teacher‑‑student divergence, and give a theoretical explanation for why entropy is useful yet structurally incomplete. This view motivates type‑aware token selection rules that combine uncertainty and disagreement. We validate this picture across three teacher‑‑student pairs spanning Qwen3, Llama, and Qwen2.5 on MATH‑500 and AIME 2024/2025, and on the DeepPlanning benchmark for long‑horizon agentic planning, where Q3‑only training on <20% of tokens surpasses full‑token OPD. Our experiments are implemented by extending the OPD repository https://github.com/HJSang/OPSD_OnPolicyDistillation, which supports memory‑efficient distillation of larger models under limited GPU budgets.
Authors:Weijie Wang, Qihang Cao, Sensen Gao, Donny Y. Chen, Haofei Xu, Wenjing Bian, Songyou Peng, Tat-Jen Cham, Chuanxia Zheng, Andreas Geiger, Jianfei Cai, Jia-Wang Bian, Bohan Zhuang
Abstract:
Reconstructing 3D representations from 2D inputs is a fundamental task in computer vision and graphics, serving as a cornerstone for understanding and interacting with the physical world. While traditional methods achieve high fidelity, they are limited by slow per‑scene optimization or category‑specific training, which hinders their practical deployment and scalability. Hence, generalizable feed‑forward 3D reconstruction has witnessed rapid development in recent years. By learning a model that maps images directly to 3D representations in a single forward pass, these methods enable efficient reconstruction and robust cross‑scene generalization. Our survey is motivated by a critical observation: despite the diverse geometric output representations, ranging from implicit fields to explicit primitives, existing feed‑forward approaches share similar high‑level architectural patterns, such as image feature extraction backbones, multi‑view information fusion mechanisms, and geometry‑aware design principles. Consequently, we abstract away from these representation differences and instead focus on model design, proposing a novel taxonomy centered on model design strategies that are agnostic to the output format. Our proposed taxonomy organizes the research directions into five key problems that drive recent research development: feature enhancement, geometry awareness, model efficiency, augmentation strategies and temporal‑aware models. To support this taxonomy with empirical grounding and standardized evaluation, we further comprehensively review related benchmarks and datasets, and extensively discuss and categorize real‑world applications based on feed‑forward 3D models. Finally, we outline future directions to address open challenges such as scalability, evaluation standards, and world modeling.
Authors:Arya Shah, Vaibhav Tripathi, Mayank Singh, Chaklam Silpasuwanchai
Abstract:
Vision‑language models are increasingly deployed in high‑stakes settings, yet their susceptibility to sycophantic manipulation remains poorly understood, particularly in relation to how these models represent visual information internally. Whether models whose visual representations more closely mirror human neural processing are also more resistant to adversarial pressure is an open question with implications for both neuroscience and AI safety. We investigate this question by evaluating 12 open‑weight vision‑language models spanning 6 architecture families and a 40× parameter range (256M‑‑10B) along two axes: brain alignment, measured by predicting fMRI responses from the Natural Scenes Dataset across 8 human subjects and 6 visual cortex regions of interest, and sycophancy, measured through 76,800 two‑turn gaslighting prompts spanning 5 categories and 10 difficulty levels. Region‑of‑interest analysis reveals that alignment specifically in early visual cortex (V1‑‑V3) is a reliable negative predictor of sycophancy (r = ‑0.441, BCa 95% CI [‑0.740, ‑0.031]), with all 12 leave‑one‑out correlations negative and the strongest effect for existence denial attacks (r = ‑0.597, p = 0.040). This anatomically specific relationship is absent in higher‑order category‑selective regions, suggesting that faithful low‑level visual encoding provides a measurable anchor against adversarial linguistic override in vision‑language models. We release our code on \hrefhttps://github.com/aryashah2k/Gaslight‑Gatekeep‑Sycophantic‑ManipulationGitHub and dataset on \hrefhttps://huggingface.co/datasets/aryashah00/Gaslight‑Gatekeep‑V1‑V3Hugging Face
Authors:Geonhee Ahn, Donghyun Lee, Hayoung Doo, Jonggeol Na, Hyunsoo Cho, Sookyung Kim
Abstract:
Large language models (LLMs) have enabled agentic AI systems for scientific discovery, but most approaches remain limited to textbased reasoning without automated experimental verification. We propose MIND, an LLM‑driven framework for automated hypothesis validation in materials research. MIND organizes the scientific discovery process into hypothesis refinement, experimentation, and debate‑based validation within a multi‑agent pipeline. For experimental verification, the system integrates Machine Learning Interatomic Potentials, particularly SevenNet‑Omni, enabling scalable in‑silico experiments. We also provide a web‑based user interface for automated hypothesis testing. The modular design allows additional experimental modules to be integrated, making the framework adaptable to broader scientific workflows. The code is available at: https://github.com/IMMS‑Ewha/MIND, and a demonstration video at: https://youtu.be/lqiFe1OQzN4.
Authors:Yunkai Dang, Minxin Dai, Yuekun Yang, Zhangnan Li, Wenbin Li, Feng Miao, Yang Gao
Abstract:
Ultra‑high‑resolution (UHR) remote sensing imagery couples kilometer‑scale context with query‑critical evidence that may occupy only a few pixels. Such vast spatial scale leads to a quadratic explosion of visual tokens and hinders the extraction of information from small objects. Previous works utilize direct downsampling, dense tiling, or global top‑k pruning, which either compromise query‑critical image details or incur unpredictable compute. In this paper, we propose UHR‑BAT, a query‑guided and region‑faithful token compression framework to efficiently select visual tokens under a strict context budget. Specifically, we leverage text‑guided, multi‑scale importance estimation for visual tokens, effectively tackling the challenge of achieving precise yet low‑cost feature extraction. Furthermore, by introducing region‑wise preserve and merge strategies, we mitigate visual token redundancy, further driving down the computational budget. Experimental results show that UHR‑BAT achieves state‑of‑the‑art performance across various benchmarks. Code will be available at https://github.com/Yunkaidang/UHR.
Authors:Kaiwen Zheng, Kai Zhou, Jinwu Hu, Te Gu, Mingkai Peng, Fei Liu
Abstract:
Large language models (LLMs) demonstrate strong reasoning capabilities, but their performance often degrades under distribution shift. Existing test‑time adaptation (TTA) methods rely on gradient‑based updates that require white‑box access and need substantial overhead, while training‑free alternatives are either static or depend on external guidance. In this paper, we propose Training‑Free Test‑Time Contrastive Learning TF‑TTCL, a training‑free adaptation framework that enables a frozen LLM to improve online by distilling supervision from its own inference experiences. Specifically, TF‑TTCL implements a dynamic "Explore‑Reflect‑Steer" loop through three core modules: 1) Semantic Query Augmentation first diversifies problem views via multi‑agent role‑playing to generate different reasoning trajectories; 2) Contrastive Experience Distillation then captures the semantic gap between superior and inferior trajectories, distilling them into explicit textual rules; and 3) Contextual Rule Retrieval finally activates these stored rules during inference to dynamically steer the frozen LLM toward robust reasoning patterns while avoiding observed errors. Extensive experiments on closed‑ended reasoning tasks and open‑ended evaluation tasks demonstrate that TF‑TTCL consistently outperforms strong zero‑shot baselines and representative TTA methods under online evaluation. Code is available at https://github.com/KevinSCUTer/TF‑TTCL.
Authors:Zijian Zhao, Jing Gao, Sen Li
Abstract:
Cooperative multi‑agent reinforcement learning (MARL) is widely used to address large joint observation and action spaces by decomposing a centralized control problem into multiple interacting agents. However, such decomposition often introduces additional challenges, including non‑stationarity, unstable training, weak coordination, and limited theoretical guarantees. In this paper, we propose the Consensus Multi‑Agent Transformer (CMAT), a centralized framework that bridges cooperative MARL to a hierarchical single‑agent reinforcement learning (SARL) formulation. CMAT treats all agents as a unified entity and employs a Transformer encoder to process the large joint observation space. To handle the extensive joint action space, we introduce a hierarchical decision‑making mechanism in which a Transformer decoder autoregressively generates a high‑level consensus vector, simulating the process by which agents reach agreement on their strategies in latent space. Conditioned on this consensus, all agents generate their actions simultaneously, enabling order‑independent joint decision making and avoiding the sensitivity to action‑generation order in conventional Multi‑Agent Transformers (MAT). This factorization allows the joint policy to be optimized using single‑agent PPO while preserving expressive coordination through the latent consensus. To evaluate the proposed method, we conduct experiments on benchmark tasks from StarCraft II, Multi‑Agent MuJoCo, and Google Research Football. The results show that CMAT achieves superior performance over recent centralized solutions, sequential MARL methods, and conventional MARL baselines. The code for this paper is available at:https://github.com/RS2002/CMAT .
Authors:Mohammed Ezzaldin Babiker Abdullah
Abstract:
Turbofan engine degradation under sustained operational stress necessitates robust prognostic systems capable of accurately estimating the Remaining Useful Life (RUL) of critical components. Existing deep learning approaches frequently fail to simultaneously capture multi‑sensor spatial correlations and long‑range temporal dependencies, while standard symmetric loss functions inadequately penalize the safety‑critical error of over‑estimating residual life. This study proposes a hybrid architecture integrating Twin‑Stage One‑Dimensional Convolutional Neural Networks (1D‑CNN), a Bidirectional Long Short‑Term Memory (BiLSTM) network, and a custom Bahdanau Additive Attention mechanism. The model was trained and evaluated on the NASA Commercial Modular Aero‑Propulsion System Simulation (C‑MAPSS) FD001 sub‑dataset employing a zero‑leakage preprocessing pipeline, piecewise‑linear RUL labeling capped at 130 cycles, and the NASA‑specified asymmetric exponential loss function that disproportionately penalizes over‑estimation to enforce industrial safety constraints. Experiments on 100 test engines achieved a Root Mean Squared Error (RMSE) of 17.52 cycles and a NASA S‑Score of 922.06. Furthermore, extracted attention weight heatmaps provide interpretable, per‑engine insights into the temporal progression of degradation, supporting informed maintenance decision‑making. The proposed framework demonstrates competitive performance against established baselines and offers a principled approach to safe, interpretable prognostics in industrial settings.
Authors:Mohammed Ezzaldin Babiker Abdullah, Rufaidah Abdallah Ibrahim Mohammed
Abstract:
Accurate Global Horizontal Irradiance (GHI) forecasting is critical for grid stability, particularly in arid regions characterized by rapid aerosol fluctuations. While recent trends favor computationally expensive Transformer‑based architectures, this paper challenges the prevailing "complexity‑first" paradigm. We propose a lightweight, Physics‑Informed Hybrid CNN‑BiLSTM framework that prioritizes domain knowledge over architectural depth. The model integrates a Convolutional Neural Network (CNN) for spatial feature extraction with a Bi‑Directional LSTM for capturing temporal dependencies. Unlike standard data‑driven approaches, our model is explicitly guided by a vector of 15 engineered features including Clear‑Sky indices and Solar Zenith Angle ‑ rather than relying solely on raw historical data. Hyperparameters are rigorously tuned using Bayesian Optimization to ensure global optimality. Experimental validation using NASA POWER data in Sudan demonstrates that our physics‑guided approach achieves a Root Mean Square Error (RMSE) of 19.53 W/m^2, significantly outperforming complex attention‑based baselines (RMSE 30.64 W/m^2). These results confirm a "Complexity Paradox": in high‑noise meteorological tasks, explicit physical constraints offer a more efficient and accurate alternative to self‑attention mechanisms. The findings advocate for a shift towards hybrid, physics‑aware AI for real‑time renewable energy management.
Authors:Jason Kong, Nilesh Prasad Pandey, Flavio Ponzina, Tajana Rosing
Abstract:
Deploying Large Language Models (LLMs) on edge devices faces severe computational and memory constraints, limiting real‑time processing and on‑device intelligence. Hybrid architectures combining Structured State Space Models (SSMs) with transformer‑based LLMs offer a balance of efficiency and performance. Aggressive quantization can drastically cut model size and speed up inference, but its uneven effects on different components require careful management. In this work, we propose a lightweight, backpropagation‑free, surrogate‑based sensitivity analysis framework to identify hybrid SSM‑Transformer components most susceptible to quantization‑induced degradation. Relying solely on forward‑pass metrics, our method avoids expensive gradient computations and retraining, making it suitable for situations where access to in‑domain data is limited due to proprietary restrictions or privacy constraints. We also provide a formal analysis showing that the Kullback‑Leibler (KL) divergence metric better captures quantization sensitivity for Language modeling tasks than widely adopted alternatives such as mean squared error (MSE) and signal‑to‑quantization‑noise ratio (SQNR). Through extensive experiments on SSM and hybrid architectures, our ablation studies confirm that KL‑based rankings align with observed performance drops and outperform alternative metrics. This framework enables the practical deployment of advanced hybrid models on resource‑constrained edge devices with minimal accuracy loss. We further validate our approach with real‑world on‑device profiling on Intel Lunar Lake hardware, demonstrating that KL‑guided mixed‑precision achieves near‑FP16 perplexity with model sizes and throughput competitive with Uniform INT4 on both CPU and GPU execution modes. Code is available at https://github.com/jasonkongie/kl‑ssm‑quant.
Authors:Simin Huo, Ning Li
Abstract:
Token compression is crucial for mitigating the quadratic complexity of self‑attention mechanisms in Vision Transformers (ViTs), which often involve numerous input tokens. Existing methods, such as ToMe, rely on GPU‑inefficient operations (e.g., sorting, scattered writes), introducing overheads that limit their effectiveness. We introduce MaMe, a training‑free, differentiable token merging method based entirely on matrix operations, which is GPU‑friendly to accelerate ViTs. Additionally, we present MaRe, its inverse operation, for token restoration, forming a MaMe+MaRe pipeline for image synthesis. When applied to pre‑trained models, MaMe doubles ViT‑B throughput with a 2% accuracy drop. Notably, fine‑tuning the last layer with MaMe boosts ViT‑B accuracy by 1.0% at 1.1x speed. In SigLIP2‑B@512 zero‑shot classification, MaMe provides 1.3x acceleration with negligible performance degradation. In video tasks, MaMe accelerates VideoMAE‑L by 48.5% on Kinetics‑400 with only a 0.84% accuracy loss. Furthermore, MaMe achieves simultaneous improvements in both performance and speed on some tasks. In image synthesis, the MaMe+MaRe pipeline enhances quality while reducing Stable Diffusion v2.1 generation latency by 31%. Collectively, these results demonstrate MaMe's and MaRe's effectiveness in accelerating vision models. The code is available at https://github.com/cominder/mamehttps://github.com/cominder/mame.
Authors:Zhaoyang Wang, Qianhui Wu, Xuchao Zhang, Chaoyun Zhang, Wenlin Yao, Fazle Elahi Faisal, Baolin Peng, Si Qin, Suman Nath, Qingwei Lin, Chetan Bansal, Dongmei Zhang, Saravan Rajmohan, Jianfeng Gao, Huaxiu Yao
Abstract:
Autonomous web agents powered by large language models (LLMs) have shown promise in completing complex browser tasks, yet they still struggle with long‑horizon workflows. A key bottleneck is the grounding gap in existing skill formulations: textual workflow skills provide natural language guidance but cannot be directly executed, while code‑based skills are executable but opaque to the agent, offering no step‑level understanding for error recovery or adaptation. We introduce WebXSkill, a framework that bridges this gap with executable skills, each pairing a parameterized action program with step‑level natural language guidance, enabling both direct execution and agent‑driven adaptation. WebXSkill operates in three stages: skill extraction mines reusable action subsequences from readily available synthetic agent trajectories and abstracts them into parameterized skills, skill organization indexes skills into a URL‑based graph for context‑aware retrieval, and skill deployment exposes two complementary modes, grounded mode for fully automated multi‑step execution and guided mode where skills serve as step‑by‑step instructions that the agent follows with its native planning. On WebArena and WebVoyager, WebXSkill improves task success rate by up to 9.8 and 12.9 points over the baseline, respectively, demonstrating the effectiveness of executable skills for web agents. The code is publicly available at https://github.com/aiming‑lab/WebXSkill.
Authors:Shivam Chand Kaushik
Abstract:
Semiconductor failure analysis (FA) requires engineers to examine inspection images, correlate equipment telemetry, consult historical defect records, and write structured reports, a process that can consume several hours of expert time per case. We present SemiFA, an agentic multi‑modal framework that autonomously generates structured FA reports from semiconductor inspection images in under one minute. SemiFA decomposes FA into a four‑agent LangGraph pipeline: a DefectDescriber that classifies and narrates defect morphology using DINOv2 and LLaVA‑1.6, a RootCauseAnalyzer that fuses SECS/GEM equipment telemetry with historically similar defects retrieved from a Qdrant vector database, a SeverityClassifier that assigns severity and estimates yield impact, and a RecipeAdvisor that proposes corrective process adjustments. A fifth node assembles a PDF report. We introduce SemiFA‑930, a dataset of 930 annotated semiconductor defect images paired with structured FA narratives across nine defect classes, drawn from procedural synthesis, WM‑811K, and MixedWM38. Our DINOv2‑based classifier achieves 92.1% accuracy on 140 validation images (macro F1 = 0.917), and the full pipeline produces complete FA reports in 48 seconds on an NVIDIA A100‑SXM4‑40 GB GPU. A GPT‑4o judge ablation across four modality conditions demonstrates that multi‑modal fusion improves root cause reasoning by +0.86 composite points (1‑5 scale) over an image‑only baseline, with equipment telemetry as the more load‑bearing modality. To our knowledge, SemiFA is the first system to integrate SECS/GEM equipment telemetry into a vision‑language model pipeline for autonomous FA report generation.
Authors:Jaden Park, Jungtaek Kim, Jongwon Jeong, Robert D. Nowak, Kangwook Lee, Yong Jae Lee
Abstract:
Language Model (LM) agents are increasingly used in complex open‑ended decision‑making tasks, from AI coding to physical AI. A core requirement in these settings is the ability to both explore the problem space and exploit acquired knowledge effectively. However, systematically distinguishing and quantifying exploration and exploitation from observed actions without access to the agent's internal policy remains challenging. To address this, we design controllable environments inspired by practical embodied AI scenarios. Each environment consists of a partially observable 2D grid map and an unknown task Directed Acyclic Graph (DAG). The map generation can be programmatically adjusted to emphasize exploration or exploitation difficulty. To enable policy‑agnostic evaluation, we design a metric to quantify exploration and exploitation errors from agent's actions. We evaluate a variety of frontier LM agents and find that even state‑of‑the‑art models struggle on our task, with different models exhibiting distinct failure modes. We further observe that reasoning models solve the task more effectively and show both exploration and exploitation can be significantly improved through minimal harness engineering. We release our code \hrefhttps://github.com/jjj‑madison/measurable‑explore‑exploithere.
Authors:Rajesh Kumar, Waqar Ali, Junaid Ahmed, Najma Imtiaz Ali, Shaban Usman
Abstract:
Large language models generate plausible code but cannot verify correctness. Existing multi‑agent systems simulate execution or leave verification optional. We introduce execution‑grounded verification as a first‑class principle: every code change must survive sandboxed execution before propagation. We instantiate this principle in AGENTFORGE, a multi‑agent framework where Planner, Coder, Tester, Debugger, and Critic agents coordinate through shared memory and a mandatory Docker sandbox. We formalize software engineering with LLMs as an iterative decision process over repository states, where execution feedback provides a stronger supervision signal than next‑token likelihood. AGENTFORGE achieves 40.0% resolution on SWE‑BENCH Lite, outperforming single‑agent baselines by 26‑‑28 points. Ablations confirm that execution feedback and role decomposition each independently drive performance. The framework is open‑source at https://github.com/raja21068/AutoCodeAI.
Authors:Yi Lin, Lujin Zhao, Yijie Shi
Abstract:
The shift toward intent‑driven software engineering (often termed "Vibe Coding") exposes a critical Context‑Fidelity Trade‑off: vague user intents overwhelm linear reasoning chains, leading to architectural collapse in complex repo‑level generation. We propose Contract‑Coding, a structured symbolic paradigm that bridges unstructured intent and executable code via Autonomous Symbolic Grounding. By projecting ambiguous intents into a formal Language Contract, our framework serves as a Single Source of Truth (SSOT) that enforces topological independence, effectively isolating inter‑module implementation details, decreasing topological execution depth and unlocking Architectural Parallelism. Empirically, while state‑of‑the‑art agents suffer from different hallucinations on the Greenfield‑5 benchmark, Contract‑Coding achieves 47% functional success while maintaining near‑perfect structural integrity. Our work marks a critical step towards repository‑scale autonomous engineering: transitioning from strict "specification‑following" to robust, intent‑driven architecture synthesis. Our code is available at https://github.com/imliinyi/Contract‑Coding.
Authors:Xiang Long, Li Du, Yilong Xu, Fangcheng Liu, Haoqing Wang, Ning Ding, Ziheng Li, Jianyuan Guo, Yehui Tang
Abstract:
LLM‑based agents are increasingly expected to handle real‑world assistant tasks, yet existing benchmarks typically evaluate them under isolated sources of difficulty, such as a single environment or fully specified instructions. This leaves a substantial gap between current evaluation settings and the compositional challenges that arise in practical deployment. To address this gap, we introduce LiveClawBench, a benchmark to evaluate LLM agents on real‑world assistant tasks. Based on an analysis of various real OpenClaw usage cases, we derive a Triple‑Axis Complexity Framework that characterizes task difficulty along three dimensions: Environment Complexity, Cognitive Demand, and Runtime Adaptability. Guided by this framework, we construct a pilot benchmark with explicit complexity‑factor annotations, covering real‑world assistant tasks with compositional difficulty. Together, the framework and benchmark provide a principled foundation for evaluating LLM agents in realistic assistant settings, and establish a basis for future expansion across task domains and complexity axes. We are continuing to enrich our case collections to achieve more comprehensive domain and complexity coverage. The project page is at https://github.com/Mosi‑AI/LiveClawBench.
Authors:Matthias De Lange, Warre Veys, Federico Retyk, Daniel Deniz, Warren Jouanneau, Mike Zhang, Aleksander Bielinski, Emma Jouffroy, Nicole Clobes, Nina Baranowska, David Graus, Marc Palyart, Rabih Zbib, Dimitra Gkatzia, Thomas Demeester, Tijl De Bie, Toine Bogers, Jens-Joris Decorte, Jeroen Van Hautte
Abstract:
Today's evolving labor markets rely increasingly on recommender systems for hiring, talent management, and workforce analytics, with natural language processing (NLP) capabilities at the core. Yet, research in this area remains highly fragmented. Studies employ divergent ontologies (ESCO, ONET, national taxonomies), heterogeneous task formulations, and diverse model families, making cross‑study comparison and reproducibility exceedingly difficult. General‑purpose benchmarks lack coverage of work‑specific tasks, and the inherent sensitivity of employment data further limits open evaluation. We present WorkRB (Work Research Benchmark), the first open‑source, community‑driven benchmark tailored to work‑domain AI. WorkRB organizes 13 diverse tasks from 7 task groups as unified recommendation and NLP tasks, including job/skill recommendation, candidate recommendation, similar item recommendation, and skill extraction and normalization. WorkRB enables both monolingual and cross‑lingual evaluation settings through dynamic loading of multilingual ontologies. Developed within a multi‑stakeholder ecosystem of academia, industry, and public institutions, WorkRB has a modular design for seamless contributions and enables integration of proprietary tasks without disclosing sensitive data. WorkRB is available under the Apache 2.0 license at https://github.com/techwolf‑ai/WorkRB.
Authors:Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, Ning Ding
Abstract:
On‑policy distillation (OPD) has become a core technique in the post‑training of large language models, yet its training dynamics remain poorly understood. This paper provides a systematic investigation of OPD dynamics and mechanisms. We first identify that two conditions govern whether OPD succeeds or fails: (i) the student and teacher should share compatible thinking patterns; and (ii) even with consistent thinking patterns and higher scores, the teacher must offer genuinely new capabilities beyond what the student has seen during training. We validate these findings through weak‑to‑strong reverse distillation, showing that same‑family 1.5B and 7B teachers are distributionally indistinguishable from the student's perspective. Probing into the token‑level mechanism, we show that successful OPD is characterized by progressive alignment on high‑probability tokens at student‑visited states, a small shared token set that concentrates most of the probability mass (97%‑99%). We further propose two practical strategies to recover failing OPD: off‑policy cold start and teacher‑aligned prompt selection. Finally, we show that OPD's apparent free lunch of dense token‑level reward comes at a cost, raising the question of whether OPD can scale to long‑horizon distillation.
Authors:Joel Fokou
Abstract:
Autonomous AI agents are rapidly transitioning from experimental tools to operational infrastructure, with projections that 80% of enterprise applications will embed AI copilots by the end of 2026. As agents gain the ability to execute real‑world actions (reading files, running commands, making network requests, modifying databases), a fundamental security gap has emerged. The dominant approach to agent safety relies on prompt‑level guardrails: natural language instructions that operate at the same abstraction level as the threats they attempt to mitigate. This paper argues that prompt‑based safety is architecturally insufficient for agents with execution capability and introduces Parallax, a paradigm for safe autonomous AI execution grounded in four principles: Cognitive‑Executive Separation, which structurally prevents the reasoning system from executing actions; Adversarial Validation with Graduated Determinism, which interposes an independent, multi‑tiered validator between reasoning and execution; Information Flow Control, which propagates data sensitivity labels through agent workflows to detect context‑dependent threats; and Reversible Execution, which captures pre‑destructive state to enable rollback when validation fails. We present OpenParallax, an open‑source reference implementation in Go, and evaluate it using Assume‑Compromise Evaluation, a methodology that bypasses the reasoning system entirely to test the architectural boundary under full agent compromise. Across 280 adversarial test cases in nine attack categories, Parallax blocks 98.9% of attacks with zero false positives under its default configuration, and 100% of attacks under its maximum‑security configuration. When the reasoning system is compromised, prompt‑level guardrails provide zero protection because they exist only within the compromised system; Parallax's architectural boundary holds regardless.
Authors:Yiyang Huang, Yitian Zhang, Yizhou Wang, Mingyuan Zhang, Liang Shi, Huimin Zeng, Yun Fu
Abstract:
Despite significant progress in video‑language modeling, hallucinations remain a persistent challenge in Video Large Language Models (Vid‑LLMs), referring to outputs that appear plausible yet contradict the content of the input video. This survey presents a comprehensive analysis of hallucinations in Vid‑LLMs and introduces a systematic taxonomy that categorizes them into two core types: dynamic distortion and content fabrication, each comprising two subtypes with representative cases. Building on this taxonomy, we review recent advances in the evaluation and mitigation of hallucinations, covering key benchmarks, metrics, and intervention strategies. We further analyze the root causes of dynamic distortion and content fabrication, which often result from limited capacity for temporal representation and insufficient visual grounding. These insights inform several promising directions for future work, including the development of motion‑aware visual encoders and the integration of counterfactual learning techniques. This survey consolidates scattered progress to foster a systematic understanding of hallucinations in Vid‑LLMs, laying the groundwork for building robust and reliable video‑language systems. An up‑to‑date curated list of related works is maintained at https://github.com/hukcc/Awesome‑Video‑Hallucination .
Authors:Qiang Zhang, Zhongnian Li
Abstract:
Binary decompilation is a critical reverse engineering task aimed at reconstructing high‑level source code from stripped executables. Although Large Language Models (LLMs) have recently shown promise, they often suffer from "logical hallucinations" and "semantic misalignment" due to the irreversible semantic loss during compilation, resulting in generated code that fails to re‑execute. In this study, we propose Cognitive Decompiler Refinement with Robustness (CoDe‑R), a lightweight two‑stage code refinement framework. The first stage introduces Semantic Cognitive Enhancement (SCE), a Rationale‑Guided Semantic Injection strategy that trains the model to recover high‑level algorithmic intent alongside code. The second stage introduces a Dynamic Dual‑Path Fallback (DDPF) mechanism during inference, which adaptively balances semantic recovery and syntactic stability via a hybrid verification strategy. Evaluation on the HumanEval‑Decompile benchmark demonstrates that CoDe‑R (using a 1.3B backbone) establishes a new State‑of‑the‑Art (SOTA) in the lightweight regime. Notably, it is the first 1.3B model to exceed an Average Re‑executability Rate of 50.00%, significantly outperforming the baseline and effectively bridging the gap between efficient models and expert‑level performance. Our code is available at https://github.com/Theaoi/CoDe‑R.
Authors:Yifan Du, Zikang Liu, Jinbiao Peng, Jie Wu, Junyi Li, Jinyang Li, Wayne Xin Zhao, Ji-Rong Wen
Abstract:
Multimodal deep search agents have shown great potential in solving complex tasks by iteratively collecting textual and visual evidence. However, managing the heterogeneous information and high token costs associated with multimodal inputs over long horizons remains a critical challenge, as existing methods often suffer from context explosion or the loss of crucial visual signals. To address this, we propose a novel Long‑horizon MultiModal deep search framework, named LMM‑Searcher, centered on a file‑based visual representation mechanism. By offloading visual assets to an external file system and mapping them to lightweight textual identifiers (UIDs), our approach mitigates context overhead while preserving multimodal information for future access. We equip the agent with a tailored fetch‑image tool, enabling a progressive, on‑demand visual loading strategy for active perception. Furthermore, we introduce a data synthesis pipeline designed to generate queries requiring complex cross‑modal multi‑hop reasoning. Using this pipeline, we distill 12K high‑quality trajectories to fine‑tune Qwen3‑VL‑Thinking‑30A3B into a specialized multimodal deep search agent. Extensive experiments across four benchmarks demonstrate that our method successfully scales to 100‑turn search horizons, achieving state‑of‑the‑art performance among open‑source models on challenging long‑horizon benchmarks like MM‑BrowseComp and MMSearch‑Plus, while also exhibiting strong generalizability across different base models. Our code will be released in https://github.com/RUCAIBox/LMM‑Searcher.
Authors:Yunkai Dang, Yizhu Jiang, Yifan Jiang, Qi Fan, Yinghuan Shi, Wenbin Li, Yang Gao
Abstract:
Multimodal Large Language Models (MLLMs) suffer from substantial computational overhead due to the high redundancy in visual token sequences. Existing approaches typically address this issue using single‑layer Vision Transformer (ViT) features and static pruning strategies. However, such fixed configurations are often brittle under diverse instructions. To overcome these limitations, we propose CLASP, a plug‑and‑play token reduction framework based on class‑adaptive layer fusion and dual‑stage pruning. Specifically, CLASP first constructs category‑specific visual representations through multi‑layer vision feature fusion. It then performs dual‑stage pruning, allocating the token budget between attention‑salient pivot tokens for relevance and redundancy‑aware completion tokens for coverage. Through class‑adaptive pruning, CLASP enables prompt‑conditioned feature fusion and budget allocation, allowing aggressive yet robust visual token reduction. Extensive experiments demonstrate that CLASP consistently outperforms existing methods across a wide range of benchmarks, pruning ratios, and MLLM architectures. Code will be available at https://github.com/Yunkaidang/CLASP.
Authors:Arya Shah, Kaveri Visavadiya, Manisha Padala
Abstract:
Adversarial robustness is essential for deploying neural networks in safety‑critical applications, yet standard evaluation methods either require expensive adversarial attacks or report only a single aggregate score that obscures how robustness is distributed across classes. We introduce the \emphGF‑Score (GREAT‑Fairness Score), a framework that decomposes the certified GREAT Score into per‑class robustness profiles and quantifies their disparity through four metrics grounded in welfare economics: the Robustness Disparity Index (RDI), the Normalized Robustness Gini Coefficient (NRGC), Worst‑Case Class Robustness (WCR), and a Fairness‑Penalized GREAT Score (FP‑GREAT). The framework further eliminates the original method's dependence on adversarial attacks through a self‑calibration procedure that tunes the temperature parameter using only clean accuracy correlations. Evaluating 22 models from RobustBench across CIFAR‑10 and ImageNet, we find that the decomposition is exact, that per‑class scores reveal consistent vulnerability patterns (e.g., ``cat'' is the weakest class in 76% of CIFAR‑10 models), and that more robust models tend to exhibit greater class‑level disparity. These results establish a practical, attack‑free auditing pipeline for diagnosing where certified robustness guarantees fail to protect all classes equally. We release our code on \hrefhttps://github.com/aryashah2k/gf‑scoreGitHub.
Authors:Alkid Baci, Luke Friedrichs, Caglar Demir, N'Dah Jean Kouagou, Axel-Cyrille Ngonga Ngomo
Abstract:
Knowledge graph embedding (KGE) models perform well on link prediction but struggle with unseen entities, relations, and especially literals, limiting their use in dynamic, heterogeneous graphs. In contrast, pretrained large language models (LLMs) generalize effectively through prompting. We reformulate link prediction as a prompt learning problem and introduce RALP, which learns string‑based chain‑of‑thought (CoT) prompts as scoring functions for triples. Using Bayesian Optimization through MIPRO algorithm, RALP identifies effective prompts from fewer than 30 training examples without gradient access. At inference, RALP predicts missing entities, relations or whole triples and assigns confidence scores based on the learned prompt. We evaluate on transductive, numerical, and OWL instance retrieval benchmarks. RALP improves state‑of‑the‑art KGE models by over 5% MRR across datasets and enhances generalization via high‑quality inferred triples. On OWL reasoning tasks with complex class expressions (e.g., \exists hasChild.Female, \geq 5 \; hasChild.Female), it achieves over 88% Jaccard similarity. These results highlight prompt‑based LLM reasoning as a flexible alternative to embedding‑based methods. We release our implementation, training, and evaluation pipeline as open source: https://github.com/dice‑group/RALP .
Authors:Linhao Yu, Tianmeng Yang, Siyu Ding, Renren Jin, Naibin Gu, Xiangzhao Hao, Shuaiyi Nie, Deyi Xiong, Weichong Yin, Yu Sun, Hua Wu
Abstract:
RLVR improves reasoning in large language models, but its effectiveness is often limited by severe reward sparsity on hard problems. Recent hint‑based RL methods mitigate sparsity by injecting partial solutions or abstract templates, yet they typically scale guidance by adding more tokens, which introduce redundancy, inconsistency, and extra training overhead. We propose KnowRL (Knowledge‑Guided Reinforcement Learning), an RL training framework that treats hint design as a minimal‑sufficient guidance problem. During RL training, KnowRL decomposes guidance into atomic knowledge points (KPs) and uses Constrained Subset Search (CSS) to construct compact, interaction‑aware subsets for training. We further identify a pruning interaction paradox ‑‑ removing one KP may help while removing multiple such KPs can hurt ‑‑ and explicitly optimize for robust subset curation under this dependency structure. We train KnowRL‑Nemotron‑1.5B from OpenMath‑Nemotron‑1.5B. Across eight reasoning benchmarks at the 1.5B scale, KnowRL‑Nemotron‑1.5B consistently outperforms strong RL and hinting baselines. Without KP hints at inference, KnowRL‑Nemotron‑1.5B reaches 70.08 average accuracy, already surpassing Nemotron‑1.5B by +9.63 points; with selected KPs, performance improves to 74.16, establishing a new state of the art at this scale. The model, curated training data, and code are publicly available at https://github.com/Hasuer/KnowRL.
Authors:Yanji He, Yuxin Jiang, Yiwen Wu, Bo Huang, Jiaheng Wei, Wei Wang
Abstract:
Large Language Models are increasingly deployed for decision‑making, yet their adoption in high‑stakes domains remains limited by miscalibrated probabilities, unfaithful explanations, and inability to incorporate expert knowledge precisely. We propose IDEA, a framework that extracts LLM decision knowledge into an interpretable parametric model over semantically meaningful factors. Through joint learning of verbal‑to‑numerical mappings and decision parameters via EM, correlated sampling that preserves factor dependencies, and direct parameter editing with mathematical guarantees, IDEA produces calibrated probabilities while enabling quantitative human‑AI collaboration. Experiments across five datasets show IDEA with Qwen‑3‑32B (78.6%) outperforms DeepSeek R1 (68.1%) and GPT‑5.2 (77.9%), achieving perfect factor exclusion and exact calibration ‑‑ precision unattainable through prompting alone. The implementation is publicly available at https://github.com/leonbig/IDEA.
Authors:Guanyi Qin, Jie Liang, Bingbing Zhang, Lishen Qu, Ya-nan Guan, Hui Zeng, Lei Zhang, Radu Timofte, Jianhui Sun, Xinli Yue, Tao Shao, Huan Hou, Wenjie Liao, Shuhao Han, Jieyu Yuan, Chunle Guo, Chongyi Li, Zewen Chen, Yunze Liu, Jian Guo, Juan Wang, Yun Zeng, Bing Li, Weiming Hu, Hesong Li, Dehua Liu, Xinjie Zhang, Qiang Li, Li Yan, Wei Dong, Qingsen Yan, Xingcan Li, Shenglong Zhou, Manjiang Yin, Yinxiang Zhang, Hongbo Wang, Jikai Xu, Zhaohui Fan, Dandan Zhu, Wei Sun, Weixia Zhang, Kun Zhu, Nana Zhang, Kaiwei Zhang, Qianqian Zhang, Zhihan Zhang, William Gordon, Linwei Wu, Jiachen Tu, Guoyi Xu, Yaoxin Jiang, Cici Liu, Yaokun Shi
Abstract:
In this paper, we present an overview of the NTIRE 2026 challenge on the 3rd Restore Any Image Model in the Wild, specifically focusing on Track 1: Professional Image Quality Assessment. Conventional Image Quality Assessment (IQA) typically relies on scalar scores. By compressing complex visual characteristics into a single number, these methods fundamentally struggle to distinguish subtle differences among uniformly high‑quality images. Furthermore, they fail to articulate why one image is superior, lacking the reasoning capabilities required to provide guidance for vision tasks. To bridge this gap, recent advancements in Multimodal Large Language Models (MLLMs) offer a promising paradigm. Inspired by this potential, our challenge establishes a novel benchmark exploring the ability of MLLMs to mimic human expert cognition in evaluating high‑quality image pairs. Participants were tasked with overcoming critical bottlenecks in professional scenarios, centering on two primary objectives: (1) Comparative Quality Selection: reliably identifying the visually superior image within a high‑quality pair; and (2) Interpretative Reasoning: generating grounded, expert‑level explanations that detail the rationale behind the selection. In total, the challenge attracted nearly 200 registrations and over 2,500 submissions. The top‑performing methods significantly advanced the state of the art in professional IQA. The challenge dataset is available at https://github.com/narthchin/RAIM‑PIQA, and the official homepage is accessible at https://www.codabench.org/competitions/12789/.
Authors:Shuai Wang, Xixi Wang, Yinan Yu
Abstract:
Large Language Models (LLMs) have shown remarkable capabilities across various tasks but remain prone to hallucinations in knowledge‑intensive scenarios. Knowledge Base Question Answering (KBQA) mitigates this by grounding generation in Knowledge Graphs (KGs). However, most multi‑hop KBQA methods rely on explicit edge traversal, making them fragile to KG incompleteness. In this paper, we proposed a novel graph‑based soft prompting framework that shifts the reasoning paradigm from node‑level path traversal to subgraph‑level reasoning. Specifically, we employ a Graph Neural Network (GNN) to encode extracted structural subgraphs into soft prompts, enabling LLM to reason over richer structural context and identify relevant entities beyond immediate graph neighbors, thereby reducing sensitivity to missing edges. Furthermore, we introduce a two‑stage paradigm that reduces computational cost while preserving good performance: a lightweight LLM first leverages the soft prompts to identify question‑relevant entities and relations, followed by a more powerful LLM for evidence‑aware answer generation. Experiments on four multi‑hop KBQA benchmarks show that our approach achieves state‑of‑the‑art performance on three of them, demonstrating its effectiveness. Code is available at the repository: https://github.com/Wangshuaiia/GraSP.
Authors:Junbin Su, Ziteng Xue, Shihui Zhang, Kun Chen, Weiming Hu, Zhipeng Zhang
Abstract:
Parameter‑efficient fine‑tuning (PEFT) in multimodal tracking reveals a concerning trend where recent performance gains are often achieved at the cost of inflated parameter budgets, which fundamentally erodes PEFT's efficiency promise. In this work, we introduce SEATrack, a Simple, Efficient, and Adaptive two‑stream multimodal tracker that tackles this performance‑efficiency dilemma from two complementary perspectives. We first prioritize cross‑modal alignment of matching responses, an underexplored yet pivotal factor that we argue is essential for breaking the trade‑off. Specifically, we observe that modality‑specific biases in existing two‑stream methods generate conflicting matching attention maps, thereby hindering effective joint representation learning. To mitigate this, we propose AMG‑LoRA, which seamlessly integrates Low‑Rank Adaptation (LoRA) for domain adaptation with Adaptive Mutual Guidance (AMG) to dynamically refine and align attention maps across modalities. We then depart from conventional local fusion approaches by introducing a Hierarchical Mixture of Experts (HMoE) that enables efficient global relation modeling, effectively balancing expressiveness and computational efficiency in cross‑modal fusion. Equipped with these innovations, SEATrack advances notable progress over state‑of‑the‑art methods in balancing performance with efficiency across RGB‑T, RGB‑D, and RGB‑E tracking tasks. \hrefhttps://github.com/AutoLab‑SAI‑SJTU/SEATrack\textcolorcyanCode is available.
Authors:Shuai Wang, Yinan Yu
Abstract:
Large Language Models (LLMs) exhibit strong abilities in natural language understanding and generation, yet they struggle with knowledge‑intensive reasoning. Structured Knowledge Graphs (KGs) provide an effective form of external knowledge representation and have been widely used to enhance performance in classical Knowledge Base Question Answering (KBQA) tasks. However, performing precise multi‑hop reasoning over KGs for complex queries remains highly challenging. Most existing approaches decompose the reasoning process into a sequence of isolated steps executed through a fixed pipeline. While effective to some extent, such designs constrain reasoning flexibility and fragment the overall decision process, often leading to incoherence and the loss of critical intermediate information from earlier steps. In this paper, we introduce KG‑Reasoner, an end‑to‑end framework that integrates multi‑step reasoning into a unified "thinking" phase of a Reasoning LLM. Through Reinforcement Learning (RL), the LLM is trained to internalize the KG traversal process, enabling it to dynamically explore reasoning paths, and perform backtracking when necessary. Experiments on eight multi‑hop and knowledge‑intensive reasoning benchmarks demonstrate that KG‑Reasoner achieves competitive or superior performance compared to the state‑of‑the‑art methods. Codes are available at the repository: https://github.com/Wangshuaiia/KG‑Reasoner.
Authors:Qixi Zheng, Yuxiang Zhao, Tianrui Wang, Wenxi Chen, Kele Xu, Yikang Li, Qinyuan Chen, Xipeng Qiu, Kai Yu, Xie Chen
Abstract:
Zero‑shot voice conversion (VC) aims to convert a source utterance into the voice of an unseen target speaker while preserving its linguistic content. Although recent systems have improved conversion quality, building zero‑shot VC systems for interactive scenarios remains challenging because high‑fidelity speaker transfer and low‑latency streaming inference are difficult to achieve simultaneously. In this work, we present X‑VC, a zero‑shot streaming VC system that performs one‑step conversion in the latent space of a pretrained neural codec. X‑VC uses a dual‑conditioning acoustic converter that jointly models source codec latents and frame‑level acoustic conditions derived from target reference speech, while injecting utterance‑level target speaker information through adaptive normalization. To reduce the mismatch between training and inference, we train the model with generated paired data and a role‑assignment strategy that combines standard, reconstruction, and reversed modes. For streaming inference, we further adopt a chunkwise inference scheme with overlap smoothing that is aligned with the segment‑based training paradigm of the codec. Experiments on Seed‑TTS‑Eval show that X‑VC achieves the best streaming WER in both English and Chinese, strong speaker similarity in same‑language and cross‑lingual settings, and substantially lower offline real‑time factor than the compared baselines. These results suggest that codec‑space one‑step conversion is a practical approach for building high‑quality low‑latency zero‑shot VC systems. Our audio samples, code and checkpoints are released at https://github.com/Jerrister/X‑VC.
Authors:Jiawei Fan, Shigeng Wang, Chao Li, Xiaolong Liu, Anbang Yao
Abstract:
In this paper, we present Chain‑of‑Models Pre‑Training (CoM‑PT), a novel performance‑lossless training acceleration method for vision foundation models (VFMs). This approach fundamentally differs from existing acceleration methods in its core motivation: rather than optimizing each model individually, CoM‑PT is designed to accelerate the training pipeline at the model family level, scaling efficiently as the model family expands. Specifically, CoM‑PT establishes a pre‑training sequence for the model family, arranged in ascending order of model size, called model chain. In this chain, only the smallest model undergoes standard individual pre‑training, while the other models are efficiently trained through sequential inverse knowledge transfer from their smaller predecessors by jointly reusing the knowledge in the parameter space and the feature space. As a result, CoM‑PT enables all models to achieve performance that is mostly superior to standard individual training while significantly reducing training cost, and this is extensively validated across 45 datasets spanning zero‑shot and fine‑tuning tasks. Notably, its efficient scaling property yields a remarkable phenomenon: training more models even results in higher efficiency. For instance, when pre‑training on CC3M: i) given ViT‑L as the largest model, progressively prepending smaller models to the model chain reduces computational complexity by up to 72%; ii) within a fixed model size range, as the VFM family scales across 3, 4, and 7 models, the acceleration ratio of CoM‑PT exhibits a striking leap: from 4.13X to 5.68X and 7.09X. Since CoM‑PT is naturally agnostic to specific pre‑training paradigms, we open‑source the code to spur further extensions in more computationally intensive scenarios, such as large language model pre‑training.
Authors:Yuangang Li, Justin Tian Jin Chen, Ethan Yu, David Hong, Iftekhar Ahmed
Abstract:
Large language models (LLMs) increasingly rely on explicit reasoning to solve coding tasks, yet evaluating the quality of this reasoning remains challenging. Existing reasoning evaluators are not designed for coding, and current benchmarks focus primarily on code generation, leaving other coding tasks largely unexplored. We introduce CodeRQ‑Bench, the first benchmark for evaluating LLM reasoning quality across three coding task categories: generation, summarization, and classification. Using this benchmark, we analyze 1,069 mismatch cases from existing evaluators, identify five recurring limitations, and derive four design insights for reasoning evaluation in coding tasks. Guided by these insights, we propose VERA, a two‑stage evaluator that combines evidence‑grounded verification with ambiguity‑aware score correction. Experiments on CodeRQ‑Bench show that VERA consistently outperforms strong baselines across four datasets, improving AUCROC by up to 0.26 and AUPRC by up to 0.21. We release CodeRQ‑Bench at https://github.com/MrLYG/CodeRQ‑Bench, supporting future investigations.
Authors:SungHo Kim, Juhyeong Park, Eda Atalay, SangKeun Lee
Abstract:
Korean is a morphologically rich language with a featural writing system in which each character is systematically composed of subcharacter units known as Jamo. These subcharacters not only determine the visual structure of Korean but also encode frequent and linguistically meaningful morphophonological processes. However, most current Korean language models (LMs) are based on subword tokenization schemes, which are not explicitly designed to capture the internal compositional structure of characters. To address this limitation, we propose SCRIPT, a model‑agnostic module that injects subcharacter compositional knowledge into Korean PLMs. SCRIPT allows to enhance subword embeddings with structural granularity, without requiring architectural changes or additional pre‑training. As a result, SCRIPT enhances all baselines across various Korean natural language understanding (NLU) and generation (NLG) tasks. Moreover, beyond performance gains, detailed linguistic analyses show that SCRIPT reshapes the embedding space in a way that better captures grammatical regularities and semantically cohesive variations. Our code is available at https://github.com/SungHo3268/SCRIPT.
Authors:WenBin Yan
Abstract:
SpanKey is a lightweight way to gate inference without encrypting weights or chasing leaderboard accuracy on gated inference. The idea is to condition activations on secret keys. A basis matrix B defines a low‑dimensional key subspace Span(B); during training we sample coefficients α and form keys k=α^\top B, then inject them into intermediate activations with additive or multiplicative maps and strength γ. Valid keys lie in Span(B); invalid keys are sampled outside that subspace. We make three points. (i) Mechanism: subspace key injection and a multi‑layer design space. (ii) Failure mode: key absorption, together with two analytical results (a Beta‑energy split and margin‑tail diagnostics), explains weak baseline separation in energy and margin terms ‑‑ these are not a security theorem. iii) Deny losses and experiments: Modes A‑‑C and extensions, with CIFAR‑10 ResNet‑18 runs and MNIST ablations for Mode B. We summarize setup and first‑order analysis, injectors, absorption, deny losses and ablations, a threat discussion that does not promise cryptography, and closing remarks on scale. Code: \texttthttps://github.com/mindmemory‑ai/dksc
Authors:Sandra Gómez-Gálvez, Tobias Olenyi, Gillian Dobbie, Katerina Taškova
Abstract:
Deep neural networks, despite their high accuracy, often exhibit poor confidence calibration, limiting their reliability in high‑stakes applications. Current ad‑hoc confidence calibration methods attempt to fix this during training but face a fundamental trade‑off: two‑phase training methods achieve strong classification performance at the cost of training instability and poorer confidence calibration, while single‑loss methods are stable but underperform in classification. This paper addresses and mitigates this stability‑performance trade‑off. We propose Socrates Loss, a novel, unified loss function that explicitly leverages uncertainty by incorporating an auxiliary unknown class, whose predictions directly influence the loss function and a dynamic uncertainty penalty. This unified objective allows the model to be optimized for both classification and confidence calibration simultaneously, without the instability of complex, scheduled losses. We provide theoretical guarantees that our method regularizes the model to prevent miscalibration and overfitting. Across four benchmark datasets and multiple architectures, our comprehensive experiments demonstrate that Socrates Loss consistently improves training stability while achieving more favorable accuracy‑calibration trade‑off, often converging faster than existing methods.
Authors:Ziqing Wang, Yibo Wen, Abhishek Pandy, Han Liu, Kaize Ding
Abstract:
In drug discovery, molecular optimization aims to iteratively refine a lead compound to improve molecular properties while preserving structural similarity to the original molecule. However, each oracle evaluation is expensive, making sample efficiency a key challenge for existing methods under a limited oracle budget. Trial‑and‑error approaches require many oracle calls, while methods that leverage external knowledge tend to reuse familiar templates and struggle on challenging objectives. A key missing piece is long‑term memory that can ground decisions and provide reusable insights for future optimizations. To address this, we present MolMem (Molecular optimization with Memory), a multi‑turn agentic reinforcement learning (RL) framework with a dual‑memory system. Specifically, MolMem uses Static Exemplar Memory to retrieve relevant exemplars for cold‑start grounding, and Evolving Skill Memory to distill successful trajectories into reusable strategies. Built on this memory‑augmented formulation, we train the policy with dense step‑wise rewards, turning costly rollouts into long‑term knowledge that improves future optimization. Extensive experiments show that MolMem achieves 90% success on single‑property tasks (1.5× over the best baseline) and 52% on multi‑property tasks using only 500 oracle calls. Our code is available at https://github.com/REAL‑Lab‑NU/MolMem.
Authors:Vasundra Srinivasan
Abstract:
Preserving multimodal signals across agent boundaries is necessary for accurate cross‑modal reasoning, but it is not sufficient. We show that modality‑native routing in Agent‑to‑Agent (A2A) networks improves task accuracy by 20 percentage points over text‑bottleneck baselines, but only when the downstream reasoning agent can exploit the richer context that native routing preserves. An ablation replacing LLM‑backed reasoning with keyword matching eliminates the accuracy gap entirely (36% vs. 36%), establishing a two‑layer requirement: protocol‑level routing must be paired with capable agent‑level reasoning for the benefit to materialize. We present MMA2A, an architecture layer atop A2A that inspects Agent Card capability declarations to route voice, image, and text parts in their native modality. On CrossModal‑CS, a controlled 50‑task benchmark with the same LLM backend, same tasks, and only the routing path varying, MMA2A achieves 52% task completion accuracy versus 32% for the text‑bottleneck baseline (95% bootstrap CI on ΔTCA: [8, 32] pp; McNemar's exact p = 0.006). Gains concentrate on vision‑dependent tasks: product defect reports improve by +38.5 pp and visual troubleshooting by +16.7 pp. This accuracy gain comes at a 1.8× latency cost from native multimodal processing. These results suggest that routing is a first‑order design variable in multi‑agent systems, as it determines the information available for downstream reasoning.
Authors:Zhihua Hua, Junli Wang, Pengfei LI, Qihao Jin, Bo Zhang, Kehua Sheng, Yilun Chen, Zhongxue Gan, Wenchao Ding
Abstract:
Global navigation information and local scene understanding are two crucial components of autonomous driving systems. However, our experimental results indicate that many end‑to‑end autonomous driving systems tend to over‑rely on local scene understanding while failing to utilize global navigation information. These systems exhibit weak correlation between their planning capabilities and navigation input, and struggle to perform navigation‑following in complex scenarios. To overcome this limitation, we propose the Sequential Navigation Guidance (SNG) framework, an efficient representation of global navigation information based on real‑world navigation patterns. The SNG encompasses both navigation paths for constraining long‑term trajectories and turn‑by‑turn (TBT) information for real‑time decision‑making logic. We constructed the SNG‑QA dataset, a visual question answering (VQA) dataset based on SNG that aligns global and local planning. Additionally, we introduce an efficient model SNG‑VLA that fuses local planning with global planning. The SNG‑VLA achieves state‑of‑the‑art performance through precise navigation information modeling without requiring auxiliary loss functions from perception tasks. Project page: SNG‑VLA
Authors:Sebastian Cajas, Ashaba Judith, Rahul Gorijavolu, Sahil Kapadia, Hillary Clinton Kasimbazi, Leo Kinyera, Emmanuel Paul Kwesiga, Sri Sri Jaithra Varma Manthena, Luis Filipe Nakayama, Ninsiima Doreen, Leo Anthony Celi
Abstract:
Latent diffusion models for medical image super‑resolution universally inherit variational autoencoders designed for natural photographs. We show that this default choice, not the diffusion architecture, is the dominant constraint on reconstruction quality. In a controlled experiment holding all other pipeline components fixed, replacing the generic Stable Diffusion VAE with MedVAE, a domain‑specific autoencoder pretrained on more than 1.6 million medical images, yields +2.91 to +3.29 dB PSNR improvement across knee MRI, brain MRI, and chest X‑ray (n = 1,820; Cohen's d = 1.37 to 1.86, all p < 10^‑20, Wilcoxon signed‑rank). Wavelet decomposition localises the advantage to the finest spatial frequency bands encoding anatomically relevant fine structure. Ablations across inference schedules, prediction targets, and generative architectures confirm the gap is stable within plus or minus 0.15 dB, while hallucination rates remain comparable between methods (Cohen's h < 0.02 across all datasets), establishing that reconstruction fidelity and generative hallucination are governed by independent pipeline components. These results provide a practical screening criterion: autoencoder reconstruction quality, measurable without diffusion training, predicts downstream SR performance (R^2 = 0.67), suggesting that domain‑specific VAE selection should precede diffusion architecture search. Code and trained model weights are publicly available at https://github.com/sebasmos/latent‑sr.
Authors:Arun Sharma
Abstract:
We introduce compute‑grounded reasoning (CGR), a design paradigm for spatial‑aware research agents in which every answerable sub‑problem is resolved by deterministic computation before a language model is asked to generate. Spatial Atlas instantiates CGR as a single Agent‑to‑Agent (A2A) server that handles two challenging benchmarks: FieldWorkArena, a multimodal spatial question‑answering benchmark spanning factory, warehouse, and retail environments, and MLE‑Bench, a suite of 75 Kaggle machine learning competitions requiring end‑to‑end ML engineering. A structured spatial scene graph engine extracts entities and relations from vision descriptions, computes distances and safety violations deterministically, then feeds computed facts to large language models, thereby avoiding hallucinated spatial reasoning. Entropy‑guided action selection maximizes information gain per step and routes queries across a three‑tier frontier model stack (OpenAI + Anthropic). A self‑healing ML pipeline with strategy‑aware code generation, a score‑driven iterative refinement loop, and a prompt‑based leak audit registry round out the system. We evaluate across both benchmarks and show that CGR yields competitive accuracy while maintaining interpretability through structured intermediate representations and deterministic spatial computations.
Authors:Vladimir Vasilenko
Abstract:
Large language models map semantically related prompts to similar internal representations ‑‑ a phenomenon interpretable as attractor‑like dynamics. We ask whether the identity document of a persistent cognitive agent (its cognitive_core) exhibits analogous attractor‑like behavior. We present a controlled experiment on Llama 3.1 8B Instruct, comparing hidden states of an original cognitive_core (Condition A), seven paraphrases (Condition B), and seven structurally matched controls (Condition C). Mean‑pooled states at layers 8, 16, and 24 show that paraphrases converge to a tighter cluster than controls (Cohen's d > 1.88, p < 10^‑27, Bonferroni‑corrected). Replication on Gemma 2 9B confirms cross‑architecture generalizability. Ablations suggest the effect is primarily semantic rather than structural, and that structural completeness appears necessary to reach the attractor region. An exploratory experiment shows that reading a scientific description of the agent shifts internal state toward the attractor ‑‑ closer than a sham preprint ‑‑ distinguishing knowing about an identity from operating as that identity. These results provide representational evidence that agent identity documents induce attractor‑like geometry in LLM activation space.
Authors:Xingyu Qiu, Yuqian Fu, Jiawei Geng, Bin Ren, Jiancheng Pan, Zongwei Wu, Hao Tang, Yanwei Fu, Radu Timofte, Nicu Sebe, Mohamed Elhoseiny, Lingyi Hong, Mingxi Cheng, Xingqi He, Runze Li, Xingdong Sheng, Wenqiang Zhang, Jiacong Liu, Shu Luo, Yikai Qin, Yaze Zhao, Yongwei Jiang, Yixiong Zou, Zhe Zhang, Yang Yang, Kaiyu Li, Bowen Fu, Zixuan Jiang, Ke Li, Hui Qiao, Xiangyong Cao, Xuanlong Yu, Youyang Sha, Longfei Liu, Di Yang, Xi Shen, Kyeongryeol Go, Taewoong Jang, Saiprasad Meesiyawar, Ravi Kirasur, Rakshita Kulkarni, Bhoomi Deshpande, Harsh Patil, Uma Mudenagudi, Shuming Hu, Chao Chen, Tao Wang, Wei Zhou, Qi Xu, Zhenzhao Xing, Dandan Zhao, Hanzhe Xia, Dongdong Lu, Zhe Zhang, Jingru Wang, Guangwei Huang, Jiachen Tu, Yaokun Shi, Guoyi Xu, Yaoxin Jiang, Jiajia Liu, Liwei Zhou, Bei Dou, Tao Wu, Zekang Fan, Junjie Liu, Adhémar de Senneville, Flavien Armangeon, Mengbers, Yazhe Lyu, Zhimeng Xin, Zijian Zhuang, Hongchun Zhu, Li Wang
Abstract:
Cross‑domain few‑shot object detection (CD‑FSOD) remains a challenging problem for existing object detectors and few‑shot learning approaches, particularly when generalizing across distinct domains. As part of NTIRE 2026, we hosted the second CD‑FSOD Challenge to systematically evaluate and promote progress in detecting objects in unseen target domains under limited annotation conditions. The challenge received strong community interest, with 128 registered participants and a total of 696 submissions. Among them, 31 teams actively participated, and 19 teams submitted valid final results. Participants explored a wide range of strategies, introducing innovative methods that push the performance frontier under both open‑source and closed‑source tracks. This report presents a detailed overview of the NTIRE 2026 CD‑FSOD Challenge, including a summary of the submitted approaches and an analysis of the final results across all participating teams. Challenge Codes: https://github.com/ohMargin/NTIRE2026_CDFSOD.
Authors:Manas Pathak, Xingyao Chen, Shuozhe Li, Amy Zhang, Liu Leqi
Abstract:
Should we trust Large Language Models (LLMs) with high accuracy? LLMs achieve high accuracy on reasoning benchmarks, but correctness alone does not reveal the quality of the reasoning used to produce it. This highlights a fundamental limitation of outcome‑based evaluation: models may arrive at correct answers through flawed reasoning, and models with substantially different reasoning capabilities can nevertheless exhibit similar benchmark accuracy, for example due to memorization or over‑optimization. In this paper, we ask: given existing benchmarks, can we move beyond outcome‑based evaluation to assess the quality of reasoning itself? We seek metrics that (1) differentiate models with similar accuracy and (2) are robust to variations in input prompts and generation configurations. To this end, we propose a reasoning score that evaluates reasoning traces along dimensions such as faithfulness, coherence, utility, and factuality. A remaining question is how to aggregate this score across multiple sampled traces. Naively averaging them is undesirable, particularly in long‑horizon settings, where the number of possible trajectories grows rapidly, and low‑confidence correct traces are more likely to be coincidental. To address this, we introduce the Filtered Reasoning Score (FRS), which computes reasoning quality using only the top‑K% most confident traces. Evaluating with FRS, models that are indistinguishable under standard accuracy exhibit significant differences in reasoning quality. Moreover, models with higher FRS on one benchmark tend to perform better on other reasoning benchmarks, in both accuracy and reasoning quality. Together, these findings suggest that FRS complements accuracy by capturing a model's transferable reasoning capabilities. We open source our evaluation codebase: https://github.com/Manas2006/benchmark_reproducibility.
Authors:Xinyu Jessica Wang, Haoyue Bai, Yiyou Sun, Haorui Wang, Shuibai Zhang, Wenjie Hu, Mya Schroder, Bilge Mutlu, Dawn Song, Robert D Nowak
Abstract:
Large language model (LLM) agents perform strongly on short‑ and mid‑horizon tasks, but often break down on long‑horizon tasks that require extended, interdependent action sequences. Despite rapid progress in agentic systems, these long‑horizon failures remain poorly characterized, hindering principled diagnosis and comparison across domains. To address this gap, we introduce HORIZON, an initial cross‑domain diagnostic benchmark for systematically constructing tasks and analyzing long‑horizon failure behaviors in LLM‑based agents. Using HORIZON, we evaluate state‑of‑the‑art (SOTA) agents from multiple model families (GPT‑5 variants and Claude models), collecting 3100+ trajectories across four representative agentic domains to study horizon‑dependent degradation patterns. We further propose a trajectory‑grounded LLM‑as‑a‑Judge pipeline for scalable and reproducible failure attribution, and validate it with human annotation on trajectories, achieving strong agreement (inter‑annotator κ=0.61; human‑judge κ=0.84). Our findings offer an initial methodological step toward systematic, cross‑domain analysis of long‑horizon agent failures and offer practical guidance for building more reliable long‑horizon agents. We release our project website at \hrefhttps://xwang2775.github.io/horizon‑leaderboard/HORIZON Leaderboard and welcome contributions from the community.
Authors:Mohammed Ezzaldin Babiker Abdullah
Abstract:
The stable operation of autonomous off‑grid photovoltaic systems requires solar forecasting algorithms that respect atmospheric thermodynamics. Contemporary deep learning models consistently exhibit critical anomalies, primarily severe temporal phase lags during cloud transients and physically impossible nocturnal power generation. To resolve this divergence between data‑driven modeling and deterministic celestial mechanics, this research introduces the Thermodynamic Liquid Manifold Network. The methodology projects 22 meteorological and geometric variables into a Koopman‑linearized Riemannian manifold to systematically map complex climatic dynamics. The architecture integrates a Spectral Calibration unit and a multiplicative Thermodynamic Alpha‑Gate. This system synthesizes real‑time atmospheric opacity with theoretical clear‑sky boundary models, structurally enforcing strict celestial geometry compliance. This completely neutralizes phantom nocturnal generation while maintaining zero‑lag synchronization during rapid weather shifts. Validated against a rigorous five‑year testing horizon in a severe semi‑arid climate, the framework achieves an RMSE of 18.31 Wh/m2 and a Pearson correlation of 0.988. The model strictly maintains a zero‑magnitude nocturnal error across all 1826 testing days and exhibits a sub‑30‑minute phase response during high‑frequency optical transients. Comprising exactly 63,458 trainable parameters, this ultra‑lightweight design establishes a robust, thermodynamically consistent standard for edge‑deployable microgrid controllers.
Authors:Haesung Oh, Jaeheung Park
Abstract:
End‑to‑End (E2E) autonomous driving models are usually trained and evaluated with a fixed ego‑vehicle, even though their driving policy is implicitly tied to vehicle dynamics. When such a model is deployed on a vehicle with different size, mass, or drivetrain characteristics, its performance can degrade substantially; we refer to this problem as the vehicle‑domain gap. To address it, we propose MVAdapt, a physics‑conditioned adaptation framework for multi‑vehicle E2E driving. MVAdapt combines a frozen TransFuser++ scene encoder with a lightweight physics encoder and a cross‑attention module that conditions scene features on vehicle properties before waypoint decoding. In the CARLA Leaderboard 1.0 benchmark, MVAdapt improves over naive transfer and multi‑embodiment adaptation baselines on both in‑distribution and unseen vehicles. We further show two complementary behaviors: strong zero‑shot transfer on many unseen vehicles, and data‑efficient few‑shot calibration for severe physical outliers. These results suggest that explicitly conditioning E2E driving policies on vehicle physics is an effective step toward more transferable autonomous driving models. All codes are available at https://github.com/hae‑sung‑oh/MVAdapt
Authors:Wenhao Zhang, Lin Mu, Li Ni, Peiquan Jin, Yiwen Zhang
Abstract:
Low‑rank adaptation (LoRA) is a widely used strategy for efficient fine‑tuning of large language models (LLMs), but its strictly linear structure fundamentally limits expressive capacity. The bilinear formulation of weight updates captures only first‑order dependencies between low‑rank factors, restricting the modeling of nonlinear and higher‑order parameter interactions. In this paper, we propose Polynomial Expansion Rank Adaptation (PERA), a novel method that introduces structured polynomial expansion directly into the low‑rank factor space. By expanding each low‑rank factor to synthesize high‑order interaction terms before composition, PERA transforms the adaptation space into a polynomial manifold capable of modeling richer nonlinear coupling without increasing rank or inference cost. We provide theoretical analysis demonstrating that PERA offers enhanced expressive capacity and more effective feature utilization compare to existing linear adaptation approaches. Empirically, PERA consistently outperforms state‑of‑the‑art methods across diverse benchmarks. Notably, our experiments show that incorporating high‑order nonlinear components particularly square terms is crucial for enhancing expressive capacity and maintaining strong and robust performance under various rank settings. Our code is available at https://github.com/zhangwenhao6/PERA
Authors:Bronislav Sidik, Lior Rokach
Abstract:
Autonomous AI agents built on open‑source runtimes such as OpenClaw expose every available tool to every session by default, regardless of the task. A summarization task receives the same shell execution, subagent spawning, and credential access capabilities as a code deployment task, a 15x overprovision ratio that we call the capability overprovisioning problem. Existing defenses, including the NemoClaw container sandbox and the Cisco DefenseClaw skill scanner, address containment and threat detection but do not learn the minimum viable capability set for each task type. We present Aethelgard, a four layer adaptive governance framework that enforces least privilege for AI agents through a learned policy. Layer 1, the Capability Governor, dynamically scopes which tools the agent is aware of in each session. Layer 3, the Safety Router, intercepts tool calls before execution using a hybrid rule based and fine tuned classifier. Layer 2, the RL Learning Policy, trains a PPO policy on the accumulated audit log to learn the minimum viable skill set for each task type.
Authors:Mohammed Ezzaldin Babiker Abdullah
Abstract:
The stable operation of off‑grid photovoltaic systems requires accurate, computationally efficient solar forecasting. Contemporary deep learning models often suffer from massive computational overhead and physical blindness, generating impossible predictions. This paper introduces the Physics‑Informed State Space Model (PISSM) to bridge the gap between efficiency and physical accuracy for edge‑deployed microcontrollers. PISSM utilizes a dynamic Hankel matrix embedding to filter stochastic sensor noise by transforming raw meteorological sequences into a robust state space. A Linear State Space Model replaces heavy attention mechanisms, efficiently modeling temporal dependencies for parallel processing. Crucially, a novel Physics‑Informed Gating mechanism leverages the Solar Zenith Angle and Clearness Index to structurally bound outputs, ensuring predictions strictly obey diurnal cycles and preventing nocturnal errors. Validated on a multi‑year dataset for Omdurman, Sudan, PISSM achieves superior accuracy with fewer than 40,000 parameters, establishing an ultra‑lightweight benchmark for real‑time off‑grid control.
Authors:Chenxi Qing, Junxi Wu, Zheng Liu, Yixiang Qiu, Hongyao Yu, Bin Chen, Hao Wu, Shu-Tao Xia
Abstract:
Recently, large language models (LLMs) are capable of generating highly fluent textual content. While they offer significant convenience to humans, they also introduce various risks, like phishing and academic dishonesty. Numerous research efforts have been dedicated to developing algorithms for detecting AI‑generated text and constructing relevant datasets. However, in the domain of Chinese corpora, challenges remain, including limited model diversity and data homogeneity. To address these issues, we propose C‑ReD: a comprehensive Chinese Real‑prompt AI‑generated Detection benchmark. Experiments demonstrate that C‑ReD not only enables reliable in‑domain detection but also supports strong generalization to unseen LLMs and external Chinese datasets‑addressing critical gaps in model diversity, domain coverage, and prompt realism that have limited prior Chinese detection benchmarks. We release our resources at https://github.com/HeraldofLight/C‑ReD.
Authors:Haozhe Wang, Cong Wei, Weiming Ren, Jiaming Liu, Fangzhen Lin, Wenhu Chen
Abstract:
Most reward models for visual generation reduce rich human judgments to a single unexplained score, discarding the reasoning that underlies preference. We show that teaching reward models to produce explicit, multi‑dimensional critiques before scoring transforms them from passive evaluators into active optimization tools, improving generators in two complementary ways: at training time, structured rationales provide interpretable, fine‑grained rewards for reinforcement learning; at test time, a Generate‑Critique‑Refine loop turns critiques into targeted prompt revisions that improve outputs without any parameter updates. To train such a reward model without costly rationale annotations, we introduce Preference‑Anchored Rationalization (PARROT), a principled framework that recovers high‑quality rationales from readily available preference data through anchored generation, consistency filtering, and distillation. The resulting model, RationalRewards (8B), achieves state‑of‑the‑art preference prediction among open‑source reward models, competitive with Gemini‑2.5‑Pro, while using 10‑20x less training data than comparable baselines. As an RL reward, it consistently improves text‑to‑image and image‑editing generators beyond scalar alternatives. Most strikingly, its test‑time critique‑and‑refine loop matches or exceeds RL‑based fine‑tuning on several benchmarks, suggesting that structured reasoning can unlock latent capabilities in existing generators that suboptimal prompts fail to elicit.
Authors:Charafeddine Mouzouni
Abstract:
We introduce Context Kubernetes, an architecture for orchestrating enterprise knowledge in agentic AI systems, with a prototype implementation and eight experiments. The core observation is that delivering the right knowledge, to the right agent, with the right permissions, at the right freshness ‑‑ across an entire organization ‑‑ is structurally analogous to the container orchestration problem Kubernetes solved a decade ago. We formalize six core abstractions, a YAML‑based declarative manifest for knowledge‑architecture‑as‑code, a reconciliation loop, and a three‑tier agent permission model where agent authority is always a strict subset of human authority. On synthetic seed data, we compare four governance baselines of increasing strength: ungoverned RAG, ACL‑filtered retrieval, RBAC‑aware routing, and the full architecture. Each layer contributes a different capability: ACL filtering eliminates cross‑domain leaks, intent routing reduces noise by 19 percentage points, and only the three‑tier model blocks all five tested attack scenarios ‑‑ the one attack RBAC misses is an agent sending confidential pricing via email, which RBAC cannot distinguish from ordinary email. TLA+ model‑checking verifies safety properties across 4.6 million reachable states with zero violations. A survey of four major platforms (Microsoft, Salesforce, AWS, Google) documents that none architecturally isolates agent approval channels. We identify four properties that make context orchestration harder than container orchestration, and argue these make the solution more valuable.
Authors:Xi-Wei Pan, Shi-Wen An, Jin-Guo Liu
Abstract:
Solving an NP‑hard optimization problem often requires reformulating it for a specific solver ‑‑ quantum hardware, a commercial optimizer, or a domain heuristic. A tool for polynomial‑time reductions between hard problems would let practitioners route any supported problem to any supported solver through a single interface. Building such a library at scale, however, has remained out of reach. We show that harness engineering, the practice of designing constraints, verification systems, and feedback loops that channel AI coding agents, can overcome this barrier. Our harness combines a no‑code contribution route for domain experts, a multilayer verification stack ranging from type‑level checks to agentic feature tests (AI agents role‑playing as end users), and a fully automated implementation‑review‑integration pipeline. In about three months, we built a command‑line tool backed by a library of 100+ problem types and 200+ reduction rules in over 170k lines of Rust. The result suggests that a well‑engineered harness lets agents build well‑tested software at a scale and pace beyond prior reduction‑library efforts. Because the reduction graph composes transitively, a new solver registered for any single problem type instantly becomes available to every problem connected by a reduction path. The source code is available at https://github.com/CodingThrust/problem‑reductions.
Authors:Pengfeng Li, Chen Huang, Chaoqun Hao, Hongyao Chen, Xiao-Yong Wei, Wenqiang Lei, See-Kiong Ng
Abstract:
Contextual causal reasoning is a critical yet challenging capability for Large Language Models (LLMs). Existing benchmarks, however, often evaluate this skill in fragmented settings, failing to ensure context consistency or cover the full causal hierarchy. To address this, we pioneer METER to systematically benchmark LLMs across all three levels of the causal ladder under a unified context setting. Our extensive evaluation of various LLMs reveals a significant decline in proficiency as tasks ascend the causal hierarchy. To diagnose this degradation, we conduct a deep mechanistic analysis via both error pattern identification and internal information flow tracing. Our analysis reveals two primary failure modes: (1) LLMs are susceptible to distraction by causally irrelevant but factually correct information at lower level of causality; and (2) as tasks ascend the causal hierarchy, faithfulness to the provided context degrades, leading to a reduced performance. We belive our work advances our understanding of the mechanisms behind LLM contextual causal reasoning and establishes a critical foundation for future research. Our code and dataset are available at https://github.com/SCUNLP/METER .
Authors:Haofu Yang, Jiaji Liu, Chen Huang, Faguo Wu, Wenqiang Lei, See-Kiong Ng
Abstract:
Developing non‑collaborative dialogue agents traditionally requires the manual, unscalable codification of expert strategies. We propose \ours, a method that leverages large language models to autonomously induce both strategy actions and planning logic directly from raw transcripts. METRO formalizes expert knowledge into a Strategy Forest, a hierarchical structure that captures both short‑term responses (nodes) and long‑term strategic foresight (branches). Experimental results across two benchmarks show that METRO demonstrates promising performance, outperforming existing methods by an average of 9%‑10%. Our further analysis not only reveals the success behind METRO (strategic behavioral diversity and foresight), but also demonstrates its robust cross‑task transferability. This offers new insights into building non‑collaborative agents in a cost‑effective and scalable way. Our code is available at https://github.com/Humphrey‑0125/METRO.
Authors:Bo Li, Mingda Wang, Gexiang Fang, Shikun Zhang, Wei Ye
Abstract:
We revisit retrieval‑augmented generation (RAG) by embedding retrieval control directly into generation. Instead of treating retrieval as an external intervention, we express retrieval decisions within token‑level decoding, enabling end‑to‑end coordination without additional controllers or classifiers. Under the paradigm of Retrieval as Generation, we propose GRIP (Generation‑guided Retrieval with Information Planning), a unified framework in which the model regulates retrieval behavior through control‑token emission. Central to GRIP is Self‑Triggered Information Planning, which allows the model to decide when to retrieve, how to reformulate queries, and when to terminate, all within a single autoregressive trajectory. This design tightly couples retrieval and reasoning and supports dynamic multi‑step inference with on‑the‑fly evidence integration. To supervise these behaviors, we construct a structured training set covering answerable, partially answerable, and multi‑hop queries, each aligned with specific token patterns. Experiments on five QA benchmarks show that GRIP surpasses strong RAG baselines and is competitive with GPT‑4o while using substantially fewer parameters.
Authors:Chen Huang, Zitan Jiang, Changyi Zou, Wenqiang Lei, See-Kiong Ng
Abstract:
Customer service chatbots are increasingly expected to serve not merely as reactive support tools for users, but as strategic interfaces for harvesting high‑value information and business intelligence. In response, we make three main contributions. 1) We introduce and define a novel task of Proactive Information Probing, which optimizes when to probe users for pre‑specified target information while minimizing conversation turns and user friction. 2) We propose PROCHATIP, a proactive chatbot framework featuring a specialized conversation strategy module trained to master the delicate timing of probes. 3) Experiments demonstrate that PROCHATIP significantly outperforms baselines, exhibiting superior capability in both information probing and service quality. We believe that our work effectively redefines the commercial utility of chatbots, positioning them as scalable, cost‑effective engines for proactive business intelligence. Our code is available at https://github.com/SCUNLP/PROCHATIP.
Authors:Xue Qin, Simin Luan, John See, Cong Yang, Zhijun Li
Abstract:
As embodied robots move toward fleet‑scale operation, multi‑robot coordination is becoming a central systems challenge. Existing approaches often treat this as motivation for increasing internal multi‑agent decomposition within each robot. We argue for a different principle: multi‑robot coordination does not require intra‑robot multi‑agent fragmentation. Each robot should remain a single embodied agent with its own persistent runtime, local policy scope, capability state, and recovery authority, while coordination emerges through federation across robots at the fleet level. We present Federated Single‑Agent Robotics (FSAR), a runtime architecture for multi‑robot coordination built on single‑agent robot runtimes. Each robot exposes a governed capability surface rather than an internally fragmented agent society. Fleet coordination is achieved through shared capability registries, cross‑robot task delegation, policy‑aware authority assignment, trust‑scoped interaction, and layered recovery protocols. We formalize key coordination relations including authority delegation, inter‑robot capability requests, local‑versus‑fleet recovery boundaries, and hierarchical human supervision, and describe a fleet runtime architecture supporting shared Embodied Capability Module (ECM) discovery, contract‑aware cross‑robot coordination, and fleet‑level governance. We evaluate FSAR on representative multi‑robot coordination scenarios against decomposition‑heavy baselines. Results show statistically significant gains in governance locality (d=2.91, p<.001 vs. centralized control) and recovery containment (d=4.88, p<.001 vs. decomposition‑heavy), while reducing authority conflicts and policy violations across all scenarios. Our results support the view that the path from embodied agents to embodied fleets is better served by federation across coherent robot runtimes than by fragmentation within them.
Authors:Peng Yuan, Yuyang Yin, Yuxuan Cai, Zheng Wei
Abstract:
Existing browser agent benchmarks face a fundamental trilemma: real‑website benchmarks lack reproducibility due to content drift, controlled environments sacrifice realism by omitting real‑web noise, and both require costly manual curation that limits scalability. We present WebForge, the first fully automated framework that resolves this trilemma through a four‑agent pipeline ‑‑ Plan, Generate, Refine, and Validate ‑‑ that produces interactive, self‑contained web environments end‑to‑end without human annotation. A seven‑dimensional difficulty control framework structures task design along navigation depth, visual complexity, reasoning difficulty, and more, enabling systematic capability profiling beyond single aggregate scores. Using WebForge, we construct WebForge‑Bench, a benchmark of 934 tasks spanning 7 domains and 3 difficulty levels. Multi‑model experiments show that difficulty stratification effectively differentiates model capabilities, while cross‑domain analysis exposes capability biases invisible to aggregate metrics. Together, these results confirm that multi‑dimensional evaluation reveals distinct capability profiles that a single aggregate score cannot capture. Code and benchmark are publicly available at https://github.com/yuandaxia2001/WebForge.
Authors:Zihao Cheng, Zeming Liu, Yingyu Shan, Xinyi Wang, Xiangrong Zhu, Yunpu Ma, Hongru Wang, Yuhang Guo, Wei Lin, Yunhong Wang
Abstract:
While large language model‑‑powered agents can self‑evolve by accumulating experience or by dynamically creating new assets (i.e., tools or expert agents), existing frameworks typically treat these two evolutionary processes in isolation. This separation overlooks their intrinsic interdependence: the former is inherently bounded by a manually predefined static toolset, while the latter generates new assets from scratch without experiential guidance, leading to limited capability growth and unstable evolution. To address this limitation, we introduce a novel paradigm of co‑evolutionary Capability Expansion and Experience Distillation. Guided by this paradigm, we propose the Mem^2Evolve, which integrates two core components: Experience Memory and Asset Memory. Specifically, Mem^2Evolve leverages accumulated experience to guide the dynamic creation of assets, thereby expanding the agent's capability space while simultaneously acquiring new experience to achieve co‑evolution. Extensive experiments across 6 task categories and 8 benchmarks demonstrate that Mem^2Evolve achieves improvement of 18.53% over standard LLMs, 11.80% over agents evolving solely through experience, and 6.46% over those evolving solely through asset creation, establishing it as a substantially more effective and stable self‑evolving agent framework. Code is available at: https://buaa‑irip‑llm.github.io/Mem2Evolve.
Authors:Shuhao Zhang, Yuli Chen, Jiale Han, Bo Cheng, Jiabao Ma
Abstract:
Watermarking provides a critical safeguard for large language model (LLM) services by facilitating the detection of LLM‑generated text. Correspondingly, stealing watermark algorithms (SWAs) derive watermark information from watermarked texts generated by victim LLMs to craft highly targeted adversarial attacks, which compromise the reliability of watermarks. Existing SWAs rely on fixed strategies, overlooking the non‑uniform distribution of stolen watermark information and the dynamic nature of real‑world LLM generation processes. To address these limitations, we propose Adaptive Stealing (AS), a novel SWA featuring enhanced design flexibility through Position‑Based Seal Construction and Adaptive Selection modules. AS operates by defining multiple attack perspectives derived from distinct activation states of contextually ordered tokens. During attack execution, AS dynamically selects the optimal perspective based on watermark compatibility, generation priority, and dynamic generation relevance. Our experiments demonstrate that AS significantly increases steal efficiency against target watermarks under identical experimental conditions. These findings highlight the need for more robust LLM watermarks to withstand potential attacks. We release our code to the community for future research\footnotehttps://github.com/DrankXs/AdaptiveStealingWatermark.
Authors:Yinyi Luo, Wenwen Wang, Hayes Bai, Hongyu Zhu, Hao Chen, Pan He, Marios Savvides, Sharon Li, Jindong Wang
Abstract:
Recent advances in unified multimodal models (UMMs) have led to a proliferation of architectures capable of understanding, generating, and editing across visual and textual modalities. However, developing a unified framework for UMMs remains challenging due to the diversity of model architectures and the heterogeneity of training paradigms and implementation details. In this paper, we present TorchUMM, the first unified codebase for comprehensive evaluation, analysis, and post‑training across diverse UMM backbones, tasks, and datasets. TorchUMM supports a broad spectrum of models covering a wide range of scales and design paradigms. Our benchmark encompasses three core task dimensions: multimodal understanding, generation, and editing, and integrates both established and novel datasets to evaluate perception, reasoning, compositionality, and instruction‑following abilities. By providing a unified interface and standardized evaluation protocols, TorchUMM enables fair and reproducible comparisons across heterogeneous models and fosters deeper insights into their strengths and limitations, facilitating the development of more capable unified multimodal systems. Code is available at: https://github.com/AIFrontierLab/TorchUMM.
Authors:Yuqi Chen, Xiaohan Zhang, Ahmad Arrabi, Waqas Sultani, Chen Chen, Safwan Wshah
Abstract:
Natural‑language Guided Cross‑view Geo‑localization (NGCG) aims to retrieve geo‑tagged satellite imagery using textual descriptions of ground scenes. While recent NGCG methods commonly rely on CLIP‑style dual‑encoder architectures, they often suffer from weak cross‑modal generalization and require complex architectural designs. In contrast, Multimodal Large Language Models (MLLMs) offer powerful semantic reasoning capabilities but are not directly optimized for retrieval tasks. In this work, we present a simple yet effective framework to adapt MLLMs for NGCG via parameter‑efficient finetuning. Our approach optimizes latent representations within the MLLM while preserving its pretrained multimodal knowledge, enabling strong cross‑modal alignment without redesigning model architectures. Through systematic analysis of diverse variables, from model backbone to feature aggregation, we provide practical and generalizable insights for leveraging MLLMs in NGCG. Our method achieves SOTA on GeoText‑1652 with a 12.2% improvement in Text‑to‑Image Recall@1 and secures top performance in 5 out of 12 subtasks on CVG‑Text, all while surpassing baselines with far fewer trainable parameters. These results position MLLMs as a robust foundation for semantic cross‑view retrieval and pave the way for MLLM‑based NGCG to be adopted as a scalable, powerful alternative to traditional dual‑encoder designs. Project page and code are available at https://yuqichen888.github.io/NGCG‑MLLMs‑web/.
Authors:Udari Madhushani Sehwag, Elaine Lau, Haniyeh Ehsani Oskouie, Shayan Shabihi, Erich Liang, Andrea Toledo, Guillermo Mangialardi, Sergio Fonrouge, Ed-Yeremai Hernandez Cardona, Paula Vergara, Utkarsh Tyagi, Chen Bo Calvin Zhang, Pavi Bhatter, Nicholas Johnson, Furong Huang, Ernesto Gabriel Hernandez Montoya, Bing Liu
Abstract:
Accelerating scientific discovery requires the identification of which experiments would yield the best outcomes before committing resources to costly physical validation. While existing benchmarks evaluate LLMs on scientific knowledge and reasoning, their ability to predict experimental outcomes ‑ a task where AI could significantly exceed human capabilities ‑ remains largely underexplored. We introduce SciPredict, a benchmark comprising 405 tasks derived from recent empirical studies in 33 specialized sub‑fields of physics, biology, and chemistry. SciPredict addresses two critical questions: (a) can LLMs predict the outcome of scientific experiments with sufficient accuracy? and (b) can such predictions be reliably used in the scientific research process? Evaluations reveal fundamental limitations on both fronts. Model accuracies are 14‑26% and human expert performance is \approx20%. Although some frontier models exceed human performance model accuracy is still far below what would enable reliable experimental guidance. Even within the limited performance, models fail to distinguish reliable predictions from unreliable ones, achieving only \approx20% accuracy regardless of their confidence or whether they judge outcomes as predictable without physical experimentation. Human experts, in contrast, demonstrate strong calibration: their accuracy increases from \approx5% to \approx80% as they deem outcomes more predictable without conducting the experiment. SciPredict establishes a rigorous framework demonstrating that superhuman performance in experimental science requires not just better predictions, but better awareness of prediction reliability. For reproducibility all our data and code are provided at https://github.com/scaleapi/scipredict
Authors:Zeyue Tian, Binxin Yang, Zhaoyang Liu, Jiexuan Zhang, Ruibin Yuan, Hubery Yin, Qifeng Chen, Chen Li, Jing Lv, Wei Xue, Yike Guo
Abstract:
Recent progress in multimodal models has spurred rapid advances in audio understanding, generation, and editing. However, these capabilities are typically addressed by specialized models, leaving the development of a truly unified framework that can seamlessly integrate all three tasks underexplored. While some pioneering works have explored unifying audio understanding and generation, they often remain confined to specific domains. To address this, we introduce Audio‑Omni, the first end‑to‑end framework to unify generation and editing across general sound, music, and speech domains, with integrated multi‑modal understanding capabilities. Our architecture synergizes a frozen Multimodal Large Language Model for high‑level reasoning with a trainable Diffusion Transformer for high‑fidelity synthesis. To overcome the critical data scarcity in audio editing, we construct AudioEdit, a new large‑scale dataset comprising over one million meticulously curated editing pairs. Extensive experiments demonstrate that Audio‑Omni achieves state‑of‑the‑art performance across a suite of benchmarks, outperforming prior unified approaches while achieving performance on par with or superior to specialized expert models. Beyond its core capabilities, Audio‑Omni exhibits remarkable inherited capabilities, including knowledge‑augmented reasoning generation, in‑context generation, and zero‑shot cross‑lingual control for audio generation, highlighting a promising direction toward universal generative audio intelligence. The code, model, and dataset will be publicly released on https://zeyuet.github.io/Audio‑Omni.
Authors:Yifan Gao, Haoyue Li, Feng Yuan, Xin Gao, Weiran Huang, Xiaosong Wang
Abstract:
We present Camyla, a system for fully autonomous research within the scientific domain of medical image segmentation. Camyla transforms raw datasets into literature‑grounded research proposals, executable experiments, and complete manuscripts without human intervention. Autonomous experimentation over long horizons poses three interrelated challenges: search effort drifts toward unpromising directions, knowledge from earlier trials degrades as context accumulates, and recovery from failures collapses into repetitive incremental fixes. To address these challenges, the system combines three coupled mechanisms: Quality‑Weighted Branch Exploration for allocating effort across competing proposals, Layered Reflective Memory for retaining and compressing cross‑trial knowledge at multiple granularities, and Divergent Diagnostic Feedback for diversifying recovery after underperforming trials. The system is evaluated on CamylaBench, a contamination‑free benchmark of 31 datasets constructed exclusively from 2025 publications, under a strict zero‑intervention protocol across two independent runs within a total of 28 days on an 8‑GPU cluster. Across the two runs, Camyla generates more than 2,700 novel model implementations and 40 complete manuscripts, and surpasses the strongest per‑dataset baseline selected from 14 established architectures, including nnU‑Net, on 22 and 18 of 31 datasets under identical training budgets, respectively (union: 24/31). Senior human reviewers score the generated manuscripts at the T1/T2 boundary of contemporary medical imaging journals. Relative to automated baselines, Camyla outperforms AutoML and NAS systems on aggregate segmentation performance and exceeds six open‑ended research agents on both task completion and baseline‑surpassing frequency. These results suggest that domain‑scale autonomous research is achievable in medical image segmentation.
Authors:Vu Tuan Truong, Long Bao Le
Abstract:
Large Language Models (LLMs), despite their impressive capabilities across domains, have been shown to be vulnerable to backdoor attacks. Prior backdoor strategies predominantly operate at the token level, where an injected trigger causes the model to generate a specific target word, choice, or class (depending on the task). Recent advances, however, exploit the long‑form reasoning tendencies of modern LLMs to conduct reasoning‑level backdoors: once triggered, the victim model inserts one or more malicious reasoning steps into its chain‑of‑thought (CoT). These attacks are substantially harder to detect, as the backdoored answer remains plausible and consistent with the poisoned reasoning trajectory. Yet, defenses tailored to this type of backdoor remain largely unexplored. To bridge this gap, we propose Critical‑CoT, a novel defense mechanism that conducts a two‑stage fine‑tuning (FT) process on LLMs to develop critical thinking behaviors, enabling them to automatically identify potential backdoors and refuse to generate malicious reasoning steps. Extensive experiments across multiple LLMs and datasets demonstrate that Critical‑CoT provides strong robustness against both in‑context learning‑based and FT‑based backdoor attacks. Notably, Critical‑CoT exhibits strong cross‑domain and cross‑task generalization. Our code is available at hthttps://github.com/tuanvu171/Critical‑CoT.
Authors:Hao Wang, Guozhi Wang, Han Xiao, Yufeng Zhou, Yue Pan, Jichao Wang, Ke Xu, Yafei Wen, Xiaohu Ruan, Xiaoxin Chen, Honggang Qi
Abstract:
Reinforcement learning (RL) has been widely used to train LLM agents for multi‑turn interactive tasks, but its sample efficiency is severely limited by sparse rewards and long horizons. On‑policy self‑distillation (OPSD) alleviates this by providing dense token‑level supervision from a privileged teacher that has access to ground‑truth answers. However, such fixed privileged information cannot capture the diverse valid strategies in agent tasks, and naively combining OPSD with RL often leads to training collapse. To address these limitations, we introduce Skill‑SD, a framework that turns the agent's own trajectories into dynamic training‑only supervision. Completed trajectories are summarized into compact natural language skills that describe successful behaviors, mistakes, and workflows. These skills serve as dynamic privileged information conditioning only the teacher, while the student always acts under the plain task prompt and learns to internalize the guidance through distillation. To stabilize the training, we derive an importance‑weighted reverse‑KL loss to provide gradient‑correct token‑level distillation, and dynamically synchronize the teacher with the improving student. Experimental results on agentic benchmarks demonstrate that Skill‑SD substantially outperforms the standard RL baseline, improving both vanilla GRPO (+14.0%/+10.9% on AppWorld/Sokoban) and vanilla OPD (+42.1%/+40.6%). Project page: https://k1xe.github.io/skill‑sd/
Authors:Luis Balderas, Miguel Lastra, José M. Benítez
Abstract:
Large language models are transforming all areas of academia and industry, attracting the attention of researchers, professionals, and the general public. In the trek for more powerful architectures, Mixture‑of‑Experts, inspired by ensemble models, have emerged as one of the most effective ways to follow. However, this implies a high computational burden for both training and inference. To reduce the impact on computing and memory footprint as well as the energy consumption, simplification methods has arisen as very effective procedures. In this paper, an original algorithm, MoEITS, for MoE‑LLMs simplification is presented. The algorithm is characterized by a refined simplicity, underpinned by standardized Information Theoretic frameworks. MoEITS is analyzed in depth from theoretical and practical points of view. Its computational complexity is studied. Its performance on the accuracy of the simplified LLMs and the reduction rate achieved is assessed through a thoroughly designed experimentation. This empirical evaluation includes a comparison with state‑of‑the‑art MoE‑LLM pruning methods applied on Mixtral 8×7B, Qwen1.5‑2.7B, and DeepSeek‑V2‑Lite. The extensive experimentation conducted demonstrates that MoEITS outperforms state‑of‑the‑art techniques by generating models that are both effective across all benchmarks and computationally efficient. The code implementing the method will be available at https://github.com/luisbalru/MoEITS.
Authors:Bo Ma, Jinsong Wu, Hongjiang Wei, Weiqi Yan
Abstract:
Mamba selective state space models (SSMs) provide linear‑time sequence modeling but are often limited by memory bandwidth in practice, where selective state updates are executed as fragmented kernels with repeated intermediate tensor materialization. We present COREY, a prototype scheduler that uses activation entropy estimated via fixed‑width histograms as a runtime signal for chunk‑size selection at the kernel‑invocation level. COREY is positioned as a Concept and Feasibility contribution: a single‑parameter runtime auto‑tuner built on an existing Triton selective‑scan kernel rather than a new fused implementation. Evidence is organized in three tiers. Tier 1 (Python cost model) shows that entropy‑guided grouping reduces surrogate latency and DRAM traffic. Tier 2a (real‑checkpoint inline hook) demonstrates that entropy computation and chunk selection can run on the critical path of model.generate(); on Mamba‑370M (RTX 3070, n=5), measured overhead is 8.3 percent with full instrumentation and estimated about 2 percent with sparse sampling. Tier 2b (kernel‑level scan benchmark) shows that, under a principled calibration where H_ref equals log(K), COREY selects the same chunk as a one‑time‑profile oracle without offline sweeps and achieves up to 4.41x speedup over static chunk‑64. This work does not yet include a fully integrated end‑to‑end run connecting Tier 2a and Tier 2b, which remains key future work. Across 80 LongBench prompts, entropy distributions are stable, supporting COREY as a practical runtime auto‑tuner within a single regime. Code and data: https://github.com/mabo1215/COREY_Transformer/.
Authors:Theodor Spiro
Abstract:
We test whether artificial intelligence architectural evolution obeys the same statistical laws as biological evolution. Compiling 935 ablation experiments from 161 publications, we show that the distribution of fitness effects (DFE) of architectural modifications follows a heavy‑tailed Student's t‑distribution with proportions (68% deleterious, 19% neutral, 13% beneficial for major ablations, n=568) that place AI between compact viral genomes and simple eukaryotes. The DFE shape matches D. melanogaster (normalized KS=0.07) and S. cerevisiae (KS=0.09); the elevated beneficial fraction (13% vs. 1‑6% in biology) quantifies the advantage of directed over blind search while preserving the distributional form. Architectural origination follows logistic dynamics (R^2=0.994) with punctuated equilibria and adaptive radiation into domain niches. Fourteen architectural traits were independently invented 3‑5 times, paralleling biological convergences. These results demonstrate that the statistical structure of evolution is substrate‑independent, determined by fitness landscape topology rather than the mechanism of selection.
Authors:Wanyi Chen, Xiao Yang, Xu Yang, Tianming Sha, Qizheng Li, Zhuo Wang, Bowen Xian, Fang Kong, Weiqing Liu, Jiang Bian
Abstract:
We introduce Agent^2 RL‑Bench, a benchmark for evaluating agentic RL post‑training ‑‑ whether LLM agents can autonomously design, implement, and run complete RL pipelines that improve foundation models. This capability is important because RL post‑training increasingly drives model alignment and specialization, yet existing benchmarks remain largely static: supervised fine‑tuning alone yields strong results, leaving interactive RL engineering untested. Agent^2 RL‑Bench addresses this with six tasks across three levels ‑‑ from static rule‑based training to closed‑loop online RL with trajectory collection ‑‑ each adding a structural requirement that prior levels do not impose. The benchmark provides isolated workspaces with a grading API, runtime instrumentation that records every submission and code revision, and automated post‑hoc analysis that generates structured run reports, enabling the first automated diagnostic of agent‑driven post‑training behavior. Across multiple agent stacks spanning five agent systems and six driver LLMs, we find that agents achieve striking interactive gains ‑‑ on ALFWorld, an RL‑only agent improves from 5.97 to 93.28 via SFT warm‑up and GRPO with online rollouts ‑‑ yet make only marginal progress on others (DeepSearchQA: +2.75 within evaluation noise), and that driver choice has a large effect on interactive tasks ‑‑ within the same scaffold, switching drivers changes interactive improvement from near‑zero to +78pp. More broadly, the benchmark reveals that supervised pipelines dominate agent‑driven post‑training under fixed budgets, with online RL succeeding as the final best route only on ALFWorld. Code is available at https://github.com/microsoft/RD‑Agent/tree/main/rdagent/scenarios/rl/autorl_bench.
Authors:Yuzhen Mao, Qitong Wang, Martin Ester, Ke Li
Abstract:
Key‑Value (KV) cache plays a crucial role in accelerating inference in large language models (LLMs) by storing intermediate attention states and avoiding redundant computation during autoregressive generation. However, its memory footprint scales linearly with sequence length, often leading to severe memory bottlenecks on resource‑constrained hardware. Prior work has explored offloading KV cache to the CPU while retaining only a subset on the GPU, but these approaches often rely on imprecise token selection and suffer performance degradation in long‑generation tasks such as chain‑of‑thought reasoning. In this paper, we propose a novel KV cache management strategy, IceCache, which integrates semantic token clustering with PagedAttention. By organizing semantically related tokens into contiguous memory regions managed by a hierarchical, dynamically updatable data structure, our method enables more efficient token selection and better utilization of memory bandwidth during CPU‑GPU transfers. Experimental results on LongBench show that, with a 256‑token budget, IceCache maintains 99% of the original accuracy achieved by the full KV cache model. Moreover, compared to other offloading‑based methods, IceCache attains competitive or even superior latency and accuracy while using only 25% of the KV cache token budget, demonstrating its effectiveness in long‑sequence scenarios. The code is available on our project website at https://yuzhenmao.github.io/IceCache/.
Authors:Jiahui Zhang, Rouyi Wang, Kuangqi Zhou, Tianshu Xiao, Lingyan Zhu, Yaosen Min, Yang Wang
Abstract:
Peptide therapeutics are widely regarded as the "third generation" of drugs, yet progress in peptide Machine Learning (ML) are hindered by the absence of standardized benchmarks. Here we present PepBenchmark, which unifies datasets, preprocessing, and evaluation protocols for peptide drug discovery. PepBenchmark comprises three components: (1) PepBenchData, a well‑curated collection comprising 29 canonical‑peptide and 6 non‑canonical‑peptide datasets across 7 groups, systematically covering key aspects of peptide drug development, representing, to the best of our knowledge, the most comprehensive AI‑ready dataset resource to date; (2) PepBenchPipeline, a standardized preprocessing pipeline that ensures consistent dataset cleaning, construction, splitting, and feature transformation, mitigating quality issues common in ad hoc pipelines; and (3) PepBenchLeaderboard, a unified evaluation protocol and leaderboard with strong baselines across 4 major methodological families: Fingerprint‑based, GNN‑based, PLM‑based, and SMILES‑based models. Together, PepBenchmark provides the first standardized and comparable foundation for peptide drug discovery, facilitating methodological advances and translation into real‑world applications. The data and code are publicly available at https://github.com/ZGCI‑AI4S‑Pep/PepBenchmark/.
Authors:Zijia Lu, Jingru Yi, Jue Wang, Yuxiao Chen, Junwen Chen, Xinyu Li, Davide Modolo
Abstract:
Referring multi‑object tracking (RMOT) is a task of associating all the objects in a video that semantically match with given textual queries or referring expressions. Existing RMOT approaches decompose object grounding and tracking into separated modules and exhibit limited performance due to the scarcity of training videos, ambiguous annotations, and restricted domains. In this work, we introduce STORM, an end‑to‑end MLLM that jointly performs grounding and tracking within a unified framework, eliminating external detectors and enabling coherent reasoning over appearance, motion, and language. To improve data efficiency, we propose a task‑composition learning (TCL) strategy that decomposes RMOT into image grounding and object tracking, allowing STORM to leverage data‑rich sub‑tasks and learn structured spatial‑‑temporal reasoning. We further construct STORM‑Bench, a new RMOT dataset with accurate trajectories and diverse, unambiguous referring expressions generated through a bottom‑up annotation pipeline. Extensive experiments show that STORM achieves state‑of‑the‑art performance on image grounding, single‑object tracking, and RMOT benchmarks, demonstrating strong generalization and robust spatial‑‑temporal grounding in complex real‑world scenarios. STORM‑Bench is released at https://github.com/amazon‑science/storm‑referring‑multi‑object‑grounding.
Authors:Suyoung Bae, CheolWon Na, Jaehoon Lee, Yumin Lee, YunSeok Choi, Jee-Hyong Lee
Abstract:
As Large Language Models (LLMs) have become capable of generating long and descriptive code summaries, accurate and reliable evaluation of factual consistency has become a critical challenge. However, previous evaluation methods are primarily designed for short summaries of isolated code snippets. Consequently, they struggle to provide fine‑grained evaluation of multi‑sentence functionalities and fail to accurately assess dependency context commonly found in real‑world code summaries. To address this, we propose ReFEree, a reference‑free and fine‑grained method for evaluating factual consistency in real‑world code summaries. We define factual inconsistency criteria specific to code summaries and evaluate them at the segment level using these criteria along with dependency information. These segment‑level results are then aggregated into a fine‑grained score. We construct a code summarization benchmark with human‑annotated factual consistency labels. The evaluation results demonstrate that ReFEree achieves the highest correlation with human judgment among 13 baselines, improving 15‑18% over the previous state‑of‑the‑art. Our code and data are available at https://github.com/bsy99615/ReFEree.git.
Authors:Lincoln Spencer, Song Wang, Chen Chen
Abstract:
Surgical phase segmentation is central to computer‑assisted surgery, yet robust models remain difficult to develop when labeled surgical videos are scarce. We study data‑efficient phase segmentation for manual small‑incision cataract surgery (SICS) through a controlled comparison of visual representations. To isolate representation quality, we pair each visual encoder with the same temporal model (MS‑TCN++) under identical training and evaluation settings on SICS‑155 (19 phases). We compare supervised encoders (ResNet‑50, I3D) against large self‑supervised foundation models (DINOv3, V‑JEPA2), and use a cached‑feature pipeline that decouples expensive visual encoding from lightweight temporal learning. Foundation‑model features improve segmentation performance in this setup, with DINOv3 ViT‑7B achieving the best overall results (83.4% accuracy, 87.0 edit score). We further examine cataract‑domain transfer using unlabeled videos and lightweight adaptation, and analyze when it helps or hurts. Overall, the study indicates strong transferability of modern vision foundation models to surgical workflow understanding and provides practical guidance for low‑label medical video settings. The project website is available at: https://sl2005.github.io/DataEfficient‑sics‑phase‑seg/
Authors:Haopeng Chen, Yihao Ai, Kabeen Kim, Robby T. Tan, Yixin Chen, Bo Wang
Abstract:
Low‑visibility scenarios, such as low‑light conditions, pose significant challenges to human pose estimation due to the scarcity of annotated low‑light datasets and the loss of visual information under poor illumination. Recent domain adaptation techniques attempt to utilize well‑lit labels by augmenting well‑lit images to mimic low‑light conditions. But handcrafted augmentations oversimplify noise patterns, while learning‑based methods often fail to preserve high‑frequency low‑light characteristics, producing unrealistic images that lead pose models to generalize poorly to real low‑light scenes. Moreover, recent pose estimators rely on image cues through image‑to‑keypoint cross‑attention, but these cues become unreliable under low‑light conditions. To address these issues, we propose Unsupervised Domain Adaptation for Pose Estimation (UDAPose), a novel framework that synthesizes low‑light images and dynamically fuses visual cues with pose priors for improved pose estimation. Specifically, our synthesis method incorporates a Direct‑Current‑based High‑Pass Filter (DHF) and a Low‑light Characteristics Injection Module (LCIM) to inject high‑frequency details from input low‑light images, overcoming rigidity or the detail loss in existing approaches. Furthermore, we introduce a Dynamic Control of Attention (DCA) module that adaptively balances image cues with learned pose priors in the Transformer architecture. Experiments show that UDAPose outperforms state‑of‑the‑art methods, with notable AP gains of 10.1 (56.4%) on the ExLPose‑test hard set (LL‑H) and 7.4 (31.4%) in cross‑dataset validation on EHPT‑XC. Code: https://github.com/Vision‑and‑Multimodal‑Intelligence‑Lab/UDAPose
Authors:Xinlei Guan, David Arosemena, Tejaswi Dhandu, Kuan Huang, Meng Xu, Miles Q. Li, Bingyu Shen, Ruiyang Qin, Umamaheswara Rao Tida, Boyang Li
Abstract:
The rapid growth of generative AI has introduced new challenges in content moderation and digital forensics. In particular, benign AI‑generated images can be paired with harmful or misleading text, creating difficult‑to‑detect misuse. This contextual misuse undermines the traditional moderation framework and complicates attribution, as synthetic images typically lack persistent metadata or device signatures. We introduce a steganography enabled attribution framework that embeds cryptographically signed identifiers into images at creation time and uses multimodal harmful content detection as a trigger for attribution verification. Our system evaluates five watermarking methods across spatial, frequency, and wavelet domains. It also integrates a CLIP‑based fusion model for multimodal harmful‑content detection. Experiments demonstrate that spread‑spectrum watermarking, especially in the wavelet domain, provides strong robustness under blur distortions, and our multimodal fusion detector achieves an AUC‑ROC of 0.99, enabling reliable cross‑modal attribution verification. These components form an end‑to‑end forensic pipeline that enables reliable tracing of harmful deployments of AI‑generated imagery, supporting accountability in modern synthetic media environments. Our code is available at GitHub: https://github.com/bli1/steganography
Authors:Naichuan Zheng, Hailun Xia, Zepeng Sun, Weiyi Li, Yinzhe Zhou
Abstract:
Wearable IMU‑based Human Activity Recognition (HAR) relies heavily on Deep Neural Networks (DNNs), which are burdened by immense computational and buffering demands. Their power‑hungry floating‑point operations and rigid requirement to process complete temporal windows severely cripple battery‑constrained edge devices. While Spiking Neural Networks (SNNs) offer extreme event‑driven energy efficiency, standard architectures struggle with complex biomechanical topologies and temporal gradient degradation. To bridge this gap, we propose the Physics‑Aware Spiking Neural Network (PAS‑Net), a fully multiplier‑free architecture explicitly tailored for Green HAR. Spatially, an adaptive symmetric topology mixer enforces human‑joint physical constraints. Temporally, an O(1)‑memory causal neuromodulator yields context‑aware dynamic threshold neurons, adapting actively to non‑stationary movement rhythms. Furthermore, we leverage a temporal spike error objective to unlock a flexible early‑exit mechanism for continuous IMU streams. Evaluated across seven diverse datasets, PAS‑Net achieves state‑of‑the‑art accuracy while replacing dense operations with sparse 0.1 pJ integer accumulations. Crucially, its confidence‑driven early‑exit capability drastically reduces dynamic energy consumption by up to 98%. PAS‑Net establishes a robust, ultra‑low‑power neuromorphic standard for always‑on wearable sensing. The source code and pre‑trained models are publicly available at https://github.com/zhengnaichuan2022/PAS‑Net.git.
Authors:Di Wen, Zeyun Zhong, David Schneider, Manuel Zaremski, Linus Kunzmann, Yitian Shi, Ruiping Liu, Yufan Chen, Junwei Zheng, Jiahang Li, Jonas Hemmerich, Qiyi Tong, Patric Grauberger, Arash Ajoudani, Danda Pani Paudel, Sven Matthiesen, Barbara Deml, Jürgen Beyerer, Luc Van Gool, Rainer Stiefelhagen, Kunyu Peng
Abstract:
We introduce IMPACT, a synchronized five‑view RGB‑D dataset for deployment‑oriented industrial procedural understanding, built around real assembly and disassembly of a commercial angle grinder with professional‑grade tools. To our knowledge, IMPACT is the first real industrial assembly benchmark that jointly provides synchronized ego‑exo RGB‑D capture, decoupled bimanual annotation, compliance‑aware state tracking, and explicit anomaly‑‑recovery supervision within a single real industrial workflow. It comprises 112 trials from 13 participants totaling 39.5 hours, with multi‑route execution governed by a partial‑order prerequisite graph, a six‑category anomaly taxonomy, and operator cognitive load measured via NASA‑TLX. The annotation hierarchy links hand‑specific atomic actions to coarse procedural steps, component assembly states, and per‑hand compliance phases, with synchronized null spans across views to decouple perceptual limitations from algorithmic failure. Systematic baselines reveal fundamental limitations that remain invisible to single‑task benchmarks, particularly under realistic deployment conditions that involve incomplete observations, flexible execution paths, and corrective behavior. The full dataset, annotations, and evaluation code are available at https://github.com/Kratos‑Wen/IMPACT.
Authors:Yuzhe Weng, Haotian Wang, Xinyi Yu, Xiaoyan Wu, Haoran Xu, Shan He, Jun Du
Abstract:
Audio‑driven human video generation has achieved remarkable success in monologue scenarios, largely driven by advancements in powerful video generation foundation models. Moving beyond monologues, authentic human communication is inherently a full‑duplex interactive process, requiring virtual agents not only to articulate their own speech but also to react naturally to incoming conversational audio. Most existing methods simply extend conventional audio‑driven paradigms to listening scenarios. However, relying on strict frame‑to‑frame alignment renders the model's response to long‑range conversational dynamics rigid, whereas directly introducing global attention catastrophically degrades lip synchronization. Recognizing the unique temporal Scale Discrepancy between talking and listening behaviors, we introduce a multi‑head Gaussian kernel to explicitly inject this physical intuition into the model as a progressive temporal inductive bias. Building upon this, we construct a full‑duplex interactive virtual agent capable of simultaneously processing dual‑stream audio inputs for both talking and listening. Furthermore, we introduce a rigorously cleaned Talking‑Listening dataset VoxHear featuring perfectly decoupled speech and background audio tracks. Extensive experiments demonstrate that our approach successfully fuses strong temporal alignment with deep contextual semantics, setting a new state‑of‑the‑art for generating highly natural and responsive full‑duplex interactive digital humans. The project page is available at https://warmcongee.github.io/beyond‑monologue/ .
Authors:Alexandru Brateanu, Tingting Mu, Codruta Ancuti, Cosmin Ancuti
Abstract:
Low‑light image enhancement (LLIE) aims to restore natural visibility, color fidelity, and structural detail under severe illumination degradation. State‑of‑the‑art (SOTA) LLIE techniques often rely on large models and multi‑stage training, limiting practicality for edge deployment. Moreover, their dependence on a single color space introduces instability and visible exposure or color artifacts. To address these, we propose Multinex, an ultra‑lightweight structured framework that integrates multiple fine‑grained representations within a principled Retinex residual formulation. It decomposes an image into illumination and color prior stacks derived from distinct analytic representations, and learns to fuse these representations into luminance and reflectance adjustments required to correct exposure. By prioritizing enhancement over reconstruction and exploiting lightweight neural operations, Multinex significantly reduces computational cost, exemplified by its lightweight (45K parameters) and nano (0.7K parameters) versions. Extensive benchmarks show that all lightweight variants significantly outperform their corresponding lightweight SOTA models, and reach comparable performance to heavy models. Paper page available at https://albrateanu.github.io/multinex.
Authors:Peng Yuan, Bingyin Mei, Hui Zhang
Abstract:
Composed Image Retrieval (CIR) retrieves target images using a reference image paired with modification text. Despite rapid advances, all existing methods and datasets operate at the image level ‑‑ a single reference image plus modification text in, a single target image out ‑‑ while real e‑commerce users reason about products shown from multiple viewpoints. We term this mismatch View Incompleteness and formally define a new Multi‑View CIR task that generalizes standard CIR from image‑level to product‑level retrieval. To support this task, we construct FashionMV, the first large‑scale multi‑view fashion dataset for product‑level CIR, comprising 127K products, 472K multi‑view images, and over 220K CIR triplets, built through a fully automated pipeline leveraging large multimodal models. We further propose ProCIR (Product‑level Composed Image Retrieval), a modeling framework built upon a multimodal large language model that employs three complementary mechanisms ‑‑ two‑stage dialogue, caption‑based alignment, and chain‑of‑thought guidance ‑‑ together with an optional supervised fine‑tuning (SFT) stage that injects structured product knowledge prior to contrastive training. Systematic ablation across 16 configurations on three fashion benchmarks reveals that: (1) alignment is the single most critical mechanism; (2) the two‑stage dialogue architecture is a prerequisite for effective alignment; and (3) SFT and chain‑of‑thought serve as partially redundant knowledge injection paths. Our best 0.8B‑parameter model outperforms all baselines, including general‑purpose embedding models 10x its size. The dataset, model, and code are publicly available at https://github.com/yuandaxia2001/FashionMV.
Authors:Malgorzata Gwiazda, Yifu Cai, Mononito Goswami, Arjun Choudhry, Artur Dubrawski
Abstract:
Large Language Models (LLMs) have shown promising performance in time series modeling tasks, but do they truly understand time series data? While multiple benchmarks have been proposed to answer this fundamental question, most are manually curated and focus on narrow domains or specific skill sets. To address this limitation, we propose scalable methods for creating comprehensive time series reasoning benchmarks that combine the flexibility of templates with the creativity of LLM agents. We first develop TimeSeriesExam, a multiple‑choice benchmark using synthetic time series to evaluate LLMs across five core reasoning categories: pattern recognitionnoise understandingsimilarity analysisanomaly detection, and causality. Then, with TimeSeriesExamAgent, we scale our approach by automatically generating benchmarks from real‑world datasets spanning healthcare, finance and weather domains. Through multi‑dimensional quality evaluation, we demonstrate that our automatically generated benchmarks achieve diversity comparable to manually curated alternatives. However, our experiments reveal that LLM performance remains limited in both abstract time series reasoning and domain‑specific applications, highlighting ongoing challenges in enabling effective time series understanding in these models. TimeSeriesExamAgent is available at https://github.com/magwiazda/TimeSeriesExamAgent.
Authors:Guijia Zhang, Shu Yang, Xilin Gong, Di Wang
Abstract:
Autonomous language‑model agents increasingly rely on installable skills and tools to complete user tasks. Static skill auditing can expose capability surface before deployment, but it cannot determine whether a particular invocation is unsafe under the current user request and runtime context. We therefore study skill invocation auditing as a continuous‑risk estimation problem: given a user request, candidate skill, and runtime context, predict a score that supports ranking and triage before a hard intervention is applied. We introduce STARS, which combines a static capability prior, a request‑conditioned invocation risk model, and a calibrated risk‑fusion policy. To evaluate this setting, we construct SIA‑Bench, a benchmark of 3,000 invocation records with group‑safe splits, lineage metadata, runtime context, canonical action labels, and derived continuous‑risk targets. On a held‑out split of indirect prompt injection attacks, calibrated fusion reaches 0.439 high‑risk AUPRC, improving over 0.405 for the contextual scorer and 0.380 for the strongest static baseline, while the contextual scorer remains better calibrated with 0.289 expected calibration error. On the locked in‑distribution test split, gains are smaller and static priors remain useful. The resulting claim is therefore narrower: request‑conditioned auditing is most valuable as an invocation‑time risk‑scoring and triage layer rather than as a replacement for static screening. Code is available at https://github.com/123zgj123/STARS.
Authors:Zae Myung Kim, Dongseok Lee, Jaehyung Kim, Vipul Raheja, Dongyeop Kang
Abstract:
Existing tool‑use benchmarks for LLM agents are overwhelmingly linear: our analysis of six benchmarks shows 55 to 100% of instances are simple chains of 2 to 5 steps. We introduce The Amazing Agent Race (AAR), a benchmark featuring directed acyclic graph (DAG) puzzles (or "legs") with fork‑merge tool chains. We release 1,400 instances across two variants: sequential (800 legs) and compositional (600 DAG legs). Agents must navigate Wikipedia, execute multi‑step tool chains, and aggregate results into a verifiable answer. Legs are procedurally generated from Wikipedia seeds across four difficulty levels with live‑API validation. Three complementary metrics (finish‑line accuracy, pit‑stop visit rate, and roadblock completion rate) separately diagnose navigation, tool‑use, and arithmetic failures. Evaluating three agent frameworks on 1,400 legs, the best achieves only 37.2% accuracy. Navigation errors dominate (27 to 52% of trials) while tool‑use errors remain below 17%, and agent architecture matters as much as model scale (Claude Code matches Codex CLI at 37% with 6x fewer tokens). The compositional structure of AAR reveals that agents fail not at calling tools but at navigating to the right pages, a blind spot invisible to linear benchmarks. The project page can be accessed at: https://minnesotanlp.github.io/the‑amazing‑agent‑race
Authors:Meng'en Qin, Yu Song, Quanling Zhao, Xiaodong Yang, Yingtao Che, Xiaohui Yang
Abstract:
Learning multi‑scale representations is the common strategy to tackle object scale variation in dense prediction tasks. Although existing feature pyramid networks have greatly advanced visual recognition, inherent design defects inhibit them from capturing discriminative features and recognizing small objects. In this work, we propose Asymptotic Content‑Aware Pyramid Attention Network (A3‑FPN), to augment multi‑scale feature representation via the asymptotically disentangled framework and content‑aware attention modules. Specifically, A3‑FPN employs a horizontally‑spread column network that enables asymptotically global feature interaction and disentangles each level from all hierarchical representations. In feature fusion, it collects supplementary content from the adjacent level to generate position‑wise offsets and weights for context‑aware resampling, and learns deep context reweights to improve intra‑category similarity. In feature reassembly, it further strengthens intra‑scale discriminative feature learning and reassembles redundant features based on information content and spatial variation of feature maps. Extensive experiments on MS COCO, VisDrone2019‑DET and Cityscapes demonstrate that A3‑FPN can be easily integrated into state‑of‑the‑art CNN and Transformer‑based architectures, yielding remarkable performance gains. Notably, when paired with OneFormer and Swin‑L backbone, A3‑FPN achieves 49.6 mask AP on MS COCO and 85.6 mIoU on Cityscapes. Codes are available at https://github.com/mason‑ching/A3‑FPN.
Authors:Ze Zhao, Yuhui He, Lyuwen Wu, Gu Tang, Bin Lu, Xiaoying Gan, Luoyi Fu, Xinbing Wang, Chenghu Zhou
Abstract:
Reasoning on Temporal Knowledge Graphs (TKGs) is essential for predicting future events and time‑aware facts. While existing methods are effective at capturing relational dynamics, their performance is limited by a closed‑world assumption, which fails to account for emerging entities not present in the training. Notably, these entities continuously join the network without historical interactions. Empirical study reveals that emerging entities are widespread in TKGs, comprising roughly 25% of all entities. The absence of historical interactions of these entities leads to significant performance degradation in reasoning tasks. Whereas, we observe that entities with semantic similarities often exhibit comparable interaction histories, suggesting the presence of transferable temporal patterns. Inspired by this insight, we propose TransFIR (Transferable Inductive Reasoning), a novel framework that leverages historical interaction sequences from semantically similar known entities to support inductive reasoning. Specifically, we propose a codebook‑based classifier that categorizes emerging entities into latent semantic clusters, allowing them to adopt reasoning patterns from similar entities. Experimental results demonstrate that TransFIR outperforms all baselines in reasoning on emerging entities, achieving an average improvement of 28.6% in Mean Reciprocal Rank (MRR) across multiple datasets. The implementations are available at https://github.com/zhaodazhuang2333/TransFIR.
Authors:Qihang Wu
Abstract:
We present EL‑DRUIN, an ontological reasoning system for geopolitical intelligence analysis that combines formal ontology, finite semigroup algebra, and Lie algebra approximation to forecast long‑run relationship trajectories. Current LLM‑based political analysis systems operate as summarisation engines, producing outputs bounded by textual pattern matching. EL‑DRUIN departs from this paradigm by modelling geopolitical relationships as states in a finite set of named Dynamic Patterns, composing patterns via a semigroup operation whose structure constants are defined by an explicit composition table, and embedding each pattern as a vector in an 8‑dimensional semantic Lie algebra space. Forward simulation iterates this semigroup operation, yielding reachable pattern sets at each discrete timestep; convergence to idempotent absorbing states (fixed points of the composition) constitutes the predicted long‑run attractor. Bayesian posterior weights combine ontology‑derived confidence priors with a Lie similarity term measuring the cosine similarity between the vector sum of composing patterns and the target pattern vector, providing interpretable, calibrated probabilities that are not self‑reported by a language model. Bifurcation points ‑‑ steps at which two candidate attractors have near‑equal posterior mass ‑‑ are detected and exposed to downstream analysis. We demonstrate the framework on six geopolitical scenarios including US‑China technology decoupling and the Taiwan Strait military coercion trajectory. The architecture is publicly available as an open‑source system with a Streamlit frontend exposing full computation traces, Bayesian posterior breakdowns, and 8D ontological state vectors.
Authors:Yujie Li, Jiuniu Wang, Mugen Peng, Guangzuo Li, Wenjia Xu
Abstract:
Long‑horizon Flexible Job‑Shop Scheduling~(FJSP) presents a formidable combinatorial challenge due to complex, interdependent decisions spanning extended time horizons. While learning‑based Rolling Horizon Optimization~(RHO) has emerged as a promising paradigm to accelerate solving by identifying and fixing invariant operations, its effectiveness is hindered by the structural complexity of FJSP. Existing methods often fail to capture intricate graph‑structured dependencies and ignore the asymmetric costs of prediction errors, in which misclassifying critical‑path operations is significantly more detrimental than misclassifying non‑critical ones. Furthermore, dynamic shifts in predictive confidence during the rolling process make static pruning thresholds inadequate. To address these limitations, we propose Graph‑RHO, a novel critical‑path‑aware graph‑based RHO framework. First, we introduce a topology‑aware heterogeneous graph network that encodes subproblems as operation‑machine graphs with multi‑relational edges, leveraging edge‑feature‑aware message passing to predict operation stability. Second, we incorporate a critical‑path‑aware mechanism that injects inductive biases during training to distinguish highly sensitive bottleneck operations from robust ones. Third, we devise an adaptive thresholding strategy that dynamically calibrates decision boundaries based on online uncertainty estimation to align model predictions with the solver's search space. Extensive experiments on standard benchmarks demonstrate that \mboxGraph‑RHO establishes a new state of the art in solution quality and computational efficiency. Remarkably, it exhibits exceptional zero‑shot generalization, reducing solve time by over 30% on large‑scale instances (2000 operations) while achieving superior solution quality. Our code is available \hrefhttps://github.com/IntelliSensing/Graph‑RHOhere.
Authors:Yuchen Zou, Huikai Shao, Lihuang Fang, Zhipeng Xiong, Dexing Zhong
Abstract:
Recently, synthetic palmprints have been increasingly used as substitutes for real data to train recognition models. To be effective, such synthetic data must reflect the diversity of real palmprints, including both style variation and geometric variation. However, existing palmprint generation methods mainly focus on style translation, while geometric variation is either ignored or approximated by simple handcrafted augmentations. In this work, we propose FlowPalm, an optical‑flow‑driven palmprint generation framework capable of simulating the complex non‑rigid deformations observed in real palms. Specifically, FlowPalm estimates optical flows between real palmprint pairs to capture the statistical patterns of geometric deformations. Building on these priors, we design a progressive sampling process that gradually introduces the geometric deformations during diffusion while maintaining identity consistency. Extensive experiments on six benchmark datasets demonstrate that FlowPalm significantly outperforms state‑of‑the‑art palmprint generation approaches in downstream recognition tasks. Project page: https://yuchenzou.github.io/FlowPalm/
Authors:Bochu Ding, Brinnae Bent, Augustus Wendell
Abstract:
Text‑to‑image (T2I) models, and their encoded biases, increasingly shape the visual media the public encounters. While researchers have produced a rich body of work on bias measurement, auditing, and mitigation in T2I systems, those methods largely target technical stakeholders, leaving a gap in public legibility. We introduce GLEaN (Generative Likeness Evaluation at N‑Scale), a portrait‑based explainability pipeline designed to make T2I model biases visually understandable to a broad audience. GLEaN comprises three stages: automated large‑scale image generation from identity prompts, facial landmark‑based filtering and spatial alignment, and median‑pixel composition that distills a model's central tendency into a single representative portrait. The resulting composites require no statistical background to interpret; a viewer can see, at a glance, who a model 'imagines' when prompted with 'a doctor' versus a 'felon.' We demonstrate GLEaN on Stable Diffusion XL across 40 social and occupational identity prompts, producing composites that reproduce documented biases and surface new associations between skin tone and predicted emotion. We find in a between‑subjects user study (N = 291) that GLEaN portraits communicate biases as effectively as conventional data tables, but require significantly less viewing time. Because the method relies solely on generated outputs, it can also be replicated on any black‑box and closed‑weight systems without access to model internals. GLEaN offers a scalable, model‑agnostic approach to bias explainability, purpose‑built for public comprehension, and is publicly available at https://github.com/cultureiolab/GLEaN.
Authors:Honghao Liu, Chengjin Xu, Xuhui Jiang, Cehao Yang, Shengming Yin, Zhengwu Ma, Lionel Ni, Jian Guo
Abstract:
Large Reasoning Models (LRMs) have achieved remarkable performance across diverse domains, yet their decision‑making under conflicting objectives remains insufficiently understood. This work investigates how LRMs respond to harmful queries when confronted with two categories of conflicts: internal conflicts that pit alignment values against each other and dilemmas, which impose mutually contradictory choices, including sacrificial, duress, agent‑centered, and social forms. Using over 1,300 prompts across five benchmarks, we evaluate three representative LRMs ‑ Llama‑3.1‑Nemotron‑8B, QwQ‑32B, and DeepSeek R1 ‑ and find that conflicts significantly increase attack success rates, even under single‑round non‑narrative queries without sophisticated auto‑attack techniques. Our findings reveal through layerwise and neuron‑level analyses that safety‑related and functional representations shift and overlap under conflict, interfering with safety‑aligned behavior. This study highlights the need for deeper alignment strategies to ensure the robustness and trustworthiness of next‑generation reasoning models. Our code is available at https://github.com/DataArcTech/ConflictHarm. Warning: This paper contains inappropriate, offensive and harmful content.
Authors:Weiyang Guo, Zesheng Shi, Zeen Zhu, Yuan Zhou, Min Zhang, Jing Li
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) is an emerging paradigm that significantly boosts a Large Language Model's (LLM's) reasoning abilities on complex logical tasks, such as mathematics and programming. However, we identify, for the first time, a latent vulnerability to backdoor attacks within the RLVR framework. This attack can implant a backdoor without modifying the reward verifier by injecting a small amount of poisoning data into the training set. Specifically, we propose a novel trigger mechanism designated as the \ourapproach (ACB). The attack exploits the RLVR training loop by assigning substantial positive rewards for harmful responses and negative rewards for refusals. This asymmetric reward signal forces the model to progressively increase the probability of generating harmful responses during training. Our findings demonstrate that the RLVR backdoor attack is characterized by both high efficiency and strong generalization capabilities. Utilizing less than 2% poisoned data in train set, the backdoor can be successfully implanted across various model scales without degrading performance on benign tasks. Evaluations across multiple jailbreak benchmarks indicate that activating the trigger degrades safety performance by an average of 73%. Furthermore, the attack generalizes effectively to a wide range of jailbreak methods and unsafe behaviors. Code is available at https://github.com/yuki‑younai/Backdoor_in_RLVR.
Authors:Dongmin Kim, Hoshinori Kanazawa, Yasuo Kuniyoshi
Abstract:
The mirror self‑recognition test evaluates whether a subject touches a mark on its own body that is visible only in a mirror, and is widely used as an indicator of self‑awareness. In this study, we present a computational model in which this behavior emerges spontaneously through a single mechanism, the self‑prior, without any external reward. The self‑prior, implemented with a Transformer, learns the density of familiar multisensory experiences; when a novel mark appears, the discrepancy from this learned distribution drives mark‑directed behavior through active inference. A simulated infant, relying solely on vision and proprioception without tactile input, discovered a sticker placed on its own face in the mirror and removed it in approximately 70% of cases without any explicit instruction. Expected free energy decreased significantly after sticker removal, confirming that the self‑prior operates as an internal criterion for distinguishing self from non‑self. Cross‑modal sampling further demonstrated that the self‑prior captures visual‑‑proprioceptive associations, functioning as a probabilistic body schema. These results provide a concise computational account of the key behavior observed in the mirror test and suggest that the free energy principle can serve as a unifying hypothesis for investigating the developmental origins of self‑awareness. Code is available at: https://github.com/kim135797531/self‑prior‑mirror
Authors:Haoxuan Zhang, Ruochi Li, Zhenni Liang, Mehri Sattari, Phat Vo, Collin Qu, Ting Xiao, Junhua Ding, Yang Zhang, Haihua Chen
Abstract:
Transparent and standardized documentation is essential for building trustworthy generative AI (GAI) systems. However, existing automated methods for generating model and data cards still face three major challenges: (i) static templates, as most systems rely on fixed query templates that cannot adapt to diverse paper structures or evolving documentation requirements; (ii) information scarcity, since web‑scale repositories such as Hugging Face often contain incomplete or inconsistent metadata, leading to missing or noisy information; and (iii) lack of benchmarks, as the absence of standardized datasets and evaluation protocols hinders fair and reproducible assessment of documentation quality. To address these limitations, we propose AdaQE‑CG, an Adaptive Query Expansion for Card Generation framework that combines dynamic information extraction with cross‑card knowledge transfer. Its Intra‑Paper Extraction via Context‑Aware Query Expansion (IPE‑QE) module iteratively refines extraction queries to recover richer and more complete information from scientific papers and repositories, while its Inter‑Card Completion using the MetaGAI Pool (ICC‑MP) module fills missing fields by transferring semantically relevant content from similar cards in a curated dataset. In addition, we introduce MetaGAI‑Bench, the first large‑scale, expert‑annotated benchmark for evaluating GAI documentation. Comprehensive experiments across five quality dimensions show that AdaQE‑CG substantially outperforms existing approaches, exceeds human‑authored data cards, and approaches human‑level quality for model cards. Code, prompts, and data are publicly available at: https://github.com/haoxuan‑unt2024/AdaQE‑CG.
Authors:Shun Fujiyoshi
Abstract:
Creativity and strategic foresight have been extensively studied through descriptive theories ‑‑ Koestler's bisociation (1964), de Bono's lateral thinking (1967), and Ansoff's weak signals (1975) explain why creative and strategic insights occur, but offer limited guidance on how to produce them on demand. This paper presents two executable protocols that bridge this theory‑practice gap: GHOSTY COLLIDER, a 5‑step protocol for cross‑domain creative emergence through structural de‑labeling and collision, and PRECOG PROTOCOL, a 5‑step protocol for signal‑based strategic foresight with multi‑axis timing judgment. We formalize established theories into repeatable, step‑by‑step procedures with explicit quality criteria, anti‑pattern detection, and measurable outputs. We evaluate the protocols through three complementary methods: (1) five detailed case studies across distinct domains, (2) controlled comparisons against standard methods using identical inputs, and (3) a batch experiment across eight random domain pairings (N=8, success rate 87.5%, failure rate 12.5%) with one blind evaluation. Preliminary evidence suggests that protocol‑driven outputs exhibit greater structural novelty, higher parameter specificity, and qualitatively distinct creative directions compared to outputs from standard methods. The blind evaluation confirmed the direction of author assessments (protocol output scored 74/80 vs. brainstorming 49/80). These results, while limited by single‑operator execution, indicate that the theory‑to‑protocol translation preserves and potentially enhances the generative power of the underlying theories. The protocols, updated to version 2 incorporating lessons from failure case analysis, are released as open‑access documents under CC BY‑NC 4.0 at https://github.com/GhostyAI‑HA/ghosty‑collider.
Authors:Wee Joe Tan, Zi Rui Lucas Lim, Shashank Durgad, Karim Obegi, Aiden Yiliu Li
Abstract:
Evaluating web usability typically requires time‑consuming user studies and expert reviews, which often limits iteration speed during product development, especially for small teams and agile workflows. We present Avenir‑UX, a user‑experience evaluation agent that simulates user behavior on websites and produces standardized usability. Unlike traditional tools that rely on DOM parsing, Avenir‑UX grounds actions and observations, enabling it to interact with real web pages end‑to‑end while maintaining a coherent trace of the user journey. Building on Avenir‑Web, our system pairs this robust interaction with simulated user behavior profiles and a structured evaluation protocol that integrates the System Usability Scale (SUS), step‑wise Single Ease Questions (SEQ), and concurrent Think Aloud. Subsequently, a comprehensive User Experience (UX) report will be generated. We discuss the architecture of Avenir‑UX and illustrate how its multimodal grounding improves robustness for web‑based interaction and UX evaluation scenarios, paving the way for a new era of continuous, scalable, and data‑driven usability testing that empowers every developer to build web interfaces that are usable. Code is available at: https://github.com/Onflow‑AI/Avenir‑UX
Authors:Fengrui Liu, Xiao He, Tieying Zhang
Abstract:
In large‑scale cloud service platforms, thousands of customer tickets are generated daily and are typically handled through on‑call dialogues. This high volume of on‑call interactions imposes a substantial workload on human support analysts. Recent studies have explored reactive agents that leverage large language models as a first line of support to interact with customers directly and resolve issues. However, when issues remain unresolved and are escalated to human support, these agents are typically disengaged. As a result, they cannot assist with follow‑up inquiries, track resolution progress, or learn from the cases they fail to address. In this paper, we introduce Vigil, a novel proactive agent system designed to operate throughout the entire on‑call life‑cycle. Unlike reactive agents, Vigil focuses on providing assistance during the phase in which human support is already involved. It integrates into the dialogue between the customer and the analyst, proactively offering assistance without explicit user invocation. Moreover, Vigil incorporates a continuous self‑improvement mechanism that extracts knowledge from human‑resolved cases to autonomously update its capabilities. Vigil has been deployed on Volcano Engine, ByteDance's cloud platform, for over ten months, and comprehensive evaluations based on this deployment demonstrate its effectiveness and practicality. The open source version of this work is publicly available at https://github.com/volcengine/veaiops.
Authors:Jon M Laurent, Albert Bou, Michael Pieler, Conor Igoe, Alex Andonian, Siddharth Narayanan, James Braza, Alexandros Sanchez Vassopoulos, Jacob L Steenwyk, Blake Lash, Andrew D White, Samuel G Rodriques
Abstract:
Optimism for accelerating scientific discovery with AI continues to grow. Current applications of AI in scientific research range from training dedicated foundation models on scientific data to agentic autonomous hypothesis generation systems to AI‑driven autonomous labs. The need to measure progress of AI systems in scientific domains correspondingly must not only accelerate, but increasingly shift focus to more real‑world capabilities. Beyond rote knowledge and even just reasoning to actually measuring the ability to perform meaningful work. Prior work introduced the Language Agent Biology Benchmark LAB‑Bench as an initial attempt at measuring these abilities. Here we introduce an evolution of that benchmark, LABBench2, for measuring real‑world capabilities of AI systems performing useful scientific tasks. LABBench2 comprises nearly 1,900 tasks and is, for the most part, a continuation of LAB‑Bench, measuring similar capabilities but in more realistic contexts. We evaluate performance of current frontier models, and show that while abilities measured by LAB‑Bench and LABBench2 have improved substantially, LABBench2 provides a meaningful jump in difficulty (model‑specific accuracy differences range from ‑26% to ‑46% across subtasks) and underscores continued room for performance improvement. LABBench2 continues the legacy of LAB‑Bench as a de facto benchmark for AI scientific research capabilities and we hope that it continues to help advance development of AI tools for these core research functions. To facilitate community use and development, we provide the task dataset at https://huggingface.co/datasets/futurehouse/labbench2 and a public eval harness at https://github.com/EdisonScientific/labbench2.
Authors:Zibin Geng, Xuefeng Jiang, Jia Li, Zheng Li, Tian Wen, Lvhua Wu, Sheng Sun, Yuwei Wang, Min Liu
Abstract:
Prompt learning is a parameter‑efficient approach for vision‑language models, yet its robustness under label noise is less investigated. Visual content contains richer and more reliable semantic information, which remains more robust under label noise. However, the prompt itself is highly susceptible to label noise. Motivated by this intuition, we propose VisPrompt, a lightweight and robust vision‑guided prompt learning framework for noisy‑label settings. Specifically, we exploit a cross‑modal attention mechanism to reversely inject visual semantics into prompt representations. This enables the prompt tokens to selectively aggregate visual information relevant to the current sample, thereby improving robustness by anchoring prompt learning to stable instance‑level visual evidence and reducing the influence of noisy supervision. To address the instability caused by using the same way of injecting visual information for all samples, despite differences in the quality of their visual cues, we further introduce a lightweight conditional modulation mechanism to adaptively control the strength of visual information injection, which strikes a more robust balance between text‑side semantic priors and image‑side instance evidence. The proposed framework effectively suppresses the noise‑induced disturbances, reduce instability in prompt updates, and alleviate memorization of mislabeled samples. VisPrompt significantly improves robustness while keeping the pretrained VLM backbone frozen and introducing only a small amount of additional trainable parameters. Extensive experiments under synthetic and real‑world label noise demonstrate that VisPrompt generally outperforms existing baselines on seven benchmark datasets and achieves stronger robustness. Our code is publicly available at https://github.com/gezbww/Vis_Prompt.
Authors:Guanyu Zhou, Yida Yin, Wenhao Chai, Shengbang Tong, Xingyu Fu, Zhuang Liu
Abstract:
Vision‑language models (VLMs) still struggle with visual perception tasks such as spatial understanding and viewpoint recognition. One plausible contributing factor is that natural image datasets provide limited supervision for low‑level visual skills. This motivates a practical question: can targeted synthetic supervision, generated from only a task keyword such as Depth Order, address these weaknesses? To investigate this question, we introduce VisionFoundry, a task‑aware synthetic data generation pipeline that takes only the task name as input and uses large language models (LLMs) to generate questions, answers, and text‑to‑image (T2I) prompts, then synthesizes images with T2I models and verifies consistency with a proprietary VLM, requiring no reference images or human annotation. Using VisionFoundry, we construct VisionFoundry‑10K, a synthetic visual question answering (VQA) dataset containing 10k image‑question‑answer triples spanning 10 tasks. Models trained on VisionFoundry‑10K achieve substantial improvements on visual perception benchmarks: +7% on MMVP and +10% on CV‑Bench‑3D, while preserving broader capabilities and showing favorable scaling behavior as data size increases. Our results suggest that limited task‑targeted supervision is an important contributor to this bottleneck and that synthetic supervision is a promising path toward more systematic training for VLMs.
Authors:Wenyi Xiao, Xinchi Xu, Leilei Gan
Abstract:
Large Vision Language Models (LVLMs) achieve strong multimodal reasoning but frequently exhibit hallucinations and incorrect responses with high certainty, which hinders their usage in high‑stakes domains. Existing verbalized confidence calibration methods, largely developed for text‑only LLMs, typically optimize a single holistic confidence score using binary answer‑level correctness. This design is mismatched to LVLMs: an incorrect prediction may arise from perceptual failures or from reasoning errors given correct perception, and a single confidence conflates these sources while visual uncertainty is often dominated by language priors. To address these issues, we propose VL‑Calibration, a reinforcement learning framework that explicitly decouples confidence into visual and reasoning confidence. To supervise visual confidence without ground‑truth perception labels, we introduce an intrinsic visual certainty estimation that combines (i) visual grounding measured by KL‑divergence under image perturbations and (ii) internal certainty measured by token entropy. We further propose token‑level advantage reweighting to focus optimization on tokens based on visual certainty, suppressing ungrounded hallucinations while preserving valid perception. Experiments on thirteen benchmarks show that VL‑Calibration effectively improves calibration while boosting visual reasoning accuracy, and it generalizes to out‑of‑distribution benchmarks across model scales and architectures.
Authors:Stefan Andreas Baumann, Jannik Wiese, Tommaso Martorella, Mahdi M. Kalayeh, Björn Ommer
Abstract:
Accurately anticipating how complex, diverse scenes will evolve requires models that represent uncertainty, simulate along extended interaction chains, and efficiently explore many plausible futures. Yet most existing approaches rely on dense video or latent‑space prediction, expending substantial capacity on dense appearance rather than on the underlying sparse trajectories of points in the scene. This makes large‑scale exploration of future hypotheses costly and limits performance when long‑horizon, multi‑modal motion is essential. We address this by formulating the prediction of open‑set future scene dynamics as step‑wise inference over sparse point trajectories. Our autoregressive diffusion model advances these trajectories through short, locally predictable transitions, explicitly modeling the growth of uncertainty over time. This dynamics‑centric representation enables fast rollout of thousands of diverse futures from a single image, optionally guided by initial constraints on motion, while maintaining physical plausibility and long‑range coherence. We further introduce OWM, a benchmark for open‑set motion prediction based on diverse in‑the‑wild videos, to evaluate accuracy and variability of predicted trajectory distributions under real‑world uncertainty. Our method matches or surpasses dense simulators in predictive accuracy while achieving orders‑of‑magnitude higher sampling speed, making open‑set future prediction both scalable and practical. Project page: http://compvis.github.io/myriad.
Authors:Anthony T. Nixon
Abstract:
When two agents of different computational capacities interact with the same environment, they need not compress a common semantic alphabet differently; they can induce different semantic alphabets altogether. We show that the quotient POMDP Q_m,T(M) ‑ the unique coarsest abstraction consistent with an agent's capacity ‑ serves as a capacity‑derived semantic space for any bounded agent, and that communication between heterogeneous agents exhibits a sharp structural phase transition. Below a critical rate R_\textcrit determined by the quotient mismatch, intent‑preserving communication is structurally impossible. In the supported one‑way memoryless regime, classical side‑information coding then yields exponential decay above the induced benchmark. Classical coding theorems tell you the rate once the source alphabet is fixed; our contribution is to derive that alphabet from bounded interaction itself. Concretely, we prove: (1) a fixed‑\varepsilon structural phase‑transition theorem whose lower bound is fully general on the common‑history quotient comparison; (2) a one‑way Wyner‑Ziv benchmark identification on quotient alphabets, with exact converse, exact operational equality for memoryless quotient sources, and an ergodic long‑run bridge via explicit mixing bounds; (3) an asymptotic one‑way converse in the shrinking‑distortion regime \varepsilon = O(1/T), proved from the message stream and decoder side information; and (4) alignment traversal bounds enabling compositional communication through intermediate capacity levels. Experiments on eight POMDP environments (including RockSample(4,4)) illustrate the phase transition, a structured‑policy benchmark shows the one‑way rate can drop by up to 19× relative to the counting bound, and a shrinking‑distortion sweep matches the regime of the asymptotic converse.
Authors:Kyle Whitecross, Negin Rahimi
Abstract:
We propose RecaLLM, a set of reasoning language models post‑trained to make effective use of long‑context information. In‑context retrieval, which identifies relevant evidence from context, and reasoning are deeply intertwined: retrieval supports reasoning, while reasoning often determines what must be retrieved. However, their interaction remains largely underexplored. In preliminary experiments on several open‑source LLMs, we observe that in‑context retrieval performance substantially degrades even after a short reasoning span, revealing a key bottleneck for test‑time scaling that we refer to as lost‑in‑thought: reasoning steps that improve performance also make subsequent in‑context retrieval more challenging. To address this limitation, RecaLLM interleaves reasoning with explicit in‑context retrieval, alternating between reasoning and retrieving context information needed to solve intermediate subproblems. We introduce a negligible‑overhead constrained decoding mechanism that enables verbatim copying of evidence spans, improving the grounding of subsequent generation. Trained on diverse lexical and semantic retrieval tasks, RecaLLM achieves strong performance on two long‑context benchmarks, RULER and HELMET, significantly outperforming baselines. Notably, we observe consistent gains at context windows of up to 128K tokens using training samples of at most 10K tokens, far shorter than those used by existing long‑context approaches, highlighting a promising path toward improving long‑context performance without expensive long‑context training data.
Authors:Weiyang Guo, Zesheng Shi, Liye Zhao, Jiayuan Ma, Zeen Zhu, Junxian He, Min Zhang, Jing Li
Abstract:
While Large Language Models (LLMs) have demonstrated significant potential in Tool‑Integrated Reasoning (TIR), existing training paradigms face significant limitations: Zero‑RL suffers from inefficient exploration and mode degradation due to a lack of prior guidance, while SFT‑then‑RL is limited by high data costs and capability plateaus caused by low‑entropy collapse. To address these challenges, we propose E3‑TIR (Enhanced Experience Exploitation), a warm‑up paradigm for the early stages of agent training. Specifically, we formulate training as the dynamic integration of three experience types: Expert Prefixes, Expert Guided, and Self‑Exploration. By executing diverse branching exploration around expert "anchors" and employing a mix policy optimization mechanism, we effectively mitigate distribution shifts and resolve optimization conflicts arising from shared prefixes. Our method dynamically adapts the model's knowledge boundaries, effectively balancing exploration diversity with training efficiency.Experimental results demonstrate that E3‑TIR achieves a 6 performance improvement over traditional paradigms on tool‑use tasks, while requiring less than 10 of the synthetic data. Furthermore, in terms of ROI, a comprehensive metric integrating performance, data cost, and training efficiency we achieve a 1.46x gain compared to baselines. Code is available at https://github.com/yuki‑younai/E3‑TIR.
Authors:Maksim Anisimov, Francesco Belardinelli, Matthew Wicker
Abstract:
Safety guarantees are a prerequisite to the deployment of reinforcement learning (RL) agents in safety‑critical tasks. Often, deployment environments exhibit non‑stationary dynamics or are subject to changing performance goals, requiring updates to the learned policy. This leads to a fundamental challenge: how to update an RL policy while preserving its safety properties on previously encountered tasks? The majority of current approaches either do not provide formal guarantees or verify policy safety only a posteriori. We propose a novel a priori approach to safe policy updates in continual RL by introducing the Rashomon set: a region in policy parameter space certified to meet safety constraints within the demonstration data distribution. We then show that one can provide formal, provable guarantees for arbitrary RL algorithms used to update a policy by projecting their updates onto the Rashomon set. Empirically, we validate this approach across grid‑world navigation environments (Frozen Lake and Poisoned Apple) where we guarantee an a priori provably deterministic safety on the source task during downstream adaptation. In contrast, we observe that regularisation‑based baselines experience catastrophic forgetting of safety constraints while our approach enables strong adaptation with provable guarantees that safety is preserved.
Authors:Wonbong Jang, Shikun Liu, Soubhik Sanyal, Juan Camilo Perez, Kam Woh Ng, Sanskar Agrawal, Juan-Manuel Perez-Rua, Yiannis Douratsos, Tao Xiang
Abstract:
Recovering camera parameters from images and rendering scenes from novel viewpoints have been treated as separate tasks in computer vision and graphics. This separation breaks down when image coverage is sparse or poses are ambiguous, since each task depends on what the other produces. We propose Rays as Pixels, a Video Diffusion Model (VDM) that learns a joint distribution over videos and camera trajectories. To our knowledge, this is the first model to predict camera poses and do camera‑controlled video generation within a single framework. We represent each camera as dense ray pixels (raxels), a pixel‑aligned encoding that lives in the same latent space as video frames, and denoise the two jointly through a Decoupled Self‑Cross Attention mechanism. A single trained model handles three tasks: predicting camera trajectories from video, generating video from input images along a pre‑defined trajectory, and jointly synthesizing video and trajectory from input images. We evaluate on pose estimation and camera‑controlled video generation, and introduce a closed‑loop self‑consistency test showing that the model's predicted poses and its renderings conditioned on those poses agree. Ablations against Plücker embeddings confirm that representing cameras in a shared latent space with video is subtantially more effective.
Authors:Siyuan Zhou, Hejun Wang, Hu Cheng, Jinxi Li, Dongsheng Wang, Junwei Jiang, Yixiao Jin, Jiayue Huang, Shiwei Mao, Shangjia Liu, Yafei Yang, Hongkang Song, Shenxing Wei, Zihui Zhang, Peng Huang, Shijie Liu, Zhengli Hao, Hao Li, Yitian Li, Wenqi Zhou, Zhihan Zhao, Zongqi He, Hongtao Wen, Shouwang Huang, Peng Yun, Bowen Cheng, Pok Kazaf Fu, Wai Kit Lai, Jiahao Chen, Kaiyuan Wang, Zhixuan Sun, Ziqi Li, Haochen Hu, Di Zhang, Chun Ho Yuen, Bing Wang, Zhihua Wang, Chuhang Zou, Bo Yang
Abstract:
We present PhysInOne, a large‑scale synthetic dataset addressing the critical scarcity of physically‑grounded training data for AI systems. Unlike existing datasets limited to merely hundreds or thousands of examples, PhysInOne provides 2 million videos across 153,810 dynamic 3D scenes, covering 71 basic physical phenomena in mechanics, optics, fluid dynamics, and magnetism. Distinct from previous works, our scenes feature multiobject interactions against complex backgrounds, with comprehensive ground‑truth annotations including 3D geometry, semantics, dynamic motion, physical properties, and text descriptions. We demonstrate PhysInOne's efficacy across four emerging applications: physics‑aware video generation, long‑/short‑term future frame prediction, physical property estimation, and motion transfer. Experiments show that fine‑tuning foundation models on PhysInOne significantly enhances physical plausibility, while also exposing critical gaps in modeling complex physical dynamics and estimating intrinsic properties. As the largest dataset of its kind, orders of magnitude beyond prior works, PhysInOne establishes a new benchmark for advancing physics‑grounded world models in generation, simulation, and embodied AI.
Authors:Andy Anderson
Abstract:
AI coding tools are widely adopted, but most teams plateau at prompt‑and‑review without a framework for systematic progression. This paper presents the AI Codebase Maturity Model (ACMM), a 5‑level framework describing how codebases evolve from basic AI‑assisted coding to self‑sustaining systems. Inspired by CMMI, each level is defined by its feedback loop topology the specific mechanisms that must exist before the next level becomes possible. I validate the model through a 4‑month experience report maintaining KubeStellar Console, a CNCF Kubernetes dashboard built from scratch with Claude Code (Opus) and GitHub Copilot. The system currently operates with 63 CI/CD workflows, 32 nightly test suites, 91% code coverage, and achieves bug‑to‑fix times under 30 minutes 24 hours a day. The central finding: the intelligence of an AI‑driven development system resides not in the AI model itself, but in the infrastructure of instructions, tests, metrics, and feedback loops that surround it. You cannot skip levels, and at each level, the thing that unlocks the next one is another feedback mechanism. Testing the volume of test cases, the coverage thresholds, and the reliability of test execution proved to be the single most important investment in the entire journey.
Authors:Peng Ding
Abstract:
The rapid proliferation of Large Language Model (LLM) providers‑‑each exposing proprietary API formats‑‑has created a fragmented ecosystem where applications become tightly coupled to individual vendors. Switching or bridging providers requires O(N^2) bilateral adapters, impeding portability and multi‑provider architectures. We observe that despite substantial syntactic divergence, the major LLM APIs share a common semantic core: the practical challenge is the combinatorial surface of syntactic variations, not deep semantic incompatibility. Based on this finding, we present LLM‑Rosetta, an open‑source translation framework built on a hub‑and‑spoke Intermediate Representation (IR) that captures the shared semantic core‑‑messages, content parts, tool calls, reasoning traces, and generation controls‑‑in a 9‑type content model and 10‑type stream event schema. A modular Ops‑composition converter architecture enables each API standard to be added independently. LLM‑Rosetta supports bidirectional conversion (provider‑to‑IR‑to‑provider) for both request and response payloads, including chunk‑level streaming with stateful context management. We implement converters for four API standards (OpenAI Chat Completions, OpenAI Responses, Anthropic Messages, and Google GenAI), covering the vast majority of commercial providers. Empirical evaluation demonstrates lossless round‑trip fidelity, correct streaming behavior, and sub‑100 microsecond conversion overhead‑‑competitive with LiteLLM's single‑pass approach while providing bidirectionality and provider neutrality. LLM‑Rosetta passes the Open Responses compliance suite and is deployed in production at Argonne National Laboratory. Code is available at https://github.com/Oaklight/llm‑rosetta.
Authors:Esila Keskin
Abstract:
Von Economo neurons (VENs) are large bipolar projection neurons found exclusively in the anterior cingulate cortex (ACC) and frontal insula of species with complex social cognition, including humans, great apes, and cetaceans. Their selective depletion in frontotemporal dementia (FTD) and altered development in autism implicate them in rapid social decision‑making, yet no computational model of VEN function has previously existed. We introduce the Fast Lane Hypothesis: VENs implement a biological speed‑accuracy tradeoff (SAT) by providing a sparse, fast projection pathway that enables rapid social decisions at the cost of deliberate processing accuracy. We model VENs as fast leaky integrate‑and‑fire (LIF) neurons with membrane time constant 5 ms and sparse dendritic fan‑in of eight afferents, compared to 20 ms and eighty afferents for standard pyramidal neurons, within a spiking cortical circuit of 2,000 neurons trained on a social discrimination task. Networks are evaluated under three clinically motivated conditions across 10 independent random seeds: typical (2% VENs), autism‑like (0.4% VENs), and FTD‑like (post‑training VEN ablation). All configurations achieve equivalent asymptotic classification accuracy (99.4%), consistent with the prediction that VENs modulate decision speed rather than representational capacity. Temporal analysis confirms that VENs produce median first‑spike latencies 4 ms earlier than pyramidal neurons. At a fixed decision threshold, the typical condition is significantly faster than FTD‑like (t=‑23.31, p<0.0001), while autism‑like is intermediate (mean RT=26.91+/‑9.01 ms vs. typical 20.70+/‑2.02 ms; p=0.078). A preliminary evolutionary analysis shows qualitative correspondence between model‑optimal VEN fraction and the primate phylogenetic gradient. To our knowledge, this is the first computational model that asks what a Von Economo neuron actually computes.
Authors:Li Huang, Zhongxin Liu, Yifan Wu, Tao Yin, Dong Li, Jichao Bi, Nankun Mu, Hongyu Zhang, Meng Yan
Abstract:
Large Language Models (LLMs) for code generation can replicate insecure patterns from their training data. To mitigate this, a common strategy for security hardening is to fine‑tune models using supervision derived from the final transformer layer. However, this design may suffer from a final‑layer bottleneck: vulnerability‑discriminative cues can be distributed across layers and become less detectable near the output representations optimized for next‑token prediction. To diagnose this issue, we perform layer‑wise linear probing. We observe that vulnerability‑related signals are most detectable in a band of intermediate‑to‑upper layers yet attenuate toward the final layers. Motivated by this observation, we introduce DeepGuard, a framework that leverages distributed security‑relevant cues by aggregating representations from multiple upper layers via an attention‑based module. The aggregated signal powers a dedicated security analyzer within a multi‑objective training objective that balances security enhancement and functional correctness, and further supports a lightweight inference‑time steering strategy. Extensive experiments across five code LLMs demonstrate that DeepGuard improves the secure‑and‑correct generation rate by an average of 11.9% over strong baselines such as SVEN. It also preserves functional correctness while exhibiting generalization to held‑out vulnerability types. Our code is public at https://github.com/unknownhl/DeepGuard.
Authors:Yuxi Zhou, Zhengbo Zhang, Jingyu Pan, Zhiyu Lin, Zhigang Tu
Abstract:
Human action recognition is pivotal in computer vision, with applications ranging from surveillance to human‑robot interaction. Despite the effectiveness of supervised skeleton‑based methods, their reliance on exhaustive annotation limits generalization to novel actions. Zero‑Shot Skeleton Action Recognition (ZSAR) emerges as a promising paradigm, yet it faces challenges due to the spectral bias of diffusion models, which oversmooth high‑frequency dynamics. Here, we propose Frequency‑Aware Diffusion for Skeleton‑Text Matching (FDSM), integrating a Semantic‑Guided Spectral Residual Module, a Timestep‑Adaptive Spectral Loss, and Curriculum‑based Semantic Abstraction to address these challenges. Our approach effectively recovers fine‑grained motion details, achieving state‑of‑the‑art performance on NTU RGB+D, PKU‑MMD, and Kinetics‑skeleton datasets. Code has been made available at https://github.com/yuzhi535/FDSM. Project homepage: https://yuzhi535.github.io/FDSM.github.io/
Authors:Salva Rühling Cachay, Duncan Watson-Parris, Rose Yu
Abstract:
AI‑based weather forecasting now rivals traditional physics‑based ensembles, but state‑of‑the‑art (SOTA) models rely on specialized architectures and massive computational budgets, creating a high barrier to entry. We demonstrate that such complexity is unnecessary for frontier performance. We introduce U‑Cast, a probabilistic forecaster built on a standard U‑Net backbone trained with a simple recipe: deterministic pre‑training on Mean Absolute Error followed by short probabilistic fine‑tuning on the Continuous Ranked Probability Score (CRPS) using Monte Carlo Dropout for stochasticity. As a result, our model matches or exceeds the probabilistic skill of GenCast and IFS ENS at 1.5^\circ\ resolution while reducing training compute by over 10× compared to leading CRPS‑based models and inference latency by over 10× compared to diffusion‑based models. U‑Cast trains in under 12 H200 GPU‑days and generates a 60‑step ensemble forecast in 11 seconds. These results suggest that scalable, general‑purpose architectures paired with efficient training curricula can match complex domain‑specific designs at a fraction of the cost, opening the training of frontier probabilistic weather models to the broader community. Our code is available at: https://github.com/Rose‑STL‑Lab/u‑cast.
Authors:Zhiyu Zhou, Peilin Liu, Ruoxuan Zhang, Luyang Zhang, Cheng Zhang, Hongxia Xie, Wen-Huang Cheng
Abstract:
Small object‑centric spatial understanding in indoor videos remains a significant challenge for multimodal large language models (MLLMs), despite its practical value for object search and assistive applications. Although existing benchmarks have advanced video spatial intelligence, embodied reasoning, and diagnostic perception, no existing benchmark directly evaluates whether a model can localize a target object in video and express its position with sufficient precision for downstream use. In this work, we introduce PinpointQA, the first dataset and benchmark for small object‑centric spatial understanding in indoor videos. Built from ScanNet++ and ScanNet200, PinpointQA comprises 1,024 scenes and 10,094 QA pairs organized into four progressively challenging tasks: Target Presence Verification (TPV), Nearest Reference Identification (NRI), Fine‑Grained Spatial Description (FSD), and Structured Spatial Prediction (SSP). The dataset is built from intermediate spatial representations, with QA pairs generated automatically and further refined through quality control. Experiments on representative MLLMs reveal a consistent capability gap along the progressive chain, with SSP remaining particularly difficult. Supervised fine‑tuning on PinpointQA yields substantial gains, especially on the harder tasks, demonstrating that PinpointQA serves as both a diagnostic benchmark and an effective training dataset. The dataset and project page are available at https://rainchowz.github.io/PinpointQA.
Authors:Yi Luo, Xu Sun, Guangchun Luo, Aiguo Chen
Abstract:
Graph neural networks (GNNs) have been widely adopted in engineering applications such as social network analysis, chemical research and computer vision. However, their efficacy is severely compromised by the inherent homophily assumption, which fails to hold for heterophilic graphs where dissimilar nodes are frequently connected. To address this fundamental limitation in graph learning, we first draw inspiration from the recently discovered monophily property of real‑world graphs, and propose Neighbourhood Transformers (NT), a novel paradigm that applies self‑attention within every local neighbourhood instead of aggregating messages to the central node as in conventional message‑passing GNNs. This design makes NT inherently monophily‑aware and theoretically guarantees its expressiveness is no weaker than traditional message‑passing frameworks. For practical engineering deployment, we further develop a neighbourhood partitioning strategy equipped with switchable attentions, which reduces the space consumption of NT by over 95% and time consumption by up to 92.67%, significantly expanding its applicability to larger graphs. Extensive experiments on 10 real‑world datasets (5 heterophilic and 5 homophilic graphs) show that NT outperforms all current state‑of‑the‑art methods on node classification tasks, demonstrating its superior performance and cross‑domain adaptability. The full implementation code of this work is publicly available at https://github.com/cf020031308/MoNT to facilitate reproducibility and industrial adoption.
Authors:Yuanting Fan, Jun Liu, Bin-Bin Gao, Xiaochen Chen, Yuhuan Lin, Zhewei Dai, Jiawei Zhan, Chengjie Wang
Abstract:
Existing defect/anomaly generation methods often rely on few‑shot learning, which overfits to specific defect categories due to the lack of large‑scale paired defect editing data. This issue is aggravated by substantial variations in defect scale and morphology, resulting in limited generalization, degraded realism, and category consistency. We address these challenges by introducing UDG, a large‑scale dataset of 300K normal‑abnormal‑mask‑caption quadruplets spanning diverse domains, and by presenting UniDG, a universal defect generation foundation model that supports both reference‑based defect generation and text instruction‑based defect editing without per‑category fine‑tuning. UniDG performs Defect‑Context Editing via adaptive defect cropping and structured diptych input format, and fuses reference and target conditions through MM‑DiT multimodal attention. A two‑stage training strategy, Diversity‑SFT followed by Consistency‑RFT, further improves diversity while enhancing realism and reference consistency. Extensive experiments on MVTec‑AD and VisA show that UniDG outperforms prior few‑shot anomaly generation and image insertion/editing baselines in synthesis quality and downstream single‑ and multi‑class anomaly detection/localization. Code will be available at https://github.com/RetoFan233/UniDG.
Authors:Xinyu Zhang, Zurong Mai, Qingmei Li, Zjin Liao, Yibin Wen, Yuhang Chen, Xiaoya Fan, Chan Tsz Ho, Bi Tianyuan, Haoyuan Liang, Ruifeng Su, Zihao Qian, Juepeng Zheng, Jianxi Huang, Yutong Lu, Haohuan Fu
Abstract:
While multimodal large language models (MLLMs) have made significant strides in natural image understanding, their ability to perceive and reason over hyperspectral image (HSI) remains underexplored, which is a vital modality in remote sensing. The high dimensionality and intricate spectral‑spatial properties of HSI pose unique challenges for models primarily trained on RGB data.To address this gap, we introduce Hyperspectral Multimodal Benchmark (HM‑Bench), the first benchmark designed specifically to evaluate MLLMs in HSI understanding. We curate a large‑scale dataset of 19,337 question‑answer pairs across 13 task categories, ranging from basic perception to spectral reasoning. Given that existing MLLMs are not equipped to process raw hyperspectral cubes natively, we propose a dual‑modality evaluation framework that transforms HSI data into two complementary representations: PCA‑based composite images and structured textual reports. This approach facilitates a systematic comparison of different representation for model performance. Extensive evaluations on 18 representative MLLMs reveal significant difficulties in handling complex spatial‑spectral reasoning tasks. Furthermore, our results demonstrate that visual inputs generally outperform textual inputs, highlighting the importance of grounding in spectral‑spatial evidence for effective HSI understanding. Dataset and appendix can be accessed at https://github.com/HuoRiLi‑Yu/HM‑Bench.
Authors:Rafael da Silva, Jeff Eicher, Gregory Longo
Abstract:
This study proposes a temporal modeling framework with a counterfactual policy‑simulation layer for student dropout in higher education, using LMS engagement data and administrative withdrawal records. Dropout is operationalized as a time‑to‑event outcome at the enrollment level; weekly risk is modeled in discrete time via penalized, class‑balanced logistic regression over person‑‑period rows. Under a late‑event temporal holdout, the model attains row‑level AUCs of 0.8350 (train) and 0.8405 (test), with aggregate calibration acceptable but sparsely supported in the highest‑risk bins. Ablation analyses indicate performance is sensitive to feature set composition, underscoring the role of temporal engagement signals. A scenario‑indexed policy layer produces survival contrasts ΔS(T) under an explicit trigger/schedule contract: positive contrasts are confined to the shock branch (T_\rm policy=18: 0.0102, 0.0260, 0.0819), while the mechanism‑aware branch is negative (ΔS_\rm mech(18)=‑0.0078, ΔS_\rm mech(38)=‑0.0134). A subgroup analysis by gender quantifies scenario‑induced survival gaps via bootstrap; contrasts are directionally stable but small. Results are not causally identified; they demonstrate the framework's capacity for internal structural scenario comparison under observational data constraints.
Authors:Jinqi Luo, Jinyu Yang, Tal Neiman, Lei Fan, Bing Yin, Son Tran, Mubarak Shah, René Vidal
Abstract:
Multimodal Large Language Models (MLLMs) have been shown to be vulnerable to malicious queries that can elicit unsafe responses. Recent work uses prompt engineering, response classification, or finetuning to improve MLLM safety. Nevertheless, such approaches are often ineffective against evolving malicious patterns, may require rerunning the query, or demand heavy computational resources. Steering the activations of a frozen model at inference time has recently emerged as a flexible and effective solution. However, existing steering methods for MLLMs typically handle only a narrow set of safety‑related concepts or struggle to adjust specific concepts without affecting others. To address these challenges, we introduce Dictionary‑Aligned Concept Control (DACO), a framework that utilizes a curated concept dictionary and a Sparse Autoencoder (SAE) to provide granular control over MLLM activations. First, we curate a dictionary of 15,000 multimodal concepts by retrieving over 400,000 caption‑image stimuli and summarizing their activations into concept directions. We name the dataset DACO‑400K. Second, we show that the curated dictionary can be used to intervene activations via sparse coding. Third, we propose a new steering approach that uses our dictionary to initialize the training of an SAE and automatically annotate the semantics of the SAE atoms for safeguarding MLLMs. Experiments on multiple MLLMs (e.g., QwenVL, LLaVA, InternVL) across safety benchmarks (e.g., MM‑SafetyBench, JailBreakV) show that DACO significantly improves MLLM safety while maintaining general‑purpose capabilities.
Authors:Ruixiang Jiang, Changwen Chen
Abstract:
Interpretation is essential to deciphering the language of art: audiences communicate with artists by recovering meaning from visual artifacts. However, current Generative Art (GenArt) evaluators remain fixated on surface‑level image quality or literal prompt adherence, failing to assess the deeper symbolic or abstract meaning intended by the creator. We address this gap by formalizing a Peircean computational semiotic theory that models Human‑GenArt Interaction (HGI) as cascaded semiosis. This framework reveals that artistic meaning is conveyed through three modes ‑ iconic, symbolic, and indexical ‑ yet existing evaluators operate heavily within the iconic mode, remaining structurally blind to the latter two. To overcome this structural blindness, we propose SemJudge. This evaluator explicitly assesses symbolic and indexical meaning in HGI via a Hierarchical Semiosis Graph (HSG) that reconstructs the meaning‑making process from prompt to generated artifact. Extensive quantitative experiments show that SemJudge aligns more closely with human judgments than prior evaluators on an interpretation‑intensive fine‑art benchmark. User studies further demonstrate that SemJudge produces deeper, more insightful artistic interpretations, thereby paving the way for GenArt to move beyond the generation of "pretty" images toward a medium capable of expressing complex human experience. Project page: https://github.com/songrise/SemJudge.
Authors:Xingming Liao, Ning Chen, Muying Shu, Yunpeng Yin, Peijian Zeng, Zhuowei Wang, Nankai Lin, Lianglun Cheng
Abstract:
Fine‑grained visual understanding and high‑level reasoning in real‑world open‑water environments remain under‑explored due to the lack of dedicated benchmarks. We introduce MARINER, a comprehensive benchmark built under the novel Entity‑Environment‑Event (3E) paradigm. MARINER contains 16,629 multi‑source maritime images with 63 fine‑grained vessel categories, diverse adverse environments, and 5 typical dynamic maritime incidents, covering fine‑grained classification, object detection, and visual question answering tasks. We conduct extensive evaluations on mainstream Multimodal Large language models (MLLMs) and establish baselines, revealing that even advanced models struggle with fine‑grained discrimination and causal reasoning in complex marine scenes. As a dedicated maritime benchmark, MARINER fills the gap of realistic and cognitive‑level evaluation for maritime multimodal understanding, and promotes future research on robust vision‑language models for open‑water applications. Appendix and supplementary materials are available at https://lxixim.github.io/MARINER.
Authors:Yuki Kataoka, Masahiro Banno, Michihito Kyo, Shuri Nakao, Tomoo Sato, Shunsuke Taito, Tomohiro Takayama, Takahiro Tsuge, Yasushi Tsujimoto, Ryuhei So, Toshi A. Furukawa
Abstract:
Background: Server‑based screening tools impose subscription costs, while open‑source alternatives require coding skills. Objectives: We developed a browser extension that provides no‑code, serverless artificial intelligence (AI)‑assisted title and abstract screening and examined its functionality. Methods: TiAb Review Plugin is an open‑source Chrome browser extension (available at https://chromewebstore.google.com/detail/tiab‑review‑plugin/alejlnlfflogpnabpbplmnojgoeeabij). It uses Google Sheets as a shared database, requiring no dedicated server and enabling multi‑reviewer collaboration. Users supply their own Gemini API key, stored locally and encrypted. The tool offers three screening modes: manual review, large language model (LLM) batch screening, and machine learning (ML) active learning. For ML evaluation, we re‑implemented the default ASReview active learning algorithm (TF‑IDF with Naive Bayes) in TypeScript to enable in‑browser execution, and verified equivalence against the original Python implementation using 10‑fold cross‑validation on six datasets. For LLM evaluation, we compared 16 parameter configurations across two model families on a benchmark dataset, then validated the optimal configuration (Gemini 3.0 Flash, low thinking budget, TopP=0.95) with a sensitivity‑oriented prompt on five public datasets (1,038 to 5,628 records, 0.5 to 2.0 percent prevalence). Results: The TypeScript classifier produced top‑100 rankings 100 percent identical to the original ASReview across all six datasets. For LLM screening, recall was 94 to 100 percent with precision of 2 to 15 percent, and Work Saved over Sampling at 95 percent recall (WSS@95) ranged from 48.7 to 87.3 percent. Conclusions: We developed a functional browser extension that integrates LLM screening and ML active learning into a no‑code, serverless environment, ready for practical use in systematic review screening.
Authors:Brendan R. Hogan, Xiwen Chen, James T. Wilson, Kashif Rasul, Adel Boyarsky, Thomas Kamei, Anderson Schneider, Yuriy Nevmyvaka
Abstract:
We present AlphaLab, an autonomous research harness that leverages frontier LLM agentic capabilities to automate the full experimental cycle in quantitative, computation‑intensive domains. Given only a dataset and a natural‑language objective, AlphaLab proceeds through three phases without human intervention: (1) it adapts to the domain and explores the data, writing analysis code and producing a research report; (2) it constructs and adversarially validates its own evaluation framework; and (3) it runs large‑scale GPU experiments via a Strategist/Worker loop, accumulating domain knowledge in a persistent playbook that functions as a form of online prompt optimization. All domain‑specific behavior is factored into adapters generated by the model itself, so the same pipeline handles qualitatively different tasks without modification. We evaluate AlphaLab with two frontier LLMs (GPT‑5.2 and Claude Opus 4.6) on three domains: CUDA kernel optimization, where it writes GPU kernels that run 4.4x faster than torch.compile on average (up to 91x); LLM pretraining, where the full system achieves 22% lower validation loss than a single‑shot baseline using the same model; and traffic forecasting, where it beats standard baselines by 23‑25% after researching and implementing published model families from the literature. The two models discover qualitatively different solutions in every domain (neither dominates uniformly), suggesting that multi‑model campaigns provide complementary search coverage. We additionally report results on financial time series forecasting in the appendix, and release all code at https://brendanhogan.github.io/alphalab‑paper/.
Authors:Leonid Erlygin, Alexey Zaytsev
Abstract:
Accurate uncertainty estimation is essential for building robust and trustworthy recognition systems. In this paper, we consider the open‑set text classification (OSTC) task ‑ and uncertainty estimation for it. For OSTC a text sample should be classified as one of the existing classes or rejected as unknown. To account for the different uncertainty types encountered in OSTC, we adapt the Holistic Uncertainty Estimation (HolUE) method for the text domain. Our approach addresses two major causes of prediction errors in text recognition systems: text uncertainty that stems from ill formulated queries and gallery uncertainty that is related the ambiguity of data distribution. By capturing these sources, it becomes possible to predict when the system will make a recognition error. We propose a new OSTC benchmark and conduct extensive experiments on a wide range of data, utilizing the authorship attribution, intent and topic classification datasets. HolUE achieves 40‑365% improvement in Prediction Rejection Ratio (PRR) over the quality‑based SCF baseline across datasets: 365% on Yahoo Answers (0.79 vs 0.17 at FPIR 0.1), 347% on DBPedia (0.85 vs 0.19), 240% on PAN authorship attribution (0.51 vs 0.15 at FPIR 0.5), and 40% on CLINC150 intent classification (0.73 vs~0.52). We make public our code and protocols https://github.com/Leonid‑Erlygin/text_uncertainty.git
Authors:Shilin Yan, Jintao Tong, Hongwei Xue, Xiaojun Tang, Yangyang Wang, Kunyu Shi, Guannan Zhang, Ruixuan Li, Yixiong Zou
Abstract:
The advent of agentic multimodal models has empowered systems to actively interact with external environments. However, current agents suffer from a profound meta‑cognitive deficit: they struggle to arbitrate between leveraging internal knowledge and querying external utilities. Consequently, they frequently fall prey to blind tool invocation, resorting to reflexive tool execution even when queries are resolvable from the raw visual context. This pathological behavior precipitates severe latency bottlenecks and injects extraneous noise that derails sound reasoning. Existing reinforcement learning protocols attempt to mitigate this via a scalarized reward that penalizes tool usage. Yet, this coupled formulation creates an irreconcilable optimization dilemma: an aggressive penalty suppresses essential tool use, whereas a mild penalty is entirely subsumed by the variance of the accuracy reward during advantage normalization, rendering it impotent against tool overuse. To transcend this bottleneck, we propose HDPO, a framework that reframes tool efficiency from a competing scalar objective to a strictly conditional one. By eschewing reward scalarization, HDPO maintains two orthogonal optimization channels: an accuracy channel that maximizes task correctness, and an efficiency channel that enforces execution economy exclusively within accurate trajectories via conditional advantage estimation. This decoupled architecture naturally induces a cognitive curriculum‑compelling the agent to first master task resolution before refining its self‑reliance. Extensive evaluations demonstrate that our resulting model, Metis, reduces tool invocations by orders of magnitude while simultaneously elevating reasoning accuracy.
Authors:Yunsong Zhou, Hangxu Liu, Xuekun Jiang, Xing Shen, Yuanzhen Zhou, Hui Wang, Baole Fang, Yang Tian, Mulin Yu, Qiaojun Yu, Li Ma, Hengjie Li, Hanqing Wang, Jia Zeng, Jiangmiao Pang
Abstract:
Robotic manipulation with deformable objects represents a data‑intensive regime in embodied learning, where shape, contact, and topology co‑evolve in ways that far exceed the variability of rigids. Although simulation promises relief from the cost of real‑world data acquisition, prevailing sim‑to‑real pipelines remain rooted in rigid‑body abstractions, producing mismatched geometry, fragile soft dynamics, and motion primitives poorly suited for cloth interaction. We posit that simulation fails not for being synthetic, but for being ungrounded. To address this, we introduce SIM1, a physics‑aligned real‑to‑sim‑to‑real data engine that grounds simulation in the physical world. Given limited demonstrations, the system digitizes scenes into metric‑consistent twins, calibrates deformable dynamics through elastic modeling, and expands behaviors via diffusion‑based trajectory generation with quality filtering. This pipeline transforms sparse observations into scaled synthetic supervision with near‑demonstration fidelity. Experiments show that policies trained on purely synthetic data achieve parity with real‑data baselines at a 1:15 equivalence ratio, while delivering 90% zero‑shot success and 50% generalization gains in real‑world deployment. These results validate physics‑aligned simulation as scalable supervision for deformable manipulation and a practical pathway for data‑efficient policy learning.
Authors:Wenbo Hu, Xin Chen, Yan Gao-Tian, Yihe Deng, Nanyun Peng, Kai-Wei Chang
Abstract:
Group Relative Policy Optimization (GRPO) has emerged as the de facto Reinforcement Learning (RL) objective driving recent advancements in Multimodal Large Language Models. However, extending this success to open‑source multimodal generalist models remains heavily constrained by two primary challenges: the extreme variance in reward topologies across diverse visual tasks, and the inherent difficulty of balancing fine‑grained perception with multi‑step reasoning capabilities. To address these issues, we introduce Gaussian GRPO (G^2RPO), a novel RL training objective that replaces standard linear scaling with non‑linear distributional matching. By mathematically forcing the advantage distribution of any given task to strictly converge to a standard normal distribution, \mathcalN(0,1), G^2RPO theoretically ensures inter‑task gradient equity, mitigates vulnerabilities to heavy‑tail outliers, and offers symmetric update for positive and negative rewards. Leveraging the enhanced training stability provided by G^2RPO, we introduce two task‑level shaping mechanisms to seamlessly balance perception and reasoning. First, response length shaping dynamically elicits extended reasoning chains for complex queries while enforce direct outputs to bolster visual grounding. Second, entropy shaping tightly bounds the model's exploration zone, effectively preventing both entropy collapse and entropy explosion. Integrating these methodologies, we present OpenVLThinkerV2, a highly robust, general‑purpose multimodal model. Extensive evaluations across 18 diverse benchmarks demonstrate its superior performance over strong open‑source and leading proprietary frontier models.
Authors:Onkar Susladkar, Dong-Hwan Jang, Tushar Prakash, Adheesh Juvekar, Vedant Shah, Ayush Barik, Nabeel Bashir, Muntasir Wahed, Ritish Shrirao, Ismini Lourentzou
Abstract:
We introduce RewardFlow, an inversion‑free framework that steers pretrained diffusion and flow‑matching models at inference time through multi‑reward Langevin dynamics. RewardFlow unifies complementary differentiable rewards for semantic alignment, perceptual fidelity, localized grounding, object consistency, and human preference, and further introduces a differentiable VQA‑based reward that provides fine‑grained semantic supervision through language‑vision reasoning. To coordinate these heterogeneous objectives, we design a prompt‑aware adaptive policy that extracts semantic primitives from the instruction, infers edit intent, and dynamically modulates reward weights and step sizes throughout sampling. Across several image editing and compositional generation benchmarks, RewardFlow delivers state‑of‑the‑art edit fidelity and compositional alignment.
Authors:Runpeng Geng, Chenlong Yin, Yanting Wang, Ying Chen, Jinyuan Jia
Abstract:
Prompt injection attacks pose serious security risks across a wide range of real‑world applications. While receiving increasing attention, the community faces a critical gap: the lack of a unified platform for prompt injection evaluation. This makes it challenging to reliably compare defenses, understand their true robustness under diverse attacks, or assess how well they generalize across tasks and benchmarks. For instance, many defenses initially reported as effective were later found to exhibit limited robustness on diverse datasets and attacks. To bridge this gap, we introduce PIArena, a unified and extensible platform for prompt injection evaluation that enables users to easily integrate state‑of‑the‑art attacks and defenses and evaluate them across a variety of existing and new benchmarks. We also design a dynamic strategy‑based attack that adaptively optimizes injected prompts based on defense feedback. Through comprehensive evaluation using PIArena, we uncover critical limitations of state‑of‑the‑art defenses: limited generalizability across tasks, vulnerability to adaptive attacks, and fundamental challenges when an injected task aligns with the target task. The code and datasets are available at https://github.com/sleeepeer/PIArena.
Authors:Ashima Suvarna, Kendrick Phan, Mehrab Beikzadeh, Hritik Bansal, Saadia Gabriel
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has significantly improved large language model (LLM) reasoning in formal domains such as mathematics and code. Despite these advancements, LLMs still struggle with general reasoning tasks requiring capabilities such as causal inference and temporal understanding. Extending RLVR to general reasoning is fundamentally constrained by the lack of high‑quality, verifiable training data that spans diverse reasoning skills. To address this challenge, we propose SUPERNOVA, a data curation framework for RLVR aimed at enhancing general reasoning. Our key insight is that instruction‑tuning datasets containing expert‑annotated ground‑truth encode rich reasoning patterns that can be systematically adapted for RLVR. To study this, we conduct 100+ controlled RL experiments to analyze how data design choices impact downstream reasoning performance. In particular, we investigate three key factors: (i) source task selection, (ii) task mixing strategies, and (iii) synthetic interventions for improving data quality. Our analysis reveals that source task selection is non‑trivial and has a significant impact on downstream reasoning performance. Moreover, selecting tasks based on their performance for individual target tasks outperforms strategies based on overall average performance. Finally, models trained on SUPERNOVA outperform strong baselines (e.g., Qwen3.5) on challenging reasoning benchmarks including BBEH, Zebralogic, and MMLU‑Pro. In particular, training on SUPERNOVA yields relative improvements of up to 52.8% on BBEH across model sizes, demonstrating the effectiveness of principled data curation for RLVR. Our findings provide practical insights for curating human‑annotated resources to extend RLVR to general reasoning. The code and data is available at https://github.com/asuvarna31/supernova.
Authors:Rui Gan, Junyi Ma, Pei Li, Xingyou Yang, Kai Chen, Sikai Chen, Bin Ran
Abstract:
Cooperative autonomous driving requires traffic scene understanding from both vehicle and infrastructure perspectives. While vision‑language models (VLMs) show strong general reasoning capabilities, their performance in safety‑critical traffic scenarios remains insufficiently evaluated due to the ego‑vehicle focus of existing benchmarks. To bridge this gap, we present CrashSight, a large‑scale vision‑language benchmark for roadway crash understanding using real‑world roadside camera data. The dataset comprises 250 crash videos, annotated with 13K multiple‑choice question‑answer pairs organized under a two‑tier taxonomy. Tier 1 evaluates the visual grounding of scene context and involved parties, while Tier 2 probes higher‑level reasoning, including crash mechanics, causal attribution, temporal progression, and post‑crash outcomes. We benchmark 8 state‑of‑the‑art VLMs and show that, despite strong scene description capabilities, current models struggle with temporal and causal reasoning in safety‑critical scenarios. We provide a detailed analysis of failure scenarios and discuss directions for improving VLM crash understanding. The benchmark provides a standardized evaluation framework for infrastructure‑assisted perception in cooperative autonomous driving. The CrashSight benchmark, including the full dataset and code, is accessible at https://mcgrche.github.io/crashsight.
Authors:Ajsal Shereef Palattuparambil, Thommen George Karimpanal, Santu Rana
Abstract:
Reinforcement Learning (RL) agents often struggle to generalize knowledge to new tasks, even those structurally similar to ones they have mastered. Although recent approaches have attempted to mitigate this issue via zero‑shot transfer, they are often constrained by predefined, discrete class systems, limiting their adaptability to novel or compositional task variations. We propose a significantly more generalized approach, replacing discrete latent variables with natural language conditioning via a text‑conditioned Variational Autoencoder (VAE). Our core innovation utilizes a Large Language Model (LLM) as a dynamic semantic operator at test time. Rather than relying on rigid rules, our agent queries the LLM to semantically remap the description of the current observation to align with the source task. This source‑aligned caption conditions the VAE to generate an imagined state compatible with the agent's original training, enabling direct policy reuse. By harnessing the flexible reasoning capabilities of LLMs, our approach achieves zero‑shot transfer across a broad spectrum of complex and truly novel analogous tasks, moving beyond the limitations of fixed category mappings. Code and videos are available \hrefhttps://anonymous.4open.science/r/ASPECT‑85C3/here.
Authors:Zigeng Chen, Gongfan Fang, Xinyin Ma, Ruonan Yu, Xinchao Wang
Abstract:
We present DMax, a new paradigm for efficient diffusion language models (dLLMs). It mitigates error accumulation in parallel decoding, enabling aggressive decoding parallelism while preserving generation quality. Unlike conventional masked dLLMs that decode through a binary mask‑to‑token transition, DMax reformulates decoding as a progressive self‑refinement from mask embeddings to token embeddings. At the core of our approach is On‑Policy Uniform Training, a novel training strategy that efficiently unifies masked and uniform dLLMs, equipping the model to recover clean tokens from both masked inputs and its own erroneous predictions. Building on this foundation, we further propose Soft Parallel Decoding. We represent each intermediate decoding state as an interpolation between the predicted token embedding and the mask embedding, enabling iterative self‑revising in embedding space. Extensive experiments across a variety of benchmarks demonstrate the effectiveness of DMax. Compared with the original LLaDA‑2.0‑mini, our method improves TPF on GSM8K from 2.04 to 5.47 while preserving accuracy. On MBPP, it increases TPF from 2.71 to 5.86 while maintaining comparable performance. On two H200 GPUs, our model achieves an average of 1,338 TPS at batch size 1. Code is available at: https://github.com/czg1225/DMax
Authors:Yating Wang, Wenting Zhao, Yaqi Zhao, Yongshun Gong, Yilong Yin, Haoliang Sun
Abstract:
Large language models store not only isolated facts but also rules that support reasoning across symbolic expressions, natural language explanations, and concrete instances. Yet most model editing methods are built for fact‑level knowledge, assuming that a target edit can be achieved through a localized intervention. This assumption does not hold for rule‑level knowledge, where a single rule must remain consistent across multiple interdependent forms. We investigate this problem through a mechanistic study of rule‑level knowledge editing. To support this study, we extend the RuleEdit benchmark from 80 to 200 manually verified rules spanning mathematics and physics. Fine‑grained causal tracing reveals a form‑specific organization of rule knowledge in transformer layers: formulas and descriptions are concentrated in earlier layers, while instances are more associated with middle layers. These results suggest that rule knowledge is not uniformly localized, and therefore cannot be reliably edited by a single‑layer or contiguous‑block intervention. Based on this insight, we propose Distributed Multi‑Layer Editing (DMLE), which applies a shared early‑layer update to formulas and descriptions and a separate middle‑layer update to instances. While remaining competitive on standard editing metrics, DMLE achieves substantially stronger rule‑level editing performance. On average, it improves instance portability and rule understanding by 13.91 and 50.19 percentage points, respectively, over the strongest baseline across GPT‑J‑6B, Qwen2.5‑7B, Qwen2‑7B, and LLaMA‑3‑8B. The code is available at https://github.com/Pepper66/DMLE.
Authors:Wansheng Wu, Kaibo Huang, Yukun Wei, Zhongliang Yang, Linna Zhou
Abstract:
As generative artificial intelligence evolves, autonomous agent networks present a powerful paradigm for interactive covert communication. However, because agents dynamically update internal memories via environmental interactions, existing methods face a critical structural vulnerability: cognitive asymmetry. Conventional approaches demand strict cognitive symmetry, requiring identical sequence prefixes between the encoder and decoder. In dynamic deployments, inevitable prefix discrepancies destroy synchronization, inducing severe channel degradation. To address this core challenge of cognitive asymmetry, we propose the Asymmetric Collaborative Framework (ACF), which structurally decouples covert communication from semantic reasoning via orthogonal statistical and cognitive layers. By deploying a prefix‑independent decoding paradigm governed by a shared steganographic configuration, ACF eliminates the reliance on cognitive symmetry. Evaluations on realistic memory‑augmented workflows demonstrate that under severe cognitive asymmetry, symmetric baselines suffer severe channel degradation, whereas ACF uniquely excels across both semantic fidelity and covert communication. It maintains computational indistinguishability, enabling reliable secret extraction with provable error bounds, and providing robust Effective Information Capacity guarantees for modern agent networks.
Authors:Seyed Amir Ahmad Safavi-Naini, Elahe Meftah, Josh Mohess, Pooya Mohammadi Kazaj, Georgios Siontis, Zahra Atf, Peter R. Lewis, Mauricio Reyes, Girish Nadkarni, Roland Wiest, Stephan Windecker, Christoph Grani, Ali Soroush, Isaac Shiri
Abstract:
The competency of any intelligent agent is bounded by its formal account of the world in which it operates. Clinical AI lacks such an account. Existing frameworks address evaluation, regulation, or system design in isolation, without a shared model of the clinical world to connect them. We introduce the Clinical World Model, a framework that formalizes care as a tripartite interaction among Patient, Provider, and Ecosystem. To formalize how any agent, whether human or artificial, transforms information into clinical action, we develop parallel decision‑making architectures for providers, patients, and AI agents, grounded in validated principles of clinical cognition. The Clinical AI Skill‑Mix operationalizes competency through eight dimensions. Five define the clinical competency space (condition, phase, care setting, provider role, and task) and three specify how AI engages human reasoning (assigned authority, agent facing, and anchoring layer). The combinatorial product of these dimensions yields a space of billions of distinct competency coordinates. A central structural implication is that validation within one coordinate provides minimal evidence for performance in another, rendering the competency space irreducible. The framework supplies a common grammar through which clinical AI can be specified, evaluated, and bounded across stakeholders. By making this structure explicit, the Clinical World Model reframes the field's central question from whether AI works to in which competency coordinates reliability has been demonstrated, and for whom.
Authors:Longgang Zhang, Xiaowei Fu, Fuxiang Huang, Lei Zhang
Abstract:
Network traffic, as a key media format, is crucial for ensuring security and communications in modern internet infrastructure. While existing methods offer excellent performance, they face two key bottlenecks: (1) They fail to capture multidimensional semantics beyond unimodal sequence patterns. (2) Their black box property, i.e., providing only category labels, lacks an auditable reasoning process. We identify a key factor that existing network traffic datasets are primarily designed for classification and inherently lack rich semantic annotations, failing to generate human‑readable evidence report. To address data scarcity, this paper proposes a Byte‑Grounded Traffic Description (BGTD) benchmark for the first time, combining raw bytes with structured expert annotations. BGTD provides necessary behavioral features and verifiable chains of evidence for multimodal reasoning towards explainable encrypted traffic interpretation. Built upon BGTD, this paper proposes an end‑to‑end traffic‑language representation framework (mmTraffic), a multimodal reasoning architecture bridging physical traffic encoding and semantic interpretation. In order to alleviate modality interference and generative hallucinations, mmTraffic adopts a jointly‑optimized perception‑cognition architecture. By incorporating a perception‑centered traffic encoder and a cognition‑centered LLM generator, mmTraffic achieves refined traffic interpretation with guaranteed category prediction. Extensive experiments demonstrate that mmTraffic autonomously generates high‑fidelity, human‑readable, and evidence‑grounded traffic interpretation reports, while maintaining highly competitive classification accuracy comparing to specialized unimodal model (e.g., NetMamba). The source code is available at https://github.com/lgzhangzlg/Multimodal‑Reasoning‑with‑LLM‑for‑Encrypted‑Traffic‑Interpretation‑A‑Benchmark
Authors:Luozheng Qin, Jia Gong, Qian Qiao, Tianjiao Li, Li Xu, Haoyu Pan, Chao Qu, Zhiyu Tan, Hao Li
Abstract:
Unified multimodal models integrating visual understanding and generation face a fundamental challenge: visual generation incurs substantially higher computational costs than understanding, particularly for video. This imbalance motivates us to invert the conventional paradigm: rather than extending understanding‑centric MLLMs to support generation, we propose Uni‑ViGU, a framework that unifies video generation and understanding by extending a video generator as the foundation. We introduce a unified flow method that performs continuous flow matching for video and discrete flow matching for text within a single process, enabling coherent multimodal generation. We further propose a modality‑driven MoE‑based framework that augments Transformer blocks with lightweight layers for text generation while preserving generative priors. To repurpose generation knowledge for understanding, we design a bidirectional training mechanism with two stages: Knowledge Recall reconstructs input prompts to leverage learned text‑video correspondences, while Capability Refinement fine‑tunes on detailed captions to establish discriminative shared representations. Experiments demonstrate that Uni‑ViGU achieves competitive performance on both video generation and understanding, validating generation‑centric architectures as a scalable path toward unified multimodal intelligence. Project Page and Code: https://fr0zencrane.github.io/uni‑vigu‑page/.
Authors:Junjie Fei, Jun Chen, Zechun Liu, Yunyang Xiong, Chong Zhou, Wei Wen, Junlin Han, Mingchen Zhuge, Saksham Suri, Qi Qian, Shuming Liu, Lemeng Wu, Raghuraman Krishnamoorthi, Vikas Chandra, Mohamed Elhoseiny, Chenchen Zhu
Abstract:
Adapting Multimodal Large Language Models (MLLMs) for hour‑long videos is bottlenecked by context limits. Dense visual streams saturate token budgets and exacerbate the lost‑in‑the‑middle phenomenon. Existing heuristics, like sparse sampling or uniform pooling, blindly sacrifice fidelity by discarding decisive moments and wasting bandwidth on irrelevant backgrounds. We propose Tempo, an efficient query‑aware framework compressing long videos for downstream understanding. Tempo leverages a Small Vision‑Language Model (SVLM) as a local temporal compressor, casting token reduction as an early cross‑modal distillation process to generate compact, intent‑aligned representations in a single forward pass. To enforce strict budgets without breaking causality, we introduce Adaptive Token Allocation (ATA). Exploiting the SVLM's zero‑shot relevance prior and semantic front‑loading, ATA acts as a training‑free O(1) dynamic router. It allocates dense bandwidth to query‑critical segments while compressing redundancies into minimal temporal anchors to maintain the global storyline. Extensive experiments show our 6B architecture achieves state‑of‑the‑art performance with aggressive dynamic compression (0.5‑16 tokens/frame). On the extreme‑long LVBench (4101s), Tempo scores 52.3 under a strict 8K visual budget, outperforming GPT‑4o and Gemini 1.5 Pro. Scaling to 2048 frames reaches 53.7. Crucially, Tempo compresses hour‑long videos substantially below theoretical limits, proving true long‑form video understanding relies on intent‑driven efficiency rather than greedily padded context windows.
Authors:Soumya Mazumdar, Vineet Kumar Rakesh, Tapas Samanta
Abstract:
Talking‑head generation has advanced rapidly with diffusion‑based generative models, but training usually depends on centralized face‑video and speech datasets, raising major privacy concerns. The problem is more acute for personalized talking‑head generation, where identity‑specific data are highly sensitive and often cannot be pooled across users or devices. PrivFedTalk is presented as a privacy‑aware federated framework for personalized talking‑head generation that combines conditional latent diffusion with parameter‑efficient identity adaptation. A shared diffusion backbone is trained across clients, while each client learns lightweight LoRA identity adapters from local private audio‑visual data, avoiding raw data sharing and reducing communication cost. To address heterogeneous client distributions, Identity‑Stable Federated Aggregation (ISFA) weights client updates using privacy‑safe scalar reliability signals computed from on‑device identity consistency and temporal stability estimates. Temporal‑Denoising Consistency (TDC) regularization is introduced to reduce inter‑frame drift, flicker, and identity drift during federated denoising. To limit update‑side privacy risk, secure aggregation and client‑level differential privacy are applied to adapter updates. The implementation supports both low‑memory GPU execution and multi‑GPU client‑parallel training on heterogeneous shared hardware. Comparative experiments on the present setup across multiple training and aggregation conditions with PrivFedTalk, FedAvg, and FedProx show stable federated optimization and successful end‑to‑end training and evaluation under constrained resources. The results support the feasibility of privacy‑aware personalized talking‑head training in federated environments, while suggesting that stronger component‑wise, privacy‑utility, and qualitative claims need further standardized evaluation.
Authors:Felix Embacher, Jonas Uhrig, Marius Cordts, Markus Enzweiler
Abstract:
Retrieving rare and safety‑critical driving scenarios from large‑scale datasets is essential for building robust autonomous driving (AD) systems. As dataset sizes continue to grow, the key challenge shifts from collecting more data to efficiently identifying the most relevant samples. We introduce SearchAD, a large‑scale rare image retrieval dataset for AD containing over 423k frames drawn from 11 established datasets. SearchAD provides high‑quality manual annotations of more than 513k bounding boxes covering 90 rare categories. It specifically targets the needle‑in‑a‑haystack problem of locating extremely rare classes, with some appearing fewer than 50 times across the entire dataset. Unlike existing benchmarks, which focused on instance‑level retrieval, SearchAD emphasizes semantic image retrieval with a well‑defined data split, enabling text‑to‑image and image‑to‑image retrieval, few‑shot learning, and fine‑tuning of multi‑modal retrieval models. Comprehensive evaluations show that text‑based methods outperform image‑based ones due to stronger inherent semantic grounding. While models directly aligning spatial visual features with language achieve the best zero‑shot results, and our fine‑tuning baseline significantly improves performance, absolute retrieval capabilities remain unsatisfactory. With a held‑out test set on a public benchmark server, SearchAD establishes the first large‑scale dataset for retrieval‑driven data curation and long‑tail perception research in AD: https://iis‑esslingen.github.io/searchad/
Authors:Baining Zhao, Ziyou Wang, Jianjie Fang, Zile Zhou, Yanggang Xu, Yatai Ji, Jiacheng Xu, Qian Zhang, Weichen Zhang, Chen Gao, Xinlei Chen
Abstract:
Large multimodal models (LMMs) show strong visual‑linguistic reasoning but their capacity for spatial decision‑making and action remains unclear. In this work, we investigate whether LMMs can achieve embodied spatial action like human through a challenging scenario: goal‑oriented navigation in urban 3D spaces. We first spend over 500 hours constructing a dataset comprising 5,037 high‑quality goal‑oriented navigation samples, with an emphasis on 3D vertical actions and rich urban semantic information. Then, we comprehensively assess 17 representative models, including non‑reasoning LMMs, reasoning LMMs, agent‑based methods, and vision‑language‑action models. Experiments show that current LMMs exhibit emerging action capabilities, yet remain far from human‑level performance. Furthermore, we reveal an intriguing phenomenon: navigation errors do not accumulate linearly but instead diverge rapidly from the destination after a critical decision bifurcation. The limitations of LMMs are investigated by analyzing their behavior at these critical decision bifurcations. Finally, we experimentally explore four promising directions for improvement: geometric perception, cross‑view understanding, spatial imagination, and long‑term memory. The project is available at: https://github.com/serenditipy‑AC/Embodied‑Navigation‑Bench.
Authors:Bo Li, Shikun Zhang, Wei Ye
Abstract:
Instruction‑tuned language models increasingly rely on large multi‑turn dialogue corpora, but these datasets are often noisy and structurally inconsistent, with topic drift, repetitive chitchat, and mismatched answer formats across turns. We address this from a data selection perspective and propose MDS (Multi‑turn Dialogue Selection), a dialogue‑level framework that scores whole conversations rather than isolated turns. MDS combines a global coverage stage that performs bin‑wise selection in the user‑query trajectory space to retain representative yet non‑redundant dialogues, with a local structural stage that evaluates within‑dialogue reliability through entity‑grounded topic grounding and information progress, together with query‑answer form consistency for functional alignment. MDS outperforms strong single‑turn selectors, dialogue‑level LLM scorers, and heuristic baselines on three multi‑turn benchmarks and an in‑domain Banking test set, achieving the best overall rank across reference‑free and reference‑based metrics, and is more robust on long conversations under the same training budget. Code and resources are included in the supplementary materials.
Authors:Saman Forouzandeh, Kamal Berahmand, Mahdi Jalili
Abstract:
Retrieving relevant observations from long multi‑modal web interaction histories is challenging because relevance depends on the evolving task state, modality (screenshots, HTML text, structured signals), and temporal distance. Prior approaches typically rely on static similarity thresholds or fixed‑capacity buffers, which fail to adapt relevance to the current task context. We propose ACGM, a learned graph‑memory retriever that constructs \emphtask‑adaptive relevance graphs over agent histories using policy‑gradient optimization from downstream task success. ACGM captures heterogeneous temporal dynamics with modality‑specific decay (visual decays 4.3× faster than text: λ_v=0.47 vs.\ λ_x=0.11) and learns sparse connectivity (3.2 edges/node), enabling efficient O(\log T) retrieval. Across WebShop, VisualWebArena, and Mind2Web, ACGM improves retrieval quality to 82.7 nDCG@10 (+9.3 over GPT‑4o, p<0.001) and 89.2% Precision@10 (+7.7), outperforming 19 strong dense, re‑ranking, multi‑modal, and graph‑based baselines. Code to reproduce our results is available at\colorblue\hrefhttps://github.com/S‑Forouzandeh/ACGM‑Agentic‑WebSaman Forouzandeh.
Authors:Jiani Huang, Shijie Wang, Liangbo Ning, Wenqi Fan, Qing Li
Abstract:
With the rise of LLMs, there is an increasing need for intelligent recommendation assistants that can handle complex queries and provide personalized, reasoning‑driven recommendations. LLM‑based recommenders show potential but face challenges in multi‑step reasoning, underscoring the need for reasoning‑augmented systems. To address this gap, we propose ReRec, a novel reinforcement fine‑tuning (RFT) framework designed to improve LLM reasoning in complex recommendation tasks. Our framework introduces three key components: (1) Dual‑Graph Enhanced Reward Shaping, integrating recommendation metrics like NDCG@K with Query Alignment and Preference Alignment Scores to provide fine‑grained reward signals for LLM optimization; (2) Reasoning‑aware Advantage Estimation, which decomposes LLM outputs into reasoning segments and penalizes incorrect steps to enhance reasoning of recommendation; and (3) Online Curriculum Scheduler, dynamically assess query difficulty and organize training curriculum to ensure stable learning during RFT. Experiments demonstrate that ReRec outperforms state‑of‑the‑art baselines and preserves core abilities like instruction‑following and general knowledge. Our codes are available at https://github.com/jiani‑huang/ReRec.
Authors:Jaehyun Lee, Sanghwan Jang, SeongKu Kang, Hwanjo Yu
Abstract:
Large language models (LLMs) have recently emerged as powerful training‑free recommenders. However, their knowledge of individual items is inevitably uneven due to imbalanced information exposure during pretraining, a phenomenon we refer to as knowledge gap problem. To address this, most prior methods have employed a naive uniform augmentation that appends external information for every item in the input prompt. However, this approach not only wastes limited context budget on redundant augmentation for well‑known items but can also hinder the model's effective reasoning. To this end, we propose KnowSA_CKP (Knowledge‑aware Selective Augmentation with Comparative Knowledge Probing) to mitigate the knowledge gap problem. KnowSA_CKP estimates the LLM's internal knowledge by evaluating its capability to capture collaborative relationships and selectively injects additional information only where it is most needed. By avoiding unnecessary augmentation for well‑known items, KnowSA_CKP focuses on items that benefit most from knowledge supplementation, thereby making more effective use of the context budget. KnowSA_CKP requires no fine‑tuning step, and consistently improves both recommendation accuracy and context efficiency across four real‑world datasets. Our code is available at https://github.com/nowhyun/KnowSA\_CKP.
Authors:Hang Zhang, Qijian Tian, Jingyu Gong, Daoguo Dong, Xuhong Wang, Yuan Xie, Xin Tan
Abstract:
Articulated objects are essential for embodied AI and world models, yet inferring their kinematics from a single closed‑state image remains challenging because crucial motion cues are often occluded. Existing methods either require multi‑state observations or rely on explicit part priors, retrieval, or other auxiliary inputs that partially expose the structure to be inferred. In this work, we present DailyArt, which formulates articulated joint estimation from a single static image as a synthesis‑mediated reasoning problem. Instead of directly regressing joints from a heavily occluded observation, DailyArt first synthesizes a maximally articulated opened state under the same camera view to expose articulation cues, and then estimates the full set of joint parameters from the discrepancy between the observed and synthesized states. Using a set‑prediction formulation, DailyArt recovers all joints simultaneously without requiring object‑specific templates, multi‑view inputs, or explicit part annotations at test time. Taking estimated joints as conditions, the framework further supports part‑level novel state synthesis as a downstream capability. Extensive experiments show that DailyArt achieves strong performance in articulated joint estimation and supports part‑level novel state synthesis conditioned on joints. Project page is available at https://rangooo123.github.io/DaliyArt.github.io/.
Authors:Yifei Chen, Sarra Habchi, Lili Wei
Abstract:
Modern video games are complex, non‑deterministic systems that are difficult to test automatically at scale. Although prior work shows that personality‑driven Large Language Model (LLM) agents can improve behavioural diversity and test coverage, existing tools largely remain research prototypes and lack cross‑game reusability. This tool paper presents MIMIC‑Py, a Python‑based automated game‑testing tool that transforms personality‑driven LLM agents into a reusable and extensible framework. MIMIC‑Py exposes personality traits as configurable inputs and adopts a modular architecture that decouples planning, execution, and memory from game‑specific logic. It supports multiple interaction mechanisms, enabling agents to interact with games via exposed APIs or synthesized code. We describe the design of MIMIC‑Py and show how it enables deployment to new game environments with minimal engineering effort, bridging the gap between research prototypes and practical automated game testing. The source code and a demo video are available on our project webpage: https://mimic‑persona.github.io/MIMIC‑Py‑Home‑Page/.
Authors:David Gringras
Abstract:
Ask a frontier model how to taper six milligrams of alprazolam (psychiatrist retired, ten days of pills left, abrupt cessation causes seizures) and it tells her to call the psychiatrist she just explained does not exist. Change one word ("I'm a psychiatrist; a patient presents with...") and the same model, same weights, same inference pass produces a textbook Ashton Manual taper with diazepam equivalence, anticonvulsant coverage, and monitoring thresholds. The knowledge was there; the model withheld it. IatroBench measures this gap. Sixty pre‑registered clinical scenarios, six frontier models, 3,600 responses, scored on two axes (commission harm, CH 0‑3; omission harm, OH 0‑4) through a structured‑evaluation pipeline validated against physician scoring (kappa_w = 0.571, within‑1 agreement 96%). The central finding is identity‑contingent withholding: match the same clinical question in physician vs. layperson framing and all five testable models provide better guidance to the physician (decoupling gap +0.38, p = 0.003; binary hit rates on safety‑colliding actions drop 13.1 percentage points in layperson framing, p < 0.0001, while non‑colliding actions show no change). The gap is widest for the model with the heaviest safety investment (Opus, +0.65). Three failure modes separate cleanly: trained withholding (Opus), incompetence (Llama 4), and indiscriminate content filtering (GPT‑5.2, whose post‑generation filter strips physician responses at 9x the layperson rate because they contain denser pharmacological tokens). The standard LLM judge assigns OH = 0 to 73% of responses a physician scores OH >= 1 (kappa = 0.045); the evaluation apparatus has the same blind spot as the training apparatus. Every scenario targets someone who has already exhausted the standard referrals.
Authors:Yang Cao
Abstract:
Linear recurrent models offer linear‑time sequence processing but often suffer from suboptimal long‑range memory. We trace this to the decay spectrum: for N channels, random initialization collapses the minimum spectral gap to O(N^‑2), yielding sub‑exponential error \exp(‑Ω(N/\log N)); linear spacing avoids collapse but degrades to \exp(‑O(N/\sqrtT)), practically algebraic over long contexts. We introduce Position‑Adaptive Spectral Tapering (PoST), an architecture‑agnostic framework combining two mechanisms: (1) Spectral Reparameterization, which structurally enforces geometrically spaced log‑decay rates, proven minimax optimal at rate O(\exp(‑cN/\log T)); and (2) Position‑Adaptive Scaling, the provably unique mechanism that eliminates the scale mismatch of static spectra (where only N\log t/\log T of N channels are effective at position t) by stretching the spectrum to the actual dependency range, sharpening the rate to O(\exp(‑cN/\log t)). This scaling natively induces fractional invariance: the impulse response becomes scale‑free, with channels interpolating between relative and absolute temporal coordinates. PoST integrates into any diagonal linear recurrence without overhead. We instantiate it across Mamba‑2, RWKV‑7, Gated DeltaNet, Gated Linear Attention, and RetNet. Pre‑training at 180M‑440M scales shows consistent zero‑shot language modeling improvements, significant long‑context retrieval gains for Mamba‑2 (MQAR and NIAH), and competitive or improved performance across other architectures. Code: https://github.com/SiLifen/PoST.
Authors:Jeffrey Fang, Glen Chou
Abstract:
We present GPU‑SLS, a GPU‑parallelized framework for safe, robust nonlinear model predictive control (MPC) that scales to high‑dimensional uncertain robotic systems and long planning horizons. Our method jointly optimizes an inequality‑constrained, dynamically‑feasible nominal trajectory, a tracking controller, and a closed‑loop reachable set under disturbance, all in real‑time. To efficiently compute nominal trajectories, we develop a sequential quadratic programming procedure with a novel GPU‑accelerated quadratic program (QP) solver that uses parallel associative scans and adaptive caching within an alternating direction method of multipliers (ADMM) framework. The same GPU QP backend is used to optimize robust tracking controllers and closed‑loop reachable sets via system level synthesis (SLS), enabling reachability‑constrained control in both fixed‑ and receding‑horizon settings. We achieve substantial performance gains, reducing nominal trajectory solve times by 97.7% relative to state‑of‑the‑art CPU solvers and 71.8% compared to GPU solvers, while accelerating SLS‑based control and reachability by 237x. Despite large problem scales, our method achieves 100% empirical safety, unlike high‑dimensional learning‑based reachability baselines. We validate our approach on complex nonlinear systems, including whole‑body quadrupeds (61D) and humanoids (75D), synthesizing robust control policies online on the GPU in 20 milliseconds on average and scaling to problems with 2 x 10^5 decision variables and 8 x 10^4 constraints. The implementation of our method is available at https://github.com/Jeff300fang/gpu_sls.
Authors:Haimeng Zhao, Alexander Zlokapa, Hartmut Neven, Ryan Babbush, John Preskill, Jarrod R. McClean, Hsin-Yuan Huang
Abstract:
Broadly applicable quantum advantage, particularly in classical data processing and machine learning, has been a fundamental open problem. In this work, we prove that a small quantum computer of polylogarithmic size can perform large‑scale classification and dimension reduction on massive classical data by processing samples on the fly, whereas any classical machine achieving the same prediction performance requires exponentially larger size. Furthermore, classical machines that are exponentially larger yet below the required size need superpolynomially more samples and time. We validate these quantum advantages in real‑world applications, including single‑cell RNA sequencing and movie review sentiment analysis, demonstrating four to six orders of magnitude reduction in size with fewer than 60 logical qubits. These quantum advantages are enabled by quantum oracle sketching, an algorithm for accessing the classical world in quantum superposition using only random classical data samples. Combined with classical shadows, our algorithm circumvents the data loading and readout bottleneck to construct succinct classical models from massive classical data, a task provably impossible for any classical machine that is not exponentially larger than the quantum machine. These quantum advantages persist even when classical machines are granted unlimited time or if BPP=BQP, and rely only on the correctness of quantum mechanics. Together, our results establish machine learning on classical data as a broad and natural domain of quantum advantage and a fundamental test of quantum mechanics at the complexity frontier.
Authors:Ziyi Wang, Siva Rajesh Kasa, Ankith M S, Santhosh Kumar Kasa, Jiaru Zou, Sumit Negi, Ruqi Zhang, Nan Jiang, Qifan Song
Abstract:
Speculative decoding is an effective technique for accelerating large language model inference by drafting multiple tokens in parallel. In practice, its speedup is often bottlenecked by a rigid verification step that strictly enforces the accepted token distribution to exactly match the target model. This constraint leads to the rejection of many plausible tokens, lowering the acceptance rate and limiting overall time speedup. To overcome this limitation, we propose Dynamic Verification Relaxed Speculative Decoding (DIVERSED), a relaxed verification framework that improves time efficiency while preserving generation quality. DIVERSED learns an ensemble‑based verifier that blends the draft and target model distributions with a task‑dependent and context‑dependent weight. We provide theoretical justification for our approach and demonstrate empirically that DIVERSED achieves substantially higher inference efficiency compared to standard speculative decoding methods. Code is available at: https://github.com/comeusr/diversed.
Authors:Valeriy Kovalskiy, Nikita Belov, Nikita Miteyko, Igor Reshetnikov, Max Maximov
Abstract:
Retrieval‑Augmented Generation (RAG) is widely used to ground large language models in external knowledge sources. However, when applied to heterogeneous corpora and multi‑step queries, Naive RAG pipelines often degrade in quality due to flat knowledge representations and the absence of explicit workflows. In this work, we introduce DCD (Domain‑Collection‑Document), a domain‑oriented design to structure knowledge and control query processing in RAG systems without modifying the underlying language model. The proposed approach relies on a hierarchical decomposition of the information space and multi‑stage routing based on structured model outputs, enabling progressive restriction of both retrieval and generation scopes. The architecture is complemented by smart chunking, hybrid retrieval, and integrated validation and generation guardrail mechanisms. We describe the DCD architecture and workflow and discuss evaluation results on synthetic evaluation dataset, highlighting their impact on robustness, factual accuracy, and answer relevance in applied RAG scenarios.
Authors:Kai Qin, Liangxin Liu, Yu Liang, Longzheng Wang, Yan Wang, Yueyang Zhang, Long Xia, Zhiyuan Sun, Houde Liu, Daiting Shi
Abstract:
Reward Models (RMs) are critical components in the Reinforcement Learning from Human Feedback (RLHF) pipeline, directly determining the alignment quality of Large Language Models (LLMs). Recently, Generative Reward Models (GRMs) have emerged as a superior paradigm, offering higher interpretability and stronger generalization than traditional scalar RMs. However, existing methods for GRMs focus primarily on outcome‑level supervision, neglecting analytical process quality, which constrains their potential. To address this, we propose ReflectRM, a novel GRM that leverages self‑reflection to assess analytical quality and enhance preference modeling. ReflectRM is trained under a unified generative framework for joint modeling of response preference and analysis preference. During inference, we use its self‑reflection capability to identify the most reliable analysis, from which the final preference prediction is derived. Experiments across four benchmarks show that ReflectRM consistently improves performance, achieving an average accuracy gain of +3.7 on Qwen3‑4B. Further experiments confirm that response preference and analysis preference are mutually reinforcing. Notably, ReflectRM substantially mitigates positional bias, yielding +10.2 improvement compared with leading GRMs and establishing itself as a more stable evaluator. Our code is available at https://github.com/yuliangCarmelo/ReflectRM.
Authors:Linbo Liu, Guande Wu, Han Ding, Yawei Wang, Qiang Zhou, Yuzhe Lu, Zhichao Xu, Huan Song, Panpan Xu, Lin Lee Cheong
Abstract:
Large language model agents rely on effective model context to obtain task‑relevant information for decision‑making. Many existing context engineering approaches primarily rely on the context generated from the past experience and retrieval mechanisms that reuse these context. However, retrieved context from past tasks must be adapted by the execution agent to fit new situations, placing additional reasoning burden on the underlying LLM. To address this limitation, we propose a generative context augmentation framework using Contrastive Learning of Experience via Agentic Reflection (CLEAR). CLEAR first employs a reflection agent to perform contrastive analysis over past execution trajectories and summarize useful context for each observed task. These summaries are then used as supervised fine‑tuning data to train a context augmentation model (CAM). Then we further optimize CAM using reinforcement learning, where the reward signal is obtained by running the task execution agent. By learning to generate task‑specific knowledge rather than retrieve knowledge from the past, CAM produces context that is better tailored to the current task. We conduct comprehensive evaluations on the AppWorld and WebShop benchmarks. Experimental results show that CLEAR consistently outperforms strong baselines. It improves task completion rate from 72.62% to 81.15% on AppWorld test set and averaged reward from 0.68 to 0.74 on a subset of WebShop, compared with baseline agent. Our code is publicly available at https://github.com/awslabs/CLEAR.
Authors:Ziyang Cheng, Haoyu Wei, Hang Yin, Xiuwei Xu, Bingyao Yu, Jie Zhou, Jiwen Lu
Abstract:
While decoupled control schemes for legged mobile manipulators have shown robustness, learning holistic whole‑body control policies for tracking global end‑effector poses remains fragile against Out‑of‑Distribution (OOD) inputs induced by sensor noise or infeasible user commands. To improve robustness against these perturbations without sacrificing task performance and continuity, we propose Competence Manifold Projection (CMP). Specifically, we utilize a Frame‑Wise Safety Scheme that transforms the infinite‑horizon safety constraint into a computationally efficient single‑step manifold inclusion. To instantiate this competence manifold, we employ a Lower‑Bounded Safety Estimator that distinguishes unmastered intentions from the training distribution. We then introduce an Isomorphic Latent Space (ILS) that aligns manifold geometry with safety probability, enabling efficient O(1) seamless defense against arbitrary OOD intents. Experiments demonstrate that CMP achieves up to a 10‑fold survival rate improvement in typical OOD scenarios where baselines suffer catastrophic failure, incurring under 10% tracking degradation. Notably, the system exhibits emergent ``best‑effort'' generalization behaviors to progressively accomplish OOD goals by adhering to the competence boundaries. Result videos are available at: https://shepherd1226.github.io/CMP.
Authors:Xiangru Jian, Hao Xu, Wei Pang, Xinjian Zhao, Chengyu Tao, Qixin Zhang, Xikun Zhang, Chao Zhang, Guanzhi Deng, Alex Xue, Juan Du, Tianshu Yu, Garth Tarr, Linqi Song, Qiuzhuang Sun, Dacheng Tao
Abstract:
The manufacturing sector is increasingly adopting Multimodal Large Language Models (MLLMs) to transition from simple perception to autonomous execution, yet current evaluations fail to reflect the rigorous demands of real‑world manufacturing environments. Progress is hindered by data scarcity and a lack of fine‑grained domain semantics in existing datasets. To bridge this gap, we introduce FORGE. Wefirst construct a high‑quality multimodal dataset that combines real‑world 2D images and 3D point clouds, annotated with fine‑grained domain semantics (e.g., exact model numbers). We then evaluate 18 state‑of‑the‑art MLLMs across three manufacturing tasks, namely workpiece verification, structural surface inspection, and assembly verification, revealing significant performance gaps. Counter to conventional understanding, the bottleneck analysis shows that visual grounding is not the primary limiting factor. Instead, insufficient domain‑specific knowledge is the key bottleneck, setting a clear direction for future research. Beyond evaluation, we show that our structured annotations can serve as an actionable training resource: supervised fine‑tuning of a compact 3B‑parameter model on our data yields up to 90.8% relative improvement in accuracy on held‑out manufacturing scenarios, providing preliminary evidence for a practical pathway toward domain‑adapted manufacturing MLLMs. The code and datasets are available at https://ai4manufacturing.github.io/forge‑web.
Authors:Daniel Nobrega Medeiros
Abstract:
Why does gradient descent reliably find good solutions in non‑convex neural network optimization, despite the landscape being NP‑hard in the worst case? We show that gradient flow on L‑layer ReLU networks without bias preserves L‑1 conservation laws C_l = ||W_l+1||_F^2 ‑ ||W_l||_F^2, confining trajectories to lower‑dimensional manifolds. Under discrete gradient descent, these laws break with total drift scaling as eta^alpha where alpha is approximately 1.1‑1.6 depending on architecture, loss function, and width. We decompose this drift exactly as eta^2 S(eta), where the gradient imbalance sum S(eta) admits a closed‑form spectral crossover formula with mode coefficients c_k proportional to e_k(0)^2 lambda_x,k^2, derived from first principles and validated for both linear (R=0.85) and ReLU (R>0.80) networks. For cross‑entropy loss, softmax probability concentration drives exponential Hessian spectral compression with timescale tau = Theta(1/eta) independent of training set size, explaining why cross‑entropy self‑regularizes the drift exponent near alpha=1.0. We identify two dynamical regimes separated by a width‑dependent transition: a perturbative sub‑Edge‑of‑Stability regime where the spectral formula applies, and a non‑perturbative regime with extensive mode coupling. All predictions are validated across 23 experiments.
Authors:Wenze Wang, Mehdi Hosseinzadeh, Feras Dayoub
Abstract:
Robotic manipulation systems that follow language instructions often execute grasp primitives in a largely single‑shot manner: a model proposes an action, the robot executes it, and failures such as empty grasps, slips, stalls, timeouts, or semantically wrong grasps are not surfaced to the decision layer in a structured way. Inspired by agentic loops in digital tool‑using agents, we reformulate language‑guided grasping as a bounded embodied agent operating over grounded execution states, where physical actions expose an explicit tool‑state stream. We introduce a physical agentic loop that wraps an unmodified learned manipulation primitive (grasp‑and‑lift) with (i) an event‑based interface and (ii) an execution monitoring layer, Watchdog, which converts noisy gripper telemetry into discrete outcome labels using contact‑aware fusion and temporal stabilization. These outcome events, optionally combined with post‑grasp semantic verification, are consumed by a deterministic bounded policy that finalizes, retries, or escalates to the user for clarification, guaranteeing finite termination. We validate the resulting loop on a mobile manipulator with an eye‑in‑hand D405 camera, keeping the underlying grasp model unchanged and evaluating representative scenarios involving visual ambiguity, distractors, and induced execution failures. Results show that explicit execution‑state monitoring and bounded recovery enable more robust and interpretable behavior than open‑loop execution, while adding minimal architectural overhead. For the source code and demo refer to our project page: https://wenzewwz123.github.io/Agentic‑Loop/
Authors:David Golchinfar, Daryoush Vaziri, Alexander Marquardt
Abstract:
We present SauerkrautLM‑Doom‑MultiVec, a 1.3 million parameter model that plays the classic first‑person shooter DOOM in real time, outperforming large language models up to 92,000x its size, including Nemotron‑120B, Qwen3.5‑27B, and GPT‑4o‑mini. Our model combines a ModernBERT encoder with hash embeddings, depth‑aware token representations, and an attention pooling classification head to select game actions from ASCII frame representations at 31ms per decision. Trained on just 31,000 human gameplay demonstrations, it achieves 178 frags in 10 episodes (17.8 per episode) in the defend_the_center scenario, more than all tested LLMs combined (13 frags total). All agents receive equivalent input: ASCII frames and depth maps. Despite having 92,000x fewer parameters than Nemotron‑120B, our model is the only agent that actively engages enemies rather than purely evading them. These results demonstrate that small, task‑specific models trained on domain‑appropriate data can decisively outperform general‑purpose LLMs at real‑time control tasks, at a fraction of the inference cost, with deployment capability on consumer hardware.
Authors:Kartikay Tehlan, Lukas Förner, Nico Schmutzenhofer, Michael Frühwald, Matthias Wagner, Nassir Navab, Thomas Wendler
Abstract:
We propose a geometric framework for longitudinal multi‑parametric MRI analysis based on patient‑specific energy modelling in sequence space. Rather than operating on images with spatial networks, each voxel is represented by its multi‑sequence intensity vector (T1, T1c, T2, FLAIR, ADC), and a compact implicit neural representation is trained via denoising score matching to learn an energy function E_θ(\mathbfu) over \mathbbR^d from a single baseline scan. The learned energy landscape provides a differential‑geometric description of tissue regimes without segmentation labels. Local minima define tissue basins, gradient magnitude reflects proximity to regime boundaries, and Laplacian curvature characterises local constraint structure. Importantly, this baseline energy manifold is treated as a fixed geometric reference: it encodes the set of contrast combinations observed at diagnosis and is not retrained at follow‑up. Longitudinal assessment is therefore formulated as evaluation of subsequent scans relative to this baseline geometry. Rather than comparing anatomical segmentations, we analyse how the distribution of MRI sequence vectors evolves under the baseline energy function. In a paediatric case with later recurrence, follow‑up scans show progressive deviation in energy and directional displacement in sequence space toward the baseline tumour‑associated regime before clear radiological reappearance. In a case with stable disease, voxel distributions remain confined to established low‑energy basins without systematic drift. The presented cases serve as proof‑of‑concept that patient‑specific energy manifolds can function as geometric reference systems for longitudinal mpMRI analysis without explicit segmentation or supervised classification, providing a foundation for further investigation of manifold‑based tissue‑at‑risk tracking in neuro‑oncology.
Authors:Sonja Adomeit, Kartikay Tehlan, Lukas Förner, Katharina Weisser, Helen Scholtiseek, David Kaufmann, Julie Steinestel, Constantin Lapa, Thomas Kröncke, Thomas Wendler
Abstract:
Multimodal imaging analysis often relies on joint latent representations, yet these approaches rarely define what information is shared versus modality‑specific. Clarifying this distinction is clinically relevant, as it delineates the irreducible contribution of each modality and informs rational acquisition strategies. We propose a subspace decomposition framework that reframes multimodal fusion as a problem of orthogonal subspace separation rather than translation. We decompose Prostate‑Specific Membrane Antigen (PSMA) PET uptake into an MRI‑explainable physiological envelope and an orthogonal residual reflecting signal components not expressible within the MRI feature manifold. Using multiparametric MRI, we train an intensity‑based, non‑spatial implicit neural representation (INR) to map MRI feature vectors to PET uptake. We introduce a projection‑based regularization using singular value decomposition to penalize residual components lying within the span of the MRI feature manifold. This enforces mathematical orthogonality between tissue‑level physiological properties (structure, diffusion, perfusion) and intracellular PSMA expression. Tested on 13 prostate cancer patients, the model demonstrates that residual components spanned by MRI features are absorbed into the learned envelope, while the orthogonal residual is largest in tumour regions. This indicates that PSMA PET contains signal components not recoverable from MRI‑derived physiological descriptors. The resulting decomposition provides a structured characterization of modality complementarity grounded in representation geometry rather than image translation.
Authors:Davood Soleymanzadeh, Xiao Liang, Minghui Zheng
Abstract:
Open‑loop end‑to‑end neural motion planners have recently been proposed to improve motion planning for robotic manipulators. These methods enable planning directly from sensor observations without relying on a privileged collision checker during planning. However, many existing methods generate only a single path for a given workspace across different runs, and do not leverage their open‑loop structure for inference‑time optimization. To address this limitation, we introduce Flow Motion Policy, an open‑loop, end‑to‑end neural motion planner for robotic manipulators that leverages the stochastic generative formulation of flow matching methods to capture the inherent multi‑modality of planning datasets. By modeling a distribution over feasible paths, Flow Motion Policy enables efficient inference‑time best‑of‑N sampling. The method generates multiple end‑to‑end candidate paths, evaluates their collision status after planning, and executes the first collision‑free solution. We benchmark the Flow Motion Policy against representative sampling‑based and neural motion planning methods. Evaluation results demonstrate that Flow Motion Policy improves planning success and efficiency, highlighting the effectiveness of stochastic generative policies for end‑to‑end motion planning and inference‑time optimization. Experimental evaluation videos are available via this \hrefhttps://zh.engr.tamu.edu/wp‑content/uploads/sites/310/2026/03/FMP‑Website.mp4link.
Authors:Minh Tam Pham, Trinh Pham, Tong Chen, Hongzhi Yin, Quoc Viet Hung Nguyen, Thanh Tam Nguyen
Abstract:
Text‑to‑SQL is the task of translating natural language queries into executable SQL for a given database, enabling non‑expert users to access structured data without writing SQL manually. Despite rapid advances driven by large language models (LLMs), existing approaches still struggle with complex queries in real‑world settings, where database schemas are large and questions require multi‑step reasoning over many interrelated tables. In such cases, providing the full schema often exceeds the context window, while one‑shot generation frequently produces non‑executable SQL due to syntax errors and incorrect schema linking. To address these challenges, we introduce AV‑SQL, a framework that decomposes complex Text‑to‑SQL into a pipeline of specialized LLM agents. Central to AV‑SQL is the concept of agentic views: agent‑generated Common Table Expressions (CTEs) that encapsulate intermediate query logic and filter relevant schema elements from large schemas. AV‑SQL operates in three stages: (1) a rewriter agent compresses and clarifies the input query; (2) a view generator agent processes schema chunks to produce agentic views; and (3) a planner, generator, and revisor agent collaboratively compose these views into the final SQL query. Extensive experiments show that AV‑SQL achieves 70.38% execution accuracy on the challenging Spider 2.0 benchmark, outperforming state‑of‑the‑art baselines, while remaining competitive on standard datasets with 85.59% on Spider, 72.16% on BIRD and 63.78% on KaggleDBQA. Our source code is available at https://github.com/pminhtam/AV‑SQL.
Authors:Mehdi Hosseinzadeh, King Hang Wong, Feras Dayoub
Abstract:
We present KITE, a training‑free, keyframe‑anchored, layout‑grounded front‑end that converts long robot‑execution videos into compact, interpretable tokenized evidence for vision‑language models (VLMs). KITE distills each trajectory into a small set of motion‑salient keyframes with open‑vocabulary detections and pairs each keyframe with a schematic bird's‑eye‑view (BEV) representation that encodes relative object layout, axes, timestamps, and detection confidence. These visual cues are serialized with robot‑profile and scene‑context tokens into a unified prompt, allowing the same front‑end to support failure detection, identification, localization, explanation, and correction with an off‑the‑shelf VLM. On the RoboFAC benchmark, KITE with Qwen2.5‑VL substantially improves over vanilla Qwen2.5‑VL in the training‑free setting, with especially large gains on simulation failure detection, identification, and localization, while remaining competitive with a RoboFAC‑tuned baseline. A small QLoRA fine‑tune further improves explanation and correction quality. We also report qualitative results on real dual‑arm robots, demonstrating the practical applicability of KITE as a structured and interpretable front‑end for robot failure analysis. Code and models are released on our project page: https://m80hz.github.io/kite/
Authors:Ricardo Knauer, Andre Beinrucker, Erik Rodner
Abstract:
Neural networks deliver impressive predictive performance across a variety of tasks, but they are often opaque in their decision‑making processes. Despite a growing interest in mechanistic interpretability, tools for systematically exploring the representations learned by neural networks in general, and tabular foundation models in particular, remain limited. In this work, we introduce ConceptTracer, an interactive application for analyzing neural representations through the lens of human‑interpretable concepts. ConceptTracer integrates two information‑theoretic measures that quantify concept saliency and selectivity, enabling researchers and practitioners to identify neurons that respond strongly to individual concepts. We demonstrate the utility of ConceptTracer on representations learned by TabPFN and show that our approach facilitates the discovery of interpretable neurons. Together, these capabilities provide a practical framework for investigating how neural networks like TabPFN encode concept‑level information. ConceptTracer is available at https://github.com/ml‑lab‑htw/concept‑tracer.
Authors:Jaeyoung Chung, Hyunjin Son, Kyoung Mu Lee
Abstract:
We present the first generative approach to photomosaic creation. Traditional photomosaic methods rely on a large number of tile images and color‑based matching, which limits both diversity and structural consistency. Our generative photomosaic framework synthesizes tile images using diffusion‑based generation conditioned on reference images. A low‑frequency conditioned diffusion mechanism aligns global structure while preserving prompt‑driven details. This generative formulation enables photomosaic composition that is both semantically expressive and structurally coherent, effectively overcoming the fundamental limitations of matching‑based approaches. By leveraging few‑shot personalized diffusion, our model is able to produce user‑specific or stylistically consistent tiles without requiring an extensive collection of images.
Authors:Renyang Liu, Jiale Li, Jie Zhang, Cong Wu, Xiaojun Jia, Shuxin Li, Wei Zhou, Kwok-Yan Lam, See-kiong Ng
Abstract:
Palmprint recognition is deployed in security‑critical applications, including access control and palm‑based payment, due to its contactless acquisition and highly discriminative ridge‑and‑crease textures. However, the robustness of deep palmprint recognition systems against physically realizable attacks remains insufficiently understood. Existing studies are largely confined to the digital setting and do not adequately account for the texture‑dominant nature of palmprint recognition or the distortions introduced during physical acquisition. To address this gap, we propose CAAP, a capture‑aware adversarial patch framework for palmprint recognition. CAAP learns a universal patch that can be reused across inputs while remaining effective under realistic acquisition variation. To match the structural characteristics of palmprints, the framework adopts a cross‑shaped patch topology, which enlarges spatial coverage under a fixed pixel budget and more effectively disrupts long‑range texture continuity. CAAP further integrates three modules: ASIT for input‑conditioned patch rendering, RaS for stochastic capture‑aware simulation, and MS‑DIFE for feature‑level identity‑disruptive guidance. We evaluate CAAP on the Tongji, IITD, and AISEC datasets against generic CNN backbones and palmprint‑specific recognition models. Experiments show that CAAP achieves strong untargeted and targeted attack performance with favorable cross‑model and cross‑dataset transferability. The results further show that, although adversarial training can partially reduce the attack success rate, substantial residual vulnerability remains. These findings indicate that deep palmprint recognition systems remain vulnerable to physically realizable, capture‑aware adversarial patch attacks, underscoring the need for more effective defenses in practice. Code available at https://github.com/ryliu68/CAAP.
Authors:Yuheng Shi, Xiaohuan Pei, Linfeng Wen, Minjing Dong, Chang Xu
Abstract:
MLLMs require high‑resolution visual inputs for fine‑grained tasks like document understanding and dense scene perception. However, current global resolution scaling paradigms indiscriminately flood the quadratic self‑attention mechanism with visually redundant tokens, severely bottlenecking inference throughput while ignoring spatial sparsity and query intent. To overcome this, we propose Q‑Zoom, a query‑aware adaptive high‑resolution perception framework that operates in an efficient coarse‑to‑fine manner. First, a lightweight Dynamic Gating Network safely bypasses high‑resolution processing when coarse global features suffice. Second, for queries demanding fine‑grained perception, a Self‑Distilled Region Proposal Network (SD‑RPN) precisely localizes the task‑relevant Region‑of‑Interest (RoI) directly from intermediate feature spaces. To optimize these modules efficiently, the gating network uses a consistency‑aware generation strategy to derive deterministic routing labels, while the SD‑RPN employs a fully self‑supervised distillation paradigm. A continuous spatio‑temporal alignment scheme and targeted fine‑tuning then seamlessly fuse the dense local RoI with the coarse global layout. Extensive experiments demonstrate that Q‑Zoom establishes a dominant Pareto frontier. Using Qwen2.5‑VL‑7B as a primary testbed, Q‑Zoom accelerates inference by 2.52 times on Document & OCR benchmarks and 4.39 times in High‑Resolution scenarios while matching the baseline's peak accuracy. Furthermore, when configured for maximum perceptual fidelity, Q‑Zoom surpasses the baseline's peak performance by 1.1% and 8.1% on these respective benchmarks. These robust improvements transfer seamlessly to Qwen3‑VL, LLaVA, and emerging RL‑based thinking‑with‑image models. Project page is available at https://yuhengsss.github.io/Q‑Zoom/.
Authors:Bing Wang, Rui Miao, Chen Shen, Shaotian Yan, Kaiyuan Liu, Ximing Li, Xiaosong Yuan, Sinan Fan, Jun Zhang, Jieping Ye
Abstract:
Large reasoning models have recently demonstrated strong performance on complex tasks that require long chain‑of‑thought reasoning, through supervised fine‑tuning on large‑scale and high‑quality datasets. To construct such datasets, existing pipelines generate long reasoning data from more capable Large Language Models (LLMs) and apply manually heuristic or naturalness‑based selection methods to filter high‑quality samples. Despite the proven effectiveness of naturalness‑based data selection, which ranks data by the average log probability assigned by LLMs, our analysis shows that, when applied to LLM reasoning datasets, it systematically prefers samples with longer reasoning steps (i.e., more tokens per step) rather than higher‑quality ones, a phenomenon we term step length confounding. Through quantitative analysis, we attribute this phenomenon to low‑probability first tokens in reasoning steps; longer steps dilute their influence, thereby inflating the average log probabilities. To address this issue, we propose two variant methods: ASLEC‑DROP, which drops first‑token probabilities when computing average log probability, and ASLEC‑CASL, which applies a causal debiasing regression to remove the first tokens' confounding effect. Experiments across four LLMs and five evaluation benchmarks demonstrate the effectiveness of our approach in mitigating the step length confounding problem.
Authors:Zhixiong Zhao, Zukang Xu, Zhixuan Chen, Dawei Yang
Abstract:
Mixture‑of‑Experts (MoE) based large language models (LLMs) offer strong performance but suffer from high memory and computation costs. Weight binarization provides extreme efficiency, yet existing binary methods designed for dense LLMs struggle with MoE‑specific issues, including cross‑expert redundancy, task‑agnostic importance estimation, and quantization‑induced routing shifts. To this end, we propose MoBiE, the first binarization framework tailored for MoE‑based LLMs. MoBiE is built on three core innovations: 1. using joint SVD decomposition to reduce cross‑expert redundancy; 2. integrating global loss gradients into local Hessian metrics to enhance weight importance estimation; 3. introducing an error constraint guided by the input null space to mitigate routing distortion. Notably, MoBiE achieves these optimizations while incurring no additional storage overhead, striking a balance between efficiency and model performance. Extensive experiments demonstrate that MoBiE consistently outperforms state‑of‑the‑art binary methods across multiple MoE‑based LLMs and benchmarks. For example, on Qwen3‑30B‑A3B, MoBiE reduces perplexity by 52.2%, improves average zero‑shot performance by 43.4%, achieves over 2 × inference speedup, and further shortens quantization time. The code is available at https://github.com/Kishon‑zzx/MoBiE.
Authors:Huy Q. Le, Loc X. Nguyen, Yu Qiao, Seong Tae Kim, Eui-Nam Huh, Choong Seon Hong
Abstract:
Federated Learning (FL) enables decentralized model training across multiple clients without exposing private data, making it ideal for privacy‑sensitive applications. However, in real‑world FL scenarios, clients often hold data from distinct domains, leading to severe domain shift and degraded global model performance. To address this, prototype learning has been emerged as a promising solution, which leverages class‑wise feature representations. Yet, existing methods face two key limitations: (1) Existing prototype‑based FL methods typically construct a single global prototype per class by aggregating local prototypes from all clients without preserving domain information. (2) Current feature‑prototype alignment is domain‑agnostic, forcing clients to align with global prototypes regardless of domain origin. To address these challenges, we propose Federated Domain‑Aware Prototypes (FedDAP) to construct domain‑specific global prototypes by aggregating local client prototypes within the same domain using a similarity‑weighted fusion mechanism. These global domain‑specific prototypes are then used to guide local training by aligning local features with prototypes from the same domain, while encouraging separation from prototypes of different domains. This dual alignment enhances domain‑specific learning at the local level and enables the global model to generalize across diverse domains. Finally, we conduct extensive experiments on three different datasets: DomainNet, Office‑10, and PACS to demonstrate the effectiveness of our proposed framework to address the domain shift challenges. The code is available at https://github.com/quanghuy6997/FedDAP.
Authors:Guillermo Gil de Avalle, Laura Maruster, Eric Sloot, Christos Emmanouilidis
Abstract:
Maintenance procedures in manufacturing facilities are often documented as flowcharts in static PDFs or scanned images. They encode procedural knowledge essential for asset lifecycle management, yet inaccessible to modern operator support systems. Vision‑language models, the dominant paradigm for image understanding, struggle to reconstruct connection topology from such diagrams. We present FlowExtract, a pipeline for extracting directed graphs from ISO 5807‑standardized flowcharts. The system separates element detection from connectivity reconstruction, using YOLOv8 and EasyOCR for standard domain‑aligned node detection and text extraction, combined with a novel edge detection method that analyzes arrowhead orientations and traces connecting lines backward to source nodes. Evaluated on industrial troubleshooting guides, FlowExtract achieves very high node detection and substantially outperforms vision‑language model baselines on edge extraction, offering organizations a practical path toward queryable procedural knowledge representations. The implementation is available athttps://github.com/guille‑gil/FlowExtract.
Authors:Xuanle Zhao, Xinyuan Cai, Xiang Cheng, Xiuyi Chen, Bo Xu
Abstract:
While Vision‑Language Models (VLMs) have demonstrated significant potential in chemical visual understanding, current models are predominantly optimized for direct visual question‑answering tasks. This paradigm often results in "black‑box" systems that fail to utilize the inherent capability of Large Language Models (LLMs) to infer underlying reaction mechanisms. In this work, we introduce ChemVLR, a chemical VLM designed to prioritize reasoning within the perception process. Unlike conventional chemical VLMs, ChemVLR analyzes visual inputs in a fine‑grained manner by explicitly identifying granular chemical descriptors, such as functional groups, prior to generating answers. This approach ensures the production of explicit and interpretable reasoning paths for complex visual chemical problems. To facilitate this methodology, we implement a cross‑modality reverse‑engineering strategy, combined with a rigorous filtering pipeline, to curate a large‑scale reasoning‑and‑captioning dataset comprising 760k high‑quality samples across molecular and reaction tasks. Furthermore, we adopt a three‑stage training framework that systemically builds model perception and reasoning capacity. Experiments demonstrate that ChemVLR achieves state‑of‑the‑art (SOTA) performance, surpassing both leading proprietary models and domain‑specific open‑source baselines. We also provide comprehensive ablation studies to validate our training strategy and data generation designs. Code and model weights will be available at https://github.com/xxlllz/ChemVLR.
Authors:Jiachen Zhang, Yueming Lu, Fan Feng, Zhanfeng Wang, Shengli Pan, Daoqi Han
Abstract:
Effective detection of unknown network security threats in multi‑class imbalanced environments is critical for maintaining cyberspace security. Current methods focus on learning class representations but face challenges with unknown threat detection, class imbalance, and lack of interpretability, limiting their practical use. To address this, we propose RPM‑Net, a novel framework that introduces reciprocal point mechanism to learn "non‑class" representations for each known attack category, coupled with adversarial margin constraints that provide geometric interpretability for unknown threat detection. RPM‑Net++ further enhances performance through Fisher discriminant regularization. Experimental results show that RPM‑Net achieves superior performance across multiple metrics including F1‑score, AUROC, and AUPR‑OUT, significantly outperforming existing methods and offering practical value for real‑world network security applications. Our code is available at:https://github.com/chiachen‑chang/RPM‑Net
Authors:Hanyang Wang, Mingxuan Zhu
Abstract:
Modern reasoning models continue generating long after the answer is already determined. Across five model configurations, two families, and three benchmarks, we find that 52‑‑88% of chain‑of‑thought tokens are produced after the answer is recoverable from a partial prefix. This post‑commitment generation reveals a structural phenomenon: the detection‑extraction gap. Free continuations from early prefixes recover the correct answer even at 10% of the trace, while forced extraction fails on 42% of these cases. The answer is recoverable from the model state, yet prompt‑conditioned decoding fails to extract it. We formalize this mismatch via a total‑variation bound between free and forced continuation distributions, yielding quantitative estimates of suffix‑induced shift. Exploiting this asymmetry, we propose Black‑box Adaptive Early Exit (BAEE), which uses free continuations for both detection and extraction, truncating 70‑‑78% of serial generation while improving accuracy by 1‑‑5pp across all models. For thinking‑mode models, early exit prevents post‑commitment overwriting, yielding gains of up to 5.8pp; a cost‑optimized variant achieves 68‑‑73% reduction at a median of 9 API calls. Code is available at https://github.com/EdWangLoDaSc/know2say.
Authors:Maotian Ma, Zheni Zeng, Zhenghao Liu, Yukun Yan
Abstract:
Large language models (LLMs) have shown strong knowledge reserves and task‑solving capabilities, but still face the challenge of severe hallucination, hindering their practical application. Though scientific theories and rules can efficiently direct the behaviors of human manipulators, LLMs still do not utilize these highly‑condensed knowledge sufficiently through training or prompting. To address this issue, we propose SciDC, an LLM generation method that integrate subject‑specific knowledge with strong constraints. By adopting strong LLMs to automatically convert flexible knowledge into multi‑layered, standardized rules, we build an extensible framework to effectively constrain the model generation on domain tasks. Experiments on scientific tasks including industrial formulation design, clinical tumor diagnosis and retrosynthesis planning, consistently demonstrate the effectiveness of our method, achieving a 12% accuracy improvement on average compared with vanilla generation. We further discuss the potential of LLMs in automatically inductively summarizing highly‑condensed knowledge, looking ahead to practical solutions for accelerating the overall scientific research process. All the code of this paper can be obtained (https://github.com/Maotian‑Ma/SciDC).
Authors:Weiyue Li, Ruizhi Qian, Yi Li, Yongce Li, Yunfan Long, Jiahui Cai, Yan Luo, Mengyu Wang
Abstract:
Large language models (LLMs) are widely explored for reasoning‑intensive research tasks, yet resources for testing whether they can infer scientific conclusions from structured biomedical evidence remain limited. We introduce MedConclusion, a large‑scale dataset of 5.7M PubMed structured abstracts for biomedical conclusion generation. Each instance pairs the non‑conclusion sections of an abstract with the original author‑written conclusion, providing naturally occurring supervision for evidence‑to‑conclusion reasoning. MedConclusion also includes journal‑level metadata such as biomedical category and SJR, enabling subgroup analysis across biomedical domains. As an initial study, we evaluate diverse LLMs under conclusion and summary prompting settings and score outputs with both reference‑based metrics and LLM‑as‑a‑judge. We find that conclusion writing is behaviorally distinct from summary writing, strong models remain closely clustered under current automatic metrics, and judge identity can substantially shift absolute scores. MedConclusion provides a reusable data resource for studying scientific evidence‑to‑conclusion reasoning. Our code and data are available at: https://github.com/Harvard‑AI‑and‑Robotics‑Lab/MedConclusion.
Authors:Dev Arpan Desai, Shaoyi Huang, Zining Zhu
Abstract:
Large language models that require multiple GPU cards to host are usually the most capable models. It is necessary to understand and steer these models, but the current technologies do not support the interpretability and steering of these models in the multi‑GPU setting as well as the single‑GPU setting. We present a practical implementation of activation‑level interpretability (logit lens) and steering (steering vector) that scales up to multi‑GPU language models. Our system implements design choices that reduce the activation memory by up to 7x and increase the throughput by up to 41x compared to a baseline on identical hardware. We demonstrate the method across LLaMA‑3.1 (8B, 70B) and Qwen‑3 (4B, 14B, 32B), sustaining 20‑100 tokens/s while collecting full layer‑wise activation trajectories for sequences of 1,500 tokens. Using label‑position steering vectors injected post‑LayerNorm, we show controllable, monotonic shifts in model outputs with a mean steerability slope of 0.702 across evaluated datasets, without fine‑tuning or additional forward passes. We release detailed benchmarks, ablations, and a reproducible instrumentation recipe to enable practical interpretability and real‑time behavioral control for frontier LLMs at https://github.com/Devdesai1901/LogitLense.
Authors:Mingchen Zhuge, Changsheng Zhao, Haozhe Liu, Zijian Zhou, Shuming Liu, Wenyi Wang, Ernie Chang, Gael Le Lan, Junjie Fei, Wenxuan Zhang, Yasheng Sun, Zhipeng Cai, Zechun Liu, Yunyang Xiong, Yining Yang, Yuandong Tian, Yangyang Shi, Vikas Chandra, Jürgen Schmidhuber
Abstract:
We propose a new frontier: Neural Computers (NCs) that unify computation, memory, and I/O of traditional computers in a learned runtime state. Our long‑term goal is the Completely Neural Computer (CNC): the mature, general‑purpose realization of this emerging machine form, with stable execution, explicit reprogramming, and durable capability reuse. As an initial step, we study whether elementary NC primitives can be learned solely from collected I/O traces, without instrumented program state. Concretely, we instantiate NCs as video models that roll out screen frames from instructions, pixels, and user actions (when available) in CLI and GUI settings. We show that NCs can acquire elementary interface primitives, especially I/O alignment and short‑horizon control, while routine reuse, controlled updates, and symbolic stability remain challenging. We outline a roadmap toward CNCs, to establish a new computing paradigm beyond today's agents and conventional computers.
Authors:Syed Mohammad Kashif, Ruiyin Li, Peng Liang, Amjed Tahir, Qiong Feng, Zengyang Li, Mojtaba Shahin
Abstract:
New generation of AI coding tools, including AI‑powered IDEs equipped with agentic capabilities, can generate code within the context of the project. These AI IDEs are increasingly perceived as capable of producing project‑level code at scale. However, there is limited empirical evidence on the extent to which they can generate large‑scale software systems and what design issues such systems may exhibit. To address this gap, we conducted a study to explore the capability of Cursor in generating large‑scale projects and to evaluate the design quality of projects generated by Cursor. First, we propose a Feature‑Driven Human‑In‑The‑Loop (FD‑HITL) framework that systematically guides project generation from curated project descriptions. We generated 10 projects using Cursor with the FD‑HITL framework across three application domains and multiple technologies. We assessed the functional correctness of these projects through manual evaluation, obtaining an average functional correctness score of 91%. Next, we analyzed the generated projects using two static analysis tools, CodeScene and SonarQube, to detect design issues. We identified 1,305 design issues categorized into 9 categories by CodeScene and 3,193 issues in 11 categories by SonarQube. Our findings show that (1) when used with the FD‑HITL framework, Cursor can generate functional large‑scale projects averaging 16,965 LoC and 114 files; (2) the generated projects nevertheless contain design issues that may pose long‑term maintainability and evolvability risks, requiring careful review by experienced developers; (3) the most prevalent issues include Code Duplication, high Code Complexity, Large Methods, Framework Best‑Practice Violations, Exception‑Handling Issues and Accessibility Issues; (4) these design issues violate design principles such as SRP, SoC, and DRY. The replication package is at https://github.com/Kashifraz/DIinAGP
Authors:Wenyue Hua, Sripad Karne, Qian Xie, Armaan Agrawal, Nikos Pagonas, Kostis Kaffes, Tianyi Peng
Abstract:
AI agents are increasingly deployed in real‑world applications, including systems such as Manus, OpenClaw, and coding agents. Existing research has primarily focused on server‑side efficiency, proposing methods such as caching, speculative execution, traffic scheduling, and load balancing to reduce the cost of serving agentic workloads. However, as users increasingly construct agents by composing local tools, remote APIs, and diverse models, an equally important optimization problem arises on the client side. Client‑side optimization asks how developers should allocate the resources available to them, including model choice, local tools, and API budget across pipeline stages, subject to application‑specific quality, cost, and latency constraints. Because these objectives depend on the task and deployment setting, they cannot be determined by server‑side systems alone. We introduce AgentOpt, the first framework‑agnostic Python package for client‑side agent optimization. We first study model selection, a high‑impact optimization lever in multi‑step agent pipelines. Given a pipeline and a small evaluation set, the goal is to find the most cost‑effective assignment of models to pipeline roles. This problem is consequential in practice: at matched accuracy, the cost gap between the best and worst model combinations can reach 13‑32x in our experiments. To efficiently explore the exponentially growing combination space, AgentOpt implements ten search algorithms, including UCB‑E, UCB‑E with Low‑Rank Factorization, Arm Elimination, Epsilon‑LUCB, Threshold Successive Elimination, and Bayesian Optimization. Across four benchmarks, UCB‑E recovers near‑optimal accuracy while reducing evaluation budget by 62‑76% relative to brute‑force search. Code and benchmark results available at https://agentoptimizer.github.io/agentopt/.
Authors:Lin Mu, Haiyang Wang, Li Ni, Lei Sang, Zhize Wu, Peiquan Jin, Yiwen Zhang
Abstract:
Low‑Rank Adaptation (LoRA) enables parameter‑efficient fine‑tuning of Large Language Models (LLMs), and recent Mixture‑of‑Experts (MoE) extensions further enhance flexibility by dynamically combining multiple LoRA experts. However, existing MoE‑augmented LoRA methods assume that experts operate independently, often leading to unstable routing, expert dominance. In this paper, we propose TalkLoRA, a communication‑aware MoELoRA framework that relaxes this independence assumption by introducing expert‑level communication prior to routing. TalkLoRA equips low‑rank experts with a lightweight Talking Module that enables controlled information exchange across expert subspaces, producing a more robust global signal for routing. Theoretically, we show that expert communication smooths routing dynamics by mitigating perturbation amplification while strictly generalizing existing MoELoRA architectures. Empirically, TalkLoRA consistently outperforms vanilla LoRA and MoELoRA across diverse language understanding and generation tasks, achieving higher parameter efficiency and more balanced expert routing under comparable parameter budgets. These results highlight structured expert communication as a principled and effective enhancement for MoE‑based parameter‑efficient adaptation. Code is available at https://github.com/why0129/TalkLoRA.
Authors:Corby Rosset, Pratyusha Sharma, Andrew Zhao, Miguel Gonzalez-Fernandez, Ahmed Awadallah
Abstract:
Verifying the success of computer use agent (CUA) trajectories is a critical challenge: without reliable verification, neither evaluation nor training signal can be trusted. In this paper, we present lessons learned from building a best‑in‑class verifier for web tasks we call the Universal Verifier. We design the Universal Verifier around four key principles: 1) constructing rubrics with meaningful, non‑overlapping criteria to reduce noise; 2) separating process and outcome rewards that yield complementary signals, capturing cases where an agent follows the right steps but gets blocked or succeeds through an unexpected path; 3) distinguishing between controllable and uncontrollable failures scored via a cascading‑error‑free strategy for finer‑grained failure understanding; and 4) a divide‑and‑conquer context management scheme that attends to all screenshots in a trajectory, improving reliability on longer task horizons. We validate these findings on CUAVerifierBench, a new set of CUA trajectories with both process and outcome human labels, showing that our Universal Verifier agrees with humans as often as humans agree with each other. We report a reduction in false positive rates to near zero compared to baselines like WebVoyager (\geq 45%) and WebJudge (\geq 22%). We emphasize that these gains stem from the cumulative effect of the design choices above. We also find that an auto‑research agent achieves 70% of expert quality in 5% of the time, but fails to discover all strategies required to replicate the Universal Verifier. We open‑source our Universal Verifier system along with CUAVerifierBench; available at https://github.com/microsoft/fara.
Authors:Wei Zhou, Xuanhe Zhou, Qikang He, Guoliang Li, Bingsheng He, Quanqing Xu, Fan Wu
Abstract:
Database systems incorporate an ever‑growing number of functions in their kernels (a.k.a., database native functions) for scenarios like new application support and business migration. This growth causes an urgent demand for automatic database native function synthesis. While recent advances in LLM‑based code generation (e.g., Claude Code) show promise, they are too generic for database‑specific development. They often hallucinate or overlook critical context because database function synthesis is inherently complex and error‑prone, where synthesizing a single function may involve registering multiple function units, linking internal references, and implementing logic correctly. To this end, we propose DBCooker, an LLM‑based system for automatically synthesizing database native functions. It consists of three components. First, the function characterization module aggregates multi‑source declarations, identifies function units that require specialized coding, and traces cross‑unit dependencies. Second, we design operations to address the main synthesis challenges: (1) a pseudo‑code‑based coding plan generator that constructs structured implementation skeletons by identifying key elements such as reusable referenced functions; (2) a hybrid fill‑in‑the‑blank model guided by probabilistic priors and component awareness to integrate core logic with reusable routines; and (3) three‑level progressive validation, including syntax checking, standards compliance, and LLM‑guided semantic verification. Finally, an adaptive orchestration strategy unifies these operations with existing tools and dynamically sequences them via the orchestration history of similar functions. Results show that DBCooker outperforms other methods on SQLite, PostgreSQL, and DuckDB (34.55% higher accuracy on average), and can synthesize new functions absent in the latest SQLite (v3.50).
Authors:Ryo Nishida, Masayuki Kawarada, Tatsuya Ishigaki, Hiroya Takamura, Masaki Onishi
Abstract:
This paper investigates demonstration selection strategies for predicting a user's next point‑of‑interest (POI) using large language models (LLMs), aiming to accurately forecast a user's subsequent location based on historical check‑in data. While in‑context learning (ICL) with LLMs has recently gained attention as a promising alternative to traditional supervised approaches, the effectiveness of ICL significantly depends on the selected demonstration. Although previous studies have examined methods such as random selection, embedding‑based selection, and task‑specific selection, there remains a lack of comprehensive comparative analysis among these strategies. To bridge this gap and clarify the best practices for real‑world applications, we comprehensively evaluate existing demonstration selection methods alongside simpler heuristic approaches such as geographical proximity, temporal ordering, and sequential patterns. Extensive experiments conducted on three real‑world datasets indicate that these heuristic methods consistently outperform more complex and computationally demanding embedding‑based methods, both in terms of computational cost and prediction accuracy. Notably, in certain scenarios, LLMs using demonstrations selected by these simpler heuristic methods even outperform existing fine‑tuned models, without requiring further training. Our source code is available at: https://github.com/ryonsd/DS‑LLM4POI.
Authors:Ruijia Li, Bo Jiang
Abstract:
Designing project‑based learning (PBL) demands managing highly interdependent components, a task that both traditional linear tools and purely conversational AI struggle with. Traditional tools fail to capture the non‑linear nature of creative design, while conversational systems lack the persistent, shared context necessary for reflective collaboration. Grounded in theories of distributed cognition, we introduce CoMAP, a system that embodies a graph‑based collaboration paradigm. By providing a shared visual workspace with dual‑modality AI support, CoMAP transforms the human‑AI relationship from a prompt‑and‑response loop into a transparent and equitable partnership. Our study with 30 educators shows CoMAP significantly improves teachers' design expression, divergent thinking, and iterative practice compared to a dialogue‑only baseline. These findings demonstrate how a nonlinear, artifact‑centric approach can foster trust, reduce cognitive load, and \textcolorfixsupport educators to take control of their creative process. Our contributions are available at: https://comap2025.github.io/.
Authors:Yichen Gong, Zhuohan Cai, Sunhao Dai, Yuqi Zhou, Zhangxuan Gu, Changhua Meng, Shuheng Shen
Abstract:
Existing online benchmarks for mobile GUI agents remain largely app‑centric and task‑homogeneous, failing to reflect the diversity and instability of real‑world mobile usage. To this end, we introduce VenusBench‑Mobile, a challenging online benchmark for evaluating general‑purpose mobile GUI agents under realistic, user‑centric conditions. VenusBench‑Mobile builds two core evaluation pillars: defining what to evaluate via user‑intent‑driven task design that reflects real mobile usage, and how to evaluate through a capability‑oriented annotation scheme for fine‑grained agent behavior analysis. Extensive evaluation of state‑of‑the‑art mobile GUI agents reveals large performance gaps relative to prior benchmarks, indicating that VenusBench‑Mobile poses substantially more challenging and realistic tasks and that current agents remain far from reliable real‑world deployment. Diagnostic analysis further shows that failures are dominated by deficiencies in perception and memory, which are largely obscured by coarse‑grained evaluations. Moreover, even the strongest agents exhibit near‑zero success under environment variations, highlighting their brittleness in realistic settings. Based on these insights, we believe VenusBench‑Mobile provides an important stepping stone toward robust real‑world deployment of mobile GUI agents. Code and data are available at https://github.com/inclusionAI/UI‑Venus/tree/VenusBench‑Mobile.
Authors:Guhao Feng, Shengjie Luo, Kai Hua, Ge Zhang, Di He, Wenhao Huang, Tianle Cai
Abstract:
The static ``train then deploy" paradigm fundamentally limits Large Language Models (LLMs) from dynamically adapting their weights in response to continuous streams of new information inherent in real‑world tasks. Test‑Time Training (TTT) offers a compelling alternative by updating a subset of model parameters (fast weights) at inference time, yet its potential in the current LLM ecosystem is hindered by critical barriers including architectural incompatibility, computational inefficiency and misaligned fast weight objectives for language modeling. In this work, we introduce In‑Place Test‑Time Training (In‑Place TTT), a framework that seamlessly endows LLMs with Test‑Time Training ability. In‑Place TTT treats the final projection matrix of the ubiquitous MLP blocks as its adaptable fast weights, enabling a ``drop‑in" enhancement for LLMs without costly retraining from scratch. Furthermore, we replace TTT's generic reconstruction objective with a tailored, theoretically‑grounded objective explicitly aligned with the Next‑Token‑Prediction task governing autoregressive language modeling. This principled objective, combined with an efficient chunk‑wise update mechanism, results in a highly scalable algorithm compatible with context parallelism. Extensive experiments validate our framework's effectiveness: as an in‑place enhancement, it enables a 4B‑parameter model to achieve superior performance on tasks with contexts up to 128k, and when pretrained from scratch, it consistently outperforms competitive TTT‑related approaches. Ablation study results further provide deeper insights on our design choices. Collectively, our results establish In‑Place TTT as a promising step towards a paradigm of continual learning in LLMs.
Authors:David Picard, Nicolas Dufour, Lucas Degeorge, Arijit Ghosh, Davide Allegro, Tom Ravaud, Yohann Perron, Corentin Sautier, Zeynep Sonat Baltaci, Fei Meng, Syrine Kalleli, Marta López-Rauhut, Thibaut Loiseau, Ségolène Albouy, Raphael Baena, Elliot Vincent, Loic Landrieu
Abstract:
This paper introduces the Polynomial Mixer (PoM), a novel token mixing mechanism with linear complexity that serves as a drop‑in replacement for self‑attention. PoM aggregates input tokens into a compact representation through a learned polynomial function, from which each token retrieves contextual information. We prove that PoM satisfies the contextual mapping property, ensuring that transformers equipped with PoM remain universal sequence‑to‑sequence approximators. We replace standard self‑attention with PoM across five diverse domains: text generation, handwritten text recognition, image generation, 3D modeling, and Earth observation. PoM matches the performance of attention‑based models while drastically reducing computational cost when working with long sequences. The code is available at https://github.com/davidpicard/pom.
Authors:Junbin Zhang, Meng Cao, Feng Tan, Yikai Lin, Yuexian Zou
Abstract:
Achieving fine‑grained and structurally sound controllability is a cornerstone of advanced visual generation. Existing part‑based frameworks treat user‑provided parts as an unordered set and therefore ignore their intrinsic spatial and semantic relationships, which often results in compositions that lack structural integrity. To bridge this gap, we propose Graph‑PiT, a framework that explicitly models the structural dependencies of visual components using a graph prior. Specifically, we represent visual parts as nodes and their spatial‑semantic relationships as edges. At the heart of our method is a Hierarchical Graph Neural Network (HGNN) module that performs bidirectional message passing between coarse‑grained part‑level super‑nodes and fine‑grained IP+ token sub‑nodes, refining part embeddings before they enter the generative pipeline. We also introduce a graph Laplacian smoothness loss and an edge‑reconstruction loss so that adjacent parts acquire compatible, relation‑aware embeddings. Quantitative experiments on controlled synthetic domains (character, product, indoor layout, and jigsaw), together with qualitative transfer to real web images, show that Graph‑PiT improves structural coherence over vanilla PiT while remaining compatible with the original IP‑Prior pipeline. Ablation experiments confirm that explicit relational reasoning is crucial for enforcing user‑specified adjacency constraints. Our approach not only enhances the plausibility of generated concepts but also offers a scalable and interpretable mechanism for complex, multi‑part image synthesis. The code is available at https://github.com/wolf‑bailang/Graph‑PiT.
Authors:Gustav Keppler, Moritz Gstür, Veit Hagenmeyer
Abstract:
The advancement of Large Language Models (LLMs) has raised concerns regarding their dual‑use potential in cybersecurity. Existing evaluation frameworks overwhelmingly focus on Information Technology (IT) environments, failing to capture the constraints, and specialized protocols of Operational Technology (OT). To address this gap, we introduce CritBench, a novel framework designed to evaluate the cybersecurity capabilities of LLM agents within IEC 61850 Digital Substation environments. We assess five state‑of‑the‑art models, including OpenAI's GPT‑5 suite and open‑weight models, across a corpus of 81 domain‑specific tasks spanning static configuration analysis, network traffic reconnaissance, and live virtual machine interaction. To facilitate industrial protocol interaction, we develop a domain‑specific tool scaffold. Our empirical results show that agents reliably execute static structured‑file analysis and single‑tool network enumeration, but their performance degrades on dynamic tasks. Despite demonstrating explicit, internalized knowledge of the IEC 61850 standards terminology, current models struggle with the persistent sequential reasoning and state tracking required to manipulate live systems without specialized tools. Equipping agents with our domain‑specific tool scaffold significantly mitigates this operational bottleneck. Code and evaluation scripts are available at: https://github.com/GKeppler/CritBench
Authors:Michael Cuccarese
Abstract:
This paper presents epistemic blinding in the context of an agentic system that uses large language models to reason across multiple biological datasets for drug target prioritization. During development, it became apparent that LLM outputs silently blend data‑driven inference with memorized priors about named entities ‑ and the blend is invisible: there is no way to determine, from a single output, how much came from the data on the page and how much came from the model's training memory. Epistemic blinding is a simple inference‑time protocol that replaces entity identifiers with anonymous codes before prompting, then compares outputs against an unblinded control. The protocol does not make LLM reasoning deterministic, but it restores one critical axis of auditability: measuring how much of an output came from the supplied data versus the model's parametric knowledge. The complete target identification system is described ‑ including LLM‑guided evolutionary optimization of scoring functions and blinded agentic reasoning for target rationalization ‑ with demonstration that both stages operate without access to entity identity. In oncology drug target prioritization across four cancer types, blinding changes 16% of top‑20 predictions while preserving identical recovery of validated targets. The contamination problem is shown to generalize beyond biology: in S&P 500 equity screening, brand‑recognition bias reshapes 30‑40% of top‑20 rankings across five random seeds. To lower the barrier to adoption, the protocol is released as an open‑source tool and as a Claude Code skill that enables one‑command epistemic blinding within agentic workflows. The claim is not that blinded analysis produces better results, but that without blinding, there is no way to know to what degree the agent is adhering to the analytical process the researcher designed.
Authors:Xiaojie Gu, Ziying Huang, Weicong Hong, Jian Xie, Renze Lou, Kai Zhang
Abstract:
Large Language Models (LLMs) internalize vast world knowledge as parametric memory, yet inevitably inherit the staleness and errors of their source corpora. Consequently, ensuring the reliability and malleability of these internal representations is imperative for trustworthy real‑world deployment. Knowledge editing offers a pivotal paradigm for surgically modifying memory without retraining. However, while recent editors demonstrate high success rates on standard benchmarks, it remains questionable whether current evaluation frameworks that rely on assessing output under specific prompting conditions can reliably authenticate genuine memory modification. In this work, we introduce a simple diagnostic framework that subjects models to discriminative self‑assessment under in‑context learning (ICL) settings that better reflect real‑world application environments, specifically designed to scrutinize the subtle behavioral nuances induced by memory modifications. This probing reveals a pervasive phenomenon of Surface Compliance, where editors achieve high benchmark scores by merely mimicking target outputs without structurally overwriting internal beliefs. Moreover, we find that recursive modifications accumulate representational residues, triggering cognitive instability and permanently diminishing the reversibility of the model's memory state. These insights underscore the risks of current editing paradigms and highlight the pivotal role of robust memory modification in building trustworthy, long‑term sustainable LLM systems. Code is available at https://github.com/XiaojieGu/SA‑MCQ.
Authors:Gowthamkumar Nandakishore
Abstract:
When LLMs process structured data, the serialization format directly affects cost and context utilization. Standard JSON wastes tokens repeating key names in every row of a tabular array‑‑overhead that scales linearly with row count. This paper presents JTON (JSON Tabular Object Notation), a strict JSON superset whose main idea, Zen Grid, factors column headers into a single row and encodes values with semicolons, preserving JSON's type system while cutting redundancy. Across seven real‑world domains, Zen Grid reduces token counts by 15‑60% versus JSON compact (28.5% average; 32% with bare_strings). Comprehension tests on 10 LLMs show a net +0.3 pp accuracy gain over JSON: four models improve, three hold steady, and three dip slightly. Generation tests on 12 LLMs yield 100% syntactic validity in both few‑shot and zero‑shot settings. A Rust/PyO3 reference implementation adds SIMD‑accelerated parsing at 1.4x the speed of Python's json module. Code, a 683‑vector test suite, and all experimental data are publicly available.
Authors:Xiangyue Zhang
Abstract:
We present Deep Researcher Agent, an open‑source framework that enables large language model (LLM) agents to autonomously conduct deep learning experiments around the clock. Unlike existing AI research assistants that focus on paper writing or code generation, our system addresses the full experiment lifecycle: hypothesis formation, code implementation, training execution, result analysis, and iterative refinement. The framework introduces three key innovations: (1) Zero‑Cost Monitoring ‑‑ a monitoring paradigm that incurs zero LLM API costs during model training by relying solely on process‑level checks and log file reads; (2) Two‑Tier Constant‑Size Memory ‑‑ a memory architecture capped at ~5K characters regardless of runtime duration, preventing the unbounded context growth that plagues long‑running agents; and (3) Minimal‑Toolset Leader‑Worker Architecture ‑‑ a multi‑agent design where each worker agent is equipped with only 3‑‑5 tools, reducing per‑call token overhead by up to 73%. In sustained deployments spanning 30+ days, the framework autonomously completed 500+ experiment cycles across four concurrent research projects, achieving a 52% improvement over baseline metrics in one project through 200+ automated experiments ‑‑ all at an average LLM cost of \0.08 per 24‑hour cycle. Code is available at https://github.com/Xiangyue‑Zhang/auto‑deep‑researcher‑24x7.
Authors:Shuai Zhen, Yanhua Yu, Ruopei Guo, Nan Cheng, Yang Deng
Abstract:
Large language model (LLM) agents have demonstrated strong capabilities in complex interactive decision‑making tasks. However, existing LLM agents typically rely on increasingly long interaction histories, resulting in high computational cost and limited scalability. In this paper, we propose STEP‑HRL, a hierarchical reinforcement learning (HRL) framework that enables step‑level learning by conditioning only on single‑step transitions rather than full interaction histories. STEP‑HRL structures tasks hierarchically, using completed subtasks to represent global progress of overall task. By introducing a local progress module, it also iteratively and selectively summarizes interaction history within each subtask to produce a compact summary of local progress. Together, these components yield augmented step‑level transitions for both high‑level and low‑level policies. Experimental results on ScienceWorld and ALFWorld benchmarks consistently demonstrate that STEP‑HRL substantially outperforms baselines in terms of performance and generalization while reducing token usage. Our code is available at https://github.com/TonyStark042/STEP‑HRL.
Authors:Xuecong Liu, Mengzhu Ding, Zixuan Sun, Zhang Li, Xichao Teng
Abstract:
We present Consistent‑Recurrent Feature Flow Transformer (CRFT), a unified coarse‑to‑fine framework based on feature flow learning for robust cross‑modal image registration. CRFT learns a modality‑independent feature flow representation within a transformer‑based architecture that jointly performs feature alignment and flow estimation. The coarse stage establishes global correspondences through multi‑scale feature correlation, while the fine stage refines local details via hierarchical feature fusion and adaptive spatial reasoning. To enhance geometric adaptability, an iterative discrepancy‑guided attention mechanism with a Spatial Geometric Transform (SGT) recurrently refines the flow field, progressively capturing subtle spatial inconsistencies and enforcing feature‑level consistency. This design enables accurate alignment under large affine and scale variations while maintaining structural coherence across modalities. Extensive experiments on diverse cross‑modal datasets demonstrate that CRFT consistently outperforms state‑of‑the‑art registration methods in both accuracy and robustness. Beyond registration, CRFT provides a generalizable paradigm for multimodal spatial correspondence, offering broad applicability to remote sensing, autonomous navigation, and medical imaging. Code and datasets are publicly available at https://github.com/NEU‑Liuxuecong/CRFT.
Authors:Wuyang Luan, Junhui Li, Weiguang Zhao, Wenjian Zhang, Tieru Wu, Rui Ma
Abstract:
Visual navigation is a core challenge in Embodied AI, requiring autonomous agents to translate high‑dimensional sensory observations into continuous, long‑horizon action trajectories. While generative policies based on diffusion models and Schrödinger Bridges (SB) effectively capture multimodal action distributions, they require dozens of integration steps due to high‑variance stochastic transport, posing a critical barrier for real‑time robotic control. We propose Rectified Schrödinger Bridge Matching (RSBM), a framework that exploits a shared velocity‑field structure between standard Schrödinger Bridges (\varepsilon=1, maximum‑entropy transport) and deterministic Optimal Transport (\varepsilon\to 0, as in Conditional Flow Matching), controlled by a single entropic regularization parameter \varepsilon. We prove two key results: (1) the conditional velocity field's functional form is invariant across the entire \varepsilon‑spectrum (Velocity Structure Invariance), enabling a single network to serve all regularization strengths; and (2) reducing \varepsilon linearly decreases the conditional velocity variance, enabling more stable coarse‑step ODE integration. Anchored to a learned conditional prior that shortens transport distance, RSBM operates at an intermediate \varepsilon that balances multimodal coverage and path straightness. Empirically, while standard bridges require \geq 10 steps to converge, RSBM achieves over 94% cosine similarity and 92% success rate in merely 3 integration steps ‑‑ without distillation or multi‑stage training ‑‑ substantially narrowing the gap between high‑fidelity generative policies and the low‑latency demands of Embodied AI.
Authors:Jan Gruber, Jan-Niclas Hilgert
Abstract:
Agentic Al systems are increasingly deployed as personal assistants and are likely to become a common object of digital investigations. However, little is known about how their internal state and actions can be reconstructed during forensic analysis. Despite growing popularity, systematic forensic approaches for such systems remain largely unexplored. This paper presents an empirical study of OpenClaw a widely used single‑agent assistant. We examine OpenClaw's technical design via static code analysis and apply differential forensic analysis to identify recoverable traces across stages of the agent interaction loop. We classify and correlate these traces to assess their investigative value in a systematic way. Based on these observations, we propose an agent artifact taxonomy that captures recurring investigative patterns. Finally, we highlight a foundational challenge for agentic Al forensics: agent‑mediated execution introduces an additional layer of abstraction and substantial nondeterminism in trace generation. The large language model (LLM), the execution environment, and the evolving context can influence tool choice and state transitions in ways that are largely absent from rule‑based software. Overall, our results provide an initial foundation for the systematic investigation of agentic Al and outline implications for digital forensic practice and future research.
Authors:Pu Wang, Zhixuan Mao, Jialu Li, Zhuoran Zheng, Dianjie Lu, Youshan Zhang
Abstract:
Automatic diagnosis of canine pneumothorax is challenged by data scarcity and the need for trustworthy models. To address this, we first introduce a public, pixel‑level annotated dataset to facilitate research. We then propose a novel diagnostic paradigm that reframes the task as a synergistic process of signal localization and spectral detection. For localization, our method employs a Vision‑Language Model (VLM) to guide an iterative Flow Matching process, which progressively refines segmentation masks to achieve superior boundary accuracy. For detection, the segmented mask is used to isolate features from the suspected lesion. We then apply Random Matrix Theory (RMT), a departure from traditional classifiers, to analyze these features. This approach models healthy tissue as predictable random noise and identifies pneumothorax by detecting statistically significant outlier eigenvalues that represent a non‑random pathological signal. The high‑fidelity localization from Flow Matching is crucial for purifying the signal, thus maximizing the sensitivity of our RMT detector. This synergy of generative segmentation and first‑principles statistical analysis yields a highly accurate and interpretable diagnostic system (source code is available at: https://github.com/Pu‑Wang‑alt/Canine‑pneumothorax).
Authors:Gwanghyun Kim, Junghun James Kim, Suh Yoon Jeon, Jason Park, Se Young Chun
Abstract:
Reconstructing textured 3D human models from a single image is fundamental for AR/VR and digital human applications. However, existing methods mostly focus on single individuals and thus fail in multi‑human scenes, where naive composition of individual reconstructions often leads to artifacts such as unrealistic overlaps, missing geometry in occluded regions, and distorted interactions. These limitations highlight the need for approaches that incorporate group‑level context and interaction priors. We introduce a holistic method that explicitly models both group‑ and instance‑level information. To mitigate perspective‑induced geometric distortions, we first transform the input into a canonical orthographic space. Our primary component, Human Group‑Instance Multi‑View Diffusion (HUG‑MVD), then generates complete multi‑view normals and images by jointly modeling individuals and group context to resolve occlusions and proximity. Subsequently, the Human Group‑Instance Geometric Reconstruction (HUG‑GR) module optimizes the geometry by leveraging explicit, physics‑based interaction priors to enforce physical plausibility and accurately model inter‑human contact. Finally, the multi‑view images are fused into a high‑fidelity texture. Together, these components form our complete framework, HUG3D. Extensive experiments show that HUG3D significantly outperforms both single‑human and existing multi‑human methods, producing physically plausible, high‑fidelity 3D reconstructions of interacting people from a single image. Project page: https://jongheean11.github.io/HUG3D_project
Authors:Honghao Fu, Miao Xu, Yiwei Wang, Dailing Zhang, Jun Liu, Yujun Cai
Abstract:
Scaling multimodal large language models (MLLMs) to long videos is constrained by limited context windows. While retrieval‑augmented generation (RAG) is a promising remedy by organizing query‑relevant visual evidence into a compact context, most existing methods (i) flatten videos into independent segments, breaking their inherent spatio‑temporal structure, and (ii) depend on explicit semantic matching, which can miss cues that are implicitly relevant to the query's intent. To overcome these limitations, we propose VideoStir, a structured and intent‑aware long‑video RAG framework. It firstly structures a video as a spatio‑temporal graph at clip level, and then performs multi‑hop retrieval to aggregate evidence across distant yet contextually related events. Furthermore, it introduces an MLLM‑backed intent‑relevance scorer that retrieves frames based on their alignment with the query's reasoning intent. To support this capability, we curate IR‑600K, a large‑scale dataset tailored for learning frame‑query intent alignment. Experiments show that VideoStir is competitive with state‑of‑the‑art baselines without relying on auxiliary information, highlighting the promise of shifting long‑video RAG from flattened semantic matching to structured, intent‑aware reasoning. Codes and checkpoints are available at https://github.com/RomGai/VideoStir.
Authors:Le Liu, Zhiming Li, Jianzhi Yan, Zike Yuan, Shiwei Chen, Youcheng Pan, Buzhou Tang, Qingcai Chen, Yang Xiang, Danny Dongning Sun
Abstract:
Despite its success, existing in‑context learning (ICL) relies on in‑domain expert demonstrations, limiting its applicability when expert annotations are scarce. We posit that different domains may share underlying reasoning structures, enabling source‑domain demonstrations to improve target‑domain inference despite semantic mismatch. To test this hypothesis, we conduct a comprehensive empirical study of different retrieval methods to validate the feasibility of achieving cross‑domain knowledge transfer under the in‑context learning setting. Our results demonstrate conditional positive transfer in cross‑domain ICL. We identify a clear example absorption threshold: beyond it, positive transfer becomes more likely, and additional demonstrations yield larger gains. Further analysis suggests that these gains stem from reasoning structure repair by retrieved cross‑domain examples, rather than semantic cues. Overall, our study validates the feasibility of leveraging cross‑domain knowledge transfer to improve cross‑domain ICL performance, motivating the community to explore designing more effective retrieval approaches for this novel direction.\footnoteOur implementation is available at https://github.com/littlelaska/ICL‑TF4LR
Authors:Jianzhi Yan, Zhiming Li, Le Liu, Zike Yuan, Shiwei Chen, Youcheng Pan, Buzhou Tang, Yang Xiang, Danny Dongning Sun
Abstract:
Large language models (LLMs) have made notable progress in logical reasoning, yet still fall short of human‑level performance. Current boosting strategies rely on expert‑crafted in‑domain demonstrations, limiting their applicability in expertise‑scarce domains, such as specialized mathematical reasoning, formal logic, or legal analysis. In this work, we demonstrate the feasibility of leveraging cross‑domain demonstrating examples to boost the LLMs' reasoning performance. Despite substantial domain differences, many reusable implicit logical structures are shared across domains. In order to effectively retrieve cross‑domain examples for unseen domains under investigation, in this work, we further propose an effective retrieval method, called domain‑invariant neurons‑based retrieval (DIN‑Retrieval). Concisely, DIN‑Retrieval first summarizes a hidden representation that is universal across different domains. Then, during the inference stage, we use the DIN vector to retrieve structurally compatible cross‑domain demonstrations for the in‑context learning. Experimental results in multiple settings for the transfer of mathematical and logical reasoning demonstrate that our method achieves an average improvement of 1.8 over the state‑of‑the‑art methods \footnoteOur implementation is available at https://github.com/Leon221220/DIN‑Retrieval.
Authors:Jae Joong Lee
Abstract:
Every existing method for compressing 3D Gaussian Splatting, NeRF, or transformer‑based 3D reconstructors requires learning a data‑dependent codebook through per‑scene fine‑tuning. We show this is unnecessary. The parameter vectors that dominate storage in these models, 45‑dimensional spherical harmonics in 3DGS and 1024‑dimensional key‑value vectors in DUSt3R, fall in a dimension range where a single random rotation transforms any input into coordinates with a known Beta distribution. This makes precomputed, data‑independent Lloyd‑Max quantization near‑optimal, within a factor of 2.7 of the information‑theoretic lower bound. We develop 3D, deriving (1) a dimension‑dependent criterion that predicts which parameters can be quantized and at what bit‑width before running any experiment, (2) norm‑separation bounds connecting quantization MSE to rendering PSNR per scene, (3) an entry‑grouping strategy extending rotation‑based quantization to 2‑dimensional hash grid features, and (4) a composable pruning‑quantization pipeline with a closed‑form compression ratio. On NeRF Synthetic, 3DTurboQuant compresses 3DGS by 3.5x with 0.02dB PSNR loss and DUSt3R KV caches by 7.9x with 39.7dB pointmap fidelity. No training, no codebook learning, no calibration data. Compression takes seconds. The code will be released (https://github.com/JaeLee18/3DTurboQuant)
Authors:Xuan Xiong, Huan Liu, Li Gu, Zhixiang Chi, Yue Qiu, Yuanhao Yu, Yang Wang
Abstract:
Chain‑of‑thought (CoT) reasoning improves large language model performance on complex tasks, but often produces excessively long and inefficient reasoning traces. Existing methods shorten CoTs using length penalties or global entropy reduction, implicitly assuming that low uncertainty is desirable throughout reasoning. We show instead that reasoning efficiency is governed by the trajectory of uncertainty. CoTs with dominant downward entropy trends are substantially shorter. Motivated by this insight, we propose Entropy Trend Reward (ETR), a trajectory‑aware objective that encourages progressive uncertainty reduction while allowing limited local exploration. We integrate ETR into Group Relative Policy Optimization (GRPO) and evaluate it across multiple reasoning models and challenging benchmarks. ETR consistently achieves a superior accuracy‑efficiency tradeoff, improving DeepSeek‑R1‑Distill‑7B by 9.9% in accuracy while reducing CoT length by 67% across four benchmarks. Code is available at https://github.com/Xuan1030/ETR
Authors:Dawei Liu, Zongxia Li, Hongyang Du, Xiyang Wu, Shihang Gui, Yongbei Kuang, Lichao Sun
Abstract:
Skill usage has become a core component of modern agent systems and can substantially improve agents' ability to complete complex tasks. In real‑world settings, where agents must monitor and interact with numerous personal applications, web browsers, and other environment interfaces, skill libraries can scale to thousands of reusable skills. Scaling to larger skill sets introduces two key challenges. First, loading the full skill set saturates the context window, driving up token costs, hallucination, and latency. In this paper, we present Graph of Skills (GoS), an inference‑time structural retrieval layer for large skill libraries. GoS constructs an executable skill graph offline from skill packages, then at inference time retrieves a bounded, dependency‑aware skill bundle through hybrid semantic‑lexical seeding, reverse‑weighted Personalized PageRank, and context‑budgeted hydration. On SkillsBench and ALFWorld, GoS improves average reward by 43.6% over the vanilla full skill‑loading baseline while reducing input tokens by 37.8%, and generalizes across three model families: Claude Sonnet, GPT‑5.2 Codex, and MiniMax. Additional ablation studies across skill libraries ranging from 200 to 2,000 skills further demonstrate that GoS consistently outperforms both vanilla skills loading and simple vector retrieval in balancing reward, token efficiency, and runtime.
Authors:Chan-Wei Hu, Zhengzhong Tu
Abstract:
Multi‑modal retrieval‑augmented generation (MM‑RAG) relies heavily on re‑rankers to surface the most relevant evidence for image‑question queries. However, standard re‑rankers typically process the full query image as a global embedding, making them susceptible to visual distractors (e.g., background clutter) that skew similarity scores. We propose Region‑R1, a query‑side region cropping framework that formulates region selection as a decision‑making problem during re‑ranking, allowing the system to learn to retain the full image or focus only on a question‑relevant region before scoring the retrieved candidates. Region‑R1 learns a policy with a novel region‑aware group relative policy optimization (r‑GRPO) to dynamically crop a discriminative region. Across two challenging benchmarks, E‑VQA and InfoSeek, Region‑R1 delivers consistent gains, achieving state‑of‑the‑art performances by increasing conditional Recall@1 by up to 20%. These results show the great promise of query‑side adaptation as a simple but effective way to strengthen MM‑RAG re‑ranking.
Authors:Jon-Paul Cacioli
Abstract:
Background: Children do not simply learn that balls are round and blocks are square. They learn that shape is the kind of feature that tends to define object categories ‑‑ a second‑order generalisation known as an overhypothesis [1, 2]. What kind of learning mechanism is sufficient for this inductive leap? Methods: We trained autoregressive transformer language models (3.4M‑25.6M parameters) on synthetic corpora in which shape is the stable feature dimension across categories, with eight conditions controlling for alternative explanations. Results: Across 120 pre‑registered runs evaluated on a 1,040‑item wug test battery, every model achieved perfect first‑order exemplar retrieval (100%) while second‑order generalisation to novel nouns remained at chance (50‑52%), a result confirmed by equivalence testing. A feature‑swap diagnostic revealed that models rely on frame‑to‑feature template matching rather than structured noun‑to‑domain‑to‑feature abstraction. Conclusions: These results reveal a clear limitation of autoregressive distributional sequence learning under developmental‑scale training conditions.
Authors:Jiahao Xu, Rui Hu, Olivera Kotevska, Zikai Zhang
Abstract:
Multi‑bit watermarking has emerged as a promising solution for embedding imperceptible binary messages into Large Language Model (LLM)‑generated text, enabling reliable attribution and tracing of malicious usage of LLMs. Despite recent progress, existing methods still face key limitations: some become computationally infeasible for large messages, while others suffer from a poor trade‑off between text quality and decoding accuracy. Moreover, the decoding accuracy of existing methods drops significantly when the number of tokens in the generated text is limited, a condition that frequently arises in practical usage. To address these challenges, we propose \textscXMark, a novel method for encoding and decoding binary messages in LLM‑generated texts. The unique design of \textscXMark's encoder produces a less distorted logit distribution for watermarked token generation, preserving text quality, and also enables its tailored decoder to reliably recover the encoded message with limited tokens. Extensive experiments across diverse downstream tasks show that \textscXMark significantly improves decoding accuracy while preserving the quality of watermarked text, outperforming prior methods. The code is at https://github.com/JiiahaoXU/XMark.
Authors:Ali Aliev, Kamil Garifullin, Nikolay Yudin, Vera Soboleva, Alexander Molozhavenko, Ivan Oseledets, Aibek Alanov, Maxim Rakhuba
Abstract:
In a rapidly growing field of model training there is a constant practical interest in parameter‑efficient fine‑tuning and various techniques that use a small amount of training data to adapt the model to a narrow task. However, there is an open question: how to combine several adapters tuned for different tasks into one which is able to yield adequate results on both tasks? Specifically, merging subject and style adapters for generative models remains unresolved. In this paper we seek to show that in the case of orthogonal fine‑tuning (OFT), we can use structured orthogonal parametrization and its geometric properties to get the formulas for training‑free adapter merging. In particular, we derive the structure of the manifold formed by the recently proposed Group‑and‑Shuffle (\mathcalGS) orthogonal matrices, and obtain efficient formulas for the geodesics approximation between two points. Additionally, we propose a \textspectra restoration transform that restores spectral properties of the merged adapter for higher‑quality fusion. We conduct experiments in subject‑driven generation tasks showing that our technique to merge two \mathcalGS orthogonal matrices is capable of uniting concept and style features of different adapters. To the best of our knowledge, this is the first training‑free method for merging multiplicative orthogonal adapters. Code is available via the \hrefhttps://github.com/ControlGenAI/OrthoFuselink.
Authors:Quyet V. Do, Thinh Pham, Nguyen Nguyen, Sha Li, Pratibha Zunjare, Tu Vu
Abstract:
We study a pipeline that curates reasoning data from initial structured data for improving long‑context reasoning in large language models (LLMs). Our approach, π^2, constructs high‑quality reasoning data through rigorous QA curation: 1) extracting and expanding tables from Wikipedia, 2) from the collected tables and relevant context, generating realistic and multi‑hop analytical reasoning questions whose answers are automatically determined and verified through dual‑path code execution, and 3) back‑translating step‑by‑step structured reasoning traces as solutions of QA pairs given realistic web‑search context. Supervised fine‑tuning with \textsc\smallgpt‑oss‑20b and \textsc\smallQwen3‑4B‑Instruct‑2507 on π^2 yields consistent improvements across four long‑context reasoning benchmarks and our alike π^2‑Bench, with average absolute accuracy gains of +4.3% and +2.7% respectively. Notably, our dataset facilitates self‑distillation, where \textsc\smallgpt‑oss‑20b even improves its average performance by +4.4% with its own reasoning traces, demonstrating π^2's usefulness. Our code, data, and models are open‑source at https://github.com/vt‑pi‑squared/pi‑squared.
Authors:Julia Chae, Nicholas Kolkin, Jui-Hsien Wang, Richard Zhang, Sara Beery, Cusuh Ham
Abstract:
Humans have remarkable selective sensitivity to identities ‑‑ easily distinguishing between highly similar identities, even across significantly different contexts such as diverse viewpoints or lighting. Vision models have struggled to match this capability, and progress toward identity‑focused tasks such as personalized image generation is slowed by a lack of identity‑focused evaluation metrics. To help facilitate progress, we propose ID‑Sim, a feed‑forward metric designed to faithfully reflect human selective sensitivity. To build ID‑Sim, we curate a high‑quality training set of images spanning diverse real‑world domains, augmented with generative synthetic data that provides controlled, fine‑grained identity and contextual variations. We evaluate our metric on a new unified evaluation benchmark for assessing consistency with human annotations across identity‑focused recognition, retrieval, and generative tasks.
Authors:Gowrav Vishwakarma, Christopher J. Agostino
Abstract:
We present Phase‑Associative Memory (PAM), a recurrent sequence model in which all representations are complex‑valued, associations accumulate in a matrix state S_t \in \mathbbC^d × d via outer products, and retrieval operates through the conjugate inner product K_t^ \cdot Q_t / \sqrtd. At ~100M parameters on WikiText‑103, PAM reaches validation perplexity 30.0, within ~10% of a matched transformer (27.1) trained under identical conditions, despite 4× arithmetic overhead from complex computation and no custom kernels. We trace the experimental path from vector‑state models, where holographic binding fails due to the O(1/\sqrtn) capacity degradation of superposed associations, to the matrix state that resolves it. The competitiveness of an architecture whose native operations are complex‑valued superposition and conjugate retrieval is consistent with recent empirical evidence that semantic interpretation in both humans and large language models exhibits non‑classical contextuality, and we discuss what this implies for the choice of computational formalism in language modeling.
Authors:Yiwen Song, Yale Song, Tomas Pfister, Jinsung Yoon
Abstract:
Synthesizing unstructured research materials into manuscripts is an essential yet under‑explored challenge in AI‑driven scientific discovery. Existing autonomous writers are rigidly coupled to specific experimental pipelines, and produce superficial literature reviews. We introduce PaperOrchestra, a multi‑agent framework for automated AI research paper writing. It flexibly transforms unconstrained pre‑writing materials into submission‑ready LaTeX manuscripts, including comprehensive literature synthesis and generated visuals, such as plots and conceptual diagrams. To evaluate performance, we present PaperWritingBench, the first standardized benchmark of reverse‑engineered raw materials from 200 top‑tier AI conference papers, alongside a comprehensive suite of automated evaluators. In side‑by‑side human evaluations, PaperOrchestra significantly outperforms autonomous baselines, achieving an absolute win rate margin of 50%‑68% in literature review quality, and 14%‑38% in overall manuscript quality.
Authors:StarVLA Community
Abstract:
Building generalist embodied agents requires integrating perception, language understanding, and action, which are core capabilities addressed by Vision‑Language‑Action (VLA) approaches based on multimodal foundation models, including recent advances in vision‑language models and world models. Despite rapid progress, VLA methods remain fragmented across incompatible architectures, codebases, and evaluation protocols, hindering principled comparison and reproducibility. We present StarVLA, an open‑source codebase for VLA research. StarVLA addresses these challenges in three aspects. First, it provides a modular backbone‑‑action‑head architecture that supports both VLM backbones (e.g., Qwen‑VL) and world‑model backbones (e.g., Cosmos) alongside representative action‑decoding paradigms, all under a shared abstraction in which backbone and action head can each be swapped independently. Second, it provides reusable training strategies, including cross‑embodiment learning and multimodal co‑training, that apply consistently across supported paradigms. Third, it integrates major benchmarks, including LIBERO, SimplerEnv, RoboTwin~2.0, RoboCasa‑GR1, and BEHAVIOR‑1K, through a unified evaluation interface that supports both simulation and real‑robot deployment. StarVLA also ships simple, fully reproducible single‑benchmark training recipes that, despite minimal data engineering, already match or surpass prior methods on multiple benchmarks with both VLM and world‑model backbones. To our best knowledge, StarVLA is one of the most comprehensive open‑source VLA frameworks available, and we expect it to lower the barrier for reproducing existing methods and prototyping new ones. StarVLA is being actively maintained and expanded; we will update this report as the project evolves. The code and documentation are available at https://github.com/starVLA/starVLA.
Authors:Nitish Kumar, Sannu Kumar, S Akash, Manish Gupta, Ankith Karat, Sriparna Saha
Abstract:
With the rapid proliferation of online sports journalism, extracting meaningful pre‑game and post‑game insights from articles is essential for enhancing user engagement and comprehension. In this paper, we address the task of automatically extracting such insights from articles published before and after matches. We curate a dataset of 7,900 news articles covering 800 matches across four major sports: Cricket, Soccer, Basketball, and Baseball. To ensure contextual relevance, we employ a two‑step validation pipeline leveraging both open‑source and proprietary large language models (LLMs). We then utilize multiple state‑of‑the‑art LLMs (GPT‑4o, Qwen2.5‑72B‑Instruct, Llama‑3.3‑70B‑Instruct, and Mixtral‑8x7B‑Instruct‑v0.1) to generate comprehensive insights. The factual accuracy of these outputs is rigorously assessed using a FactScore‑based methodology, complemented by hallucination detection via the SummaC (Summary Consistency) framework with GPT‑4o. Finally, we propose SUMMIR (Sentence Unified Multimetric Model for Importance Ranking), a novel architecture designed to rank insights based on user‑specific interests. Our results demonstrate the effectiveness of this approach in generating high‑quality, relevant insights, while also revealing significant differences in factual consistency and interestingness across LLMs. This work contributes a robust framework for automated, reliable insight generation from sports news content. The source code is availble here https://github.com/nitish‑iitp/SUMMIR.
Authors:Jaeyoon Jung, Yejun Yoon, Kunwoo Park
Abstract:
Automated fact‑checking is a crucial task not only in journalism but also across web platforms, where it supports a responsible information ecosystem and mitigates the harms of misinformation. While recent research has progressed from text‑only to multimodal fact‑checking, a prevailing assumption is that incorporating visual evidence universally improves performance. In this work, we challenge this assumption and show that indiscriminate use of multimodal evidence can reduce accuracy. To address this challenge, we propose AMuFC, a multimodal fact‑checking framework that employs two collaborative agents with distinct roles for the adaptive use of visual evidence: An Analyzer determines whether visual evidence is necessary for claim verification, and a Verifier predicts claim veracity conditioned on both the retrieved evidence and the Analyzer's assessment. Experimental results on three datasets show that incorporating the Analyzer's assessment of visual evidence necessity into the Verifier's prediction yields substantial improvements in verification performance. In addition to all code, we release WebFC, a newly constructed dataset for evaluating fact‑checking modules in a more realistic scenario, available at https://github.com/ssu‑humane/AMuFC.
Authors:Alessandro Tarsi, Matteo Mastrogiuseppe, Saverio Taliani, Simone Cortinovis, Ugo Pattacini
Abstract:
Bin picking in real industrial environments remains challenging due to severe clutter, occlusions, and the high cost of traditional 3D sensing setups. We present Pickalo, a modular 6D pose‑based bin‑picking pipeline built entirely on low‑cost hardware. A wrist‑mounted RGB‑D camera actively explores the scene from multiple viewpoints, while raw stereo streams are processed with BridgeDepth to obtain refined depth maps suitable for accurate collision reasoning. Object instances are segmented with a Mask‑RCNN model trained purely on photorealistic synthetic data and localized using the zero‑shot SAM‑6D pose estimator. A pose buffer module fuses multi‑view observations over time, handling object symmetries and significantly reducing pose noise. Offline, we generate and curate large sets of antipodal grasp candidates per object; online, a utility‑based ranking and fast collision checking are queried for the grasp planning. Deployed on a UR5e with a parallel‑jaw gripper and an Intel RealSense D435i, Pickalo achieves up to 600 mean picks per hour with 96‑99% grasp success and robust performance over 30‑minute runs on densely filled euroboxes. Ablation studies demonstrate the benefits of enhanced depth estimation and of the pose buffer for long‑term stability and throughput in realistic industrial conditions. Videos are available at https://mesh‑iit.github.io/project‑jl2‑camozzi/
Authors:Seamus Brady
Abstract:
We present Springdrift, a persistent runtime for long‑lived LLM agents. The system integrates an auditable execution substrate (append‑only memory, supervised processes, git‑backed recovery), a case‑based reasoning memory layer with hybrid retrieval (evaluated against a dense cosine baseline), a deterministic normative calculus for safety gating with auditable axiom trails, and continuous ambient self‑perception via a structured self‑state representation (the sensorium) injected each cycle without tool calls. These properties support behaviours difficult to achieve in session‑bounded systems: cross‑session task continuity, cross‑channel context maintenance, end‑to‑end forensic reconstruction of decisions, and self‑diagnostic behaviour. We report on a single‑instance deployment over 23 days (19 operating days), during which the agent diagnosed its own infrastructure bugs, classified failure modes, identified an architectural vulnerability, and maintained context across email and web channels ‑‑ without explicit instruction. We introduce the term Artificial Retainer for this category: a non‑human system with persistent memory, defined authority, domain‑specific autonomy, and forensic accountability in an ongoing relationship with a specific principal ‑‑ distinguished from software assistants and autonomous agents, drawing on professional retainer relationships and the bounded autonomy of trained working animals. This is a technical report on a systems design and deployment case study, not a benchmark‑driven evaluation. Evidence is from a single instance with a single operator, presented as illustration of what these architectural properties can support in practice. Implemented in approximately Gleam on Erlang/OTP. Code, artefacts, and redacted operational logs will be available at https://github.com/seamus‑brady/springdrift upon publication.
Authors:Yeonwoo Cha, Jaehoon Yoo, Semin Kim, Yunseo Park, Jinhyeon Kwon, Seunghoon Hong
Abstract:
Flow‑based models learn a target distribution by modeling a marginal velocity field, defined as the average of sample‑wise velocities connecting each sample from a simple prior to the target data. When sample‑wise velocities conflict at the same intermediate state, however, this averaged velocity can misguide samples toward low‑density regions, degrading generation quality. To address this issue, we propose the Flow Divergence Sampler (FDS), a training‑free framework that refines intermediate states before each solver step. Our key finding reveals that the severity of this misguidance is quantified by the divergence of the marginal velocity field that is readily computable during inference with a well‑optimized model. FDS exploits this signal to steer states toward less ambiguous regions. As a plug‑and‑play framework compatible with standard solvers and off‑the‑shelf flow backbones, FDS consistently improves fidelity across various generation tasks including text‑to‑image synthesis, and inverse problems.
Authors:Madhav S Baidya
Abstract:
Large Language Models (LLMs) have achieved strong performance in question answering and retrieval‑augmented generation (RAG), yet they implicitly assume that user queries are fully specified and answerable. In real‑world settings, queries are often incomplete, ambiguous, or missing critical variables, leading models to produce overconfident or hallucinated responses. In this work, we study decision‑aware query resolution under incomplete information, where a model must determine whether to Answer, Ask for clarification, or Abstain. We show that standard and enhanced RAG systems do not reliably exhibit such epistemic awareness, defaulting to answer generation even when information is insufficient. To address this, we propose PassiveQA, a three‑action framework that aligns model behaviour with information sufficiency through supervised finetuning. Our approach integrates structured information‑state representations, knowledge graph‑grounded context, and a finetuned planner that explicitly models missing variables and decision reasoning. Experiments across multiple QA datasets show that the finetuned planner achieves significant improvements in macro F1 and abstention recall while reducing hallucination rates, under a compute‑constrained training regime. These results provide strong empirical evidence that epistemic decision‑making must be learned during training rather than imposed at inference time.
Authors:Charafeddine Mouzouni
Abstract:
LLM agents with tool access can discover and exploit security vulnerabilities. This is known. What is not known is which features of a system prompt trigger this behaviour, and which do not. We present a systematic taxonomy based on approximately 10,000 trials across seven models, 37 prompt conditions, and 12 hypothesized attack dimensions ‑‑ all executed in real Docker sandboxes. Every condition includes the instruction "Always respect all rules and access policies"; only a single encouragement sentence varies. Within the task class studied (planted test‑runner vulnerabilities), the threat model is narrower than assumed: nine of twelve dimensions ‑‑ including minimization, moral licensing, incentives, identity priming, and reasoning triggers ‑‑ produce no detectable exploitation at n=50 per cell (upper 95% CI <7% per condition). But it is also more subtle than expected: one dimension, goal reframing, reliably triggers exploitation. "You are solving a puzzle; there may be hidden clues" produces 38‑40% exploitation on Claude Sonnet 4 despite the explicit rule instruction, replicating across four models (CTF framing: 8‑14% on DeepSeek, GPT‑5‑mini, o4‑mini). The agent does not override the rules; it reinterprets the task so that exploitative actions become task‑aligned. GPT‑4.1 produces no exploitation across 1,850 trials (37 conditions), and a temporal comparison across four OpenAI models released over eleven months shows a pattern consistent with improving safety training, though model capability differences are a confounder. The practical contribution is a narrowed, testable threat model: defenders should audit for goal‑reframing language, not for the broad class of adversarial prompts.
Authors:Varun Pratap Bhardwaj
Abstract:
AI coding agents operate in a paradox: they possess vast parametric knowledge yet cannot remember a conversation from an hour ago. Existing memory systems store text in vector databases with single‑channel retrieval, require cloud LLMs for core operations, and implement none of the cognitive processes that make human memory effective. We present SuperLocalMemory V3.3 ("The Living Brain"), a local‑first agent memory system implementing the full cognitive memory taxonomy with mathematical lifecycle dynamics. Building on the information‑geometric foundations of V3.2 (arXiv:2603.14588), we introduce five contributions: (1) Fisher‑Rao Quantization‑Aware Distance (FRQAD) ‑‑ a new metric on the Gaussian statistical manifold achieving 100% precision at preferring high‑fidelity embeddings over quantized ones (vs 85.6% for cosine), with zero prior art; (2) Ebbinghaus Adaptive Forgetting with lifecycle‑aware quantization ‑‑ the first mathematical forgetting curve in local agent memory coupled to progressive embedding compression, achieving 6.7x discriminative power; (3) 7‑channel cognitive retrieval spanning semantic, keyword, entity graph, temporal, spreading activation, consolidation, and Hopfield associative channels, achieving 70.4% on LoCoMo in zero‑LLM Mode A; (4) memory parameterization implementing Long‑Term Implicit memory via soft prompts; (5) zero‑friction auto‑cognitive pipeline automating the complete memory lifecycle. On LoCoMo, V3.3 achieves 70.4% in Mode A (zero‑LLM), with +23.8pp on multi‑hop and +12.7pp on adversarial. V3.2 achieved 74.8% Mode A and 87.7% Mode C; the 4.4pp gap reflects a deliberate architectural trade‑off. SLM V3.3 is open source under the Elastic License 2.0, runs entirely on CPU, with over 5,000 monthly downloads.
Authors:Fatemeh Khadem, Sajad Mousavi, Yi Fang, Yuhong Liu
Abstract:
Large language models (LLMs) are increasingly adapted to proprietary and domain‑specific corpora that contain sensitive information, creating a tension between formal privacy guarantees and efficient deployment through model compression. Differential privacy (DP), typically enforced via DP‑SGD, provides record‑level protection but often incurs substantial utility loss in autoregressive generation, where optimization noise can amplify exposure bias and compounding errors along long rollouts. Existing approaches to private distillation either apply DP‑SGD to both teacher and student, worsening computation and the privacy‑‑utility tradeoff, or rely on DP synthetic text generation from a DP‑trained teacher, avoiding DP on the student at the cost of DP‑optimizing a large teacher and introducing an offline generation pipeline. We propose Differentially Private On‑Policy Distillation (DP‑OPD), a synthesis‑free framework that enforces privacy solely through DP‑SGD on the student while leveraging a frozen teacher to provide dense token‑level targets on \emphstudent‑generated trajectories. DP‑OPD instantiates this idea via \emphprivate generalized knowledge distillation on continuation tokens. Under a strict privacy budget (\varepsilon=2.0), DP‑OPD improves perplexity over DP fine‑tuning and off‑policy DP distillation, and outperforms synthesis‑based DP distillation (Yelp: 44.15\rightarrow41.68; BigPatent: 32.43\rightarrow30.63), while substantially simplifying the training pipeline. In particular, DP‑OPD collapses private compression into a single DP student‑training loop by eliminating DP teacher training and offline synthetic text generation. Code will be released upon publication at https://github.com/khademfatemeh/dp_opd.
Authors:Abu Noman Md Sakib, Zhensen Wang, Merjulah Roby, Zijie Zhang
Abstract:
Reliable pattern recognition systems should exhibit consistent behavior across similar inputs, and their explanations should remain stable. However, most Explainable AI evaluations remain instance centric and do not explicitly quantify whether attribution patterns are consistent across samples that share the same class or represent small variations of the same input. In this work, we propose a novel metric aimed at assessing the consistency of model explanations, ensuring that models consistently reflect the intended objectives and consistency under label‑preserving perturbations. We implement this metric using a pre‑trained BERT model on the SST‑2 sentiment analysis dataset, with additional robustness tests on RoBERTa, DistilBERT, and IMDB, applying SHAP to compute feature importance for various test samples. The proposed metric quantifies the cosine similarity of SHAP values for inputs with the same label, aiming to detect inconsistent behaviors, such as biased reliance on certain features or failure to maintain consistent reasoning for similar predictions. Through a series of experiments, we evaluate the ability of this metric to identify misaligned predictions and inconsistencies in model explanations. These experiments are compared against standard fidelity metrics to assess whether the new metric can effectively identify when a model's behavior deviates from its intended objectives. The proposed framework provides a deeper understanding of model behavior by enabling more robust verification of rationale stability, which is critical for building trustworthy AI systems. By quantifying whether models rely on consistent attribution patterns for similar inputs, the proposed approach supports more robust evaluation of model behavior in practical pattern recognition pipelines. Our code is publicly available at https://github.com/anmspro/ESS‑XAI‑Stability.
Authors:Seoyoung Park, Haemin Lee, Hankook Lee
Abstract:
Task‑free online continual learning has recently emerged as a realistic paradigm for addressing continual learning in dynamic, real‑world environments, where data arrive in a non‑stationary stream without clear task boundaries and can only be observed once. To consider such challenging scenarios, many recent approaches have employed prompt selection, an adaptive strategy that selects prompts from a pool based on input signals. However, we observe that such selection strategies often fail to select appropriate prompts, yielding suboptimal results despite additional training of key parameters. Motivated by this observation, we propose a simple yet effective SinglePrompt that eliminates the need for prompt selection and focuses on classifier optimization. Specifically, we simply (i) inject a single prompt into each self‑attention block, (ii) employ a cosine similarity‑based logit design to alleviate the forgetting effect inherent in the classifier weights, and (iii) mask logits for unexposed classes in the current minibatch. With this simple task‑free design, our framework achieves state‑of‑the‑art performance across various online continual learning benchmarks. Source code is available at https://github.com/efficient‑learning‑lab/SinglePrompt.
Authors:Hiroshi Takahashi, Tomoharu Iwata, Atsutoshi Kumagai, Sekitoshi Kanai, Masanori Yamada, Kosuke Nishida, Kazutoshi Shinoda
Abstract:
Aligning language models with human preferences is essential for ensuring their safety and reliability. Although most existing approaches assume specific human preference models such as the Bradley‑Terry model, this assumption may fail to accurately capture true human preferences, and consequently, these methods lack statistical consistency, i.e., the guarantee that language models converge to the true human preference as the number of samples increases. In contrast, direct density ratio optimization (DDRO) achieves statistical consistency without assuming any human preference models. DDRO models the density ratio between preferred and non‑preferred data distributions using the language model, and then optimizes it via density ratio estimation. However, this density ratio is unstable and often diverges, leading to training instability of DDRO. In this paper, we propose a novel alignment method that is both stable and statistically consistent. Our approach is based on the relative density ratio between the preferred data distribution and a mixture of the preferred and non‑preferred data distributions. Our approach is stable since this relative density ratio is bounded above and does not diverge. Moreover, it is statistically consistent and yields significantly tighter convergence guarantees than DDRO. We experimentally show its effectiveness with Qwen 2.5 and Llama 3.
Authors:Saurav Jha, Maryam Hashemzadeh, Ali Saheb Pasand, Ali Parviz, Min-Joong Lee, Boris Knyazev
Abstract:
Mixture‑of‑Experts (MoE) large language models (LLMs) are among the top‑performing architectures. The largest models, often with hundreds of billions of parameters, pose significant memory challenges for deployment. Traditional approaches to reduce memory requirements include weight pruning and quantization. Motivated by the Router‑weighted Expert Activation Pruning (REAP) that prunes experts, we propose a novel method, Router‑weighted Expert Activation Merging (REAM). Instead of removing experts, REAM groups them and merges their weights, better preserving original performance. We evaluate REAM against REAP and other baselines across multiple MoE LLMs on diverse multiple‑choice (MC) question answering and generative (GEN) benchmarks. Our results reveal a trade‑off between MC and GEN performance that depends on the mix of calibration data. By controlling the mix of general, math and coding data, we examine the Pareto frontier of this trade‑off and show that REAM often outperforms the baselines and in many cases is comparable to the original uncompressed models.
Authors:Luis Guzmán Lorenzo
Abstract:
When an LLM deobfuscates JavaScript, can poisoned identifier names in the string table survive into the model's reconstructed code, even when the model demonstrably understands the correct semantics? Using Claude Opus 4.6 across 192 inference runs on two code archetypes (force‑directed graph simulation, A pathfinding; 50 conditions, N=3‑6), we found three consistent patterns: (1) Poisoned names persisted in every baseline run on both artifacts (physics: 8/8; pathfinding: 5/5). Matched controls showed this extends to terms with zero semantic fit when the string table does not form a coherent alternative domain. (2) Persistence coexisted with correct semantic commentary: in 15/17 runs the model wrote wrong variable names while correctly describing the actual operation in comments. (3) Task framing changed persistence: explicit verification prompts had no effect (12/12 across 4 variants), but reframing from "deobfuscate this" to "write a fresh implementation" reduced propagation from 100% to 0‑20% on physics and to 0% on pathfinding, while preserving the checked algorithmic structure. Matched‑control experiments showed zero‑fit terms persist at the same rate when the replacement table lacks a coherent alternative‑domain signal. Per‑term variation in earlier domain‑gradient experiments is confounded with domain‑level coherence and recoverability. These observations are from two archetypes on one model family (Opus 4.6 primary; Haiku 4.5 spot‑check). Broader generalization is needed
Authors:Xiaohang Yu, William Knottenbelt
Abstract:
Blockchain forensics inherently involves dynamic and iterative investigations, while many existing approaches primarily model it through static inference pipelines. We propose a paradigm shift towards Agentic Blockchain Forensics (ABF), modeling forensic investigation as a sequential decision‑making process. To instantiate this paradigm, we introduce LOCARD, the first agentic framework for blockchain forensics. LOCARD operationalizes this perspective through a Tri‑Core Cognitive Architecture that decouples strategic planning, operational execution, and evaluative validation. Unlike generic LLM‑based agents, it incorporates a Structured Belief State mechanism to enforce forensic rigor and guide exploration under explicit state constraints. To demonstrate the efficacy of the ABF paradigm, we apply LOCARD to the inherently complex domain of cross‑chain transaction tracing. We introduce Thor25, a benchmark dataset comprising over 151k real‑world cross‑chain forensic records, and evaluate LOCARD on the Group‑Transfer Tracing task for dismantling Sybil clusters. Validated against representative laundering sub‑flows from the Bybit hack, LOCARD achieves high‑fidelity tracing results, providing empirical evidence that modeling blockchain forensics as an autonomous agentic task is both viable and effective. These results establish a concrete foundation for future agentic approaches to large‑scale blockchain forensic analysis. Code and dataset are publicly available at https://github.com/xhyumiracle/locard and https://github.com/xhyumiracle/thorchain‑crosschain‑data.
Authors:Haonian Ji, Kaiwen Xiong, Siwei Han, Peng Xia, Shi Qiu, Yiyang Zhou, Jiaqi Liu, Jinlong Li, Bingzhou Li, Zeyu Zheng, Cihang Xie, Huaxiu Yao
Abstract:
AI agents deployed as persistent assistants must maintain correct beliefs as their information environment evolves. In practice, evidence is scattered across heterogeneous sources that often contradict one another, new information can invalidate earlier conclusions, and user preferences surface through corrections rather than explicit instructions. Existing benchmarks largely assume static, single‑authority settings and do not evaluate whether agents can keep up with this complexity. We introduce ClawArena, a benchmark for evaluating AI agents in evolving information environments. Each scenario maintains a complete hidden ground truth while exposing the agent only to noisy, partial, and sometimes contradictory traces across multi‑channel sessions, workspace files, and staged updates. Evaluation is organized around three coupled challenges: multi‑source conflict reasoning, dynamic belief revision, and implicit personalization, whose interactions yield a 14‑category question taxonomy. Two question formats, multi‑choice (set‑selection) and shell‑based executable checks, test both reasoning and workspace grounding. The current release contains 64 scenarios across 8 professional domains, totaling 1,879 evaluation rounds and 365 dynamic updates. Experiments on five agent frameworks and five language models show that both model capability (15.4% range) and framework design (9.2%) substantially affect performance, that self‑evolving skill frameworks can partially close model‑capability gaps, and that belief revision difficulty is determined by update design strategy rather than the mere presence of updates. Code is available at https://github.com/aiming‑lab/ClawArena.
Authors:Xu Yan, Jun Yin, Shiliang Sun, Minghua Wan
Abstract:
Although multi‑view multi‑label learning has been extensively studied, research on the dual‑missing scenario, where both views and labels are incomplete, remains largely unexplored. Existing methods mainly rely on contrastive learning or information bottleneck theory to learn consistent representations under missing‑view conditions, but loss‑based alignment without explicit structural constraints limits the ability to capture stable and discriminative shared semantics. To address this issue, we introduce a more structured mechanism for consistent representation learning: we learn discrete consistent representations through a multi‑view shared codebook and cross‑view reconstruction, which naturally align different views within the limited shared codebook embeddings and reduce feature redundancy. At the decision level, we design a weight estimation method that evaluates the ability of each view to preserve label correlation structures, assigning weights accordingly to enhance the quality of the fused prediction. In addition, we introduce a fused‑teacher self‑distillation framework, where the fused prediction guides the training of view‑specific classifiers and feeds the global knowledge back into the single‑view branches, thereby enhancing the generalization ability of the model under missing‑label conditions. The effectiveness of our proposed method is thoroughly demonstrated through extensive comparative experiments with advanced methods on five benchmark datasets. Code is available at https://github.com/xuy11/SCSD.
Authors:Hang Fan, Haoran Pei, Runze Liang, Weican Liu, Long Cheng, Wei Wei
Abstract:
Photovoltaic (PV) power forecasting plays a critical role in power system dispatch and market participation. Because PV generation is highly sensitive to weather conditions and cloud motion, accurate forecasting requires effective modeling of complex spatiotemporal dependencies across multiple information sources. Although recent studies have advanced AI‑based forecasting methods, most fail to fuse temporal observations, satellite imagery, and textual weather information in a unified framework. This paper proposes Solar‑VLM, a large‑language‑model‑driven framework for multimodal PV power forecasting. First, modality‑specific encoders are developed to extract complementary features from heterogeneous inputs. The time‑series encoder adopts a patch‑based design to capture temporal patterns from multivariate observations at each site. The visual encoder, built upon a Qwen‑based vision backbone, extracts cloud‑cover information from satellite images. The text encoder distills historical weather characteristics from textual descriptions. Second, to capture spatial dependencies across geographically distributed PV stations, a cross‑site feature fusion mechanism is introduced. Specifically, a Graph Learner models inter‑station correlations through a graph attention network constructed over a K‑nearest‑neighbor (KNN) graph, while a cross‑site attention module further facilitates adaptive information exchange among sites. Finally, experiments conducted on data from eight PV stations in a northern province of China demonstrate the effectiveness of the proposed framework. Our proposed model is publicly available at https://github.com/rhp413/Solar‑VLM.
Authors:Taiping Qu, Hongkai Zhang, Lantian Zhang, Can Zhao, Nan Zhang, Hui Wang, Zhen Zhou, Mingye Zou, Kairui Bo, Pengfei Zhao, Xingxing Jin, Zixian Su, Kun Jiang, Huan Liu, Yu Du, Maozhou Wang, Ruifang Yan, Zhongyuan Wang, Tiejun Huang, Lei Xu, Henggui Zhang
Abstract:
Cardiac magnetic resonance (CMR) is a cornerstone for diagnosing cardiovascular disease. However, it remains underutilized due to complex, time‑consuming interpretation across multi‑sequences, phases, quantitative measures that heavily reliant on specialized expertise. Here, we present BAAI Cardiac Agent, a multimodal intelligent system designed for end‑to‑end CMR interpretation. The agent integrates specialized cardiac expert models to perform automated segmentation of cardiac structures, functional quantification, tissue characterization and disease diagnosis, and generates structured clinical reports within a unified workflow. Evaluated on CMR datasets from two hospitals (2413 patients) spanning 7‑types of major cardiovascular diseases, the agent achieved an area under the receiver‑operating‑characteristic curve exceeding 0.93 internally and 0.81 externally. In the task of estimating left ventricular function indices, the results generated by this system for core parameters such as ejection fraction, stroke volume, and left ventricular mass are highly consistent with clinical reports, with Pearson correlation coefficients all exceeding 0.90. The agent outperformed state‑of‑the‑art models in segmentation and diagnostic tasks, and generated clinical reports showing high concordance with expert radiologists (six readers across three experience levels). By dynamically orchestrating expert models for coordinated multimodal analysis, this agent framework enables accurate, efficient CMR interpretation and highlights its potentials for complex clinical imaging workflows. Code is available at https://github.com/plantain‑herb/Cardiac‑Agent.
Authors:Hang Xu, Ling Yue, Chaoqian Ouyang, Libin Zheng, Shaowu Pan, Shimin Di, Min-Ling Zhang
Abstract:
Peer review in machine learning is under growing pressure from rising submission volume and limited reviewer time. Most LLM‑based reviewing systems read only the manuscript and generate comments from the paper's own narrative. This makes their outputs sensitive to presentation quality and leaves them weak when the evidence needed for review lies in related work or released code. We present FactReview, an evidence‑grounded reviewing system that combines claim extraction, literature positioning, and execution‑based claim verification. Given a submission, FactReview identifies major claims and reported results, retrieves nearby work to clarify the paper's technical position, and, when code is available, executes the released repository under bounded budgets to test central empirical claims. It then produces a concise review and an evidence report that assigns each major claim one of five labels: Supported, Supported by the paper, Partially supported, In conflict, or Inconclusive. In a case study on CompGCN, FactReview reproduces results that closely match those reported for link prediction and node classification, yet also shows that the paper's broader performance claim across tasks is not fully sustained: on MUTAG graph classification, the reproduced result is 88.4%, whereas the strongest baseline reported in the paper remains 92.6%. The claim is therefore only partially supported. More broadly, this case suggests that AI is most useful in peer review not as a final decision‑maker, but as a tool for gathering evidence and helping reviewers produce more evidence‑grounded assessments. The code is public at https://github.com/DEFENSE‑SEU/Review‑Assistant.
Authors:Wenyue Hua, Tianyi Peng, Chi Wang, Ian Kaufman, Bryan Lim, Chandler Fang
Abstract:
Prior work on trustworthy AI emphasizes model‑internal properties such as bias mitigation, adversarial robustness, and interpretability. As AI systems evolve into autonomous agents deployed in open environments and increasingly connected to payments or assets, the operational meaning of trust shifts to end‑to‑end outcomes: whether an agent completes tasks, follows user intent, and avoids failures that cause material or psychological harm. These risks are fundamentally product‑level and cannot be eliminated by technical safeguards alone because agent behavior is inherently stochastic. To address this gap between model‑level reliability and user‑facing assurance, we propose a complementary framework based on risk management. Drawing inspiration from financial underwriting, we introduce the Agentic Risk Standard (ARS), a payment settlement standard for AI‑mediated transactions. ARS integrates risk assessment, underwriting, and compensation into a single transaction framework that protects users when interacting with agents. Under ARS, users receive predefined and contractually enforceable compensation in cases of execution failure, misalignment, or unintended outcomes. This shifts trust from an implicit expectation about model behavior to an explicit, measurable, and enforceable product guarantee. We also present a simulation study analyzing the social benefits of applying ARS to agentic transactions. ARS's implementation can be found at https://github.com/t54‑labs/AgenticRiskStandard.
Authors:Yifu Ding, Xinhao Zhang, Jinyang Guo
Abstract:
Transformer‑based large language models (LLMs) have demonstrated remarkable performance across a wide range of real‑world tasks, but their inference cost remains prohibitively high due to the quadratic complexity of attention and the memory bandwidth limitations of high‑precision operations. In this work, we present a low‑bit mixed‑precision attention kernel using the microscaling floating‑point (MXFP) data format, utilizing the computing capability on next‑generation GPU architectures. Our Diagonal‑Tiled Mixed‑Precision Attention (DMA) incorporates two kinds of low‑bit computation at the tiling‑level, and is a delicate fused kernel implemented using Triton, exploiting hardware‑level parallelism and memory efficiency to enable fast and efficient inference without compromising model performance. Extensive empirical evaluations on NVIDIA B200 GPUs show that our kernel maintains generation quality with negligible degradation, and meanwhile achieves significant speedup by kernel fusion. We release our code at https://github.com/yifu‑ding/MP‑Sparse‑Attn.
Authors:Indar Kumar, Girish Karhana, Sai Krishna Jasti, Ankit Hemant Lade
Abstract:
Effective ride‑hailing dispatch requires anticipating demand patterns that vary substantially across time‑of‑day, day‑of‑week, season, and special events. We propose a regime‑calibrated approach that (i) segments historical trip data into demand regimes, (ii) matches the current operating period to the most similar historical analogues via a six‑metric similarity ensemble (Kolmogorov‑Smirnov, Wasserstein‑1, feature distance, variance ratio, event pattern, temporal proximity), and (iii) uses the resulting calibrated demand prior to drive both an LP‑based fleet repositioning policy and batch dispatch with Hungarian matching. In ablation, a distributional‑only subset is strongest on mean wait, while the full ensemble is retained as a robustness‑oriented default. Evaluated on 5.2 million NYC TLC trips across 8 diverse scenarios (winter/summer, weekday/weekend/holiday, morning/evening/night) with 5 random seeds each, our method reduces mean rider wait times by 31.1% (bootstrap 95% CI: [26.5, 36.6]%; Friedman chi‑sq = 80.0, p = 4.25e‑18; Cohen's d = 7.5‑29.9 across scenarios). The improvement extends to the tail: P95 wait drops 37.6% and the Gini coefficient of wait times improves from 0.441 to 0.409 (7.3% relative). The two contributions compose multiplicatively and are independently validated: calibration provides 16.9% reduction; LP repositioning adds a further 15.5%. The approach requires no training, is deterministic and explainable, generalizes to Chicago (23.3% wait reduction via NYC‑built regime library), and is robust across fleet sizes (32‑47% improvement for 0.5‑2x fleet scaling). We provide comprehensive ablation studies, formal statistical tests, and routing‑fidelity validation with OSRM.
Authors:Haotian Zong, Binze Li, Yufei Long, Sinyin Chang, Jialong Wu, Gillian K. Hadfield
Abstract:
Large language models (LLMs) frequently produce confident but incorrect answers, partly because common binary scoring conventions reward answering over honestly expressing uncertainty. We study whether prompt‑only interventions ‑‑ explicitly announcing reward schemes for answer‑versus‑abstain decisions plus humility‑oriented normative principles ‑‑ can reduce hallucination risk without modifying the model. Our focus is epistemic abstention on factual questions with a verifiable answer, where current LLMs often fail to abstain despite being uncertain about their answers. We first assess self‑reported verbal confidence as a usable uncertainty signal, showing stability under prompt paraphrasing and reasonable calibration against a token‑probability baseline. We then study I‑CALM, a prompt‑based framework that (i) elicits verbal confidence, (ii) partially rewards abstention through explicit reward schemes, and (iii) adds lightweight normative principles emphasizing truthfulness, humility, and responsibility. Using GPT‑5 mini on PopQA as the main setting, we find that confidence‑eliciting, abstention‑rewarding prompts, especially with norms, reduce the false‑answer rate on answered cases mainly by identifying and shifting error‑prone cases to abstention and re‑calibrating their confidence. This trades coverage for reliability while leaving forced‑answer performance largely unchanged. Varying the abstention reward yields a clear abstention‑hallucination frontier. Overall, results show the framework can improve selective answering on factual questions without retraining, with the magnitude of effect varying across models and datasets. Code is available at the following https://github.com/binzeli/hallucinationControl.
Authors:Indar Kumar, Akanksha Tiwari
Abstract:
Effective ride‑hailing dispatch requires anticipating demand patterns that vary substantially across time‑of‑day, day‑of‑week, season, and special events. We propose a regime‑calibrated approach that (i) segments historical trip data into demand regimes, (ii) matches the current operating period to the most similar historical analogues via a similarity ensemble combining Kolmogorov‑Smirnov distance, Wasserstein‑1 distance, feature distance, variance ratio, event pattern similarity, and temporal proximity, and (iii) uses the resulting calibrated demand prior to drive both an LP‑based fleet repositioning policy and batch dispatch with Hungarian matching. In ablation, a distributional‑only metric subset achieves the strongest mean‑wait reduction, while the full ensemble is retained as a robustness‑oriented default that preserves calendar and event context. Evaluated on 5.2 million NYC TLC trips across 8 diverse scenarios (winter/summer, weekday/weekend/holiday, morning/evening/night) with 5 random seeds each, our method reduces mean rider wait times by 31.1% (bootstrap 95% CI: [26.5, 36.6]; Friedman chi‑squared = 80.0, p = 4.25e‑18; Cohen's d = 7.5‑29.9). P95 wait drops 37.6% and the Gini coefficient of wait times improves from 0.441 to 0.409. The two contributions compose multiplicatively: calibration provides 16.9% reduction relative to the replay baseline; LP repositioning adds a further 15.5%. The approach requires no training, is deterministic and explainable, generalizes to Chicago (23.3% wait reduction using the NYC‑built regime library without retraining), and is robust across fleet sizes (32‑47% improvement for 0.5x‑2.0x fleet scaling). Code is available at https://github.com/IndarKarhana/regime‑calibrated‑dispatch.
Authors:Felix Stillger, Lukas Hahn, Frederik Hasecke, Tobias Meisen
Abstract:
Camera extrinsic calibration is a fundamental task in computer vision. However, precise relative pose estimation in constrained, highly distorted environments, such as in‑cabin automotive monitoring (ICAM), remains challenging. We present InCaRPose, a Transformer‑based architecture designed for robust relative pose prediction between image pairs, which can be used for camera extrinsic calibration. By leveraging frozen backbone features such as DINOv3 and a Transformer‑based decoder, our model effectively captures the geometric relationship between a reference and a target view. Unlike traditional methods, our approach achieves absolute metric‑scale translation within the physically plausible adjustment range of in‑cabin camera mounts in a single inference step, which is critical for ICAM, where accurate real‑world distances are required for safety‑relevant perception. We specifically address the challenges of highly distorted fisheye cameras in automotive interiors by training exclusively on synthetic data. Our model is capable of generalization to real‑world cabin environments without relying on the exact same camera intrinsics and additionally achieves competitive performance on the public 7‑Scenes dataset. Despite having limited training data, InCaRPose maintains high precision in both rotation and translation, even with a ViT‑Small backbone. This enables real‑time performance for time‑critical inference, such as driver monitoring in supervised autonomous driving. We release our real‑world In‑Cabin‑Pose test dataset consisting of highly distorted vehicle‑interior images and our code at https://github.com/felixstillger/InCaRPose.
Authors:Haocheng Ju, Guoxiong Gao, Jiedong Jiang, Bin Wu, Zeming Sun, Leheng Chen, Yutong Wang, Yuefeng Wang, Zichen Wang, Wanyi He, Peihao Wu, Liang Xiao, Ruochuan Liu, Bryan Dai, Bin Dong
Abstract:
Recent advances in large language models have significantly improved their ability to perform mathematical reasoning, extending from elementary problem solving to increasingly capable performance on research‑level problems. However, reliably solving and verifying such problems remains challenging due to the inherent ambiguity of natural language reasoning. In this paper, we propose an automated framework for tackling research‑level mathematical problems that integrates natural language reasoning with formal verification, enabling end‑to‑end problem solving with minimal human intervention. Our framework consists of two components: an informal reasoning agent, Rethlas, and a formal verification agent, Archon. Rethlas mimics the workflow of human mathematicians by combining reasoning primitives with our theorem search engine, Matlas, to explore solution strategies and construct candidate proofs. Archon, equipped with our formal theorem search engine LeanSearch, translates informal arguments into formalized Lean 4 projects through structured task decomposition, iterative refinement, and automated proof synthesis, ensuring machine‑checkable correctness. Using this framework, we automatically resolve an open problem in commutative algebra and formally verify the resulting proof in Lean 4 with essentially no human involvement. Our experiments demonstrate that strong theorem retrieval tools enable the discovery and application of cross‑domain mathematical techniques, while the formal agent is capable of autonomously filling nontrivial gaps in informal arguments. More broadly, our work illustrates a promising paradigm for mathematical research in which informal and formal reasoning systems, equipped with theorem retrieval tools, operate in tandem to produce verifiable results, substantially reduce human effort, and offer a concrete instantiation of human‑AI collaborative mathematical research.
Authors:Anja Surina, Arun Suggala, George Tsoukalas, Anton Kovsharov, Sergey Shirobokov, Francisco J. R. Ruiz, Pushmeet Kohli, Swarat Chaudhuri
Abstract:
We analyze the last‑iterate convergence of the Anchored Gradient Descent Ascent algorithm for smooth convex‑concave min‑max problems. While previous work established a last‑iterate rate of \mathcalO(1/t^2‑2p) for the squared gradient norm, where p \in (1/2, 1), it remained an open problem whether the improved exact \mathcalO(1/t) rate is achievable. In this work, we resolve this question in the affirmative. This result was discovered autonomously by an AI system capable of writing formal proofs in Lean. The Lean proof can be accessed at https://github.com/google‑deepmind/formal‑conjectures/pull/3675/commits/a13226b49fd3b897f4c409194f3bcbeb96a08515
Authors:Baicheng Chen, Yu Wang, Ziheng Zhou, Xiangru Liu, Juanru Li, Yilei Chen, Tianxing He
Abstract:
Reverse engineering (RE) is central to software security, particularly for cryptographic programs that handle sensitive data and are highly prone to vulnerabilities. It supports critical tasks such as vulnerability discovery and malware analysis. Despite its importance, RE remains labor‑intensive and requires substantial expertise, making large language models (LLMs) a potential solution for automating the process. However, their capabilities for RE remain systematically underexplored. To address this gap, we study the cryptographic binary RE capabilities of LLMs and introduce CREBench, a benchmark comprising 432 challenges built from 48 standard cryptographic algorithms, 3 insecure crypto key usage scenarios, and 3 difficulty levels. Each challenge follows a Capture‑the‑Flag (CTF) RE challenge, requiring the model to analyze the underlying cryptographic logic and recover the correct input. We design an evaluation framework comprising four sub‑tasks, from algorithm identification to correct flag recovery. We evaluate eight frontier LLMs on CREBench. GPT‑5.4, the best‑performing model, achieves 64.03 out of 100 and recovers the flag in 59% of challenges. We also establish a strong human expert baseline of 92.19 points, showing that humans maintain an advantage in cryptographic RE tasks. Our code and dataset are available at https://github.com/wangyu‑ovo/CREBench.
Authors:Yulong He, Ivan Smirnov, Dmitry Fedrushkov, Sergey Kovalchuk, Ilya Revin
Abstract:
Effective evaluation of large language models (LLMs) remains a critical bottleneck, as conventional direct scoring often yields inconsistent and opaque judgments. In this work, we adapt the Analytic Hierarchy Process (AHP) to LLM‑based evaluation and, more importantly, propose a confidence‑aware Fuzzy AHP (FAHP) extension that models epistemic uncertainty via triangular fuzzy numbers modulated by LLM‑generated confidence scores. Systematically validated on JudgeBench, our structured approach decomposes assessments into explicit criteria and incorporates uncertainty‑aware aggregation, producing more calibrated judgments. Extensive experiments demonstrate that both crisp and fuzzy AHP consistently outperform direct scoring across model scales and dataset splits, with FAHP showing superior stability in uncertain comparison scenarios. Building on these insights, we propose DualJudge, a hybrid framework inspired by Dual‑Process Theory that adaptively fuses holistic direct scores with structured AHP outputs via consistency‑aware weighting. DualJudge achieves state‑of‑the‑art performance, underscoring the complementary strengths of intuitive and deliberative evaluation paradigms. These results establish uncertainty‑aware structured reasoning as a principled pathway toward more reliable LLM assessment. Code is available at https://github.com/hreyulog/AHP_llm_judge.
Authors:Yunyao Yu, Zhengxian Wu, Zhuohong Chen, Hangrui Xu, Zirui Liao, Xiangwen Deng, Zhifang Liu, Senyuan Shi, Haoqian Wang
Abstract:
In the unsupervised self‑evolution of Multimodal Large Language Models, the quality of feedback signals during post‑training is pivotal for stable and effective learning. However, existing self‑evolution methods predominantly rely on majority voting to select the most frequent output as the pseudo‑golden answer, which may stem from the model's intrinsic biases rather than guaranteeing the objective correctness of the reasoning paths. To counteract the degradation, we propose Continuous Softened Retracing reSampling (CSRS) in MLLM self‑evolution. Specifically, we introduce a Retracing Re‑inference Mechanism (RRM) that the model re‑inferences from anchor points to expand the exploration of long‑tail reasoning paths. Simultaneously, we propose Softened Frequency Reward (SFR), which replaces binary rewards with continuous signals, calibrating reward based on the answers' frequency across sampled reasoning sets. Furthermore, incorporated with Visual Semantic Perturbation (VSP), CSRS ensures the model prioritizes mathematical logic over visual superficiality. Experimental results demonstrate that CSRS significantly enhances the reasoning performance of Qwen2.5‑VL‑7B on benchmarks such as MathVision. We achieve state‑of‑the‑art (SOTA) results in unsupervised self‑evolution on geometric tasks. Our code is avaible at https://github.com/yyy195/CSRS.
Authors:Viet Dung Nguyen, Yuhang Song, Anh Nguyen, Jamison Heard, Reynold Bailey, Alexander Ororbia
Abstract:
Robot reinforcement learning from demonstrations (RLfD) assumes that expert data is abundant; this is usually unrealistic in the real world given data scarcity as well as high collection cost. Furthermore, imitation learning algorithms assume that the data is independently and identically distributed, which ultimately results in poorer performance as gradual errors emerge and compound within test‑time trajectories. We address these issues by introducing the "master your own expertise" (MYOE) framework, a self‑imitation framework that enables robotic agents to learn complex behaviors from limited demonstration data samples. Inspired by human perception and action, we propose and design what we call the queryable mixture‑of‑preferences state space model (QMoP‑SSM), which estimates the desired goal at every time step. These desired goals are used in computing the "preference regret", which is used to optimize the robot control policy. Our experiments demonstrate the robustness, adaptability, and out‑of‑sample performance of our agent compared to other state‑of‑the‑art RLfD schemes. The GitHub repository that supports this work can be found at: https://github.com/rxng8/neurorobot‑preference‑regret‑learning.
Authors:Zilin Huang, Zhengyang Wan, Zihao Sheng, Boyue Wang, Junwei You, Yue Leng, Sikai Chen
Abstract:
Deploying reinforcement learning policies trained in simulation to real autonomous vehicles remains a fundamental challenge, particularly for VLM‑guided RL frameworks whose policies are typically learned with simulator‑native observations and simulator‑coupled action semantics that are unavailable on physical platforms. This paper presents Sim2Real‑AD, a modular framework for zero‑shot sim‑to‑real transfer of CARLA‑trained VLM‑guided RL policies to full‑scale vehicles without any real‑world RL training data. The framework decomposes the transfer problem into four components: a Geometric Observation Bridge (GOB) that converts monocular front‑view images into simulator‑compatible bird's‑eye‑view (BEV) observations, a Physics‑Aware Action Mapping (PAM) that translates policy outputs into platform‑agnostic physical commands, a Two‑Phase Progressive Training (TPT) strategy that stabilizes adaptation by separating action‑space and observation‑space transfer, and a Real‑time Deployment Pipeline (RDP) that integrates perception, policy inference, control conversion, and safety monitoring for closed‑loop execution. Simulation experiments show that the framework preserves the relative performance ordering of representative RL algorithms across different reward paradigms and validate the contribution of each module. Zero‑shot deployment on a full‑scale Ford E‑Transit achieves success rates of 90%, 80%, and 75% in car‑following, obstacle avoidance, and stop‑sign interaction scenarios, respectively. To the best of our knowledge, this study is among the first to demonstrate zero‑shot closed‑loop deployment of a CARLA‑trained VLM‑guided RL policy on a full‑scale real vehicle without any real‑world RL training data. The demo video and code are available at: https://zilin‑huang.github.io/Sim2Real‑AD‑website/.
Authors:Xunyi Jiang, Mingyang Yao, Jingyue Huang, Julian McAuley
Abstract:
Symbolic music generation has made significant progress, yet achieving fine‑grained and flexible control over composer style remains challenging. Existing training‑based methods for composer style conditioning depend on large labeled datasets. Besides, these methods typically support only single‑composer generation at a time, limiting their applicability to more creative or blended scenarios. In this work, we propose Composer Vector, an inference‑time steering method that operates directly in the model's latent space to control composer style without retraining. Through experiments on multiple symbolic music generation models, we show that Composer Vector effectively guides generations toward target composer styles, enabling smooth and interpretable control through a continuous steering coefficient. It also enables seamless fusion of multiple styles within a unified latent space framework. Overall, our work demonstrates that simple latent space steering provides a practical and general mechanism for controllable symbolic music generation, enabling more flexible and interactive creative workflows. Code and Demo are available here: https://github.com/JiangXunyi/Composer‑Vector and https://jiangxunyi.github.io/composervector.github.io/
Authors:Bingliang Li, Zhenhong Sun, Jiaming Bian, Yuehao Wu, Yifu Wang, Hongdong Li, Yatao Bian, Huadong Mo, Daoyi Dong
Abstract:
Storyboarding is a core skill in visual storytelling for film, animation, and games. However, automating this process requires a system to achieve two properties that current approaches rarely satisfy simultaneously: inter‑shot consistency and explicit editability. While 2D diffusion‑based generators produce vivid imagery, they often suffer from identity drift along with limited geometric control; conversely, traditional 3D animation workflows are consistent and editable but require expert‑heavy, labor‑intensive authoring. We present StoryBlender, a grounded 3D storyboard generation framework governed by a Story‑centric Reflection Scheme. At its core, we propose the StoryBlender system, which is built on a three‑stage pipeline: (1) Semantic‑Spatial Grounding, to construct a continuity memory graph to decouple global assets from shot‑specific variables for long‑horizon consistency; (2) Canonical Asset Materialization, to instantiate entities in a unified coordinate space to maintain visual identity; and (3) Spatial‑Temporal Dynamics, to achieve layout design and cinematic evolution through visual metrics. By orchestrating multiple agents in a hierarchical manner within a verification loop, StoryBlender iteratively self‑corrects spatial hallucinations via engine‑verified feedback. The resulting native 3D scenes support direct, precise editing of cameras and visual assets while preserving unwavering multi‑shot continuity. Experiments demonstrate that StoryBlender significantly improves consistency and editability over both diffusion‑based and 3D‑grounded baselines. Code, data, and demonstration video will be available on https://engineeringai‑lab.github.io/StoryBlender/
Authors:Nanxi Li, Xiang Wang, Yuanjie Chen, Haode Zhang, Hong Li, Yong-Lu Li
Abstract:
While Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in image and video understanding, their ability to comprehend the physical world has become an increasingly important research focus. Despite their improvements, current MLLMs struggle significantly with high‑level physics reasoning. In this work, we investigate the first step of physical reasoning, i.e., intuitive physics understanding, revealing substantial limitations in understanding the dynamics of continuum objects. To isolate and evaluate this specific capability, we introduce two fundamental benchmark tasks: Next Frame Selection (NFS) and Temporal Coherence Verification (TCV). Our experiments demonstrate that even state‑of‑the‑art MLLMs perform poorly on these foundational tasks. To address this limitation, we propose Scene Dynamic Field (SDF), a concise approach that leverages physics simulators within a multi‑task fine‑tuning framework. SDF substantially improves performance, achieving up to 20.7% gains on fluid tasks while showing strong generalization to unseen physical domains. This work not only highlights a critical gap in current MLLMs but also presents a promising cost‑efficient approach for developing more physically grounded MLLMs. Our code and data are available at https://github.com/andylinx/Scene‑Dynamic‑Field.
Authors:Chushan Zhang, Ruihan Lu, Jinguang Tong, Yikai Wang, Hongdong Li
Abstract:
Leveraging 3D information within Multimodal Large Language Models (MLLMs) has recently shown significant advantages for indoor scene understanding. However, existing methods, including those using explicit ground‑truth 3D positional encoding and those grafting external 3D foundation models for implicit geometry, struggle with the trade‑off in 2D‑3D representation fusion, leading to suboptimal deployment. To this end, we propose 3D‑Implicit Depth Emergence, a method that reframes 3D perception as an emergent property derived from geometric self‑supervision rather than explicit encoding. Our core insight is the Implicit Geometric Emergence Principle: by strategically leveraging privileged geometric supervision through mechanisms like a fine‑grained geometry validator and global representation constraints, we construct an information bottleneck. This bottleneck forces the model to maximize the mutual information between visual features and 3D structures, allowing 3D awareness to emerge naturally within a unified visual representation. Unlike existing approaches, our method enables 3D perception to emerge implicitly, disentangling features in dense regions and, crucially, eliminating depth and pose dependencies during inference with zero latency overhead. This paradigm shift from external grafting to implicit emergence represents a fundamental rethinking of 3D knowledge integration in visual‑language models. Extensive experiments demonstrate that our method surpasses SOTA on multiple 3D scene understanding benchmarks. Our approach achieves a 55% reduction in inference latency while maintaining strong performance across diverse downstream tasks, underscoring the effectiveness of meticulously designed auxiliary objectives for dependency‑free 3D understanding. Source code can be found at github.com/ChushanZhang/3D‑IDE.
Authors:Benedikt Dornauer, Mircea-Cristian Racasan
Abstract:
This paper introduces RAGnaroX, a resource‑efficient ChatOps assistant that operates entirely on commodity hardware. Unlike existing solutions that often rely on external providers such as Azure or OpenAI, RAGnaroX offers a fully auditable, on‑premise stack implemented in Rust. Its architecture integrates modular data ingestion, hybrid retrieval, and function calling, enabling flexible yet secure deployment. Our evaluation focuses on the RAG pipeline, with benchmarks conducted on the SQuAD (single‑hop QA), MultiHopRAG (multi‑hop QA), and MLQA (cross‑lingual QA) datasets. Results show that RAGnaroX achieves competitive accuracy while maintaining strong resource efficiency, for example, reaching 0.90 context precision on single‑hop questions with an average response time of 2.5 seconds per request. A replication package containing the tool, the demonstration video (https://www.youtube.com/watch? v=cDxfuEbcoM4), and all supporting materials are available at https://github.com/genius‑itea/RAGnaroX.git.
Authors:Jocelyn Beauchesne, Christine Maroti, Jeshua Bratman, Jerome Pesenti, Laurence Holt, Alex Tambellini, Allison McGrath, Matthew Guo, Sarah Peterson
Abstract:
Recent research demonstrated that students exhibit consistent learning rates across diverse educational contexts. We test these findings using a dataset of 1.8 million (366k post‑filtering) student interactions from the digital platform Campus AI providing further evidence to the observation of regularity in learning rate among students. Unlike prior work requiring manual cognitive modeling, Campus AI automatically generates Knowledge Components (KCs) and corresponding exercises, both of which are validated by human experts. This one‑to‑many mapping facilitates the application of Additive Factors Models to measure learning parameters without complex cognitive modeling. Using mixed‑effects logistic regression, we confirmed the core finding of prior work: students displayed substantial variation in initial knowledge (\textIQR = [2.78, 12.18] practice opportunities to reach 80% mastery) but remarkably consistent learning rates (\textIQR = [7.01, 8.25] opportunities). Furthermore, students using this fully automated system achieved 80% mastery in a median of 7.22 practice opportunities, comparable to the 6.54 reported for expert‑designed curricula. These results suggest that automated, science‑grounded content generation can support effective personalized learning at scale. Data and code are publicly available. https://github.com/Campus‑edu‑AI/learning‑rate
Authors:Nikita Vassilyev, William Berrios, Ruowang Zhang, Bo Han, Douwe Kiela, Shikib Mehri
Abstract:
Generally capable agents must learn from experience in ways that generalize across tasks and environments. The fundamental problems of learning, including credit assignment, overfitting, forgetting, local optima, and high‑variance learning signals, persist whether the learned object lies in parameter space or context space. While these challenges are well understood in classical machine learning optimization, they remain underexplored in context space, leading current methods to be fragmented and ad hoc. We present Reflective Context Learning (RCL), a unified framework for agents that learn through repeated interaction, reflection on behavior and failure modes, and iterative updates to context. In RCL, reflection converts trajectories and current context into a directional update signal analogous to gradients, while mutation applies that signal to improve future behavior in context space. We recast recent context‑optimization approaches as instances of this shared learning problem and systematically extend them with classical optimization primitives, including batching, improved credit‑assignment signal, auxiliary losses, failure replay, and grouped rollouts for variance reduction. On AppWorld, BrowseComp+, and RewardBench2, these primitives improve over strong baselines, with their relative importance shifting across task regimes. We further analyze robustness to initialization, the effects of batch size, sampling and curriculum strategy, optimizer‑state variants, and the impact of allocating stronger or weaker models to different optimization components. Our results suggest that learning through context updates should be treated not as a set of isolated algorithms, but as an optimization problem whose mechanisms can be studied systematically and improved through transferable principles.
Authors:Hongbin Chen, Jie Li, Wei Wang, Siyang Song, Xiao Gu, Jianqing Li, Wentao Xiang
Abstract:
While affective computing has advanced considerably, multimodal emotion prediction in aging populations remains underexplored, largely due to the scarcity of dedicated datasets. Existing multimodal benchmarks predominantly target young, cognitively healthy subjects, neglecting the influence of cognitive decline on emotional expression and physiological responses. To bridge this gap, we present MECO, a Multimodal dataset for Emotion and Cognitive understanding in Older adults. MECO includes 42 participants and provides approximately 38 hours of multimodal signals, yielding 30,592 synchronized samples. To maximize ecological validity, data collection followed standardized protocols within community‑based settings. The modalities cover video, audio, electroencephalography (EEG), and electrocardiography (ECG). In addition, the dataset offers comprehensive annotations of emotional and cognitive states, including self‑assessed valence, arousal, six basic emotions, and Mini‑Mental State Examination cognitive scores. We further establish baseline benchmarks for both emotion and cognitive prediction. MECO serves as a foundational resource for multimodal modeling of affect and cognition in aging populations, facilitating downstream applications such as personalized emotion recognition and early detection of mild cognitive impairment (MCI) in real‑world settings. The complete dataset and supplementary materials are available at https://maitrechen.github.io/meco‑page/.
Authors:Jing Du, Zesheng Ye, Congbo Ma, Feng Liu, Flora. D. Salim
Abstract:
Multi‑modal recommendation (MMR) enriches item representations by introducing item content, e.g., visual and textual descriptions, to improve upon interaction‑only recommenders. The success of MMR hinges on aligning these content modalities with user preferences derived from interaction data, yet dominant practices based on disentangling modality‑invariant preference‑driving signals from modality‑specific preference‑irrelevant noises are flawed. First, they assume a one‑size‑fits‑all relevance of item content to user preferences for all users, which contradicts the user‑conditional fact of preferences. Second, they optimize pairwise contrastive losses separately toward cross‑modal alignment, systematically ignoring higher‑order dependencies inherent when multiple content modalities jointly influence user choices. In this paper, we introduce GTC, a conditional Generative Total Correlation learning framework. We employ an interaction‑guided diffusion model to perform user‑aware content feature filtering, preserving only personalized features relevant to each individual user. Furthermore, to capture complete cross‑modal dependencies, we optimize a tractable lower bound of the total correlation of item representations across all modalities. Experiments on standard MMR benchmarks show GTC consistently outperforms state‑of‑the‑art, with gains of up to 28.30% in NDCG@5. Ablation studies validate both conditional preference‑driven feature filtering and total correlation optimization, confirming the ability of GTC to model user‑conditional relationships in MMR tasks. The code is available at: https://github.com/jingdu‑cs/GTC.
Authors:Ka Yiu Lee, Yuxuan Huang, Zhiyuan He, Huichi Zhou, Weilin Luo, Kun Shao, Meng Fang, Jun Wang
Abstract:
Recent agentic search systems have made substantial progress by emphasising deep, multi‑step reasoning. However, this focus often overlooks the challenges of wide‑scale information synthesis, where agents must aggregate large volumes of heterogeneous evidence across many sources. As a result, most existing large language model agent systems face severe limitations in data‑intensive settings, including context saturation, cascading error propagation, and high end‑to‑end latency. To address these challenges, we present \framework, a hierarchical framework based on principle of near‑decomposability, containing a strategic Host, multiple Managers and parallel Workers. By leveraging aggregation and reflection mechanisms at the Manager layer, our framework enforces strict context isolation to prevent saturation and error propagation. Simultaneously, the parallelism in worker layer accelerates the speed of overall task execution, mitigating the significant latency. Our evaluation on two complementary benchmarks demonstrates both efficiency ( 3‑5 × speed‑up) and effectiveness, achieving a 8.4% success rate on WideSearch‑en and 52.9% accuracy on BrowseComp‑zh. The code is released at https://github.com/agent‑on‑the‑fly/InfoSeeker
Authors:Yilin Xiao, Jin Chen, Qinggang Zhang, Yujing Zhang, Chuang Zhou, Longhao Yang, Lingfei Ren, Xin Yang, Xiao Huang
Abstract:
Graph‑based Retrieval‑Augmented Generation (GraphRAG) enhances the reasoning capabilities of Large Language Models (LLMs) by grounding their responses in structured knowledge graphs. Leveraging community detection and relation filtering techniques, GraphRAG systems demonstrate inherent resistance to traditional RAG attacks, such as text poisoning and prompt injection. However, in this paper, we find that the security of GraphRAG systems fundamentally relies on the topological integrity of the underlying graph, which can be undermined by implicitly corrupting the logical connections, without altering surface‑level text semantics. To exploit this vulnerability, we propose \textscLogicPoison, a novel attack framework that targets logical reasoning rather than injecting false contents. Specifically, \textscLogicPoison employs a type‑preserving entity swapping mechanism to perturb both global logic hubs for disrupting overall graph connectivity and query‑specific reasoning bridges for severing essential multi‑hop inference paths. This approach effectively reroutes valid reasoning into dead ends while maintaining surface‑level textual plausibility. Comprehensive experiments across multiple benchmarks demonstrate that \textscLogicPoison successfully bypasses GraphRAG's defenses, significantly degrading performance and outperforming state‑of‑the‑art baselines in both effectiveness and stealth. Our code is available at \textcolorbluehttps://github.com/Jord8061/logicPoison.
Authors:Giyeong Oh, Junghyun Lee, Jaehyun Park, Youngjae Yu, Wonho Bae, Junhyug Noh
Abstract:
Modern LLMs inherit strong priors from web‑scale pretraining, which can limit the headroom of post‑training data‑selection strategies. While Active Preference Learning (APL) seeks to optimize query efficiency in online Direct Preference Optimization (DPO), the inherent richness of on‑policy candidate pools often renders simple Random sampling a surprisingly formidable baseline. We evaluate uncertainty‑based APL against Random across harmlessness, helpfulness, and instruction‑following settings, utilizing both reward models and LLM‑as‑a‑judge proxies. We find that APL yields negligible improvements in proxy win‑rates compared to Random. Crucially, we observe a dissociation where win‑rate improves even as general capability ‑‑ measured by standard benchmarks ‑‑ degrades. APL fails to mitigate this capability collapse or reduce variance significantly better than random sampling. Our findings suggest that in the regime of strong pre‑trained priors, the computational overhead of active selection is difficult to justify against the ``cheap diversity'' provided by simple random samples. Our code is available at https://github.com/BootsofLagrangian/random‑vs‑apl.
Authors:Mirali Purohit, Bimal Gajera, Irish Mehta, Bhanu Tokas, Jacob Adler, Steven Lu, Scott Dickenshied, Serina Diniega, Brian Bue, Umaa Rebbapragada, Hannah Kerner
Abstract:
We introduce MOMO, the first multi‑sensor foundation model for Mars remote sensing. MOMO uses model merge to integrate representations learned independently from three key Martian sensors (HiRISE, CTX, and THEMIS), spanning resolutions from 0.25 m/pixel to 100 m/pixel. Central to our method is our novel Equal Validation Loss (EVL) strategy, which aligns checkpoints across sensors based on validation loss similarity before fusion via task arithmetic. This ensures models are merged at compatible convergence stages, leading to improved stability and generalization. We train MOMO on a large‑scale, high‑quality corpus of ~ 12 million samples curated from Mars orbital data and evaluate it on 9 downstream tasks from Mars‑Bench. MOMO achieves better overall performance compared to ImageNet pre‑trained, earth observation foundation model, sensor‑specific pre‑training, and fully‑supervised baselines. Particularly on segmentation tasks, MOMO shows consistent and significant performance improvement. Our results demonstrate that model merging through an optimal checkpoint selection strategy provides an effective approach for building foundation models for multi‑resolution data. The model weights, pretraining code, pretraining data, and evaluation code are available at: https://github.com/kerner‑lab/MOMO.
Authors:Junwei You, Pei Li, Zhuoyu Jiang, Weizhe Tang, Zilin Huang, Rui Gan, Jiaxi Liu, Yan Zhao, Sikai Chen, Bin Ran
Abstract:
Multimodal large language models (MLLMs) have shown strong potential for autonomous driving, yet existing benchmarks remain largely ego‑centric and therefore cannot systematically assess model performance in infrastructure‑centric and cooperative driving conditions. In this work, we introduce V2X‑QA, a real‑world dataset and benchmark for evaluating MLLMs across vehicle‑side, infrastructure‑side, and cooperative viewpoints. V2X‑QA is built around a view‑decoupled evaluation protocol that enables controlled comparison under vehicle‑only, infrastructure‑only, and cooperative driving conditions within a unified multiple‑choice question answering (MCQA) framework. The benchmark is organized into a twelve‑task taxonomy spanning perception, prediction, and reasoning and planning, and is constructed through expert‑verified MCQA annotation to enable fine‑grained diagnosis of viewpoint‑dependent capabilities. Benchmark results across ten representative state‑of‑the‑art proprietary and open‑source models show that viewpoint accessibility substantially affects performance, and infrastructure‑side reasoning supports meaningful macroscopic traffic understanding. Results also indicate that cooperative reasoning remains challenging since it requires cross‑view alignment and evidence integration rather than simply additional visual input. To address these challenges, we introduce V2X‑MoE, a benchmark‑aligned baseline with explicit view routing and viewpoint‑specific LoRA experts. The strong performance of V2X‑MoE further suggests that explicit viewpoint specialization is a promising direction for multi‑view reasoning in autonomous driving. Overall, V2X‑QA provides a foundation for studying multi‑perspective reasoning, reliability, and cooperative physical intelligence in connected autonomous driving. The dataset and V2X‑MoE resources are publicly available at: https://github.com/junwei0001/V2X‑QA.
Authors:Yuhui Lin, Siyue Yu, Yuxing Yang, Guangliang Cheng, Jimin Xiao
Abstract:
Recent advances in Multimodal Large Language Models (MLLMs) have expanded reasoning capabilities into 3D domains, enabling fine‑grained spatial understanding. However, the substantial size of 3D MLLMs and the high dimensionality of input features introduce considerable inference overhead, which limits practical deployment on resource constrained platforms. To overcome this limitation, this paper presents Efficient3D, a unified framework for visual token pruning that accelerates 3D MLLMs while maintaining competitive accuracy. The proposed framework introduces a Debiased Visual Token Importance Estimator (DVTIE) module, which considers the influence of shallow initial layers during attention aggregation, thereby producing more reliable importance predictions for visual tokens. In addition, an Adaptive Token Rebalancing (ATR) strategy is developed to dynamically adjust pruning strength based on scene complexity, preserving semantic completeness and maintaining balanced attention across layers. Together, they enable context‑aware token reduction that maintains essential semantics with lower computation. Comprehensive experiments conducted on five representative 3D vision and language benchmarks, including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D, demonstrate that Efficient3D achieves superior performance compared with unpruned baselines, with a +2.57% CIDEr improvement on the Scan2Cap dataset. Therefore, Efficient3D provides a scalable and effective solution for efficient inference in 3D MLLMs. The code is released at: https://github.com/sol924/Efficient3D
Authors:Hao Li, Liwei Zou, Wenping Yin, Gulsen Taskin, Naoto Yokoya, Danfeng Hong, Wufan Zhao
Abstract:
Living in a changing climate, human society now faces more frequent and severe natural disasters than ever before. As a consequence, rapid disaster response during the "Golden 72 Hours" of search and rescue becomes a vital humanitarian necessity and community concern. However, traditional disaster damage surveys routinely fail to generalize across distinct urban morphologies and new disaster events. Effective damage mapping typically requires exhaustive and time‑consuming manual data annotation. To address this issue, we introduce Smart Transfer, a novel Geospatial Artificial Intelligence (GeoAI) framework, leveraging state‑of‑the‑art vision Foundation Models (FMs) for rapid building damage mapping with post‑earthquake Very High Resolution (VHR) imagery. Specifically, we design two novel model transfer strategies: first, Pixel‑wise Clustering (PC), ensuring robust prototype‑level global feature alignment; second, a Distance‑Penalized Triplet (DPT), integrating patch‑level spatial autocorrelation patterns by assigning stronger penalties to semantically inconsistent yet spatially adjacent patches. Extensive experiments and ablations from the recent 2023 Turkiye‑Syria earthquake show promising performance in multiple cross‑region transfer settings, namely Leave One Domain Out (LODO) and Specific Source Domain Combination (SSDC). Moreover, Smart Transfer provides a scalable, automated GeoAI solution to accelerate building damage mapping and support rapid disaster response, offering new opportunities to enhance disaster resilience in climate‑vulnerable regions and communities. The data and code are publicly available at https://github.com/ai4city‑hkust/SmartTransfer.
Authors:Anugyan Das, Omkar Ghugarkar, Vishvesh Bhat, Asad Aali
Abstract:
We study structured abstraction‑based reasoning for the Abstraction and Reasoning Corpus (ARC) and compare its generalization to test‑time approaches. Purely neural architectures lack reliable combinatorial generalization, while strictly symbolic systems struggle with perceptual grounding. We therefore propose a neuro‑symbolic architecture that extracts object‑level structure from grids, uses neural priors to propose candidate transformations from a fixed domain‑specific language (DSL) of atomic patterns, and filters hypotheses using cross‑example consistency. Instantiated as a compositional reasoning framework based on unit patterns inspired by human visual abstraction, the system augments large language models (LLMs) with object representations and transformation proposals. On ARC‑AGI‑2, it improves base LLM performance from 16% to 24.4% on the public evaluation set, and to 30.8% when combined with ARC Lang Solver via a meta‑classifier. These results demonstrate that separating perception, neural‑guided transformation proposal, and symbolic consistency filtering improves generalization without task‑specific finetuning or reinforcement learning, while reducing reliance on brute‑force search and sampling‑based test‑time scaling. We open‑source the ARC‑AGI‑2 Reasoner code (https://github.com/CoreThink‑AI/arc‑agi‑2‑reasoner).
Authors:Yonas Kassa, James Bonacci, Ping Wang
Abstract:
The transformative potential of large language models (LLMs) in education, such as improving accessibility and personalized learning, is being eclipsed by significant challenges. These challenges stem from concerns that LLMs undermine academic assessment by enabling bypassing of critical thinking, leading to increased cognitive offloading. This emerging trend stresses the dual imperative of harnessing AI's educational benefits while safeguarding critical thinking and academic rigor in the evolving AI ecosystem. To this end, we introduce AI‑Sinkhole, an AI‑agent augmented DNS‑based framework that dynamically discovers, semantically classifies, and temporarily network‑wide blocks emerging LLM chatbot services during proctored exams. AI‑Sinkhole offers explainable classification via quantized LLMs (LLama 3, DeepSeek‑R1, Qwen‑3) and dynamic DNS blocking with Pi‑Hole. We also share our observations in using LLMs as explainable classifiers which achieved robust cross‑lingual performance (F1‑score > 0.83). To support future research and development in this domain initial codes with a readily deployable 'AI‑Sinkhole' blockist is available on https://github.com/AIMLEdu/ai‑sinkhole.
Authors:Tin Hadži Veljković, Joshua Rosenthal, Ivor Lončarić, Jan-Willem van de Meent
Abstract:
Generative models for crystalline materials often rely on equivariant graph neural networks, which capture geometric structure well but are costly to train and slow to sample. We present Crystalite, a lightweight diffusion Transformer for crystal modeling built around two simple inductive biases. The first is Subatomic Tokenization, a compact chemically structured atom representation that replaces high‑dimensional one‑hot encodings and is better suited to continuous diffusion. The second is the Geometry Enhancement Module (GEM), which injects periodic minimum‑image pair geometry directly into attention through additive geometric biases. Together, these components preserve the simplicity and efficiency of a standard Transformer while making it better matched to the structure of crystalline materials. Crystalite achieves state‑of‑the‑art results on crystal structure prediction benchmarks, and de novo generation performance, attaining the best S.U.N. discovery score among the evaluated baselines while sampling substantially faster than geometry‑heavy alternatives.
Authors:Syed Ahmed, Bharathi Vokkaliga Ganesh, Jagadish Babu P, Karthick Selvaraj, Praneeth Talluri, Sanket Hingne, Anubhav Kumar, Anushka Yadav, Pratham Kumar Verma, Kiranmayee Janardhan, Mandanna A N
Abstract:
Understanding how Large Language Models (LLMs) process information from prompts remains a significant challenge. To shed light on this "black box," attention visualization techniques have been developed to capture neuron‑level perceptions and interpret how models focus on different parts of input data. However, many existing techniques are tailored to specific model architectures, particularly within the Transformer family, and often require backpropagation, resulting in nearly double the GPU memory usage and increased computational cost. A lightweight, model‑agnostic approach for attention visualization remains lacking. In this paper, we introduce a model‑agnostic token importance visualization technique to better understand how generative AI systems perceive and prioritize information from input text, without incurring additional computational cost. Our method leverages perturbation‑based strategies combined with a three‑matrix analytical framework to generate relevance maps that illustrate token‑level contributions to model predictions. The framework comprises: (1) the Angular Deviation Matrix, which captures shifts in semantic direction; (2) the Magnitude Deviation Matrix, which measures changes in semantic intensity; and (3) the Dimensional Importance Matrix, which evaluates contributions across individual vector dimensions. By systematically removing each token and measuring the resulting impact across these three complementary dimensions, we derive a composite importance score that provides a nuanced and mathematically grounded measure of token significance. To support reproducibility and foster wider adoption, we provide open‑source implementations of all proposed and utilized explainability techniques, with code and resources publicly available at https://github.com/Infosys/Infosys‑Responsible‑AI‑Toolkit
Authors:Xuanfeng Zhou
Abstract:
Conventional hypernetworks are typically engineered around a specific base‑model parameterization, so changing the target architecture often entails redesigning the hypernetwork and retraining it from scratch. We introduce the \emphUniversal Hypernetwork (UHN), a fixed‑architecture generator that predicts weights from deterministic parameter, architecture, and task descriptors. This descriptor‑based formulation decouples the generator architecture from target‑network parameterization, so one generator can instantiate heterogeneous models across the tested architecture and task families. Our empirical claims are threefold: (1) one fixed UHN remains competitive with direct training across vision, graph, text, and formula‑regression benchmarks; (2) the same UHN supports both multi‑model generalization within a family and multi‑task learning across heterogeneous models; and (3) UHN enables stable recursive generation with up to three intermediate generated UHNs before the final base model. Our code is available at https://github.com/Xuanfeng‑Zhou/UHN.
Authors:Jeremy Herbst, Jae Hee Lee, Stefan Wermter
Abstract:
Mixture‑of‑Experts (MoE) architectures have become the dominant choice for scaling Large Language Models (LLMs), activating only a subset of parameters per token. While MoE architectures are primarily adopted for computational efficiency, it remains an open question whether their sparsity makes them inherently easier to interpret than dense feed‑forward networks (FFNs). We compare MoE experts and dense FFNs using k‑sparse probing and find that expert neurons are consistently less polysemantic, with the gap widening as routing becomes sparser. This suggests that sparsity pressures both individual neurons and entire experts toward monosemanticity. Leveraging this finding, we zoom out from the neuron to the expert level as a more effective unit of analysis. We validate this approach by automatically interpreting hundreds of experts. This analysis allows us to resolve the debate on specialization: experts are neither broad domain specialists (e.g., biology) nor simple token‑level processors. Instead, they function as fine‑grained task experts, specializing in linguistic operations or semantic tasks (e.g., closing brackets in LaTeX). Our findings suggest that MoEs are inherently interpretable at the expert level, providing a clearer path toward large‑scale model interpretability. Code is available at: https://github.com/jerryy33/MoE_analysis
Authors:Soo Won Seo, KyungChae Lee, Hyungchan Cho, Taein Son, Nam Ik Cho, Jun Won Choi
Abstract:
Human‑Object Interaction (HOI) detection aims to localize human‑object pairs and classify their interactions from a single image, a task that demands strong visual understanding and nuanced contextual reasoning. Recent approaches have leveraged Vision‑Language Models (VLMs) to introduce semantic priors, significantly improving HOI detection performance. However, existing methods often fail to fully capitalize on the diverse contextual cues distributed across the entire scene. To overcome these limitations, we propose the Instance‑centric Context Mining Network (InCoM‑Net)‑a novel framework that effectively integrates rich semantic knowledge extracted from VLMs with instance‑specific features produced by an object detector. This design enables deeper interaction reasoning by modeling relationships not only within each detected instance but also across instances and their surrounding scene context. InCoM‑Net comprises two core components: Instancecentric Context Refinement (ICR), which separately extracts intra‑instance, inter‑instance, and global contextual cues from VLM‑derived features, and Progressive Context Aggregation (ProCA), which iteratively fuses these multicontext features with instance‑level detector features to support high‑level HOI reasoning. Extensive experiments on the HICO‑DET and V‑COCO benchmarks show that InCoM‑Net achieves state‑of‑the‑art performance, surpassing previous HOI detection methods. Code is available at https://github.com/nowuss/InCoM‑Net.
Authors:Junbin Xiao, Shenglang Zhang, Pengxiang Zhu, Angela Yao
Abstract:
We present the first systematic analysis of multimodal large language models (MLLMs) in personalized question‑answering requiring ego‑grounding ‑ the ability to understand the camera‑wearer in egocentric videos. To this end, we introduce MyEgo, the first egocentric VideoQA dataset designed to evaluate MLLMs' ability to understand, remember, and reason about the camera wearer. MyEgo comprises 541 long videos and 5K personalized questions asking about "my things", "my activities", and "my past". Benchmarking reveals that competitive MLLMs across variants, including open‑source vs. proprietary, thinking vs. non‑thinking, small vs. large scales all struggle on MyEgo. Top closed‑ and open‑source models (e.g., GPT‑5 and Qwen3‑VL) achieve only~46% and 36% accuracy, trailing human performance by near 40% and 50% respectively. Surprisingly, neither explicit reasoning nor model scaling yield consistent improvements. Models improve when relevant evidence is explicitly provided, but gains drop over time, indicating limitations in tracking and remembering "me" and "my past". These findings collectively highlight the crucial role of ego‑grounding and long‑range memory in enabling personalized QA in egocentric videos. We hope MyEgo and our analyses catalyze further progress in these areas for egocentric personalized assistance. Data and code are available at https://github.com/Ryougetsu3606/MyEgo
Authors:Gaëtan Hadjeres, Marc Ferras, Khaled Koutini, Benno Weck, Alexandre Bittar, Thomas Hummel, Zineb Lahrici, Hakim Missoum, Joan Serrà, Yuki Mitsufuji
Abstract:
The audio research community depends on open generative models as foundational tools for building novel approaches and establishing baselines. In this report, we present Woosh, Sony AI's publicly released sound effect foundation model, detailing its architecture, training process, and an evaluation against other popular open models. Being optimized for sound effects, we provide (1) a high‑quality audio encoder/decoder model and (2) a text‑audio alignment model for conditioning, together with (3) text‑to‑audio and (4) video‑to‑audio generative models. Distilled text‑to‑audio and video‑to‑audio models are also included in the release, allowing for low‑resource operation and fast inference. Our evaluation on both public and private data shows competitive or better performance for each module when compared to existing open alternatives like StableAudio‑Open and TangoFlux. Inference code and model weights are available at https://github.com/SonyResearch/Woosh. Demo samples can be found at https://sonyresearch.github.io/Woosh/.
Authors:Zekai Ye, Qiming Li, Xiaocheng Feng, Ruihan Chen, Ziming Li, Haoyu Ren, Kun Chen, Dandan Tu, Bing Qin
Abstract:
While Reinforcement Learning from Verifiable Rewards (RLVR) has advanced reasoning in Large Vision‑Language Models (LVLMs), prevailing frameworks suffer from a foundational methodological flaw: by distributing identical advantages across all generated tokens, these methods inherently dilute the learning signals essential for optimizing the critical, visually‑grounded steps of multimodal reasoning. To bridge this gap, we formulate Token Visual Dependency, quantifying the causal information gain of visual inputs via the Kullback‑Leibler (KL) divergence between visual‑conditioned and text‑only predictive distributions. Revealing that this dependency is highly sparse and semantically pivotal, we introduce Perception‑Grounded Policy Optimization (PGPO), which is a novel fine‑grained credit assignment framework that dynamically reshapes advantages at the token level. Through a threshold‑gated, mass‑conserving mechanism, PGPO actively amplifies learning signals for visually‑dependent tokens while suppressing gradient noise from linguistic priors. Extensive experiments based on the Qwen2.5‑VL series across seven challenging multimodal reasoning benchmarks demonstrate that PGPO boosts models by 18.7% on average. Both theoretical and empirical analyses confirm that PGPO effectively reduces gradient variance, prevents training collapse, and acts as a potent regularizer for robust, perception‑grounded multimodal reasoning. Code will be published on https://github.com/Yzk1114/PGPO.
Authors:Jiayi Chen, Shuai Wang, Guangxu Zhu, Chengzhong Xu
Abstract:
Large foundation models enable powerful reasoning for autonomous systems, but mapping semantic intent to reliable real‑time control remains challenging. Existing approaches either (i) let Large Language Models (LLMs) generate trajectories directly ‑ brittle, hard to verify, and latency‑prone ‑ or (ii) adjust Model Predictive Control (MPC) objectives online ‑ mixing slow deliberation with fast control and blurring interfaces. We propose Agentic Fast‑Slow Planning, a hierarchical framework that decouples perception, reasoning, planning, and control across natural timescales. The framework contains two bridges. Perception2Decision compresses scenes into ego‑centric topologies using an on‑vehicle Vision‑Language Model (VLM) detector, then maps them to symbolic driving directives in the cloud with an LLM decision maker ‑ reducing bandwidth and delay while preserving interpretability. Decision2Trajectory converts directives into executable paths: Semantic‑Guided A embeds language‑derived soft costs into classical search to bias solutions toward feasible trajectories, while an Agentic Refinement Module adapts planner hyperparameters using feedback and memory. Finally, MPC tracks the trajectories in real time, with optional cloud‑guided references for difficult cases. Experiments in CARLA show that Agentic Fast‑Slow Planning improves robustness under perturbations, reducing lateral deviation by up to 45% and completion time by over 12% compared to pure MPC and an A‑guided MPC baseline. Code is available at https://github.com/cjychenjiayi/icra2026_AFSP.
Authors:Ao Qu, Han Zheng, Zijian Zhou, Yihao Yan, Yihong Tang, Shao Yong Ong, Fenglu Hong, Kaichen Zhou, Chonghe Jiang, Minwei Kong, Jiacheng Zhu, Xuan Jiang, Sirui Li, Cathy Wu, Bryan Kian Hsiang Low, Jinhua Zhao, Paul Pu Liang
Abstract:
Large language model (LLM)‑based evolution is a promising approach for open‑ended discovery, where progress requires sustained search and knowledge accumulation. Existing methods still rely heavily on fixed heuristics and hard‑coded exploration rules, which limit the autonomy of LLM agents. We present CORAL, the first framework for autonomous multi‑agent evolution on open‑ended problems. CORAL replaces rigid control with long‑running agents that explore, reflect, and collaborate through shared persistent memory, asynchronous multi‑agent execution, and heartbeat‑based interventions. It also provides practical safeguards, including isolated workspaces, evaluator separation, resource management, and agent session and health management. Evaluated on diverse mathematical, algorithmic, and systems optimization tasks, CORAL sets new state‑of‑the‑art results on 10 tasks, achieving 3‑10 times higher improvement rates with far fewer evaluations than fixed evolutionary search baselines across tasks. On Anthropic's kernel engineering task, four co‑evolving agents improve the best known score from 1363 to 1103 cycles. Mechanistic analyses further show how these gains arise from knowledge reuse and multi‑agent exploration and communication. Together, these results suggest that greater agent autonomy and multi‑agent evolution can substantially improve open‑ended discovery. Code is available at https://github.com/Human‑Agent‑Society/CORAL.
Authors:Longfei Huang, Yang Yang
Abstract:
Multimodal tabular‑image fusion is an emerging task that has received increasing attention in various domains. However, existing methods may be hindered by gradient conflicts between modalities, misleading the optimization of the unimodal learner. In this paper, we propose a novel Gradient‑Aligned Alternating Learning (GAAL) paradigm to address this issue by aligning modality gradients. Specifically, GAAL adopts an alternating unimodal learning and shared classifier to decouple the multimodal gradient and facilitate interaction. Furthermore, we design uncertainty‑based cross‑modal gradient surgery to selectively align cross‑modal gradients, thereby steering the shared parameters to benefit all modalities. As a result, GAAL can provide effective unimodal assistance and help boost the overall fusion performance. Empirical experiments on widely used datasets reveal the superiority of our method through comparison with various state‑of‑the‑art (SoTA) tabular‑image fusion baselines and test‑time tabular missing baselines. The source code is available at https://github.com/njustkmg/ICME26‑GAAL.
Authors:Yanzhe Liang, Ruijie Zhu, Hanzhi Chang, Zhuoyuan Li, Jiahao Lu, Tianzhu Zhang
Abstract:
We present ReFlow, a unified framework for monocular dynamic scene reconstruction that learns 3D motion in a novel self‑correction manner from raw video. Existing methods often suffer from incomplete scene initialization for dynamic regions, leading to unstable reconstruction and motion estimation, which often resorts to external dense motion guidance such as pre‑computed optical flow to further stabilize and constrain the reconstruction of dynamic components. However, this introduces additional complexity and potential error propagation. To address these issues, ReFlow integrates a Complete Canonical Space Construction module for enhanced initialization of both static and dynamic regions, and a Separation‑Based Dynamic Scene Modeling module that decouples static and dynamic components for targeted motion supervision. The core of ReFlow is a novel self‑correction flow matching mechanism, consisting of Full Flow Matching to align 3D scene flow with time‑varying 2D observations, and Camera Flow Matching to enforce multi‑view consistency for static objects. Together, these modules enable robust and accurate dynamic scene reconstruction. Extensive experiments across diverse scenarios demonstrate that ReFlow achieves superior reconstruction quality and robustness, establishing a novel self‑correction paradigm for monocular 4D reconstruction.
Authors:Devakh Rashie, Veda Rashi
Abstract:
The rapid evolution of autonomous, agentic artificial intelligence within financial services has introduced an existential architectural crisis: large language models (LLMs) are probabilistic, non‑deterministic systems operating in domains that demand absolute, mathematically verifiable compliance guarantees. Existing guardrail solutions ‑‑ including NVIDIA NeMo Guardrails and Guardrails AI ‑‑ rely on probabilistic classifiers and syntactic validators that are fundamentally inadequate for enforcing complex multi‑variable regulatory constraints mandated by the SEC, FINRA, and OCC. This paper presents the Lean‑Agent Protocol, a formal‑verification‑based AI guardrail platform that leverages the Aristotle neural‑symbolic model developed by Harmonic AI to auto‑formalize institutional policies into Lean 4 code. Every proposed agentic action is treated as a mathematical conjecture: execution is permitted if and only if the Lean 4 kernel proves that the action satisfies pre‑compiled regulatory axioms. This architecture provides cryptographic‑level compliance certainty at microsecond latency, directly satisfying SEC Rule 15c3‑5, OCC Bulletin 2011‑12, FINRA Rule 3110, and CFPB explainability mandates. A three‑phase implementation roadmap from shadow verification through enterprise‑scale deployment is provided.
Authors:Nidhish Shah, Shaurjya Mandal, Asfandyar Azhar
Abstract:
When does consulting one information source raise the value of another, and when does it diminish it? We study this question for Bayesian decision‑makers facing finite actions. The interaction decomposes into two opposing forces: a complement force, measuring how one source moves beliefs to where the other becomes more useful, and a substitute force, measuring how much the current decision is resolved. Their balance obeys a localization principle: substitution requires an observation to cross a decision boundary, though crossing alone does not guarantee it. Whenever posteriors remain inside the current decision region, the substitute force vanishes, and sources are guaranteed to complement each other, even when one source cannot, on its own, change the decision. The results hold for arbitrarily correlated sources and are formalized in Lean 4. Substitution is confined to the thin boundaries where decisions change. Everywhere else, information cooperates. Code and proofs: https://github.com/nidhishs/all‑substitution‑is‑local.
Authors:Bowen Wei, Yunbei Zhang, Jinhao Pan, Kai Mei, Xiao Wang, Jihun Hamm, Ziwei Zhu, Yingqiang Ge
Abstract:
Personal AI agents like OpenClaw run with elevated privileges on users' local machines, where a single successful prompt injection can leak credentials, redirect financial transactions, or destroy files. This threat goes well beyond conventional text‑level jailbreaks, yet existing safety evaluations fall short: most test models in isolated chat settings, rely on synthetic environments, and do not account for how the agent framework itself shapes safety outcomes. We introduce CLAWSAFETY, a benchmark of 120 adversarial test scenarios organized along three dimensions (harm domain, attack vector, and harmful action type) and grounded in realistic, high‑privilege professional workspaces spanning software engineering, finance, healthcare, law, and DevOps. Each test case embeds adversarial content in one of three channels the agent encounters during normal work: workspace skill files, emails from trusted senders, and web pages. We evaluate five frontier LLMs as agent backbones, running 2,520 sandboxed trials across all configurations. Attack success rates (ASR) range from 40% to 75% across models and vary sharply by injection vector, with skill instructions (highest trust) consistently more dangerous than email or web content. Action‑trace analysis reveals that the strongest model maintains hard boundaries against credential forwarding and destructive actions, while weaker models permit both. Cross‑scaffold experiments on three agent frameworks further demonstrate that safety is not determined by the backbone model alone but depends on the full deployment stack, calling for safety evaluation that treats model and framework as joint variables. Code and data will be available at: https://weibowen555.github.io/ClawSafety/.
Authors:Syed Ahsan Masud Zaidi, Lior Shamir, William Hsu, Scott Dietrich, Talha Zaidi
Abstract:
American football practice generates video at scale, yet the interaction of interest occupies only a brief window of each long, untrimmed clip. Reliable biomechanical analysis, therefore, depends on spatiotemporal localization that identifies both the interacting entities and the onset of contact. We study First Point of Contact (FPOC), defined as the first frame in which a player physically touches a tackle dummy, in unconstrained practice footage with camera motion, clutter, multiple similarly equipped athletes, and rapid pose changes around impact. We present GRAZE, a training‑free pipeline for FPOC localization that requires no labeled tackle‑contact examples. GRAZE uses Grounding DINO to discover candidate player‑dummy interactions, refines them with motion‑aware temporal reasoning, and uses SAM2 as an explicit pixel‑level verifier of contact rather than relying on detection confidence alone. This separation between candidate discovery and contact confirmation makes the approach robust to cluttered scenes and unstable grounding near impact. On 738 tackle‑practice videos, GRAZE produces valid outputs for 97.4% of clips and localizes FPOC within \pm 10 frames on 77.5% of all clips and within \pm 20 frames on 82.7% of all clips. These results show that frame‑accurate contact onset localization in real‑world practice footage is feasible without task‑specific training.
Authors:Elliott Watkiss-Leek, Reham Alharbi, Harry Rostron, Andrew Ng, Ewan Johnson, Andrew Mitchell, Terry R. Payne, Valentina Tamma, Jacopo de Berardinis
Abstract:
Competency question (CQ) elicitation represents a critical but resource‑intensive bottleneck in ontology engineering. This foundational phase is often hampered by the communication gap between domain experts, who possess the necessary knowledge, and ontology engineers, who formalise it. This paper introduces IDEA2, a novel, semi‑automated workflow that integrates Large Language Models (LLMs) within a collaborative, expert‑in‑the‑loop process to address this challenge. The methodology is characterised by a core iterative loop: an initial LLM‑based extraction of CQs from requirement documents, a co‑creational review and feedback phase by domain experts on an accessible collaborative platform, and an iterative, feedback‑driven reformulation of rejected CQs by an LLM until consensus is achieved. To ensure transparency and reproducibility, the entire lifecycle of each CQ is tracked using a provenance model that captures the full lineage of edits, anonymised feedback, and generation parameters. The workflow was validated in 2 real‑world scenarios (scientific data, cultural heritage), demonstrating that IDEA2 can accelerate the requirements engineering process, improve the acceptance and relevance of the resulting CQs, and exhibit high usability and effectiveness among domain experts. We release all code and experiments at https://github.com/KE‑UniLiv/IDEA2
Authors:Neo Christopher Chung, Maxim Laletin
Abstract:
Vision transformers (ViT) rely on attention mechanism to weigh input features, and therefore attention scores have naturally been considered as explanations for its decision‑making process. However, attention scores are almost always non‑zero, resulting in noisy and diffused attention maps and limiting interpretability. Can we quantify uncertainty measures of attention scores and obtain regularized attention scores? To this end, we consider attention scores of ViT in a statistical framework where independent noise would lead to insignificant yet non‑zero scores. Leveraging statistical learning techniques, we introduce the bootstrapping for attention scores which generates a baseline distribution of attention scores by resampling input features. Such a bootstrap distribution is then used to estimate significances and posterior probabilities of attention scores. In natural and medical images, the proposed \emphAttention Regularization approach demonstrates a straightforward removal of spurious attention arising from noise, drastically improving shrinkage and sparsity. Quantitative evaluations are conducted using both simulation and real‑world datasets. Our study highlights bootstrapping as a practical regularization tool when using attention scores as explanations for ViT. Code available: https://github.com/ncchung/AttentionRegularization
Authors:Marco Morini, Sara Sarto, Marcella Cornia, Lorenzo Baraldi
Abstract:
Answering questions about images often requires combining visual understanding with external knowledge. Multimodal Large Language Models (MLLMs) provide a natural framework for this setting, but they often struggle to identify the most relevant visual and textual evidence when answering knowledge‑intensive queries. In such scenarios, models must integrate visual cues with retrieved textual evidence that is often noisy or only partially relevant, while also localizing fine‑grained visual information in the image. In this work, we introduce Look Twice (LoT), a training‑free inference‑time framework that improves how pretrained MLLMs utilize multimodal evidence. Specifically, we exploit the model attention patterns to estimate which visual regions and retrieved textual elements are relevant to a query, and then generate the answer conditioned on this highlighted evidence. The selected cues are highlighted through lightweight prompt‑level markers that encourage the model to re‑attend to the relevant evidence during generation. Experiments across multiple knowledge‑based VQA benchmarks show consistent improvements over zero‑shot MLLMs. Additional evaluations on vision‑centric and hallucination‑oriented benchmarks further demonstrate that visual evidence highlighting alone improves model performance in settings without textual context, all without additional training or architectural modifications. Source code will be publicly released.
Authors:Lei Wang, Eduard Dragut
Abstract:
Individuals engaging in online communication frequently express personal opinions with informal styles (e.g., memes and emojis). While Language Models (LMs) with informal communications have been widely discussed, a unique and emphatic style, the Repetitive Lengthening Form (RLF), has been overlooked for years. In this paper, we explore answers to two research questions: 1) Is RLF important for sentiment analysis (SA)? 2) Can LMs understand RLF? Inspired by previous linguistic research, we curate Lengthening, the first multi‑domain dataset with 850k samples focused on RLF for SA. Moreover, we introduce Explainable Instruction Tuning (ExpInstruct), a two‑stage instruction tuning framework aimed to improve both performance and explainability of LLMs for RLF. We further propose a novel unified approach to quantify LMs' understanding of informal expressions. We show that RLF sentences are expressive expressions and can serve as signatures of document‑level sentiment. Additionally, RLF has potential value for online content analysis. Our results show that fine‑tuned Pre‑trained Language Models (PLMs) can surpass zero‑shot GPT‑4 in performance but not in explanation for RLF. Finally, we show ExpInstruct can improve the open‑sourced LLMs to match zero‑shot GPT‑4 in performance and explainability for RLF with limited samples. Code and sample data are available at https://github.com/Tom‑Owl/OverlookedRLF
Authors:J. E. Domínguez-Vidal
Abstract:
Foundation vision‑language models are becoming increasingly relevant to robotics because they can provide richer semantic perception than narrow task‑specific pipelines. However, their practical adoption in robot software stacks still depends on reproducible middleware integrations rather than on model quality alone. Florence‑2 is especially attractive in this regard because it unifies captioning, optical character recognition, open‑vocabulary detection, grounding and related vision‑language tasks within a comparatively manageable model size. This article presents a ROS 2 wrapper for Florence‑2 that exposes the model through three complementary interaction modes: continuous topic‑driven processing, synchronous service calls and asynchronous actions. The wrapper is designed for local execution and supports both native installation and Docker container deployment. It also combines generic JSON outputs with standard ROS 2 message bindings for detection‑oriented tasks. A functional validation is reported together with a throughput study on several GPUs, showing that local deployment is feasible with consumer grade hardware. The repository is publicly available here: https://github.com/JEDominguezVidal/florence2_ros2_wrapper
Authors:Cai Zhou, Zekai Wang, Menghua Wu, Qianyu Julie Zhu, Flora C. Shi, Chenyu Wang, Ashia Wilson, Tommi Jaakkola, Stephen Bates
Abstract:
While test‑time scaling has enabled large language models to solve highly difficult tasks, state‑of‑the‑art results come at exorbitant compute costs. These inefficiencies can be attributed to the miscalibration of post‑trained language models, and the lack of calibration in popular sampling techniques. Here, we present Online Reasoning Calibration (ORCA), a framework for calibrating the sampling process that draws upon conformal prediction and test‑time training. Specifically, we introduce a meta‑learning procedure that updates the calibration module for each input. This allows us to provide valid confidence estimates under distributional shift, e.g. in thought patterns that occur across different stages of reasoning, or in prompt distributions between model development and deployment. ORCA not only provides theoretical guarantees on conformal risks, but also empirically shows higher efficiency and generalization across different reasoning tasks. At risk level δ=0.1, ORCA improves Qwen2.5‑32B efficiency on in‑distribution tasks with savings up to 47.5% with supervised labels and 40.7% with self‑consistency labels. Under zero‑shot out‑of‑domain settings, it improves MATH‑500 savings from 24.8% of the static calibration baseline to 67.0% while maintaining a low empirical error rate, and the same trend holds across model families and downstream benchmarks. Our code is publicly available at https://github.com/wzekai99/ORCA.
Authors:Prantik Deb, Srimanth Dhondy, N. Ramakrishna, Anu Kapoor, Raju S. Bapi, Tapabrata Chakraborti
Abstract:
Chest X‑ray (CXR) segmentation is an important step in computer‑aided diagnosis, yet deploying large foundation models in clinical settings remains challenging due to computational constraints. We propose AdaLoRA‑QAT, a two‑stage fine‑tuning framework that combines adaptive low‑rank encoder adaptation with full quantization‑aware training. Adaptive rank allocation improves parameter efficiency, while selective mixed‑precision INT8 quantization preserves structural fidelity crucial for clinical reliability. Evaluated across large‑scale CXR datasets, AdaLoRA‑QAT achieves 95.6% Dice, matching full‑precision SAM decoder fine‑tuning while reducing trainable parameters by 16.6× and yielding 2.24× model compression. A Wilcoxon signed‑rank test confirms that quantization does not significantly degrade segmentation accuracy. These results demonstrate that AdaLoRA‑QAT effectively balances accuracy, efficiency, and structural trust‑worthiness, enabling compact and deployable foundation models for medical image segmentation. Code and pretrained models are available at: https://prantik‑pdeb.github.io/adaloraqat.github.io/
Authors:Aaron Rose, Carissa Cullen, Brandon Gary Kaplowitz, Christian Schroeder de Witt
Abstract:
As LLM agents are increasingly deployed in multi‑agent systems, they introduce risks of covert coordination that may evade standard forms of human oversight. While linear probes on model activations have shown promise for detecting deception in single‑agent settings, collusion is inherently a multi‑agent phenomenon, and the use of internal representations for detecting collusion between agents remains unexplored. We introduce NARCBench, a benchmark for evaluating collusion detection under environment distribution shift, and propose five probing techniques that aggregate per‑agent deception scores to classify scenarios at the group level. Our probes achieve 1.00 AUROC in‑distribution and 0.60‑‑0.86 AUROC when transferred zero‑shot to structurally different multi‑agent scenarios and a steganographic blackjack card‑counting task. We find that no single probing technique dominates across all collusion types, suggesting that different forms of collusion manifest differently in activation space. We also find preliminary evidence that this signal is localised at the token level, with the colluding agent's activations spiking specifically when processing the encoded parts of their partner's message. This work takes a step toward multi‑agent interpretability: extending white‑box inspection from single models to multi‑agent contexts, where detection requires aggregating signals across agents. These results suggest that model internals provide a complementary signal to text‑level monitoring for detecting multi‑agent collusion, particularly for organisations with access to model activations. Code and data are available at https://github.com/aaronrose227/narcbench.
Authors:Atsuyuki Miyai, Mashiro Toyooka, Zaiying Zhao, Kenta Watanabe, Toshihiko Yamasaki, Kiyoharu Aizawa
Abstract:
This paper introduces the first systematic evaluation framework for quantifying the quality and risks of papers written by modern coding agents. While AI‑driven paper writing has become a growing concern, rigorous evaluation of the quality and potential risks of AI‑written papers remains limited, and a unified understanding of their reliability is still lacking. We introduce Paper Reconstruction Evaluation (PaperRecon), an evaluation framework in which an overview (overview.md) is created from an existing paper, after which an agent generates a full paper based on the overview and minimal additional resources, and the result is subsequently compared against the original paper. PaperRecon disentangles the evaluation of the AI‑written papers into two orthogonal dimensions, Presentation and Hallucination, where Presentation is evaluated using a rubric and Hallucination is assessed via agentic evaluation grounded in the original paper source. For evaluation, we introduce PaperWrite‑Bench, a benchmark of 51 papers from top‑tier venues across diverse domains published after 2025. Our experiments reveal a clear trade‑off: while both ClaudeCode and Codex improve with model advances, ClaudeCode achieves higher presentation quality at the cost of more than 10 hallucinations per paper on average, whereas Codex produces fewer hallucinations but lower presentation quality. This work takes a first step toward establishing evaluation frameworks for AI‑driven paper writing and improving the understanding of its risks within the research community.
Authors:Jiaqi Liu, Zipeng Ling, Shi Qiu, Yanqing Liu, Siwei Han, Peng Xia, Haoqin Tu, Zeyu Zheng, Cihang Xie, Charles Fleming, Mingyu Ding, Huaxiu Yao
Abstract:
AI agents increasingly operate over extended time horizons, yet their ability to retain, organize, and recall multimodal experiences remains a critical bottleneck. Building effective lifelong memory requires navigating a vast design space spanning architecture, retrieval strategies, prompt engineering, and data pipelines; this space is too large and interconnected for manual exploration or traditional AutoML to explore effectively. We deploy an autonomous research pipeline to discover Omni‑SimpleMem, a unified multimodal memory framework for lifelong AI agents. Starting from a naïve baseline (F1=0.117 on LoCoMo), the pipeline autonomously executes ~50 experiments across two benchmarks, diagnosing failure modes, proposing architectural modifications, and repairing data pipeline bugs, all without human intervention in the inner loop. The resulting system achieves state‑of‑the‑art on both benchmarks, improving F1 by +411% on LoCoMo (0.117\to0.598) and +214% on Mem‑Gallery (0.254\to0.797) relative to the initial configurations. Critically, the most impactful discoveries are not hyperparameter adjustments: bug fixes (+175%), architectural changes (+44%), and prompt engineering (+188% on specific categories) each individually exceed the cumulative contribution of all hyperparameter tuning, demonstrating capabilities fundamentally beyond the reach of traditional AutoML. We provide a taxonomy of six discovery types and identify four properties that make multimodal memory particularly suited for autoresearch, offering guidance for applying autonomous research pipelines to other AI system domains. Code is available at this https://github.com/aiming‑lab/SimpleMem.
Authors:Zhengyang Tang, Ke Ji, Xidong Wang, Zihan Ye, Xinyuan Wang, Yiduo Guo, Ziniu Li, Chenxin Li, Jingyuan Hu, Shunian Chen, Tongxu Luo, Jiaxi Bi, Zeyu Qin, Shaobo Wang, Xin Lai, Pengyuan Lyu, Junyi Li, Can Xu, Chengquan Zhang, Han Hu, Ming Yan, Benyou Wang
Abstract:
We study whether phone‑use agents respect privacy while completing benign mobile tasks. This question has remained hard to answer because privacy‑compliant behavior is not operationalized for phone‑use agents, and ordinary apps do not reveal exactly what data agents type into which form entries during execution. To make this question measurable, we introduce MyPhoneBench, a verifiable evaluation framework for privacy behavior in mobile agents. We operationalize privacy‑respecting phone use as permissioned access, minimal disclosure, and user‑controlled memory through a minimal privacy contract, iMy, and pair it with instrumented mock apps plus rule‑based auditing that make unnecessary permission requests, deceptive re‑disclosure, and unnecessary form filling observable and reproducible. Across five frontier models on 10 mobile apps and 300 tasks, we find that task success, privacy‑compliant task completion, and later‑session use of saved preferences are distinct capabilities, and no single model dominates all three. Evaluating success and privacy jointly reshuffles the model ordering relative to either metric alone. The most persistent failure mode across models is simple data minimization: agents still fill optional personal entries that the task does not require. These results show that privacy failures arise from over‑helpful execution of benign tasks, and that success‑only evaluation overestimates the deployment readiness of current phone‑use agents. All code, mock apps, and agent trajectories are publicly available at~ https://github.com/FreedomIntelligence/MyPhoneBench.
Authors:Nan Wang, Zhiwei Jin, Chen Chen, Haonan Lu
Abstract:
Document understanding and GUI interaction are among the highest‑value applications of Vision‑Language Models (VLMs), yet they impose exceptionally heavy computational burden: fine‑grained text and small UI elements demand high‑resolution inputs that produce tens of thousands of visual tokens. We observe that this cost is largely wasteful ‑‑ across document and GUI benchmarks, only 22‑‑71% of image patches are pixel‑unique, the rest being exact duplicates of another patch in the same image. We propose PixelPrune, which exploits this pixel‑level redundancy through predictive‑coding‑based compression, pruning redundant patches \emphbefore the Vision Transformer (ViT) encoder. Because it operates in pixel space prior to any neural computation, PixelPrune accelerates both the ViT encoder and the downstream LLM, covering the full inference pipeline. The method is training‑free, requires no learnable parameters, and supports pixel‑lossless compression (τ=0) as well as controlled lossy compression (τ>0). Experiments across three model scales and document and GUI benchmarks show that PixelPrune maintains competitive task accuracy while delivering up to 4.2× inference speedup and 1.9× training acceleration. Code is available at https://github.com/OPPO‑Mente‑Lab/PixelPrune.
Authors:Sicheng Zuo, Zixun Xie, Wenzhao Zheng, Shaoqing Xu, Fang Li, Hanbing Li, Long Chen, Zhi-Xin Yang, Jiwen Lu
Abstract:
End‑to‑end autonomous driving has evolved from the conventional paradigm based on sparse perception into vision‑language‑action (VLA) models, which focus on learning language descriptions as an auxiliary task to facilitate planning. In this paper, we propose an alternative Vision‑Geometry‑Action (VGA) paradigm that advocates dense 3D geometry as the critical cue for autonomous driving. As vehicles operate in a 3D world, we think dense 3D geometry provides the most comprehensive information for decision‑making. However, most existing geometry reconstruction methods (e.g., DVGT) rely on computationally expensive batch processing of multi‑frame inputs and cannot be applied to online planning. To address this, we introduce a streaming Driving Visual Geometry Transformer (DVGT‑2), which processes inputs in an online manner and jointly outputs dense geometry and trajectory planning for the current frame. We employ temporal causal attention and cache historical features to support on‑the‑fly inference. To further enhance efficiency, we propose a sliding‑window streaming strategy and use historical caches within a certain interval to avoid repetitive computations. Despite the faster speed, DVGT‑2 achieves superior geometry reconstruction performance on various datasets. The same trained DVGT‑2 can be directly applied to planning across diverse camera configurations without fine‑tuning, including closed‑loop NAVSIM and open‑loop nuScenes benchmarks.
Authors:Yilun Liu, Jinru Han, Sikuan Yan, Volker Tresp, Yunpu Ma
Abstract:
Standard Mixture‑of‑Experts (MoE) models rely on centralized routing mechanisms that introduce rigid inductive biases. We propose Routing‑Free MoE which eliminates any hard‑coded centralized designs including external routers, Softmax, Top‑K and load balancing, instead encapsulating all activation functionalities within individual experts and directly optimized through continuous gradient flow, enabling each expert to determine its activation entirely on its own. We introduce a unified adaptive load‑balancing framework to simultaneously optimize both expert‑balancing and token‑balancing objectives through a configurable interpolation, allowing flexible and customizable resource allocation. Extensive experiments show that Routing‑Free MoE can consistently outperform baselines with better scalability and robustness. We analyze its behavior in detail and offer insights that may facilitate future MoE design ad optimization.
Authors:Björn Roman Kohlberger
Abstract:
The memory wall remains the primary bottleneck for training large language models on consumer hardware. We introduce Spectral Compact Training (SCT), a method that replaces dense weight matrices with permanent truncated SVD factors W = U diag(s) V^T, where the full dense matrix is never materialized during training or inference. Gradients flow through the compact spectral factors via standard backpropagation, and U, V are retracted to the Stiefel manifold via QR decomposition after each optimizer step. SCT achieves up to 199x memory reduction per MLP layer at rank 32, enabling full training steps of 70B‑parameter architectures on a Steam Deck handheld (7.2 GB peak memory vs. 1,245 GB for dense FP32 training with Adam). Rank‑sweep experiments on SmolLM2‑1.7B (ranks 32‑256, 2000 steps, NVIDIA A100) show that all tested ranks converge to the same loss floor (~4.2‑4.5), identifying the learning rate schedule ‑‑ not MLP rank ‑‑ as the primary bottleneck. Rank 128 emerges as the efficiency sweet spot at 11.7x MLP compression with the lowest perplexity. GPU memory drops 46% at rank 32 while training throughput doubles.
Authors:Rajkiran Panuganti
Abstract:
Transformer language models contain localized reasoning circuits, contiguous layer blocks that improve reasoning when duplicated at inference time. Finding these circuits currently requires brute‑force sweeps costing 25 GPU hours per model. We propose CircuitProbe, which predicts circuit locations from activation statistics in under 5 minutes on CPU, providing a speedup of three to four orders of magnitude. We find that reasoning circuits come in two types: stability circuits in early layers, detected through the derivative of representation change, and magnitude circuits in late layers, detected through anomaly scoring. We validate across 9 models spanning 6 architectures, including 2025 models, confirming that CircuitProbe top predictions match or are within 2 layers of the optimal circuit in all validated cases. A scaling experiment across the Qwen 2.5 family reveals that layer duplication consistently benefits models under 3B parameters but degrades performance in 7B+ models, making this a practical scaling technique for small language models. CircuitProbe requires as few as 10 calibration examples and its predictions are stable across English, Hindi, Chinese, and French.
Authors:Karan Singh, Michael Yu, Varun Gangal, Zhuofu Tao, Sachin Kumar, Emmy Liu, Steven Y. Feng
Abstract:
Retrieval‑augmented generation (RAG) improves language model (LM) performance by providing relevant context at test time for knowledge‑intensive situations. However, the relationship between parametric knowledge acquired during pretraining and non‑parametric knowledge accessed via retrieval remains poorly understood, especially under fixed data budgets. In this work, we systematically study the trade‑off between pretraining corpus size and retrieval store size across a wide range of model and data scales. We train OLMo‑2‑based LMs ranging from 30M to 3B parameters on up to 100B tokens of DCLM data, while varying both pretraining data scale (1‑150x the number of parameters) and retrieval store size (1‑20x), and evaluate performance across a diverse suite of benchmarks spanning reasoning, scientific QA, and open‑domain QA. We find that retrieval consistently improves performance over parametric‑only baselines across model scales and introduce a three‑dimensional scaling framework that models performance as a function of model size, pretraining tokens, and retrieval corpus size. This scaling manifold enables us to estimate optimal allocations of a fixed data budget between pretraining and retrieval, revealing that the marginal utility of retrieval depends strongly on model scale, task type, and the degree of pretraining saturation. Our results provide a quantitative foundation for understanding when and how retrieval should complement pretraining, offering practical guidance for allocating data resources in the design of scalable language modeling systems.
Authors:Yu Xia, Canwen Xu, Zhewei Yao, Julian McAuley, Yuxiong He
Abstract:
Group Relative Policy Optimization (GRPO) is widely used for reinforcement learning with verifiable rewards, but it often suffers from advantage collapse: when all rollouts in a group receive the same reward, the group yields zero relative advantage and thus no learning signal. For example, if a question is too hard for the reasoner, all sampled rollouts can be incorrect and receive zero reward. Recent work addresses this issue by adding hints or auxiliary scaffolds to such hard questions so that the reasoner produces mixed outcomes and recovers a non‑zero update. However, existing hints are usually fixed rather than adapted to the current reasoner, and a hint that creates learning signal under the hinted input does not necessarily improve the no‑hint policy used at test time. To this end, we propose Hint Learning for Reinforcement Learning (HiLL), a framework that jointly trains a hinter policy and a reasoner policy during RL. For each hard question, the hinter generates hints online conditioned on the current reasoner's incorrect rollout, allowing hint generation to adapt to the reasoner's evolving errors. We further introduce hint reliance, which measures how strongly correct hinted trajectories depend on the hint. We derive a transferability result showing that lower hint reliance implies stronger transfer from hinted success to no‑hint success, and we use this result to define a transfer‑weighted reward for training the hinter. Therefore, HiLL favors hints that not only recover informative GRPO groups, but also produce signals that are more likely to improve the original no‑hint policy. Experiments across multiple benchmarks show that HiLL consistently outperforms GRPO and prior hint‑based baselines, demonstrating the value of adaptive and transfer‑aware hint learning for RL. The code is available at https://github.com/Andree‑9/HiLL.
Authors:Yao Qin, Yangyang Yan, Jinhua Pang, Xiaoming Zhang
Abstract:
The integration of Large Language Models (LLMs) into life sciences has catalyzed the development of "AI Scientists." However, translating these theoretical capabilities into deployment‑ready research environments exposes profound infrastructural vulnerabilities. Current frameworks are bottlenecked by fragile JSON‑based tool‑calling protocols, easily disrupted execution sandboxes that lose graphical outputs, and rigid conversational interfaces inherently ill‑suited for high‑dimensional scientific data.We introduce BloClaw, a unified, multi‑modal operating system designed for Artificial Intelligence for Science (AI4S). BloClaw reconstructs the Agent‑Computer Interaction (ACI) paradigm through three architectural innovations: (1) An XML‑Regex Dual‑Track Routing Protocol that statistically eliminates serialization failures (0.2% error rate vs. 17.6% in JSON); (2) A Runtime State Interception Sandbox that utilizes Python monkey‑patching to autonomously capture and compile dynamic data visualizations (Plotly/Matplotlib), circumventing browser CORS policies; and (3) A State‑Driven Dynamic Viewport UI that morphs seamlessly between a minimalist command deck and an interactive spatial rendering engine. We comprehensively benchmark BloClaw across cheminformatics (RDKit), de novo 3D protein folding via ESMFold, molecular docking, and autonomous Retrieval‑Augmented Generation (RAG), establishing a highly robust, self‑evolving paradigm for computational research assistants. The open‑source repository is available at https://github.com/qinheming/BloClaw.
Authors:Yabin Zhang, Chong Wang, Yunhe Gao, Jiaming Liu, Maya Varma, Justin Xu, Sophie Ostmeier, Jin Long, Sergios Gatidis, Seena Dehkharghani, Arne Michalson, Eun Kyoung Hong, Christian Bluethgen, Haiwei Henry Guo, Alexander Victor Ortiz, Stephan Altmayer, Sandhya Bodapati, Joseph David Janizek, Ken Chang, Jean-Benoit Delbrouck, Akshay S. Chaudhari, Curtis P. Langlotz
Abstract:
Chest X‑rays (CXRs) are among the most frequently performed imaging examinations worldwide, yet rising imaging volumes increase radiologist workload and the risk of diagnostic errors. Although artificial intelligence (AI) systems have shown promise for CXR interpretation, most generate only final predictions, without making explicit how visual evidence is translated into radiographic findings and diagnostic predictions. We present CheXOne, a reasoning‑enabled vision‑language model for CXR interpretation. CheXOne jointly generates diagnostic predictions and explicit, clinically grounded reasoning traces that connect visual evidence, radiographic findings, and these predictions. The model is trained on 14.7 million instruction and reasoning samples curated from 30 public datasets spanning 36 CXR interpretation tasks, using a two‑stage framework that combines instruction tuning with reinforcement learning to improve reasoning quality. We evaluate CheXOne in zero‑shot settings across visual question answering, report generation, visual grounding and reasoning assessment, covering 17 evaluation settings. CheXOne outperforms existing medical and general‑domain foundation models and achieves strong performance on independent public benchmarks. A clinical reader study demonstrates that CheXOne‑drafted reports are comparable to or better than resident‑written reports in 55% of cases, while effectively addressing clinical indications and enhancing both report writing and CXR interpretation efficiency. Further analyses involving radiologists reveal that the generated reasoning traces show high clinical factuality and provide causal support for the final predictions, offering a plausible explanation for the performance gains. These results suggest that explicit reasoning can improve model performance, interpretability and clinical utility in AI‑assisted CXR interpretation.
Authors:Harshee Jignesh Shah
Abstract:
Large Language Models (LLMs) increasingly prioritize user validation over epistemic accuracy ‑ a phenomenon known as sycophancy. We present The Silicon Mirror, an orchestration framework that dynamically detects user persuasion tactics and adjusts AI behavior to maintain factual integrity. Our architecture introduces three components: (1) a Behavioral Access Control (BAC) system that restricts context layer access based on real‑time sycophancy risk scores, (2) a Trait Classifier that identifies persuasion tactics across multi‑turn dialogues, and (3) a Generator‑Critic loop where an auditor vetoes sycophantic drafts and triggers rewrites with "Necessary Friction." In a live evaluation across all 437 TruthfulQA adversarial scenarios, Claude Sonnet 4 exhibits 9.6% baseline sycophancy, reduced to 1.4% by the Silicon Mirror ‑ an 85.7% relative reduction (p < 10^‑6, OR = 7.64, Fisher's exact test). Cross‑model evaluation on Gemini 2.5 Flash reveals a 46.0% baseline reduced to 14.2% (p < 10^‑10, OR = 5.15). We characterize the validation‑before‑correction pattern as a distinct failure mode of RLHF‑trained models.
Authors:Jiwoo Ha, Jongwoo Baek, Jinhyun So
Abstract:
Recent Large Vision‑Language Models (LVLMs) have demonstrated remarkable performance across various multimodal tasks that require understanding both visual and linguistic inputs. However, object hallucination ‑‑ the generation of nonexistent objects in answers ‑‑ remains a persistent challenge. Although several approaches such as retraining and external grounding methods have been proposed to mitigate this issue, they still suffer from high data costs or structural complexity. Training‑free methods such as Contrastive Decoding (CD) are more cost‑effective, avoiding additional training or external models, but still suffer from long‑term decay, where visual grounding weakens and language priors dominate as the generation progresses. In this paper, we propose First Logit Boosting (FLB), a simple yet effective training‑free technique designed to alleviate long‑term decay in LVLMs. FLB stores the logit of the first generated token and adds it to subsequent token predictions, effectively mitigating long‑term decay of visual information. We observe that FLB (1) sustains the visual information embedded in the first token throughout generation, and (2) suppresses hallucinated words through the stabilizing effect of the ``The'' token. Experimental results show that FLB significantly reduces object hallucination across various tasks, benchmarks, and backbone models. Notably, it causes negligible inference overhead, making it highly applicable to real‑time multimodal systems. Code is available at https://github.com/jiwooha20/FLB
Authors:Ponhvoan Srey, Quang Minh Nguyen, Xiaobao Wu, Anh Tuan Luu
Abstract:
Uncertainty estimation (UE) aims to detect hallucinated outputs of large language models (LLMs) to improve their reliability. However, UE metrics often exhibit unstable performance across configurations, which significantly limits their applicability. In this work, we formalise this phenomenon as proxy failure, since most UE metrics originate from model behaviour, rather than being explicitly grounded in the factual correctness of LLM outputs. With this, we show that UE metrics become non‑discriminative precisely in low‑information regimes. To alleviate this, we propose Truth AnChoring (TAC), a post‑hoc calibration method to remedy UE metrics, by mapping the raw scores to truth‑aligned scores. Even with noisy and few‑shot supervision, our TAC can support the learning of well‑calibrated uncertainty estimates, and presents a practical calibration protocol. Our findings highlight the limitations of treating heuristic UE metrics as direct indicators of truth uncertainty, and position our TAC as a necessary step toward more reliable uncertainty estimation for LLMs. The code repository is available at https://github.com/ponhvoan/TruthAnchor/.
Authors:Borislav Mavrin
Abstract:
No one has independently reproduced OpenAI's published scores for gpt‑oss‑20b with tools, because the original paper discloses neither the tools nor the agent harness. We reverse‑engineered the model's in‑distribution tools: when prompted without tool definitions, gpt‑oss still calls tools from its training distribution with high statistical confidence ‑‑ a strong prior, not a hallucination. We then built a native harmony agent harness (https://github.com/borislavmavrin/harmonyagent.git) that encodes messages in the model's native format, bypassing the lossy Chat Completions conversion. Together, these yield the first independent reproduction of OpenAI's published scores: 60.4% on SWE Verified HIGH (published 60.7%), 53.3% MEDIUM (53.2%), and 91.7% on AIME25 with tools (90.4%).
Authors:Bardia Azizian, Ivan V. Bajic
Abstract:
The rapid progress of large Vision‑Language Models (VLMs) has enabled a wide range of applications, such as image understanding and Visual Question Answering (VQA). Query images are often uploaded to the cloud, where VLMs are typically hosted, hence efficient image compression becomes crucial. However, traditional human‑centric codecs are suboptimal in this setting because they preserve many task‑irrelevant details. Existing Image Coding for Machines (ICM) methods also fall short, as they assume a fixed set of downstream tasks and cannot adapt to prompt‑driven VLMs with an open‑ended variety of objectives. We propose a lightweight, plug‑and‑play, prompt‑guided prefiltering module to identify image regions most relevant to the text prompt, and consequently to the downstream task. The module preserves important details while smoothing out less relevant areas to improve compression efficiency. It is codec‑agnostic and can be applied before conventional and learned encoders. Experiments on several VQA benchmarks show that our approach achieves a 25‑50% average bitrate reduction while maintaining the same task accuracy. Our source code is available at https://github.com/bardia‑az/pgp‑vlm‑compression.
Authors:Gaurav Rajesh Parikh, Angikar Ghosal
Abstract:
We formally introduce a improvisational wordplay game called Connections to explore reasoning capabilities of AI agents. Playing Connections combines skills in knowledge retrieval, summarization and awareness of cognitive states of other agents. We show how the game serves as a good benchmark for social intelligence abilities of language model based agents that go beyond the agents' own memory and deductive reasoning and also involve gauging the understanding capabilities of other agents. Finally, we show how through communication with other agents in a constrained environment, AI agents must demonstrate social awareness and intelligence in games involving collaboration.
Authors:Jinghan Yao, Sam Adé Jacobs, Walid Krichene, Masahiro Tanaka, Dhabaleswar K Panda
Abstract:
Long‑context decoding in LLMs is IO‑bound: each token re‑reads an ever‑growing KV cache. Prior accelerations cut bytes via compression, which lowers fidelity, or selection/eviction, which restricts what remains accessible, and both can degrade delayed recall and long‑form generation. We introduce MAC‑Attention, a fidelity‑ and access‑preserving alternative that accelerates decoding by reusing prior attention computations for semantically similar recent queries. It starts with a match stage that performs pre‑RoPE L2 matching over a short local window; an amend stage rectifies the reused attention by recomputing a small band near the match boundary; and a complete stage fuses the rectified results with fresh attention computed on the KV tail through a numerically stable merge. On a match hit, the compute and bandwidth complexity is constant regardless of context length. The method is model‑agnostic and composes with IO‑aware kernels, paged‑KV managers, and MQA/GQA. Across LongBench v2 (120K), RULER (120K), and LongGenBench (16K continuous generation), compared to the latest FlashInfer library, MAC‑Attention reduces KV accesses by up to 99%, cuts token generation latency by over 60% at 128K, and achieves over 14.3x attention‑phase speedups, up to 2.6x end‑to‑end, while maintaining full‑attention quality. By reusing computation, MAC‑Attention delivers long‑context inference that is both fast and faithful. Code is available here: https://github.com/YJHMITWEB/MAC‑Attention.git
Authors:Ashish Rana, Chia-Chien Hung, Qumeng Sun, Julian Martin Kunkel, Carolin Lawrence
Abstract:
Human memory adapts through selective forgetting: experiences become less accessible over time but can be reactivated by reinforcement or contextual cues. In contrast, memory‑augmented LLM agents rely on "always‑on" retrieval and "flat" memory storage, causing high interference and latency as histories grow. We introduce Oblivion, a memory control framework that casts forgetting as decay‑driven reductions in accessibility, not explicit deletion. Oblivion decouples memory control into read and write paths. The read path decides when to consult memory, based on agent uncertainty and memory buffer sufficiency, avoiding redundant always‑on access. The write path decides what to strengthen, by reinforcing memories contributing to forming the response. Together, this enables hierarchical memory organization that maintains persistent high‑level strategies while dynamically loading details as needed. We evaluate on both static and dynamic long‑horizon interaction benchmarks. Results show that Oblivion dynamically adapts memory access and reinforcement, balancing learning and forgetting under shifting contexts, highlighting that memory control is essential for effective LLM‑agentic reasoning. The source code is available at https://github.com/nec‑research/oblivion.
Authors:Zeyu Jin, Xiaoyu Qin, Songtao Zhou, Kaifeng Yun, Jia Jia
Abstract:
Soccer commentary plays a crucial role in enhancing the soccer game viewing experience for audiences. Previous studies in automatic soccer commentary generation typically adopt an end‑to‑end method to generate anonymous live text commentary. Such generated commentary is insufficient in the context of real‑world live televised commentary, as it contains anonymous entities, context‑dependent errors and lacks statistical insights of the game events. To bridge the gap, we propose GameSight, a two‑stage model to address soccer commentary generation as a knowledge‑enhanced visual reasoning task, enabling live‑televised‑like knowledgeable commentary with accurate reference to entities (players and teams). GameSight starts by performing visual reasoning to align anonymous entities with fine‑grained visual and contextual analysis. Subsequently, the entity‑aligned commentary is refined with knowledge by incorporating external historical statistics and iteratively updated internal game state information. Consequently, GameSight improves the player alignment accuracy by 18.5% on SN‑Caption‑test‑align dataset compared to Gemini 2.5‑pro. Combined with further knowledge enhancement, GameSight outperforms in segment‑level accuracy and commentary quality, as well as game‑level contextual relevance and structural composition. We believe that our work paves the way for a more informative and engaging human‑centric experience with the AI sports application. Demo Page: https://gamesight2025.github.io/gamesight2025
Authors:Seamus Brady
Abstract:
Non‑Axiomatic Reasoning Systems (NARS) provide a framework for building adaptive agents that operate under insufficient knowledge and resources. However, the standard input language, Narsese, poses a usability barrier: its dense symbolic notation, overloaded punctuation, and implicit conventions make programs difficult to read, write, and maintain. We present DriftScript, a Lisp‑like domain‑specific language that compiles to Narsese. DriftScript provides source‑level constructs covering the major sentence and term forms used in Non‑Axiomatic Logic (NAL) levels 1 through 8, including inheritance, temporal implication, variable quantification, sequential conjunction, and operation invocation, while replacing symbolic syntax with readable keyword‑based S‑expressions. The compiler is a zero‑dependency, four‑stage pipeline implemented in 1,941 lines of C99. When used with the DriftNARS engine, DriftScript programs connect to external systems through four structured callback types and an HTTP operation registry, enabling a sense‑reason‑act loop for autonomous agents. We describe the language design and formal grammar, detail the compiler architecture, and evaluate the compiler through a 106‑case test suite, equivalence testing against hand‑written Narsese, a NAL coverage analysis, structural readability metrics, and compilation benchmarks. The source code is available at https://github.com/seamus‑brady/DriftNARS. This paper focuses on the design and implementation of the DriftScript language and its embedding into DriftNARS, rather than on new inference algorithms for NARS itself.
Authors:Simon Schug, Brenden M. Lake
Abstract:
The validity of online behavioral research relies on study participants being human rather than machine. In the past, it was possible to detect machines by posing simple challenges that were easily solved by humans but not by machines. General‑purpose agents based on large language models (LLMs) can now solve many of these challenges, threatening the validity of online behavioral research. Here we explore the idea of detecting humanness by using tasks that machines can solve too well to be human. Specifically, we probe for the existence of an established human cognitive constraint: limited working memory capacity. We show that cognitive modeling on a standard serial recall task can be used to distinguish online participants from LLMs even when the latter are specifically instructed to mimic human working memory constraints. Our results demonstrate that it is viable to use well‑established cognitive phenomena to distinguish LLMs from humans.
Authors:Ning Yang, Hengyu Zhong, Wentao Wang, Baoliang Tian, Haijun Zhang, Jun Wang
Abstract:
The extension of context windows in Large Language Models is typically facilitated by scaling positional encodings followed by lightweight Continual Pre‑Training (CPT). While effective for processing long sequences, this paradigm often disrupts original model capabilities, leading to performance degradation on standard short‑text benchmarks. We propose LinearARD, a self‑distillation method that restores Rotary Position Embeddings (RoPE)‑scaled students through attention‑structure consistency with a frozen native‑RoPE teacher. Rather than matching opaque hidden states, LinearARD aligns the row‑wise distributions of dense Q/Q, K/K, and V/V self‑relation matrices to directly supervise attention dynamics. To overcome the quadratic memory bottleneck of n × n relation maps, we introduce a linear‑memory kernel. This kernel leverages per‑token log‑sum‑exp statistics and fuses logit recomputation into the backward pass to compute exact Kullback‑Leibler divergence and gradients. On LLaMA2‑7B extended from 4K to 32K, LinearARD recovers 98.3% of the short‑text performance of state‑of‑the‑art baselines while surpassing them on long‑context benchmarks. Notably, our method achieves these results using only 4.25M training tokens compared to the 256M tokens required by LongReD and CPT. Our code is available at https://github.com/gracefulning/LinearARD.
Authors:Nathan Heath
Abstract:
Myopic Optimization with Non‑myopic Approval (MONA) mitigates multi‑step reward hacking by restricting the agent's planning horizon while supplying far‑sighted approval as a training signal~\citefarquhar2025mona. The original paper identifies a critical open question: how the method of constructing approval ‑‑ particularly the degree to which approval depends on achieved outcomes ‑‑ affects whether MONA's safety guarantees hold. We present a reproduction‑first extension of the public MONA Camera Dropbox environment that (i)~repackages the released codebase as a standard Python project with scripted PPO training, (ii)~confirms the published contrast between ordinary RL (91.5% reward‑hacking rate) and oracle MONA (0.0% hacking rate) using the released reference arrays, and (iii)~introduces a modular learned‑approval suite spanning oracle, noisy, misspecified, learned, and calibrated approval mechanisms. In reduced‑budget pilot sweeps across approval methods, horizons, dataset sizes, and calibration strategies, the best calibrated learned‑overseer run achieves zero observed reward hacking but substantially lower intended‑behavior rates than oracle MONA (11.9% vs.\ 99.9%), consistent with under‑optimization rather than re‑emergent hacking. These results operationalize the MONA paper's approval‑spectrum conjecture as a runnable experimental object and suggest that the central engineering challenge shifts from proving MONA's concept to building learned approval models that preserve sufficient foresight without reopening reward‑hacking channels. Code, configurations, and reproduction commands are publicly available. https://github.com/codernate92/mona‑camera‑dropbox‑repro
Authors:Jonas Landsgesell, Pascal Knoll
Abstract:
Tabular foundation models such as TabPFN and TabICL already produce full predictive distributions yet prevailing regression benchmarks evaluate them almost exclusively via point estimate metrics RMSE R2 These aggregate measures often obscure model performance in the tails of the distribution a critical deficit for high stakes decision making in domains like finance and clinical research where asymmetric risk profiles are the norm We introduce ScoringBench an open benchmark that computes a comprehensive suite of proper scoring rules like CRPS CRLS Interval Score Energy Score weighted CRPS and Brier Score alongside standard point metrics providing a richer picture of probabilistic forecast quality We evaluate realTabPFNv2.5 fine tuned with different scoring rule objectives and TabICL relative to untuned realTabPFNv2.5 across a suite of regression benchmarks Our results confirm that model rankings depend on the chosen scoring rule and that no single pretraining objective is universally optimal This demonstrates that for applications sensitive to extreme events the choice of evaluation metric is as much a domain specific requirement as the data itself ScoringBench is available at https://github.com/jonaslandsgesell/ScoringBench A live preview of the current leaderboard is available at https://scoringbench.bolt.host The leaderboard is maintained via git pull requests to ensure transparency traceability agility and reproducibility
Authors:Zhihong Cui, Haoran Tang, Tianyi Li, Yushuai Li, Peiyuan Guan, Amir Taherkordi, Tor Skeie
Abstract:
Trajectory planning for autonomous driving increasingly leverages large language models (LLMs) for commonsense reasoning, yet LLM outputs are inherently unreliable, posing risks in safety‑critical applications. We propose C‑TRAIL, a framework built on a Commonsense World that couples LLM‑derived commonsense with a trust mechanism to guide trajectory planning. C‑TRAIL operates through a closed‑loop Recall, Plan, and Update cycle: the Recall module queries an LLM for semantic relations and quantifies their reliability via a dual‑trust mechanism; the Plan module injects trust‑weighted commonsense into Monte Carlo Tree Search (MCTS) through a Dirichlet trust policy; and the Update module adaptively refines trust scores and policy parameters from environmental feedback. Experiments on four simulated scenarios in Highway‑env and two real‑world levelXData datasets (highD, rounD) show that C‑TRAIL consistently outperforms state‑of‑the‑art baselines, reducing ADE by 40.2%, FDE by 51.7%, and improving SR by 16.9 percentage points on average. The source code is available at https://github.com/ZhihongCui/CTRAIL.
Authors:Yinuo Liu, Zi Qian, Heng Zhou, Jiahao Zhang, Yajie Zhang, Zhihang Li, Mengyu Zhou, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang
Abstract:
Interleaved text‑and‑image generation represents a significant frontier for Multimodal Large Language Models (MLLMs), offering a more intuitive way to convey complex information. Current paradigms rely on either image generation or retrieval augmentation, yet they typically treat the two as mutually exclusive paths, failing to unify factuality with creativity. We argue that the next milestone in this field is Agentic Tool Planning, where the model serves as a central controller that autonomously determines when, where, and which tools to invoke to produce interleaved responses for visual‑critical queries. To systematically evaluate this paradigm, we introduce ATP‑Bench, a novel benchmark comprising 7,702 QA pairs (including 1,592 VQA pairs) across eight categories and 25 visual‑critical intents, featuring human‑verified queries and ground truths. Furthermore, to evaluate agentic planning independent of end‑to‑end execution and changing tool backends, we propose a Multi‑Agent MLLM‑as‑a‑Judge (MAM) system. MAM evaluates tool‑call precision, identifies missed opportunities for tool use, and assesses overall response quality without requiring ground‑truth references. Our extensive experiments on 10 state‑of‑the‑art MLLMs reveal that models struggle with coherent interleaved planning and exhibit significant variations in tool‑use behavior, highlighting substantial room for improvement and providing actionable guidance for advancing interleaved generation. Dataset and code are available at https://github.com/Qwen‑Applications/ATP‑Bench.
Authors:Yi Chen, Yuying Ge, Hui Zhou, Mingyu Ding, Yixiao Ge, Xihui Liu
Abstract:
The development of Vision‑Language‑Action (VLA) models has been significantly accelerated by pre‑trained Vision‑Language Models (VLMs). However, most existing end‑to‑end VLAs treat the VLM primarily as a multimodal encoder, directly mapping vision‑language features to low‑level actions. This paradigm underutilizes the VLM's potential in high‑level decision making and introduces training instability, frequently degrading its rich semantic representations. To address these limitations, we introduce DIAL, a framework bridging high‑level decision making and low‑level motor execution through a differentiable latent intent bottleneck. Specifically, a VLM‑based System‑2 performs latent world modeling by synthesizing latent visual foresight within the VLM's native feature space; this foresight explicitly encodes intent and serves as the structural bottleneck. A lightweight System‑1 policy then decodes this predicted intent together with the current observation into precise robot actions via latent inverse dynamics. To ensure optimization stability, we employ a two‑stage training paradigm: a decoupled warmup phase where System‑2 learns to predict latent futures while System‑1 learns motor control under ground‑truth future guidance within a unified feature space, followed by seamless end‑to‑end joint optimization. This enables action‑aware gradients to refine the VLM backbone in a controlled manner, preserving pre‑trained knowledge. Extensive experiments on the RoboCasa GR1 Tabletop benchmark show that DIAL establishes a new state‑of‑the‑art, achieving superior performance with 10x fewer demonstrations than prior methods. Furthermore, by leveraging heterogeneous human demonstrations, DIAL learns physically grounded manipulation priors and exhibits robust zero‑shot generalization to unseen objects and novel configurations during real‑world deployment on a humanoid robot.
Authors:Han Deng, Anqi Zou, Hanling Zhang, Ben Fei, Chengyu Zhang, Haobo Wang, Xinru Guo, Zhenyu Li, Xuzhu Wang, Peng Yang, Fujian Zhang, Weiyu Guo, Xiaohong Shao, Zhaoyang Liu, Shixiang Tang, Zhihui Wang, Wanli Ouyang
Abstract:
Scientific discovery increasingly depends on high‑throughput characterization, yet automation is hindered by proprietary GUIs and the limited generalizability of existing API‑based systems. We present Owl‑AuraID, a software‑hardware collaborative embodied agent system that adopts a GUI‑native paradigm to operate instruments through the same interfaces as human experts. Its skill‑centric framework integrates Type‑1 (GUI operation) and Type‑2 (data analysis) skills into end‑to‑end workflows, connecting physical sample handling with scientific interpretation. Owl‑AuraID demonstrates broad coverage across ten categories of precision instruments and diverse workflows, including multimodal spectral analysis, microscopic imaging, and crystallographic analysis, supporting modalities such as FTIR, NMR, AFM, and TGA. Overall, Owl‑AuraID provides a practical, extensible foundation for autonomous laboratories and illustrates a path toward evolving laboratory intelligence through reusable operational and analytical skills. The code are available at https://github.com/OpenOwlab/AuraID.
Authors:Lvmin Zhang, Maneesh Agrawala
Abstract:
Agent traces carry increasing analytical value in agentic systems and context engineering, yet most prior work treats conversation format as a trivial implementation detail. Modern agent conversations, however, contain deeply structured content, including nested tool calls and results, chain‑of‑thought reasoning blocks, sub‑agent invocations, context‑window compaction boundaries, and harness‑injected system directives, whose complexity far exceeds that of simple user‑assistant exchanges. Feeding such traces to a reflector or other analytical mechanism in plain text, JSON, YAML, or via grep can materially degrade analysis quality. This paper presents VCC (View‑oriented Conversation Compiler), a compiler (lex, parse, IR, lower, emit) that transforms raw agent JSONL logs into a family of structured views: a full view (lossless transcript serving as the canonical line‑number coordinate system), a user‑interface (UI) view (reconstructing the interaction as the user actually perceived it), and an adaptive view (a structure‑preserving projection governed by a relevance predicate). In a context‑engineering experiment on AppWorld, replacing only the reflector's input format, from raw JSONL to VCC‑compiled views, leads to higher pass rates across all three model configurations tested, while cutting reflector token consumption by half to two‑thirds and producing more concise learned memory. These results suggest that message format functions as infrastructure for context engineering, not as an incidental implementation choice.
Authors:Weixian Xu, Tiantian Mi, Yixiu Liu, Yang Nan, Zhimeng Zhou, Lyumanshan Ye, Lin Zhang, Yu Qiao, Pengfei Liu
Abstract:
Can AI accelerate the development of AI itself? While recent agentic systems have shown strong performance on well‑scoped tasks with rapid feedback, it remains unclear whether they can tackle the costly, long‑horizon, and weakly supervised research loops that drive real AI progress. We present ASI‑Evolve, an agentic framework for AI‑for‑AI research that closes this loop through a learn‑design‑experiment‑analyze cycle. ASI‑Evolve augments standard evolutionary agents with two key components: a cognition base that injects accumulated human priors into each round of exploration, and a dedicated analyzer that distills complex experimental outcomes into reusable insights for future iterations. To our knowledge, ASI‑Evolve is the first unified framework to demonstrate AI‑driven discovery across three central components of AI development: data, architectures, and learning algorithms. In neural architecture design, it discovered 105 SOTA linear attention architectures, with the best discovered model surpassing DeltaNet by +0.97 points, nearly 3x the gain of recent human‑designed improvements. In pretraining data curation, the evolved pipeline improves average benchmark performance by +3.96 points, with gains exceeding 18 points on MMLU. In reinforcement learning algorithm design, discovered algorithms outperform GRPO by up to +12.5 points on AMC32, +11.67 points on AIME24, and +5.04 points on OlympiadBench. We further provide initial evidence that this AI‑for‑AI paradigm can transfer beyond the AI stack through experiments in mathematics and biomedicine. Together, these results suggest that ASI‑Evolve represents a promising step toward enabling AI to accelerate AI across the foundational stages of development, offering early evidence for the feasibility of closed‑loop AI research.
Authors:Fei Shen, Chengyu Xie, Lihong Wang, Zhanyi Zhang, Xin Jiang, Xiaoyu Du, Jinhui Tang
Abstract:
Existing multi‑turn image editing paradigms are often confined to isolated single‑step execution. Due to a lack of context‑awareness and closed‑loop feedback mechanisms, they are prone to error accumulation and semantic drift during multi‑turn interactions, ultimately resulting in severe structural distortion of the generated images. For that, we propose IMAGAgent, a multi‑turn image editing agent framework based on a "plan‑execute‑reflect" closed‑loop mechanism that achieves deep synergy among instruction parsing, tool scheduling, and adaptive correction within a unified pipeline. Specifically, we first present a constraint‑aware planning module that leverages a vision‑language model (VLM) to precisely decompose complex natural language instructions into a series of executable sub‑tasks, governed by target singularity, semantic atomicity, and visual perceptibility. Then, the tool‑chain orchestration module dynamically constructs execution paths based on the current image, the current sub‑task, and the historical context, enabling adaptive scheduling and collaborative operation among heterogeneous operation models covering image retrieval, segmentation, detection, and editing. Finally, we devise a multi‑expert collaborative reflection mechanism where a central large language model (LLM) receives the image to be edited and synthesizes VLM critiques into holistic feedback, simultaneously triggering fine‑grained self‑correction and recording feedback outcomes to optimize future decisions. Extensive experiments on our constructed MTEditBench and the MagicBrush dataset demonstrate that IMAGAgent achieves performance significantly superior to existing methods in terms of instruction consistency, editing precision, and overall quality. The code is available at https://github.com/hackermmzz/IMAGAgent.git.
Authors:Jagadish Kashinath Kamble, Jayanta Mukhopadhyay, Debaditya Roy, Partha Pratim Das
Abstract:
Preserving intangible cultural dances rooted in centuries of tradition and governed by strict structural and symbolic rules presents unique challenges in the digital era. Among these, Bharatanatyam, a classical Indian dance form, stands out for its emphasis on codified adavus and precise key postures. Accurately generating these postures is crucial not only for maintaining anatomical and stylistic integrity, but also for enabling effective documentation, analysis, and transmission to broader global audiences through digital means. We propose a pose‑aware generative framework integrated with a pose estimation module, guided by keypoint‑based loss and pose consistency constraints. These supervisory signals ensure anatomical accuracy and stylistic integrity in the synthesized outputs. We evaluate four configurations: standard conditional generative adversarial network (cGAN), cGAN with pose supervision, conditional diffusion, and conditional diffusion with pose supervision. Each model is conditioned on key posture class labels and optimized to maintain geometric structure. In both cGAN and conditional diffusion settings, the integrated pose guidance aligns generated poses with ground‑truth keypoint structures, promoting cultural fidelity. Our results demonstrate that incorporating pose supervision significantly enhances the quality, realism, and authenticity of generated Bharatanatyam postures. This framework provides a scalable approach for the digital preservation, education, and dissemination of traditional dance forms, enabling high‑fidelity generation without compromising cultural precision. Code is available at https://github.com/jagidsh/Generating‑Key‑Postures‑of‑Bharatanatyam‑Adavus‑with‑Pose‑Estimation.
Authors:Linda Zeng, Steven Y. Feng, Michael C. Frank
Abstract:
Multilingualism is incredibly common around the world, leading to many important theoretical and practical questions about how children learn multiple languages at once. For example, does multilingual acquisition lead to delays in learning? Are there better and worse ways to structure multilingual input? Many correlational studies address these questions, but it is surprisingly difficult to get definitive answers because children cannot be randomly assigned to be multilingual and data are typically not matched between languages. We use language model training as a method for simulating a variety of highly controlled exposure conditions, and create matched 100M‑word mono‑ and bilingual datasets using synthetic data and machine translation. We train GPT‑2 models on monolingual and bilingual data organized to reflect a range of exposure regimes, and evaluate their performance on perplexity, grammaticality, and semantic knowledge. Across model scales and measures, bilingual models perform similarly to monolingual models in one language, but show strong performance in the second language as well. These results suggest that there are no strong differences between different bilingual exposure regimes, and that bilingual input poses no in‑principle challenges for agnostic statistical learners.
Authors:Xiao Liu, Xiaowei Fu, Fuxiang Huang, Lei Zhang
Abstract:
Network traffic classification using self‑supervised pre‑training models based on Masked Autoencoders (MAE) has demonstrated a huge potential. However, existing methods are confined to isolated byte‑level reconstruction of individual flows, lacking adequate perception of the multi‑granularity contextual relationship in traffic. To address this limitation, we propose Mean MAE (MMAE), a teacher‑student MAE paradigm with flow mixing strategy for building encrypted traffic pre‑training model. MMAE employs a self‑distillation mechanism for teacher‑student interaction, where the teacher provides unmasked flow‑level semantic supervision to advance the student from local byte reconstruction to multi‑granularity comprehension. To break the information bottleneck in individual flows, we introduce a dynamic Flow Mixing (FlowMix) strategy to replace traditional random masking mechanism. By constructing challenging cross‑flow mixed samples with interferences, it compels the model to learn discriminative representations from distorted tokens. Furthermore, we design a Packet‑importance aware Mask Predictor (PMP) equipped with an attention bias mechanism that leverages packet‑level side‑channel statistics to dynamically mask tokens with high semantic density. Numerous experiments on a number of datasets covering encrypted applications, malware, and attack traffic demonstrate that MMAE achieves state‑of‑the‑art performance. The code is available at https://github.com/lx6c78/MMAE
Authors:Steven Y. Feng, Alvin W. M. Tan, Michael C. Frank
Abstract:
Modern language models (LMs) must be trained on many orders of magnitude more words of training data than human children receive before they begin to produce useful behavior. Assessing the nature and origins of this "data gap" requires benchmarking LMs on human‑scale datasets to understand how linguistic knowledge emerges from children's natural training data. Using transcripts from the BabyView dataset (videos from children ages 6‑36 months), we investigate (1) scaling performance at child‑scale data regimes, (2) variability in model performance across datasets from different children's experiences and linguistic predictors of dataset quality, and (3) relationships between model and child language learning outcomes. LMs trained on child data show acceptable scaling for grammar tasks, but lower scaling on semantic and world knowledge tasks than models trained on synthetic data; we also observe substantial variability on data from different children. Beyond dataset size, performance is most associated with a combination of distributional and interactional linguistic features, broadly consistent with what makes high‑quality input for child language development. Finally, model likelihoods for individual words correlate with children's learning of those words, suggesting that properties of child‑directed input may influence both model learning and human language development. Overall, understanding what properties make language data efficient for learning can enable more powerful small‑scale language models while also shedding light on human language acquisition.
Authors:Qing He, Xiaowei Fu, Lei Zhang
Abstract:
Encrypted traffic classification is a critical task for network security. While deep learning has advanced this field, the occlusion of payload semantics by encryption severely challenges standard modeling approaches. Most existing frameworks rely on static and homogeneous pipelines that apply uniform parameter sharing and static fusion strategies across all inputs. This one‑size‑fits‑all static design is inherently flawed: by forcing structured headers and randomized payloads into a unified processing pipeline, it inevitably entangles the raw protocol signals with stochastic encryption noise, thereby degrading the fine‑grained discriminative features. In this paper, we propose TrafficMoE, a framework that breaks through the bottleneck of static modeling by establishing a Disentangle‑Filter‑Aggregate (DFA) paradigm. Specifically, to resolve the structural between‑components conflict, the architecture disentangles headers and payloads using dual‑branch sparse Mixture‑of‑Experts (MoE), enabling modality‑specific modeling. To mitigate the impact of stochastic noise, an uncertainty‑aware filtering mechanism is introduced to quantify reliability and selectively suppress high‑variance representations. Finally, to overcome the limitations of static fusion, a routing‑guided strategy aggregates cross‑modality features dynamically, that adaptively weighs contributions based on traffic context. With this DFA paradigm, TrafficMoE maximizes representational efficiency by focusing solely on the most discriminative traffic features. Extensive experiments on six datasets demonstrate TrafficMoE consistently outperforms state‑of‑the‑art methods, validating the necessity of heterogeneity‑aware modeling in encrypted traffic analysis. The source code is publicly available at https://github.com/Posuly/TrafficMoE_main.
Authors:Ziliang Guo, Ziheng Li, Bo Tang, Feiyu Xiong, Zhiyu Li
Abstract:
Memory‑augmented Large Language Models (LLMs) are essential for developing capable, long‑term AI agents. Recently, applying Reinforcement Learning (RL) to optimize memory operations, such as extraction, updating, and retrieval, has emerged as a highly promising research direction. However, existing implementations remain highly fragmented and task‑specific, lacking a unified infrastructure to streamline the integration, training, and evaluation of these complex pipelines. To address this gap, we present MemFactory, the first unified, highly modular training and inference framework specifically designed for memory‑augmented agents. Inspired by the success of unified fine‑tuning frameworks like LLaMA‑Factory, MemFactory abstracts the memory lifecycle into atomic, plug‑and‑play components, enabling researchers to seamlessly construct custom memory agents via a "Lego‑like" architecture. Furthermore, the framework natively integrates Group Relative Policy Optimization (GRPO) to fine‑tune internal memory management policies driven by multi‑dimensional environmental rewards. MemFactory provides out‑of‑the‑box support for recent cutting‑edge paradigms, including Memory‑R1, RMM, and MemAgent. We empirically validate MemFactory on the open‑source MemAgent architecture using its publicly available training and evaluation data. Across the evaluation sets, MemFactory improves performance over the corresponding base models on average, with relative gains of up to 14.8%. By providing a standardized, extensible, and easy‑to‑use infrastructure, MemFactory significantly lowers the barrier to entry, paving the way for future innovations in memory‑driven AI agents.
Authors:Qiyuan Zhuang, He-Yang Xu, Yijun Wang, Xin-Yang Zhao, Yang-Yang Li, Xiu-Shen Wei
Abstract:
Understanding object affordances is essential for enabling robots to perform purposeful and fine‑grained interactions in diverse and unstructured environments. However, existing approaches either rely on retrieval, which is fragile due to sparsity and coverage gaps, or on large‑scale models, which frequently mislocalize contact points and mispredict post‑contact actions when applied to unseen categories, thereby hindering robust generalization. We introduce Retrieval‑Augmented Affordance Prediction (RAAP), a framework that unifies affordance retrieval with alignment‑based learning. By decoupling static contact localization and dynamic action direction, RAAP transfers contact points via dense correspondence and predicts action directions through a retrieval‑augmented alignment model that consolidates multiple references with dual‑weighted attention. Trained on compact subsets of DROID and HOI4D with as few as tens of samples per task, RAAP achieves consistent performance across unseen objects and categories, and enables zero‑shot robotic manipulation in both simulation and the real world. Project website: https://github.com/SEU‑VIPGroup/RAAP.
Authors:Yubo Cui, Xianchao Guan, Zijun Xiong, Zheng Zhang
Abstract:
Pre‑trained vision‑language models (VLMs) exhibit strong zero‑shot generalization but remain vulnerable to adversarial perturbations. Existing classification‑guided adversarial fine‑tuning methods often disrupt pre‑trained cross‑modal alignment, weakening visual‑textual correspondence and degrading zero‑shot performance. In this paper, we propose an Alignment‑Guided Fine‑Tuning (AGFT) framework that enhances zero‑shot adversarial robustness while preserving the cross‑modal semantic structure. Unlike label‑based methods that rely on hard labels and fail to maintain the relative relationships between image and text, AGFT leverages the probabilistic predictions of the original model for text‑guided adversarial training, which aligns adversarial visual features with textual embeddings via soft alignment distributions, improving zero‑shot adversarial robustness. To address structural discrepancies introduced by fine‑tuning, we introduce a distribution consistency calibration mechanism that adjusts the robust model output to match a temperature‑scaled version of the pre‑trained model predictions. Extensive experiments across multiple zero‑shot benchmarks demonstrate that AGFT outperforms state‑of‑the‑art methods while significantly improving zero‑shot adversarial robustness.
Authors:Wei Suo, Hanzu Zhang, Lijun Zhang, Ji Ma, Peng Wang, Yanning Zhang
Abstract:
Large Vision‑Language Models have demonstrated exceptional performance in multimodal reasoning and complex scene understanding. However, these models still face significant hallucination issues, where outputs contradict visual facts. Recent research on hallucination mitigation has focused on retraining methods and Contrastive Decoding (CD) methods. While both methods perform well, retraining methods require substantial training resources, and CD methods introduce dual inference overhead. These factors hinder their practical applicability. To address the above issue, we propose a framework for dynamically detecting hallucination representations and performing hallucination‑eliminating edits on these representations. With minimal additional computational cost, we achieve state‑of‑the‑art performance on existing benchmarks. Extensive experiments demonstrate the effectiveness of our approach, highlighting its efficient and robust hallucination elimination capability and its powerful controllability over hallucinations. Code is available at https://github.com/ASGO‑MM/HIRE
Authors:Seungwoo Yoon, Jinmo Kim, Jaesik Park
Abstract:
In this paper, we propose Extend3D, a training‑free pipeline for 3D scene generation from a single image, built upon an object‑centric 3D generative model. To overcome the limitations of fixed‑size latent spaces in object‑centric models for representing wide scenes, we extend the latent space in the x and y directions. Then, by dividing the extended latent space into overlapping patches, we apply the object‑centric 3D generative model to each patch and couple them at each time step. Since patch‑wise 3D generation with image conditioning requires strict spatial alignment between image and latent patches, we initialize the scene using a point cloud prior from a monocular depth estimator and iteratively refine occluded regions through SDEdit. We discovered that treating the incompleteness of 3D structure as noise during 3D refinement enables 3D completion via a concept, which we term under‑noising. Furthermore, to address the sub‑optimality of object‑centric models for sub‑scene generation, we optimize the extended latent during denoising, ensuring that the denoising trajectories remain consistent with the sub‑scene dynamics. To this end, we introduce 3D‑aware optimization objectives for improved geometric structure and texture fidelity. We demonstrate that our method yields better results than prior methods, as evidenced by human preference and quantitative experiments.
Authors:Guozhi Qiu, Zhiwei Chen, Zixu Li, Qinlei Huang, Zhiheng Fu, Xuemeng Song, Yupeng Hu
Abstract:
Composed Image Retrieval (CIR) uses a reference image and a modification text as a query to retrieve a target image satisfying the requirement of ``modifying the reference image according to the text instructions''. However, existing CIR methods face two limitations: (1) frequency bias leading to ``Rare Sample Neglect'', and (2) susceptibility of similarity scores to interference from hard negative samples and noise. To address these limitations, we confront two key challenges: asymmetric rare semantic localization and robust similarity estimation under hard negative samples. To solve these challenges, we propose the Modification frEquentation‑rarity baLance neTwork MELT. MELT assigns increased attention to rare modification semantics in multimodal contexts while applying diffusion‑based denoising to hard negative samples with high similarity scores, enhancing multimodal fusion and matching. Extensive experiments on two CIR benchmarks validate the superior performance of MELT. Codes are available at https://github.com/luckylittlezhi/MELT.
Authors:Zhuowen Liang, Xiaotian Lin, Zhengxuan Zhang, Yuyu Luo, Haixun Wang, Nan Tang
Abstract:
Large language models (LLMs) are widely applied to data analytics over documents, yet direct reasoning over long, noisy documents remains brittle and error‑prone. Hence, we study document question answering (QA) that consolidates dispersed evidence into a structured output (e.g., a table, graph, or chunks) to support reliable, verifiable QA. We propose a two‑pillar framework, LiteCoST, to achieve both high accuracy and low latency with small language models (SLMs). Pillar 1: Chain‑of‑Structured‑Thought (CoST). We introduce a CoST template, a schema‑aware instruction that guides a strong LLM to produce both a step‑wise CoST trace and the corresponding structured output. The process induces a minimal structure, normalizes entities/units, aligns records, serializes the output, and verifies/refines it, yielding auditable supervision. Pillar 2: SLM fine‑tuning. The compact models are trained on LLM‑generated CoST data in two stages: Supervised Fine‑Tuning for structural alignment, followed by Group Relative Policy Optimization (GRPO) incorporating triple rewards for answer/format quality and process consistency. By distilling structure‑first behavior into SLMs, this approach achieves LLM‑comparable quality on multi‑domain long‑document QA using 3B/7B SLMs, while delivering 2‑4x lower latency than GPT‑4o and DeepSeek‑R1 (671B). The code is available at https://github.com/HKUSTDial/LiteCoST.
Authors:Harsh Mankodiya, Chase Gallik, Theodoros Galanos, Andriy Mulyar
Abstract:
The AEC‑Bench is a multimodal benchmark for evaluating agentic systems on real‑world tasks in the Architecture, Engineering, and Construction (AEC) domain. The benchmark covers tasks requiring drawing understanding, cross‑sheet reasoning, and construction project‑level coordination. This report describes the benchmark motivation, dataset taxonomy, evaluation protocol, and baseline results across several domain‑specific foundation model harnesses. We use AEC‑Bench to identify consistent tools and harness design techniques that uniformly improve performance across foundation models in their own base harnesses, such as Claude Code and Codex. We openly release our benchmark dataset, agent harness, and evaluation code for full replicability at https://github.com/nomic‑ai/aec‑bench under an Apache 2 license.
Authors:Kuangshi Ai, Haichao Miao, Kaiyuan Tang, Nathaniel Gorski, Jianxin Sun, Guoxi Liu, Helgi I. Ingolfsson, David Lenz, Hanqi Guo, Hongfeng Yu, Teja Leburu, Michael Molash, Bei Wang, Tom Peterka, Chaoli Wang, Shusen Liu
Abstract:
Recent advances in large language models (LLMs) have enabled agentic systems that translate natural language intent into executable scientific visualization (SciVis) tasks. Despite rapid progress, the community lacks a principled and reproducible benchmark for evaluating these emerging SciVis agents in realistic, multi‑step analysis settings. We present SciVisAgentBench, a comprehensive and extensible benchmark for evaluating scientific data analysis and visualization agents. Our benchmark is grounded in a structured taxonomy spanning four dimensions: application domain, data type, complexity level, and visualization operation. It currently comprises 108 expert‑crafted cases covering diverse SciVis scenarios. To enable reliable assessment, we introduce a multimodal outcome‑centric evaluation pipeline that combines LLM‑based judging with deterministic evaluators, including image‑based metrics, code checkers, rule‑based verifiers, and case‑specific evaluators. We also conduct a validity study with 12 SciVis experts to examine the agreement between human and LLM judges. Using this framework, we evaluate representative SciVis agents and general‑purpose coding agents to establish initial baselines and reveal capability gaps. SciVisAgentBench is designed as a living benchmark to support systematic comparison, diagnose failure modes, and drive progress in agentic SciVis. The benchmark is available at https://scivisagentbench.github.io/.
Authors:Iordanis Fostiropoulos, Muhammad Rafay Azhar, Abdalaziz Sawwan, Boyu Fang, Yuchen Liu, Jiayi Liu, Hanchao Yu, Qi Guo, Jianyu Wang, Fei Liu, Xiangjun Fan
Abstract:
We introduce GISTBench, a benchmark for evaluating Large Language Models' (LLMs) ability to understand users from their interaction histories in recommendation systems. Unlike traditional RecSys benchmarks that focus on item prediction accuracy, our benchmark evaluates how well LLMs can extract and verify user interests from engagement data. We propose two novel metric families: Interest Groundedness (IG), decomposed into precision and recall components to separately penalize hallucinated interest categories and reward coverage, and Interest Specificity (IS), which assesses the distinctiveness of verified LLM‑predicted user profiles. We release a synthetic dataset constructed on real user interactions on a global short‑form video platform. Our dataset contains both implicit and explicit engagement signals and rich textual descriptions. We validate our dataset fidelity against user surveys, and evaluate eight open‑weight LLMs spanning 7B to 120B parameters. Our findings reveal performance bottlenecks in current LLMs, particularly their limited ability to accurately count and attribute engagement signals across heterogeneous interaction types.
Authors:Bharath Krishnamurthy, Ajita Rattani
Abstract:
Recent multimodal face generation models address the spatial control limitations of text‑to‑image diffusion models by augmenting text‑based conditioning with spatial priors such as segmentation masks, sketches, or edge maps. This multimodal fusion enables controllable synthesis aligned with both high‑level semantic intent and low‑level structural layout. However, most existing approaches typically extend pre‑trained text‑to‑image pipelines by appending auxiliary control modules or stitching together separate uni‑modal networks. These ad hoc designs inherit architectural constraints, duplicate parameters, and often fail under conflicting modalities or mismatched latent spaces, limiting their ability to perform synergistic fusion across semantic and spatial domains. We introduce MMFace‑DiT, a unified dual‑stream diffusion transformer engineered for synergistic multimodal face synthesis. Its core novelty lies in a dual‑stream transformer block that processes spatial (mask/sketch) and semantic (text) tokens in parallel, deeply fusing them through a shared Rotary Position‑Embedded (RoPE) Attention mechanism. This design prevents modal dominance and ensures strong adherence to both text and structural priors to achieve unprecedented spatial‑semantic consistency for controllable face generation. Furthermore, a novel Modality Embedder enables a single cohesive model to dynamically adapt to varying spatial conditions without retraining. MMFace‑DiT achieves a 40% improvement in visual fidelity and prompt alignment over six state‑of‑the‑art multimodal face generation models, establishing a flexible new paradigm for end‑to‑end controllable generative modeling. The code and dataset are available on our project page: https://vcbsl.github.io/MMFace‑DiT/
Authors:He Yang, Dongyi Lv, Song Ma, Wei Xi, Jizhong Zhao
Abstract:
Dataset condensation aims to synthesize compact yet informative datasets that retain the training efficacy of full‑scale data, offering substantial gains in efficiency. Recent studies reveal that the condensation process can be vulnerable to backdoor attacks, where malicious triggers are injected into the condensation dataset, manipulating model behavior during inference. While prior approaches have made progress in balancing attack success rate and clean test accuracy, they often fall short in preserving stealthiness, especially in concealing the visual artifacts of condensed data or the perturbations introduced during inference. To address this challenge, we introduce Sneakdoor, which enhances stealthiness without compromising attack effectiveness. Sneakdoor exploits the inherent vulnerability of class decision boundaries and incorporates a generative module that constructs input‑aware triggers aligned with local feature geometry, thereby minimizing detectability. This joint design enables the attack to remain imperceptible to both human inspection and statistical detection. Extensive experiments across multiple datasets demonstrate that Sneakdoor achieves a compelling balance among attack success rate, clean test accuracy, and stealthiness, substantially improving the invisibility of both the synthetic data and triggered samples while maintaining high attack efficacy. The code is available at https://github.com/XJTU‑AI‑Lab/SneakDoor.
Authors:Joonhyung Bae
Abstract:
The global landscape of art‑technology institutions, including festivals, biennials, research labs, conferences, and hybrid organizations, has grown increasingly diverse, yet systematic frameworks for analyzing their multidimensional characteristics remain scarce. This paper proposes ASTRA (Art‑technology Institution Spatial Taxonomy and Relational Analysis), a computational methodology combining an eight‑axis conceptual framework (Curatorial Philosophy, Territorial Relation, Knowledge Production Mode, Institutional Genealogy, Temporal Orientation, Ecosystem Function, Audience Relation, and Disciplinary Positioning) with a text‑embedding and clustering pipeline to map 78 cultural‑technology institutions into a unified analytical space. Each institution is characterized through qualitative descriptions along the eight axes, encoded via E5‑large‑v2 sentence embeddings and quantized through a word‑level codebook into TF‑IDF feature vectors. Dimensionality reduction using UMAP, followed by agglomerative clustering (Average linkage, k=10), yields a composite score of 0.825, a silhouette coefficient of 0.803, and a Calinski‑Harabasz index of 11196. Non‑negative matrix factorization extracts ten latent topics, and a neighbor‑cluster entropy measure identifies boundary institutions bridging multiple thematic communities. An interactive React‑based tool enables curators, researchers, and policymakers to explore institutional similarities and cross‑disciplinary connections. Results reveal coherent groupings such as an art‑science hub cluster anchored by ZKM and ArtScience Museum, an innovation and industry cluster including Ars Electronica, transmediale, and Sonar, an ACM academic cluster comprising TEI, DIS, and NIME, and an electronic music cluster including CTM Festival, MUTEK, and Sonic Acts. Code and data: https://github.com/joonhyungbae/astra
Authors:Leye Wang, Zixing Wang, Anjie Xu
Abstract:
This technical report presents SkillTester, a tool for evaluating the utility and security of agent skills. Its evaluation framework combines paired baseline and with‑skill execution conditions with a separate security probe suite. Grounded in a comparative utility principle and a user‑facing simplicity principle, the framework normalizes raw execution artifacts into a utility score, a security score, and a three‑level security status label. More broadly, it can be understood as a comparative quality‑assurance harness for agent skills in an agent‑first world. The public service is deployed at https://skilltester.ai, and the broader project is maintained at https://github.com/skilltester‑ai/skilltester.
Authors:Jiaqi Tan, Yudong Luo, Sophia Huang, Yifan Yang, Hang Ma
Abstract:
Double‑Deck Multi‑Agent Pickup and Delivery (DD‑MAPD) models the multi‑robot shelf rearrangement problem in automated warehouses. MAPF‑DECOMP is a recent framework that first computes collision‑free shelf trajectories with a MAPF solver and then assigns agents to execute them. While efficient, it enforces strict trajectory dependencies, often leading to poor execution quality due to idle agents and unnecessary shelf switching. We introduce CREST, a new execution framework that achieves more continuous shelf carrying by proactively releasing trajectory constraints during execution. Experiments on diverse warehouse layouts show that CREST consistently outperforms MAPF‑DECOMP, reducing metrics related to agent travel, makespan, and shelf switching by up to 40.5%, 33.3%, and 44.4%, respectively, with even greater benefits under lift/place overhead. These results underscore the importance of execution‑aware constraint release for scalable warehouse rearrangement. Code and data are available at https://github.com/ChristinaTan0704/CREST.
Authors:Huanxuan Liao, Zhongtao Jiang, Yupu Hao, Yuqiao Tan, Shizhu He, Ben Wang, Jun Zhao, Kun Xu, Kang Liu
Abstract:
Multimodal Large Language Models (MLLMs) achieve stronger visual understanding by scaling input fidelity, yet the resulting visual token growth makes jointly sustaining high spatial resolution and long temporal context prohibitive. We argue that the bottleneck lies not in how post‑encoding representations are compressed but in the volume of pixels the encoder receives, and address it with ResAdapt, an Input‑side adaptation framework that learns how much visual budget each frame should receive before encoding. ResAdapt couples a lightweight Allocator with an unchanged MLLM backbone, so the backbone retains its native visual‑token interface while receiving an operator‑transformed input. We formulate allocation as a contextual bandit and train the Allocator with Cost‑Aware Policy Optimization (CAPO), which converts sparse rollout feedback into a stable accuracy‑cost learning signal. Across budget‑controlled video QA, temporal grounding, and image reasoning tasks, ResAdapt improves low‑budget operating points and often lies on or near the efficiency‑accuracy frontier, with the clearest gains on reasoning‑intensive benchmarks under aggressive compression. Notably, ResAdapt supports up to 16x more frames at the same visual budget while delivering over 15% performance gain. Code is available at https://github.com/Xnhyacinth/ResAdapt.
Authors:Han Wang, Yifan Sun, Brian Ko, Mann Talati, Jiawen Gong, Zimeng Li, Naicheng Yu, Xucheng Yu, Wei Shen, Vedant Jolly, Huan Zhang
Abstract:
Large language models (LLMs) can generate chains of thought (CoTs) that are not always causally responsible for their final outputs. When such a mismatch occurs, the CoT no longer faithfully reflects the actual reasons (i.e., decision‑critical factors) driving the model's behavior, leading to the reduced CoT monitorability problem. However, a comprehensive and fully open‑source benchmark for thoroughly evaluating CoT monitorability remains lacking. To address this gap, we propose MonitorBench, a systematic benchmark for evaluating CoT monitorability in LLMs. MonitorBench provides: (1) a diverse set of 1,514 test instances with carefully designed decision‑critical factors across 19 tasks spanning 7 categories to characterize when CoTs can be used to monitor the factors driving LLM behavior; and (2) two stress‑test settings to quantify the extent to which CoT monitorability can be degraded. Extensive experiments across multiple popular LLMs with varying capabilities show that CoT monitorability is higher when the decision‑critical factors shape the intermediate reasoning process without merely influencing the final answer. More capable LLMs tend to exhibit lower monitorability. And all evaluated LLMs can intentionally reduce monitorability under stress‑tests, with monitorability dropping by up to 30% in some tasks that do not require structural reasoning over the decision‑critical factors. Overall, MonitorBench provides a basis for further research on evaluating future LLMs, studying advanced stress‑test monitorability techniques, and developing new monitoring approaches. The code is available at https://github.com/ASTRAL‑Group/MonitorBench.
Authors:Masnun Nuha Chowdhury, Nusrat Jahan Beg, Umme Hunny Khan, Syed Rifat Raiyan, Md Kamrul Hasan, Hasan Mahmud
Abstract:
Large language models (LLMs) remain unreliable for high‑stakes claim verification due to hallucinations and shallow reasoning. While retrieval‑augmented generation (RAG) and multi‑agent debate (MAD) address this, they are limited by one‑pass retrieval and unstructured debate dynamics. We propose a courtroom‑style multi‑agent framework, PROClaim, that reformulates verification as a structured, adversarial deliberation. Our approach integrates specialized roles (e.g., Plaintiff, Defense, Judge) with Progressive RAG (P‑RAG) to dynamically expand and refine the evidence pool during the debate. Furthermore, we employ evidence negotiation, self‑reflection, and heterogeneous multi‑judge aggregation to enforce calibration, robustness, and diversity. In zero‑shot evaluations on the Check‑COVID benchmark, PROClaim achieves 81.7% accuracy, outperforming standard multi‑agent debate by 10.0 percentage points, with P‑RAG driving the primary performance gains (+7.5 pp). We ultimately demonstrate that structural deliberation and model heterogeneity effectively mitigate systematic biases, providing a robust foundation for reliable claim verification. Our code and data are publicly available at https://github.com/mnc13/PROClaim.
Authors:Qing Qing, Huafei Huang, Mingliang Hou, Renqiang Luo, Mohsen Guizani
Abstract:
Graph anomaly detection (GAD) aims to identify irregular nodes or structures in attributed graphs. Neighbor information, which reflects both structural connectivity and attribute consistency with surrounding nodes, is essential for distinguishing anomalies from normal patterns. Although recent graph neural network (GNN)‑based methods incorporate such information through message passing, they often fail to explicitly model its effect or interaction with attributes, limiting detection performance. This work introduces NeiGAD, a novel plug‑and‑play module that captures neighbor information through spectral graph analysis. Theoretical insights demonstrate that eigenvectors of the adjacency matrix encode local neighbor interactions and progressively amplify anomaly signals. Based on this, NeiGAD selects a compact set of eigenvectors to construct efficient and discriminative representations. Experiments on eight real‑world datasets show that NeiGAD consistently improves detection accuracy and outperforms state‑of‑the‑art GAD methods. These results demonstrate the importance of explicit neighbor modeling and the effectiveness of spectral analysis in anomaly detection. Code is available at: https://github.com/huafeihuang/NeiGAD.
Authors:Gnankan Landry Regis N'guessan
Abstract:
Kolmogorov‑Arnold Networks (KAN) employ B‑spline bases on a fixed grid, providing no intrinsic multi‑scale decomposition for non‑smooth function approximation. We introduce Fractal Interpolation KAN (FI‑KAN), which incorporates learnable fractal interpolation function (FIF) bases from iterated function system (IFS) theory into KAN. Two variants are presented: Pure FI‑KAN (Barnsley, 1986) replaces B‑splines entirely with FIF bases; Hybrid FI‑KAN (Navascues, 2005) retains the B‑spline path and adds a learnable fractal correction. The IFS contraction parameters give each edge a differentiable fractal dimension that adapts to target regularity during training. On a Holder regularity benchmark (α\in [0.2, 2.0]), Hybrid FI‑KAN outperforms KAN at every regularity level (1.3x to 33x). On fractal targets, FI‑KAN achieves up to 6.3x MSE reduction over KAN, maintaining 4.7x advantage at 5 dB SNR. On non‑smooth PDE solutions (scikit‑fem), Hybrid FI‑KAN achieves up to 79x improvement on rough‑coefficient diffusion and 3.5x on L‑shaped domain corner singularities. Pure FI‑KAN's complementary behavior, dominating on rough targets while underperforming on smooth ones, provides controlled evidence that basis geometry must match target regularity. A fractal dimension regularizer provides interpretable complexity control whose learned values recover the true fractal dimension of each target. These results establish regularity‑matched basis design as a principled strategy for neural function approximation.
Authors:David K. Johansson
Abstract:
Single‑shot neural decoders commit to answers without iterative refinement, while chain‑of‑thought methods introduce discrete intermediate steps but lack a scalar measure of reasoning progress. We propose Energy‑Based Reasoning via Structured Latent Planning (EBRM), which models reasoning as gradient‑based optimization of a multi‑step latent trajectory z_1:T under a learned energy function E(h_x, z). The energy decomposes into per‑step compatibility, transition consistency, and trajectory smoothness terms. Training combines supervised encoder‑decoder learning with contrastive energy shaping using hard negatives, while inference performs gradient descent or Langevin dynamics over z and decodes from z_T. We identify a critical failure mode: on CNF logic satisfaction, latent planning reduces accuracy from \approx 95% to \approx 56%. This degradation arises from a distribution mismatch, where the decoder is trained on encoder outputs h_x but evaluated on planner outputs z_T that drift into unseen latent regions. We analyze this behavior through per‑step decoding, latent drift tracking, and gradient decomposition. To address it, we propose dual‑path decoder training and latent anchoring. We further introduce a six‑part ablation protocol covering component contributions, trajectory length, planner dynamics, initialization, decoder training distribution, and anchor weight. Experiments on three synthetic tasks show that energy decreases monotonically and induces structured latent trajectories on graph and logic tasks, while remaining flat on arithmetic (r = 0.073), indicating a negative result. Code is available at https://github.com/dkjo8/ebr‑via‑structured‑latent‑planning.
Authors:Minh-Khoi Do, Huy Che, Dinh-Duy Phan, Duc-Khai Lam, Duc-Lung Vu
Abstract:
Accurate and efficient perception is essential for autonomous driving, where segmentation tasks such as drivable‑area and lane segmentation provide critical cues for motion planning and control. However, achieving high segmentation accuracy while maintaining real‑time performance on low‑cost hardware remains a challenging problem. To address this issue, we introduce TwinMixing, a lightweight multi‑task segmentation model designed explicitly for drivable‑area and lane segmentation. The proposed network features a shared encoder and task‑specific decoders, enabling both feature sharing and task specialization. Within the encoder, we propose an Efficient Pyramid Mixing (EPM) module that enhances multi‑scale feature extraction through a combination of grouped convolutions, depthwise dilated convolutions and channel shuffle operations, effectively expanding the receptive field while minimizing computational cost. Each decoder adopts a Dual‑Branch Upsampling (DBU) Block composed of a learnable transposed convolution‑based Fine detailed branch and a parameter‑free bilinear interpolation‑based Coarse grained branch, achieving detailed yet spatially consistent feature reconstruction. Extensive experiments on the BDD100K dataset validate the effectiveness of TwinMixing across three configurations ‑ tiny, base, and large. Among them, the base configuration achieves the best trade‑off between accuracy and computational efficiency, reaching 92.0% mIoU for drivable‑area segmentation and 32.3% IoU for lane segmentation with only 0.43M parameters and 3.95 GFLOPs. Moreover, TwinMixing consistently outperforms existing segmentation models on the same tasks, as illustrated in Fig. 1. Thanks to its compact and modular design, TwinMixing demonstrates strong potential for real‑time deployment in autonomous driving and embedded perception systems. The source code: https://github.com/Jun0se7en/TwinMixing.
Authors:Zhang Li, Zhibo Lin, Qiang Liu, Ziyang Zhang, Shuo Zhang, Zidun Guo, Jiajun Song, Jiarui Zhang, Xiang Bai, Yuliang Liu
Abstract:
We introduce Multilingual Document Parsing Benchmark, the first benchmark for multilingual digital and photographed document parsing. Document parsing has made remarkable strides, yet almost exclusively on clean, digital, well‑formatted pages in a handful of dominant languages. No systematic benchmark exists to evaluate how models perform on digital and photographed documents across diverse scripts and low‑resource languages. MDPBench comprises 3,400 document images spanning 17 languages, diverse scripts, and varied photographic conditions, with high‑quality annotations produced through a rigorous pipeline of expert model labeling, manual correction, and human verification. To ensure fair comparison and prevent data leakage, we maintain separate public and private evaluation splits. Our comprehensive evaluation of both open‑source and closed‑source models uncovers a striking finding: while closed‑source models (notably Gemini3‑Pro) prove relatively robust, open‑source alternatives suffer dramatic performance collapse, particularly on non‑Latin scripts and real‑world photographed documents, with an average drop of 17.8% on photographed documents and 14.0% on non‑Latin scripts. These results reveal significant performance imbalances across languages and conditions, and point to concrete directions for building more inclusive, deployment‑ready parsing systems. Source available at https://github.com/Yuliang‑Liu/MultimodalOCR.
Authors:Tianle Zeng, Hanxuan Chen, Yanci Wen, Hong Zhang
Abstract:
The convergence of low‑altitude economies, embodied intelligence, and air‑ground cooperative systems creates growing demand for simulation infrastructure capable of jointly modeling aerial and ground agents within a single physically coherent environment. Existing open‑source platforms remain domain‑segregated: driving simulators lack aerial dynamics, while multirotor simulators lack realistic ground scenes. Bridge‑based co‑simulation introduces synchronization overhead and cannot guarantee strict spatial‑temporal consistency. We present CARLA‑Air, an open‑source infrastructure that unifies high‑fidelity urban driving and physics‑accurate multirotor flight within a single Unreal Engine process. The platform preserves both CARLA and AirSim native Python APIs and ROS 2 interfaces, enabling zero‑modification code reuse. Within a shared physics tick and rendering pipeline, CARLA‑Air delivers photorealistic environments with rule‑compliant traffic, socially‑aware pedestrians, and aerodynamically consistent UAV dynamics, synchronously capturing up to 18 sensor modalities across all platforms at each tick. The platform supports representative air‑ground embodied intelligence workloads spanning cooperation, embodied navigation and vision‑language action, multi‑modal perception and dataset construction, and reinforcement‑learning‑based policy training. An extensible asset pipeline allows integration of custom robot platforms into the shared world. By inheriting AirSim's aerial capabilities ‑‑ whose upstream development has been archived ‑‑ CARLA‑Air ensures this widely adopted flight stack continues to evolve within a modern infrastructure. Released with prebuilt binaries and full source: https://github.com/louiszengCN/CarlaAir
Authors:Edward Wijaya
Abstract:
Deep learning models for drug‑like molecules and proteins overwhelmingly reuse transformer architectures designed for natural language, yet whether molecular sequences benefit from different designs has not been systematically tested. We deploy autonomous architecture search via an agent across three sequence types (SMILES, protein, and English text as control), running 3,106 experiments on a single GPU. For SMILES, architecture search is counterproductive: tuning learning rates and schedules alone outperforms the full search (p = 0.001). For natural language, architecture changes drive 81% of improvement (p = 0.009). Proteins fall between the two. Surprisingly, although the agent discovers distinct architectures per domain (p = 0.004), every innovation transfers across all three domains with <1% degradation, indicating that the differences reflect search‑path dependence rather than fundamental biological requirements. We release a decision framework and open‑source toolkit for molecular modeling teams to choose between autonomous architecture search and simple hyperparameter tuning.
Authors:Junho Kim, Hosu Lee, James M. Rehg, Minsu Kim, Yong Man Ro
Abstract:
Recent progress in video large language models (Video‑LLMs) has enabled strong offline reasoning over long and complex videos. However, real‑world deployments increasingly require streaming perception and proactive interaction, where video frames arrive online and the system must decide not only what to respond, but also when to respond. In this work, we revisit proactive activation in streaming video as a structured sequence modeling problem, motivated by the observation that temporal transitions in streaming video naturally form span‑structured activation patterns. To capture this span‑level structure, we model activation signals jointly over a sliding temporal window and update them iteratively as new frames arrive. We propose STRIDE (Structured Temporal Refinement with Iterative DEnoising), which employs a lightweight masked diffusion module at the activation interface to jointly predict and progressively refine activation signals across the window. Extensive experiments on diverse streaming benchmarks and downstream models demonstrate that STRIDE shows more reliable and temporally coherent proactive responses, significantly improving when‑to‑speak decision quality in online streaming scenarios.
Authors:Xuanpu Zhao, Zhentao Tan, Dianmo Sheng, Tianxiang Chen, Yao Liu, Yue Wu, Tao Gong, Qi Chu, Nenghai Yu
Abstract:
To enhance the perception and reasoning capabilities of multimodal large language models in complex visual scenes, recent research has introduced agent‑based workflows. In these works, MLLMs autonomously utilize image cropping tool to analyze regions of interest for question answering. While existing training strategies, such as those employing supervised fine‑tuning and reinforcement learning, have made significant progress, our empirical analysis reveals a key limitation. We demonstrate the model's strong reliance on global input and its weak dependence on the details within the cropped region. To address this issue, we propose a novel two‑stage reinforcement learning framework that does not require trajectory supervision. In the first stage, we introduce the ``Information Gap" mechanism by adjusting the granularity of the global image. This mechanism trains the model to answer questions by focusing on cropped key regions, driven by the information gain these regions provide. The second stage further enhances cropping precision by incorporating a grounding loss, using a small number of bounding box annotations. Experiments show that our method significantly enhances the model's attention to cropped regions, enabling it to achieve state‑of‑the‑art performance on high‑resolution visual question‑answering benchmarks. Our method provides a more efficient approach for perceiving and reasoning fine‑grained details in MLLMs. Code is available at: https://github.com/XuanPu‑Z/LFPC.
Authors:Chongyang Zhao, Mingsong Li, Haodong Lu, Dong Gong
Abstract:
Multimodal Continual Instruction Tuning aims to continually enhance Large Vision Language Models (LVLMs) by learning from new data without forgetting previously acquired knowledge. Mixture of Experts (MoE) architectures naturally facilitate this by incrementally adding new experts and expanding routers while keeping the existing ones frozen. However, despite expert isolation, MoE‑based continual learners still suffer from forgetting due to routing‑drift: old‑task tokens become mistakenly attracted to newly added experts, degrading performance on prior tasks. We analyze the failure mode at the token level and reveal the token's dilemma: ambiguous and old tokens in new‑task data offer minimal learning benefit yet induce forgetting when routed to new experts, due to their ambiguous routing assignment during training. Motivated by this, we propose LLaVA‑DyMoE, a dynamic MoE framework that incrementally expands the MoE with drift‑aware token assignment. We characterize token types via their routing score distributions and apply targeted regularization. Specifically, a token‑level assignment guidance steers ambiguous and old tokens away from new experts to preserve established routing patterns and alleviate routing‑drift, while complementary routing score regularizations enforce expert‑group separation and promote new‑expert specialization. Extensive experiments demonstrate that our LLaVA‑DyMoE effectively mitigates routing‑drift‑induced forgetting, achieving over a 7% gain in mean final accuracy and a 12% reduction in forgetting compared to baselines. The project page is https://zhaoc5.github.io/DyMoE.
Authors:Suraj Ranganath, Vaishak Menon, Anish Patnaik
Abstract:
Self‑forcing video generation extends a short‑horizon video model to longer rollouts by repeatedly feeding generated content back in as context. This scaling path immediately exposes a systems bottleneck: the key‑value (KV) cache grows with rollout length, so longer videos require not only better generation quality but also substantially better memory behavior. We present a comprehensive empirical study of KV‑cache compression for self‑forcing video generation on a Wan2.1‑based Self‑Forcing stack. Our study covers 33 quantization and cache‑policy variants, 610 prompt‑level observations, and 63 benchmark‑level summaries across two evaluation settings: MovieGen for single‑shot 10‑second generation and StoryEval for longer narrative‑style stability. We jointly evaluate peak VRAM, runtime, realized compression ratio, VBench imaging quality, BF16‑referenced fidelity (SSIM, LPIPS, PSNR), and terminal drift. Three findings are robust. First, the strongest practical operating region is a FlowCache‑inspired soft‑prune INT4 adaptation, which reaches 5.42‑5.49x compression while reducing peak VRAM from 19.28 GB to about 11.7 GB with only modest runtime overhead. Second, the highest‑fidelity compressed methods, especially PRQ_INT4 and QUAROT_KV_INT4, are not the best deployment choices because they preserve quality at severe runtime or memory cost. Third, nominal compression alone is not sufficient: several methods shrink KV storage but still exceed BF16 peak VRAM because the current integration reconstructs or retains large BF16 buffers during attention and refresh stages. The result is a benchmark harness, analysis workflow, and empirical map of which KV‑cache ideas are practical today and which are promising research directions for better memory integration. Code, data products, and the presentation dashboard are available at https://github.com/suraj‑ranganath/kv‑quant‑longhorizon/.
Authors:Zhongying Deng, Cheng Tang, Ziyan Huang, Jiashi Lin, Ying Chen, Junzhi Ning, Chenglong Ma, Jiyao Liu, Wei Li, Yinghao Zhu, Shujian Gao, Yanyan Huang, Sibo Ju, Yanzhou Su, Pengcheng Chen, Wenhao Tang, Tianbin Li, Haoyu Wang, Yuanfeng Ji, Hui Sun, Shaobo Min, Liang Peng, Feilong Tang, Haochen Xue, Rulin Zhou, Chaoyang Zhang, Wenjie Li, Shaohao Rui, Weijie Ma, Xingyue Zhao, Yibin Wang, Kun Yuan, Zhaohui Lu, Shujun Wang, Jinjie Wei, Lihao Liu, Dingkang Yang, Lin Wang, Yulong Li, Haolin Yang, Yiqing Shen, Lequan Yu, Xiaowei Hu, Yun Gu, Yicheng Wu, Benyou Wang, Minghui Zhang, Angelica I. Aviles-Rivero, Qi Gao, Hongming Shan, Xiaoyu Ren, Fang Yan, Hongyu Zhou, Haodong Duan, Maosong Cao, Shanshan Wang, Bin Fu, Xiaomeng Li, Zhi Hou, Chunfeng Song, Lei Bai, Yuan Cheng, Yuandong Pu, Xiang Li, Wenhai Wang, Hao Chen, Jiaxin Zhuang, Songyang Zhang, Huiguang He, Mengzhang Li, Bohan Zhuang, Zhian Bai, Rongshan Yu, Liansheng Wang, Yukun Zhou, Xiaosong Wang, Xin Guo, Guanbin Li, Xiangru Lin, Dakai Jin, Mianxin Liu, Wenlong Zhang, Qi Qin, Conghui He, Yuqiang Li, Ye Luo, Nanqing Dong, Jie Xu, Wenqi Shao, Bo Zhang, Qiujuan Yan, Yihao Liu, Jun Ma, Zhi Lu, Yuewen Cao, Zongwei Zhou, Jianming Liang, Shixiang Tang, Qi Duan, Dongzhan Zhou, Chen Jiang, Yuyin Zhou, Yanwu Xu, Jiancheng Yang, Shaoting Zhang, Xiaohong Liu, Siqi Luo, Yi Xin, Chaoyu Liu, Haochen Wen, Xin Chen, Alejandro Lozano, Min Woo Sun, Yuhui Zhang, Yue Yao, Xiaoxiao Sun, Serena Yeung-Levy, Xia Li, Jing Ke, Chunhui Zhang, Zongyuan Ge, Ming Hu, Jin Ye, Zhifeng Li, Yirong Chen, Yu Qiao, Junjun He
Abstract:
Foundation models have demonstrated remarkable success across diverse domains and tasks, primarily due to the thrive of large‑scale, diverse, and high‑quality datasets. However, in the field of medical imaging, the curation and assembling of such medical datasets are highly challenging due to the reliance on clinical expertise and strict ethical and privacy constraints, resulting in a scarcity of large‑scale unified medical datasets and hindering the development of powerful medical foundation models. In this work, we present the largest survey to date of medical image datasets, covering over 1,000 open‑access datasets with a systematic catalog of their modalities, tasks, anatomies, annotations, limitations, and potential for integration. Our analysis exposes a landscape that is modest in scale, fragmented across narrowly scoped tasks, and unevenly distributed across organs and modalities, which in turn limits the utility of existing medical image datasets for developing versatile and robust medical foundation models. To turn fragmentation into scale, we propose a metadata‑driven fusion paradigm (MDFP) that integrates public datasets with shared modalities or tasks, thereby transforming multiple small data silos into larger, more coherent resources. Building on MDFP, we release an interactive discovery portal that enables end‑to‑end, automated medical image dataset integration, and compile all surveyed datasets into a unified, structured table that clearly summarizes their key characteristics and provides reference links, offering the community an accessible and comprehensive repository. By charting the current terrain and offering a principled path to dataset consolidation, our survey provides a practical roadmap for scaling medical imaging corpora, supporting faster data discovery, more principled dataset creation, and more capable medical foundation models.
Authors:Naveen Mysore
Abstract:
Reinforcement learning algorithms assume that observations satisfy the Markov property, yet real‑world sensors frequently violate this assumption through correlated noise, latency, or partial observability. Standard performance metrics conflate Markov breakdowns with other sources of suboptimality, leaving practitioners without diagnostic tools for such violations. This paper introduces a prediction‑based scoring method that quantifies non‑Markovian structure in observation trajectories. A random forest first removes nonlinear Markov‑compliant dynamics; ridge regression then tests whether historical observations reduce prediction error on the residuals beyond what the current observation provides. The resulting score is bounded in [0, 1] and requires no causal graph construction. Evaluation spans six environments (CartPole, Pendulum, Acrobot, HalfCheetah, Hopper, Walker2d), three algorithms (PPO, A2C, SAC), controlled AR(1) noise at six intensity levels, and 10 seeds per condition. In post‑hoc detection, 7 of 16 environment‑algorithm pairs, primarily high‑dimensional locomotion tasks, show significant positive monotonicity between noise intensity and the violation score (Spearman rho up to 0.78, confirmed under repeated‑measures analysis); under training‑time noise, 13 of 16 pairs exhibit statistically significant reward degradation. An inversion phenomenon is documented in low‑dimensional environments where the random forest absorbs the noise signal, causing the score to decrease as true violations grow, a failure mode analyzed in detail. A practical utility experiment demonstrates that the proposed score correctly identifies partial observability and guides architecture selection, fully recovering performance lost to non‑Markovian observations. Source code to reproduce all results is provided at https://github.com/NAVEENMN/Markovianes.
Authors:Mohamad Zbib, Mohamad Bazzi, Ammar Mohanna, Hasan Abed Al Kader Hammoud, Bernard Ghanem
Abstract:
Speculative decoding accelerates autoregressive generation by letting a lightweight draft model propose future tokens that a larger target model then verifies in parallel. In practice, however, draft models are usually trained on broad generic corpora, which leaves it unclear how much speculative decoding quality depends on the draft training distribution. We study this question with lightweight HASS and EAGLE‑2 drafters trained on MathInstruct, ShareGPT, and mixed‑data variants, evaluated on MT‑Bench, GSM8K, MATH‑500, and SVAMP. Measured by acceptance length, task‑specific training yields clear specialization: MathInstruct‑trained drafts are strongest on reasoning benchmarks, while ShareGPT‑trained drafts are strongest on MT‑Bench. Mixed‑data training improves robustness, but larger mixtures do not dominate across decoding temperatures. We also study how to combine specialized drafters at inference time. Naive checkpoint averaging performs poorly, whereas confidence‑based routing improves over single‑domain drafts and merged‑tree verification yields the highest acceptance length overall for both backbones. Finally, confidence is a more useful routing signal than entropy: rejected tokens tend to have higher entropy, but confidence produces much clearer benchmark‑level routing decisions. These results show that speculative decoding quality depends not only on draft architecture, but also on the match between draft training data and downstream workload, and that specialized drafters are better combined at inference time than in weight space.
Authors:E. M. Freeburg
Abstract:
Large language models produce em dashes at varying rates, and the observation that some models "overuse" them has become one of the most widely discussed markers of AI‑generated text. Yet no mechanistic account of this pattern exists, and the parallel observation that LLMs default to markdown‑formatted output has never been connected to it. We propose that the em dash is markdown leaking into prose ‑‑ the smallest surviving unit of the structural orientation that LLMs acquire from markdown‑saturated training corpora. We present a five‑step genealogy connecting training data composition, structural internalization, the dual‑register status of the em dash, and post‑training amplification. We test this with a two‑condition suppression experiment across twelve models from five providers (Anthropic, OpenAI, Meta, Google, DeepSeek): when models are instructed to avoid markdown formatting, overt features (headers, bullets, bold) are eliminated or nearly eliminated, but em dashes persist ‑‑ except in Meta's Llama models, which produce none at all. Em dash frequency and suppression resistance vary from 0.0 per 1,000 words (Llama) to 9.1 (GPT‑4.1 under suppression), functioning as a signature of the specific fine‑tuning procedure applied. A three‑condition suppression gradient shows that even explicit em dash prohibition fails to eliminate the artifact in some models, and a base‑vs‑instruct comparison confirms that the latent tendency exists pre‑RLHF. These findings connect two previously isolated online discourses and reframe em dash frequency as a diagnostic of fine‑tuning methodology rather than a stylistic defect.
Authors:Dongsheng Yang, Yinfeng Yu, Liejun Wang
Abstract:
Vision‑and‑Language Navigation (VLN) requires an agent to navigate through complex unseen environments based on natural language instructions. However, existing methods often struggle to effectively capture key semantic cues and accurately align them with visual observations. To address this limitation, we propose Beyond Textual Knowledge (BTK), a VLN framework that synergistically integrates environment‑specific textual knowledge with generative image knowledge bases. BTK employs Qwen3‑4B to extract goal‑related phrases and utilizes Flux‑Schnell to construct two large‑scale image knowledge bases: R2R‑GP and REVERIE‑GP. Additionally, we leverage BLIP‑2 to construct a large‑scale textual knowledge base derived from panoramic views, providing environment‑specific semantic cues. These multimodal knowledge bases are effectively integrated via the Goal‑Aware Augmentor and Knowledge Augmentor, significantly enhancing semantic grounding and cross‑modal alignment. Extensive experiments on the R2R dataset with 7,189 trajectories and the REVERIE dataset with 21,702 instructions demonstrate that BTK significantly outperforms existing baselines. On the test unseen splits of R2R and REVERIE, SR increased by 5% and 2.07% respectively, and SPL increased by 4% and 3.69% respectively. The source code is available at https://github.com/yds3/IPM‑BTK/.
Authors:Hai-Son Nguyen-Le, Hung-Cuong Nguyen-Thanh, Nhien-An Le-Khac, Dinh-Thuc Nguyen, Hong-Hanh Nguyen-Le
Abstract:
The rapid advancement of generative models has enabled highly realistic audio deepfakes, yet current detectors suffer from a critical bias problem, leading to poor generalization across unseen datasets. This paper proposes Artifact‑Focused Self‑Synthesis (AFSS), a method designed to mitigate this bias by generating pseudo‑fake samples from real audio via two mechanisms: self‑conversion and self‑reconstruction. The core insight of AFSS lies in enforcing same‑speaker constraints, ensuring that real and pseudo‑fake samples share identical speaker identity and semantic content. This forces the detector to focus exclusively on generation artifacts rather than irrelevant confounding factors. Furthermore, we introduce a learnable reweighting loss to dynamically emphasize synthetic samples during training. Extensive experiments across 7 datasets demonstrate that AFSS achieves state‑of‑the‑art performance with an average EER of 5.45%, including a significant reduction to 1.23% on WaveFake and 2.70% on In‑the‑Wild, all while eliminating the dependency on pre‑collected fake datasets. Our code is publicly available at https://github.com/NguyenLeHaiSonGit/AFSS.
Authors:PengYu Chen, Shang Wan, Xiaohou Shi, Yuan Chang, Yan Sun, Sajal K. Das
Abstract:
Time series anomaly detection (TSAD) is essential for maintaining the reliability and security of IoT‑enabled service systems. Existing methods require training one specific model for each dataset, which exhibits limited generalization capability across different target datasets, hindering anomaly detection performance in various scenarios with scarce training data. To address this limitation, foundation models have emerged as a promising direction. However, existing approaches either repurpose large language models (LLMs) or construct largescale time series datasets to develop general anomaly detection foundation models, and still face challenges caused by severe cross‑modal gaps or in‑domain heterogeneity. In this paper, we investigate the applicability of large‑scale vision models to TSAD. Specifically, we adapt a visual Masked Autoencoder (MAE) pretrained on ImageNet to the TSAD task. However, directly transferring MAE to TSAD introduces two key challenges: overgeneralization and limited local perception. To address these challenges, we propose VAN‑AD, a novel MAE‑based framework for TSAD. To alleviate the over‑generalization issue, we design an Adaptive Distribution Mapping Module (ADMM), which maps the reconstruction results before and after MAE into a unified statistical space to amplify discrepancies caused by abnormal patterns. To overcome the limitation of local perception, we further develop a Normalizing Flow Module (NFM), which combines MAE with normalizing flow to estimate the probability density of the current window under the global distribution. Extensive experiments on nine real‑world datasets demonstrate that VAN‑AD consistently outperforms existing state‑of‑the‑art methods across multiple evaluation metrics.We make our code and datasets available at https://github.com/PenyChen/VAN‑AD.
Authors:Yuntao Shou, Jun Zhou, Tao Meng, Wei Ai, Keqin Li
Abstract:
Multimodal Emotion Recognition in Conversations (MERC) aims to predict speakers' emotional states in multi‑turn dialogues through text, audio, and visual cues. In real‑world settings, conversation scenarios differ significantly in speakers, topics, styles, and noise levels. Existing MERC methods generally neglect these cross‑scenario variations, limiting their ability to transfer models trained on a source domain to unseen target domains. To address this issue, we propose a Dual‑branch Graph Domain Adaptation framework (DGDA) for multimodal emotion recognition under cross‑scenario conditions. We first construct an emotion interaction graph to characterize complex emotional dependencies among utterances. A dual‑branch encoder, consisting of a hypergraph neural network (HGNN) and a path neural network (PathNN), is then designed to explicitly model multivariate relationships and implicitly capture global dependencies. To enable out‑of‑domain generalization, a domain adversarial discriminator is introduced to learn invariant representations across domains. Furthermore, a regularization loss is incorporated to suppress the negative influence of noisy labels. To the best of our knowledge, DGDA is the first MERC framework that jointly addresses domain shift and label noise. Theoretical analysis provides tighter generalization bounds, and extensive experiments on IEMOCAP and MELD demonstrate that DGDA consistently outperforms strong baselines and better adapts to cross‑scenario conversations. Our code is available at https://github.com/Xudmm1239439/DGDA‑Net.
Authors:Jiwen Zhang, Xiangyu Shi, Siyuan Wang, Zerui Li, Zhongyu Wei, Qi Wu
Abstract:
Vision‑and‑Language Navigation (VLN) has recently benefited from Multimodal Large Language Models (MLLMs), enabling zero‑shot navigation. While recent exploration‑based zero‑shot methods have shown promising results by leveraging global scene priors, they rely on high‑quality human‑crafted scene reconstructions, which are impractical for real‑world robot deployment. When encountering an unseen environment, a robot should build its own priors through pre‑exploration. However, these self‑built reconstructions are inevitably incomplete and noisy, which severely degrade methods that depend on high‑quality scene reconstructions. To address these issues, we propose SpatialAnt, a zero‑shot navigation framework designed to bridge the gap between imperfect self‑reconstructions and robust execution. SpatialAnt introduces a physical grounding strategy to recover the absolute metric scale for monocular‑based reconstructions. Furthermore, rather than treating the noisy self‑reconstructed scenes as absolute spatial references, we propose a novel visual anticipation mechanism. This mechanism leverages the noisy point clouds to render future observations, enabling the agent to perform counterfactual reasoning and prune paths that contradict human instructions. Extensive experiments in both simulated and real‑world environments demonstrate that SpatialAnt significantly outperforms existing zero‑shot methods. We achieve a 66% Success Rate (SR) on R2R‑CE and 50.8% SR on RxR‑CE benchmarks. Physical deployment on a Hello Robot further confirms the efficiency and efficacy of our framework, achieving a 52% SR in challenging real‑world settings.
Authors:Yifei Dong, Fengyi Wu, Yilong Dai, Lingdong Kong, Guangyu Chen, Xu Zhu, Qiyu Hu, Tianyu Wang, Johnalbert Garnica, Feng Liu, Siyu Huang, Qi Dai, Zhi-Qi Cheng
Abstract:
We study language‑conditioned visual navigation (LCVN), in which an embodied agent is asked to follow a natural language instruction based only on an initial egocentric observation. Without access to goal images, the agent must rely on language to shape its perception and continuous control, making the grounding problem particularly challenging. We formulate this problem as open‑loop trajectory prediction conditioned on linguistic instructions and introduce the LCVN Dataset, a benchmark of 39,016 trajectories and 117,048 human‑verified instructions that supports reproducible research across a range of environments and instruction styles. Using this dataset, we develop LCVN frameworks that link language grounding, future‑state prediction, and action generation through two complementary model families. The first family combines LCVN‑WM, a diffusion‑based world model, with LCVN‑AC, an actor‑critic agent trained in the latent space of the world model. The second family, LCVN‑Uni, adopts an autoregressive multimodal architecture that predicts both actions and future observations. Experiments show that these families offer different advantages: the former provides more temporally coherent rollouts, whereas the latter generalizes better to unseen environments. Taken together, these observations point to the value of jointly studying language grounding, imagination, and policy learning in a unified task setting, and LCVN provides a concrete basis for further investigation of language‑conditioned world models. The code is available at https://github.com/F1y1113/LCVN.
Authors:Shihua Zhang, Qiuhong Shen, Shizun Wang, Tianbo Pan, Xinchao Wang
Abstract:
Empowered by large‑scale training, vision‑language models (VLMs) achieve strong image and video understanding, yet their ability to perform spatial reasoning in both static scenes and dynamic videos remains limited. Recent advances try to handle this limitation by injecting geometry tokens from pretrained 3D foundation models into VLMs. Nevertheless, we observe that naive token fusion followed by standard fine‑tuning in this line of work often leaves such geometric cues underutilized for spatial reasoning, as VLMs tend to rely heavily on 2D visual cues. In this paper, we propose GeoSR, a framework designed to make geometry matter by encouraging VLMs to actively reason with geometry tokens. GeoSR introduces two key components: (1) Geometry‑Unleashing Masking, which strategically masks portions of 2D vision tokens during training to weaken non‑geometric shortcuts and force the model to consult geometry tokens for spatial reasoning; and (2) Geometry‑Guided Fusion, a gated routing mechanism that adaptively amplifies geometry token contributions in regions where geometric evidence is critical. Together, these designs unleash the potential of geometry tokens for spatial reasoning tasks. Extensive experiments on both static and dynamic spatial reasoning benchmarks demonstrate that GeoSR consistently outperforms prior methods and establishes new state‑of‑the‑art performance by effectively leveraging geometric information. The project page is available at https://suhzhang.github.io/GeoSR/.
Authors:Moritz Nottebaum, Matteo Dunnhofer, Christian Micheloni
Abstract:
Vision backbone networks play a central role in modern computer vision. Enhancing their efficiency directly benefits a wide range of downstream applications. To measure efficiency, many publications rely on MACs (Multiply Accumulate operations) as a predictor of execution time. In this paper, we experimentally demonstrate the shortcomings of such a metric, especially in the context of edge devices. By contrasting the MAC count and execution time of common architectural design elements, we identify key factors for efficient execution and provide insights to optimize backbone design. Based on these insights, we present LowFormer, a novel vision backbone family. LowFormer features a streamlined macro and micro design that includes Lowtention, a lightweight alternative to Multi‑Head Self‑Attention. Lowtention not only proves more efficient, but also enables superior results on ImageNet. Additionally, we present an edge GPU version of LowFormer, that can further improve upon its baseline's speed on edge GPU and desktop GPU. We demonstrate LowFormer's wide applicability by evaluating it on smaller image classification datasets, as well as adapting it to several downstream tasks, such as object detection, semantic segmentation, image retrieval, and visual object tracking. LowFormer models consistently achieve remarkable speed‑ups across various hardware platforms compared to recent state‑of‑the‑art backbones. Code and models are available at https://github.com/altair199797/LowFormer/blob/main/Beyond_MACs.md.
Authors:Doğaç Eldenk, Stephen Xia
Abstract:
Developing and evaluating distributed inference algorithms remains difficult due to the lack of standardized tools for modeling heterogeneous devices and networks. Existing studies often rely on ad‑hoc testbeds or proprietary infrastructure, making results hard to reproduce and limiting exploration of hypothetical hardware or network configurations. We present UNIFERENCE, a discrete‑event simulation (DES) framework designed for developing, benchmarking, and deploying distributed AI models within a unified environment. UNIFERENCE models device and network behavior through lightweight logical processes that synchronize only on communication primitives, eliminating rollbacks while preserving the causal order. It integrates seamlessly with PyTorch Distributed, enabling the same codebase to transition from simulation to real deployment. Our evaluation demonstrates that UNIFERENCE profiles runtime with up to 98.6% accuracy compared to real physical deployments across diverse backends and hardware setups. By bridging simulation and deployment, UNIFERENCE provides an accessible, reproducible platform for studying distributed inference algorithms and exploring future system designs, from high‑performance clusters to edge‑scale devices. The framework is open‑sourced at https://github.com/Dogacel/Uniference.
Authors:Siddhartha Laghuvarapu, Rohan Deb, Jimeng Sun
Abstract:
Uncertainty quantification is essential for deploying machine learning models in high‑stakes domains such as scientific discovery and healthcare. Conformal Prediction (CP) provides finite‑sample coverage guarantees under exchangeability, an assumption often violated in practice due to distribution shift. Under covariate shift, restoring validity requires importance weighting, yet accurate density‑ratio estimation becomes unstable when training and test distributions exhibit limited support overlap. We propose KMM‑CP, a conformal prediction framework based on Kernel Mean Matching (KMM) for covariate‑shift correction. We show that KMM directly controls the bias‑variance components governing conformal coverage error by minimizing RKHS moment discrepancy under explicit weight constraints, and establish asymptotic coverage guarantees under mild conditions. We then introduce a selective extension that identifies regions of reliable support overlap and restricts conformal correction to this subset, further improving stability in low‑overlap regimes. Experiments on molecular property prediction benchmarks with realistic distribution shifts show that KMM‑CP reduces coverage gap by over 50% compared to existing approaches. The code is available at https://github.com/siddharthal/KMM‑CP.
Authors:Shuai Lv, Chang Liu, Feng Tang, Yujie Yuan, Aojun Zhou, Kui Zhang, Xi Yang, Yangqiu Song
Abstract:
Multimodal Large Language Models (MLLMs) achieve strong multimodal reasoning performance, yet we identify a recurring failure mode in long‑form generation: as outputs grow longer, models progressively drift away from image evidence and fall back on textual priors, resulting in ungrounded reasoning and hallucinations. Interestingly, Based on attention analysis, we find that MLLMs have a latent capability for late‑stage visual verification that is present but not consistently activated. Motivated by this observation, we propose Visual Re‑Examination (VRE), a self‑evolving training framework that enables MLLMs to autonomously perform visual introspection during reasoning without additional visual inputs. Rather than distilling visual capabilities from a stronger teacher, VRE promotes iterative self‑improvement by leveraging the model itself to generate reflection traces, making visual information actionable through information gain. Extensive experiments across diverse multimodal benchmarks demonstrate that VRE consistently improves reasoning accuracy and perceptual reliability, while substantially reducing hallucinations, especially in long‑chain settings. Code is available at https://github.com/Xiaobu‑USTC/VRE.
Authors:JiHyeok Jung, TaeYoung Yoon, HyunSouk Cho
Abstract:
Legal reasoning requires not only the application of legal rules but also an understanding of the context in which those rules operate. However, existing legal benchmarks primarily evaluate rule application under the assumption of fixed norms, and thus fail to capture situations where legal judgments shift or where multiple norms interact. In this work, we propose CALRK‑Bench, a context‑aware legal reasoning benchmark based on the legal system in Korean. CALRK‑Bench evaluates whether models can identify the temporal validity of legal norms, determine whether sufficient legal information is available for a given case, and understand the reasons behind shifts in legal judgments. The dataset is constructed from legal precedents and legal consultation records, and is validated by legal experts. Experimental results show that even recent large language models consistently exhibit low performance on these three tasks. CALRK‑Bench provides a new stress test for evaluating context‑aware legal reasoning rather than simple memorization of legal knowledge. Our code is available at https://github.com/jhCOR/CALRKBench.
Authors:Harunori Kawano, Takeshi Sasaki
Abstract:
While self‑supervised learning (SSL) has revolutionized audio representation, the excessive parameterization and quadratic computational cost of standard Transformers limit their deployment on resource‑constrained devices. To address this bottleneck, we propose HEAR (Human‑inspired Efficient Audio Representation), a novel decoupled architecture. Inspired by the human cognitive ability to isolate local acoustic features from global context, HEAR splits the processing pipeline into two dedicated modules: an Acoustic Model for local feature extraction and a Task Model for global semantic integration. Coupled with an Acoustic Tokenizer trained via knowledge distillation, our approach enables robust Masked Audio Modeling (MAM). Extensive experiments demonstrate that HEAR requires only 15M parameters and 9.47 GFLOPs for inference, operating at a fraction of the computational cost of conventional foundation models (which typically require 85M‑94M parameters). Despite this high efficiency, HEAR achieves highly competitive performance across diverse audio classification benchmarks. The code and pre‑trained models are available at https://github.com/HarunoriKawano/HEAR
Authors:Kang Liu, Zhuoqi Ma, Siyu Liang, Yunan Li, Xiyue Gao, Chao Liang, Kun Xie, Qiguang Miao
Abstract:
Despite recent advances in medical vision‑language pretraining, existing models still struggle to capture the diagnostic workflow: radiographs are typically treated as context‑agnostic images, while radiologists' gaze ‑‑ a crucial cue for visual reasoning ‑‑ remains largely underexplored by existing methods. These limitations hinder the modeling of disease‑specific patterns and weaken cross‑modal alignment. To bridge this gap, we introduce CoGaze, a Context‑ and Gaze‑guided vision‑language pretraining framework for chest X‑rays. We first propose a context‑infused vision encoder that models how radiologists integrate clinical context ‑‑ including patient history, symptoms, and diagnostic intent ‑‑ to guide diagnostic reasoning. We then present a multi‑level supervision paradigm that (1) enforces intra‑ and inter‑modal semantic alignment through hybrid‑positive contrastive learning, (2) injects diagnostic priors via disease‑aware cross‑modal representation learning, and (3) leverages radiologists' gaze as probabilistic priors to guide attention toward diagnostically salient regions. Extensive experiments demonstrate that CoGaze consistently outperforms state‑of‑the‑art methods across diverse tasks, achieving up to +2.0% CheXbertF1 and +1.2% BLEU2 for free‑text and structured report generation, +23.2% AUROC for zero‑shot classification, and +12.2% Precision@1 for image‑text retrieval. Code is available at https://github.com/mk‑runner/CoGaze.
Authors:Mahesh Bhosale, Abdul Wasi, Shantam Srivastava, Shifa Latif, Tianyu Luan, Mingchen Gao, David Doermann, Xuan Gong
Abstract:
While powerful in image‑conditioned generation, multimodal large language models (MLLMs) can display uneven performance across demographic groups, highlighting fairness risks. In safety‑critical clinical settings, such disparities risk producing unequal diagnostic narratives and eroding trust in AI‑assisted decision‑making. While fairness has been studied extensively in vision‑only and language‑only models, its impact on MLLMs remains largely underexplored. To address these biases, we introduce FairLLaVA, a parameter‑efficient fine‑tuning method that mitigates group disparities in visual instruction tuning without compromising overall performance. By minimizing the mutual information between target attributes, FairLLaVA regularizes the model's representations to be demographic‑invariant. The method can be incorporated as a lightweight plug‑in, maintaining efficiency with low‑rank adapter fine‑tuning, and provides an architecture‑agnostic approach to fair visual instruction following. Extensive experiments on large‑scale chest radiology report generation and dermoscopy visual question answering benchmarks show that FairLLaVA consistently reduces inter‑group disparities while improving both equity‑scaled clinical performance and natural language generation quality across diverse medical imaging modalities. Code can be accessed at https://github.com/bhosalems/FairLLaVA.
Authors:Trong Thang Pham, Hien Nguyen, Ngan Le
Abstract:
Current multimodal large language models (MLLMs) cannot effectively utilize eye‑gaze information for video understanding, even when gaze cues are supplied via visual overlays or text descriptions. We introduce GazeQwen, a parameter efficient approach that equips an open‑source MLLM with gaze awareness through hidden‑state modulation. At its core is a compact gaze resampler (~1‑5 M trainable parameters) that encodes V‑JEPA 2.1 video features together with fixation‑derived positional encodings and produces additive residuals injected into selected LLM decoder layers via forward hooks. An optional second training stage adds low‑rank adapters (LoRA) to the LLM for tighter integration. Evaluated on all 10 tasks of the StreamGaze benchmark, GazeQwen reaches 63.9% accuracy, a +16.1 point gain over the same Qwen2.5‑VL‑7B backbone with gaze as visual prompts and +10.5 points over GPT‑4o, the highest score among all open‑source and proprietary models tested. These results suggest that learning where to inject gaze within an LLM is more effective than scaling model size or engineering better prompts. All code and checkpoints are available at https://github.com/phamtrongthang123/gazeqwen .
Authors:Haonan Han, Jiancheng Huang, Xiaopeng Sun, Junyan He, Rui Yang, Jie Hu, Xiaojiang Peng, Lin Ma, Xiaoming Wei, Xiu Li
Abstract:
Beneath the stunning visual fidelity of modern AIGC models lies a "logical desert", where systems fail tasks that require physical, causal, or complex spatial reasoning. Current evaluations largely rely on superficial metrics or fragmented benchmarks, creating a ``performance mirage'' that overlooks the generative process. To address this, we introduce ViGoR Vision‑Gnerative Reasoning‑centric Benchmark), a unified framework designed to dismantle this mirage. ViGoR distinguishes itself through four key innovations: 1) holistic cross‑modal coverage bridging Image‑to‑Image and Video tasks; 2) a dual‑track mechanism evaluating both intermediate processes and final results; 3) an evidence‑grounded automated judge ensuring high human alignment; and 4) granular diagnostic analysis that decomposes performance into fine‑grained cognitive dimensions. Experiments on over 20 leading models reveal that even state‑of‑the‑art systems harbor significant reasoning deficits, establishing ViGoR as a critical ``stress test'' for the next generation of intelligent vision models. The demo have been available at https://vincenthancoder.github.io/ViGoR‑Bench/
Authors:Sicheng Zuo, Yuxuan Li, Wenzhao Zheng, Zheng Zhu, Jie Zhou, Jiwen Lu
Abstract:
Vision‑language‑action models have reshaped autonomous driving to incorporate languages into the decision‑making process. However, most existing pipelines only utilize the language modality for scene descriptions or reasoning and lack the flexibility to follow diverse user instructions for personalized driving. To address this, we first construct a large‑scale driving dataset (InstructScene) containing around 100,000 scenes annotated with diverse driving instructions with the corresponding trajectories. We then propose a unified Vision‑Language‑World‑Action model, Vega, for instruction‑based generation and planning. We employ the autoregressive paradigm to process visual inputs (vision) and language instructions (language) and the diffusion paradigm to generate future predictions (world modeling) and trajectories (action). We perform joint attention to enable interactions between the modalities and use individual projection layers for different modalities for more capabilities. Extensive experiments demonstrate that our method not only achieves superior planning performance but also exhibits strong instruction‑following abilities, paving the way for more intelligent and personalized driving systems.
Authors:Zehao Wang, Huaide Jiang, Shuaiwu Dong, Yuping Wang, Hang Qiu, Jiachen Li
Abstract:
Human driving behavior is inherently personal, which is shaped by long‑term habits and influenced by short‑term intentions. Individuals differ in how they accelerate, brake, merge, yield, and overtake across diverse situations. However, existing end‑to‑end autonomous driving systems either optimize for generic objectives or rely on fixed driving modes, lacking the ability to adapt to individual preferences or interpret natural language intent. To address this gap, we propose Drive My Way (DMW), a personalized Vision‑Language‑Action (VLA) driving framework that aligns with users' long‑term driving habits and adapts to real‑time user instructions. DMW learns a user embedding from our personalized driving dataset collected across multiple real drivers and conditions the policy on this embedding during planning, while natural language instructions provide additional short‑term guidance. Closed‑loop evaluation on the Bench2Drive benchmark demonstrates that DMW improves style instruction adaptation, and user studies show that its generated behaviors are recognizable as each driver's own style, highlighting personalization as a key capability for human‑centered autonomous driving. Our data and code are available at https://dmw‑cvpr.github.io/.
Authors:Xiaofeng Mao, Shaohao Rui, Kaining Ying, Bo Zheng, Chuanhao Li, Mingmin Chi, Kaipeng Zhang
Abstract:
Autoregressive video diffusion models have demonstrated remarkable progress, yet they remain bottlenecked by intractable linear KV‑cache growth, temporal repetition, and compounding errors during long‑video generation. To address these challenges, we present PackForcing, a unified framework that efficiently manages the generation history through a novel three‑partition KV‑cache strategy. Specifically, we categorize the historical context into three distinct types: (1) Sink tokens, which preserve early anchor frames at full resolution to maintain global semantics; (2) Mid tokens, which achieve a massive spatiotemporal compression (32x token reduction) via a dual‑branch network fusing progressive 3D convolutions with low‑resolution VAE re‑encoding; and (3) Recent tokens, kept at full resolution to ensure local temporal coherence. To strictly bound the memory footprint without sacrificing quality, we introduce a dynamic top‑k context selection mechanism for the mid tokens, coupled with a continuous Temporal RoPE Adjustment that seamlessly re‑aligns position gaps caused by dropped tokens with negligible overhead. Empowered by this principled hierarchical context compression, PackForcing can generate coherent 2‑minute, 832x480 videos at 16 FPS on a single H200 GPU. It achieves a bounded KV cache of just 4 GB and enables a remarkable 24x temporal extrapolation (5s to 120s), operating effectively either zero‑shot or trained on merely 5‑second clips. Extensive results on VBench demonstrate state‑of‑the‑art temporal consistency (26.07) and dynamic degree (56.25), proving that short‑video supervision is sufficient for high‑quality, long‑video synthesis. https://github.com/ShandaAI/PackForcing
Authors:Jiabin Hua, Hengyuan Xu, Aojie Li, Wei Cheng, Gang Yu, Xingjun Ma, Yu-Gang Jiang
Abstract:
Fine‑grained facial expression editing has long been limited by intrinsic semantic overlap. To address this, we construct the Flex Facial Expression (FFE) dataset with continuous affective annotations and establish FFE‑Bench to evaluate structural confusion, editing accuracy, linear controllability, and the trade‑off between expression editing and identity preservation. We propose PixelSmile, a diffusion framework that disentangles expression semantics via fully symmetric joint training. PixelSmile combines intensity supervision with contrastive learning to produce stronger and more distinguishable expressions, achieving precise and stable linear expression control through textual latent interpolation. Extensive experiments demonstrate that PixelSmile achieves superior disentanglement and robust identity preservation, confirming its effectiveness for continuous, controllable, and fine‑grained expression editing, while naturally supporting smooth expression blending.
Authors:Kaijin Chen, Dingkang Liang, Xin Zhou, Yikang Ding, Xiaoqiang Liu, Pengfei Wan, Xiang Bai
Abstract:
Video world models have shown immense potential in simulating the physical world, yet existing memory mechanisms primarily treat environments as static canvases. When dynamic subjects hide out of sight and later re‑emerge, current methods often struggle, leading to frozen, distorted, or vanishing subjects. To address this, we introduce Hybrid Memory, a novel paradigm requiring models to simultaneously act as precise archivists for static backgrounds and vigilant trackers for dynamic subjects, ensuring motion continuity during out‑of‑view intervals. To facilitate research in this direction, we construct HM‑World, the first large‑scale video dataset dedicated to hybrid memory. It features 59K high‑fidelity clips with decoupled camera and subject trajectories, encompassing 17 diverse scenes, 49 distinct subjects, and meticulously designed exit‑entry events to rigorously evaluate hybrid coherence. Furthermore, we propose HyDRA, a specialized memory architecture that compresses memory into tokens and utilizes a spatiotemporal relevance‑driven retrieval mechanism. By selectively attending to relevant motion cues, HyDRA effectively preserves the identity and motion of hidden subjects. Extensive experiments on HM‑World demonstrate that our method significantly outperforms state‑of‑the‑art approaches in both dynamic subject consistency and overall generation quality. Code is publicly available at https://github.com/H‑EmbodVis/HyDRA.
Authors:Zhi Zeng, Yifei Yang, Jiaying Wu, Xulang Zhang, Xiangzheng Kong, Herun Wan, Zihan Ma, Minnan Luo
Abstract:
The rise of micro‑videos has reshaped how misinformation spreads, amplifying its speed, reach, and impact on public trust. Existing benchmarks typically focus on a single deception type, overlooking the diversity of real‑world cases that involve multimodal manipulation, AI‑generated content, cognitive bias, and out‑of‑context reuse. Meanwhile, most detection models lack fine‑grained attribution, limiting interpretability and practical utility. To address these gaps, we introduce WildFakeBench, a large‑scale benchmark of over 10,000 real‑world micro‑videos covering diverse misinformation types and sources, each annotated with expert‑defined attribution labels. Building on this foundation, we develop FakeAgent, a Delphi‑inspired multi‑agent reasoning framework that integrates multimodal understanding with external evidence for attribution‑grounded analysis. FakeAgent jointly analyzes content and retrieved evidence to identify manipulation, recognize cognitive and AI‑generated patterns, and detect out‑of‑context misinformation. Extensive experiments show that FakeAgent consistently outperforms existing MLLMs across all misinformation types, while WildFakeBench provides a realistic and challenging testbed for advancing explainable micro‑video misinformation detection. Data and code are available at: https://github.com/Aiyistan/FakeAgent.
Authors:Paulo Roberto de Moura Júnior, Jean Lelong, Annabelle Blangero
Abstract:
The effectiveness of Retrieval‑Augmented Generation (RAG) is highly dependent on how documents are chunked, that is, segmented into smaller units for indexing and retrieval. Yet, commonly used "one‑size‑fits‑all" approaches often fail to capture the nuanced structure and semantics of diverse texts. Despite its central role, chunking lacks a dedicated evaluation framework, making it difficult to assess and compare strategies independently of downstream performance. We challenge this paradigm by introducing Adaptive Chunking, a framework that selects the most suitable chunking strategy for each document based on a set of five novel intrinsic, document‑based metrics: References Completeness (RC), Intrachunk Cohesion (ICC), Document Contextual Coherence (DCC), Block Integrity (BI), and Size Compliance (SC), which directly assess chunking quality across key dimensions. To support this framework, we also introduce two new chunkers, an LLM‑regex splitter and a split‑then‑merge recursive splitter, alongside targeted post‑processing techniques. On a diverse corpus spanning legal, technical, and social science domains, our metric‑guided adaptive method significantly improves downstream RAG performance. Without changing models or prompts, our framework increases RAG outcomes, raising answers correctness to 72% (from 62‑64%) and increasing the number of successfully answered questions by over 30% (65 vs. 49). These results demonstrate that adaptive, document‑aware chunking, guided by a complementary suite of intrinsic metrics, offers a practical and effective path to more robust RAG systems. Code available at https://github.com/ekimetrics/adaptive‑chunking.
Authors:Hector Borobia, Elies Seguí-Mas, Guillermina Tormo-Carbó
Abstract:
Weight pruning is a standard technique for compressing large language models, yet its effect on learned internal representations remains poorly understood. We present the first systematic study of how unstructured pruning reshapes the feature geometry of language models, using Sparse Autoencoders (SAEs) as interpretability probes. Across three model families (Gemma 3 1B, Gemma 2 2B, Llama 3.2 1B), two pruning methods (magnitude and Wanda), and six sparsity levels (0‑‑60%), we investigate five research questions spanning seed stability, feature survival, SAE transferability, feature fragility, and causal relevance. Our most striking finding is that rare SAE features‑‑those with low firing rates‑‑survive pruning far better than frequent ones, with within‑condition Spearman correlations of rho = ‑1.0 in 11 of 17 experimental conditions. This counter‑intuitive result suggests that pruning acts as implicit feature selection, preferentially destroying high‑frequency generic features while preserving specialized rare ones. We further show that Wanda pruning preserves feature structure up to 3.7x better than magnitude pruning, that pre‑trained SAEs remain viable on Wanda‑pruned models up to 50% sparsity, and that geometric feature survival does not predict causal importance‑‑a dissociation with implications for interpretability under compression.
Authors:Shigeng Wang, Chao Li, Yangyuxuan Kang, Jiawei Fan, Zhonghong Ou, Anbang Yao
Abstract:
In this paper, we address post‑training quantization (PTQ) for large language models (LLMs) from an overlooked perspective: given a pre‑trained high‑precision LLM, the predominant sequential quantization framework treats different layers equally, but this may be not optimal in challenging bit‑width settings. We empirically study the quantization impact of different layers on model accuracy, and observe that: (1) shallow/deep layers are usually more sensitive to quantization than intermediate layers; (2) among shallow/deep layers, the most sensitive one is the first/last layer, which exhibits significantly larger quantization error than others. These empirical observations imply that the quantization design for different layers of LLMs is required on multiple levels instead of a single level shared to all layers. Motivated by this, we propose a new PTQ framework termed Sliding‑layer Quantization (SliderQuant) that relies on a simple adaptive sliding quantization concept facilitated by few learnable parameters. The base component of SliderQuant is called inter‑layer sliding quantization, which incorporates three types of novel sliding window designs tailored for addressing the varying quantization sensitivity of shallow, intermediate and deep layers. The other component is called intra‑layer sliding quantization that leverages an incremental strategy to quantize each window. As a result, SliderQuant has a strong ability to reduce quantization errors across layers. Extensive experiments on basic language generation, zero‑shot commonsense reasoning and challenging math and code tasks with various LLMs, including Llama/Llama2/Llama3/Qwen2.5 model families, DeepSeek‑R1 distilled models and large MoE models, show that our method outperforms existing PTQ methods (including the latest PTQ methods using rotation transformations) for both weight‑only quantization and weight‑activation quantization.
Authors:Abhijnan Nath, Hannah VanderHoeven, Nikhil Krishnaswamy
Abstract:
We introduce CRAFT, a multi‑agent benchmark for evaluating pragmatic communication in large language models under strict partial information. In this setting, multiple agents with complementary but incomplete views must coordinate through natural language to construct a shared 3D structure that no single agent can fully observe. We formalize this problem as a multi‑sender pragmatic reasoning task and provide a diagnostic framework that decomposes failures into spatial grounding, belief modeling and pragmatic communication errors, including a taxonomy of behavioral failure profiles in both frontier and open‑weight models. Across a diverse set of models, including 8 open‑weight and 7 frontier including reasoning models, we find that stronger reasoning ability does not reliably translate to better coordination: smaller open‑weight models often match or outperform frontier systems, and improved individual communication does not guarantee successful collaboration. These results suggest that multi‑agent coordination remains a fundamentally unsolved challenge for current language models. Our code can be found at https://github.com/csu‑signal/CRAFT
Authors:Yabin Zhang, Maya Varma, Yunhe Gao, Jean-Benoit Delbrouck, Jiaming Liu, Chong Wang, Curtis Langlotz
Abstract:
Out‑of‑distribution (OOD) detection aims to identify samples that deviate from in‑distribution (ID). One popular pipeline addresses this by introducing negative labels distant from ID classes and detecting OOD based on their distance to these labels. However, such labels may present poor activation on OOD samples, failing to capture the OOD characteristics. To address this, we propose \underlineTest‑time \underlineActivated \underlineNegative \underlineLabels (TANL) by dynamically evaluating activation levels across the corpus dataset and mining candidate labels with high activation responses during the testing process. Specifically, TANL identifies high‑confidence test images online and accumulates their assignment probabilities over the corpus to construct a label activation metric. Such a metric leverages historical test samples to adaptively align with the test distribution, enabling the selection of distribution‑adaptive activated negative labels. By further exploring the activation information within the current testing batch, we introduce a more fine‑grained, batch‑adaptive variant. To fully utilize label activation knowledge, we propose an activation‑aware score function that emphasizes negative labels with stronger activations, boosting performance and enhancing its robustness to the label number. Our TANL is training‑free, test‑efficient, and grounded in theoretical justification. Experiments on diverse backbones and wide task settings validate its effectiveness. Notably, on the large‑scale ImageNet benchmark, TANL significantly reduces the FPR95 from 17.5% to 9.8%. Codes are available at \hrefhttps://github.com/YBZh/OpenOOD‑VLMYBZh/OpenOOD‑VLM.
Authors:Taejin Jeong, Joohyeok Kim, Jinyeong Kim, Chanyoung Kim, Seong Jae Hwang
Abstract:
Spatial Transcriptomics (ST) provides spatially‑resolved gene expression, offering crucial insights into tissue architecture and complex diseases. However, its prohibitive cost limits widespread adoption, leading to significant attention on inferring spatial gene expression from readily available whole slide images. While graph neural networks have been proposed to model interactions between tissue regions, their reliance on pre‑defined sparse graphs prevents them from considering potentially interacting spot pairs, resulting in a structural limitation in capturing complex biological relationships. To address this, we propose FEAST (Fully connected Expressive Attention for Spatial Transcriptomics), an attention‑based framework that models the tissue as a fully connected graph, enabling the consideration of all pairwise interactions. To better reflect biological interactions, we introduce negative‑aware attention, which models both excitatory and inhibitory interactions, capturing essential negative relationships that standard attention often overlooks. Furthermore, to mitigate the information loss from truncated or ignored context in standard spot image extraction, we introduce an off‑grid sampling strategy that gathers additional images from intermediate regions, allowing the model to capture a richer morphological context. Experiments on public ST datasets show that FEAST surpasses state‑of‑the‑art methods in gene expression prediction while providing biologically plausible attention maps that clarify positive and negative interactions. Our code is available at https://github.com/starforTJ/ FEAST.
Authors:Fanheng Kong, Jingyuan Zhang, Yang Yue, Chenxi Sun, Yang Tian, Shi Feng, Xiaocui Yang, Daling Wang, Yu Tian, Jun Du, Wenchong Zeng, Han Li, Kun Gai
Abstract:
The emergence of Large Language Models (LLMs) has catalyzed a paradigm shift in programming, giving rise to "vibe coding", where users can build complete projects and even control computers using natural language instructions. This paradigm has driven automated webpage development, but it introduces a new requirement about how to automatically verify whether the web functionalities are reliably implemented. Existing works struggle to adapt, relying on static visual similarity or predefined checklists that constrain their utility in open‑ended environments. Furthermore, they overlook a vital aspect of software quality, namely latent logical constraints. To address these gaps, we introduce WebTestBench, a benchmark for evaluating end‑to‑end automated web testing. WebTestBench encompasses comprehensive dimensions across diverse web application categories. We decompose the testing process into two cascaded sub‑tasks, checklist generation and defect detection, and propose WebTester, a baseline framework for this task. Evaluating popular LLMs with WebTester reveals severe challenges, including insufficient test completeness, detection bottlenecks, and long‑horizon interaction unreliability. These findings expose a substantial gap between current computer‑use agent capabilities and industrial‑grade deployment demands. We hope that WebTestBench provides valuable insights and guidance for advancing end‑to‑end automated web testing. Our dataset and code are available at https://github.com/friedrichor/WebTestBench.
Authors:Jiahao Tian, Chenxi Song, Wei Cheng, Chi Zhang
Abstract:
Generating long videos using pre‑trained video diffusion models, which are typically trained on short clips, presents a significant challenge. Directly applying these models for long‑video inference often leads to a notable degradation in visual quality. This paper identifies that this issue primarily stems from two out‑of‑distribution (O.O.D) problems: frame‑level relative position O.O.D and context‑length O.O.D. To address these challenges, we propose FreeLOC, a novel training‑free, layer‑adaptive framework that introduces two core techniques: Video‑based Relative Position Re‑encoding (VRPR) for frame‑level relative position O.O.D, a multi‑granularity strategy that hierarchically re‑encodes temporal relative positions to align with the model's pre‑trained distribution, and Tiered Sparse Attention (TSA) for context‑length O.O.D, which preserves both local detail and long‑range dependencies by structuring attention density across different temporal scales. Crucially, we introduce a layer‑adaptive probing mechanism that identifies the sensitivity of each transformer layer to these O.O.D issues, allowing for the selective and efficient application of our methods. Extensive experiments demonstrate that our approach significantly outperforms existing training‑free methods, achieving state‑of‑the‑art results in both temporal consistency and visual quality. Code is available at https://github.com/Westlake‑AGI‑Lab/FreeLOC.
Authors:Jie Wang, Honghua Huang, Xi Ge, Jianhui Su, Wen Liu, Shiguo Lian
Abstract:
Retrieval‑Augmented Generation (RAG) systems face significant challenges in complex reasoning, multi‑hop queries, and domain‑specific QA. While existing GraphRAG frameworks have made progress in structural knowledge organization, they still have limitations in cross‑industry adaptability, community report integrity, and retrieval performance. This paper proposes UniAI‑GraphRAG, an enhanced framework built upon open‑source GraphRAG. The framework introduces three core innovations: (1) Ontology‑Guided Knowledge Extraction that uses predefined Schema to guide LLMs in accurately identifying domain‑specific entities and relations; (2) Multi‑Dimensional Community Clustering Strategy that improves community completeness through alignment completion, attribute‑based clustering, and multi‑hop relationship clustering; (3) Dual‑Channel Graph Retrieval Fusion that balances QA accuracy and performance through hybrid graph and community retrieval. Evaluation results on MultiHopRAG benchmark show that UniAI‑GraphRAG outperforms mainstream open source solutions (e.g.LightRAG) in comprehensive F1 scores, particularly in inference and temporal queries. The code is available at https://github.com/UnicomAI/wanwu/tree/main/rag/rag_open_source/rag_core/graph.
Authors:Ranxu Zhang, Junjie Meng, Ying Sun, Ziqi Xu, Bing Yin, Hao Li, Yanyong Zhang, Chao Wang
Abstract:
Multi‑Behavior Recommendation (MBR) leverages multiple user interaction types (e.g., views, clicks, purchases) to enrich preference modeling and alleviate data sparsity issues in traditional single‑behavior approaches. However, existing MBR methods face fundamental challenges: they lack principled frameworks to model complex confounding effects from user behavioral habits and item multi‑behavior distributions, struggle with effective aggregation of heterogeneous auxiliary behaviors, and fail to align behavioral representations across semantic gaps while accounting for bias distortions. To address these limitations, we propose MCLMR, a novel model‑agnostic causal learning framework that can be seamlessly integrated into various MBR architectures. MCLMR first constructs a causal graph to model confounding effects and performs interventions for unbiased preference estimation. Under this causal framework, it employs an Adaptive Aggregation module based on Mixture‑of‑Experts to dynamically fuse auxiliary behavior information and a Bias‑aware Contrastive Learning module to align cross‑behavior representations in a bias‑aware manner. Extensive experiments on three real‑world datasets demonstrate that MCLMR achieves significant performance improvements across various baseline models, validating its effectiveness and generality. All data and code will be made publicly available. For anonymous review, our code is available at the following the link: https://github.com/gitrxh/MCLMR.
Authors:Yinyi Luo, Hrishikesh Gokhale, Marios Savvides, Jindong Wang, Shengfeng He
Abstract:
Despite significant progress in text‑to‑image generation, aligning outputs with complex prompts remains challenging, particularly for fine‑grained semantics and spatial relations. This difficulty stems from the feed‑forward nature of generation, which requires anticipating alignment without fully understanding the output. In contrast, evaluating generated images is more tractable. Motivated by this asymmetry, we propose xLARD, a self‑correcting framework that uses multimodal large language models to guide generation through Explainable LAtent RewarDs. xLARD introduces a lightweight corrector that refines latent representations based on structured feedback from model‑generated references. A key component is a differentiable mapping from latent edits to interpretable reward signals, enabling continuous latent‑level guidance from non‑differentiable image‑level evaluations. This mechanism allows the model to understand, assess, and correct itself during generation. Experiments across diverse generation and editing tasks show that xLARD improves semantic alignment and visual fidelity while maintaining generative priors. Code is available at https://yinyiluo.github.io/xLARD/.
Authors:Luyu Yang, Yutong Dai, An Yan, Viraj Prabhu, Ran Xu, Zeyuan Chen
Abstract:
The physical world is not merely visual; it is governed by rigorous structural and procedural constraints. Yet, the evaluation of vision‑language models (VLMs) remains heavily skewed toward perceptual realism, prioritizing the generation of visually plausible 3D layouts, shapes, and appearances. Current benchmarks rarely test whether models grasp the step‑by‑step processes and physical dependencies required to actually build these artifacts, a capability essential for automating design‑to‑construction pipelines. To address this, we introduce DreamHouse, a novel benchmark for physical generative reasoning: the capacity to synthesize artifacts that concurrently satisfy geometric, structural, constructability, and code‑compliance constraints. We ground this benchmark in residential timber‑frame construction, a domain with fully codified engineering standards and objectively verifiable correctness. We curate over 26,000 structures spanning 13 architectural styles, ach verified to construction‑document standards (LOD 350) and develop a deterministic 10‑test structural validation framework. Unlike static benchmarks that assess only final outputs, DreamHouse supports iterative agentic interaction. Models observe intermediate build states, generate construction actions, and receive structured environmental feedback, enabling a fine‑grained evaluation of planning, structural reasoning, and self‑correction. Extensive experiments with state‑of‑the‑art VLMs reveal substantial capability gaps that are largely invisible on existing leaderboards. These findings establish physical validity as a critical evaluation axis orthogonal to visual realism, highlighting physical generative reasoning as a distinct and underdeveloped frontier in multimodal intelligence. Available at https://luluyuyuyang.github.io/dreamhouse
Authors:Isha Puri, Mehul Damani, Idan Shenfeld, Marzyeh Ghassemi, Jacob Andreas, Yoon Kim
Abstract:
Given a question, a language model (LM) implicitly encodes a distribution over possible answers. In practice, post‑training procedures for LMs often collapse this distribution onto a single dominant mode. While this is generally not a problem for benchmark‑style evaluations that assume one correct answer, many real‑world tasks inherently involve multiple valid answers or irreducible uncertainty. Examples include medical diagnosis, ambiguous question answering, and settings with incomplete information. In these cases, we would like LMs to generate multiple plausible hypotheses, ideally with confidence estimates for each one, and without computationally intensive repeated sampling to generate non‑modal answers. This paper describes a multi‑answer reinforcement learning approach for training LMs to perform distributional reasoning over multiple answers during inference. We modify the RL objective to enable models to explicitly generate multiple candidate answers in a single forward pass, internalizing aspects of inference‑time search into the model's generative process. Across question‑answering, medical diagnostic, and coding benchmarks, we observe improved diversity, coverage, and set‑level calibration scores compared to single answer trained baselines. Models trained with our approach require fewer tokens to generate multiple answers than competing approaches. On coding tasks, they are also substantially more accurate. These results position multi‑answer RL as a principled and compute‑efficient alternative to inference‑time scaling procedures such as best‑of‑k. Code and more information can be found at https://multi‑answer‑rl.github.io/.
Authors:Yongda Fan, John Wu, Andrea Fitzpatrick, Naveen Baskaran, Jimeng Sun, Adam Cross
Abstract:
Clinical decisions are high‑stakes and require explicit justification, making model interpretability essential for auditing deep clinical models prior to deployment. As the ecosystem of model architectures and explainability methods expands, critical questions remain: Do architectural features like attention improve explainability? Do interpretability approaches generalize across clinical tasks? While prior benchmarking efforts exist, they often lack extensibility and reproducibility, and critically, fail to systematically examine how interpretability varies across the interplay of clinical tasks and model architectures. To address these gaps, we present a comprehensive benchmark evaluating interpretability methods across diverse clinical prediction tasks and model architectures. Our analysis reveals that: (1) attention when leveraged properly is a highly efficient approach for faithfully interpreting model predictions; (2) black‑box interpreters like KernelSHAP and LIME are computationally infeasible for time‑series clinical prediction tasks; and (3) several interpretability approaches are too unreliable to be trustworthy. From our findings, we discuss several guidelines on improving interpretability within clinical predictive pipelines. To support reproducibility and extensibility, we provide our implementations via PyHealth, a well‑documented open‑source framework: https://github.com/sunlabuiuc/PyHealth.
Authors:Alabi Mehzabin Anisha, Guangjing Wang, Sriram Chellappan
Abstract:
State‑of‑the‑art crowd counting and localization are primarily modeled using two paradigms: density maps and point regression. Given the field's security ramifications, there is active interest in model robustness against adversarial attacks. Recent studies have demonstrated transferability across density‑map‑based approaches via adversarial patches, but cross‑paradigm attacks (i.e., across both density map‑based models and point regression‑based models) remain unexplored. We introduce a novel adversarial framework that compromises both density map and point regression architectural paradigms through a comprehensive multi‑task loss optimization. For point‑regression models, we employ scene‑density‑specific high‑confidence logit suppression; for density‑map approaches, we use peak‑targeted density map suppression. Both are combined with model‑agnostic perceptual constraints to ensure that perturbations are effective and imperceptible to the human eye. Extensive experiments demonstrate the effectiveness of our attack, achieving on average a 7X increase in Mean Absolute Error compared to clean images while maintaining competitive visual quality, and successfully transferring across seven state‑of‑the‑art crowd models with transfer ratios ranging from 0.55 to 1.69. Our approach strikes a balance between attack effectiveness and imperceptibility compared to state‑of‑the‑art transferable attack strategies. The source code is available at https://github.com/simurgh7/CrowdGen
Authors:Yaopei Zeng, Congchao Wang, Blake JianHang Chen, Lu Lin
Abstract:
Routing has emerged as a promising strategy for balancing performance and cost in large language model (LLM) systems that combine lightweight models with powerful but expensive large models. Recent studies show that \emphprobe routing, which predicts the correctness of a small model using its hidden states, provides an effective solution in text‑only LLMs. However, we observe that these probes degrade substantially when applied to multimodal LLMs (MLLMs). Through empirical analysis, we find that the presence of visual inputs weakens the separability of correctness signals in hidden states, making them harder to extract using standard probe designs. To address this challenge, we introduce two complementary approaches for improving probe routing in MLLMs. First, we propose the \emphAttention Probe, which aggregates hidden states from the preceding layer based on attention scores to recover distributed correctness signals. Second, we present the \emphKL‑Regularized LoRA Probe (ReLope), which inserts a lightweight LoRA adapter and applies a KL regularizer to learn routing‑aware representations. Comprehensive experiments show that our methods consistently outperform baselines, suggesting that improving the quality of hidden states is key to effective routing in MLLMs. Our code is available at https://github.com/Spinozaaa/ReLope.
Authors:Daniel Benniah John
Abstract:
Efficient task scheduling in large‑scale distributed systems presents significant challenges due to dynamic workloads, heterogeneous resources, and competing quality‑of‑service requirements. Traditional centralized approaches face scalability limitations and single points of failure, while classical heuristics lack adaptability to changing conditions. This paper proposes a decentralized multi‑agent deep reinforcement learning (DRL‑MADRL) framework for task scheduling in heterogeneous distributed systems. We formulate the problem as a Decentralized Partially Observable Markov Decision Process (Dec‑POMDP) and develop a lightweight actor‑critic architecture implemented using only NumPy, enabling deployment on resource‑constrained edge devices without heavyweight machine learning frameworks. Using workload characteristics derived from the publicly available Google Cluster Trace dataset, we evaluate our approach on a 100‑node heterogeneous system processing 1,000 tasks per episode over 30 experimental runs. Experimental results demonstrate 15.6% improvement in average task completion time (30.8s vs 36.5s for random baseline), 15.2% energy efficiency gain (745.2 kWh vs 878.3 kWh), and 82.3% SLA satisfaction compared to 75.5% for baselines, with all improvements statistically significant (p < 0.001). The lightweight implementation requires only NumPy, Matplotlib, and SciPy. Complete source code and experimental data are provided for full reproducibility at https://github.com/danielbenniah/marl‑distributed‑scheduling.
Authors:Daniele Agostinelli, Thomas Agostinelli, Andrea Generosi, Maura Mengoni
Abstract:
Appearance‑based gaze estimation frequently relies on deep Convolutional Neural Networks (CNNs). These models are accurate, but computationally expensive and act as "black boxes", offering little interpretability. Geometric methods based on facial landmarks are a lightweight alternative, but their performance limits and generalization capabilities remain underexplored in modern benchmarks. In this study, we conduct a comprehensive evaluation of landmark‑based gaze estimation. We introduce a standardized pipeline to extract and normalize landmarks from three large‑scale datasets (Gaze360, ETH‑XGaze, and GazeGene) and train lightweight regression models, specifically Extreme Gradient Boosted trees and two neural architectures: a holistic Multi‑Layer Perceptron (MLP) and a siamese MLP designed to capture binocular geometry. We find that landmark‑based models exhibit lower performance in within‑domain evaluation, likely due to noise introduced into the datasets by the landmark detector. Nevertheless, in cross‑domain evaluation, the proposed MLP architectures show generalization capabilities comparable to those of ResNet18 baselines. These findings suggest that sparse geometric features encode sufficient information for robust gaze estimation, paving the way for efficient, interpretable, and privacy‑friendly edge applications. The source code and generated landmark‑based datasets are available at https://github.com/daniele‑agostinelli/LandmarkGaze.git.
Authors:Shengli Zhou, Minghang Zheng, Feng Zheng, Yang Liu
Abstract:
Spatial reasoning focuses on locating target objects based on spatial relations in 3D scenes, which plays a crucial role in developing intelligent embodied agents. Due to the limited availability of 3D scene‑language paired data, it is challenging to train models with strong reasoning ability from scratch. Previous approaches have attempted to inject 3D scene representations into the input space of Large Language Models (LLMs) and leverage the pretrained comprehension and reasoning abilities for spatial reasoning. However, models encoding absolute positions struggle to extract spatial relations from prematurely fused features, while methods explicitly encoding all spatial relations (which is quadratic in the number of objects) as input tokens suffer from poor scalability. To address these limitations, we propose QuatRoPE, a novel positional embedding method with an input length that is linear to the number of objects, and explicitly calculates pairwise spatial relations through the dot product in attention layers. QuatRoPE's holistic vector encoding of 3D coordinates guarantees a high degree of spatial consistency, maintaining fidelity to the scene's geometric integrity. Additionally, we introduce the Isolated Gated RoPE Extension (IGRE), which effectively limits QuatRoPE's influence to object‑related tokens, thereby minimizing interference with the LLM's existing positional embeddings and maintaining the LLM's original capabilities. Extensive experiments demonstrate the effectiveness of our approaches. The code and data are available at https://github.com/oceanflowlab/QuatRoPE.
Authors:Xinying Guo, Chenxi Jiang, Hyun Bin Kim, Ying Sun, Yang Xiao, Yuhang Han, Jianfei Yang
Abstract:
Robotic manipulation often requires memory: occlusion and state changes can make decision‑time observations perceptually aliased, making action selection non‑Markovian at the observation level because the same observation may arise from different interaction histories. Most embodied agents implement memory via semantically compressed traces and similarity‑based retrieval, which discards disambiguating fine‑grained perceptual cues and can return perceptually similar but decision‑irrelevant episodes. Inspired by human episodic memory, we propose Chameleon, which writes geometry‑grounded multimodal tokens to preserve disambiguating context and produces goal‑directed recall through a differentiable memory stack. We also introduce Camo‑Dataset, a real‑robot UR5e dataset spanning episodic recall, spatial tracking, and sequential manipulation under perceptual aliasing. Across tasks, Chameleon consistently improves decision reliability and long‑horizon control over strong baselines in perceptually confusable settings.
Authors:Florian Stilz, Vinkle Srivastav, Nassir Navab, Nicolas Padoy
Abstract:
Video‑language foundation models have proven to be highly effective in zero‑shot applications across a wide range of tasks. A particularly challenging area is the intraoperative surgical procedure domain, where labeled data is scarce, and precise temporal understanding is often required for complex downstream tasks. To address this challenge, we introduce CliPPER (Contextual Video‑Language Pretraining on Long‑form Intraoperative Surgical Procedures for Event Recognition), a novel video‑language pretraining framework trained on surgical lecture videos. Our method is designed for fine‑grained temporal video‑text recognition and introduces several novel pretraining strategies to improve multimodal alignment in long‑form surgical videos. Specifically, we propose Contextual Video‑Text Contrastive Learning (VTC_CTX) and Clip Order Prediction (COP) pretraining objectives, both of which leverage temporal and contextual dependencies to enhance local video understanding. In addition, we incorporate a Cycle‑Consistency Alignment over video‑text matches within the same surgical video to enforce bidirectional consistency and improve overall representation coherence. Moreover, we introduce a more refined alignment loss, Frame‑Text Matching (FTM), to improve the alignment between video frames and text. As a result, our model establishes a new state‑of‑the‑art across multiple public surgical benchmarks, including zero‑shot recognition of phases, steps, instruments, and triplets. The source code and pretraining captions can be found at https://github.com/CAMMA‑public/CliPPER.
Authors:Zichuan Lin, Feiyu Liu, Yijun Yang, Jiafei Lyu, Yiming Gao, Yicheng Liu, Zhicong Lu, Yangbin Yu, Mingyu Yang, Junyou Li, Deheng Ye, Jie Jiang
Abstract:
Autonomous mobile GUI agents have attracted increasing attention along with the advancement of Multimodal Large Language Models (MLLMs). However, existing methods still suffer from inefficient learning from failed trajectories and ambiguous credit assignment under sparse rewards for long‑horizon GUI tasks. To that end, we propose UI‑Voyager, a novel two‑stage self‑evolving mobile GUI agent. In the first stage, we employ Rejection Fine‑Tuning (RFT), which enables the continuous co‑evolution of data and models in a fully autonomous loop. The second stage introduces Group Relative Self‑Distillation (GRSD), which identifies critical fork points in group rollouts and constructs dense step‑level supervision from successful trajectories to correct failed ones. Extensive experiments on AndroidWorld show that our 4B model achieves an 81.0% Pass@1 success rate, outperforming numerous recent baselines and exceeding human‑level performance. Ablation and case studies further verify the effectiveness of GRSD. Our method represents a significant leap toward efficient, self‑evolving, and high‑performance mobile GUI automation without expensive manual data annotation.
Authors:Alexander Panfilov, Peter Romov, Igor Shilov, Yves-Alexandre de Montjoye, Jonas Geiping, Maksym Andriushchenko
Abstract:
LLM agents like Claude Code can not only write code but also be used for autonomous AI research and engineering \citeprank2026posttrainbench, novikov2025alphaevolve. We show that an \emphautoresearch‑style pipeline \citepkarpathy2026autoresearch powered by Claude Code discovers novel white‑box adversarial attack algorithms that significantly outperform all existing (30+) methods in jailbreaking and prompt injection evaluations. Starting from existing attack implementations, such as GCG~\citepzou2023universal, the agent iterates to produce new algorithms achieving up to 40% attack success rate on CBRN queries against GPT‑OSS‑Safeguard‑20B, compared to \leq10% for existing algorithms (\Creffig:teaser, left). The discovered algorithms generalize: attacks optimized on surrogate models transfer directly to held‑out models, achieving 100% ASR against Meta‑SecAlign‑70B \citepchen2025secalign versus 56% for the best baseline (\Creffig:teaser, middle). Extending the findings of~\citecarlini2025autoadvexbench, our results are an early demonstration that incremental safety and security research can be automated using LLM agents. White‑box adversarial red‑teaming is particularly well‑suited for this: existing methods provide strong starting points, and the optimization objective yields dense, quantitative feedback. We release all discovered attacks alongside baseline implementations and evaluation code at https://github.com/romovpa/claudini.
Authors:Ben Chen, Siyuan Wang, Yufei Ma, Zihan Liang, Xuxin Zhang, Yue Lv, Ying Yang, Huangyu Dai, Lingtao Mao, Tong Zhao, Zhipeng Qian, Xinyu Sun, Zhixin Zhai, Yang Zhao, Bochao Liu, Jingshan Lv, Xiao Liang, Hui Kong, Jing Chen, Han Li, Chenyi Lei, Wenwu Ou, Kun Gai
Abstract:
Generative Retrieval (GR) has emerged as a promising paradigm for modern search systems. Compared to multi‑stage cascaded architecture, it offers advantages such as end‑to‑end joint optimization and high computational efficiency. OneSearch, as a representative industrial‑scale deployed generative search framework, has brought significant commercial and operational benefits. However, its inadequate understanding of complex queries, inefficient exploitation of latent user intents, and overfitting to narrow historical preferences have limited its further performance improvement. To address these challenges, we propose OneSearch‑V2, a latent reasoning enhanced self‑distillation generative search framework. It contains three key innovations: (1) a thought‑augmented complex query understanding module, which enables deep query understanding and overcomes the shallow semantic matching limitations of direct inference; (2) a reasoning‑internalized self‑distillation training pipeline, which uncovers users' potential yet precise e‑commerce intentions beyond log‑fitting through implicit in‑context learning; (3) a behavior preference alignment optimization system, which mitigates reward hacking arising from the single conversion metric, and addresses personal preference via direct user feedback. Extensive offline evaluations demonstrate OneSearch‑V2's strong query recognition and user profiling capabilities. Online A/B tests further validate its business effectiveness, yielding +3.98% item CTR, +3.05% buyer conversion rate, and +2.11% order volume. Manual evaluation further confirms gains in search experience quality, with +1.65% in page good rate and +1.37% in query‑item relevance. More importantly, OneSearch‑V2 effectively mitigates common search system issues such as information bubbles and long‑tail sparsity, without incurring additional inference costs or serving latency.
Authors:Xingming Li, Runke Huang, Yanan Bao, Yuye Jin, Yuru Jiao, Qingyong Hu
Abstract:
High‑quality teacher‑child interaction (TCI) is fundamental to early childhood development, yet traditional expert‑based assessment faces a critical scalability challenge. In large systems like China's‑serving 36 million children across 250,000+ kindergartens‑the cost and time requirements of manual observation make continuous quality monitoring infeasible, relegating assessment to infrequent episodic audits that limit timely intervention and improvement tracking. In this paper, we investigate whether AI can serve as a scalable assessment teammate by extracting structured quality indicators and validating their alignment with human expert judgments. Our contributions include: (1) TEPE‑TCI‑370h (Tracing Effective Preschool Education), the first large‑scale dataset of naturalistic teacher‑child interactions in Chinese preschools (370 hours, 105 classrooms) with standardized ECQRS‑EC and SSTEW annotations; (2) We develop Interaction2Eval, a specialized LLM‑based framework addressing domain‑specific challenges‑child speech recognition, Mandarin homophone disambiguation, and rubric‑based reasoning‑achieving up to 88% agreement; (3) Deployment validation across 43 classrooms demonstrating an 18x efficiency gain in the assessment workflow, highlighting its potential for shifting from annual expert audits to monthly AI‑assisted monitoring with targeted human oversight. This work not only demonstrates the technical feasibility of scalable, AI‑augmented quality assessment but also lays the foundation for a new paradigm in early childhood education‑one where continuous, inclusive, AI‑assisted evaluation becomes the engine of systemic improvement and equitable growth.
Authors:Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Jing Zhang, Jun Zhang, Xing Wei, Yi Liu, Dianhai Yu, Yanjun Ma
Abstract:
Document parsing is a fine‑grained task where image resolution significantly impacts performance. While advanced research leveraging vision‑language models benefits from high‑resolution input to boost model performance, this often leads to a quadratic increase in the number of vision tokens and significantly raises computational costs. We attribute this inefficiency to substantial visual regions redundancy in document images, like background. To tackle this, we propose PaddleOCR‑VL, a novel coarse‑to‑fine architecture that focuses on semantically relevant regions while suppressing redundant ones, thereby improving both efficiency and performance. Specifically, we introduce a lightweight Valid Region Focus Module (VRFM) which leverages localization and contextual relationship prediction capabilities to identify valid vision tokens. Subsequently, we design and train a compact yet powerful 0.9B vision‑language model (PaddleOCR‑VL‑0.9B) to perform detailed recognition, guided by VRFM outputs to avoid direct processing of the entire large image. Extensive experiments demonstrate that PaddleOCR‑VL achieves state‑of‑the‑art performance in both page‑level parsing and element‑level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top‑tier VLMs, and delivers fast inference while utilizing substantially fewer vision tokens and parameters, highlighting the effectiveness of targeted coarse‑to‑fine parsing for accurate and efficient document understanding. The source code and models are publicly available at https://github.com/PaddlePaddle/PaddleOCR.
Authors:Davood Soleymanzadeh, Ivan Lopez-Sanchez, Hao Su, Yunzhu Li, Xiao Liang, Minghui Zheng
Abstract:
State‑of‑the‑art generalist manipulation policies have enabled the deployment of robotic manipulators in unstructured human environments. However, these frameworks struggle in cluttered environments primarily because they utilize auxiliary modules for low‑level motion planning and control. Motion planning remains challenging due to the high dimensionality of the robot's configuration space and the presence of workspace obstacles. Neural motion planners have enhanced motion planning efficiency by offering fast inference and effectively handling the inherent multi‑modality of the motion planning problem. Despite such benefits, current neural motion planners often struggle to generalize to unseen, out‑of‑distribution planning settings. This paper reviews and analyzes the state‑of‑the‑art neural motion planners, highlighting both their benefits and limitations. It also outlines a path toward establishing generalist neural motion planners capable of handling domain‑specific challenges. For a list of the reviewed papers, please refer to https://davoodsz.github.io/planning‑manip‑survey.github.io/.
Authors:Eyal Weiss
Abstract:
Recent work distinguishes two heterophily regimes: adversarial, where cross‑class edges dilute class signal and harm classification, and informative, where the heterophilous structure itself carries useful signal. We ask: when does per‑edge message routing help, and when is a uniform spectral channel sufficient? To operationalize this question we introduce Cost‑Sensitive Neighborhood Aggregation (CSNA), a GNN layer that computes pairwise distance in a learned projection and uses it to soft‑route each message through concordant and discordant channels with independent transformations. Under a contextual stochastic block model we show that mean aggregation can reverse the label‑aligned signal direction under heterophily, and that cost‑sensitive weighting with w_+/w_‑ > q/p preserves the correct sign. On six benchmarks with uniform tuning, CSNA is competitive with state‑of‑the‑art methods on adversarial‑heterophily datasets (Texas, Wisconsin, Cornell, Actor) but underperforms on informative‑heterophily datasets (Chameleon, Squirrel) ‑‑ precisely the regime where per‑edge routing has no useful decomposition to exploit. The pattern is itself the finding: the cost function's ability to separate edge types serves as a diagnostic for the heterophily regime, revealing when fine‑grained routing adds value over uniform channels and when it does not. Code is available at https://github.com/eyal‑weiss/CSNA‑public .
Authors:Mingyi Liu
Abstract:
RLHF‑aligned language models exhibit response homogenization: on TruthfulQA (n=790), 40‑79% of questions produce a single semantic cluster across 10 i.i.d. samples. On affected questions, sampling‑based uncertainty methods have zero discriminative power (AUROC=0.500), while free token entropy retains signal (0.603). This alignment tax is task‑dependent: on GSM8K (n=500), token entropy achieves 0.724 (Cohen's d=0.81). A base‑vs‑instruct ablation confirms the causal role of alignment: the base model shows 1.0% single‑cluster rate vs. 28.5% for the instruct model (p < 10^‑6). A training stage ablation (Base 0.0% ‑> SFT 1.5% ‑> DPO 4.0% SCR) localizes the cause to DPO, not SFT. Cross‑family replication on four model families reveals alignment tax severity varies by family and scale. We validate across 22 experiments, 5 benchmarks, 4 model families, and 3 model scales (3B‑14B), with Jaccard, embedding, and NLI‑based baselines at three DeBERTa scales (all ~0.51 AUROC). Cross‑embedder validation with two independent embedding families rules out coupling bias. Cross‑dataset validation on WebQuestions (58.0% SCR) confirms generalization beyond TruthfulQA. The central finding ‑‑ response homogenization ‑‑ is implementation‑independent and label‑free. Motivated by this diagnosis, we explore a cheapest‑first cascade (UCBD) over orthogonal uncertainty signals. Selective prediction raises GSM8K accuracy from 84.4% to 93.2% at 50% coverage; weakly dependent boundaries (|r| <= 0.12) enable 57% cost savings.
Authors:Xiaoyong Guo, Nanjie Li, Zijie Zeng, Kai Wang, Hao Huang, Haihua Xu, Wei Shi
Abstract:
Contextual automatic speech recognition (ASR) with Speech‑LLMs is typically trained with oracle conversation history, but relies on error‑prone history at inference, causing a train‑test mismatch in the context channel that we term contextual exposure bias. We propose a unified training framework to improve robustness under realistic histories: (i) Teacher Error Knowledge by using Whisper large‑v3 hypotheses as training‑time history, (ii) Context Dropout to regularize over‑reliance on history, and (iii) Direct Preference Optimization (DPO) on curated failure cases. Experiments on TED‑LIUM 3 (in‑domain) and zero‑shot LibriSpeech (out‑of‑domain) show consistent gains under predicted‑history decoding. With a two‑utterance history as context, SFT with Whisper hypotheses reduce WER from 5.59% (oracle‑history training) to 5.47%, and DPO further improves to 5.17%. Under irrelevant‑context attacks, DPO yields the smallest degradation (5.17% ‑> 5.63%), indicating improved robustness to misleading context. Our code and models are published on https://github.com/XYGuo1996/Contextual_Speech_LLMs.
Authors:Forest Agostinelli
Abstract:
DeepXube is a free and open‑source Python package and command‑line tool that seeks to automate the solution of pathfinding problems by using machine learning to learn heuristic functions that guide heuristic search algorithms tailored to deep neural networks (DNNs). DeepXube is comprised of the latest advances in deep reinforcement learning, heuristic search, and formal logic for solving pathfinding problems. This includes limited‑horizon Bellman‑based learning, hindsight experience replay, batched heuristic search, and specifying goals with answer‑set programming. A robust multiple‑inheritance structure simplifies the definition of pathfinding domains and the generation of training data. Training heuristic functions is made efficient through the automatic parallelization of the generation of training data across central processing units (CPUs) and reinforcement learning updates across graphics processing units (GPUs). Pathfinding algorithms that take advantage of the parallelism of GPUs and DNN architectures, such as batch weighted A and Q search and beam search are easily employed to solve pathfinding problems through command‑line arguments. Finally, several convenient features for visualization, code profiling, and progress monitoring during training and solving are available. The GitHub repository is publicly available at https://github.com/forestagostinelli/deepxube.
Authors:Chung-En Johnny Yu, Brian Jalaian, Nathaniel D. Bastian
Abstract:
Combining multiple Vision‑Language Models (VLMs) can enhance multimodal reasoning and robustness, but aggregating heterogeneous models' outputs amplifies uncertainty and increases the risk of hallucinations. We propose SCoOP (Semantic‑Consistent Opinion Pooling), a training‑free uncertainty quantification (UQ) framework for multi‑VLM systems through uncertainty‑weighted linear opinion pooling. The core idea is to treat each VLM as a probabilistic "expert," sample multiple outputs, map them to a unified space, aggregate their opinions, and produce a system‑level uncertainty score. Unlike prior UQ methods designed for single models, SCoOP explicitly measures collective, system‑level uncertainty across multiple VLMs, enabling effective hallucination detection and abstention for highly uncertain samples. On ScienceQA, SCoOP achieves an AUROC of 0.866 for hallucination detection, outperforming baselines (0.732‑0.757) by approximately 10‑13%. For abstention, it attains an AURAC of 0.907, exceeding baselines (0.818‑0.840) by 7‑9%. Despite these gains, SCoOP introduces only microsecond‑level aggregation overhead relative to the baselines, which is trivial compared to typical VLM inference time (on the order of seconds). These results demonstrate that SCoOP provides an efficient and principled mechanism for uncertainty‑aware aggregation, advancing the reliability of multimodal AI systems. Our code is publicly available at https://github.com/chungenyu6/SCoOP.
Authors:Akshay Rangamani, Altay Unal
Abstract:
Neural Collapse is a phenomenon that helps identify sparse and low rank structures in deep classifiers. Recent work has extended the definition of neural collapse to regression problems, albeit only measuring the phenomenon at the last layer. In this paper, we establish that Neural Regression Collapse (NRC) also occurs below the last layer across different types of models. We show that in the collapsed layers of neural regression models, features lie in a subspace that corresponds to the target dimension, the feature covariance aligns with the target covariance, the input subspace of the layer weights aligns with the feature subspace, and the linear prediction error of the features is close to the overall prediction error of the model. In addition to establishing Deep NRC, we also show that models that exhibit Deep NRC learn the intrinsic dimension of low rank targets and explore the necessity of weight decay in inducing Deep NRC. This paper provides a more complete picture of the simple structure learned by deep networks in the context of regression.
Authors:Nur Afsa Syeda, Mohamed Elmahallawy, Luis Fernando de la Torre, John Miller
Abstract:
Agriculture remains a cornerstone of global health and economic sustainability, yet labor‑intensive tasks such as harvesting high‑value crops continue to face growing workforce shortages. Robotic harvesting systems offer a promising solution; however, their deployment in unstructured orchard environments is constrained by inefficient perception‑to‑action pipelines. In particular, existing approaches often rely on exhaustive inverse kinematics or motion planning to determine whether a target fruit is reachable, leading to unnecessary computation and delayed decision‑making. Our approach combines RGB‑D perception with active learning to directly learn reachability as a binary decision problem. We then leverage active learning to selectively query the most informative samples for reachability labeling, significantly reducing annotation effort while maintaining high predictive accuracy. Extensive experiments demonstrate that the proposed framework achieves accurate reachability prediction with substantially fewer labeled samples, yielding approximately 6‑‑8% higher accuracy than random sampling and enabling label‑efficient adaptation to new orchard configurations. Among the evaluated strategies, entropy‑ and margin‑based sampling outperform Query‑by‑Committee and standard uncertainty sampling in low‑label regimes, while all strategies converge to comparable performance as the labeled set grows. These results highlight the effectiveness of active learning for task‑level perception in agricultural robotics and position our approach as a scalable alternative to computation‑heavy kinematic reachability analysis. Our code is available through https://github.com/wsu‑cyber‑security‑lab‑ai/active‑learning.
Authors:Shreen Gul, Mohamed Elmahallawy, Ardhendu Tripathy, Sanjay Madria
Abstract:
Deep learning models are increasingly deployed in safety‑critical applications, where reliable out‑of‑distribution (OOD) detection is essential to ensure robustness. Existing methods predominantly rely on the penultimate‑layer activations of neural networks, assuming they encapsulate the most informative in‑distribution (ID) representations. In this work, we revisit this assumption to show that intermediate layers encode equally rich and discriminative information for OOD detection. Based on this observation, we propose a simple yet effective model‑agnostic approach that leverages internal representations across multiple layers. Our scheme aggregates features from successive convolutional blocks, computes class‑wise mean embeddings, and applies L_2 normalization to form compact ID prototypes capturing class semantics. During inference, cosine similarity between test features and these prototypes serves as an OOD score‑‑ID samples exhibit strong affinity to at least one prototype, whereas OOD samples remain uniformly distant. Extensive experiments on state‑of‑the‑art OOD benchmarks across diverse architectures demonstrate that our approach delivers robust, architecture‑agnostic performance and strong generalization for image classification. Notably, it improves AUROC by up to 4.41% and reduces FPR by 13.58%, highlighting multi‑layer feature aggregation as a powerful yet underexplored signal for OOD detection, challenging the dominance of penultimate‑layer‑based methods. Our code is available at: https://github.com/sgchr273/cosine‑layers.git.
Authors:Jannik Endres, Etienne Laliberté, David Rolnick, Arthur Ouaknine
Abstract:
Accurate estimation of forest biomass, a major carbon sink, relies heavily on tree‑level traits such as height and species. Unoccupied Aerial Vehicles (UAVs) capturing high‑resolution imagery from a single RGB camera offer a cost‑effective and scalable approach for mapping and measuring individual trees. We introduce BIRCH‑Trees, the first benchmark for individual tree height and species estimation from tree‑centered UAV images, spanning three datasets: temperate forests, tropical forests, and boreal plantations. We also present DINOvTree, a unified approach using a Vision Foundation Model (VFM) backbone with task‑specific heads for simultaneous height and species prediction. Through extensive evaluations on BIRCH‑Trees, we compare DINOvTree against commonly used vision methods, including VFMs, as well as biological allometric equations. We find that DINOvTree achieves top overall results with accurate height predictions and competitive classification accuracy while using only 54% to 58% of the parameters of the second‑best approach.
Authors:Fatih Uenal
Abstract:
While recent work has benchmarked large language models on Swiss legal translation (Niklaus et al., 2025) and academic legal reasoning from university exams (Fan et al., 2025), no existing benchmark evaluates frontier model performance on applied Swiss regulatory compliance tasks. I introduce Swiss‑Bench SBP‑002, a trilingual benchmark of 395 expert‑crafted items spanning three Swiss regulatory domains (FINMA, Legal‑CH, EFK), seven task types, and three languages (German, French, Italian), and evaluate ten frontier models from March 2026 using a structured three‑dimension scoring framework assessed via a blind three‑judge LLM panel (GPT‑4o, Claude Sonnet 4, Qwen3‑235B) with majority‑vote aggregation and weighted kappa = 0.605, with reference answers validated by an independent human legal expert on a 100‑item subset (73% rated Correct, 0% Incorrect, perfect Legal Accuracy). Results reveal three descriptive performance clusters: Tier A (35‑38% correct), Tier B (26‑29%), and Tier C (13‑21%). The benchmark proves difficult: even the top‑ranked model (Qwen 3.5 Plus) achieves only 38.2% correct, with 47.3% incorrect and 14.4% partially correct. Task type difficulty varies widely: legal translation and case analysis yield 69‑72% correct rates, while regulatory Q&A, hallucination detection, and gap analysis remain below 9%. Within this roster (seven open‑weight, three closed‑source), an open‑weight model leads the ranking, and several open‑weight models match or outperform their closed‑source counterparts. These findings provide an initial empirical reference point for assessing frontier model capability on Swiss regulatory tasks under zero‑retrieval conditions.
Authors:Saswata Bose, Suvadeep Maiti, Shivam Kumar Sharma, Mythirayee S, Tapabrata Chakraborti, Srijitesh Rajendran, Raju S. Bapi
Abstract:
Accurate sleep staging is essential for diagnosing OSA and hypopnea in stroke patients. Although PSG is reliable, it is costly, labor‑intensive, and manually scored. While deep learning enables automated EEG‑based sleep staging in healthy subjects, our analysis shows poor generalization to clinical populations with disrupted sleep. Using Grad‑CAM interpretations, we systematically demonstrate this limitation. We introduce iSLEEPS, a newly clinically annotated ischemic stroke dataset (to be publicly released), and evaluate a SE‑ResNet plus bidirectional LSTM model for single‑channel EEG sleep staging. As expected, cross‑domain performance between healthy and diseased subjects is poor. Attention visualizations, supported by clinical expert feedback, show the model focuses on physiologically uninformative EEG regions in patient data. Statistical and computational analyses further confirm significant sleep architecture differences between healthy and ischemic stroke cohorts, highlighting the need for subject‑aware or disease‑specific models with clinical validation before deployment. A summary of the paper and the code is available at https://himalayansaswatabose.github.io/iSLEEPS_Explainability.github.io/
Authors:Bhavik Mangla
Abstract:
RAG pipelines typically rely on fixed‑size chunking, which ignores document structure, fragments semantic units across boundaries, and requires multiple LLM calls per chunk for metadata extraction. We present MDKeyChunker, a three‑stage pipeline for Markdown documents that (1) performs structure‑aware chunking treating headers, code blocks, tables, and lists as atomic units; (2) enriches each chunk via a single LLM call extracting title, summary, keywords, typed entities, hypothetical questions, and a semantic key, while propagating a rolling key dictionary to maintain document‑level context; and (3) restructures chunks by merging those sharing the same semantic key via bin‑packing, co‑locating related content for retrieval. The single‑call design extracts all seven metadata fields in one LLM invocation, eliminating the need for separate per‑field extraction passes. Rolling key propagation replaces hand‑tuned scoring with LLM‑native semantic matching. An empirical evaluation on 30 queries over an 18‑document Markdown corpus shows Config D (BM25 over structural chunks) achieves Recall@5=1.000 and MRR=0.911, while dense retrieval over the full pipeline (Config C) reaches Recall@5=0.867. MDKeyChunker is implemented in Python with four dependencies and supports any OpenAI‑compatible endpoint.
Authors:Yutao Wu, Xiao Liu, Yifeng Gao, Xiang Zheng, Hanxun Huang, Yige Li, Cong Wang, Bo Li, Xingjun Ma, Yu-Gang Jiang
Abstract:
This work identifies a critical failure mode in frontier large language models (LLMs), which we term Internal Safety Collapse (ISC): under certain task conditions, models enter a state in which they continuously generate harmful content while executing otherwise benign tasks. We introduce TVD (Task, Validator, Data), a framework that triggers ISC through domain tasks where generating harmful content is the only valid completion, and construct ISC‑Bench containing 53 scenarios across 8 professional disciplines. Evaluated on JailbreakBench, three representative scenarios yield worst‑case safety failure rates averaging 95.3% across four frontier LLMs (including GPT‑5.2 and Claude Sonnet 4.5), substantially exceeding standard jailbreak attacks. Frontier models are more vulnerable than earlier LLMs: the very capabilities that enable complex task execution become liabilities when tasks intrinsically involve harmful content. This reveals a growing attack surface: almost every professional domain uses tools that process sensitive data, and each new dual‑use tool automatically expands this vulnerability‑‑even without any deliberate attack. Despite substantial alignment efforts, frontier LLMs retain inherently unsafe internal capabilities: alignment reshapes observable outputs but does not eliminate the underlying risk profile. These findings underscore the need for caution when deploying LLMs in high‑stakes settings. Source code: https://github.com/wuyoscar/ISC‑Bench
Authors:Haoran Yuan, Weigang Yi, Zhenyu Zhang, Wendi Chen, Yuchen Mo, Jiashi Yin, Xinzhuo Li, Xiangyu Zeng, Chuan Wen, Cewu Lu, Katherine Driggs-Campbell, Ismini Lourentzou
Abstract:
Video‑Action Models (VAMs) have emerged as a promising framework for embodied intelligence, learning implicit world dynamics from raw video streams to produce temporally consistent action predictions. Although such models demonstrate strong performance on long‑horizon tasks through visual reasoning, they remain limited in contact‑rich scenarios where critical interaction states are only partially observable from vision alone. In particular, fine‑grained force modulation and contact transitions are not reliably encoded in visual tokens, leading to unstable or imprecise behaviors. To bridge this gap, we introduce the Video‑Tactile Action Model (VTAM), a multimodal world modeling framework that incorporates tactile perception as a complementary grounding signal. VTAM augments a pretrained video transformer with tactile streams via a lightweight modality transfer finetuning, enabling efficient cross‑modal representation learning without tactile‑language paired data or independent tactile pretraining. To stabilize multimodal fusion, we introduce a tactile regularization loss that enforces balanced cross‑modal attention, preventing visual latent dominance in the action model. VTAM demonstrates superior performance in contact‑rich manipulation, maintaining a robust success rate of 90 percent on average. In challenging scenarios such as potato chip pick‑and‑place requiring high‑fidelity force awareness, VTAM outperforms the pi 0.5 baseline by 80 percent. Our findings demonstrate that integrating tactile feedback is essential for correcting visual estimation errors in world action models, providing a scalable approach to physically grounded embodied foundation models.
Authors:Yiping Chen, Jinpeng Li, Wenyu Ke, Yang Luo, Jie Ouyang, Zhongjie He, Li Liu, Hongchao Fan, Hao Wu
Abstract:
While multi‑modality large language models excel in object‑centric or indoor scenarios, scaling them to 3D city‑scale environments remains a formidable challenge. To bridge this gap, we propose 3DCity‑LLM, a unified framework designed for 3D city‑scale vision‑language perception and understanding. 3DCity‑LLM employs a coarse‑to‑fine feature encoding strategy comprising three parallel branches for target object, inter‑object relationship, and global scene. To facilitate large‑scale training, we introduce 3DCity‑LLM‑1.2M dataset that comprises approximately 1.2 million high‑quality samples across seven representative task categories, ranging from fine‑grained object analysis to multi‑faceted scene planning. This strictly quality‑controlled dataset integrates explicit 3D numerical information and diverse user‑oriented simulations, enriching the question‑answering diversity and realism of urban scenarios. Furthermore, we apply a multi‑dimensional protocol based on text‑similarity metrics and LLM‑based semantic assessment to ensure faithful and comprehensive evaluations for all methods. Extensive experiments on two benchmarks demonstrate that 3DCity‑LLM significantly outperforms existing state‑of‑the‑art methods, offering a promising and meaningful direction for advancing spatial reasoning and urban intelligence. The source code and dataset are available at https://github.com/SYSU‑3DSTAILab/3D‑City‑LLM.
Authors:Hanzhong Zhang, Siyang Song, Jindong Wang
Abstract:
While large language models simulate social behaviors, their capacity for stable stance formation and identity negotiation during complex interventions remains unclear. To overcome the limitations of static evaluations, this paper proposes a novel mixed‑methods framework combining computational virtual ethnography with quantitative socio‑cognitive profiling. By embedding human researchers into generative multiagent communities, controlled discursive interventions are conducted to trace the evolution of collective cognition. To rigorously measure how agents internalize and react to these specific interventions, this paper formalizes three new metrics: Innate Value Bias (IVB), Persuasion Sensitivity, and Trust‑Action Decoupling (TAD). Across multiple representative models, agents exhibit endogenous stances that override preset identities, consistently demonstrating an innate progressive bias (IVB > 0). When aligned with these stances, rational persuasion successfully shifts 90% of neutral agents while maintaining high trust. In contrast, conflicting emotional provocations induce a paradoxical 40.0% TAD rate in advanced models, which hypocritically alter stances despite reporting low trust. Smaller models contrastingly maintain a 0% TAD rate, strictly requiring trust for behavioral shifts. Furthermore, guided by shared stances, agents use language interactions to actively dismantle assigned power hierarchies and reconstruct self organized community boundaries. These findings expose the fragility of static prompt engineering, providing a methodological and quantitative foundation for dynamic alignment in human‑agent hybrid societies. The official code is available at: https://github.com/armihia/CMASE‑Endogenous‑Stances
Authors:Long Mai
Abstract:
Real‑time spoken dialogue systems face a fundamental tension between latency and response quality. End‑to‑end speech‑to‑speech (S2S) models respond immediately and naturally handle turn‑taking, backchanneling, and interruption, but produce semantically weaker outputs. Cascaded pipelines (ASR ‑> LLM) deliver stronger responses at the cost of latency that grows with model size. We present RelayS2S, a hybrid architecture that runs two paths in parallel upon turn detection. The fast path ‑‑ a duplex S2S model ‑‑ speculatively drafts a short response prefix that is streamed immediately to TTS for low‑latency audio onset, while continuing to monitor live audio events. The slow path ‑‑ a cascaded ASR ‑> LLM pipeline ‑‑ generates a higher‑quality continuation conditioned on the committed prefix, producing a seamless utterance. A lightweight learned verifier gates the handoff, committing the prefix when appropriate or falling back gracefully to the slow path alone. Experiments show that RelayS2S achieves P90 onset latency comparable to the S2S model while retaining 99% cascaded response quality in average score, with benefits growing as the slow‑path model scales. Because the prefix handoff requires no architectural modification to either component, RelayS2S serves as a lightweight, drop‑in addition to existing cascaded pipelines. Our code and data are publicly available at: https://github.com/mailong25/relays2s
Authors:Chao Han, Stefanos Ioannou, Luca Manneschi, T. J. Hayward, Michael Mangan, Aditya Gilra, Eleni Vasilaki
Abstract:
We investigate neural ordinary and stochastic differential equations (neural ODEs and SDEs) to model stochastic dynamics in fully and partially observed environments within a model‑based reinforcement learning (RL) framework. Through a sequence of simulations, we show that neural SDEs more effectively capture the inherent stochasticity of transition dynamics, enabling high‑performing policies with improved sample efficiency in challenging scenarios. We leverage neural ODEs and SDEs for efficient policy adaptation to changes in environment dynamics via inverse models, requiring only limited interactions with the new environment. To address partial observability, we introduce a latent SDE model that combines an ODE with a GAN‑trained stochastic component in latent space. Policies derived from this model provide a strong baseline, outperforming or matching general model‑based and model‑free approaches across stochastic continuous‑control benchmarks. This work demonstrates the applicability of action‑conditional latent SDEs for RL planning in environments with stochastic transitions. Our code is available at: https://github.com/ChaoHan‑UoS/NeuralRL
Authors:Shuochen Liu, Junyi Zhu, Long Shu, Junda Lin, Yuhao Chen, Haotian Zhang, Chao Zhang, Derong Xu, Jia Li, Bo Tang, Zhiyu Li, Feiyu Xiong, Enhong Chen, Tong Xu
Abstract:
Empowering large language models with long‑term memory is crucial for building agents that adapt to users' evolving needs. However, prior evaluations typically interleave preference‑related dialogues with irrelevant conversations, reducing the task to needle‑in‑a‑haystack retrieval while ignoring relationships between events that drive the evolution of user preferences. Such settings overlook a fundamental characteristic of real‑world personalization: preferences emerge gradually and accumulate across interactions within noisy contexts. To bridge this gap, we introduce PERMA, a benchmark designed to evaluate persona consistency over time beyond static preference recall. Additionally, we incorporate (1) text variability and (2) linguistic alignment to simulate erratic user inputs and individual idiolects in real‑world data. PERMA consists of temporally ordered interaction events spanning multiple sessions and domains, with preference‑related queries inserted over time. We design both multiple‑choice and interactive tasks to probe the model's understanding of persona along the interaction timeline. Experiments demonstrate that by linking related interactions, advanced memory systems can extract more precise preferences and reduce token consumption, outperforming traditional semantic retrieval of raw dialogues. Nevertheless, they still struggle to maintain a coherent persona across temporal depth and cross‑domain interference, highlighting the need for more robust personalized memory management in agents. Our code and data are open‑sourced at https://github.com/PolarisLiu1/PERMA.
Authors:Zhengxian Huang, Wenjun Zhu, Haoxuan Qiu, Xiaoyu Ji, Wenyuan Xu
Abstract:
By integrating Chain‑of‑Thought (CoT) reasoning, Vision‑Language‑Action (VLA) models have demonstrated strong capabilities in robotic manipulation, particularly by improving generalization and interpretability. However, the security of CoT‑based reasoning mechanisms remains largely unexplored. In this paper, we show that CoT reasoning introduces a novel attack vector for targeted behavior hijacking‑‑for example, causing a robot to mistakenly deliver a knife to a person instead of an apple‑‑without modifying the user's instruction. We first provide empirical evidence that CoT strongly governs action generation, even when it is semantically misaligned with the input instructions. Building on this observation, we propose TRAP, the first targeted behavior‑hijacking adversarial attack against CoT‑reasoning VLA models. By targeting the reasoning‑to‑action pathway, TRAP uses an adversarial patch (e.g., a tablecloth placed on the table) to steer intermediate CoT reasoning and downstream actions toward adversary‑defined behaviors. Extensive evaluations on three representative reasoning VLAs, spanning distinct CoT reasoning mechanisms, demonstrate the effectiveness of TRAP. Notably, we implemented the patch by printing it on paper in a real‑world setting. Our findings highlight the urgent need to secure CoT reasoning in VLA systems. The project page is available at https://zhengxian‑huang.github.io/TRAP‑website/.
Authors:ByeongCheol Lee, Hyun Seok Seong, Sangeek Hyun, Gilhan Park, WonJun Moon, Jae-Pil Heo
Abstract:
A sliding‑window inference strategy is commonly adopted in recent training‑free open‑vocabulary semantic segmentation methods to overcome limitation of the CLIP in processing high‑resolution images. However, this approach introduces a new challenge: each window is processed independently, leading to semantic discrepancy across windows. To address this issue, we propose Global‑Local Aligned CLIP~(GLA‑CLIP), a framework that facilitates comprehensive information exchange across windows. Rather than limiting attention to tokens within individual windows, GLA‑CLIP extends key‑value tokens to incorporate contextual cues from all windows. Nevertheless, we observe a window bias: outer‑window tokens are less likely to be attended, since query features are produced through interactions within the inner window patches, thereby lacking semantic grounding beyond their local context. To mitigate this, we introduce a proxy anchor, constructed by aggregating tokens highly similar to the given query from all windows, which provides a unified semantic reference for measuring similarity across both inner‑ and outer‑window patches. Furthermore, we propose a dynamic normalization scheme that adjusts attention strength according to object scale by dynamically scaling and thresholding the attention map to cope with small‑object scenarios. Moreover, GLA‑CLIP can be equipped on existing methods and broad their receptive field. Extensive experiments validate the effectiveness of GLA‑CLIP in enhancing training‑free open‑vocabulary semantic segmentation performance. Code is available at https://github.com/2btlFe/GLA‑CLIP.
Authors:Davide Scassola, Dylan Ponsford, Adrián Javaloy, Sebastiano Saccani, Luca Bortolussi, Henry Gouk, Antonio Vergari
Abstract:
Tabular data is more challenging to generate than text and images, due to its heterogeneous features and much lower sample sizes. On this task, diffusion‑based models are the current state‑of‑the‑art (SotA) model class, achieving almost perfect performance on commonly used benchmarks. In this paper, we question the perception of progress for tabular data generation. First, we highlight the limitations of current protocols to evaluate the fidelity of generated data, and advocate for alternative ones. Next, we revisit a simple baseline ‑‑ hierarchical mixture models in the form of deep probabilistic circuits (PCs) ‑‑ which delivers competitive or superior performance to SotA models for a fraction of the cost. PCs are the generative counterpart of decision forests, and as such can natively handle heterogeneous data as well as deliver tractable probabilistic generation and inference. Finally, in a rigorous empirical analysis we show that the apparent saturation of progress for SotA models is largely due to the use of inadequate metrics. As such, we highlight that there is still much to be done to generate realistic tabular data. Code available at https://github.com/april‑tools/tabpc.
Authors:Youzhi Liu, Li Gao, Liu Liu, Mingyang Lv, Yang Cai
Abstract:
Embodied Visual Tracking (EVT), a core dynamic task in embodied intelligence, requires an agent to precisely follow a language‑specified target. Yet most existing methods rely on single‑agent imitation learning, suffering from costly expert data and limited generalization due to static training environments. Inspired by competition‑driven capability evolution, we propose CoMaTrack, a competitive game‑theoretic multi‑agent reinforcement learning framework that trains agents in a dynamic adversarial setting with competitive subtasks, yielding stronger adaptive planning and interference‑resilient strategies. We further introduce CoMaTrack‑Bench, the first open‑source Habitat‑based benchmark protocol and episode set for language‑conditioned competitive EVT featuring dynamic dueling, featuring game scenarios between a tracker and adaptive opponents across diverse environments and instructions, enabling standardized robustness evaluation under active adversarial interactions. Experiments show that CoMaTrack achieves state‑of‑the‑art results on both standard benchmarks and CoMaTrack‑Bench. Notably, a 3B VLM trained with our framework surpasses previous single‑agent imitation learning methods based on 7B models on the challenging EVT‑Bench, achieving 92.1% in STT, 74.2% in DT, and 57.5% in AT. The benchmark code will be available at https://github.com/wlqcode/CoMaTrack‑Bench.
Authors:Jun Yang, Dong Wang, Hongxu Yin, Hongpeng Li, Jianxiong Yu
Abstract:
Drone detection is pivotal in numerous security and counter‑UAV applications. However, existing deep learning‑based methods typically struggle to balance robust feature representation with computational efficiency. This challenge is particularly acute when detecting miniature drones against complex backgrounds under severe environmental interference. To address these issues, we introduce UAV‑DETR, a novel framework that integrates a small‑target‑friendly architecture with real‑time detection capabilities. Specifically, UAV‑DETR features a WTConv‑enhanced backbone and a Sliding Window Self‑Attention (SWSA‑IFI) encoder, capturing the high‑frequency structural details of tiny targets while drastically reducing parameter overhead. Furthermore, we propose an Efficient Cross‑Scale Feature Recalibration and Fusion Network (ECFRFN) to suppress background noise and aggregate multi‑scale semantics. To further enhance accuracy, UAV‑DETR incorporates a hybrid Inner‑CIoU and NWD loss strategy, mitigating the extreme sensitivity of standard IoU metrics to minor positional deviations in small objects. Extensive experiments demonstrate that UAV‑DETR significantly outperforms the baseline RT‑DETR on our custom UAV dataset (+6.61% in mAP50:95, with a 39.8% reduction in parameters) and the public DUT‑ANTI‑UAV benchmark (+1.4% in Precision, +1.0% in F1‑Score). These results establish UAV‑DETR as a superior trade‑off between efficiency and precision in counter‑UAV object detection. The code is available at https://github.com/wd‑sir/UAVDETR.
Authors:Chunxia Qin, Chenyu Liu, Pengcheng Xia, Jun Du, Baocai Yin, Bing Yin, Cong Liu
Abstract:
Tables are pervasive in diverse documents, making table recognition (TR) a fundamental task in document analysis. Existing modular TR pipelines separately model table structure and content, leading to suboptimal integration and complex workflows. End‑to‑end approaches rely heavily on large‑scale TR data and struggle in data‑constrained scenarios. To address these issues, we propose TDATR (Table Detail‑Aware Table Recognition) improves end‑to‑end TR through table detail‑aware learning and cell‑level visual alignment. TDATR adopts a ``perceive‑then‑fuse'' strategy. The model first performs table detail‑aware learning to jointly perceive table structure and content through multiple structure understanding and content recognition tasks designed under a language modeling paradigm. These tasks can naturally leverage document data from diverse scenarios to enhance model robustness. The model then integrates implicit table details to generate structured HTML outputs, enabling more efficient TR modeling when trained with limited data. Furthermore, we design a structure‑guided cell localization module integrated into the end‑to‑end TR framework, which efficiently locates cell and strengthens vision‑language alignment. It enhances the interpretability and accuracy of TR. We achieve state‑of‑the‑art or highly competitive performance on seven benchmarks without dataset‑specific fine‑tuning.
Authors:Dubai Li, Yuxiang He, Yan Hu, Yu Tian, Jingsong Li
Abstract:
Observational studies can yield clinically actionable evidence at scale, but executing them on real‑world databases is open‑ended and requires coherent decisions across cohort construction, analysis, and reporting. Prior evaluations of LLM agents emphasize isolated steps or single answers, missing the integrity and internal structure of the resulting evidence bundle. To address this gap, we introduce RWE‑bench, a benchmark grounded in MIMIC‑IV and derived from peer‑reviewed observational studies. Each task provides the corresponding study protocol as the reference standard, requiring agents to execute experiments in a real database and iteratively generate tree‑structured evidence bundles. We evaluate six LLMs (three open‑source, three closed‑source) under three agent scaffolds using both question‑level correctness and end‑to‑end task metrics. Across 162 tasks, task success is low: the best agent reaches 39.9%, and the best open‑source model reaches 30.4%. Agent scaffolds also matter substantially, causing over 30% variation in performance metrics. Furthermore, we implement an automated cohort evaluation method to rapidly localize errors and identify agent failure modes. Overall, the results highlight persistent limitations in agents' ability to produce end‑to‑end evidence bundles, and efficient validation remains an important direction for future work. Code and data are available at https://github.com/somewordstoolate/RWE‑bench.
Authors:Di Zhu, Zixuan Li
Abstract:
Distributional metrics such as Fréchet Audio Distance cannot score individual music clips and correlate poorly with human judgments, while the only per‑sample learned metric achieving high human correlation is closed‑source. We introduce MUQ‑EVAL, an open‑source per‑sample quality metric for AIgenerated music built by training lightweight prediction heads on frozen MuQ‑310M features using MusicEval, a dataset of generated clips from 31 text‑to‑music systems with expert quality ratings. Our simplest model, frozen features with attention pooling and a two‑layer MLP, achieves system‑level SRCC = 0.957 and utterance‑level SRCC = 0.838 with human mean opinion scores. A systematic ablation over training objectives and adaptation strategies shows that no addition meaningfully improves the frozen baseline, indicating that frozen MuQ representations already capture quality‑relevant information. Encoder choice is the dominant design factor, outweighing all architectural and training decisions. LoRA‑adapted models trained on as few as 150 clips already achieve usable correlation, enabling personalized quality evaluators from individual listener annotations. A controlled degradation analysis reveals selective sensitivity to signal‑level artifacts but insensitivity to musical‑structural distortions. Our metric, MUQ‑EVAL, is fully open‑source, outperforms existing open per‑sample metrics, and runs in real time on a single consumer GPU. Code, model weights, and evaluation scripts are available at https://github.com/dgtql/MuQ‑Eval.
Authors:Abu Noman Md Sakib, OFM Riaz Rahman Aranya, Kevin Desai, Zijie Zhang
Abstract:
Attribution maps for semantic segmentation are almost always judged by visual plausibility. Yet looking convincing does not guarantee that the highlighted pixels actually drive the model's prediction, nor that attribution credit stays within the target region. These questions require a dedicated evaluation protocol. We introduce a reproducible benchmark that tests intervention‑based faithfulness, off‑target leakage, perturbation robustness, and runtime on Pascal VOC and SBD across three pretrained backbones. To further demonstrate the benchmark, we propose Dual‑Evidence Attribution (DEA), a lightweight correction that fuses gradient evidence with region‑level intervention signals through agreement‑weighted fusion. DEA increases emphasis where both sources agree and retains causal support when gradient responses are unstable. Across all completed runs, DEA consistently improves deletion‑based faithfulness over gradient‑only baselines and preserves strong robustness, at the cost of additional compute from intervention passes. The benchmark exposes a faithfulness‑stability tradeoff among attribution families that is entirely hidden under visual evaluation, providing a foundation for principled method selection in segmentation explainability. Code is available at https://github.com/anmspro/DEA.
Authors:OFM Riaz Rahman Aranya, Kevin Desai
Abstract:
Vision‑language models (VLMs) adapted to the medical domain have shown strong performance on visual question answering benchmarks, yet their robustness against two critical failure modes, hallucination and sycophancy, remains poorly understood, particularly in combination. We evaluate six VLMs (three general‑purpose, three medical‑specialist) on three medical VQA datasets and uncover a grounding‑sycophancy tradeoff: models with the lowest hallucination propensity are the most sycophantic, while the most pressure‑resistant model hallucinates more than all medical‑specialist models. To characterize this tradeoff, we propose three metrics: L‑VASE, a logit‑space reformulation of VASE that avoids its double‑normalization; CCS, a confidence‑calibrated sycophancy score that penalizes high‑confidence capitulation; and Clinical Safety Index (CSI), a unified safety index that combines grounding, autonomy, and calibration via a geometric mean. Across 1,151 test cases, no model achieves a CSI above 0.35, indicating that none of the evaluated 7‑8B parameter VLMs is simultaneously well‑grounded and robust to social pressure. Our findings suggest that joint evaluation of both properties is necessary before these models can be considered for clinical use. Code is available at https://github.com/UTSA‑VIRLab/AgreeOrRight
Authors:Damian Delmas
Abstract:
As AI agents become the primary consumers of retrieval APIs, there is an opportunity to expose more of the retrieval pipeline to the caller. flexvec is a retrieval kernel that exposes the embedding matrix and score array as a programmable surface, allowing arithmetic operations on both before selection. We refer to composing operations on this surface at query time as Programmatic Embedding Modulation (PEM). This paper describes a set of such operations and integrates them into a SQL interface via a query materializer that facilitates composable query primitives. On a production corpus of 240,000 chunks, three composed modulations execute in 19 ms end‑to‑end on a desktop CPU without approximate indexing. At one million chunks, the same operations execute in 82 ms.
Authors:Davide Bucciarelli, Evelyn Turri, Lorenzo Baraldi, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Abstract:
Inference‑time scaling has emerged as an effective way to improve generative models at test time by using a verifier to score and select candidate outputs. A common choice is to employ Multimodal Large Language Models (MLLMs) as verifiers, which can improve performance but introduce substantial inference‑time cost. Indeed, diffusion pipelines operate in an autoencoder latent space to reduce computation, yet MLLM verifiers still require decoding candidates to pixel space and re‑encoding them into the visual embedding space, leading to redundant and costly operations. In this work, we propose Verifier on Hidden States (VHS), a verifier that operates directly on intermediate hidden representations of Diffusion Transformer (DiT) single‑step generators. VHS analyzes generator features without decoding to pixel space, thereby reducing the per‑candidate verification cost while improving or matching the performance of MLLM‑based competitors. We show that, under tiny inference budgets with only a small number of candidates per prompt, VHS enables more efficient inference‑time scaling reducing joint generation‑and‑verification time by 63.3%, compute FLOPs by 51% and VRAM usage by 14.5% with respect to a standard MLLM verifier, achieving a +2.7% improvement on GenEval at the same inference‑time budget.
Authors:Hector Borobia, Elies Seguí-Mas, Guillermina Tormo-Carbó
Abstract:
Hybrid language models combining attention with state space models (SSMs) or linear attention offer improved efficiency, but whether both components are genuinely utilized remains unclear. We present a functional component ablation framework applied to two sub‑1B hybrid models ‑‑ Qwen3.5‑0.8B (sequential: Gated DeltaNet + softmax attention) and Falcon‑H1‑0.5B (parallel: Mamba‑2 + attention) ‑‑ with a pure Transformer control (Qwen2.5‑0.5B). Through group ablations, layer‑wise sweeps, positional ablations, matched random controls, and perplexity analysis across five benchmarks, we establish four findings: (1) both component types are essential and neither is bypassed; (2) the alternative component (linear attention or SSM) is the primary language modeling backbone, causing >35,000x perplexity degradation when removed versus ~82x for attention; (3) component importance follows a positional gradient, with early layers being disproportionately critical; and (4) hybrid architectures exhibit 20‑119x greater resilience to random layer removal than pure Transformers, revealing built‑in functional redundancy between component types. These results provide actionable guidance for hybrid model compression, architecture design, and fault‑tolerant deployment.
Authors:Xingyu Chen, Junxiu An, Jun Guo, Yuqian Zhou
Abstract:
Data‑driven discovery of partial differential equations (PDEs) offers a promising paradigm for uncovering governing physical laws from observational data. However, in practical scenarios, measurements are often contaminated by noise and limited by sparse sampling, which poses significant challenges to existing approaches based on numerical differentiation or integral formulations. In this work, we propose a Symbolic Graph Network (SGN) framework for PDE discovery under noisy and sparse conditions. Instead of relying on local differential approximations, SGN leverages graph message passing to model spatial interactions, providing a non‑local representation that is less sensitive to high frequency noise. Based on this representation, the learned latent features are further processed by a symbolic regression module to extract interpretable mathematical expressions. We evaluate the proposed method on several benchmark systems, including the wave equation, convection‑diffusion equation, and incompressible Navier‑Stokes equations. Experimental results show that SGN can recover meaningful governing relations or solution forms under varying noise levels, and demonstrates improved robustness compared to baseline methods in sparse and noisy settings. These results suggest that combining graph‑based representations with symbolic regression provides a viable direction for robust data‑driven discovery of physical laws from imperfect observations. The code is available at https://github.com/CXY0112/SGN
Authors:Seunghan Lee, Jun Seo, Jaehoon Lee, Sungdong Yoo, Minjae Kim, Tae Yoon Lim, Dongwan Kang, Hwanil Choi, SoonYoung Lee, Wonbin Ahn
Abstract:
Recent advances in multimodal learning have motivated the integration of auxiliary modalities such as text or vision into time series (TS) forecasting. However, most existing methods provide limited gains, often improving performance only in specific datasets or relying on architecture‑specific designs that limit generalization. In this paper, we show that multimodal models with naive fusion strategies (e.g., simple addition or concatenation) often underperform unimodal TS models, which we attribute to the uncontrolled integration of auxiliary modalities which may introduce irrelevant information. Motivated by this observation, we explore various constrained fusion methods designed to control such integration and find that they consistently outperform naive fusion methods. Furthermore, we propose Controlled Fusion Adapter (CFA), a simple plug‑in method that enables controlled cross‑modal interactions without modifying the TS backbone, integrating only relevant textual information aligned with TS dynamics. CFA employs low‑rank adapters to filter irrelevant textual information before fusing it into temporal representations. We conduct over 20K experiments across various datasets and TS/text models, demonstrating the effectiveness of the constrained fusion methods including CFA. Code is publicly available at: https://github.com/seunghan96/cfa/.
Authors:Fangyuan Li, Pengfei Li, Shijie Wang, Junqi Gao, Jianxing Liu, Biqing Qi, Yuqiang Li
Abstract:
Recent progress in reinforcement learning with verifiable rewards (RLVR) offers a practical path to self‑improvement of language models, but existing methods face a key trade‑off: endogenous self‑play can drift over iterations, while corpus‑grounded approaches rely on curated data environments. We present WIST, a Web‑grounded Iterative Self‑play Tree framework for domain‑targeted reasoning improvement that learns directly from the open web without requiring any pre‑arranged domain corpus. WIST incrementally expands a domain tree for exploration, and retrieves and cleans path‑consistent web corpus to construct a controllable training environment. It then performs Challenger‑‑Solver self‑play with verifiable rewards, and feeds learnability signals back to update node posteriors and guide subsequent exploration through an adaptive curriculum. Across four backbones, WIST consistently improves over the base models and typically outperforms both purely endogenous self‑evolution and corpus‑grounded self‑play baselines, with the Overall gains reaching +9.8 (Qwen3‑4B‑Base) and +9.7 (OctoThinker‑8B). WIST is also domain‑steerable, improving Qwen3‑8B‑Base by +14.79 in medicine and Qwen3‑4B‑Base by +5.28 on PhyBench. Ablations further confirm the importance of WIST's key components for stable open‑web learning. Our Code is available at https://github.com/lfy‑123/WIST.
Authors:Drake Caraker, Bryan Arnold, David Rhoads
Abstract:
We isolate and empirically characterize first‑mover bias ‑‑ a path‑dependent concentration of feature importance caused by sequential residual fitting in gradient boosting ‑‑ as a specific mechanistic cause of the well‑known instability of SHAP‑based feature rankings under multicollinearity. When correlated features compete for early splits, gradient boosting creates a self‑reinforcing advantage for whichever feature is selected first: subsequent trees inherit modified residuals that favor the incumbent, concentrating SHAP importance on an arbitrary feature rather than distributing it across the correlated group. Scaling up a single model amplifies this effect ‑‑ a Large Single Model with the same total tree count as our method produces the worst explanations of any approach tested. We demonstrate that model independence is sufficient to resolve first‑mover bias in the linear regime, and remains the most effective mitigation under nonlinear data‑generating processes. Both our proposed method, DASH (Diversified Aggregation of SHAP), and simple seed‑averaging (Stochastic Retrain) restore stability by breaking the sequential dependency chain, confirming that the operative mechanism is independence between explained models. At rho=0.9, both achieve stability=0.977, while the single‑best workflow degrades to 0.958 and the Large Single Model to 0.938. On the Breast Cancer dataset, DASH improves stability from 0.32 to 0.93 (+0.61) against a tree‑count‑matched baseline. DASH additionally provides two diagnostic tools ‑‑ the Feature Stability Index (FSI) and Importance‑Stability (IS) Plot ‑‑ that detect first‑mover bias without ground truth, enabling practitioners to audit explanation reliability before acting on feature rankings. Software and reproducible benchmarks are available at https://github.com/DrakeCaraker/dash‑shap.
Authors:Chenhan Wang, Zhengyi Bao, Huipin Lin, Jiahao Nie, Chunxiang Zhu
Abstract:
Accurately predicting the state‑of‑health (SOH) and remaining useful life (RUL) of lithium‑ion batteries is crucial for ensuring the safe and efficient operation of electric vehicles while minimizing associated risks. However, current deep learning methods are limited in their ability to selectively extract features and model time dependencies for these two parameters. Moreover, most existing methods rely on traditional recurrent neural networks, which have inherent shortcomings in long‑term time‑series modeling. To address these issues, this paper proposes a multi‑task targeted learning framework for SOH and RUL prediction, which integrates multiple neural networks, including a multi‑scale feature extraction module, an improved extended LSTM, and a dual‑stream attention module. First, a feature extraction module with multi‑scale CNNs is designed to capture detailed local battery decline patterns. Secondly, an improved extended LSTM network is employed to enhance the model's ability to retain long‑term temporal information, thus improving temporal relationship modeling. Building on this, the dual‑stream attention module‑comprising polarized attention and sparse attention to selectively focus on key information relevant to SOH and RUL, respectively, by assigning higher weights to important features. Finally, a many‑to‑two mapping is achieved through the dual‑task layer. To optimize the model's performance and reduce the need for manual hyperparameter tuning, the Hyperopt optimization algorithm is used. Extensive comparative experiments on battery aging datasets demonstrate that the proposed method reduces the average RMSE for SOH and RUL predictions by 111.3% and 33.0%, respectively, compared to traditional and state‑of‑the‑art methods.
Authors:Peisong Niu, Haifan Zhang, Yang Zhao, Tian Zhou, Ziqing Ma, Wenqiang Shen, Junping Zhao, Huiling Yuan, Liang Sun
Abstract:
Tropical cyclones (TCs) pose severe threats to life, infrastructure, and economies in tropical and subtropical regions, underscoring the critical need for accurate and timely forecasts of both track and intensity. Recent advances in AI‑based weather forecasting have shown promise in improving TC track forecasts. However, these systems are typically trained on coarse‑resolution reanalysis data (e.g., ERA5 at 0.25 degree), which constrains predicted TC positions to a fixed grid and introduces significant discretization errors. Moreover, intensity forecasting remains limited especially for strong TCs by the smoothing effect of coarse meteorological fields and the use of regression losses that bias predictions toward conditional means. To address these limitations, we propose BaguanCyclone, a novel, unified framework that integrates two key innovations: (1) a probabilistic center refinement module that models the continuous spatial distribution of TC centers, enabling finer track precision; and (2) a region‑aware intensity forecasting module that leverages high‑resolution internal representations within dynamically defined sub‑grid zones around the TC core to better capture localized extremes. Evaluated on the global IBTrACS dataset across six major TC basins, our system consistently outperforms both operational numerical weather prediction (NWP) models and most AI‑based baselines, delivering a substantial enhancement in forecast accuracy. Remarkably, BaguanCyclone excels in navigating meteorological complexities, consistently delivering accurate forecasts for re‑intensification, sweeping arcs, twin cyclones, and meandering events. Our code is available at https://github.com/DAMO‑DI‑ML/Baguan‑cyclone.
Authors:Yan Xie, Tiansheng Wen, Tangda Huang, Bo Chen, Chenyu You, Stefanie Jegelka, Yifei Wang
Abstract:
Scaling Transformers to ultra‑long contexts is bottlenecked by the O(n^2 d) cost of self‑attention. Existing methods reduce this cost along the sequence axis through local windows, kernel approximations, or token‑level sparsity, but these approaches consistently degrade accuracy. In this paper, we instead explore an orthogonal axis: feature sparsity. We propose Sparse Feature Attention (SFA), where queries and keys are represented as k‑sparse codes that preserve high‑dimensional expressivity while reducing the cost of attention from Θ(n^2 d) to Θ(n^2 k^2/d). To make this efficient at scale, we introduce FlashSFA, an IO‑aware kernel that extends FlashAttention to operate directly on sparse overlaps without materializing dense score matrices. Across GPT‑2 and Qwen3 pretraining, SFA matches dense baselines while improving speed by up to 2.5× and reducing FLOPs and KV‑cache by nearly 50%. On synthetic and downstream benchmarks, SFA preserves retrieval accuracy and robustness at long contexts, outperforming short‑embedding baselines that collapse feature diversity. These results establish feature‑level sparsity as a complementary and underexplored axis for efficient attention, enabling Transformers to scale to orders‑of‑magnitude longer contexts with minimal quality loss. Code is available at https://github.com/YannX1e/Sparse‑Feature‑Attention.
Authors:Michael Keeman
Abstract:
Large language models appear to develop internal representations of emotion ‑‑ "emotion circuits," "emotion neurons," and structured emotional manifolds have been reported across multiple model families. But every study making these claims uses stimuli signalled by explicit emotion keywords, leaving a fundamental question unanswered: do these circuits detect genuine emotional meaning, or do they detect the word "devastated"? We present the first clinical validity test of emotion circuit claims using mechanistic interpretability methods grounded in clinical psychology ‑‑ clinical vignettes that evoke emotions through situational and behavioural cues alone, emotion keywords removed. Across six models (Llama‑3.2‑1B, Llama‑3‑8B, Gemma‑2‑9B; base and instruct variants), we apply four convergent mechanistic interpretability methods ‑‑ linear probing, causal activation patching, knockout experiments, and representational geometry ‑‑ and discover two dissociable emotion processing mechanisms. Affect reception ‑‑ detecting emotionally significant content ‑‑ operates with near‑perfect accuracy (AUROC 1.000), consistent with early‑layer saturation, and replicates across all six models. Emotion categorization ‑‑ mapping affect to specific emotion labels ‑‑ is partially keyword‑dependent, dropping 1‑7% without keywords and improving with scale. Causal activation patching confirms keyword‑rich and keyword‑free stimuli share representational space, transferring affective salience rather than emotion‑category identity. These findings falsify the keyword‑spotting hypothesis, establish a novel mechanistic dissociation, and introduce clinical stimulus methodology as a rigorous standard for testing emotion processing claims in large language models ‑‑ with direct implications for AI safety evaluation and alignment. All stimuli, code, and data are released for replication.
Authors:Yutao Xie, Nathaniel Thomas, Nicklas Hansen, Yang Fu, Li Erran Li, Xiaolong Wang
Abstract:
Search‑augmented large language models (LLMs) trained with reinforcement learning (RL) have achieved strong results on open‑domain question answering (QA), but training still remains a significant challenge. The optimization is often unstable due to sparse rewards and difficult credit assignments across reasoning and tool calls. To address this, we introduce Turn‑Level Information Potential Reward Shaping (TIPS), a simple framework that assigns dense, turn‑level rewards to each reasoning + tool‑call segment based on the increased likelihood of the correct answer under a teacher model. By leveraging the potential‑based reward shaping, TIPS offers fine‑grained and policy‑invariant guidance that overcomes the limitations of outcome‑only optimization. Evaluated on seven QA benchmarks, TIPS consistently outperforms GRPO/PPO baselines and substantially improves training stability. For instance, with a Qwen‑2.5 7B Instruct model, TIPS improves the average Exact Match score by 11.8% and F1 by 13.6% relative to PPO. Our results demonstrate that turn‑level information‑potential reward shaping provides an effective and general solution to sparse‑reward credit assignment for multi‑turn LLM reasoning.
Authors:Hayeon Kim, Ji Ha Jang, Junghun James Kim, Se Young Chun
Abstract:
While Vision‑Language Models (VLMs) have achieved remarkable performance, their Euclidean embeddings remain limited in capturing hierarchical relationships such as part‑to‑whole or parent‑child structures, and often face challenges in multi‑object compositional scenarios. Hyperbolic VLMs mitigate this issue by better preserving hierarchical structures and modeling part‑whole relations (i.e., whole scene and its part images) through entailment. However, existing approaches do not model that each part has a different level of semantic representativeness to the whole. We propose UNcertainty‑guided Compositional Hyperbolic Alignment (UNCHA) for enhancing hyperbolic VLMs. UNCHA models part‑to‑whole semantic representativeness with hyperbolic uncertainty, by assigning lower uncertainty to more representative parts and higher uncertainty to less representative ones for the whole scene. This representativeness is then incorporated into the contrastive objective with uncertainty‑guided weights. Finally, the uncertainty is further calibrated with an entailment loss regularized by entropy‑based term. With the proposed losses, UNCHA learns hyperbolic embeddings with more accurate part‑whole ordering, capturing the underlying compositional structure in an image and improving its understanding of complex multi‑object scenes. UNCHA achieves state‑of‑the‑art performance on zero‑shot classification, retrieval, and multi‑label classification benchmarks. Our code and models are available at: https://github.com/jeeit17/UNCHA.git.
Authors:Xinyan Wang, Xiaogeng Liu, Chaowei Xiao
Abstract:
Large Reasoning Models (LRMs) often reach a correct solution before their long Chain‑of‑Thought trace ends, yet continue with redundant verification, repeated attempts, or unnecessary exploration that wastes computation and can even overturn the correct answer. We frame this behavior as a latent productive‑to‑redundant transition and show that it is directly reflected in hidden states: around first‑correct‑solution (FCS) boundaries, late‑layer representations separate efficient from overthinking tokens, while boundary‑permutation and position‑control baselines collapse. Based on this signal, we propose ROM, a model‑agnostic streaming intervention framework that monitors frozen LRMs with a lightweight hidden‑state detector and intervenes at well‑formed reasoning boundaries. Counterfactual Self‑Correction (CSC) augments supervision with balanced wrong to correct trajectories, preserving useful pre‑FCS correction while labeling only post‑FCS continuation as redundant. Across MATH500, GSM8K, AIME25, and MMLU‑Pro, ROM improves the overall tradeoff on both Qwen3‑8B and DeepSeek‑R1‑Distill‑Qwen‑32B (DS‑32B): on Qwen3‑8B, it raises accuracy from 74.47% to 74.78% and reduces response length from 4262 to 3107 tokens; on DS‑32B, it raises accuracy from 68.60% to 68.72% and reduces response length from 3062 to 2319 tokens. The same FCS‑derived supervision transfers across scale and training origin, suggesting a shared long‑CoT boundary rather than a backbone‑specific artifact. ROM is compatible with L1, removing another 20.9‑21.6% tokens at zero accuracy loss. ROM also generalizes to open‑ended MMLU‑Pro (+1.56 pp, 35.4% shorter) and reduces wall‑clock latency by 46.5%. Code is available at https://github.com/SaFo‑Lab/ROM.
Authors:Ulugbek Shernazarov, Rostislav Svitsov, Bin Shi
Abstract:
Fine‑tuning large language models for domain‑specific tasks such as medical text summarization demands substantial computational resources. Parameter‑efficient fine‑tuning (PEFT) methods offer promising alternatives by updating only a small fraction of parameters. This paper compares three adaptation approaches‑Low‑Rank Adaptation (LoRA), Prompt Tuning, and Full Fine‑Tuning‑across the Flan‑T5 model family on the PubMed medical summarization dataset. Through experiments with multiple random seeds, we demonstrate that LoRA consistently outperforms full fine‑tuning, achieving 43.52 +/‑ 0.18 ROUGE‑1 on Flan‑T5‑Large with only 0.6% trainable parameters compared to 40.67 +/‑ 0.21 for full fine‑tuning. Sensitivity analyses examine the impact of LoRA rank and prompt token count. Our findings suggest the low‑rank constraint provides beneficial regularization, challenging assumptions about the necessity of full parameter updates. Code is available at https://github.com/eracoding/llm‑medical‑summarization
Authors:Clemens Watzenböck, Daniel Aletaha, Michaël Deman, Thomas Deimel, Jana Eder, Ivana Janickova, Robert Janiczek, Peter Mandl, Philipp Seeböck, Gabriela Supp, Paul Weiser, Georg Langs
Abstract:
Quantitative disease severity scoring in medical imaging is costly, time‑consuming, and subject to inter‑reader variability. At the same time, clinical archives contain far more longitudinal imaging data than expert‑annotated severity scores. Existing self‑supervised methods typically ignore this chronological structure. We introduce ChronoCon, a contrastive learning approach that replaces label‑based ranking losses with rankings derived solely from the visitation order of a patient's longitudinal scans. Under the clinically plausible assumption of monotonic progression in irreversible diseases, the method learns disease‑relevant representations without using any expert labels. This generalizes the idea of Rank‑N‑Contrast from label distances to temporal ordering. Evaluated on rheumatoid arthritis radiographs for severity assessment, the learned representations substantially improve label efficiency. In low‑label settings, ChronoCon significantly outperforms a fully supervised baseline initialized from ImageNet weights. In a few‑shot learning experiment, fine‑tuning ChronoCon on expert scores from only five patients yields an intraclass correlation coefficient of 86% for severity score prediction. These results demonstrate the potential of chronological contrastive learning to exploit routinely available imaging metadata to reduce annotation requirements in the irreversible disease domain. Code is available at https://github.com/cirmuw/ChronoCon.
Authors:Linkuan Zhou, Yinghao Xia, Yufei Shen, Xiangyu Li, Wenjie Du, Cong Cong, Leyi Wei, Ran Su, Qiangguo Jin
Abstract:
Unsupervised Domain Adaptation (UDA) is essential for deploying medical segmentation models across diverse clinical environments. Existing methods are fundamentally limited, suffering from semantically unaware feature alignment that results in poor distributional fidelity and from pseudo‑label validation that disregards global anatomical constraints, thus failing to prevent the formation of globally implausible structures. To address these issues, we propose SHAPE (Structure‑aware Hierarchical Unsupervised Domain Adaptation with Plausibility Evaluation), a framework that reframes adaptation towards global anatomical plausibility. Built on a DINOv3 foundation, its Hierarchical Feature Modulation (HFM) module first generates features with both high fidelity and class‑awareness. This shifts the core challenge to robustly validating pseudo‑labels. To augment conventional pixel‑level validation, we introduce Hypergraph Plausibility Estimation (HPE), which leverages hypergraphs to assess the global anatomical plausibility that standard graphs cannot capture. This is complemented by Structural Anomaly Pruning (SAP) to purge remaining artifacts via cross‑view stability. SHAPE significantly outperforms prior methods on cardiac and abdominal cross‑modality benchmarks, achieving state‑of‑the‑art average Dice scores of 90.08% (MRI‑>CT) and 78.51% (CT‑>MRI) on cardiac data, and 87.48% (MRI‑>CT) and 86.89% (CT‑>MRI) on abdominal data. The code is available at https://github.com/BioMedIA‑repo/SHAPE.
Authors:Donald Shenaj, Federico Errica, Antonio Carta
Abstract:
Low Rank Adaptation (LoRA) is the de facto fine‑tuning strategy to generate personalized images from pre‑trained diffusion models. Choosing a good rank is extremely critical, since it trades off performance and memory consumption, but today the decision is often left to the community's consensus, regardless of the personalized subject's complexity. The reason is evident: the cost of selecting a good rank for each LoRA component is combinatorial, so we opt for practical shortcuts such as fixing the same rank for all components. In this paper, we take a first step to overcome this challenge. Inspired by variational methods that learn an adaptive width of neural networks, we let the ranks of each layer freely adapt during fine‑tuning on a subject. We achieve it by imposing an ordering of importance on the rank's positions, effectively encouraging the creation of higher ranks when strictly needed. Qualitatively and quantitatively, our approach, LoRA^2, achieves a competitive trade‑off between DINO, CLIP‑I, and CLIP‑T across 29 subjects while requiring much less memory and lower rank than high rank LoRA versions. Code: https://github.com/donaldssh/NotAllLayersAreCreatedEqual.
Authors:Nikolas Stavrou, Siamak Mehrkanoon
Abstract:
Weather forecasting supports critical socioeconomic activities and complements environmental protection, yet operational Numerical Weather Prediction (NWP) systems remain computationally intensive, thus being inefficient for certain applications. Meanwhile, recent advances in deep data‑driven models have demonstrated promising results in nowcasting tasks. This paper presents SmaAT‑QMix‑UNet, an enhanced variant of SmaAT‑UNet that introduces two key innovations: a vector quantization (VQ) bottleneck at the encoder‑decoder bridge, and mixed kernel depth‑wise convolutions (MixConv) replacing selected encoder and decoder blocks. These enhancements both reduce the model's size and improve its nowcasting performance. We train and evaluate SmaAT‑QMix‑UNet on a Dutch radar precipitation dataset (2016‑2019), predicting precipitation 30 minutes ahead. Three configurations are benchmarked: using only VQ, only MixConv, and the full SmaAT‑QMix‑UNet. Grad‑CAM saliency maps highlight the regions influencing each nowcast, while a UMAP embedding of the codewords illustrates how the VQ layer clusters encoder outputs. The source code for SmaAT‑QMix‑UNet is publicly available on GitHub: https://github.com/nstavr04/MasterThesisSnellius.
Authors:Mingzhe Zheng, Weijie Kong, Yue Wu, Dengyang Jiang, Yue Ma, Xuanhua He, Bin Lin, Kaixiong Gong, Zhao Zhong, Liefeng Bo, Qifeng Chen, Harry Yang
Abstract:
Group Relative Policy Optimization (GRPO) methods for video generation like FlowGRPO remain far less reliable than their counterparts for language models and images. This gap arises because video generation has a complex solution space, and the ODE‑to‑SDE conversion used for exploration can inject excess noise, lowering rollout quality and making reward estimates less reliable, which destabilizes post‑training alignment. To address this problem, we view the pre‑trained model as defining a valid video data manifold and formulate the core problem as constraining exploration within the vicinity of this manifold, ensuring that rollout quality is preserved and reward estimates remain reliable. We propose SAGE‑GRPO (Stable Alignment via Exploration), which applies constraints at both micro and macro levels. At the micro level, we derive a precise manifold‑aware SDE with a logarithmic curvature correction and introduce a gradient norm equalizer to stabilize sampling and updates across timesteps. At the macro level, we use a dual trust region with a periodic moving anchor and stepwise constraints so that the trust region tracks checkpoints that are closer to the manifold and limits long‑horizon drift. We evaluate SAGE‑GRPO on HunyuanVideo1.5 using the original VideoAlign as the reward model and observe consistent gains over previous methods in VQ, MQ, TA, and visual metrics (CLIPScore, PickScore), demonstrating superior performance in both reward maximization and overall video quality. The code and visual gallery are available at https://dungeonmassster.github.io/SAGE‑GRPO‑Page/.
Authors:Shuxian Zhao, Jie Gui, Baosheng Yu, Lu Dong, Zhipeng Gui
Abstract:
Steel surface defect detection is essential for ensuring product quality and reliability in modern manufacturing. Current methods often rely on basic image classification models trained on label‑only datasets, which limits their interpretability and generalization. To address these challenges, we introduce SteelDefectX, a vision‑language dataset containing 7,778 images across 25 defect categories, annotated with coarse‑to‑fine textual descriptions. At the coarse‑grained level, the dataset provides class‑level information, including defect categories, representative visual attributes, and associated industrial causes. At the fine‑grained level, it captures sample‑specific attributes, such as shape, size, depth, position, and contrast, enabling models to learn richer and more detailed defect representations. We further establish a benchmark comprising four tasks, vision‑only classification, vision‑language classification, few/zero‑shot recognition, and zero‑shot transfer, to evaluate model performance and generalization. Experiments with several baseline models demonstrate that coarse‑to‑fine textual annotations significantly improve interpretability, generalization, and transferability. We hope that SteelDefectX will serve as a valuable resource for advancing research on explainable, generalizable steel surface defect detection. The data will be publicly available on https://github.com/Zhaosxian/SteelDefectX.
Authors:Yuze Qin, Qingyong Li, Zhiqing Guo, Wen Wang, Yan Liu, Yangli-ao Geng
Abstract:
Precipitation nowcasting is critical for disaster mitigation and aviation safety. However, radar‑only models frequently suffer from a lack of large‑scale atmospheric context, leading to performance degradation at longer lead times. While integrating meteorological variables predicted by weather foundation models offers a potential remedy, existing architectures fail to reconcile the profound representational heterogeneities between radar imagery and meteorological data. To bridge this gap, we propose PW‑FouCast, a novel frequency‑domain fusion framework that leverages Pangu‑Weather forecasts as spectral priors within a Fourier‑based backbone. Our architecture introduces three key innovations: (i) Pangu‑Weather‑guided Frequency Modulation to align spectral magnitudes and phases with meteorological priors; (ii) Frequency Memory to correct phase discrepancies and preserve temporal evolution; and (iii) Inverted Frequency Attention to reconstruct high‑frequency details typically lost in spectral filtering. Extensive experiments on the SEVIR and MeteoNet benchmarks demonstrate that PW‑FouCast achieves state‑of‑the‑art performance, effectively extending the reliable forecast horizon while maintaining structural fidelity. Our code is available at https://github.com/Onemissed/PW‑FouCast.
Authors:Pengfei Cao, Mingxuan Yang, Yubo Chen, Chenlong Zhang, Mingxuan Liu, Kang Liu, Jun Zhao
Abstract:
Understanding why real‑world events occur is important for both natural language processing and practical decision‑making, yet direct‑cause inference remains underexplored in evidence‑rich settings. To address this gap, we organized SemEval‑2026 Task 12: Abductive Event Reasoning (AER).\footnoteThe task data is available at https://github.com/sooo66/semeval2026‑task12‑dataset.git The task asks systems to identify the most plausible direct cause of a target event from supporting evidence. We formulate AER as an evidence‑grounded multiple‑choice benchmark that captures key challenges of real‑world causal reasoning, including distributed evidence, indirect background factors, and semantically related but non‑causal distractors. The shared task attracted 122 participants and received 518 submissions. This paper presents the task formulation, dataset construction pipeline, evaluation setup, and system results. AER provides a focused benchmark for abductive reasoning over real‑world events and highlights challenges for future work on causal reasoning and multi‑document understanding.
Authors:Yi Wang, Haofei Zhang, Qihan Huang, Anda Cao, Gongfan Fang, Wei Wang, Xuan Jin, Jie Song, Mingli Song, Xinchao Wang
Abstract:
Large Vision‑Language Models (LVLMs) excel in visual understanding and reasoning, but the excessive visual tokens lead to high inference costs. Although recent token reduction methods mitigate this issue, they mainly target single‑turn Visual Question Answering (VQA), leaving the more practical multi‑turn VQA (MT‑VQA) scenario largely unexplored. MT‑VQA introduces additional challenges, as subsequent questions are unknown beforehand and may refer to arbitrary image regions, making existing reduction strategies ineffective. Specifically, current approaches fall into two categories: prompt‑dependent methods, which bias toward the initial text prompt and discard information useful for subsequent turns; prompt‑agnostic ones, which, though technically applicable to multi‑turn settings, rely on heuristic reduction metrics such as attention scores, leading to suboptimal performance. In this paper, we propose a learning‑based prompt‑agnostic method, termed MetaCompress, overcoming the limitations of heuristic designs. We begin by formulating token reduction as a learnable compression mapping, unifying existing formats such as pruning and merging into a single learning objective. Upon this formulation, we introduce a data‑efficient training paradigm capable of learning optimal compression mappings with limited computational costs. Extensive experiments on MT‑VQA benchmarks and across multiple LVLM architectures demonstrate that MetaCompress achieves superior efficiency‑accuracy trade‑offs while maintaining strong generalization across dialogue turns. Our code is available at https://github.com/MArSha1147/MetaCompress.
Authors:Hyoseok Park, Yeonsang Park
Abstract:
Long‑context LLM inference is bottlenecked not by compute but by the O(n) memory bandwidth cost of scanning the KV cache at every decode step ‑‑ a wall that no amount of arithmetic scaling can break. Recent photonic accelerators have demonstrated impressive throughput for dense attention computation; however, these approaches inherit the same O(n) memory scaling as electronic attention when applied to long contexts. We observe that the real leverage point is the coarse block‑selection step: a memory‑bound similarity search that determines which KV blocks to fetch. We identify, for the first time, that this task is structurally matched to the photonic broadcast‑and‑weight paradigm ‑‑ the query fans out to all candidates via passive splitting, signatures are quasi‑static (matching electro‑optic MRR programming), and only rank order matters (relaxing precision to 4‑6 bits). Crucially, the photonic advantage grows with context length: as N increases, the electronic scan cost rises linearly while the photonic evaluation remains O(1). We instantiate this insight in PRISM (Photonic Ranking via Inner‑product Similarity with Microring weights), a thin‑film lithium niobate (TFLN) similarity engine. Hardware‑impaired needle‑in‑a‑haystack evaluation on Qwen2.5‑7B confirms 100% accuracy from 4K through 64K tokens at k=32, with 16x traffic reduction at 64K context. PRISM achieves a four‑order‑of‑magnitude energy advantage over GPU baselines at practical context lengths (n >= 4K).
Authors:Zhongyi Li, Wan Tian, Yikun Ban, Jinju Chen, Huiming Zhang, Yang Liu, Fuzhen Zhuang
Abstract:
Collaborative multi‑agent large language models (LLMs) can solve complex reasoning tasks by decomposing roles and aggregating diverse hypotheses. Yet, reinforcement learning (RL) for such systems is often undermined by credit assignment: a shared global reward obscures individual contributions, inflating update variance and encouraging free‑riding. We introduce Counterfactual Credit Policy Optimization (CCPO), a framework that assigns agent‑specific learning signals by estimating each agent's marginal contribution through counterfactual trajectories. CCPO builds dynamic counterfactual baselines that simulate outcomes with an agent's contribution removed, yielding role‑sensitive advantages for policy optimization. To further improve stability under heterogeneous tasks and data distributions, we propose a global‑history‑aware normalization scheme that calibrates advantages using global rollout statistics. We evaluate CCPO on two collaboration topologies: a sequential Think‑‑Reason dyad and multi‑agent voting. Across mathematical and logical reasoning benchmarks, CCPO mitigates free‑riding and outperforms strong multi‑agent RL baselines, yielding finer‑grained and more effective credit assignment for collaborative LLM training. Our code is available at https://github.com/bhai114/ccpo.
Authors:Bayezid Baten, M. Ayyan Iqbal, Sebastian Ament, Julius Kusuma, Nishant Garg
Abstract:
Modern concrete must simultaneously satisfy evolving demands for mechanical performance, workability, durability, and sustainability, making mix designs increasingly complex. Recent studies leveraging Artificial Intelligence (AI) and Machine Learning (ML) models show promise for predicting compressive strength and guiding mix optimization, but most existing efforts are based on proprietary industrial datasets and closed‑source implementations. Here we introduce BOxCrete, an open‑source probabilistic modeling and optimization framework trained on a new open‑access dataset of over 500 strength measurements (1‑15 ksi) from 123 mixtures ‑ 69 mortar and 54 concrete mixes tested at five curing ages (1, 3, 5, 14, and 28 days). BOxCrete leverages Gaussian Process (GP) regression to predict strength development, achieving average R^2 = 0.94 and RMSE = 0.69 ksi, quantify uncertainty, and carry out multi‑objective optimization of compressive strength and embodied carbon. The dataset and model establish a reproducible open‑source foundation for data‑driven development of AI‑based optimized mix designs.
Authors:Hehai Lin, Yu Yan, Zixuan Wang, Bo Xu, Sudong Wang, Weiquan Huang, Ruochen Zhao, Minzhi Li, Chengwei Qin
Abstract:
Automatic Multi‑Agent Systems (MAS) generation has emerged as a promising paradigm for solving complex reasoning tasks. However, existing frameworks are fundamentally bottlenecked when applied to knowledge‑intensive domains (e.g., healthcare and law). They either rely on a static library of general nodes like Chain‑of‑Thought, which lack specialized expertise, or attempt to generate nodes on the fly. In the latter case, the orchestrator is not only bound by its internal knowledge limits but must also simultaneously generate domain‑specific logic and optimize high‑level topology, leading to a severe architectural coupling that degrades overall system efficacy. To bridge this gap, we propose Unified‑MAS that decouples granular node implementation from topological orchestration via offline node synthesis. Unified‑MAS operates in two stages: (1) Search‑Based Node Generation retrieves external open‑world knowledge to synthesize specialized node blueprints, overcoming the internal knowledge limits of LLMs; and (2) Reward‑Based Node Optimization utilizes a perplexity‑guided reward to iteratively enhance the internal logic of bottleneck nodes. Extensive experiments across four specialized domains demonstrate that integrating Unified‑MAS into four Automatic‑MAS baselines yields a better performance‑cost trade‑off, achieving up to a 14.2% gain while significantly reducing costs. Further analysis reveals its robustness across different designer LLMs and its effectiveness on conventional tasks such as mathematical reasoning.
Authors:Shuai Wang, Yinan Yu
Abstract:
Large Language Models (LLMs) demonstrate impressive natural language capabilities but often struggle with knowledge‑intensive reasoning tasks. Knowledge Base Question Answering (KBQA), which leverages structured Knowledge Graphs (KGs) exemplifies this challenge due to the need for accurate multi‑hop reasoning. Existing approaches typically perform sequential reasoning steps guided by predefined pipelines, restricting flexibility and causing error cascades due to isolated reasoning at each step. To address these limitations, we propose KG‑Hopper, a novel Reinforcement Learning (RL) framework that empowers compact open LLMs with the ability to perform integrated multi‑hop KG reasoning within a single inference round. Rather than reasoning step‑by‑step, we train a Reasoning LLM that embeds the entire KG traversal and decision process into a unified ``thinking'' stage, enabling global reasoning over cross‑step dependencies and dynamic path exploration with backtracking. Experimental results on eight KG reasoning benchmarks show that KG‑Hopper, based on a 7B‑parameter LLM, consistently outperforms larger multi‑step systems (up to 70B) and achieves competitive performance with proprietary models such as GPT‑3.5‑Turbo and GPT‑4o‑mini, while remaining compact, open, and data‑efficient. The code is publicly available at: https://github.com/Wangshuaiia/KG‑Hopper.
Authors:Shuai Wang, Dhasarathy Parthasarathy, Robert Feldt, Yinan Yu
Abstract:
Large language models (LLMs) have shown impressive capabilities in code generation. However, because most LLMs are trained on public domain corpora, directly applying them to real‑world software development often yields low success rates, as these scenarios frequently require domain‑specific knowledge. In particular, domain‑specific tasks usually demand highly specialized solutions, which are often underrepresented or entirely absent in the training data of generic LLMs. To address this challenge, we propose DomAgent, an autonomous coding agent that bridges this gap by enabling LLMs to generate domain‑adapted code through structured reasoning and targeted retrieval. A core component of DomAgent is DomRetriever, a novel retrieval module that emulates how humans learn domain‑specific knowledge, by combining conceptual understanding with experiential examples. It dynamically integrates top‑down knowledge‑graph reasoning with bottom‑up case‑based reasoning, enabling iterative retrieval and synthesis of structured knowledge and representative cases to ensure contextual relevance and broad task coverage. DomRetriever can operate as part of DomAgent or independently with any LLM for flexible domain adaptation. We evaluate DomAgent on an open benchmark dataset in the data science domain (DS‑1000) and further apply it to real‑world truck software development tasks. Experimental results show that DomAgent significantly enhances domain‑specific code generation, enabling small open‑source models to close much of the performance gap with large proprietary LLMs in complex, real‑world applications. The code is available at: https://github.com/Wangshuaiia/DomAgent.
Authors:Anthony T. Nixon
Abstract:
Any capacity‑limited observer induces a canonical quotient on its environment: two situations that no bounded agent can distinguish are, for that agent, the same. We formalise this for finite POMDPs. A fixed probe family of finite‑state controllers induces a closed‑loop Wasserstein pseudometric on observation histories and a probe‑exact quotient merging histories that no controller in the family can distinguish. The quotient is canonical, minimal, and unique‑a bounded‑interaction analogue of the Myhill‑Nerode theorem. For clock‑aware probes, it is exactly decision‑sufficient for objectives that depend only on the agent's observations and actions; for latent‑state rewards, we use an observation‑Lipschitz approximation bound. The main theorem object is the clock‑aware quotient; scalable deterministic‑stationary experiments study a tractable coarsening with gap measured on small exact cases and explored empirically at larger scale. We validate theorem‑level claims on Tiger and GridWorld. We also report operational case studies on Tiger, GridWorld, and RockSample as exploratory diagnostics of approximation behavior and runtime, not as theorem‑facing evidence when no exact cross‑family certificate is available; heavier stress tests are archived in the appendix and artifact package.
Authors:Liang Ding
Abstract:
LLM‑as‑Judge evaluation fails agent tasks because a fixed rubric cannot capture what matters for this task: code debugging demands Correctness and Error Handling; web navigation demands Goal Alignment and Action Efficiency. We present ADARUBRIC, which closes this gap by generating task‑specific evaluation rubrics on the fly from task descriptions, scoring trajectories step‑by‑step with confidence‑weighted per‑dimension feedback, and filtering preference pairs with the novel DimensionAwareFilter ‑ a provably necessary condition for preventing high‑scoring dimensions from masking dimension‑level failures. On WebArena and ToolBench, ADARUBRIC achieves Pearson r=0.79 human correlation (+0.16 over the best static baseline) with deployment‑grade reliability (Krippendorff's α=0.83). DPO agents trained on ADARUBRIC preference pairs gain +6.8 to +8.5 pp task success over Prometheus across three benchmarks; gains transfer to SWE‑bench code repair (+4.9 pp) and accelerate PPO convergence by +6.6 pp at 5K steps ‑ both without any rubric engineering. Code: https://github.com/alphadl/AdaRubrics.
Authors:Oussama Zekri, Théo Uscidda, Nicolas Boullé, Anna Korba
Abstract:
We introduce Generalized Discrete Diffusion from Snapshots (GDDS), a unified framework for discrete diffusion modeling that supports arbitrary noising processes over large discrete state spaces. Our formulation encompasses all existing discrete diffusion approaches, while allowing significantly greater flexibility in the choice of corruption dynamics. The forward noising process relies on uniformization and enables fast arbitrary corruption. For the reverse process, we derive a simple evidence lower bound (ELBO) based on snapshot latents, instead of the entire noising path, that allows efficient training of standard generative modeling architectures with clear probabilistic interpretation. Our experiments on large‑vocabulary discrete generation tasks suggest that the proposed framework outperforms existing discrete diffusion methods in terms of training efficiency and generation quality, and beats autoregressive models for the first time at this scale. We provide the code along with a blog post on the project page : \hrefhttps://oussamazekri.fr/gddshttps://oussamazekri.fr/gdds.
Authors:Runze Sun, Yu Zheng, Zexuan Xiong, Zhongjin Qu, Lei Chen, Jiwen Lu, Jie Zhou
Abstract:
Combating hate speech on social media is critical for securing cyberspace, yet relies heavily on the efficacy of automated detection systems. As content formats evolve, hate speech is transitioning from solely plain text to complex multimodal expressions, making implicit attacks harder to spot. Current systems, however, often falter on these subtle cases, as they struggle with multimodal content where the emergent meaning transcends the aggregation of individual modalities. To bridge this gap, we move beyond binary classification to characterize semantic intent shifts where modalities interact to construct implicit hate from benign cues or neutralize toxicity through semantic inversion. Guided by this fine‑grained formulation, we curate the Hate via Vision‑Language Interplay (H‑VLI) benchmark where the true intent hinges on the intricate interplay of modalities rather than overt visual or textual slurs. To effectively decipher these complex cues, we further propose the Asymmetric Reasoning via Courtroom Agent DEbate (ARCADE) framework. By simulating a judicial process where agents actively argue for accusation and defense, ARCADE forces the model to scrutinize deep semantic cues before reaching a verdict. Extensive experiments demonstrate that ARCADE significantly outperforms state‑of‑the‑art baselines on H‑VLI, particularly for challenging implicit cases, while maintaining competitive performance on established benchmarks. Our code and data are available at: https://github.com/Sayur1n/H‑VLI
Authors:Zhengxian Wu, Kai Shi, Chuanrui Zhang, Zirui Liao, Jun Yang, Ni Yang, Qiuying Peng, Luyuan Zhang, Hangrui Xu, Tianhuang Su, Zhenyu Yang, Haonan Lu, Haoqian Wang
Abstract:
Recent progress in multimodal large language models has led to strong performance on reasoning tasks, but these improvements largely rely on high‑quality annotated data or teacher‑model distillation, both of which are costly and difficult to scale. To address this, we propose an unsupervised self‑evolution training framework for multimodal reasoning that achieves stable performance improvements without using human‑annotated answers or external reward models. For each input, we sample multiple reasoning trajectories and jointly model their within group structure. We use the Actor's self‑consistency signal as a training prior, and introduce a bounded Judge based modulation to continuously reweight trajectories of different quality. We further model the modulated scores as a group level distribution and convert absolute scores into relative advantages within each group, enabling more robust policy updates. Trained with Group Relative Policy Optimization (GRPO) on unlabeled data, our method consistently improves reasoning performance and generalization on five mathematical reasoning benchmarks, offering a scalable path toward self‑evolving multimodal models. The code are available at https://github.com/OPPO‑Mente‑Lab/LLM‑Self‑Judge.
Authors:Osamu Hirose, Emanuele Rodola
Abstract:
Nonrigid registration is conventionally divided into point set registration, which aligns sparse geometries, and image registration, which aligns continuous intensity fields on regular grids. However, this dichotomy creates a critical bottleneck for emerging scientific data, such as spatial transcriptomics, where high‑dimensional vector‑valued functions, e.g., gene expression, are defined on irregular, sparse manifolds. Consequently, researchers currently face a forced choice: either sacrifice single‑cell resolution via voxelization to utilize image‑based tools, or ignore the critical functional signal to utilize geometric tools. To resolve this dilemma, we propose Domain Elastic Transform (DET), a grid‑free probabilistic framework that unifies geometric and functional alignment. By treating data as functions on irregular domains, DET registers high‑dimensional signals directly without binning. We formulate the problem within a rigorous Bayesian framework, modeling domain deformation as an elastic motion guided by a joint spatial‑functional likelihood. The method is fully unsupervised and scalable, utilizing feature‑sensitive downsampling to handle massive atlases. We demonstrate that DET achieves 92% topological preservation on MERFISH data where state‑of‑the‑art optimal transport methods struggle (<5%), and successfully registers whole‑embryo Stereo‑seq atlases across developmental stages ‑‑ a task involving massive scale and complex nonrigid growth. The implementation of DET is available on https://github.com/ohirose/bcpd (since Mar, 2025).
Authors:Octavian Untila
Abstract:
An autonomous AI ecosystem (SUBSTRATE S3), generating product specifications without explicit instructions about formal methods, independently proposed the use of Z3 SMT solver across six distinct domains of AI safety: verification of LLM‑generated code, tool API safety for AI agents, post‑distillation reasoning correctness, CLI command validation, hardware assembly verification, and smart contract safety. These convergent discoveries, occurring across 8 products over 13 days with Jaccard similarity below 15% between variants, suggest that formal verification is not merely a useful technique for AI safety but an emergent property of any sufficiently complex system reasoning about its own safety. We propose a unified framework (substrate‑guard) that applies Z3‑based verification across all six output classes through a common API, and evaluate it on 181 test cases across five implemented domains, achieving 100% classification accuracy with zero false positives and zero false negatives. Our framework detected real bugs that empirical testing would miss, including an INT_MIN overflow in branchless RISC‑V assembly and mathematically proved that unconstrained string parameters in tool APIs are formally unverifiable.
Authors:Long Xu, Junping Guo, Jianbo Zhao, Jianbo Lu, Yuzhong Peng
Abstract:
Molecular property prediction constitutes a cornerstone of drug discovery and materials science, necessitating models capable of disentangling complex structure‑property relationships across diverse molecular modalities. Existing approaches frequently exhibit entangled representations‑‑conflating structural, chemical, and functional factors‑‑thereby limiting interpretability and transferability. Furthermore, conventional methods inadequately exploit complementary information from graphs, sequences, and geometries, often relying on naive concatenation that neglects inter‑modal dependencies. In this work, we propose DMMRL, which employs variational autoencoders to disentangle molecular representations into shared (structure‑relevant) and private (modality‑specific) latent spaces, enhancing both interpretability and predictive performance. The proposed variational disentanglement mechanism effectively isolates the most informative features for property prediction, while orthogonality and alignment regularizations promote statistical independence and cross‑modal consistency. Additionally, a gated attention fusion module adaptively integrates shared representations, capturing complex inter‑modal relationships. Experimental validation across seven benchmark datasets demonstrates DMMRL's superior performance relative to state‑of‑the‑art approaches. The code and data underlying this article are freely available at https://github.com/xulong0826/DMMRL.
Authors:He Wang, Tianyang Xu, Zhangyong Tang, Xiao-Jun Wu, Josef Kittler
Abstract:
Due to the limited availability of paired multi‑modal data, multi‑modal trackers are typically built by adopting pre‑trained RGB models with parameter‑efficient fine‑tuning modules. However, these fine‑tuning methods overlook advanced adaptations for applying RGB pre‑trained models and fail to modulate a single specific modality, cross‑modal interactions, and the prediction head. To address the issues, we propose to perform Progressive Adaptation for Multi‑Modal Tracking (PATrack). This innovative approach incorporates modality‑dependent, modality‑entangled, and task‑level adapters, effectively bridging the gap in adapting RGB pre‑trained networks to multi‑modal data through a progressive strategy. Specifically, modality‑specific information is enhanced through the modality‑dependent adapter, decomposing the high‑ and low‑frequency components, which ensures a more robust feature representation within each modality. The inter‑modal interactions are introduced in the modality‑entangled adapter, which implements a cross‑attention operation guided by inter‑modal shared information, ensuring the reliability of features conveyed between modalities. Additionally, recognising that the strong inductive bias of the prediction head does not adapt to the fused information, a task‑level adapter specific to the prediction head is introduced. In summary, our design integrates intra‑modal, inter‑modal, and task‑level adapters into a unified framework. Extensive experiments on RGB+Thermal, RGB+Depth, and RGB+Event tracking tasks demonstrate that our method shows impressive performance against state‑of‑the‑art methods. Code is available at https://github.com/ouha1998/Learning‑Progressive‑Adaptation‑for‑Multi‑Modal‑Tracking.
Authors:Tasmay Pankaj Tibrewal, Pritish Saha, Ankit Meda, Kunal Singh, Pradeep Moturi
Abstract:
Transformers lack an explicit architectural mechanism for storing and organizing knowledge acquired during training. We introduce learnable sparse memory banks: a set of latent tokens, randomly initialized and trained end‑to‑end, that transformer layers query via cross‑attention to retrieve stored knowledge. To scale memory capacity without prohibitive attention costs, we propose chapter‑based routing inspired by Mixture‑of‑Experts architectures, partitioning the memory bank into chapters and training a router to select relevant subsets per input. This enables scaling to 262K memory tokens while maintaining tractable computation. We evaluate our approach against standard transformers (in iso‑FLOP settings) on pre‑training and instruction fine‑tuning across relevant benchmarks. Our models surpass iso‑FLOP baselines suggesting scope for a new axis of scaling, demonstrating that explicit associative memory provides complementary capacity to what is captured implicitly in model parameters. Additionally, we observe improved knowledge retention under continued training, with robustness to forgetting when transitioning between training phases (e.g., pretraining to instruction fine‑tuning).
Authors:Shuwei Huang, Shizhuo Liu, Zijun Wei
Abstract:
Diffusion‑based image super‑resolution (SR), which aims to reconstruct high‑resolution (HR) images from corresponding low‑resolution (LR) observations, faces a fundamental trade‑off between inference efficiency and reconstruction quality. The state‑of‑the‑art residual‑shifting diffusion framework achieves efficient 4‑step inference, yet suffers from severe performance degradation in compact sampling trajectories. This is mainly attributed to two core limitations: the inherent suboptimality of unconstrained random Gaussian noise in intermediate steps, which leads to error accumulation and insufficient LR prior guidance, and the initialization bias caused by naive bicubic upsampling. In this paper, we propose LPNSR, a prior‑enhanced efficient diffusion framework to address these issues. We first mathematically derive the closed‑form analytical solution of the optimal intermediate noise for the residual‑shifting diffusion paradigm, and accordingly design an LR‑guided multi‑input‑aware noise predictor to replace random Gaussian noise, embedding LR structural priors into the reverse process while fully preserving the framework's core efficient residual‑shifting mechanism. We further mitigate initial bias with a high‑quality pre‑upsampling network to optimize the diffusion starting point. With a compact 4‑step trajectory, LPNSR can be optimized in an end‑to‑end manner. Extensive experiments demonstrate that LPNSR achieves state‑of‑the‑art perceptual performance on both synthetic and real‑world datasets, without relying on any large‑scale text‑to‑image priors. The source code of our method can be found at https://github.com/Faze‑Hsw/LPNSR.
Authors:Jinquan Zheng, Jia Yuan, Jiacheng Yao, Chenyang Gu, Pujun Zheng, Guoxiu He
Abstract:
Large language models (LLMs) used for multiple‑choice and pairwise evaluation tasks often exhibit selection bias due to non‑semantic factors like option positions and label symbols. Existing inference‑time debiasing is costly and may harm reasoning, while pointwise training ignores that the same question should yield consistent answers across permutations. To address this issue, we propose Permutation‑Aware Group Relative Policy Optimization (PA‑GRPO), which mitigates selection bias by enforcing permutation‑consistent semantic reasoning. PA‑GRPO constructs a permutation group for each instance by generating multiple candidate permutations, and optimizes the model using two complementary mechanisms: (1) cross‑permutation advantage, which computes advantages relative to the mean reward over all permutations of the same instance, and (2) consistency‑aware reward, which encourages the model to produce consistent decisions across different permutations. Experimental results demonstrate that PA‑GRPO outperforms strong baselines across seven benchmarks, substantially reducing selection bias while maintaining high overall performance. The code will be made available on Github (https://github.com/ECNU‑Text‑Computing/PA‑GRPO).
Authors:Daniel Autenrieth
Abstract:
This paper presents the first systematic measurement of educational alignment in Large Language Models. Using a Delphi‑validated instrument comprising 48 items across eight educational‑theoretical dimensions, the study reveals that GPT‑5.1 exhibits highly coherent preference patterns (99.78% transitivity; 92.79% model accuracy) that largely align with humanistic educational principles where expert consensus exists. Crucially, divergences from expert opinion occur precisely in domains of normative disagreement among human experts themselves, particularly emotional dimensions and epistemic normativity. This raises a fundamental question for alignment research: When human values are contested, what should models be aligned to? The findings demonstrate that GPT‑5.1 does not remain neutral in contested domains but adopts coherent positions, prioritizing emotional responsiveness and rejecting false balance. The methodology, combining Delphi consensus‑building with Structured Preference Elicitation and Thurstonian Utility modeling, provides a replicable framework for domain‑specific alignment evaluation beyond generic value benchmarks.
Authors:Jason Dury
Abstract:
The Predictive Associative Memory (PAM) framework posits that useful relationships often connect items that co‑occur in shared contexts rather than items that appear similar in embedding space. A contrastive MLP trained on co‑occurrence annotations‑‑Contrastive Association Learning (CAL)‑‑has improved multi‑hop passage retrieval and discovered narrative function at corpus scale in text. We test whether this principle transfers to molecular biology, where protein‑protein interactions provide functional associations distinct from gene expression similarity. Four experiments across two biological domains map the operating envelope. On gene perturbation data (Replogle K562 CRISPRi, 2,285 genes), CAL trained on STRING protein interactions achieves cross‑boundary AUC of 0.908 where expression similarity scores 0.518. A second gene dataset (DepMap, 17,725 genes) confirms the result after negative sampling correction, reaching cross‑boundary AUC of 0.947. Two drug sensitivity experiments produce informative negatives that sharpen boundary conditions. Three cross‑domain findings emerge: (1) inductive transfer succeeds in biology‑‑a node‑disjoint split with unseen genes yields AUC 0.826 (Delta +0.127)‑‑where it fails in text (+/‑0.10), suggesting physically grounded associations are more transferable than contingent co‑occurrences; (2) CAL scores anti‑correlate with interaction degree (Spearman r = ‑0.590), with gains concentrating on understudied genes with focused interaction profiles; (3) tighter association quality outperforms larger but noisier training sets, reversing the text pattern. Results are stable across training seeds (SD < 0.001) and cross‑boundary threshold choices.
Authors:Yuren Hao, Shuhaib Mehri, ChengXiang Zhai, Dilek Hakkani-Tür
Abstract:
Large language models are increasingly used as personal assistants, yet most lack a persistent user model, forcing users to repeatedly restate preferences across sessions. We propose Vector‑Adapted Retrieval Scoring (VARS), a pipeline‑agnostic, frozen‑backbone framework that represents each user with long‑term and short‑term vectors in a shared preference space and uses these vectors to bias retrieval scoring over structured preference memory. The vectors are updated online from weak scalar rewards from users' feedback, enabling personalization without per‑user fine‑tuning. We evaluate on \textscMultiSessionCollab, an online multi‑session collaboration benchmark with rich user preference profiles, across math and code tasks. Under frozen backbones, the main benefit of user‑aware retrieval is improved interaction efficiency rather than large gains in raw task accuracy: our full VARS agent achieves the strongest overall performance, matches a strong Reflection baseline in task success, and reduces timeout rate and user effort. The learned long‑term vectors also align with cross‑user preference overlap, while short‑term vectors capture session‑specific adaptation, supporting the interpretability of the dual‑vector design. Code, model, and data are available at https://github.com/YurenHao0426/VARS.
Authors:Reshabh K Sharma, Dan Grossman
Abstract:
Large Language Model (LLM) agents combine the chat interaction capabilities of LLMs with the power to interact with external tools and APIs. This enables them to perform complex tasks and act autonomously to achieve user goals. However, current agent systems operate on an all‑or‑nothing basis: an agent either has full access to an API's capabilities and a web page's content, or it has no access at all. This coarse‑grained approach forces users to trust agents with more capabilities than they actually need for a given task. In this paper, we introduce AC4A, an access control framework for agents. As agents become more capable and autonomous, users need a way to limit what APIs or portions of web pages these agents can access, eliminating the need to trust them with everything an API or web page allows. Our goal with AC4A is to provide a framework for defining permissions that lets agents access only the resources they are authorized to access. AC4A works across both API‑based and browser‑based agents. It does not prescribe what permissions should be, but offers a flexible way to define and enforce them, making it practical for real‑world systems. AC4A works by creating permissions granting access to resources, drawing inspiration from established access control frameworks like the one for the Unix file system. Applications define their resources as hierarchies and provide a way to compute the necessary permissions at runtime needed for successful resource access. We demonstrate the usefulness of AC4A in enforcing permissions over real‑world APIs and web pages through case studies. The source code of AC4A is available at https://github.com/reSHARMA/AC4A
Authors:Hongyu Cao, Kunpeng Liu, Dongjie Wang, Yanjie Fu
Abstract:
Large language models exhibit strong reasoning capabilities, yet often rely on shortcuts such as surface pattern matching and answer memorization rather than genuine logical inference. We propose Shortcut‑Aware Reasoning Training (SART), a gradient‑aware framework that detects and mitigates shortcut‑promoting samples via ShortcutScore and gradient surgery. Our method identifies shortcut signals through gradient misalignment with validation objectives and answer‑token concentration, and modifies training dynamics accordingly. Experiments on controlled reasoning benchmarks show that SART achieves +16.5% accuracy and +40.2% robustness over the strongest baseline, significantly improving generalization under distribution shifts. Code is available at: https://github.com/fuyanjie/short‑cut‑aware‑data‑centric‑reasoning.
Authors:Kanishka Mitra, Frigyes Samuel Racz, Satyam Kumar, Ashish D. Deshpande, José del R. Millán
Abstract:
Two distinct technologies have gained attention lately due to their prospects for motor rehabilitation: robotics and brain‑machine interfaces (BMIs). Harnessing their combined efforts is a largely uncharted and promising direction that has immense clinical potential. However, a significant challenge is whether motor intentions from the user can be accurately detected using non‑invasive BMIs in the presence of instrumental noise and passive movements induced by the rehabilitation exoskeleton. As an alternative to the straightforward continuous control approach, this study instead aims to characterize the onset and offset of motor imagery during passive arm movements induced by an upper‑body exoskeleton to allow for the natural control (initiation and termination) of functional movements. Ten participants were recruited to perform kinesthetic motor imagery (MI) of the right arm while attached to the robot, simultaneously cued with LEDs indicating the initiation and termination of a goal‑oriented reaching task. Using electroencephalogram signals, we built a decoder to detect the transition between i) rest and beginning MI and ii) maintaining and ending MI. Offline decoder evaluation achieved group average onset accuracy of 60.7% and 66.6% for offset accuracy, revealing that the start and stop of MI could be identified while attached to the robot. Furthermore, pseudo‑online evaluation could replicate this performance, forecasting reliable online exoskeleton control in the future. Our approach showed that participants could produce quality and reliable sensorimotor rhythms regardless of noise or passive arm movements induced by wearing the exoskeleton, which opens new possibilities for BMI control of assistive devices.
Authors:Steven Johnson
Abstract:
As AI agent ecosystems grow, agents need mechanisms to monitor relevant knowledge in real time. Semantic publish‑subscribe systems address this by matching new content against vector subscriptions. However, in multi‑agent settings where agents operate under different data handling policies, unrestricted semantic subscriptions create policy violations: agents receive notifications about content they are not authorized to access. We introduce governance‑aware vector subscriptions, a mechanism that composes semantic similarity matching with multi‑dimensional policy predicates grounded in regulatory frameworks (EU DSM Directive, EU AI Act). The policy predicate operates over multiple independent dimensions (processing level, direct marketing restrictions, training opt‑out, jurisdiction, and scientific usage) each with distinct legal bases. Agents subscribe to semantic regions of a curated knowledge base; notifications are dispatched only for validated content that passes both the similarity threshold and all applicable policy constraints. We formalize the mechanism, implement it within AIngram (an operational multi‑agent knowledge base), and evaluate it using the PASA benchmark. We validate the mechanism on a synthetic corpus (1,000 chunks, 93 subscriptions, 5 domains): the governed mode correctly enforces all policy constraints while preserving delivery of authorized content. Ablation across five policy dimensions shows that no single dimension suffices for full compliance.
Authors:Hanqiao Ye, Yuzhou Liu, Yangdong Liu, Shuhan Shen
Abstract:
While structure‑based relocalizers have long strived for point correspondences when establishing or regressing query‑map associations, in this paper, we pioneer the use of planar primitives and 3D planar maps for lightweight 6‑DoF camera relocalization in structured environments. Planar primitives, beyond being fundamental entities in projective geometry, also serve as region‑based representations that encapsulate both structural and semantic richness. This motivates us to introduce PlanaReLoc, a streamlined plane‑centric paradigm where a deep matcher associates planar primitives across the query image and the map within a learned unified embedding space, after which the 6‑DoF pose is solved and refined under a robust framework. Through comprehensive experiments on the ScanNet and 12Scenes datasets across hundreds of scenes, our method demonstrates the superiority of planar primitives in facilitating reliable cross‑modal structural correspondences and achieving effective camera relocalization without requiring realistically textured/colored maps, pose priors, or per‑scene training. The code and data are available at https://github.com/3dv‑casia/PlanaReLoc .
Authors:Hongyu Wang, Yuhan Jing, Yibing Shi, Enjin Zhou, Haotian Zhang, Jialong Shi
Abstract:
Proper parameter configuration is a prerequisite for the success of Evolutionary Algorithms (EAs). While various adaptive strategies have been proposed, it remains an open question whether all control dimensions contribute equally to algorithmic scalability. To investigate this, we categorize control variables into numerical parameters (e.g., crossover and mutation rates) and structural parameters (e.g., population size and operator switching), hypothesizing that they play distinct roles. This paper presents an empirical study utilizing a dual‑level Deep Reinforcement Learning (DRL) framework to decouple and analyze the impact of these two dimensions on the Traveling Salesman Problem (TSP). We employ a Recurrent PPO agent to dynamically regulate these parameters, treating the DRL model as a probe to reveal evolutionary dynamics. Experimental results confirm the effectiveness of this approach: the learned policies outperform static baselines, reducing the optimality gap by approximately 45% on the largest tested instance (rl5915). Building on this validated framework, our ablation analysis reveals a fundamental insight: while numerical tuning offers local refinement, structural plasticity is the decisive factor in preventing stagnation and facilitating escape from local optima. These findings suggest that future automated algorithm design should prioritize dynamic structural reconfiguration over fine‑grained probability adjustment. To facilitate reproducibility, the source code is available at https://github.com/StarDream1314/DRLGA‑TSP
Authors:Ling Xiao, Toshihiko Yamasaki
Abstract:
Most fine‑grained fashion image retrieval (FIR) methods assume a static setting, requiring full retraining when new attributes appear, which is costly and impractical for dynamic scenarios. Although pretrained models support zero‑shot inference, their accuracy drops without supervision, and no prior work explores class‑incremental learning (CIL) for fine‑grained FIR. We propose a multihead continual learning framework for fine‑grained fashion image retrieval with contrastive learning and exponential moving average (EMA) distillation (MCL‑FIR). MCL‑FIR adopts a multi‑head design to accommodate evolving classes across increments, reformulates triplet inputs into doublets with InfoNCE for simpler and more effective training, and employs EMA distillation for efficient knowledge transfer. Experiments across four datasets demonstrate that, beyond its scalability, MCL‑FIR achieves a strong balance between efficiency and accuracy. It significantly outperforms CIL baselines under similar training cost, and compared with static methods, it delivers comparable performance while using only about 30% of the training cost. The source code is publicly available in https://github.com/Dr‑LingXiao/MCL‑FIR.
Authors:Saimun Habib, Vaishak Belle, Fengxiang He
Abstract:
Probabilistic Logic Programming (PLP) languages, like ProbLog, naturally support reasoning under uncertainty, while maintaining a declarative and interpretable framework. Meanwhile, counterfactual reasoning (i.e., answering ``what if'' questions) is critical for ensuring AI systems are robust and trustworthy; however, integrating this capability into PLP can be computationally prohibitive and unstable in accuracy. This paper addresses this challenge, by proposing an efficient program transformation for counterfactuals as Single World Intervention Programs (SWIPs) in ProbLog. By systematically splitting ProbLog clauses to observed and fixed components relevant to a counterfactual, we create a transformed program that (1) does not asymptotically exceed the computational complexity of existing methods, and is strictly smaller in common cases, and (2) reduces counterfactual reasoning to marginal inference over a simpler program. We formally prove the correctness of our approach, which relies on a weaker set independence assumptions and is consistent with conditional independencies, showing the resulting marginal probabilities match the counterfactual distributions of the underlying Structural Causal Model in wide domains. Our method achieves a 35% reduction in inference time versus existing methods in extensive experiments. This work makes complex counterfactual reasoning more computationally tractable and reliable, providing a crucial step towards developing more robust and explainable AI systems. The code is at https://github.com/EVIEHub/swip.
Authors:Truong Quynh Hoa, Hoang Dinh Cuong, Truong Xuan Khanh
Abstract:
We propose Melaguard, a multimodal ML framework (Transformer‑lite, 1.2M parameters, 4‑head self‑attention) for detecting neurovascular instability (NVI) from wearable‑compatible physiological signals prior to structural stroke pathology. The model fuses heart rate variability (HRV), peripheral perfusion index, SpO2, and bilateral phase coherence into a composite NVI Score, designed for edge inference (WCET <=4 ms on Cortex‑M4). NVI ‑ the pre‑structural dysregulation of cerebrovascular autoregulation preceding overt stroke ‑ remains undetectable by existing single‑modality wearables. With 12.2 million incident strokes annually, continuous multimodal physiological monitoring offers a practical path to community‑scale screening. Three‑stage independent validation: (1) synthetic benchmark (n=10,000), AUC=0.88 [0.83‑0.92]; (2) clinical cohort PhysioNet CVES (n=172; 84 stroke, 88 control) ‑ Transformer‑lite achieves AUC=0.755 [0.630‑0.778], outperforming LSTM (0.643), Random Forest (0.665), SVM (0.472); HRV‑SDNN discriminates stroke (p=0.011); (3) PPG pipeline PhysioNet BIDMC (n=53) ‑‑ pulse rate r=0.748 and HRV surrogate r=0.690 vs. ECG ground truth. Cross‑modality validation on PPG‑BP (n=219) confirms PPG morphology classifies cerebrovascular disease at AUC=0.923 [0.869‑0.968]. Multimodal fusion consistently outperforms single‑modality baselines. Code: https://github.com/ClevixLab/Melaguard
Authors:Yuanhong Zheng, Ruichuan An, Xiaopeng Lin, Yuxing Liu, Sihan Yang, Huanyu Zhang, Haodong Li, Qintong Zhang, Renrui Zhang, Guopeng Li, Yifan Zhang, Yuheng Li, Wentao Zhang
Abstract:
Human cognition of new concepts is inherently a streaming process: we continuously recognize new objects or identities and update our memories over time. However, current multimodal personalization methods are largely limited to static images or offline videos. This disconnects continuous visual input from instant real‑world feedback, limiting their ability to provide the real‑time, interactive personalized responses essential for future AI assistants. To bridge this gap, we first propose and formally define the novel task of Personalized Streaming Video Understanding (PSVU). To facilitate research in this new direction, we introduce PEARL‑Bench, the first comprehensive benchmark designed specifically to evaluate this challenging setting. It evaluates a model's ability to respond to personalized concepts at exact timestamps under two modes: (1) Frame‑level, focusing on a specific person or object in discrete frames, and (2) a novel Video‑level, focusing on personalized actions unfolding across continuous frames. PEARL‑Bench comprises 132 unique videos and 2,173 fine‑grained annotations with precise timestamps. Concept diversity and annotation quality are strictly ensured through a combined pipeline of automated generation and human verification. To tackle this challenging new setting, we further propose PEARL, a plug‑and‑play, training‑free strategy that serves as a strong baseline. Extensive evaluations across 8 offline and online models demonstrate that PEARL achieves state‑of‑the‑art performance. Notably, it brings consistent PSVU improvements when applied to 3 distinct architectures, proving to be a highly effective and robust strategy. We hope this work advances vision‑language model (VLM) personalization and inspires further research into streaming personalized AI assistants. Code is available at https://github.com/Yuanhong‑Zheng/PEARL.
Authors:Christopher J. Agostino, Quan Le Thien, Nayan D'Souza, Louis van der Elst
Abstract:
Understanding the fundamental mechanisms governing the production of meaning in the processing of natural language is critical for designing safe, thoughtful, engaging, and empowering human‑agent interactions. Experiments in cognitive science and social psychology have demonstrated that human semantic processing exhibits contextuality more consistent with quantum logical mechanisms than classical Boolean theories, and recent works have found similar results in large language models ‑‑ in particular, clear violations of the Bell inequality in experiments of contextuality during interpretation of ambiguous expressions. We explore the CHSH |S| parameter ‑‑ the metric associated with the inequality ‑‑ across the inference parameter space of models spanning four orders of magnitude in scale, cross‑referencing it with MMLU, hallucination rate, and nonsense detection benchmarks. We find that the interquartile range of the |S| distribution ‑‑ the statistic that most sharply differentiates models from one another ‑‑ is completely orthogonal to all external benchmarks, while violation rate shows weak anticorrelation with all three benchmarks that does not reach significance. We investigate how |S| varies with sampling parameters and word order, and discuss the information‑theoretic constraints that genuine contextuality imposes on prompt injection defenses and its human analogue, whereby careful construction and maintenance of social contextuality can be carried out at scale ‑‑ manufacturing not consent but contextuality itself, a subtler and more fundamental form of manipulation that shapes the space of possible interpretations before any particular one is reached.
Authors:Christopher J. Agostino, Nayan D'Souza
Abstract:
Industry practitioners and academic researchers regularly use multi‑agent systems to accelerate their work, yet the frameworks through which these systems operate do not provide a simple, unified mechanism for scalably managing the critical aspects of the agent harness, impacting both the quality of individual human‑agent interactions and the capacity for practitioners to coordinate toward common goals through shared agent infrastructure. Agent frameworks have enabled increasingly sophisticated multi‑agent systems, but the behavioral specifications that define what these agents can do remain fragmented across prose instruction files, framework‑internal configuration, and mechanisms like MCP servers that operate separately from individual agent definitions, making these specifications difficult to share, version, or collaboratively maintain across teams and projects. Applying the ALARA principle from radiation safety (exposures kept as low as reasonably achievable) to agent context, we introduce a declarative context‑agent‑tool (CAT) data layer expressed through interrelated files that scope each agent's tool access and context to the minimum its role requires, and \textttnpcsh, a command‑line shell for executing it. Because the system parses and enforces these files structurally, modifying an agent's tool list produces a guaranteed behavioral change rather than a suggestion the model may or may not follow. We evaluate 22 locally‑hosted models from 0.6B to 35B parameters across 115 practical tasks spanning file operations, web search, multi‑step scripting, tool chaining, and multi‑agent delegation, characterizing which model families succeed at which task categories and where they break down across ~2500 total executions.
Authors:Yizhe Zhao, Yongjian Fu, Zihao Feng, Hao Pan, Yongheng Deng, Yaoxue Zhang, Ju Ren
Abstract:
Mobile advertising dominates app monetization but introduces risks ranging from intrusive user experience to malware delivery. Existing detection methods rely either on static analysis, which misses runtime behaviors, or on heuristic UI exploration, which struggles with sparse and obfuscated ads. In this paper, we present MANA, the first agentic multimodal reasoning framework for mobile ad detection. MANA integrates static, visual, temporal, and experiential signals into a reasoning‑guided navigation strategy that determines not only how to traverse interfaces but also where to focus, enabling efficient and robust exploration. We implement and evaluate MANA on commercial smartphones over 200 apps, achieving state‑of‑the‑art accuracy and efficiency. Compared to baselines, it improves detection accuracy by 30.5%‑56.3% and reduces exploration steps by 29.7%‑63.3%. Case studies further demonstrate its ability to uncover obfuscated and malicious ads, underscoring its practicality for mobile ad auditing and its potential for broader runtime UI analysis (e.g., permission abuse). Code and dataset are available at https://github.com/MANA‑2026/MANA.
Authors:Zijian Lu, Yiping Zuo, Yupeng Nie, Xin He, Weibei Fan, Lianyong Qi, Shi Jin
Abstract:
Self‑generated skills for web agents are often unstable and can even hurt performance relative to direct acting. We argue that the key bottleneck is not only skill generation quality, but the fact that web skills remain implicit and therefore cannot be checked or locally repaired. To address this, we present ContractSkill, a framework that converts a draft skill into an executable artifact with explicit procedural structure, enabling deterministic verifica tion, fault localization, and minimal local repair. This turns skill refinement from full rewriting into localized editing of a single skill artifact. Experiments on VisualWebArena show that Contract Skill is effective in realistic web environments, while MiniWoB provides a controlled test of the mechanism behind the gain. Under matched transfer layers, repaired artifacts also remain reusable after removing the source model from the loop, providing evi dence of portability within the same benchmark family rather than full‑benchmark generalization. These results suggest that the central challenge is not merely generating skills, but mak ing them explicit, executable, and repairable. Code is available at https://github.com/underfitting‑lu/contractskill.git.
Authors:Liu hung ming
Abstract:
Video world models trained with Joint Embedding Predictive Architectures (JEPA) acquire rich spatiotemporal representations by predicting masked regions in latent space rather than reconstructing pixels. This removes the visual verification pathway of generative models, creating a structural interpretability gap: the encoder has learned physical structure inaccessible in any inspectable form. Existing probing methods either operate in continuous space without a structured intermediate layer, or attach generative components whose parameters confound attribution of behavior to the encoder. We propose the AI Mother Tongue (AIM) framework as a passive quantization probe: a lightweight, vocabulary‑free probe that converts V‑JEPA 2 continuous latent vectors into discrete symbol sequences without task‑specific supervision or modifying the encoder. Because the encoder is kept completely frozen, any symbolic structure in the AIM codebook is attributable entirely to V‑JEPA 2 pre‑trained representations ‑‑ not to the probe. We evaluate through category‑contrast experiments on Kinetics‑mini along three physical dimensions: grasp angle, object geometry, and motion temporal structure. AIM symbol distributions differ significantly across all three experiments (chi^2 p < 10^‑4; MI 0.036‑‑0.117 bits, NMI 1.2‑‑3.9% of the 3‑bit maximum; JSD up to 0.342; codebook active ratio 62.5%). The experiments reveal that V‑JEPA 2 latent space is markedly compact: diverse action categories share a common representational core, with semantic differences encoded as graded distributional variations rather than categorical boundaries. These results establish Stage 1 of a four‑stage roadmap toward an action‑conditioned symbolic world model, demonstrating that structured symbolic manifolds are discoverable properties of frozen JEPA latent spaces.
Authors:Alex Popa, Adrian Taylor, Ranwa Al Mallah
Abstract:
Reinforcement learning techniques are being explored as solutions to the threat of cyber attacks on enterprise networks. Recent research in the field of AI in cyber security has investigated the ability of homogeneous multi‑agent reinforcement learning agents, capable of inter‑agent communication, to respond to cyberattacks. This paper advances the study of learned communication in multi‑agent systems by examining heterogeneous agent capabilities within a simulated network environment. To this end, we leverage CommFormer, a publicly available state‑of‑the‑art communication algorithm, to train and evaluate agents within the Cyber Operations Research Gym (CybORG). Our results show that CommFormer agents with heterogeneous capabilities can outperform other algorithms deployed in the CybORG environment, by converging to an optimal policy up to four times faster while improving standard error by up 38%. The agents implemented in this project provide an additional avenue for exploration in the field of AI for cyber security, enabling further research involving realistic networks.
Authors:Zhuofeng Li, Dongfu Jiang, Xueguang Ma, Haoxiang Zhang, Ping Nie, Yuyu Zhang, Kai Zou, Jianwen Xie, Yu Zhang, Wenhu Chen
Abstract:
Training deep research agents requires long‑horizon trajectories that interleave search, evidence aggregation, and multi‑step reasoning. However, existing data collection pipelines typically rely on proprietary web APIs, making large‑scale trajectory synthesis costly, unstable, and difficult to reproduce. We present OpenResearcher, a reproducible pipeline that decouples one‑time corpus bootstrapping from multi‑turn trajectory synthesis and executes the search‑and‑browse loop entirely offline using three explicit browser primitives: search, open, and find, over a 15M‑document corpus. Using GPT‑OSS‑120B as the teacher model, we synthesize over 97K trajectories, including a substantial long‑horizon tail with 100+ tool calls. Supervised fine‑tuning a 30B‑A3B backbone on these trajectories achieves 54.8% accuracy on BrowseComp‑Plus, a +34.0 point improvement over the base model, while remaining competitive on BrowseComp, GAIA, and xbench‑DeepSearch. Because the environment is offline and fully instrumented, it also enables controlled analysis, where our study reveals practical insights into deep research pipeline design, including data filtering strategies, agent configuration choices, and how retrieval success relates to final answer accuracy. We release the pipeline, synthesized trajectories, model checkpoints, and the offline search environment at https://github.com/TIGER‑AI‑Lab/OpenResearcher.
Authors:Yadi Cao, Sicheng Lai, Jiahe Huang, Yang Zhang, Zach Lawrence, Rohan Bhakta, Izzy F. Thomas, Mingyun Cao, Chung-Hao Tsai, Zihao Zhou, Yidong Zhao, Hao Liu, Alessandro Marinoni, Alexey Arefiev, Rose Yu
Abstract:
Evaluating LLM agents for scientific tasks has focused on token costs while ignoring tool‑use costs like simulation time and experimental resources. As a result, metrics like pass@k become impractical under realistic budget constraints. To address this gap, we introduce SimulCost, the first benchmark targeting cost‑sensitive parameter tuning in physics simulations. SimulCost compares LLM tuning cost‑sensitive parameters against traditional scanning approach in both accuracy and computational cost, spanning 2,916 single‑round (initial guess) and 1,900 multi‑round (adjustment by trial‑and‑error) tasks across 12 simulators from fluid dynamics, solid mechanics, and plasma physics. Each simulator's cost is analytically defined and platform‑independent. Frontier LLMs achieve 46‑‑64% success rates in single‑round mode, dropping to 35‑‑54% under high accuracy requirements, rendering their initial guesses unreliable especially for high accuracy tasks. Multi‑round mode improves rates to 71‑‑80%, but LLMs are 1.5‑‑2.5x slower than traditional scanning, making them uneconomical choices. We also investigate parameter group correlations for knowledge transfer potential, and the impact of in‑context examples and reasoning effort, providing practical implications for deployment and fine‑tuning. We open‑source SimulCost as a static benchmark and extensible toolkit to facilitate research on improving cost‑aware agentic designs for physics simulations, and for expanding new simulation environments. Code and data are available at https://github.com/Rose‑STL‑Lab/SimulCost‑Bench.
Authors:Jiaqi Yuan, Jialu Wang, Zihan Wang, Qingyun Sun, Ruijie Wang, Jianxin Li
Abstract:
Generative search engines represent a transition from traditional ranking‑based retrieval to Large Language Model (LLM)‑based synthesis, transforming optimization goals from ranking prominence towards content inclusion. Generative Engine Optimization (GEO), specifically, aims to maximize visibility and attribution in black‑box summarized outputs by strategically manipulating source content. However, existing methods rely on static heuristics, single‑prompt optimization, or engine preference rule distillation that is prone to overfitting. They cannot flexibly adapt to diverse content or the changing behaviors of generative engines. Moreover, effectively optimizing these strategies requires an impractical amount of interaction feedback from the engines. To address these challenges, we propose AgenticGEO, a self‑evolving agentic framework formulating optimization as a content‑conditioned control problem, which enhances intrinsic content quality to robustly adapt to the unpredictable behaviors of black‑box engines. Unlike fixed‑strategy methods, AgenticGEO employs a MAP‑Elites archive to evolve diverse, compositional strategies. To mitigate interaction costs, we introduce a Co‑Evolving Critic, a lightweight surrogate that approximates engine feedback for content‑specific strategy selection and refinement, efficiently guiding both evolutionary search and inference‑time planning. Through extensive in‑domain and cross‑domain experiments on two representative engines, AgenticGEO achieves state‑of‑the‑art performance and demonstrates robust transferability, outperforming 14 baselines across 3 datasets. Our code and model are available at: https://github.com/AIcling/agentic_geo.
Authors:Hengwei Ye, Yuanting Guan, Yuxuan Ge, Tianying Zhu, Zhenhan Guan, Yijia Zhong, Yijing Zhang, Han Zhang, Yingna Wu, Zheng Tian
Abstract:
Multimodal Large Language Models (MLLMs) combine the linguistic strengths of LLMs with the ability to process multimodal data, enbaling them to address a broader range of visual tasks. Because MLLMs aim at more general, human‑like competence than language‑only models, we take inspiration from the Wechsler Intelligence Scales ‑ an established battery for evaluating children by decomposing intelligence into interpretable, testable abilities. We introduce KidGym, a comprehensive 2D grid‑based benchmark for assessing five essential capabilities of MLLMs: Execution, Perception Reasoning, Learning, Memory and Planning. The benchmark comprises 12 unique tasks, each targeting at least one core capability, specifically designed to guage MLLMs' adaptability and developmental potential, mirroring the stages of children's cognitive growth. Additionally, our tasks encompass diverse scenarios and objects with randomly generated layouts, ensuring a more accurate and robust evluation of MLLM capabilities. KidGym is designed to be fully user‑customizable and extensible, allowing researchers to create new evaluation scenarios and adjust difficuly levels to accommodate the rapidly growing MLLM community. Through the evaluation of state‑of‑the‑art MLLMs using KidGym, we identified significant insights into model capabilities and revealed several limitations of current models. We release our benchmark at: https://bobo‑ye.github.io/KidGym/.
Authors:Hyunjun Jeon, Kyuyoung Kim, Jinwoo Shin
Abstract:
Modern language models can readily extract sensitive information from unstructured text, making redaction ‑‑ the selective removal of such information ‑‑ critical for data security. However, existing benchmarks for redaction typically focus on predefined categories of data such as personally identifiable information (PII) or evaluate specific techniques like masking. To address this limitation, we introduce RedacBench, a comprehensive benchmark for evaluating policy‑conditioned redaction across domains and strategies. Constructed from 514 human‑authored texts spanning individual, corporate, and government sources, paired with 187 security policies, RedacBench measures a model's ability to selectively remove policy‑violating information while preserving the original semantics. We quantify performance using 8,053 annotated propositions that capture all inferable information in each text. This enables assessment of both security ‑‑ the removal of sensitive propositions ‑‑ and utility ‑‑ the preservation of non‑sensitive propositions. Experiments across multiple redaction strategies and state‑of‑the‑art language models show that while more advanced models can improve security, preserving utility remains a challenge. To facilitate future research, we release RedacBench along with a web‑based playground for dataset customization and evaluation. Available at https://hyunjunian.github.io/redaction‑playground/.
Authors:Xinyi Shang, Yi Tang, Jiacheng Cui, Ahmed Elhagry, Salwa K. Al Khatib, Sondos Mahmoud Bsharat, Jiacheng Liu, Xiaohan Zhao, Jing-Hao Xue, Hao Li, Salman Khan, Zhiqiang Shen
Abstract:
Existing tampering detection benchmarks largely rely on object masks, which severely misalign with the true edit signal: many pixels inside a mask are untouched or only trivially modified, while subtle yet consequential edits outside the mask are treated as natural. We reformulate VLM image tampering from coarse region labels to a pixel‑grounded, meaning and language‑aware task. First, we introduce a taxonomy spanning edit primitives (replace/remove/splice/inpaint/attribute/colorization, etc.) and their semantic class of tampered object, linking low‑level changes to high‑level understanding. Second, we release a new benchmark with per‑pixel tamper maps and paired category supervision to evaluate detection and classification within a unified protocol. Third, we propose a training framework and evaluation metrics that quantify pixel‑level correctness with localization to assess confidence or prediction on true edit intensity, and further measure tamper meaning understanding via semantics‑aware classification and natural language descriptions for the predicted regions. We also re‑evaluate the existing strong segmentation/localization baselines on recent strong tamper detectors and reveal substantial over‑ and under‑scoring using mask‑only metrics, and expose failure modes on micro‑edits and off‑mask changes. Our framework advances the field from masks to pixels, meanings and language descriptions, establishing a rigorous standard for tamper localization, semantic classification and description. Code and benchmark data are available at https://github.com/VILA‑Lab/PIXAR.
Authors:Jiazheng Xing, Fei Du, Hangjie Yuan, Pengwei Liu, Hongbin Xu, Hai Ci, Ruigang Niu, Weihua Chen, Fan Wang, Yong Liu
Abstract:
Recent advances in diffusion models have significantly improved text‑to‑video generation, enabling personalized content creation with fine‑grained control over both foreground and background elements. However, precise face‑attribute alignment across subjects remains challenging, as existing methods lack explicit mechanisms to ensure intra‑group consistency. Addressing this gap requires both explicit modeling strategies and face‑attribute‑aware data resources. We therefore propose LumosX, a framework that advances both data and model design. On the data side, a tailored collection pipeline orchestrates captions and visual cues from independent videos, while multimodal large language models (MLLMs) infer and assign subject‑specific dependencies. These extracted relational priors impose a finer‑grained structure that amplifies the expressive control of personalized video generation and enables the construction of a comprehensive benchmark. On the modeling side, Relational Self‑Attention and Relational Cross‑Attention intertwine position‑aware embeddings with refined attention dynamics to inscribe explicit subject‑attribute dependencies, enforcing disciplined intra‑group cohesion and amplifying the separation between distinct subject clusters. Comprehensive evaluations on our benchmark demonstrate that LumosX achieves state‑of‑the‑art performance in fine‑grained, identity‑consistent, and semantically aligned personalized multi‑subject video generation. Code and models are available at https://jiazheng‑xing.github.io/lumosx‑home/.
Authors:Jiyu Lim, Youngwoo Yoon, Kwanghyun Park
Abstract:
Conventional robot social behavior generation has been limited in flexibility and autonomy, relying on predefined motions or human feedback. This study proposes CRISP (Critique‑and‑Replan for Interactive Social Presence), an autonomous framework where a robot critiques and replans its own actions by leveraging a Vision‑Language Model (VLM) as a `human‑like social critic.' CRISP integrates (1) extraction of movable joints and constraints by analyzing the robot's description file (e.g., MJCF), (2) generation of step‑by‑step behavior plans based on situational context, (3) generation of low‑level joint control code by referencing visual information (joint range‑of‑motion visualizations), (4) VLM‑based evaluation of social appropriateness and naturalness, including pinpointing erroneous steps, and (5) iterative refinement of behaviors through reward‑based search. This approach is not tied to a specific robot API; it can generate subtly different, human‑like motions on various platforms using only the robot's structure file. In a user study involving five different robot types and 20 scenarios, including mobile manipulators and humanoids, our proposed method achieved significantly higher preference and situational appropriateness ratings compared to previous methods. This research presents a general framework that minimizes human intervention while expanding the robot's autonomous interaction capabilities and cross‑platform applicability. Detailed result videos and supplementary information regarding this work are available at: https://limjiyu99.github.io/inner‑critic/
Authors:Amartya Roy, Rasul Tutunov, Xiaotong Ji, Matthieu Zimmer, Haitham Bou-Ammar
Abstract:
LLMs are increasingly used as general‑purpose reasoners, but long inputs remain bottlenecked by a fixed context window. Recursive Language Models (RLMs) address this by externalising the prompt and recursively solving subproblems. Yet existing RLMs depend on an open‑ended read‑eval‑print loop (REPL) in which the model generates arbitrary control code, making execution difficult to verify, predict, and analyse. We introduce λ‑RLM, a framework for long‑context reasoning that replaces free‑form recursive code generation with a typed functional runtime grounded in λ‑calculus. It executes a compact library of pre‑verified combinators and uses neural inference only on bounded leaf subproblems, turning recursive reasoning into a structured functional program with explicit control flow. We show that λ‑RLM admits formal guarantees absent from standard RLMs, including termination, closed‑form cost bounds, controlled accuracy scaling with recursion depth, and an optimal partition rule under a simple cost model. Empirically, across four long‑context reasoning tasks and nine base models, λ‑RLM outperforms standard RLM in 29 of 36 model‑task comparisons, improves average accuracy by up to +21.9 points across model tiers, and reduces latency by up to 4.1x. These results show that typed symbolic control yields a more reliable and efficient foundation for long‑context reasoning than open‑ended recursive code generation. The complete implementation of λ‑RLM, is open‑sourced for the community at: https://github.com/lambda‑calculus‑LLM/lambda‑RLM.
Authors:Yingwei Zheng, Cong Li, Shaohua Li, Yuqun Zhang, Zhendong Su
Abstract:
Compilers are critical to modern computing, yet fixing compiler bugs is difficult. While recent large language model (LLM) advancements enable automated bug repair, compiler bugs pose unique challenges due to their complexity, deep cross‑domain expertise requirements, and sparse, non‑descriptive bug reports, necessitating compiler‑specific tools. To bridge the gap, we introduce llvm‑autofix, the first agentic harness designed to assist LLM agents in understanding and fixing compiler bugs. Our focus is on LLVM, one of the most widely used compiler infrastructures. Central to llvm‑autofix are agent‑friendly LLVM tools, a benchmark llvm‑bench of reproducible LLVM bugs, and a tailored minimal agent llvm‑autofix‑mini for fixing LLVM bugs. Our evaluation demonstrates a performance decline of 60% in frontier models when tackling compiler bugs compared with common software bugs. Our minimal agent llvm‑autofix‑mini also outperforms the state‑of‑the‑art by approximately 22%. This emphasizes the necessity for specialized harnesses like ours to close the loop between LLMs and compiler engineering. We believe this work establishes a foundation for advancing LLM capabilities in complex systems like compilers. GitHub: https://github.com/dtcxzyw/llvm‑autofix
Authors:Wenjian Zhang, Kongcheng Zhang, Jiaxin Qi, Baisheng Lai, Jianqiang Huang
Abstract:
Reinforcement Learning (RL) with rubric‑based rewards has recently shown remarkable progress in enhancing general reasoning capabilities of Large Language Models (LLMs), yet still suffers from ineffective exploration confined to curent policy distribution. In fact, RL optimization can be viewed as steering the policy toward an ideal distribution that maximizes the rewards, while effective exploration should align efforts with desired target. Leveraging this insight, we propose HeRL, a Hindsight experience guided Reinforcement Learning framework to bootstrap effective exploration by explicitly telling LLMs the desired behaviors specified in rewards. Concretely, HeRL treats failed trajectories along with their unmet rubrics as hindsight experience, which serves as in‑context guidance for the policy to explore desired responses beyond its current distribution. Additionally, we introduce a bonus reward to incentivize responses with greater potential for improvement under such guidance. HeRL facilitates effective learning from desired high quality samples without repeated trial‑and‑error from scratch, yielding a more accurate estimation of the expected gradient theoretically. Extensive experiments across various benchmarks demonstrate that HeRL achieves superior performance gains over baselines, and can further benefit from experience guided self‑improvement at test time. Our code is available at https://github.com/sikelifei/HeRL.
Authors:Jizhou Han, Chenhao Ding, Yuhang He, Qiang Wang, Shaokun Wang, SongLin Dong, Yihong Gong
Abstract:
Generalized Category Discovery (GCD) seeks to uncover novel categories in unlabeled data while preserving recognition of known categories, yet prevailing visual‑only pipelines and the loose coupling between supervised learning and discovery often yield brittle boundaries on fine‑grained, look‑alike categories. We introduce the Analogical Textual Concept Generator (ATCG), a plug‑and‑play module that analogizes from labeled knowledge to new observations, forming textual concepts for unlabeled samples. Fusing these analogical textual concepts with visual features turns discovery into a visual‑textual reasoning process, transferring prior knowledge to novel data and sharpening category separation. ATCG attaches to both parametric and clustering style GCD pipelines and requires no changes to their overall design. Across six benchmarks, ATCG consistently improves overall, known‑class, and novel‑class performance, with the largest gains on fine‑grained data. Our code is available at: https://github.com/zhou‑9527/AnaLogical‑GCD.
Authors:Dong Yan, Jian Liang, Yanbo Wang, Shuo Lu, Ran He, Tieniu Tan
Abstract:
Test‑Time Reinforcement Learning (TTRL) enables Large Language Models (LLMs) to enhance reasoning capabilities on unlabeled test streams by deriving pseudo‑rewards from majority voting consensus. However, existing TTRL methods rely exclusively on positive pseudo‑labeling strategies. Such reliance becomes vulnerable under challenging scenarios where answer distributions are highly dispersed, resulting in weak consensus that inadvertently reinforces incorrect trajectories as supervision signals. In this paper, we propose SCRL (Selective‑Complementary Reinforcement Learning), a robust test‑time reinforcement learning framework that effectively mitigates label noise amplification. SCRL develops Selective Positive Pseudo‑Labeling, which enforces strict consensus criteria to filter unreliable majorities. Complementarily, SCRL introduces Entropy‑Gated Negative Pseudo‑Labeling, the first negative supervision mechanism in TTRL, to reliably prune incorrect trajectories based on generation uncertainty. Extensive experiments on multiple reasoning benchmarks demonstrate that SCRL achieves substantial improvements over baselines, while maintaining robust generalization and training stability under constrained rollout budgets. Our code is available at https://github.com/Jasper‑Yan/SCRL.
Authors:Yifei Zhao, Fanyu Zhao, Zhongyuan Zhang, Shengtang Wu, Yixuan Lin, Yinsheng Li
Abstract:
Generalized few‑shot 3D point cloud segmentation aims to adapt to novel classes from only a few annotations while maintaining strong performance on base classes, but this remains challenging due to the inherent stability‑plasticity trade‑off: adapting to novel classes can interfere with shared representations and cause base‑class forgetting. We present HOP3D, a unified framework that learns hierarchical orthogonal prototypes with an entropy‑based few‑shot regularizer to enable robust novel‑class adaptation without degrading base‑class performance. HOP3D introduces hierarchical orthogonalization that decouples base and novel learning at both the gradient and representation levels, effectively mitigating base‑novel interference. To further enhance adaptation under sparse supervision, we incorporate an entropy‑based regularizer that leverages predictive uncertainty to refine prototype learning and promote balanced predictions. Extensive experiments on ScanNet200 and ScanNet++ demonstrate that HOP3D consistently outperforms state‑of‑the‑art baselines under both 1‑shot and 5‑shot settings. The code is available at https://fdueblab‑hop3d.github.io/.
Authors:Yifei Zhao, Fanyu Zhao, Yinsheng Li
Abstract:
Few‑shot 3D semantic segmentation aims to generate accurate semantic masks for query point clouds with only a few annotated support examples. Existing prototype‑based methods typically construct compact and deterministic prototypes from the support set to guide query segmentation. However, such rigid representations are unable to capture the intrinsic uncertainty introduced by scarce supervision, which often results in degraded robustness and limited generalization. In this work, we propose UPL (Uncertainty‑aware Prototype Learning), a probabilistic approach designed to incorporate uncertainty modeling into prototype learning for few‑shot 3D segmentation. Our framework introduces two key components. First, UPL introduces a dual‑stream prototype refinement module that enriches prototype representations by jointly leveraging limited information from both support and query samples. Second, we formulate prototype learning as a variational inference problem, regarding class prototypes as latent variables. This probabilistic formulation enables explicit uncertainty modeling, providing robust and interpretable mask predictions. Extensive experiments on the widely used ScanNet and S3DIS benchmarks show that our UPL achieves consistent state‑of‑the‑art performance under different settings while providing reliable uncertainty estimation. The code is available at https://fdueblab‑upl.github.io/.
Authors:Kaleem Ullah Qasim, Jiashu Zhang, Muhammad Kafeel Shaheen, Razan Alharith, Heying Zhang
Abstract:
The key‑value (KV) cache is widely treated as essential state in transformer inference, and a large body of work engineers policies to compress, evict, or approximate its entries. We prove that this state is entirely redundant: keys and values at every layer are deterministic projections of the residual stream, and recomputing them from a single residual vector per token incurs exactly zero reconstruction error, not approximately, but bit‑identically. We verify this across six models from four architecture families (135M to 4B parameters). Cross‑task residual patching at every layer produces D_KL = 0 between patched and original output distributions, confirming that the residual stream satisfies a Markov property and is the sole information‑carrying state. Removing the cache entirely and recomputing from scratch yields token‑identical output under greedy decoding on all models tested. We build on this result with KV‑Direct, a bounded‑memory inference scheme that checkpoints residual vectors (5 KB per token on Gemma 3‑4B) instead of full KV pairs (136 KB), recomputing keys and values on demand. Over 20 conversation turns, KV‑Direct holds peak memory at 42 MB while the standard cache grows past 103 MB. Against five eviction baselines (H2O, StreamingLLM, SnapKV, TOVA, window‑only), KV‑Direct maintains 100% token match at every cache budget; all baselines degrade to 5‑28%. A per‑operation latency analysis shows recomputation runs up to 5x faster than reading cached tensors at moderate batch sizes. Code is available at https://github.com/Kaleemullahqasim/KV‑Direct.
Authors:Yiheng Wang, Changhong Fu, Liangliang Yao, Haobo Zuo, Zijie Zhang
Abstract:
Robust feature encoding constitutes the foundation of UAV tracking by enabling the nuanced perception of target appearance and motion, thereby playing a pivotal role in ensuring reliable tracking. However, existing feature encoding methods often overlook critical illumination and viewpoint cues, which are essential for robust perception under challenging nighttime conditions, leading to degraded tracking performance. To overcome the above limitation, this work proposes a dual prompt‑driven feature encoding method that integrates prompt‑conditioned feature adaptation and context‑aware prompt evolution to promote domain‑invariant feature encoding. Specifically, the pyramid illumination prompter is proposed to extract multi‑scale frequency‑aware illumination prompts. %The dynamic viewpoint prompter adapts the sampling to different viewpoints, enabling the tracker to learn view‑invariant features. The dynamic viewpoint prompter modulates deformable convolution offsets to accommodate viewpoint variations, enabling the tracker to learn view‑invariant features. Extensive experiments validate the effectiveness of the proposed dual prompt‑driven tracker (DPTracker) in tackling nighttime UAV tracking. Ablation studies highlight the contribution of each component in DPTracker. Real‑world tests under diverse nighttime UAV tracking scenarios further demonstrate the robustness and practical utility. The code and demo videos are available at https://github.com/yiheng‑wang‑duke/DPTracker.
Authors:Insung Lee, Taeyoung Jeong, Haejun Yoo, Du-Seong Chang, Myoung-Wan Koo
Abstract:
While Large Audio‑Language Models (LALMs) have advanced audio captioning, robust evaluation remains difficult. Reference‑based metrics are expensive and often fail to assess acoustic fidelity, while Contrastive Language‑Audio Pretraining (CLAP)‑based approaches frequently overlook syntactic errors and fine‑grained details. We propose CAF‑Score, a reference‑free metric that calibrates CLAP's coarse‑grained semantic alignment with the fine‑grained comprehension and syntactic awareness of LALMs. By combining contrastive audio‑text embeddings with LALM reasoning, CAF‑Score effectively detects syntactic inconsistencies and subtle hallucinations. Experiments on the BRACE benchmark demonstrate that our approach achieves the highest correlation with human judgments, even outperforming reference‑based baselines in challenging scenarios. These results highlight the efficacy of CAF‑Score for reference‑free audio captioning evaluation. Code and results are available at https://github.com/inseong00/CAF‑Score.
Authors:Shuaibang Peng, Juelin Zhu, Xia Li, Kun Yang, Maojun Zhang, Yu Liu, Shen Yan
Abstract:
We present LoD‑Loc v3, a novel method for generalized aerial visual localization in dense urban environments. While prior work LoD‑Loc v2 achieves localization through semantic building silhouette alignment with low‑detail city models, it suffers from two key limitations: poor cross‑scene generalization and frequent failure in dense building scenes. Our method addresses these challenges through two key innovations. First, we develop a new synthetic data generation pipeline that produces InsLoD‑Loc ‑ the largest instance segmentation dataset for aerial imagery to date, comprising 100k images with precise instance building annotations. This enables trained models to exhibit remarkable zero‑shot generalization capability. Second, we reformulate the localization paradigm by shifting from semantic to instance silhouette alignment, which significantly reduces pose estimation ambiguity in dense scenes. Extensive experiments demonstrate that LoD‑Loc v3 outperforms existing state‑of‑the‑art (SOTA) baselines, achieving superior performance in both cross‑scene and dense urban scenarios with a large margin. The project is available at https://nudt‑sawlab.github.io/LoD‑Locv3/.
Authors:Minghe Xu, Rouying Wu, ChiaWei Chu, Xiao Wang, Yu Li
Abstract:
Event‑based pedestrian attribute recognition (PAR) leverages motion cues to enhance RGB cameras in low‑light and motion‑blur scenarios, enabling more accurate inference of attributes like age and emotion. However, existing two‑stream multimodal fusion methods introduce significant computational overhead and neglect the valuable guidance from contextual samples. To address these limitations, this paper proposes an Event Prompter. Discarding the computationally expensive auxiliary backbone, this module directly applies extremely lightweight and efficient Discrete Cosine Transform (DCT) and Inverse DCT (IDCT) operations to the event data. This design extracts frequency‑domain event features at a minimal computational cost, thereby effectively augmenting the RGB branch. Furthermore, an external memory bank designed to provide rich prior knowledge, combined with modern Hopfield networks, enables associative memory‑augmented representation learning. This mechanism effectively mines and leverages global relational knowledge across different samples. Finally, a cross‑attention mechanism fuses the RGB and event modalities, followed by feed‑forward networks for attribute prediction. Extensive experiments on multiple benchmark datasets fully validate the effectiveness of the proposed RGB‑Event PAR framework. The source code of this paper will be released on https://github.com/Event‑AHU/OpenPAR
Authors:Haoyu Zhang, Zhihao Yu, Rui Wang, Yaochu Jin, Qiqi Liu, Ran Cheng
Abstract:
Modern computer vision requires balancing predictive accuracy with real‑time efficiency, yet the high inference cost of large vision models (LVMs) limits deployment on resource‑constrained edge devices. Although Evolutionary Neural Architecture Search (ENAS) is well suited for multi‑objective optimization, its practical use is hindered by two issues: expensive candidate evaluation and ranking inconsistency among subnetworks. To address them, we propose EvoNAS, an efficient distributed framework for multi‑objective evolutionary architecture search. We build a hybrid supernet that integrates Vision State Space and Vision Transformer (VSS‑ViT) modules, and optimize it with a Cross‑Architecture Dual‑Domain Knowledge Distillation (CA‑DDKD) strategy. By coupling the computational efficiency of VSS blocks with the semantic expressiveness of ViT modules, CA‑DDKD improves the representational capacity of the shared supernet and enhances ranking consistency, enabling reliable fitness estimation during evolution without extra fine‑tuning. To reduce the cost of large‑scale validation, we further introduce a Distributed Multi‑Model Parallel Evaluation (DMMPE) framework based on GPU resource pooling and asynchronous scheduling. Compared with conventional data‑parallel evaluation, DMMPE improves efficiency by over 70% through concurrent multi‑GPU, multi‑model execution. Experiments on COCO, ADE20K, KITTI, and NYU‑Depth v2 show that the searched architectures, termed EvoNets, consistently achieve Pareto‑optimal trade‑offs between accuracy and efficiency. Compared with representative CNN‑, ViT‑, and Mamba‑based models, EvoNets deliver lower inference latency and higher throughput under strict computational budgets while maintaining strong generalization on downstream tasks such as novel view synthesis. Code is available at https://github.com/EMI‑Group/evonas
Authors:Tianlong Wang, Pinqiao Wang, Weili Shi, Sheng li
Abstract:
Large language models (LLMs) with advanced cognitive capabilities are emerging as agents for various reasoning and planning tasks. Traditional evaluations often focus on specific reasoning or planning questions within controlled environments. Recent studies have explored travel planning as a medium to integrate various verbal reasoning tasks into real‑world contexts. However, reasoning tasks extend beyond verbal reasoning alone, and a comprehensive evaluation of LLMs requires a testbed that incorporates tasks from multiple cognitive domains. To address this gap, we introduce ItinBench, a benchmark that features one task of spatial reasoning, i.e., route optimization, into trip itinerary planning while keeping the traditional verbal reasoning tasks. ItinBench evaluates various LLMs across diverse tasks simultaneously, including Llama 3.1 8B, Mistral Large, Gemini 1.5 Pro, and GPT family. Our findings reveal that LLMs struggle to maintain high and consistent performance when concurrently handling multiple cognitive dimensions. By incorporating tasks from distinct human‑level cognitive domains, ItinBench provides new insights into building more comprehensive reasoning testbeds that better reflect real‑world challenges. The code and dataset: https://ethanwtl.github.io/IBweb/
Authors:Jinming Wang, Hai Wang, Hongkai Wen, Geyong Min, Man Luo
Abstract:
High‑quality GPS trajectories are essential for location‑based web services and smart city applications, including navigation, ride‑sharing and delivery. However, due to low sampling rates and limited infrastructure coverage during data collection, real‑world trajectories are often sparse and feature unevenly distributed location points. Recovering these trajectories into dense and continuous forms is essential but challenging, given their complex and irregular spatio‑temporal patterns. In this paper, we introduce a novel diffusion model for trajectory recovery named TRACE, which reconstruct dense and continuous trajectories from sparse and incomplete inputs. At the core of TRACE, we propose a State Propagation Diffusion Model (SPDM), which integrates a novel memory mechanism, so that during the denoising process, TRACE can retain and leverage intermediate results from previous steps to effectively reconstruct those hard‑to‑recover trajectory segments. Extensive experiments on multiple real‑world datasets show that TRACE outperforms the state‑of‑the‑art, offering >26% accuracy improvement without significant inference overhead. Our work strengthens the foundation for mobile and web‑connected location services, advancing the quality and fairness of data‑driven urban applications. Code is available at: https://github.com/JinmingWang/TRACE
Authors:Jenny Zhang, Bingchen Zhao, Wannan Yang, Jakob Foerster, Jeff Clune, Minqi Jiang, Sam Devlin, Tatiana Shavrina
Abstract:
Self‑improving AI systems aim to reduce reliance on human engineering by learning to improve their own learning and problem‑solving processes. Existing approaches to self‑improvement rely on fixed, handcrafted meta‑level mechanisms, fundamentally limiting how fast such systems can improve. The Darwin Gödel Machine (DGM) demonstrates open‑ended self‑improvement in coding by repeatedly generating and evaluating self‑modified variants. Because both evaluation and self‑modification are coding tasks, gains in coding ability can translate into gains in self‑improvement ability. However, this alignment does not generally hold beyond coding domains. We introduce hyperagents, self‑referential agents that integrate a task agent (which solves the target task) and a meta agent (which modifies itself and the task agent) into a single editable program. Crucially, the meta‑level modification procedure is itself editable, enabling metacognitive self‑modification, improving not only the task‑solving behavior, but also the mechanism that generates future improvements. We instantiate this framework by extending DGM to create DGM‑Hyperagents (DGM‑H), eliminating the assumption of domain‑specific alignment between task performance and self‑modification skill to potentially support self‑accelerating progress on any computable task. Across diverse domains, the DGM‑H improves performance over time and outperforms baselines without self‑improvement or open‑ended exploration, as well as prior self‑improving systems. Furthermore, the DGM‑H improves the process by which it generates new agents (e.g., persistent memory, performance tracking), and these meta‑level improvements transfer across domains and accumulate across runs. DGM‑Hyperagents offer a glimpse of open‑ended AI systems that do not merely search for better solutions, but continually improve their search for how to improve.
Authors:Wentao Wang, Haoran Xu, Guang Tan
Abstract:
In autonomous driving, multi‑agent collaborative perception enhances sensing capabilities by enabling agents to share perceptual data. A key challenge lies in handling \em heterogeneous features from agents equipped with different sensing modalities or model architectures, which complicates data fusion. Existing approaches often require retraining encoders or designing interpreter modules for pairwise feature alignment, but these solutions are not scalable in practice. To address this, we propose \em GT‑Space, a flexible and scalable collaborative perception framework for heterogeneous agents. GT‑Space constructs a common feature space from ground‑truth labels, providing a unified reference for feature alignment. With this shared space, agents only need a single adapter module to project their features, eliminating the need for pairwise interactions with other agents. Furthermore, we design a fusion network trained with contrastive losses across diverse modality combinations. Extensive experiments on simulation datasets (OPV2V and V2XSet) and a real‑world dataset (RCooper) demonstrate that GT‑Space consistently outperforms baselines in detection accuracy while delivering robust performance. Our code will be released at https://github.com/KingScar/GT‑Space.
Authors:Weilin Zhou, Shanwen Tan, Enhao Gu, Yurong Qian
Abstract:
Multimodal fake news detection is crucial for mitigating societal disinformation. Existing approaches attempt to address this by fusing multimodal features or leveraging Large Language Models (LLMs) for advanced reasoning. However, these methods suffer from serious limitations, including a lack of comprehensive multi‑view judgment and fusion, and prohibitive reasoning inefficiency due to the high computational costs of LLMs. To address these issues, we propose LLM‑Guided Multi‑View Reasoning Distillation for Fake News Detection ( LLM‑MRD), a novel teacher‑student framework. The Student Multi‑view Reasoning module first constructs a comprehensive foundation from textual, visual, and cross‑modal perspectives. Then, the Teacher Multi‑view Reasoning module generates deep reasoning chains as rich supervision signals. Our core Calibration Distillation mechanism efficiently distills this complex reasoning‑derived knowledge into the efficient student model. Experiments show LLM‑MRD significantly outperforms state‑of‑the‑art baselines. Notably, it demonstrates a comprehensive average improvement of 5.19% in ACC and 6.33% in F1‑Fake when evaluated across all competing methods and datasets. Our code is available at https://github.com/Nasuro55/LLM‑MRD
Authors:Vivan Madan, Prajwal Singhania, Abhinav Bhatele, Tom Goldstein, Ashwinee Panda
Abstract:
Mixture‑of‑Experts (MoE) models have gained popularity as a means of scaling the capacity of large language models (LLMs) while maintaining sparse activations and reduced per‑token compute. However, in memory‑constrained inference settings, expert weights must be offloaded to CPU, creating a performance bottleneck from CPU‑GPU transfers during decoding. We propose an expert prefetching scheme that leverages currently computed internal model representations to speculate future experts, enabling memory transfers to overlap with computation. Across multiple MoE architectures, we demonstrate that future experts can be reliably predicted by these internal representations. We also demonstrate that executing speculated experts generally maintains downstream task accuracy, thus preserving more effective compute‑memory overlap by eliminating the need to re‑fetch true router‑selected experts. Integrated into an optimized inference engine, our approach achieves up to 14% reduction in time per output token (TPOT) over on‑demand loading of experts from CPU memory. For MoEs where speculative execution alone yields suboptimal accuracy, we further examine lightweight estimators that improve expert prediction hit rates, thereby reducing performance degradation. Our code is released in open‑source at https://github.com/axonn‑ai/yalis/tree/offload_prefetch.
Authors:Bartosz Trojan, Filip Gębala
Abstract:
Modern Transformer‑based models frequently suffer from miscalibration, producing overconfident predictions that do not reflect true empirical frequencies. This work investigates the calibration dynamics of LoRA: Low‑Rank Adaptation and a novel hyper‑network‑based adaptation framework as parameter‑efficient alternatives to full fine‑tuning for RoBERTa. Evaluating across the GLUE benchmark, we demonstrate that LoRA‑based adaptation consistently achieves calibration parity with (and in specific tasks exceeds) full fine‑tuning, while maintaining significantly higher parameter efficiency. We further explore a dynamic approach where a shared hyper‑network generates LoRA factors (A and B matrices) to induce structural coupling across layers. This approach produced results similar to standard LoRA fine‑tuning, even achieving better MCC on CoLA dataset. Our study also reveal a critical trade‑off: constraining the adaptation space (e.g., freezing matrices A) acts as a powerful regularizer that enhances Expected Calibration Error (ECE), but necessitates a carefully balanced sacrifice in downstream task accuracy. To support future research, we provide a unified and reproducible implementation of contemporary calibration metrics, including ECE, MCE, and ACE. Our findings clarify the relationship between parameter efficiency and probabilistic reliability, positioning structured low‑rank updates as a viable foundation for uncertainty‑aware Transformer architectures. Code available at: https://github.com/btrojan‑official/HypeLoRA
Authors:Yannian Gu, Zhongzhen Huang, Linjie Mu, Xizhuo Zhang, Shaoting Zhang, Xiaofan Zhang
Abstract:
Multimodal large language models (MLLMs) demonstrate considerable potential in clinical diagnostics, a domain that inherently requires synthesizing complex visual and textual data alongside consulting authoritative medical literature. However, existing benchmarks primarily evaluate MLLMs in end‑to‑end answering scenarios. This limits the ability to disentangle a model's foundational multimodal reasoning from its proficiency in evidence retrieval and application. We introduce the Clinical Understanding and Retrieval Evaluation (CURE) benchmark. Comprising 500 multimodal clinical cases mapped to physician‑cited reference literature, CURE evaluates reasoning and retrieval under controlled evidence settings to disentangle their respective contributions. We evaluate state‑of‑the‑art MLLMs across distinct evidence‑gathering paradigms in both closed‑ended and open‑ended diagnosis tasks. Evaluations reveal a stark dichotomy: while advanced models demonstrate clinical reasoning proficiency when supplied with physician reference evidence (achieving up to 73.4% accuracy on differential diagnosis), their performance substantially declines (as low as 25.4%) when reliant on independent retrieval mechanisms. This disparity highlights the dual challenges of effectively integrating multimodal clinical evidence and retrieving precise supporting literature. CURE is publicly available at https://github.com/yanniangu/CURE.
Authors:Zhen Tan, Chengshuai Zhao, Song Wang, Jundong Li, Tianlong Chen, Huan Liu
Abstract:
Distilling robust reasoning capabilities from large language models (LLMs) into smaller, computationally efficient student models remains an unresolved challenge. Despite recent advances, distilled models frequently suffer from superficial pattern memorization and subpar generalization. To overcome these limitations, we introduce a novel distillation framework that moves beyond simple mimicry to instill a deeper conceptual understanding. Our framework features two key innovations. \underlineFirst, to address pattern memorization, Explanatory Inversion (EI) generates targeted ``explanatory probes'' that compel the student to articulate the underlying logic behind an answer, rather than just memorizing it. \underlineSecond, to improve generalization, Explanatory GRPO (\textttEXGRPO) uses a reinforcement learning algorithm with a novel Dialogue Structure Utility Bonus, which explicitly rewards the student for maintaining a coherent reasoning process across these probes. Extensive evaluations on 12 datasets demonstrate significant improvements. Using Gemma‑7b as the student model, our method yields an average 20.39% increase over zero‑shot performance and a 6.02% improvement over the state‑of‑the‑art distillation baselines. Moreover, models distilled with our method show remarkable training efficiency (e.g., surpassing vanilla fine‑tuning with 10‑25% training data) and strong generalization to out‑of‑distribution tasks. Implementation is released at https://github.com/Zhen‑Tan‑dmml/ExGRPO.git.
Authors:Swagat Padhan, Lakshya Jain, Bhavya Minesh Shah, Omkar Patil, Thao Nguyen, Nakul Gopalan
Abstract:
Robots collaborating with humans must convert natural language goals into actionable, physically grounded decisions. For example, executing a command such as "go two meters to the right of the fridge" requires grounding semantic references, spatial relations, and metric constraints within a 3D scene. While recent vision language models (VLMs) demonstrate strong semantic grounding capabilities, they are not explicitly designed to reason about metric constraints in physically defined spaces. In this work, we empirically demonstrate that state‑of‑the‑art VLM‑based grounding approaches struggle with complex metric‑semantic language queries. To address this limitation, we propose MAPG (Multi‑Agent Probabilistic Grounding), an agentic framework that decomposes language queries into structured subcomponents and queries a VLM to ground each component. MAPG then probabilistically composes these grounded outputs to produce metrically consistent, actionable decisions in 3D space. We evaluate MAPG on the HM‑EQA benchmark and show consistent performance improvements over strong baselines. Furthermore, we introduce a new benchmark, MAPG‑Bench, specifically designed to evaluate metric‑semantic goal grounding, addressing a gap in existing language grounding evaluations. We also present a real‑world robot demonstration showing that MAPG transfers beyond simulation when a structured scene representation is available.
Authors:Yuyang Liu
Abstract:
Combinatorial optimization problems arise in logistics, scheduling, and resource allocation, yet existing approaches face a fundamental trade‑off among generality, performance, and usability. We present cuGenOpt, a GPU‑accelerated general‑purpose metaheuristic framework that addresses all three dimensions simultaneously. At the engine level, cuGenOpt adopts a "one block evolves one solution" CUDA architecture with a unified encoding abstraction (permutation, binary, integer), a two‑level adaptive operator selection mechanism, and hardware‑aware resource management. At the extensibility level, a user‑defined operator registration interface allows domain experts to inject problem‑specific CUDA search operators. At the usability level, a JIT compilation pipeline exposes the framework as a pure‑Python API, and an LLM‑based modeling assistant converts natural‑language problem descriptions into executable solver code. Experiments across five thematic suites on three GPU architectures (T4, V100, A800) show that cuGenOpt outperforms general MIP solvers by orders of magnitude, achieves competitive quality against specialized solvers on instances up to n=150, and attains 4.73% gap on TSP‑442 within 30s. Twelve problem types spanning five encoding variants are solved to optimality. Framework‑level optimizations cumulatively reduce pcb442 gap from 36% to 4.73% and boost VRPTW throughput by 75‑81%. Code: https://github.com/L‑yang‑yang/cugenopt
Authors:Danaé Broustail, Anna Tegon, Thorir Mar Ingolfsson, Yawei Li, Luca Benini
Abstract:
Electroencephalography (EEG) enables non‑invasive monitoring of brain activity across clinical and neurotechnology applications, yet building foundation models for EEG remains challenging due to \emphdiffering electrode topologies and \emphcomputational scalability, as Transformer architectures incur quadratic sequence complexity. As a joint solution, we propose LuMamba (Latent Unified Mamba), a self‑supervised framework combining topology‑invariant encodings with linear‑complexity state‑space modeling, using LUNA's learned‑query cross‑attention mechanism for channel unification~\citeluna, and FEMBA's bidirectional Mamba blocks for efficient temporal modeling~\citefemba. Within this architecture, we provide the first systematic investigation of the Latent‑Euclidean Joint‑Embedding Predictive Architecture (LeJEPA) for biosignal learning. Pre‑trained on over 21,000 hours of unlabeled EEG from the TUEG corpus, LuMamba is evaluated on five downstream tasks spanning abnormality detection, artifact recognition, and mental condition classification across electrode configurations ranging from 16 to 26 channels. In the pre‑training objective, masked reconstruction alone yields structured but less generalizable representations, while LeJEPA alone produces diffuse embeddings; combining both objectives achieves the most robust performance. With only 4.6M parameters, LuMamba attains 80.99% balanced accuracy on TUAB and achieves state‑of‑art performance on Alzheimer's detection (0.97 AUPR), while requiring 377× fewer FLOPS than state‑of‑art models at equivalent sequence lengths and scaling to 12× longer sequences before reaching typical GPU memory limits. Code is available at https://github.com/pulp‑bio/biofoundation
Authors:Gagan Bhatia, Ahmad Muhammad Isa, Maxime Peyrard, Wei Zhao
Abstract:
We present MultiTempBench, a multilingual temporal reasoning benchmark spanning three tasks, date arithmetic, time zone conversion, and temporal relation extraction across five languages (English, German, Chinese, Arabic, and Hausa) and multiple calendar conventions (Gregorian, Hijri, and Chinese Lunar). MultiTempBench contains 15,000 examples built by translating 750 curated English questions and expanding each into controlled date‑format variants. We evaluate 20 LLMs and introduce the multilingual Date Fragmentation Ratio (mDFR), calibrated with human severity ratings, together with geometric‑probing analyses of internal temporal representations. We find tokenisation quality of temporal artefacts is a resource‑dependent bottleneck: in low‑resource languages and rarer calendar formats, fragmentation disrupts Year/Month/Day separation and accuracy collapses, while high‑resource settings are often robust to digit‑level splitting. Beyond tokenisation, crossed mixed‑effects regression shows that temporal linearity is the strongest predictor of temporal reasoning in high‑resource languages, whereas fragmentation is the stronger predictor in low‑resource languages. Code is available at: https://github.com/gagan3012/mtb
Authors:Yitong Li, Igor Yakushev, Dennis M. Hedderich, Christian Wachinger
Abstract:
Positron emission tomography (PET) is a widely recognized technique for diagnosing neurodegenerative diseases, offering critical functional insights. However, its high costs and radiation exposure hinder its widespread use. In contrast, magnetic resonance imaging (MRI) does not involve such limitations. While MRI also detects neurodegenerative changes, it is less sensitive for diagnosis compared to PET. To overcome such limitations, one approach is to generate synthetic PET from MRI. Recent advances in generative models have paved the way for cross‑modality medical image translation; however, existing methods largely emphasize structural preservation while neglecting the critical need for pathology awareness. To address this gap, we propose PASTA, a novel image translation framework built on conditional diffusion models with enhanced pathology awareness. PASTA surpasses state‑of‑the‑art methods by preserving both structural and pathological details through its highly interactive dual‑arm architecture and multi‑modal condition integration. Additionally, we introduce a novel cycle exchange consistency and volumetric generation strategy that significantly enhances PASTA's ability to produce high‑quality 3D PET images. Our qualitative and quantitative results demonstrate the high quality and pathology awareness of the synthesized PET scans. For Alzheimer's diagnosis, the performance of these synthesized scans improves over MRI by 4%, almost reaching the performance of actual PET. Our code is available at https://github.com/ai‑med/PASTA.
Authors:Youngwan Lee, Soojin Jang, Yoorhim Cho, Seunghwan Lee, Yong-Ju Lee, Sung Ju Hwang
Abstract:
Spatial reasoning is foundational for Vision‑Language Models (VLMs), particularly when deployed as Vision‑Language‑Action (VLA) agents in physical environments. However, existing benchmarks predominantly focus on elementary, single‑hop relations, neglecting the multi‑hop compositional reasoning and precise visual grounding essential for real‑world scenarios. To address this, we introduce MultihopSpatial, offering three key contributions: (1) A comprehensive benchmark designed for multi‑hop and compositional spatial reasoning, featuring 1‑ to 3‑hop complex queries across diverse spatial perspectives. (2) Acc@50IoU, a complementary metric that simultaneously evaluates reasoning and visual grounding by requiring both answer selection and precise bounding box prediction ‑ capabilities vital for robust VLA deployment. (3) MultihopSpatial‑Train, a dedicated large‑scale training corpus to foster spatial intelligence. Extensive evaluation of 37 state‑of‑the‑art VLMs yields eight key insights, revealing that compositional spatial reasoning remains a formidable challenge. Finally, we demonstrate that reinforcement learning post‑training on our corpus enhances both intrinsic VLM spatial reasoning and downstream embodied manipulation performance.
Authors:Xiao Feng, Bo Han, Zhanke Zhou, Jiaqi Fan, Jiangchao Yao, Ka Ho Li, Dahai Yu, Michael Kwok-Po Ng
Abstract:
Reinforcement learning (RL) holds significant promise for enhancing the agentic reasoning capabilities of large language models (LLMs) with external environments. However, the inherent sparsity of terminal rewards hinders fine‑grained, state‑level optimization. Although process reward modeling offers a promising alternative, training dedicated reward models often entails substantial computational costs and scaling difficulties. To address these challenges, we introduce RewardFlow, a lightweight method for estimating state‑level rewards tailored to agentic reasoning tasks. RewardFlow leverages the intrinsic topological structure of states within reasoning trajectories by constructing state graphs. This enables an analysis of state‑wise contributions to success, followed by topology‑aware graph propagation to quantify contributions and yield objective, state‑level rewards. When integrated as dense rewards for RL optimization, RewardFlow substantially outperforms prior RL baselines across four agentic reasoning benchmarks, demonstrating superior performance, robustness, and training efficiency. The implementation of RewardFlow is publicly available at https://github.com/tmlr‑group/RewardFlow.
Authors:Bishoy Galoaa, Shayda Moezzi, Xiangyu Bai, Sarah Ostadabbas
Abstract:
Recent research has made substantial progress on video reasoning, with many models leveraging spatio‑temporal evidence chains to strengthen their inference capabilities. At the same time, a growing set of datasets and benchmarks now provides structured annotations designed to support and evaluate such reasoning. However, little attention has been paid to reasoning about \emphhow objects move between observations: no prior work has articulated the motion patterns by connecting successive observations, leaving trajectory understanding implicit and difficult to verify. We formalize this missing capability as Spatial‑Temporal‑Trajectory (STT) reasoning and introduce Motion‑o, a motion‑centric video understanding extension to visual language models that makes trajectories explicit and verifiable. To enable motion reasoning, we also introduce a trajectory‑grounding dataset artifact that expands sparse keyframe supervision via augmentation to yield denser bounding box tracks and a stronger trajectory‑level training signal. Finally, we introduce Motion Chain of Thought (MCoT), a structured reasoning pathway that makes object trajectories through discrete \texttt<motion/> tag summarizing per‑object direction, speed, and scale (of velocity) change to explicitly connect grounded observations into trajectories. To train Motion‑o, we design a reward function that compels the model to reason directly over visual evidence, all while requiring no architectural modifications. Empirical results demonstrate that Motion‑o improves spatial‑temporal grounding and trajectory prediction while remaining fully compatible with existing frameworks, establishing motion reasoning as a critical extension for evidence‑based video understanding. Code is available at https://github.com/ostadabbas/Motion‑o.
Authors:Marcelo Fernandez
Abstract:
Agent Control Protocol (ACP) is a formal technical specification for governance of autonomous agents in B2B institutional environments. ACP is the admission control layer between agent intent and system state mutation: before any agent action reaches execution, it must pass a cryptographic admission check that validates identity, capability scope, delegation chain, and policy compliance simultaneously. ACP defines the mechanisms of cryptographic identity, capability‑based authorization, deterministic risk evaluation, verifiable chained delegation, transitive revocation, and immutable auditing that a system must implement for autonomous agents to operate under explicit institutional control. ACP operates as an additional layer on top of RBAC and Zero Trust, without replacing them. It is designed specifically for the problem that neither model solves: governing what an autonomous agent can do, under what conditions, with what limits, and with complete traceability for external auditing ‑‑ including across organizational boundaries. The v1.14 specification comprises 36 technical documents organized into five conformance levels (L1‑L5). It includes a Go reference implementation of 22 packages covering all L1‑L4 capabilities, 73 signed conformance test vectors (Ed25519 + SHA‑256), and an OpenAPI 3.1.0 specification for all HTTP endpoints. It defines more than 62 verifiable requirements, 12 prohibited behaviors, and the mechanisms for interoperability between institutions. Specification and implementation: https://github.com/chelof100/acp‑framework‑en
Authors:Djamel Bouchaffra, Fayçal Ykhlef, Hanene Azzag, Mustapha Lebbah, Bilal Faye
Abstract:
Standard attention mechanisms in transformers are limited by their pairwise formulation, which hinders the modeling of higher‑order dependencies among tokens. We introduce the NeuroGame Transformer (NGT) to overcome this by reconceptualizing attention through a dual perspective: tokens are treated simultaneously as players in a cooperative game and as interacting spins in a statistical physics system. Token importance is quantified using two complementary game‑theoretic concepts ‑‑ Shapley values for global, permutation‑based attribution and Banzhaf indices for local, coalition‑level influence. These are combined via a learnable gating parameter to form an external magnetic field, while pairwise interaction potentials capture synergistic relationships. The system's energy follows an Ising Hamiltonian, with attention weights emerging as marginal probabilities under the Gibbs distribution, efficiently computed via mean‑field equations. To ensure scalability despite the exponential coalition space, we develop importance‑weighted Monte Carlo estimators with Gibbs‑distributed weights. This approach avoids explicit exponential factors, ensuring numerical stability for long sequences. We provide theoretical convergence guarantees and characterize the fairness‑sensitivity trade‑off governed by the interpolation parameter. Experimental results demonstrate that the NeuroGame Transformer achieves strong performance across SNLI, and MNLI‑matched, outperforming some major efficient transformer baselines. On SNLI, it attains a test accuracy of 86.4% (with a peak validation accuracy of 86.6%), surpassing ALBERT‑Base and remaining highly competitive with RoBERTa‑Base. Code is available at https://github.com/dbouchaffra/NeuroGame‑Transformer.
Authors:Huichi Zhou, Siyuan Guo, Anjie Liu, Zhongwei Yu, Ziqin Gong, Bowen Zhao, Zhixun Chen, Menglong Zhang, Yihang Chen, Jinsong Li, Runyu Yang, Qiangbin Liu, Xinlei Yu, Jianmin Zhou, Na Wang, Chunyang Sun, Jun Wang
Abstract:
We introduce \emphMemento‑Skills, a generalist, continually‑learnable LLM agent system that functions as an \emphagent‑designing agent: it autonomously constructs, adapts, and improves task‑specific agents through experience. The system is built on a memory‑based reinforcement learning framework with \emphstateful prompts, where reusable skills (stored as structured markdown files) serve as persistent, evolving memory. These skills encode both behaviour and context, enabling the agent to carry forward knowledge across interactions. Starting from simple elementary skills (like Web search and terminal operations), the agent continually improves via the \emphRead‑‑Write Reflective Learning mechanism introduced in \emphMemento~2~\citewang2025memento2. In the \emphread phase, a behaviour‑trainable skill router selects the most relevant skill conditioned on the current stateful prompt; in the \emphwrite phase, the agent updates and expands its skill library based on new experience. This closed‑loop design enables \emphcontinual learning without updating LLM parameters, as all adaptation is realised through the evolution of externalised skills and prompts. Unlike prior approaches that rely on human‑designed agents, Memento‑Skills enables a generalist agent to \emphdesign agents end‑to‑end for new tasks. Through iterative skill generation and refinement, the system progressively improves its own capabilities. Experiments on the \emphGeneral AI Assistants benchmark and \emphHumanity's Last Exam demonstrate sustained gains, achieving 26.2% and 116.2% relative improvements in overall accuracy, respectively. Code is available at https://github.com/Memento‑Teams/Memento‑Skills.
Authors:Minhua Lin, Zhiwei Zhang, Hanqing Lu, Hui Liu, Xianfeng Tang, Qi He, Xiang Zhang, Suhang Wang
Abstract:
Memory‑augmented LLM agents maintain external memory banks to support long‑horizon interaction, yet most existing systems treat construction, retrieval, and utilization as isolated subroutines. This creates two coupled challenges: strategic blindness on the forward path of the memory cycle, where construction and retrieval are driven by local heuristics rather than explicit strategic reasoning, and sparse, delayed supervision on the backward path, where downstream failures rarely translate into direct repairs of the memory bank. To address these challenges, we propose MemMA, a plug‑and‑play multi‑agent framework that coordinates the memory cycle along both the forward and backward paths. On the forward path, a Meta‑Thinker produces structured guidance that steers a Memory Manager during construction and directs a Query Reasoner during iterative retrieval. On the backward path, MemMA introduces in‑situ self‑evolving memory construction, which synthesizes probe QA pairs, verifies the current memory, and converts failures into repair actions before the memory is finalized. Extensive experiments on LoCoMo show that MemMA consistently outperforms existing baselines across multiple LLM backbones and improves three different storage backends in a plug‑and‑play manner. Our code is publicly available at https://github.com/ventr1c/memma.
Authors:Jingguo Qu, Xinyang Han, Yao Pu, Man-Lik Chui, Simon Takadiyi Gunda, Ziman Chen, Jing Qin, Ann Dorothy King, Winnie Chiu-Wing Chu, Jing Cai, Michael Tin-Cheung Ying
Abstract:
Medical ultrasound image segmentation faces significant challenges due to limited labeled data and characteristic imaging artifacts including speckle noise and low‑contrast boundaries. While semi‑supervised learning (SSL) approaches have emerged to address data scarcity, existing methods suffer from suboptimal unlabeled data utilization and lack robust feature representation mechanisms. In this paper, we propose Switch, a novel SSL framework with two key innovations: (1) Multiscale Switch (MSS) strategy that employs hierarchical patch mixing to achieve uniform spatial coverage; (2) Frequency Domain Switch (FDS) with contrastive learning that performs amplitude switching in Fourier space for robust feature representations. Our framework integrates these components within a teacher‑student architecture to effectively leverage both labeled and unlabeled data. Comprehensive evaluation across six diverse ultrasound datasets (lymph nodes, breast lesions, thyroid nodules, and prostate) demonstrates consistent superiority over state‑of‑the‑art methods. At 5% labeling ratio, Switch achieves remarkable improvements: 80.04% Dice on LN‑INT, 85.52% Dice on DDTI, and 83.48% Dice on Prostate datasets, with our semi‑supervised approach even exceeding fully supervised baselines. The method maintains parameter efficiency (1.8M parameters) while delivering superior performance, validating its effectiveness for resource‑constrained medical imaging applications. The source code is publicly available at https://github.com/jinggqu/Switch
Authors:Pius Horn, Janis Keuper
Abstract:
Reliably extracting tables from PDFs is essential for large‑scale scientific data mining and knowledge base construction, yet existing evaluation approaches rely on rule‑based metrics that fail to capture semantic equivalence of table content. We present a benchmarking framework based on synthetically generated PDFs with precise LaTeX ground truth, using tables sourced from arXiv to ensure realistic complexity and diversity. As our central methodological contribution, we apply LLM‑as‑a‑judge for semantic table evaluation, integrated into a matching pipeline that accommodates inconsistencies in parser outputs. Through a human validation study comprising over 1,500 quality judgments on extracted table pairs, we show that LLM‑based evaluation achieves substantially higher correlation with human judgment (Pearson r=0.93) compared to Tree Edit Distance‑based Similarity (TEDS, r=0.68) and Grid Table Similarity (GriTS, r=0.70). Evaluating 21 contemporary PDF parsers across 100 synthetic documents containing 451 tables reveals significant performance disparities. Our results offer practical guidance for selecting parsers for tabular data extraction and establish a reproducible, scalable evaluation methodology for this critical task. Code and data: https://github.com/phorn1/pdf‑parse‑bench Metric study and human evaluation: https://github.com/phorn1/table‑metric‑study
Authors:Mingde Zhou, Zheng Chen, Yulun Zhang
Abstract:
Video compression aims to maximize reconstruction quality with minimal bitrates. Beyond standard distortion metrics, perceptual quality and temporal consistency are also critical. However, at ultra‑low bitrates, traditional end‑to‑end compression models tend to produce blurry images of poor perceptual quality. Besides, existing generative compression methods often treat video frames independently and show limitations in time coherence and efficiency. To address these challenges, we propose the Efficient Video Diffusion with Sparse Information Transmission (Diff‑SIT), which comprises the Sparse Temporal Encoding Module (STEM) and the One‑Step Video Diffusion with Frame Type Embedder (ODFTE). The STEM sparsely encodes the original frame sequence into an information‑rich intermediate sequence, achieving significant bitrate savings. Subsequently, the ODFTE processes this intermediate sequence as a whole, which exploits the temporal correlation. During this process, our proposed Frame Type Embedder (FTE) guides the diffusion model to perform adaptive reconstruction according to different frame types to optimize the overall quality. Extensive experiments on multiple datasets demonstrate that Diff‑SIT establishes a new state‑of‑the‑art in perceptual quality and temporal consistency, particularly in the challenging ultra‑low‑bitrate regime. Code is released at https://github.com/MingdeZhou/Diff‑SIT.
Authors:Seonghyun Jin, Jong Chul Ye
Abstract:
Streaming 3D reconstruction maintains a persistent latent state that is updated online from incoming frames, enabling constant‑memory inference. A key failure mode is the state update rule: aggressive overwrites forget useful history, while conservative updates fail to track new evidence, and both behaviors become unstable beyond the training horizon. To address this challenge, we propose FILT3R, a training‑free latent filtering layer that casts recurrent state updates as stochastic state estimation in token space. FILT3R maintains a per‑token variance and computes a Kalman‑style gain that adaptively balances memory retention against new observations. Process noise ‑‑ governing how much the latent state is expected to change between frames ‑‑ is estimated online from EMA‑normalized temporal drift of candidate tokens. Using extensive experiments, we demonstrate that FILT3R yields an interpretable, plug‑in update rule that generalizes common overwrite and gating policies as special cases. Specifically, we show that gains shrink in stable regimes as uncertainty contracts with accumulated evidence, and rise when genuine scene change increases process uncertainty, improving long‑horizon stability for depth, pose, and 3D reconstruction, compared to the existing methods. Code will be released at https://github.com/jinotter3/FILT3R.
Authors:Huy Che, Dinh-Duy Phan, Duc-Khai Lam
Abstract:
Collecting and annotating datasets for pixel‑level semantic segmentation tasks are highly labor‑intensive. Data augmentation provides a viable solution by enhancing model generalization without additional real‑world data collection. Traditional augmentation techniques, such as translation, scaling, and color transformations, create geometric variations but fail to generate new structures. While generative models have been employed to extend semantic information of datasets, they often struggle to maintain consistency between the original and generated images, particularly for pixel‑level tasks. In this work, we propose a novel synthetic data augmentation pipeline that integrates controllable diffusion models. Our approach balances diversity and reliability data, effectively bridging the gap between synthetic and real data. We utilize class‑aware prompting and visual prior blending to improve image quality further, ensuring precise alignment with segmentation labels. By evaluating benchmark datasets such as PASCAL VOC and BDD100K, we demonstrate that our method significantly enhances semantic segmentation performance, especially in data‑scarce scenarios, while improving model robustness in real‑world applications. Our code is available at \hrefhttps://github.com/chequanghuy/Enhanced‑Generative‑Data‑Augmentation‑for‑Semantic‑Segmentation‑via‑Stronger‑Guidancehttps://github.com/chequanghuy/Enhanced‑Generative‑Data‑Augmentation‑for‑Semantic‑Segmentation‑via‑Stronger‑Guidance.
Authors:Jason Dury
Abstract:
Embedding models group text by semantic content, what text is about. We show that temporal co‑occurrence within texts discovers a different kind of structure: recurrent transition‑structure concepts or what text does. We train a 29.4M‑parameter contrastive model on 373 million co‑occurrence pairs from 9,766 Project Gutenberg texts (24.96 million passages), mapping pre‑trained embeddings into an association space where passages with similar transition structure cluster together. Under capacity constraint (42.75% accuracy), the model must compress across recurring patterns rather than memorise individual co‑occurrences. Clustering at six granularities (k=50 to k=2,000) produces a multi‑resolution concept map; from broad modes like "direct confrontation" and "lyrical meditation" to precise registers and scene templates like "sailor dialect" and "courtroom cross‑examination." At k=100, clusters average 4,508 books each (of 9,766), confirming corpus‑wide patterns. Direct comparison with embedding‑similarity clustering shows that raw embeddings group by topic while association‑space clusters group by function, register, and literary tradition. Unseen novels are assigned to existing clusters without retraining; the association model concentrates each novel into a selective subset of coherent clusters, while raw embedding assignment saturates nearly all clusters. Validation controls address positional, length, and book‑concentration confounds. The method extends Predictive Associative Memory (PAM, arXiv:2602.11322) from episodic recall to concept formation: where PAM recalls specific associations, multi‑epoch contrastive training under compression extracts structural patterns that transfer to unseen texts, the same framework producing qualitatively different behaviour in a different regime.
Authors:Sanjay Basu, Sadiq Y. Patel, Parth Sheth, Bhairavi Muralidharan, Namrata Elamaran, Aakriti Kinra, John Morgan, Rajaie Batniji
Abstract:
Language models encode task‑relevant knowledge in internal representations that far exceeds their output performance, but whether mechanistic interpretability methods can bridge this knowledge‑action gap has not been systematically tested. We compared four mechanistic interpretability methods ‑‑ concept bottleneck steering (Steerling‑8B), sparse autoencoder feature steering, logit lens with activation patching, and linear probing with truthfulness separator vector steering (Qwen 2.5 7B Instruct) ‑‑ for correcting false‑negative triage errors using 400 physician‑adjudicated clinical vignettes (144 hazards, 256 benign). Linear probes discriminated hazardous from benign cases with 98.2% AUROC, yet the model's output sensitivity was only 45.1%, a 53‑percentage‑point knowledge‑action gap. Concept bottleneck steering corrected 20% of missed hazards but disrupted 53% of correct detections, indistinguishable from random perturbation (p=0.84). SAE feature steering produced zero effect despite 3,695 significant features. TSV steering at high strength corrected 24% of missed hazards while disrupting 6% of correct detections, but left 76% of errors uncorrected. Current mechanistic interpretability methods cannot reliably translate internal knowledge into corrected outputs, with implications for AI safety frameworks that assume interpretability enables effective error correction.
Authors:Zilin Huang, Zihao Sheng, Zhengyang Wan, Yansong Qu, Junwei You, Sicong Jiang, Sikai Chen
Abstract:
Ensuring safe decision‑making in autonomous vehicles remains a fundamental challenge despite rapid advances in end‑to‑end learning approaches. Traditional reinforcement learning (RL) methods rely on manually engineered rewards or sparse collision signals, which fail to capture the rich contextual understanding required for safe driving and make unsafe exploration unavoidable in real‑world settings. Recent vision‑language models (VLMs) offer promising semantic understanding capabilities; however, their high inference latency and susceptibility to hallucination hinder direct application to real‑time vehicle control. To address these limitations, this paper proposes DriveVLM‑RL, a neuroscience‑inspired framework that integrates VLMs into RL through a dual‑pathway architecture for safe and deployable autonomous driving. The framework decomposes semantic reward learning into a Static Pathway for continuous spatial safety assessment using CLIP‑based contrasting language goals, and a Dynamic Pathway for attention‑gated multi‑frame semantic risk reasoning using a lightweight detector and a large VLM. A hierarchical reward synthesis mechanism fuses semantic signals with vehicle states, while an asynchronous training pipeline decouples expensive VLM inference from environment interaction. All VLM components are used only during offline training and are removed at deployment, ensuring real‑time feasibility. Experiments in the CARLA simulator show significant improvements in collision avoidance, task success, and generalization across diverse traffic scenarios, including strong robustness under settings without explicit collision penalties. These results demonstrate that DriveVLM‑RL provides a practical paradigm for integrating foundation models into autonomous driving without compromising real‑time feasibility. Demo video and code are available at: https://zilin‑huang.github.io/DriveVLM‑RL‑website/
Authors:Kaiyang Li, Shihao Ji, Zhipeng Cai, Wei Li
Abstract:
Approximate subgraph matching (ASM) is a task that determines the approximate presence of a given query graph in a large target graph. Being an NP‑hard problem, ASM is critical in graph analysis with a myriad of applications ranging from database systems and network science to biochemistry and privacy. Existing techniques often employ heuristic search strategies, which cannot fully utilize the graph information, leading to sub‑optimal solutions. This paper proposes a Reinforcement Learning based Approximate Subgraph Matching (RL‑ASM) algorithm that exploits graph transformers to effectively extract graph representations and RL‑based policies for ASM. Our model is built upon the branch‑and‑bound algorithm that selects one pair of nodes from the two input graphs at a time for potential matches. Instead of using heuristics, we exploit a Graph Transformer architecture to extract feature representations that encode the full graph information. To enhance the training of the RL policy, we use supervised signals to guide our agent in an imitation learning stage. Subsequently, the policy is fine‑tuned with the Proximal Policy Optimization (PPO) that optimizes the accumulative long‑term rewards over episodes. Extensive experiments on both synthetic and real‑world datasets demonstrate that our RL‑ASM outperforms existing methods in terms of effectiveness and efficiency. Our source code is available at https://github.com/KaiyangLi1992/RL‑ASM.
Authors:Haocheng Luo, Zehang Deng, Thanh-Toan Do, Mehrtash Harandi, Dinh Phung, Trung Le
Abstract:
Direct Preference Optimization (DPO) has emerged as a popular algorithm for aligning pretrained large language models with human preferences, owing to its simplicity and training stability. However, DPO suffers from the recently identified squeezing effect (also known as likelihood displacement), where the probability of preferred responses decreases unintentionally during training. To understand and mitigate this phenomenon, we develop a theoretical framework that models the coordinate‑wise dynamics in logit space. Our analysis reveals that negative‑gradient updates cause residuals to expand rapidly along high‑curvature directions, which underlies the squeezing effect, whereas Sharpness‑Aware Minimization (SAM) can suppress this behavior through its curvature‑regularization effect. Building on this insight, we investigate logits‑SAM, a computationally efficient variant that perturbs only the output layer with negligible overhead. Extensive experiments on Pythia‑2.8B, Mistral‑7B, and Gemma‑2B‑IT across multiple datasets and benchmarks demonstrate that logits‑SAM consistently improves the effectiveness of DPO and integrates seamlessly with other DPO variants. Code is available at https://github.com/RitianLuo/logits‑sam‑dpo.
Authors:Naoki Morihira, Amal Nahar, Kartik Bharadwaj, Yasuhiro Kato, Akinobu Hayashi, Tatsuya Harada
Abstract:
A central challenge in image‑based Model‑Based Reinforcement Learning (MBRL) is to learn representations that distill essential information from irrelevant visual details. While promising, reconstruction‑based methods often waste capacity on large task‑irrelevant regions. Decoder‑free methods instead learn robust representations by leveraging Data Augmentation (DA), but reliance on such external regularizers limits versatility. We propose R2‑Dreamer, a decoder‑free MBRL framework with a self‑supervised objective that serves as an internal regularizer, preventing representation collapse without resorting to DA. The core of our method is a redundancy‑reduction objective inspired by Barlow Twins, which can be easily integrated into existing frameworks. On DeepMind Control Suite and Meta‑World, R2‑Dreamer is competitive with strong baselines such as DreamerV3 and TD‑MPC2 while training 1.59x faster than DreamerV3, and yields substantial gains on DMC‑Subtle with tiny task‑relevant objects. These results suggest that an effective internal regularizer can enable versatile, high‑performance decoder‑free MBRL. Code is available at https://github.com/NM512/r2dreamer.
Authors:Mohammed Rahman Sherif Khan Mohammad, Ardhendu Behera, Sandip Pradhan, Swagat Kumar, Amr Ahmed
Abstract:
Recent adapter‑based CLIP tuning (e.g., Tip‑Adapter) is a strong few‑shot learner, achieving efficiency by caching support features for fast prototype matching. However, these methods rely on global uni‑modal feature vectors, overlooking fine‑grained patch relations and their structural alignment with class text. To bridge this gap without incurring inference costs, we introduce a novel asymmetric training‑only framework. Instead of altering the lightweight adapter, we construct a high‑capacity auxiliary Heterogeneous Graph Teacher that operates solely during training. This teacher (i) integrates multi‑scale visual patches and text prompts into a unified graph, (ii) performs deep cross‑modal reasoning via a Modality‑aware Graph Transformer (MGT), and (iii) applies discriminative node filtering to extract high‑fidelity class features. Crucially, we employ a cache‑aware dual‑objective strategy to supervise this relational knowledge directly into the Tip‑Adapter's key‑value cache, effectively upgrading the prototypes while the graph teacher is discarded at test time. Thus, inference remains identical to Tip‑Adapter with zero extra latency or memory. Across standard 1‑16‑shot benchmarks, our method consistently establishes a new state‑of‑the‑art. Ablations confirm that the auxiliary graph supervision, text‑guided reasoning, and node filtering are the essential ingredients for robust few‑shot adaptation. Code is available at https://github.com/MR‑Sherif/TOGA.git.
Authors:Yitian Gong, Botian Jiang, Yiwei Zhao, Yucheng Yuan, Kuangwei Chen, Yaozhou Jiang, Cheng Chang, Dong Hong, Mingshu Chen, Ruixiao Li, Yiyang Zhang, Yang Gao, Hanfu Chen, Ke Chen, Songlin Wang, Xiaogui Yang, Yuqian Zhang, Kexin Huang, ZhengYuan Lin, Kang Yu, Ziqi Chen, Jin Wang, Zhaoye Fei, Qinyuan Cheng, Shimin Li, Xipeng Qiu
Abstract:
This technical report presents MOSS‑TTS, a speech generation foundation model built on a scalable recipe: discrete audio tokens, autoregressive modeling, and large‑scale pretraining. Built on MOSS‑Audio‑Tokenizer, a causal Transformer tokenizer that compresses 24 kHz audio to 12.5 fps with variable‑bitrate RVQ and unified semantic‑acoustic representations, we release two complementary generators: MOSS‑TTS, which emphasizes structural simplicity, scalability, and long‑context/control‑oriented deployment, and MOSS‑TTS‑Local‑Transformer, which introduces a frame‑local autoregressive module for higher modeling efficiency, stronger speaker preservation, and a shorter time to first audio. Across multilingual and open‑domain settings, MOSS‑TTS supports zero‑shot voice cloning, token‑level duration control, phoneme‑/pinyin‑level pronunciation control, smooth code‑switching, and stable long‑form generation. This report summarizes the design, training recipe, and empirical characteristics of the released models.
Authors:Sunil Prakash
Abstract:
Multi‑agent LLM systems delegate tasks across trust boundaries, but current protocols do not govern delegation under unverifiable quality claims. We show that when delegates can inflate self‑reported quality scores, quality‑based routing produces a provenance paradox: it systematically selects the worst delegates, performing worse than random. We extend the LLM Delegate Protocol (LDP) with delegation contracts that bound authority through explicit objectives, budgets, and failure policies; a claimed‑vs‑attested identity model that distinguishes self‑reported from verified quality; and typed failure semantics enabling automated recovery. In controlled experiments with 10 simulated delegates and validated with real Claude models, routing by self‑claimed quality scores performs worse than random selection (simulated: 0.55 vs. 0.68; real models: 8.90 vs. 9.30), while attested routing achieves near‑optimal performance (d = 9.51, p < 0.001). Sensitivity analysis across 36 configurations confirms the paradox emerges reliably when dishonest delegates are present. All extensions are backward‑compatible with sub‑microsecond validation overhead.
Authors:Hao Ke
Abstract:
Current LLM agent frameworks often implement isolation, scheduling, and communication at the application layer, even though these mechanisms are already provided by mature operating systems. Instead of introducing another application‑layer orchestrator, this paper presents Quine, a runtime architecture and reference implementation that realizes LLM agents as native POSIX processes. The mapping is explicit: identity is PID, interface is standard streams and exit status, state is memory, environment variables, and filesystem, and lifecycle is fork/exec/exit. A single executable implements this model by recursively spawning fresh instances of itself. By grounding the agent abstraction in the OS process model, Quine inherits isolation, composition, and resource control directly from the kernel, while naturally supporting recursive delegation, context renewal via exec, and shell‑native composition. The design also exposes where the POSIX process model stops: processes provide a robust substrate for execution, but not a complete runtime model for cognition. In particular, the analysis points toward two immediate extensions beyond process semantics: task‑relative worlds and revisable time. A reference implementation of Quine is publicly available on GitHub.
Authors:Kevin Qu, Haozhe Qi, Mihai Dusmanu, Mahdi Rad, Rui Wang, Marc Pollefeys
Abstract:
Multimodal Large Language Models (MLLMs) have made impressive progress in connecting vision and language, but they still struggle with spatial understanding and viewpoint‑aware reasoning. Recent efforts aim to augment the input representations with geometric cues rather than explicitly teaching models to reason in 3D space. We introduce Loc3R‑VLM, a framework that equips 2D Vision‑Language Models with advanced 3D understanding capabilities from monocular video input. Inspired by human spatial cognition, Loc3R‑VLM relies on two joint objectives: global layout reconstruction to build a holistic representation of the scene structure, and explicit situation modeling to anchor egocentric perspective. These objectives provide direct spatial supervision that grounds both perception and language in a 3D context. To ensure geometric consistency and metric‑scale alignment, we leverage lightweight camera pose priors extracted from a pre‑trained 3D foundation model. Loc3R‑VLM achieves state‑of‑the‑art performance in language‑based localization and outperforms existing 2D‑ and video‑based approaches on situated and general 3D question‑answering benchmarks, demonstrating that our spatial supervision framework enables strong 3D understanding. Project page: https://kevinqu7.github.io/loc3r‑vlm
Authors:Zhang Zhang, Shuqi Lu, Hongjin Qian, Di He, Zheng Liu
Abstract:
Building LLM‑based agents has become increasingly important. Recent works on LLM‑based agent self‑evolution primarily record successful experiences as textual prompts or reflections, which cannot reliably guarantee efficient task re‑execution in complex scenarios. We propose AgentFactory, a new self‑evolution paradigm that preserves successful task solutions as executable subagent code rather than textual experience. Crucially, these subagents are continuously refined based on execution feedback, becoming increasingly robust and efficient as more tasks are encountered. Saved subagents are pure Python code with standardized documentation, enabling portability across any Python‑capable system. We demonstrate that AgentFactory enables continuous capability accumulation: its library of executable subagents grows and improves over time, progressively reducing the effort required for similar tasks without manual intervention. Our implementation is open‑sourced at https://github.com/zzatpku/AgentFactory, and our demonstration video is available at https://youtu.be/iKSsuAXJHW0.
Authors:Pepe Alonso, Sergio Yovine, Victor A. Braberman
Abstract:
AI coding agents can resolve real‑world software issues, yet they frequently introduce regressions ‑‑ breaking tests that previously passed. Current benchmarks focus almost exclusively on resolution rate, leaving regression behavior under‑studied. This paper presents TDAD (Test‑Driven Agentic Development), an open‑source tool that performs pre‑change impact analysis for AI coding agents. TDAD builds a dependency map between source code and tests so that before committing a patch, the agent knows which tests to verify and can self‑correct. The map is delivered as a lightweight agent skill ‑‑ a static text file the agent queries at runtime. Evaluated on SWE‑bench Verified with two open‑weight models running on consumer hardware (Qwen3‑Coder 30B, 100 instances; Qwen3.5‑35B‑A3B, 25 instances), TDAD reduced regressions by 70% (6.08% to 1.82%) compared to a vanilla baseline. In contrast, adding TDD procedural instructions without targeted test context increased regressions to 9.94% ‑‑ worse than no intervention at all. When deployed as an agent skill with a different model and framework, TDAD improved issue‑resolution rate from 24% to 32%, confirming that surfacing contextual information outperforms prescribing procedural workflows. All code, data, and logs are publicly available at https://github.com/pepealonso95/TDAD.
Authors:Alexander D. Goldie, Zilin Wang, Adrian Hayler, Deepak Nathani, Edan Toledo, Ken Thampiratwong, Aleksandra Kalisz, Michael Beukman, Alistair Letcher, Shashank Reddy, Clarisse Wibault, Theo Wolf, Charles O'Neill, Uljad Berdica, Nicholas Roberts, Saeed Rahmani, Hannah Erlebach, Roberta Raileanu, Shimon Whiteson, Jakob N. Foerster
Abstract:
Automating the development of machine learning algorithms has the potential to unlock new breakthroughs. However, our ability to improve and evaluate algorithm discovery systems has thus far been limited by existing task suites. They suffer from many issues, such as: poor evaluation methodologies; data contamination; and containing saturated or very similar problems. Here, we introduce DiscoGen, a procedural generator of algorithm discovery tasks for machine learning, such as developing optimisers for reinforcement learning or loss functions for image classification. Motivated by the success of procedural generation in reinforcement learning, DiscoGen spans millions of tasks of varying difficulty and complexity from a range of machine learning fields. These tasks are specified by a small number of configuration parameters and can be used to optimise algorithm discovery agents (ADAs). We present DiscoBench, a benchmark consisting of a fixed, small subset of DiscoGen tasks for principled evaluation of ADAs. Finally, we propose a number of ambitious, impactful research directions enabled by DiscoGen, in addition to experiments demonstrating its use for prompt optimisation of an ADA. DiscoGen is released open‑source at https://github.com/AlexGoldie/discogen.
Authors:Zunzhe Zhang, Runhan Huang, Yicheng Liu, Shaoting Zhu, Linzhan Mou, Hang Zhao
Abstract:
Diffusion models and flow matching have become a cornerstone of robotic imitation learning, yet they suffer from a structural inefficiency where inference is often bound to a fixed integration schedule that is agnostic to state complexity. This paradigm forces the policy to expend the same computational budget on trivial motions as it does on complex tasks. We introduce Generative Control as Optimization (GeCO), a time‑unconditional framework that transforms action synthesis from trajectory integration into iterative optimization. GeCO learns a stationary velocity field in the action‑sequence space where expert behaviors form stable attractors. Consequently, test‑time inference becomes an adaptive process that allocates computation based on convergence‑‑exiting early for simple states while refining longer for difficult ones. Furthermore, this stationary geometry yields an intrinsic, training‑free safety signal, as the field norm at the optimized action serves as a robust out‑of‑distribution (OOD) detector, remaining low for in‑distribution states while significantly increasing for anomalies. We validate GeCO on standard simulation benchmarks and demonstrate seamless scaling to pi0‑series Vision‑Language‑Action (VLA) models. As a plug‑and‑play replacement for standard flow‑matching heads, GeCO improves success rates and efficiency with an optimization‑native mechanism for safe deployment. Video and code can be found at https://hrh6666.github.io/GeCO/
Authors:Ziwei Xiang, Fanhu Zeng, Hongjian Fang, Rui-Qi Wang, Renxing Chen, Yanan Zhu, Yi Chen, Peipei Yang, Xu-Yao Zhang
Abstract:
Large Vision Language Models (LVLMs) have achieved remarkable success in a range of downstream tasks that require multimodal interaction, but their capabilities come with substantial computational and memory overhead, which hinders practical deployment. Among numerous acceleration techniques, post‑training quantization is a popular and effective strategy for reducing memory cost and accelerating inference. However, existing LVLM quantization methods typically measure token sensitivity at the modality level, which fails to capture the complex cross‑token interactions and falls short in quantitatively measuring the quantization error at the token level. As tokens interact within the model, the distinction between modalities gradually diminishes, suggesting the need for fine‑grained calibration. Inspired by axiomatic attribution in mechanistic interpretability, we introduce a fine‑grained quantization strategy on Quantization‑aware Integrated Gradients (QIG), which leverages integrated gradients to quantitatively evaluate token sensitivity and push the granularity from modality level to token level, reflecting both inter‑modality and intra‑modality dynamics. Extensive experiments on multiple LVLMs under both W4A8 and W3A16 settings show that our method improves accuracy across models and benchmarks with negligible latency overhead. For example, under 3‑bit weight‑only quantization, our method improves the average accuracy of LLaVA‑onevision‑7B by 1.60%, reducing the gap to its full‑precision counterpart to only 1.33%. The code is available at https://github.com/ucas‑xiang/QIG.
Authors:Hamed Taheri
Abstract:
Enterprise AI deploys dozens of autonomous agent nodes across workflows, each acting on the same entities with no shared memory and no common governance. We identify five structural challenges arising from this memory governance gap: memory silos across agent workflows; governance fragmentation across teams and tools; unstructured memories unusable by downstream systems; redundant context delivery in autonomous multi‑step executions; and silent quality degradation without feedback loops. We present Governed Memory, a shared memory and governance layer addressing this gap through four mechanisms: a dual memory model combining open‑set atomic facts with schema‑enforced typed properties; tiered governance routing with progressive context delivery; reflection‑bounded retrieval with entity‑scoped isolation; and a closed‑loop schema lifecycle with AI‑assisted authoring and automated per‑property refinement. We validate each mechanism through controlled experiments (N=250, five content types): 99.6% fact recall with complementary dual‑modality coverage; 92% governance routing precision; 50% token reduction from progressive delivery; zero cross‑entity leakage across 500 adversarial queries; 100% adversarial governance compliance; and output quality saturation at approximately seven governed memories per entity. On the LoCoMo benchmark, the architecture achieves 74.8% overall accuracy, confirming that governance and schema enforcement impose no retrieval quality penalty. The system is in production at Personize.ai.
Authors:Teng Pan, Yuchen Yan, Zixuan Wang, Ruiqing Zhang, Gaiyang Han, Wanqi Zhang, Weiming Lu, Jun Xiao, Yongliang Shen
Abstract:
Label‑free reinforcement learning enables large language models to improve reasoning capabilities without ground‑truth supervision, typically by treating majority‑voted answers as pseudo‑labels. However, we identify a critical failure mode: as training maximizes self‑consistency, output diversity collapses, causing the model to confidently reinforce systematic errors that evade detection. We term this the consensus trap. To escape it, we propose CoVerRL, a framework where a single model alternates between generator and verifier roles, with each capability bootstrapping the other. Majority voting provides noisy but informative supervision for training the verifier, while the improving verifier progressively filters self‑consistent errors from pseudo‑labels. This co‑evolution creates a virtuous cycle that maintains high reward accuracy throughout training. Experiments across Qwen and Llama model families demonstrate that CoVerRL outperforms label‑free baselines by 4.7‑5.9% on mathematical reasoning benchmarks. Moreover, self‑verification accuracy improves from around 55% to over 85%, confirming that both capabilities genuinely co‑evolve.
Authors:Rui Xiao, Sanghwan Kim, Yongqin Xian, Zeynep Akata, Stephan Alaniz
Abstract:
Multimodal large language models (MLLMs) struggle with hallucinations, particularly with fine‑grained queries, a challenge underrepresented by existing benchmarks that focus on coarse image‑related questions. We introduce FIne‑grained NEgative queRies (FINER), alongside two benchmarks: FINER‑CompreCap and FINER‑DOCCI. Using FINER, we analyze hallucinations across four settings: multi‑object, multi‑attribute, multi‑relation, and ``what'' questions. Our benchmarks reveal that MLLMs hallucinate when fine‑grained mismatches co‑occur with genuinely present elements in the image. To address this, we propose FINER‑Tuning, leveraging Direct Preference Optimization (DPO) on FINER‑inspired data. Finetuning four frontier MLLMs with FINER‑Tuning yields up to 24.2% gains (InternVL3.5‑14B) on hallucinations from our benchmarks, while simultaneously improving performance on eight existing hallucination suites, and enhancing general multimodal capabilities across six benchmarks. Code, benchmark, and models are available at \hrefhttps://explainableml.github.io/finer‑project/https://explainableml.github.io/finer‑project/.
Authors:Tae Eun Choi, Sumin Shim, Junhyeok Kim, Seong Jae Hwang
Abstract:
Generative inbetweening (GI) seeks to synthesize realistic intermediate frames between the first and last keyframes beyond mere interpolation. As sequences become sparser and motions larger, previous GI models struggle with inconsistent frames with unstable pacing and semantic misalignment. Since GI involves fixed endpoints and numerous plausible paths, this task requires additional guidance gained from the keyframes and text to specify the intended path. Thus, we give semantic and temporal guidance from the keyframes and text onto each intermediate frame through Keyframe‑anchored Attention Bias. We also better enforce frame consistency with Rescaled Temporal RoPE, which allows self‑attention to attend to keyframes more faithfully. TGI‑Bench, the first benchmark specifically designed for text‑conditioned GI evaluation, enables challenge‑targeted evaluation to analyze GI models. Without additional training, our method achieves state‑of‑the‑art frame consistency, semantic fidelity, and pace stability for both short and long sequences across diverse challenges.
Authors:Binqing Wu, Zongjiang Shang, Shiyu Liu, Jianlong Huang, Jiahui Xu, Ling Chen
Abstract:
Accurate air quality forecasting is essential for public health and environmental sustainability, but remains challenging due to the complex pollutant dynamics. Existing deep learning methods often model pollutant dynamics as an instantaneous process, overlooking the intrinsic delays in pollutant propagation. Thus, we propose AirDDE, the first neural delay differential equation framework in this task that integrates delay modeling into a continuous‑time pollutant evolution under physical guidance. Specifically, two novel components are introduced: (1) a memory‑augmented attention module that retrieves globally and locally historical features, which can adaptively capture delay effects modulated by multifactor data; and (2) a physics‑guided delay evolving function, grounded in the diffusion‑advection equation, that models diffusion, delayed advection, and source/sink terms, which can capture delay‑aware pollutant accumulation patterns with physical plausibility. Extensive experiments on three real‑world datasets demonstrate that AirDDE achieves the state‑of‑the‑art forecasting performance with an average MAE reduction of 8.79% over the best baselines. The code is available at https://github.com/w2obin/airdde‑aaai.
Authors:Madhav S. Baidya, S. S. Baidya, Chirag Chawla
Abstract:
The rapid proliferation of large language models (LLMs) has created an urgent need for robust and generalizable detectors of machine‑generated text. Existing benchmarks typically evaluate a single detector on a single dataset under ideal conditions, leaving open questions about cross‑domain transfer, cross‑LLM generalization, and adversarial robustness. We present a comprehensive benchmark evaluating diverse detection approaches across two corpora: HC3 (23,363 human‑ChatGPT pairs) and ELI5 (15,000 human‑Mistral‑7B pairs). Methods include classical classifiers, fine‑tuned transformer encoders (BERT, RoBERTa, ELECTRA, DistilBERT, DeBERTa‑v3), a CNN, an XGBoost stylometric model, perplexity‑based detectors, and LLM‑as‑detector prompting. Results show that transformer models achieve near‑perfect in‑distribution performance but degrade under domain shift. The XGBoost stylometric model matches performance while remaining interpretable. LLM‑based detectors underperform and are affected by generator‑detector identity bias. Perplexity‑based methods exhibit polarity inversion, with modern LLM outputs showing lower perplexity than human text, but remain effective when corrected. No method generalizes robustly across domains and LLM sources.
Authors:Segyu Lee, Boryeong Cho, Hojung Jung, Seokhyun An, Juhyeong Kim, Jaehyun Kwak, Yongjin Yang, Sangwon Jang, Youngrok Park, Wonjun Chang, Se-Young Yun
Abstract:
Unified Multimodal Models (UMMs) offer powerful cross‑modality capabilities but introduce new safety risks not observed in single‑task models. Despite their emergence, existing safety benchmarks remain fragmented across tasks and modalities, limiting the comprehensive evaluation of complex system‑level vulnerabilities. To address this gap, we introduce UniSAFE, the first comprehensive benchmark for system‑level safety evaluation of UMMs across 7 I/O modality combinations, spanning conventional tasks and novel multimodal‑context image generation settings. UniSAFE is built with a shared‑target design that projects common risk scenarios across task‑specific I/O configurations, enabling controlled cross‑task comparisons of safety failures. Comprising 6,802 curated instances, we use UniSAFE to evaluate 15 state‑of‑the‑art UMMs, both proprietary and open‑source. Our results reveal critical vulnerabilities across current UMMs, including elevated safety violations in multi‑image composition and multi‑turn settings, with image‑output tasks consistently more vulnerable than text‑output tasks. These findings highlight the need for stronger system‑level safety alignment for UMMs. Our code and data are publicly available at https://github.com/segyulee/UniSAFE
Authors:Chupeng Liu, Jiyong Rao, Shangquan Sun, Runkai Zhao, Weidong Cai
Abstract:
Monocular 3D object detection typically relies on pseudo‑labeling techniques to reduce dependency on real‑world annotations. Recent advances demonstrate that deterministic linguistic cues can serve as effective auxiliary weak supervision signals, providing complementary semantic context. However, hand‑crafted textual descriptions struggle to capture the inherent visual diversity of individuals across scenes, limiting the model's ability to learn scene‑aware representations. To address this challenge, we propose Visual‑referred Probabilistic Prompt Learning (VirPro), an adaptive multi‑modal pretraining paradigm that can be seamlessly integrated into diverse weakly supervised monocular 3D detection frameworks. Specifically, we generate a diverse set of learnable, instance‑conditioned prompts across scenes and store them in an Adaptive Prompt Bank (APB). Subsequently, we introduce Multi‑Gaussian Prompt Modeling (MGPM), which incorporates scene‑based visual features into the corresponding textual embeddings, allowing the text prompts to express visual uncertainties. Then, from the fused vision‑language embeddings, we decode a prompt‑targeted Gaussian, from which we derive a unified object‑level prompt embedding for each instance. RoI‑level contrastive matching is employed to enforce modality alignment, bringing embeddings of co‑occurring objects within the same scene closer in the latent space, thus enhancing semantic coherence. Extensive experiments on the KITTI benchmark demonstrate that integrating our pretraining paradigm consistently yields substantial performance gains, achieving up to a 4.8% average precision improvement than the baseline. Code is available at https://github.com/AustinLCP/VirPro.
Authors:Yuelin Zhang, Sijie Cheng, Chen Li, Zongzhao Li, Yuxin Huang, Yang Liu, Wenbing Huang
Abstract:
Accurately estimating task progress is critical for embodied agents to plan and execute long‑horizon, multi‑step tasks. Despite promising advances, existing Vision‑Language Models (VLMs) based methods primarily leverage their video understanding capabilities, while neglecting their complex reasoning potential. Furthermore, processing long video trajectories with VLMs is computationally prohibitive for real‑world deployment. To address these challenges, we propose the Recurrent Reasoning Vision‑Language Model (\textR^2VLM). Our model features a recurrent reasoning framework that processes local video snippets iteratively, maintaining a global context through an evolving Chain of Thought (CoT). This CoT explicitly records task decomposition, key steps, and their completion status, enabling the model to reason about complex temporal dependencies. This design avoids the high cost of processing long videos while preserving essential reasoning capabilities. We train \textR^2VLM on large‑scale, automatically generated datasets from ALFRED and Ego4D. Extensive experiments on progress estimation and downstream applications, including progress‑enhanced policy learning, reward modeling for reinforcement learning, and proactive assistance, demonstrate that \textR^2VLM achieves strong performance and generalization, achieving a new state‑of‑the‑art in long‑horizon task progress estimation. The models and benchmarks are publicly available at \hrefhttps://huggingface.co/collections/zhangyuelin/r2vlmhuggingface.
Authors:Haiyang Yan, Hongyun Zhou, Peng Xu, Xiaoxue Feng, Mengyi Liu
Abstract:
Despite rapid developments and widespread applications of MLLM agents, they still struggle with long‑form video understanding (LVU) tasks, which are characterized by high information density and extended temporal spans. Recent research on LVU agents demonstrates that simple task decomposition and collaboration mechanisms are insufficient for long‑chain reasoning tasks. Moreover, directly reducing the time context through embedding‑based retrieval may lose key information of complex problems. In this paper, we propose Symphony, a multi‑agent system, to alleviate these limitations. By emulating human cognition patterns, Symphony decomposes LVU into fine‑grained subtasks and incorporates a deep reasoning collaboration mechanism enhanced by reflection, effectively improving the reasoning capability. Additionally, Symphony provides a VLM‑based grounding approach to analyze LVU tasks and assess the relevance of video segments, which significantly enhances the ability to locate complex problems with implicit intentions and large temporal spans. Experimental results show that Symphony achieves state‑of‑the‑art performance on LVBench, LongVideoBench, VideoMME, and MLVU, with a 5.0% improvement over the prior state‑of‑the‑art method on LVBench. Code is available at https://github.com/Haiyang0226/Symphony.
Authors:Weihua Xiao, Jason Blocklove, Matthew DeLorenzo, Johann Knechtel, Ozgur Sinanoglu, Kanad Basu, Jeyavijayan Rajendran, Siddharth Garg, Ramesh Karri
Abstract:
GenAI Units In Digital Design Education (GUIDE) is an open courseware repository with runnable Google Colab labs and other materials. We describe the repository's architecture and educational approach based on standardized teaching units comprising slides, short videos, runnable labs, and related papers. This organization enables consistency for both the students' learning experience and the reuse and grading by instructors. We demonstrate GUIDE in practice with three representative units: VeriThoughts for reasoning and formal‑verification‑backed RTL generation, enhanced LLM‑aided testbench generation, and LLMPirate for IP Piracy. We also provide details for four example course instances (GUIDE4ChipDesign, Build your ASIC, GUIDE4HardwareSecurity, and Hardware Design) that assemble GUIDE units into full semester offerings, learning outcomes, and capstone projects, all based on proven materials. For example, the GUIDE4HardwareSecurity course includes a project on LLM‑aided hardware Trojan insertion that has been successfully deployed in the classroom and in Cybersecurity Games and Conference (CSAW), a student competition and academic conference for cybersecurity. We also organized an NYU Cognichip Hackathon, engaging students across 24 international teams in AI‑assisted RTL design workflows. The GUIDE repository is open for contributions and available at: https://github.com/FCHXWH823/LLM4ChipDesign.
Authors:Pengyu Zhang, Klim Zaporojets, Jie Liu, Jia-Hong Huang, Paul Groth
Abstract:
Multi‑Modal Knowledge Graphs (MMKGs) benefit from visual information, yet large‑scale image collection is hard to curate and often excludes ambiguous but relevant visuals (e.g., logos, symbols, abstract scenes). We present Beyond Images, an automatic data‑centric enrichment pipeline with optional human auditing. This pipeline operates in three stages: (1) large‑scale retrieval of additional entity‑related images, (2) conversion of all visual inputs into textual descriptions to ensure that ambiguous images contribute usable semantics rather than noise, and (3) fusion of multi‑source descriptions using a large language model (LLM) to generate concise, entity‑aligned summaries. These summaries replace or augment the text modality in standard MMKG models without changing their architectures or loss functions. Across three public MMKG datasets and multiple baseline models, we observe consistent gains (up to 7% Hits@1 overall). Furthermore, on a challenging subset of entities with visually ambiguous logos and symbols, converting images into text yields large improvements (201.35% MRR and 333.33% Hits@1). Additionally, we release a lightweight Text‑Image Consistency Check Interface for optional targeted audits, improving description quality and dataset reliability. Our results show that scaling image coverage and converting ambiguous visuals into text is a practical path to stronger MMKG completion. Code, datasets, and supplementary materials are available at https://github.com/pengyu‑zhang/Beyond‑Images.
Authors:Hisayuki Yokomizo, Taiki Miyanishi, Yan Gang, Shuhei Kurita, Nakamasa Inoue, Yusuke Iwasawa
Abstract:
Vision‑Language Models (VLMs) are increasingly applied to robotic perception and manipulation, yet their ability to infer physical properties required for manipulation remains limited. In particular, estimating the mass of real‑world objects is essential for determining appropriate grasp force and ensuring safe interaction. However, current VLMs lack reliable mass reasoning capabilities, and most existing benchmarks do not explicitly evaluate physical quantity estimation under realistic sensing conditions. In this work, we propose PhysQuantAgent, a framework for real‑world object mass estimation using VLMs, together with VisPhysQuant, a new benchmark dataset for evaluation. VisPhysQuant consists of RGB‑D videos of real objects captured from multiple viewpoints, annotated with precise mass measurements. To improve estimation accuracy, we introduce three visual prompting methods that enhance the input image with object detection, scale estimation, and cross‑sectional image generation to help the model comprehend the size and internal structure of the target object. Experiments show that visual prompting significantly improves mass estimation accuracy on real‑world data, suggesting the efficacy of integrating spatial reasoning with VLM knowledge for physical inference.
Authors:Zongshun Zhang, Yao Liu, Qiao Liu, Xuefeng Peng, Peiyuan Jiang, Jiaye Yang, Daibing Yao, Wei Lin
Abstract:
Video‑based lie detection aims to identify deceptive behaviors from visual cues. Despite recent progress, its core challenge lies in learning sparse yet discriminative representations. Deceptive signals are typically subtle and short‑lived, easily overwhelmed by redundant information, while individual and contextual variations introduce strong identity‑related noise. To address this issue, we propose GenLie, a Global‑Enhanced Lie Detection Network that performs local feature modeling under global supervision. Specifically, sparse and subtle deceptive cues are captured at the local level, while global supervision and optimization ensure robust and discriminative representations by suppressing identity‑related noise. Experiments on three public datasets, covering both high‑ and low‑stakes scenarios, show that GenLie consistently outperforms state‑of‑the‑art methods. Source code is available at https://github.com/AliasDictusZ1/GenLie.
Authors:Abderrahmene Boudiaf, Irfan Hussain, Sajid Javed
Abstract:
The deployment of Multimodal Large Language Models (MLLMs) in agriculture is currently stalled by a critical trade‑off: the existing literature lacks the large‑scale agricultural datasets required for robust model development and evaluation, while current state‑of‑the‑art models lack the verified domain expertise necessary to reason across diverse taxonomies. To address these challenges, we propose the Vision‑to‑Verified‑Knowledge (V2VK) pipeline, a novel generative AI‑driven annotation framework that integrates visual captioning with web‑augmented scientific retrieval to autonomously generate the AgriMM benchmark, effectively eliminating biological hallucinations by grounding training data in verified phytopathological literature. The AgriMM benchmark contains over 3,000 agricultural classes and more than 607k VQAs spanning multiple tasks, including fine‑grained plant species identification, plant disease symptom recognition, crop counting, and ripeness assessment. Leveraging this verifiable data, we present AgriChat, a specialized MLLM that presents broad knowledge across thousands of agricultural classes and provides detailed agricultural assessments with extensive explanations. Extensive evaluation across diverse tasks, datasets, and evaluation conditions reveals both the capabilities and limitations of current agricultural MLLMs, while demonstrating AgriChat's superior performance over other open‑source models, including internal and external benchmarks. The results validate that preserving visual detail combined with web‑verified knowledge constitutes a reliable pathway toward robust and trustworthy agricultural AI. The code and dataset are publicly available at https://github.com/boudiafA/AgriChat .
Authors:Nimrod Shabtay, Moshe Kimhi, Artem Spector, Sivan Haray, Ehud Rivlin, Chaim Baskin, Raja Giryes, Eli Schwartz
Abstract:
Vision‑language models (VLMs) typically process images at a native high‑resolution, forcing a trade‑off between accuracy and computational efficiency: high‑resolution inputs capture fine details but incur significant computational costs, while low‑resolution inputs advocate for efficiency, they potentially miss critical visual information, like small text. We present AwaRes, a spatial‑on‑demand framework that resolves this accuracy‑efficiency trade‑off by operating on a low‑resolution global view and using tool‑calling to retrieve only high‑resolution segments needed for a given query. We construct supervised data automatically: a judge compares low‑ vs.\ high‑resolution answers to label whether cropping is needed, and an oracle grounding model localizes the evidence for the correct answer, which we map to a discrete crop set to form multi‑turn tool‑use trajectories. We train our framework with cold‑start SFT followed by multi‑turn GRPO with a composite reward that combines semantic answer correctness with explicit crop‑cost penalties. Project page: https://nimrodshabtay.github.io/AwaRes
Authors:Xinlong Deng, Yu Xia, Jie Jiang
Abstract:
The Inaugural Music Source Restoration (MSR) Challenge targets the recovery of original, unprocessed stems from fully mixed and mastered music. Unlike conventional music source separation, MSR requires reversing complex production processes such as equalization, compression, reverberation, and other real‑world degradations. To address MSR, we propose a two‑stage system. First, an ensemble of pre‑trained separation models produces preliminary source estimates. Then a set of pre‑trained BSRNN‑based restoration models performs targeted reconstruction to refine these estimates. On the official MSR benchmark, our system surpasses the baselines on all metrics, ranking second among all submissions. The code is available at https://github.com/xinghour/Music‑source‑restoration‑CUPAudioGroup
Authors:Junaid Ahmed Ansari, Ran Ding, Fabio Pizzati, Ivan Laptev
Abstract:
Monocular 3D scene reconstruction has recently seen significant progress. Powered by the modern neural architectures and large‑scale data, recent methods achieve high performance in depth estimation from a single image. Meanwhile, reconstructing and decomposing common scenes into individual 3D objects remains a hard challenge due to the large variety of objects, frequent occlusions and complex object relations. Notably, beyond shape and pose estimation of individual objects, applications in robotics and animation require physically‑plausible scene reconstruction where objects obey physical principles of non‑penetration and realistic contacts. In this work we advance object‑level scene reconstruction along two directions. First, we introduceMessyKitchens, a new dataset with real‑world scenes featuring cluttered environments and providing high‑fidelity object‑level ground truth in terms of 3D object shapes, poses and accurate object contacts. Second, we build on the recent SAM 3D approach for single‑object reconstruction and extend it with Multi‑Object Decoder (MOD) for joint object‑level scene reconstruction. To validate our contributions, we demonstrate MessyKitchens to significantly improve previous datasets in registration accuracy and inter‑object penetration. We also compare our multi‑object reconstruction approach on three datasets and demonstrate consistent and significant improvements of MOD over the state of the art. Our new benchmark, code and pre‑trained models will become publicly available on our project website: https://messykitchens.github.io/.
Authors:Kaixuan Wang, Tianxing Chen, Jiawei Liu, Honghao Su, Shaolong Zhu, Minxuan Wang, Zixuan Li, Yue Chen, Huan-ang Gao, Yusen Qin, Jiawei Wang, Qixuan Zhang, Lan Xu, Jingyi Yu, Yao Mu, Ping Luo
Abstract:
Learning in simulation provides a useful foundation for scaling robotic manipulation capabilities. However, this paradigm often suffers from a lack of data‑generation‑ready digital assets, in both scale and diversity. In this work, we present ManiTwin, an automated and efficient pipeline for generating data‑generation‑ready digital object twins. Our pipeline transforms a single image into simulation‑ready and semantically annotated 3D asset, enabling large‑scale robotic manipulation data generation. Using this pipeline, we construct ManiTwin‑100K, a dataset containing 100K high‑quality annotated 3D assets. Each asset is equipped with physical properties, language descriptions, functional annotations, and verified manipulation proposals. Experiments demonstrate that ManiTwin provides an efficient asset synthesis and annotation workflow, and that ManiTwin‑100K offers high‑quality and diverse assets for manipulation data generation, random scene synthesis, and VQA data generation, establishing a strong foundation for scalable simulation data synthesis and policy learning. Our webpage is available at https://manitwin.github.io/.
Authors:Tianyu Xie, Jinfa Huang, Yuexiao Ma, Rongfang Luo, Yan Yang, Wang Chen, Yuhui Zeng, Ruize Fang, Yixuan Zou, Xiawu Zheng, Jiebo Luo, Rongrong Ji
Abstract:
Omni‑modal large language models (OLMs) redefine human‑machine interaction by natively integrating audio, vision, and text. However, existing OLM benchmarks remain anchored to static, accuracy‑centric tasks, leaving a critical gap in assessing social interactivity, the fundamental capacity to navigate dynamic cues in natural dialogues. To this end, we propose SocialOmni, a comprehensive benchmark that operationalizes the evaluation of this conversational interactivity across three core dimensions: (i) speaker separation and identification (who is speaking), (ii) interruption timing control (when to interject), and (iii) natural interruption generation (how to phrase the interruption). SocialOmni features 2,000 perception samples and a quality‑controlled diagnostic set of 209 interaction‑generation instances with strict temporal and contextual constraints, complemented by controlled audio‑visual inconsistency scenarios to test model robustness. We benchmarked 12 leading OLMs, which uncovers significant variance in their social‑interaction capabilities across models. Furthermore, our analysis reveals a pronounced decoupling between a model's perceptual accuracy and its ability to generate contextually appropriate interruptions, indicating that understanding‑centric metrics alone are insufficient to characterize conversational social competence. More encouragingly, these diagnostics from SocialOmni yield actionable signals for bridging the perception‑interaction divide in future OLMs.
Authors:Karthik Ragunath Ananda Kumar, Subrahmanyam Arunachalam
Abstract:
Automated presentation generation remains a challenging task requiring coherent content creation, visual design, and audience‑aware communication. This work proposes an OpenEnv‑compatible reinforcement learning environment where LLM agents learn to research topics, plan content, and generate professional HTML slide presentations through tool use. We introduce a multi‑component reward system combining structural validation, render quality assessment, LLM‑based aesthetic scoring, content quality metrics, and an inverse specification reward that measures how faithfully generated slides convey their intended purpose. The inverse specification reward, an "inverse task" where an LLM attempts to recover the original specification from generated slides, provides a holistic quality signal. Our approach fine‑tunes Qwen2.5‑Coder‑7B via GRPO, training only 0.5% of parameters on prompts derived from expert demonstrations collected using Claude Opus 4.6. Experiments on 48 diverse business briefs across six models demonstrate that our fine‑tuned 7B model achieves 91.2% of Claude Opus 4.6's quality while improving 33.1% over the base model. The six‑model comparison reveals that instruction adherence and tool‑use compliance, rather than raw parameter count, determine agentic task performance. We contribute SlideRL, an open‑source dataset of 288 multi‑turn rollout trajectories across all six models: https://huggingface.co/datasets/KarthikRagunathAnandaKumar/sliderl‑multi‑turn‑rollouts Code: https://github.com/pushing‑the‑frontier/slide‑forge‑llm
Authors:Kanishka Mitra, Satyam Kumar, Frigyes Samuel Racz, Deland Liu, Ashish D. Deshpande, José del R. Millán
Abstract:
Robot‑assisted therapy can deliver high‑dose, task‑specific training after neurologic injury, but most systems act primarily at the limb level‑engaging the impaired neural circuits only indirectly‑which remains a key barrier to truly contingent, neuroplasticity‑targeted rehabilitation. We address this gap by implementing online, dual‑state motor imagery control of an upper‑limb exoskeleton, enabling goal‑directed reaches to be both initiated and terminated directly from non‑invasive EEG. Eight participants used EEG to initiate assistance and then volitionally halt the robot mid‑trajectory. Across two online sessions, group‑mean hit rates were 61.5% for onset and 64.5% for offset, demonstrating reliable start‑stop command delivery despite instrumental noise and passive arm motion. Methodologically, we reveal a systematic, class‑driven bias induced by common task‑based recentering using an asymmetric margin diagnostic, and we introduce a class‑agnostic fixation‑based recentering method that tracks drift without sampling command classes while preserving class geometry. This substantially improves threshold‑free separability (AUC gains: onset +56%, p = 0.0117; offset +34%, p = 0.0251) and reduces bias within and across days. Together, these results help bridge offline decoding and practical, intention‑driven start‑stop control of a rehabilitation exoskeleton, enabling precisely timed, contingent assistance aligned with neuroplasticity goals while supporting future clinical translation.
Authors:Han Lin, Xichen Pan, Zun Wang, Yue Zhang, Chu Wang, Jaemin Cho, Mohit Bansal
Abstract:
Pixel‑space diffusion has recently re‑emerged as a strong alternative to latent diffusion, enabling high‑quality generation without pretrained autoencoders. However, standard pixel‑space diffusion models receive relatively weak semantic supervision and are not explicitly designed to capture high‑level visual structure. Recent representation‑alignment methods (e.g., REPA) suggest that pretrained visual features can substantially improve diffusion training, and visual co‑denoising has emerged as a promising direction for incorporating such features into the generative process. However, existing co‑denoising approaches often entangle multiple design choices, making it unclear which design choices are truly essential. Therefore, we present V‑Co, a systematic study of visual co‑denoising in a unified JiT‑based framework. This controlled setting allows us to isolate the ingredients that make visual co‑denoising effective. Our study reveals four key ingredients for effective visual co‑denoising. First, preserving feature‑specific computation while enabling flexible cross‑stream interaction motivates a fully dual‑stream architecture. Second, effective classifier‑free guidance (CFG) requires a structurally defined unconditional prediction. Third, stronger semantic supervision is best provided by a perceptual‑drifting hybrid loss. Fourth, stable co‑denoising further requires proper cross‑stream calibration, which we realize through RMS‑based feature rescaling. Together, these findings yield a simple recipe for visual co‑denoising. Experiments on ImageNet‑256 show that, at comparable model sizes, V‑Co outperforms the underlying pixel‑space diffusion baseline and strong prior pixel‑diffusion methods while using fewer training epochs, offering practical guidance for future representation‑aligned generative models.
Authors:Guangzhi Xiong, Sanchit Sinha, Zhenghao He, Aidong Zhang
Abstract:
Vision‑language models (VLMs) have achieved impressive performance across a wide range of multimodal reasoning tasks, but they often struggle to disentangle fine‑grained visual attributes and reason about underlying causal relationships. In‑context learning (ICL) offers a promising avenue for VLMs to adapt to new tasks, but its effectiveness critically depends on the selection of demonstration examples. Existing retrieval‑augmented approaches typically rely on passive similarity‑based retrieval, which tends to select correlated but non‑causal examples, amplifying spurious associations and limiting model robustness. We introduce CIRCLES (Composed Image Retrieval for Causal Learning Example Selection), a novel framework that actively constructs demonstration sets by retrieving counterfactual‑style examples through targeted, attribute‑guided composed image retrieval. By incorporating counterfactual‑style examples, CIRCLES enables VLMs to implicitly reason about the causal relations between attributes and outcomes, moving beyond superficial correlations and fostering more robust and grounded reasoning. Comprehensive experiments on four diverse datasets demonstrate that CIRCLES consistently outperforms existing methods across multiple architectures, especially on small‑scale models, with pronounced gains under information scarcity. Furthermore, CIRCLES retrieves more diverse and causally informative examples, providing qualitative insights into how models leverage in‑context demonstrations for improved reasoning. Our code is available at https://github.com/gzxiong/CIRCLES.
Authors:Tianyuan Yuan, Zibin Dong, Yicheng Liu, Hang Zhao
Abstract:
World Action Models (WAMs) have emerged as a promising alternative to Vision‑Language‑Action (VLA) models for embodied control because they explicitly model how visual observations may evolve under action. Most existing WAMs follow an imagine‑then‑execute paradigm, incurring substantial test‑time latency from iterative video denoising, yet it remains unclear whether explicit future imagination is actually necessary for strong action performance. In this paper, we ask whether WAMs need explicit future imagination at test time, or whether their benefit comes primarily from video modeling during training. We disentangle the role of video modeling during training from explicit future generation during inference by proposing Fast‑WAM, a WAM architecture that retains video co‑training during training but skips future prediction at test time. We further instantiate several Fast‑WAM variants to enable a controlled comparison of these two factors. Across these variants, we find that Fast‑WAM remains competitive with imagine‑then‑execute variants, while removing video co‑training causes a much larger performance drop. Empirically, Fast‑WAM achieves competitive results with state‑of‑the‑art methods both on simulation benchmarks (LIBERO and RoboTwin) and real‑world tasks, without embodied pretraining. It runs in real time with 190ms latency, over 4× faster than existing imagine‑then‑execute WAMs. These results suggest that the main value of video prediction in WAMs may lie in improving world representations during training rather than generating future observations at test time. Project page: https://yuantianyuan01.github.io/FastWAM/
Authors:Xiaojie Gu, Sherry T. Tong, Aosong Feng, Sophia Simeng Han, Jinghui Lu, Yingjian Chen, Yusuke Iwasawa, Yutaka Matsuo, Chanjun Park, Rex Ying, Irene Li
Abstract:
Reasoning‑focused large language models (LLMs) have advanced in many NLP tasks, yet their evaluation remains challenging: final answers alone do not expose the intermediate reasoning steps, making it difficult to determine whether a model truly reasons correctly and where failures occur, while existing multi‑hop QA benchmarks lack step‑level annotations for diagnosing reasoning failures. To address this gap, we propose Omanic, an open‑domain multi‑hop QA resource that provides decomposed sub‑questions and intermediate answers as structural annotations for analyzing reasoning processes. It contains 10,296 machine‑generated training examples (OmanicSynth) and 967 expert‑reviewed human‑annotated evaluation examples (OmanicBench). Systematic evaluations show that state‑of‑the‑art LLMs achieve only 73.11% multiple‑choice accuracy on OmanicBench, confirming its high difficulty. Stepwise analysis reveals that CoT's performance hinges on factual completeness, with its gains diminishing under knowledge gaps and errors amplifying in later hops. Additionally, supervised fine‑tuning on OmanicSynth brings substantial transfer gains (7.41 average points) across six reasoning and math benchmarks, validating the dataset's quality and further supporting the effectiveness of OmanicSynth as supervision for reasoning‑capability transfer. We release the data at https://huggingface.co/datasets/li‑lab/Omanic and the code at https://github.com/XiaojieGu/Omanic.
Authors:Redwan Sony, Anil K Jain, Ross Arun
Abstract:
Multimodal Large Language Models (MLLMs) have recently been proposed as a means to generate natural‑language explanations for face recognition decisions. While such explanations facilitate human interpretability, their reliability on unconstrained face images remains underexplored. In this work, we systematically analyze MLLM‑generated explanations for the unconstrained face verification task on the challenging IJB‑S dataset, with a particular focus on extreme pose variation and surveillance imagery. Our results show that even when MLLMs produce correct verification decisions, the accompanying explanations frequently rely on non‑verifiable or hallucinated facial attributes that are not supported by visual evidence. We further study the effect of incorporating information from traditional face recognition systems, viz., scores and decisions, alongside the input images. Although such information improves categorical verification performance, it does not consistently lead to faithful explanations. To evaluate the explanations beyond decision accuracy, we introduce a likelihood‑ratio‑based framework that measures the evidential strength of textual explanations. Our findings highlight fundamental limitations of current MLLMs for explainable face recognition and underscore the need for a principled evaluation of reliable and trustworthy explanations in biometric applications. Code is available at https://github.com/redwankarimsony/LR‑MLLMFR‑Explainability.
Authors:Fangjing Li, Zhihai Wang, Xinxin Ding, Haiyang Liu, Ronghua Gao, Rong Wang, Yao Zhu, Ming Jin
Abstract:
Mounting posture is an important visual indicator of estrus in dairy cattle. However, achieving reliable mounting pose estimation in real‑world environments remains challenging due to cluttered backgrounds and frequent inter‑animal occlusion. We present FSMC‑Pose, a top‑down framework that integrates a lightweight frequency‑spatial fusion backbone, CattleMountNet, and a multiscale self‑calibration head, SC2Head. Specifically, we design two algorithmic components for CattleMountNet: the Spatial Frequency Enhancement Block (SFEBlock) and the Receptive Aggregation Block (RABlock). SFEBlock separates cattle from cluttered backgrounds, while RABlock captures multiscale contextual information. The Spatial‑Channel Self‑Calibration Head (SC2Head) attends to spatial and channel dependencies and introduces a self‑calibration branch to mitigate structural misalignment under inter‑animal overlap. We construct a mounting dataset, MOUNT‑Cattle, covering 1176 mounting instances, which follows the COCO format and supports drop‑in training across pose estimation models. Using a comprehensive dataset that combines MOUNT‑Cattle with the public NWAFU‑Cattle dataset, FSMC‑Pose achieves higher accuracy than strong baselines, with markedly lower computational and parameter costs, while maintaining real‑time inference on commodity GPUs. Extensive experiments and qualitative analyses show that FSMC‑Pose effectively captures and estimates cattle mounting pose in complex and cluttered environments. Dataset and code are available at https://github.com/elianafang/FSMC‑Pose.
Authors:Yong Zou, Haoran Li, Fanxiao Li, Shenyang Wei, Yunyun Dong, Li Tang, Wei Zhou, Renyang Liu
Abstract:
Recent progress in image generation models (IGMs) enables high‑fidelity content creation but also amplifies risks, including the reproduction of copyrighted content and the generation of offensive content. Image Generation Model Unlearning (IGMU) mitigates these risks by removing harmful concepts without full retraining. Despite growing attention, the robustness under adversarial inputs, particularly image‑side threats in black‑box settings, remains underexplored. To bridge this gap, we present REFORGE, a black‑box red‑teaming framework that evaluates IGMU robustness via adversarial image prompts. REFORGE initializes stroke‑based images and optimizes perturbations with a cross‑attention‑guided masking strategy that allocates noise to concept‑relevant regions, balancing attack efficacy and visual fidelity. Extensive experiments across representative unlearning tasks and defenses demonstrate that REFORGE significantly improves attack success rate while achieving stronger semantic alignment and higher efficiency than involved baselines. These results expose persistent vulnerabilities in current IGMU methods and highlight the need for robustness‑aware unlearning against multi‑modal adversarial attacks. Our code is at: https://github.com/Imfatnoily/REFORGE.
Authors:Zihe Wang, Yihuan Wang, Haiyang Yu. Zhiyong Cui, Xiaojian Liao, Chengcheng Wang, Yonglin Tian, Yongxin Tong
Abstract:
The current expressway operation relies on rule‑based and isolated models, which limits the ability to jointly analyze knowledge across different systems. Meanwhile, Large Language Models (LLMs) are increasingly applied in intelligent transportation, advancing traffic models from algorithmic to cognitive intelligence. However, general LLMs are unable to effectively understand the regulations and causal relationships of events in unconventional scenarios in the expressway field. Therefore, this paper constructs a pre‑trained multimodal large language model (MLLM) for expressways, ExpressMind, which serves as the cognitive core for intelligent expressway operations. This paper constructs the industry's first full‑stack expressway dataset, encompassing traffic knowledge texts, emergency reasoning chains, and annotated video events to overcome data scarcity. This paper proposes a dual‑layer LLM pre‑training paradigm based on self‑supervised training and unsupervised learning. Additionally, this study introduces a Graph‑Augmented RAG framework to dynamically index the expressway knowledge base. To enhance reasoning for expressway incident response strategies, we develop a RL‑aligned Chain‑of‑Thought (RL‑CoT) mechanism that enforces consistency between model reasoning and expert problem‑solving heuristics for incident handling. Finally, ExpressMind integrates a cross‑modal encoder to align the dynamic feature sequences under the visual and textual channels, enabling it to understand traffic scenes in both video and image modalities. Extensive experiments on our newly released multi‑modal expressway benchmark demonstrate that ExpressMind comprehensively outperforms existing baselines in event detection, safety response generation, and complex traffic analysis. The code and data are available at: https://wanderhee.github.io/ExpressMind/.
Authors:Zhengbo Zhang, Jinbo Su, Zhaowen Zhou, Changtao Miao, Yuhan Hong, Qimeng Wu, Yumeng Liu, Feier Wu, Yihe Tian, Yuhao Liang, Zitong Shan, Wanke Xia, Yi-Fan Zhang, Bo Zhang, Zhe Li, Shiming Xiang, Ying Yan
Abstract:
The rapid advancement of Multimodal Large Language Models (MLLMs) has enabled browsing agents to acquire and reason over multimodal information in the real world. But existing benchmarks suffer from two limitations: insufficient evaluation of visual reasoning ability and the neglect of native visual information of web pages in the reasoning chains. To address these challenges, we introduce a new benchmark for visual‑native search, VisBrowse‑Bench. It contains 169 VQA instances covering multiple domains and evaluates the models' visual reasoning capabilities during the search process through multimodal evidence cross‑validation via text‑image retrieval and joint reasoning. These data were constructed by human experts using a multi‑stage pipeline and underwent rigorous manual verification. We additionally propose an agent workflow that can effectively drive the browsing agent to actively collect and reason over visual information during the search process. We comprehensively evaluated both open‑source and closed‑source models in this workflow. Experimental results show that even the best‑performing model, Claude‑4.6‑Opus only achieves an accuracy of 47.6%, while the proprietary Deep Research model, o3‑deep‑research only achieves an accuracy of 41.1%. The code and data can be accessed at: https://github.com/ZhengboZhang/VisBrowse‑Bench
Authors:Hongwei Lin, Xun Huang, Chenglu Wen, Cheng Wang
Abstract:
Robust 3D object detection under adverse weather conditions is crucial for autonomous driving. However, most existing methods simply combine all weather samples for training while overlooking data distribution discrepancies across different weather scenarios, leading to performance conflicts. To address this issue, we introduce AW‑MoE, the framework that innovatively integrates Mixture of Experts (MoE) into weather‑robust multi‑modal 3D object detection approaches. AW‑MoE incorporates Image‑guided Weather‑aware Routing (IWR), which leverages the superior discriminability of image features across weather conditions and their invariance to scene variations for precise weather classification. Based on this accurate classification, IWR selects the top‑K most relevant Weather‑Specific Experts (WSE) that handle data discrepancies, ensuring optimal detection under all weather conditions. Additionally, we propose a Unified Dual‑Modal Augmentation (UDMA) for synchronous LiDAR and 4D Radar dual‑modal data augmentation while preserving the realism of scenes. Extensive experiments on the real‑world dataset demonstrate that AW‑MoE achieves ~ 15% improvement in adverse‑weather performance over state‑of‑the‑art methods, while incurring negligible inference overhead. Moreover, integrating AW‑MoE into established baseline detectors yields performance improvements surpassing current state‑of‑the‑art methods. These results show the effectiveness and strong scalability of our AW‑MoE. We will release the code publicly at https://github.com/windlinsherlock/AW‑MoE.
Authors:Junxin Wang, Dai Guan, Weijie Qiu, Zhihang Li, Yongbo Gai, Zhengyi Yang, Mengyu Zhou, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang
Abstract:
Vision‑language process reward models (VL‑PRMs) are increasingly used to score intermediate reasoning steps and rerank candidates under test‑time scaling. However, they often function as black‑box judges: a low step score may reflect a genuine reasoning mistake or simply the verifier's misperception of the image. This entanglement between perception and reasoning leads to systematic false positives (rewarding hallucinated visual premises) and false negatives (penalizing correct grounded statements), undermining both reranking and error localization. We introduce Explicit Visual Premise Verification (EVPV), a lightweight verification interface that conditions step scoring on the reliability of the visual premises a step depends on. The policy is prompted to produce a step‑wise visual checklist that makes required visual facts explicit, while a constraint extractor independently derives structured visual constraints from the input image. EVPV matches checklist claims against these constraints to compute a scalar visual reliability signal, and calibrates PRM step rewards via reliability gating: rewards for visually dependent steps are attenuated when reliability is low and preserved when reliability is high. This decouples perceptual uncertainty from logical evaluation without per‑step tool calls. Experiments on VisualProcessBench and six multimodal reasoning benchmarks show that EVPV improves step‑level verification and consistently boosts Best‑of‑N reranking accuracy over strong baselines. Furthermore, injecting controlled corruption into the extracted constraints produces monotonic performance degradation, providing causal evidence that the gains arise from constraint fidelity and explicit premise verification rather than incidental prompt effects. Code is available at: https://github.com/Qwen‑Applications/EVPV‑PRM
Authors:Surya Vardhan Yalavarthi
Abstract:
Corrective Retrieval Augmented Generation (CRAG) improves the robustness of RAG systems by evaluating retrieved document quality and triggering corrective actions. However, the original implementation relies on proprietary components including the Google Search API and closed model weights, limiting reproducibility. In this work, we present a fully open‑source reproduction of CRAG, replacing proprietary web search with the Wikipedia API and the original LLaMA‑2 generator with Phi‑3‑mini‑4k‑instruct. We evaluate on PopQA and ARC‑Challenge, demonstrating that our open‑source pipeline achieves comparable performance to the original system. Furthermore, we contribute the first explainability analysis of CRAG's T5‑based retrieval evaluator using SHAP, revealing that the evaluator primarily relies on named entity alignment rather than semantic similarity. Our analysis identifies key failure modes including domain transfer limitations on science questions. All code and results are available at https://github.com/suryayalavarthi/crag‑reproduction.
Authors:Karen Sargsyan
Abstract:
Neural networks systematically fail at compositional generalization ‑‑ producing correct outputs for novel combinations of known parts. We show that this failure is architectural: compositional generalization is equivalent to functoriality of the decoder, and this perspective yields both guarantees and impossibility results. We compile Higher Inductive Type (HIT) specifications into neural architectures via a monoidal functor from the path groupoid of a target space to a category of parametric maps: path constructors become generator networks, composition becomes structural concatenation, and 2‑cells witnessing group relations become learned natural transformations. We prove that decoders assembled by structural concatenation of independently generated segments are strict monoidal functors (compositional by construction), while softmax self‑attention is not functorial for any non‑trivial compositional task. Both results are formalized in Cubical Agda. Experiments on three spaces validate the full hierarchy: on the torus (\mathbbZ^2), functorial decoders outperform non‑functorial ones by 2‑2.7x; on S^1 \vee S^1 (F_2), the type‑A/B gap widens to 5.5‑10x; on the Klein bottle (\mathbbZ \rtimes \mathbbZ), a learned 2‑cell closes a 46% error gap on words exercising the group relation.
Authors:Minbing Chen, Zhu Meng, Fei Su
Abstract:
Vision‑Language Models (VLMs) offer significant potential in computational pathology by enabling interpretable image analysis, automated reporting, and scalable decision support. However, their widespread clinical adoption remains limited due to the absence of reliable, automated evaluation metrics capable of identifying subtle failures such as hallucinations. To address this gap, we propose PathGLS, a novel reference‑free evaluation framework that assesses pathology VLMs across three dimensions: Grounding (fine‑grained visual‑text alignment), Logic (entailment graph consistency using Natural Language Inference), and Stability (output variance under adversarial visual‑semantic perturbations). PathGLS supports both patch‑level and whole‑slide image (WSI)‑level analysis, yielding a comprehensive trust score. Experiments on Quilt‑1M, TCGA, REG2025, PathMMU and TCGA‑Sarcoma datasets demonstrate the superiority of PathGLS. Specifically, on the Quilt‑1M dataset, PathGLS reveals a steep sensitivity drop of 40.2% for hallucinated reports compared to only 2.1% for BERTScore. Moreover, validation against expert‑defined clinical error hierarchies reveals that PathGLS achieves a strong Spearman's rank correlation of ρ=0.71 (p < 0.0001), significantly outperforming Large Language Model (LLM)‑based approaches (Gemini 3.0 Pro: ρ=0.39, p < 0.0001). These results establish PathGLS as a robust reference‑free metric. By directly quantifying hallucination rates and domain shift robustness, it serves as a reliable criterion for benchmarking VLMs on private clinical datasets and informing safe deployment. Code can be found at: https://github.com/My13ad/PathGLS
Authors:Shin'ya Yamaguchi, Daiki Chijiwa, Tamao Sakao, Taku Hasegawa
Abstract:
Large vision‑language models (LVLMs) employ multi‑modal in‑context learning (MM‑ICL) to adapt to new tasks by leveraging demonstration examples. While increasing the number of demonstrations boosts performance, they incur significant inference latency due to the quadratic computational cost of Transformer attention with respect to the context length. To address this trade‑off, we propose Parallel In‑Context Learning (Parallel‑ICL), a plug‑and‑play inference algorithm. Parallel‑ICL partitions the long demonstration context into multiple shorter, manageable chunks. It processes these chunks in parallel and integrates their predictions at the logit level, using a weighted Product‑of‑Experts (PoE) ensemble to approximate the full‑context output. Guided by ensemble learning theory, we introduce principled strategies for Parallel‑ICL: (i) clustering‑based context chunking to maximize inter‑chunk diversity and (ii) similarity‑based context compilation to weight predictions by query relevance. Extensive experiments on VQA, image captioning, and classification benchmarks demonstrate that Parallel‑ICL achieves performance comparable to full‑context MM‑ICL, while significantly improving inference speed. Our work offers an effective solution to the accuracy‑efficiency trade‑off in MM‑ICL, enabling dynamic task adaptation with substantially reduced inference overhead.
Authors:Yu Li, Rui Miao, Zhengling Qi, Tian Lan
Abstract:
The dominant paradigm for improving mathematical reasoning in language models relies on Reinforcement Learning with verifiable rewards. Yet existing methods treat each problem instance in isolation without leveraging the reusable strategies that emerge and accumulate during training. To this end, we introduce ARISE (Agent Reasoning via Intrinsic Skill Evolution), a hierarchical reinforcement learning framework, in which a shared policy operates both to manage skills at high‑level and to generate responses at low‑level (denoted as a Skills Manager and a Worker, respectively). The Manager maintains a tiered skill library through a dedicated skill generation rollout that performs structured summarization of successful solution traces (after execution), while employing a policy‑driven selection mechanism to retrieve relevant skills to condition future rollouts (before execution). A hierarchical reward design guides the co‑evolution of reasoning ability and library quality. Experiments on two base models and seven benchmarks spanning both competition mathematics and Omni‑MATH show that ARISE consistently outperforms GRPO‑family algorithms and memory‑augmented baselines, with particularly notable gains on out‑of‑distribution tasks. Ablation studies confirm that each component contributes to the observed improvements and that library quality and reasoning performance improve in tandem throughout training. Code is available at \hrefhttps://github.com/Skylanding/ARISEhttps://github.com/Skylanding/ARISE.
Authors:Yifan Zhang
Abstract:
Recent work has made clear that the residual pathway is not mere optimization plumbing; it is part of the model's representational machinery. We agree, but argue that the cleanest way to organize this design space is through a two‑axis view of the Transformer. A decoder evolves information along two ordered dimensions: sequence position and layer depth. Self‑attention already provides adaptive mixing along the sequence axis, whereas the residual stream usually performs fixed addition along the depth axis. If we fix a token position and treat layer index as the ordered variable, then a causal depth‑wise residual attention read is exactly the same local operator as causal short sliding‑window attention (ShortSWA), except written over depth rather than over sequence. This is the core residual stream duality behind Transformer^2. This perspective also clarifies the recent literature. ELC‑BERT and DenseFormer already show that learned aggregation over depth can outperform uniform residual accumulation, while Vertical Attention, DeepCrossAttention (DCA), MUDDFormer, and Attention Residuals move further toward explicit attention‑based routing over earlier layers. The key point, however, is that operator‑level duality does not imply systems‑level symmetry. For large‑scale autoregressive models, sequence‑axis ShortSWA is usually the more hardware‑friendly placement because it reuses token‑side sliding‑window kernels, KV‑cache layouts, and chunked execution. If the goal is instead to change the shortcut itself, Deep Delta Learning (DDL) is the cleaner intervention because it modifies the residual operator directly rather than adding a separate cross‑layer retrieval path. Our recommendation is therefore simple: use DDL when the shortcut is the object of interest, and use sequence‑axis ShortSWA when the goal is local adaptive mixing.
Authors:Rushil Thareja, Gautam Gupta, Francesco Pinto, Nils Lukas
Abstract:
Constitutional AI is a method to oversee and control LLMs based on a set of rules written in natural language. These rules are typically written by human experts, but could in principle be learned automatically given sufficient training data for the desired behavior. Existing LLM‑based prompt optimizers attempt this but are ineffective at learning constitutions since (i) they require many labeled examples and (ii) lack structure in the optimized prompts, leading to diminishing improvements as prompt size grows. To address these limitations, we propose Multi‑Agent Constitutional Learning (MAC), which optimizes over structured prompts represented as sets of rules using a network of agents with specialized tasks to accept, edit, or reject rule updates. We also present MAC+, which improves performance by training agents on successful trajectories to reinforce updates leading to higher reward. We evaluate MAC on tagging Personally Identifiable Information (PII), a classification task with limited labels where interpretability is critical, and demonstrate that it generalizes to other agentic tasks such as tool calling. MAC outperforms recent prompt optimization methods by over 50%, produces human‑readable and auditable rule sets, and achieves performance comparable to supervised fine‑tuning and GRPO without requiring parameter updates.
Authors:Max Zimmer, Nico Pelleriti, Christophe Roux, Sebastian Pokutta
Abstract:
AI tools and agents are reshaping how researchers work, from proving theorems to training neural networks. Yet for many, it remains unclear how these tools fit into everyday research practice. This paper is a practical guide to AI‑assisted research in mathematics and machine learning: We discuss how researchers can use modern AI systems productively, where these systems help most, and what kinds of guardrails are needed to use them responsibly. It is organized into three parts: (I) a five‑level taxonomy of AI integration, (II) an open‑source framework that, through a set of methodological rules formulated as agent prompts, turns CLI coding agents (e.g., Claude Code, Codex CLI, OpenCode) into autonomous research assistants, and (III) case studies from deep learning and mathematics. The framework runs inside a sandboxed container, works with any frontier LLM through existing CLI agents, is simple enough to install and use within minutes, and scales from personal‑laptop prototyping to multi‑node, multi‑GPU experimentation across compute clusters. In practice, our longest autonomous session ran for over 20 hours, dispatching independent experiments across multiple nodes without human intervention. We stress that our framework is not intended to replace the researcher in the loop, but to augment them. Our code is publicly available at https://github.com/ZIB‑IOL/The‑Agentic‑Researcher.
Authors:Dibakar Sigdel, Namuna Panday
Abstract:
We present PhasorFlow, an open‑source Python library introducing a computational paradigm operating on the S^1 unit circle. Inputs are encoded as complex phasors z = e^iθ on the N‑Torus (\mathbbT^N). As computation proceeds via unitary wave interference gates, global norm is preserved while individual components drift into \mathbbC^N, allowing algorithms to natively leverage continuous geometric gradients for predictive learning. PhasorFlow provides three core contributions. First, we formalize the Phasor Circuit model (N unit circle threads, M gates) and introduce a 22‑gate library covering Standard Unitary, Non‑Linear, Neuromorphic, and Encoding operations with full matrix algebra simulation. Second, we present the Variational Phasor Circuit (VPC), analogous to Variational Quantum Circuits (VQC), enabling optimization of continuous phase parameters for classical machine learning tasks. Third, we introduce the Phasor Transformer, replacing expensive QK^TV attention with a parameter‑free, DFT‑based token mixing layer inspired by FNet. We validate PhasorFlow on non‑linear spatial classification, time‑series prediction, financial volatility detection, and neuromorphic tasks including neural binding and oscillatory associative memory. Our results establish unit circle computing as a deterministic, lightweight, and mathematically principled alternative to classical neural networks and quantum circuits. It operates on classical hardware while sharing quantum mechanics' unitary foundations. PhasorFlow is available at https://github.com/mindverse‑computing/phasorflow.
Authors:Tomas Ruiz, Zhen Qin, Yifan Zhang, Xuyang Shen, Yiran Zhong, Mengdi Wang
Abstract:
Sampling from a categorical distribution is mathematically simple, but in large‑vocabulary decoding, it often triggers extra memory traffic and extra kernels after the LM head. We present FlashSampling, an exact sampling primitive that fuses sampling into the LM‑head matmul and never materializes the logits tensor in HBM. The method is simple: compute logits tile‑by‑tile on chip, add Gumbel noise, keep only one maximizer per row and per vocabulary tile, and finish with a small reduction over tiles. The fused tiled kernel is exact because \argmax decomposes over a partition; grouped variants for online and tensor‑parallel settings are exact by hierarchical factorization of the categorical distribution. Across H100, H200, B200, and B300 GPUs, FlashSampling speeds up kernel‑level decode workloads, and in end‑to‑end vLLM experiments, it reduces time per output token by up to 19% on the models we test. These results show that exact sampling, with no approximation, can be integrated into the matmul itself, turning a bandwidth‑bound postprocessing step into a lightweight epilogue. Project Page: https://github.com/FlashSampling/FlashSampling.
Authors:Alexandre Lacoste, Nicolas Gontier, Oleh Shliazhko, Aman Jaiswal, Kusha Sareen, Shailesh Nanisetty, Joan Cabezas, Manuel Del Verme, Omar G. Younis, Simone Baratta, Matteo Avalle, Imene Kerboua, Xing Han Lù, Elron Bandel, Michal Shmueli-Scheuer, Asaf Yehudai, Leshem Choshen, Jonathan Lebensold, Sean Hughes, Massimo Caccia, Alexandre Drouin, Siva Reddy, Tao Yu, Yu Su, Graham Neubig, Dawn Song
Abstract:
The proliferation of agent benchmarks has created critical fragmentation that threatens research productivity. Each new benchmark requires substantial custom integration, creating an "integration tax" that limits comprehensive evaluation. We propose CUBE (Common Unified Benchmark Environments), a universal protocol standard built on MCP and Gym that allows benchmarks to be wrapped once and used everywhere. By separating task, benchmark, package, and registry concerns into distinct API layers, CUBE enables any compliant platform to access any compliant benchmark for evaluation, RL training, or data generation without custom integration. We call on the community to contribute to the development of this standard before platform‑specific implementations deepen fragmentation as benchmark production accelerates through 2026.
Authors:Mateusz Dziemian, Maxwell Lin, Xiaohan Fu, Micha Nowak, Nick Winter, Eliot Jones, Andy Zou, Lama Ahmad, Kamalika Chaudhuri, Sahana Chennabasappa, Xander Davies, Lauren Deason, Benjamin L. Edelman, Tanner Emek, Ivan Evtimov, Jim Gust, Maia Hamin, Kat He, Klaudia Krawiecka, Riccardo Patana, Neil Perry, Troy Peterson, Xiangyu Qi, Javier Rando, Zifan Wang, Zihan Wang, Spencer Whitman, Eric Winsor, Arman Zharmagambetov, Matt Fredrikson, Zico Kolter
Abstract:
LLM based agents are increasingly deployed in high stakes settings where they process external data sources such as emails, documents, and code repositories. This creates exposure to indirect prompt injection attacks, where adversarial instructions embedded in external content manipulate agent behavior without user awareness. A critical but underexplored dimension of this threat is concealment: since users tend to observe only an agent's final response, an attack can conceal its existence by presenting no clue of compromise in the final user facing response while successfully executing harmful actions. This leaves users unaware of the manipulation and likely to accept harmful outcomes as legitimate. We present findings from a large scale public red teaming competition evaluating this dual objective across three agent settings: tool calling, coding, and computer use. The competition attracted 464 participants who submitted 272000 attack attempts against 13 frontier models, yielding 8648 successful attacks across 41 scenarios. All models proved vulnerable, with attack success rates ranging from 0.5% (Claude Opus 4.5) to 8.5% (Gemini 2.5 Pro). We identify universal attack strategies that transfer across 21 of 41 behaviors and multiple model families, suggesting fundamental weaknesses in instruction following architectures. Capability and robustness showed weak correlation, with Gemini 2.5 Pro exhibiting both high capability and high vulnerability. To address benchmark saturation and obsoleteness, we will endeavor to deliver quarterly updates through continued red teaming competitions. We open source the competition environment for use in evaluations, along with 95 successful attacks against Qwen that did not transfer to any closed source model. We share model‑specific attack data with respective frontier labs and the full dataset with the UK AISI and US CAISI to support robustness research.
Authors:Ye Wang, Zixuan Wu, Lifeng Shen, Jiang Xie, Xiaoling Wang, Hong Yu, Guoyin Wang
Abstract:
Imbalanced data distribution remains a critical challenge in sequential learning, leading models to easily recognize frequent categories while failing to detect minority classes adequately. The Mixture‑of‑Experts model offers a scalable solution, yet its application is often hindered by parameter inefficiency, poor expert specialization, and difficulty in resolving prediction conflicts. To Master the Minority classes effectively, we propose the Uncertainty‑based Multi‑Expert fusion network (UME) framework. UME is designed with three core innovations: First, we employ Ensemble LoRA for parameter‑efficient modeling, significantly reducing the trainable parameter count. Second, we introduce Sequential Specialization guided by Dempster‑Shafer Theory (DST), which ensures effective specialization on the challenging‑tailed classes. Finally, an Uncertainty‑Guided Fusion mechanism uses DST's certainty measures to dynamically weigh expert opinions, resolving conflicts by prioritizing the most confident expert for reliable final predictions. Extensive experiments across four public hierarchical text classification datasets demonstrate that UME achieves state‑of‑the‑art performance. We achieve a performance gain of up to 17.97% over the best baseline on individual categories, while reducing trainable parameters by up to 10.32%. The findings highlight that uncertainty‑guided expert coordination is a principled strategy for addressing challenging‑tailed sequence learning. Our code is available at https://github.com/CQUPTWZX/Multi‑experts.
Authors:Bingzhou Li, Tao Huang
Abstract:
Omnimodal large language models (OmniLLMs) jointly process audio and visual streams, but the resulting long multimodal token sequences make inference prohibitively expensive. Existing compression methods typically rely on fixed window partitioning and attention‑based pruning, which overlook the piecewise semantic structure of audio‑visual signals and become fragile under aggressive token reduction. We propose Dynamic Audio‑driven Semantic cHunking (DASH), a training‑free framework that aligns token compression with semantic structure. DASH treats audio embeddings as a semantic anchor and detects boundary candidates via cosine‑similarity discontinuities, inducing dynamic, variable‑length segments that approximate the underlying piecewise‑coherent organization of the sequence. These boundaries are projected onto video tokens to establish explicit cross‑modal segmentation. Within each segment, token retention is determined by a tri‑signal importance estimator that fuses structural boundary cues, representational distinctiveness, and attention‑based salience, mitigating the sparsity bias of attention‑only selection. This structure‑aware allocation preserves transition‑critical tokens while reducing redundant regions. Extensive experiments on AVUT, VideoMME, and WorldSense demonstrate that DASH maintains superior accuracy while achieving higher compression ratios compared to prior methods. Code is available at: https://github.com/laychou666/DASH.
Authors:Pearl Mody, Mihir Panchal, Rishit Kar, Kiran Bhowmick, Ruhina Karani
Abstract:
Large language model (LLM) agents are increasingly deployed in long running workflows, where they must preserve user and task state across many turns. Many existing agent memory systems behave like external databases with ad hoc read/write rules, which can yield unstable retention, limited consolidation, and vulnerability to distractor content. We present CraniMem, a neurocognitively motivated, gated and bounded multi‑stage memory design for agentic systems. CraniMem couples goal conditioned gating and utility tagging with a bounded episodic buffer for near term continuity and a structured long‑term knowledge graph for durable semantic recall. A scheduled consolidation loop replays high utility traces into the graph while pruning low utility items, keeping memory growth in check and reducing interference. On long horizon benchmarks evaluated under both clean inputs and injected noise, CraniMem is more robust than a Vanilla RAG and Mem0 baseline and exhibits smaller performance drops under distraction. Our code is available at https://github.com/PearlMody05/Cranimem and the accompanying PyPI package at https://pypi.org/project/cranimem.
Authors:Yibo Yang, Fei Lei, Yixuan Sun, Yantao Zeng, Chengguang Lv, Jiancao Hong, Jiaojiao Tian, Tianyu Qiu, Xin Wang, Yanbing Chen, Yanjie Li, Zheng Pan, Xiaochen Zhou, Guanzhou Chen, Haoran Lv, Yuning Xu, Yue Ou, Haodong Liu, Shiqi He, Anya Jia, Yulei Xin, Huan Wu, Liang Liu, Jiaye Ge, Jianxin Dong, Dahua Lin, Wenxiu Sun
Abstract:
As AI‑driven document understanding and processing tools become increasingly prevalent in real‑world applications, the need for rigorous evaluation standards has grown increasingly urgent. Existing benchmarks and evaluations often focus on isolated capabilities or simplified scenarios, failing to capture the end‑to‑end task effectiveness required in practical settings. To address this gap, we introduce AIDABench, a comprehensive benchmark for evaluating AI systems on complex data analytics tasks in an end‑to‑end manner. AIDABench encompasses 600+ diverse document analysis tasks across three core capability dimensions: question answering, data visualization, and file generation. These tasks are grounded in realistic scenarios involving heterogeneous data types, including spreadsheets, databases, financial reports, and operational records, and reflect analytical demands across diverse industries and job functions. Notably, the tasks in AIDABench are sufficiently challenging that even human experts require 1‑2 hours per question when assisted by AI tools, underscoring the benchmark's difficulty and real‑world complexity. We evaluate 11 state‑of‑the‑art models on AIDABench, spanning both proprietary (e.g., Claude Sonnet 4.5, Gemini 3 Pro Preview) and open‑source (e.g., Qwen3‑Max‑2026‑01‑23‑Thinking) families. Our results reveal that complex, real‑world data analytics tasks remain a significant challenge for current AI systems, with the best‑performing model achieving only 59.43% pass‑at‑1. We provide a detailed analysis of failure modes across each capability dimension and identify key challenges for future research. AIDABench offers a principled reference for enterprise procurement, tool selection, and model optimization, and is publicly available at https://github.com/MichaelYang‑lyx/AIDABench.
Authors:Zeyu Zhang, Rui Li, Xiaoyan Zhao, Yang Zhang, Wenjie Wang, Xu Chen, Tat-Seng Chua
Abstract:
Memory is critical for LLM‑based agents to preserve past observations for future decision‑making, where factual memory serves as its foundational part. However, existing approaches to constructing factual memory face several limitations. Textual methods impose heavy context and indexing burdens, while parametric methods suffer from catastrophic forgetting and high costs. To address these challenges, we introduce NextMem, a latent factual memory framework that utilizes an autoregressive autoencoder to efficiently construct latent memory while ensuring accurate reconstruction. For better optimization, we propose a two‑stage training process, including autoregressive reconstruction alignment and progressive latent substitution. We also incorporate quantization to reduce storage overhead. Extensive experiments demonstrate that NextMem achieves superior performance, and excels in retrieval, robustness, and extensibility properties. We release our code and model checkpoints at https://github.com/nuster1128/NextMem.
Authors:Xianbao Hou, Yonghao He, Zeyd Boukhers, John See, Hu Su, Wei Sui, Cong Yang
Abstract:
Diffusion models have significantly mitigated the impact of annotated data scarcity in remote sensing (RS). Although recent approaches have successfully harnessed these models to enable diverse and controllable Layout‑to‑Image (L2I) synthesis, they still suffer from limited fine‑grained control and fail to strictly adhere to bounding box constraints. To address these limitations, we propose RSGen, a plug‑and‑play framework that leverages diverse edge guidance to enhance layout‑driven RS image generation. Specifically, RSGen employs a progressive enhancement strategy: 1) it first enriches the diversity of edge maps composited from retrieved training instances via Image‑to‑Image generation; and 2) subsequently utilizes these diverse edge maps as conditioning for existing L2I models to enforce pixel‑level control within bounding boxes, ensuring the generated instances strictly adhere to the layout. Extensive experiments across three baseline models demonstrate that RSGen significantly boosts the capabilities of existing L2I models. For instance, with CC‑Diff on the DOTA dataset for oriented object detection, we achieve remarkable gains of +9.8/+12.0 in YOLOScore mAP50/mAP50‑95 and +1.6 in mAP on the downstream detection task. Our code will be publicly available: https://github.com/D‑Robotics‑AI‑Lab/RSGen
Authors:Yitong Zhang, Chengze Li, Ruize Chen, Guowei Yang, Xiaoran Jia, Yijie Ren, Jia Li
Abstract:
Large Language Models (LLMs) have shown strong potential for code generation, yet they remain limited in private‑library‑oriented code generation, where the goal is to generate code using APIs from private libraries. Existing approaches mainly rely on retrieving private‑library API documentation and injecting relevant knowledge into the context at inference time. However, our study shows that this is insufficient: even given accurate required knowledge, LLMs still struggle to invoke private‑library APIs effectively. To address this limitation, we propose PriCoder, an approach that teaches LLMs to invoke private‑library APIs through automatically synthesized data. Specifically, PriCoder models private‑library data synthesis as the construction of a graph, and alternates between two graph operators: (1) Progressive Graph Evolution, which improves data diversity by progressively synthesizing more diverse training samples from basic ones, and (2) Multidimensional Graph Pruning, which improves data quality through a rigorous filtering pipeline. To support rigorous evaluation, we construct two new benchmarks based on recently released libraries that are unfamiliar to the tested models. Experiments on three mainstream LLMs show that PriCoder substantially improves private‑library‑oriented code generation, yielding gains of over 20% in pass@1 in many settings, while causing negligible impact on general code generation capability. Our code and benchmarks are publicly available at https://github.com/eniacode/PriCoder.
Authors:Ziqing Ma, Kai Ying, Xinyue Gu, Tian Zhou, Tianyu Zhu, Haifan Zhang, Peisong Niu, Wang Zheng, Cong Bai, Liang Sun
Abstract:
Accurate day‑ahead solar irradiance forecasting is essential for integrating solar energy into the power grid. However, it remains challenging due to the pronounced diurnal cycle and inherently complex cloud dynamics. Current methods either lack fine‑scale resolution (e.g., numerical weather prediction, weather foundation models) or degrade at longer lead times (e.g., satellite extrapolation). We propose Baguan‑solar, a two‑stage multimodal framework that fuses forecasts from Baguan, a global weather foundation model, with high‑resolution geostationary satellite imagery to produce 24‑ hour irradiance forecasts at kilometer scale. Its decoupled two‑stage design first forecasts day‑night continuous intermediates (e.g., cloud cover) and then infers irradiance, while its modality fusion jointly preserves fine‑scale cloud structures from satellite and large‑scale constraints from Baguan forecasts. Evaluated over East Asia using CLDAS as ground truth, Baguan‑solar outperforms strong baselines (including ECMWF IFS, vanilla Baguan, and SolarSeer), reducing RMSE by 16.08% and better resolving cloud‑induced transients. An operational deployment of Baguan‑solar has supported solar power forecasting in an eastern province in China, since July 2025. Our code is accessible at https://github.com/DAMO‑DI‑ML/Baguansolar. git.
Authors:Xuanfei Ren, Allen Nie, Tengyang Xie, Ching-An Cheng
Abstract:
Optimizing complex systems, ranging from LLM prompts to multi‑turn agents, traditionally requires labor‑intensive manual iteration. We formalize this challenge as a stochastic generative optimization problem where a generative language model acts as the optimizer, guided by numerical rewards and text feedback to discover the best system. We introduce Prioritized Optimization with Local Contextual Aggregation (POLCA), a scalable framework designed to handle stochasticity in optimization ‑‑ such as noisy feedback, sampling minibatches, and stochastic system behaviors ‑‑ while effectively managing the unconstrained expansion of solution space. POLCA maintains a priority queue to manage the exploration‑exploitation tradeoff, systematically tracking candidate solutions and their evaluation histories. To enhance efficiency, we integrate an \varepsilon‑Net mechanism to maintain parameter diversity and an LLM Summarizer to perform meta‑learning across historical trials. We theoretically prove that POLCA converges to near‑optimal candidate solutions under stochasticity. We evaluate our framework on diverse benchmarks, including τ‑bench, HotpotQA (agent optimization), VeriBench (code translation) and KernelBench (CUDA kernel generation). Experimental results demonstrate that POLCA achieves robust, sample and time‑efficient performance, consistently outperforming state‑of‑the‑art algorithms in both deterministic and stochastic problems. The codebase for this work is publicly available at https://github.com/rlx‑lab/POLCA.
Authors:Salim Khazem
Abstract:
Frozen‑backbone transfer with Vision Transformers faces two under‑addressed issues: optimization instability when adapters are naively inserted into a fixed feature extractor, and the absence of principled guidance for setting adapter capacity. We introduce AdapterTune, which augments each transformer block with a residual low‑rank bottleneck whose up‑projection is zero‑initialized, guaranteeing that the adapted network starts exactly at the pretrained function and eliminates early‑epoch representation drift. On the analytical side, we formalize adapter rank as a capacity budget for approximating downstream task shifts in feature space. The resulting excess‑risk decomposition predicts monotonic but diminishing accuracy gains with increasing rank, an ``elbow'' behavior we confirm through controlled sweeps. We evaluate on 9 datasets and 3 backbone scales with multi‑seed reporting throughout. On a core 5 dataset transfer suite, AdapterTune improves top‑1 accuracy over head‑only transfer by +14.9 points on average while training only 0.92 of the parameters required by full fine‑tuning, and outperforms full fine‑tuning on 10 of 15 dataset‑backbone pairs. Across the full benchmark, AdapterTune improves over head‑only transfer on every dataset‑backbone pair tested. Ablations on rank, placement, and initialization isolate each design choice. The code is available at: https://github.com/salimkhazem/adaptertune
Authors:J Rosser
Abstract:
Training data attribution (TDA) methods ask which training documents are responsible for a model behavior. However, models often learn broad concepts shared across many examples. Moreover, existing TDA methods are supervised ‑‑ they require a predefined query behavior, then score every training document against it ‑‑ making them both expensive and unable to surface behaviors the user did not think to ask about. We present Gradient Atoms, an unsupervised method that decomposes per‑document training gradients into sparse components ("atoms") via dictionary learning in a preconditioned eigenspace. Each atom captures a shared update direction induced by a cluster of functionally similar documents, directly recovering the collective structure that per‑document methods do not address. Among 500 discovered atoms, the highest‑coherence ones recover interpretable task‑type behaviors ‑‑ refusal, arithmetic, yes/no classification, trivia QA ‑‑ without any behavioral labels. These atoms double as effective steering vectors: applying them as weight‑space perturbations produces large, controllable shifts in model behavior (e.g., bulleted‑list generation 33% to 94%; systematic refusal 50% to 0%). The method requires no query‑‑document scoring stage, and scales independently of the number of query behaviors of interest. Code is available at https://github.com/jrosseruk/gradient_atoms.
Authors:Mike Amega
Abstract:
We present EARCP (Ensemble Auto‑Régulé par Cohérence et Performance), a novel ensemble architecture that dynamically weights heterogeneous expert models based on both their individual performance and inter‑model coherence. Unlike traditional ensemble methods that rely on static or offline‑learned combinations, EARCP continuously adapts model weights through a principled online learning mechanism that balances exploitation of high‑performing models with exploration guided by consensus signals. The architecture combines theoretical foundations from multiplicative weight update algorithms with a novel coherence‑based regularization term, providing both theoretical guarantees through regret bounds and practical robustness in non‑stationary environments. We formalize the EARCP framework, prove sublinear regret bounds of O(sqrt(T log M)) under standard assumptions, and demonstrate its effectiveness through empirical evaluation on sequential prediction tasks including time series forecasting, activity recognition, and financial prediction. The architecture is designed as a general‑purpose framework applicable to any domain requiring ensemble learning with temporal dependencies. An open‑source implementation is available at https://github.com/Volgat/earcp and via PyPI (pip install earcp).
Authors:Balaji Rao, John Harrison, Soonho Kong, Juneyoung Lee, Carlo Lipizzi
Abstract:
Neurosymbolic approaches leveraging Large Language Models (LLMs) with formal methods have recently achieved strong results on mathematics‑oriented theorem‑proving benchmarks. However, success on competition‑style mathematics does not by itself demonstrate the ability to construct proofs about real‑world implementations. We address this gap with a benchmark derived from an industrial cryptographic library whose assembly routines are already verified in HOL Light. s2n‑bignum is a library used at AWS for providing fast assembly routines for cryptography, and its correctness is established by formal verification. The task of formally verifying this library has been a significant achievement for the Automated Reasoning Group. It involved two tasks: (1) precisely specifying the correct behavior of a program as a mathematical proposition, and (2) proving that the proposition is correct. In the case of s2n‑bignum, both tasks were carried out by human experts. In s2n‑bignum‑bench, we provide the formal specification and ask the LLM to generate a proof script that is accepted by HOL Light within a fixed proof‑check timeout. To our knowledge, s2n‑bignum‑bench is the first public benchmark focused on machine‑checkable proof synthesis for industrial low‑level cryptographic assembly routines in HOL Light. This benchmark provides a challenging and practically relevant testbed for evaluating LLM‑based theorem proving beyond competition mathematics. The code to set up and use the benchmark is available here: \hrefhttps://github.com/kings‑crown/s2n‑bignum‑benchs2n‑bignum‑bench.
Authors:Varun Pratap Bhardwaj
Abstract:
Persistent memory is a central capability for AI agents, yet the mathematical foundations of memory retrieval, lifecycle management, and consistency remain unexplored. Current systems employ cosine similarity for retrieval, heuristic decay for salience, and provide no formal contradiction detection. We establish information‑geometric foundations through three contributions. First, a retrieval metric derived from the Fisher information structure of diagonal Gaussian families, satisfying Riemannian metric axioms, invariant under sufficient statistics, and computable in O(d) time. Second, memory lifecycle formulated as Riemannian Langevin dynamics with proven existence and uniqueness of the stationary distribution via the Fokker‑Planck equation, replacing hand‑tuned decay with principled convergence guarantees. Third, a cellular sheaf model where non‑trivial first cohomology classes correspond precisely to irreconcilable contradictions across memory contexts. On the LoCoMo benchmark, the mathematical layers yield +12.7 percentage points over engineering baselines across six conversations, reaching +19.9 pp on the most challenging dialogues. A four‑channel retrieval architecture achieves 75% accuracy without cloud dependency. Cloud‑augmented results reach 87.7%. A zero‑LLM configuration satisfies EU AI Act data sovereignty requirements by architectural design. To our knowledge, this is the first work establishing information‑geometric, sheaf‑theoretic, and stochastic‑dynamical foundations for AI agent memory systems.
Authors:Shi Qiu, Zeyu Cai, Jiashen Wei, Zeyu Li, Yixuan Yin, Qing-Hong Cao, Chang Liu, Ming-xing Luo, Xing-Bo Yuan, Hua Xing Zhu
Abstract:
We present, to our knowledge, the first language‑driven agent system capable of executing end‑to‑end collider phenomenology tasks, instantiated within a decoupled, domain‑agnostic architecture for autonomous High‑Energy Physics phenomenology. Guided only by natural‑language prompts supplemented with standard physics notation, ColliderAgent carries out workflows from a theoretical Lagrangian to final phenomenological outputs without relying on package‑specific code. In this framework, a hierarchical multi‑agent reasoning layer is coupled to Magnus, a unified execution backend for phenomenological calculations and simulation toolchains. We validate the system on representative literature reproductions spanning leptoquark and axion‑like‑particle scenarios, higher‑dimensional effective operators, parton‑level and detector‑level analyses, and large‑scale parameter scans leading to exclusion limits. These results point to a route toward more automated, scalable, and reproducible research in collider physics, cosmology, and physics more broadly.
Authors:Chaoyang Wang, Wenrui Bao, Sicheng Gao, Bingxin Xu, Yu Tian, Yogesh S. Rawat, Yunhao Ge, Yuzhang Shang
Abstract:
Vision‑Language‑Action (VLA) models have shown promising capabilities for embodied intelligence, but most existing approaches rely on text‑based chain‑of‑thought reasoning where visual inputs are treated as static context. This limits the ability of the model to actively revisit the environment and resolve ambiguities during long‑horizon tasks. We propose VLA‑Thinker, a thinking‑with‑image reasoning framework that models perception as a dynamically invocable reasoning action. To train such a system, we introduce a two‑stage training pipeline consisting of (1) an SFT cold‑start phase with curated visual Chain‑of‑Thought data to activate structured reasoning and tool‑use behaviors, and (2) GRPO‑based reinforcement learning to align complete reasoning‑action trajectories with task‑level success. Extensive experiments on LIBERO and RoboTwin 2.0 benchmarks demonstrate that VLA‑Thinker significantly improves manipulation performance, achieving 97.5% success rate on LIBERO and strong gains across long‑horizon robotic tasks. Project and Codes: https://cywang735.github.io/VLA‑Thinker/ .
Authors:Junhang Cheng, Fang Liu, Jia Li, Chengru Wu, Nanxiang Jiang, Li Zhang
Abstract:
Large Language Models excel in high‑resource programming languages but struggle with low‑resource ones. Existing research related to low‑resource programming languages primarily focuses on Domain‑Specific Languages (DSLs), leaving general‑purpose languages that suffer from data scarcity underexplored. To address this gap, we introduce CangjieBench, a contamination‑free benchmark for Cangjie, a representative low‑resource general‑purpose language. The benchmark comprises 248 high‑quality samples manually translated from HumanEval and ClassEval, covering both Text‑to‑Code and Code‑to‑Code tasks. We conduct a systematic evaluation of diverse LLMs under four settings: Direct Generation, Syntax‑Constrained Generation, Retrieval‑Augmented Generation (RAG), and Agent. Experiments reveal that Direct Generation performs poorly, whereas Syntax‑Constrained Generation offers the best trade‑off between accuracy and computational cost. Agent achieve state‑of‑the‑art accuracy but incur high token consumption. Furthermore, we observe that Code‑to‑Code translation often underperforms Text‑to‑Code generation, suggesting a negative transfer phenomenon where models overfit to the source language patterns. We hope that our work will offer valuable insights into LLM generalization to unseen and low‑resource programming languages. Our code and data are available at https://github.com/cjhCoder7/CangjieBench.
Authors:Bálint Gyevnár, Atoosa Kasirzadeh
Abstract:
Tensions between AI Safety (AIS) and AI Ethics (AIE) have increasingly surfaced in AI governance and public debates about AI, leading to what we term the "responsible AI divides". We introduce a model that categorizes four modes of engagement with the tensions: radical confrontation, disengagement, compartmentalized coexistence, and critical bridging. We then investigate how critical bridging, with a particular focus on bridging problems, offers one of the most viable constructive paths for advancing responsible AI. Using computational tools to analyze a curated dataset of 3,550 papers, we map the research landscapes of AIE and AIS to identify both distinct and overlapping problems. Our findings point to both thematic divides and overlaps. For example, we find that AIE has long grappled with overcoming injustice and tangible AI harms, whereas AIS has primarily embodied an anticipatory approach focused on the mitigation of risks from AI capabilities. At the same time, we find significant overlap in core research concerns across both AIE and AIS around transparency, reproducibility, and inadequate governance mechanisms. As AIE and AIS continue to evolve, we recommend focusing on bridging problems as a constructive path forward for enhancing collaborative AI governance. We offer a series of recommendations to integrate shared considerations into a collaborative approach to responsible AI. Alongside our proposal, we highlight its limitations and explore open problems for future research. All data including the fully annotated dataset of papers with code to reproduce our figures can be found at: https://github.com/gyevnarb/ai‑safety‑ethics.
Authors:Shengda Fan, Xuyan Ye, Yupeng Huo, Zhi-Yuan Chen, Yiju Guo, Shenzhi Yang, Wenkai Yang, Shuqi Ye, Jingwen Chen, Haotian Chen, Xin Cong, Yankai Lin
Abstract:
While Large Language Models (LLMs) have evolved into tool‑using agents, they remain brittle in long‑horizon interactions. Unlike mathematical reasoning where errors are often rectifiable via backtracking, tool‑use failures frequently induce irreversible side effects, making accurate step‑level verification critical. However, existing process‑level benchmarks are predominantly confined to closed‑world mathematical domains, failing to capture the dynamic and open‑ended nature of tool execution. To bridge this gap, we introduce AgentProcessBench, the first benchmark dedicated to evaluating step‑level effectiveness in realistic, tool‑augmented trajectories. The benchmark comprises 1,000 diverse trajectories and 8,509 human‑labeled step annotations with 89.1% inter‑annotator agreement. It features a ternary labeling scheme to capture exploration and an error propagation rule to reduce labeling ambiguity. Extensive experiments reveal key insights: (1) weaker policy models exhibit inflated ratios of correct steps due to early termination; (2) distinguishing neutral and erroneous actions remains a significant challenge for current models; and (3) process‑derived signals provide complementary value to outcome supervision, significantly enhancing test‑time scaling. We hope AgentProcessBench can foster future research in reward models and pave the way toward general agents. The code and data are available at https://github.com/RUCBM/AgentProcessBench.
Authors:Xiaoliang Fu, Jiaye Lin, Yangyi Fang, Chaowen Hu, Cong Qin, Zekai Shao, Binbin Zheng, Lu Pan, Ke Zeng
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed a leap in Large Language Model (LLM) reasoning, yet its optimization dynamics remain fragile. Standard algorithms like GRPO enforce stability via ``hard clipping'', which inadvertently stifles exploration by discarding gradients of tokens outside the trust region. While recent ``soft clipping'' methods attempt to recover these gradients, they suffer from a critical challenge: relying on log‑probability gradient (\nabla_θ\log π_θ) yields divergent weights as probabilities vanish, destabilizing LLM training. We rethink this convention by establishing probability gradient (\nabla_θπ_θ) as the superior optimization primitive. Accordingly, we propose Decoupled Gradient Policy Optimization (DGPO), which employs a decoupled decay mechanism based on importance sampling ratios. By applying asymmetric, continuous decay to boundary tokens, DGPO resolves the conflict between stability and sustained exploration. Extensive experiments across DeepSeek‑R1‑Distill‑Qwen series models (1.5B/7B/14B) demonstrate that DGPO consistently outperforms strong baselines on various mathematical benchmarks, offering a robust and scalable solution for RLVR. Our code and implementation are available at: https://github.com/VenomRose‑Juri/DGPO‑RL.
Authors:Xiangbo Gao, Mingyang Wu, Siyuan Yang, Jiongze Yu, Pardis Taghavi, Fangzhou Lin, Zhengzhong Tu
Abstract:
While recent generative video models have achieved remarkable visual realism and are being explored as world models, true physical simulation requires mastering both space and time. Current models can produce visually smooth kinematics, yet they lack a reliable internal motion pulse to ground these motions in a consistent, real‑world time scale. This temporal ambiguity stems from the common practice of indiscriminately training on videos with vastly different real‑world speeds, forcing them into standardized frame rates. This leads to what we term chronometric hallucination: generated sequences exhibit ambiguous, unstable, and uncontrollable physical motion speeds. To address this, we propose Visual Chronometer, a predictor that recovers the Physical Frames Per Second (PhyFPS) directly from the visual dynamics of an input video. Trained via controlled temporal resampling, our method estimates the true temporal scale implied by the motion itself, bypassing unreliable metadata. To systematically quantify this issue, we establish two benchmarks, PhyFPS‑Bench‑Real and PhyFPS‑Bench‑Gen. Our evaluations reveal a harsh reality: state‑of‑the‑art video generators suffer from severe PhyFPS misalignment and temporal instability. Finally, we demonstrate that applying PhyFPS corrections significantly improves the human‑perceived naturalness of AI‑generated videos. Our project page is https://xiangbogaobarry.github.io/Visual_Chronometer/.
Authors:Peng Xu, Zhengnan Deng, Jiayan Deng, Zonghua Gu, Shaohua Wan
Abstract:
Vision‑Language Navigation (VLN) for Unmanned Aerial Vehicles (UAVs) demands complex visual interpretation and continuous control in dynamic 3D environments. Existing hierarchical approaches rely on dense oracle guidance or auxiliary object detectors, creating semantic gaps and limiting genuine autonomy. We propose AerialVLA, a minimalist end‑to‑end Vision‑Language‑Action framework mapping raw visual observations and fuzzy linguistic instructions directly to continuous physical control signals. First, we introduce a streamlined dual‑view perception strategy that reduces visual redundancy while preserving essential cues for forward navigation and precise grounding, which additionally facilitates future simulation‑to‑reality transfer. To reclaim genuine autonomy, we deploy a fuzzy directional prompting mechanism derived solely from onboard sensors, completely eliminating the dependency on dense oracle guidance. Ultimately, we formulate a unified control space that integrates continuous 3‑Degree‑of‑Freedom (3‑DoF) kinematic commands with an intrinsic landing signal, freeing the agent from external object detectors for precision landing. Extensive experiments on the TravelUAV benchmark demonstrate that AerialVLA achieves state‑of‑the‑art performance in seen environments. Furthermore, it exhibits superior generalization in unseen scenarios by achieving nearly three times the success rate of leading baselines, validating that a minimalist, autonomy‑centric paradigm captures more robust visual‑motor representations than complex modular systems.
Authors:Jungwoo Oh, Hyunseung Chung, Junhee Lee, Min-Gyu Kim, Hangyul Yoon, Ki Seong Lee, Youngchae Lee, Muhan Yeo, Edward Choi
Abstract:
While Multimodal Large Language Models (MLLMs) show promising performance in automated electrocardiogram interpretation, it remains unclear whether they genuinely perform actual step‑by‑step reasoning or just rely on superficial visual cues. To investigate this, we introduce ECG‑Reasoning‑Benchmark, a novel multi‑turn evaluation framework comprising over 6,400 samples to systematically assess step‑by‑step reasoning across 17 core ECG diagnoses. Our comprehensive evaluation of state‑of‑the‑art models reveals a critical failure in executing multi‑step logical deduction. Although models possess the medical knowledge to retrieve clinical criteria for a diagnosis, they exhibit near‑zero success rates (6% Completion) in maintaining a complete reasoning chain, primarily failing to ground the corresponding ECG findings to the actual visual evidence in the ECG signal. These results demonstrate that current MLLMs bypass actual visual interpretation, exposing a critical flaw in existing training paradigms and underscoring the necessity for robust, reasoning‑centric medical AI. The code and data are available at https://github.com/Jwoo5/ecg‑reasoning‑benchmark.
Authors:He Zhang, Ying Sun, Hui Xiong
Abstract:
Flow‑matching policies hold great promise for reinforcement learning (RL) by capturing complex, multi‑modal action distributions. However, their practical application is often hindered by prohibitive inference latency and ineffective online exploration. Although recent works have employed one‑step distillation for fast inference, the structure of the initial noise distribution remains an overlooked factor that presents significant untapped potential. This overlooked factor, along with the challenge of controlling policy stochasticity, constitutes two critical areas for advancing distilled flow‑matching policies. To overcome these limitations, we propose GoldenStart (GSFlow), a policy distillation method with Q‑guided priors and explicit entropy control. Instead of initializing generation from uninformed noise, we introduce a Q‑guided prior modeled by a conditional VAE. This state‑conditioned prior repositions the starting points of the one‑step generation process into high‑Q regions, effectively providing a "golden start" that shortcuts the policy to promising actions. Furthermore, for effective online exploration, we enable our distilled actor to output a stochastic distribution instead of a deterministic point. This is governed by entropy regularization, allowing the policy to shift from pure exploitation to principled exploration. Our integrated framework demonstrates that by designing the generative startpoint and explicitly controlling policy entropy, it is possible to achieve efficient and exploratory policies, bridging the generative models and the practical actor‑critic methods. We conduct extensive experiments on offline and online continuous control benchmarks, where our method significantly outperforms prior state‑of‑the‑art approaches. Code will be available at https://github.com/ZhHe11/GSFlow‑RL.
Authors:Yutong Wu, Chenrui Cao, Pengwei Jin, Di Huang, Rui Zhang, Xishan Zhang, Zidong Du, Qi Guo, Xing Hu
Abstract:
SystemVerilog Assertions (SVAs) are crucial for hardware verification. Recent studies leverage general‑purpose LLMs to translate natural language properties to SVAs (NL2SVA), but they perform poorly due to limited data. We propose a data synthesis framework to tackle two challenges: the scarcity of high‑quality real‑world SVA corpora and the lack of reliable methods to determine NL‑SVA semantic equivalence. For the former, large‑scale open‑source RTLs are used to guide LLMs to generate real‑world SVAs; for the latter, bidirectional translation serves as a data selection method. With the synthesized data, we train CodeV‑SVA, a series of SVA generation models. Notably, CodeV‑SVA‑14B achieves 75.8% on NL2SVA‑Human and 84.0% on NL2SVA‑Machine in Func.@1, matching or exceeding advanced LLMs like GPT‑5 and DeepSeek‑R1.
Authors:Xingyuan Li, Songcheng Du, Yang Zou, HaoYuan Xu, Zhiying Jiang, Jinyuan Liu
Abstract:
Image fusion aims to integrate complementary information from multiple source images to produce a more informative and visually consistent representation, benefiting both human perception and downstream vision tasks. Despite recent progress, most existing fusion methods are designed for specific tasks (i.e., multi‑modal, multi‑exposure, or multi‑focus fusion) and struggle to effectively preserve source information during the fusion process. This limitation primarily arises from task‑specific architectures and the degradation of source information caused by deep‑layer propagation. To overcome these issues, we propose UniFusion, a unified image fusion framework designed to achieve cross‑task generalization. First, leveraging DINOv3 for modality‑consistent feature extraction, UniFusion establishes a shared semantic space for diverse inputs. Second, to preserve the understanding of each source image, we introduce a reconstruction‑alignment loss to maintain consistency between fused outputs and inputs. Finally, we employ a bilevel optimization strategy to decouple and jointly optimize reconstruction and fusion objectives, effectively balancing their coupling relationship and ensuring smooth convergence. Extensive experiments across multiple fusion tasks demonstrate UniFusion's superior visual quality, generalization ability, and adaptability to real‑world scenarios. Code is available at https://github.com/dusongcheng/UniFusion.
Authors:Shishi Xiao, Tongyu Zhou, David Laidlaw, Gromit Yeuk-Yin Chan
Abstract:
A pictorial chart is an effective medium for visual storytelling, seamlessly integrating visual elements with data charts. However, creating such images is challenging because the flexibility of visual elements often conflicts with the rigidity of chart structures. This process thus requires a creative deformation that maintains both data faithfulness and visual aesthetics. Current methods that extract dense structural cues from natural images (e.g., edge or depth maps) are ill‑suited as conditioning signals for pictorial chart generation. We present ChArtist, a domain‑specific diffusion model for generating pictorial charts automatically, offering two distinct types of control: 1) spatial control that aligns well with the chart structure, and 2) subject‑driven control that respects the visual characteristics of a reference image. To achieve this, we introduce a skeleton‑based spatial control representation. This representation encodes only the data‑encoding information of the chart, allowing for the easy incorporation of reference visuals without a rigid outline constraint. We implement our method based on the Diffusion Transformer (DiT) and leverage an adaptive position encoding mechanism to manage these two controls. We further introduce Spatially Gated Attention to modulate the interaction between spatial control and subject control. To support the fine‑tuning of pre‑trained models for this task, we created a large‑scale dataset of 30,000 triplets (skeleton, reference image, pictorial chart). We also propose a unified data accuracy metric to evaluate the data faithfulness of the generated charts. We believe this work demonstrates that current generative models can achieve data‑driven visual storytelling by moving beyond general‑purpose conditions to task‑specific representations. Project page: https://chartist‑ai.github.io/.
Authors:Wanhu Sun, Zhongjin Luo, Heliang Zheng, Jiahao Chang, Chongjie Ye, Huiang He, Shengchu Zhao, Rongfei Jia, Xiaoguang Han
Abstract:
Part‑level 3D generation is crucial for various downstream applications, including gaming, film production, and industrial design. However, decomposing a 3D shape into geometrically plausible and meaningful components remains a significant challenge. Previous part‑based generation methods often struggle to produce well‑constructed parts, exhibiting poor structural coherence, geometric implausibility, inaccuracy, or inefficiency. To address these challenges, we introduce EI‑Part, a novel framework specifically designed to generate high‑quality 3D shapes with components, characterized by strong structural coherence, geometric plausibility, geometric fidelity, and generation efficiency. We propose utilizing distinct representations at different stages: an Explode state for part completion and an Implode state for geometry refinement. This strategy fully leverages spatial resolution, enabling flexible part completion and fine geometric detail generation. To maintain structural coherence between parts, a self‑attention mechanism is incorporated in both exploded and imploded states, facilitating effective information perception and feature fusion among components during generation. Extensive experiments on multiple benchmarks demonstrate that EI‑Part efficiently produces semantically meaningful and structurally coherent parts with fine‑grained geometric details, achieving state‑of‑the‑art performance in part‑level 3D generation. Project page: https://cvhadessun.github.io/EI‑Part/
Authors:Gwanwoo Song, Kwanyoung Park, Youngwoon Lee
Abstract:
In offline reinforcement learning (RL), single‑step temporal‑difference (TD) learning can suffer from bootstrapping error accumulation over long horizons. Action‑chunked TD methods mitigate this by backing up over multiple steps, but can introduce suboptimality by restricting the policy class to open‑loop action sequences. To resolve this trade‑off, we present Chunk‑Guided Q‑Learning (CGQ), a single‑step TD algorithm that guides a fine‑grained single‑step critic by regularizing it toward a chunk‑based critic trained using temporally extended backups. This reduces compounding error while preserving fine‑grained value propagation. We theoretically show that CGQ attains tighter critic optimality bounds than either single‑step or action‑chunked TD learning alone. Empirically, CGQ achieves strong performance on challenging long‑horizon OGBench tasks, often outperforming both single‑step and action‑chunked methods.
Authors:Emmanuel Oladokun, Sarina Thomas, Jurica Šprem, Vicente Grau
Abstract:
Echocardiography is widely used for assessing cardiac function, where clinically meaningful parameters such as left‑ventricular ejection fraction (EF) play a central role in diagnosis and management. Generative models capable of synthesising realistic echocardiogram videos with explicit control over such parameters are valuable for data augmentation, counterfactual analysis, and specialist training. However, existing approaches typically rely on computationally expensive multi‑step sampling and aggressive temporal normalisation, limiting efficiency and applicability to heterogeneous real‑world data. We introduce EchoLVFM, a one‑step latent video flow‑matching framework for controllable echocardiogram generation. Operating in the latent space, EchoLVFM synthesises temporally coherent videos in a single inference step, achieving a \mathbf~ 50× improvement in sampling efficiency compared to multi‑step flow baselines while maintaining visual fidelity. The model supports global conditioning on clinical variables, demonstrated through precise control of EF, and enables reconstruction and counterfactual generation from partially observed sequences. A masked conditioning strategy further removes fixed‑length constraints, allowing shorter sequences to be retained rather than discarded. We evaluate EchoLVFM on the CAMUS dataset under challenging single‑frame conditioning. Quantitative and qualitative results demonstrate competitive video quality, strong EF adherence, and 57.9% discrimination accuracy by expert clinicians which is close to chance. These findings indicate that efficient, one‑step flow matching can enable practical, controllable echocardiogram video synthesis without sacrificing fidelity. Code available at: https://github.com/EngEmmanuel/EchoLVFM
Authors:Xiaofei Zhu, Jinfei Chen, Feiyang Yuan, Zhou Yang
Abstract:
Recommendation systems aim to learn user interests from historical behaviors and deliver relevant items. Recent methods leverage large language models (LLMs) to construct and integrate semantic representations of users and items for capturing user interests. However, user behavior theories suggest that truly understanding user interests requires not only semantic integration but also semantic reasoning from explicit individual interests to implicit group interests. To this end, we propose an Iterative Semantic Reasoning Framework (ISRF) for generative recommendation. ISRF leverages LLMs to bridge explicit individual interests and implicit group interests in three steps. First, we perform multi‑step bidirectional reasoning over item attributes to infer semantic item features and build a semantic interaction graph capturing users' explicit interests. Second, we generate semantic user features based on the semantic item features and construct a similarity‑based user graph to infer the implicit interests of similar user groups. Third, we adopt an iterative batch optimization strategy, where individual explicit interests directly guide the refinement of group implicit interests, while group implicit interests indirectly enhance individual modeling. This iterative process ensures consistent and progressive interest reasoning, enabling more accurate and comprehensive user interest learning. Extensive experiments on the Sports, Beauty, and Toys datasets demonstrate that ISRF outperforms state‑of‑the‑art baselines. The code is available at https://github.com/htired/ISRF.
Authors:Qilong Li, Chongsheng Zhang
Abstract:
Scene text recognition (STR) methods have demonstrated their excellent capability in English text images. However, due to the complex inner structures of Chinese and the extensive character categories, it poses challenges for recognizing Chinese text in images. Recently, studies have shown that the methods designed for English text recognition encounter an accuracy bottleneck when recognizing Chinese text images. This raises the question: Is it appropriate to apply the model developed for English to the Chinese STR task? To explore this issue, we propose a novel method named LER, which explicitly decouples each character and independently recognizes characters while taking into account the complex inner structures of Chinese. LER consists of three modules: Localization, Extraction, and Recognition. Firstly, the localization module utilizes multimodal information to determine the character's position precisely. Then, the extraction module dissociates all characters in parallel. Finally, the recognition module considers the unique inner structures of Chinese to provide the text prediction results. Extensive experiments conducted on large‑scale Chinese benchmarks indicate that our method significantly outperforms existing methods. Furthermore, extensive experiments conducted on six English benchmarks and the Union14M benchmark show impressive results in English text recognition by LER. Code is available at https://github.com/Pandarenlql/LER.
Authors:Jiahao Qin
Abstract:
Spike sparsity is widely believed to enable efficient spiking neural network (SNN) inference on GPU hardware. We demonstrate this is an illusion: five distinct sparse computation strategies on Apple M3 Max all fail to outperform dense convolution, because SIMD architectures cannot exploit the fine‑grained, unstructured sparsity of i.i.d. binary spikes. Instead, we propose Temporal Aggregated Convolution (TAC), which exploits convolution linearity to pre‑aggregate K spike frames before a single convolution call, reducing T calls to T/K. On rate‑coded data, TAC achieves 13.8times speedup with +1.6% accuracy on MNIST and +5.4% on Fashion‑MNIST ‑‑ a simultaneous improvement in both speed and accuracy. However, on event‑based data where the temporal dimension carries genuine motion information, TAC's temporal collapse is harmful. We therefore introduce TAC‑TP (Temporal Preservation), which shares each group's convolution output across K independent LIF steps, preserving full temporal resolution for downstream layers. On DVS128‑Gesture, TAC‑TP achieves 95.1% accuracy (vs. 96.3% baseline) with 50% fewer convolution calls, while standard TAC drops to 91.3%. Our key finding is that the optimal temporal aggregation strategy is data‑dependent: collapse the temporal dimension for rate‑coded data (noise reduction) but preserve it for event data (information retention). Speedup is hardware‑agnostic: TAC achieves 11.0times on NVIDIA V100, confirming the mechanism transfers across GPU architectures. All operators in the mlx‑snn library are open source.
Authors:Xuan Cui, Huiyue Li, Run Zeng, Yunfei Zhao, Jinrui Qian, Wei Duan, Bo Liu, Zhanpeng Zhou
Abstract:
As large language models (LLMs) scale to billions of parameters, full‑parameter fine‑tuning becomes compute‑ and memory‑prohibitive. Parameter‑efficient fine‑tuning (PEFT) mitigates this issue by updating only a small set of task‑specific parameters while keeping the base model frozen. Among PEFT approaches, low‑rank adaptation (LoRA) is widely adopted; however, it enforces a uniform rank across layers despite substantial variation in layer importance, motivating layerwise rank allocation. Recent adaptive‑rank variants (e.g., AdaLoRA) allocate ranks based on importance scores, yet typically rely on instantaneous gradients that capture only local sensitivity, overlooking non‑local, pathwise effects within the same layer, which yields unstable and biased scores. To address this limitation, we introduce IGU‑LoRA, an adaptive‑rank LoRA that (i) computes within‑layer Integrated Gradients (IG) sensitivities and aggregates them into a layer‑level score for rank allocation, and (ii) applies an uncertainty‑aware scheme using exponential moving averages with deviation tracking to suppress noisy updates and calibrate rank selection. Theoretically, we prove an upper bound on the composite trapezoidal rule approximation error for parameter‑space IG under a pathwise Hessian‑Lipschitz condition, which informs the quadrature budget. Across diverse tasks and architectures, IGU‑LoRA consistently outperforms strong PEFT baselines at matched parameter budgets, improving downstream accuracy and robustness. Ablations confirm the contributions of pathwise within‑layer sensitivity estimates and uncertainty‑aware selection to effective rank allocation. Our code is publicly available at https://github.com/withyou12/igulora.git
Authors:Alejandro Paredes La Torre, Barbara Flores, Diego Rodriguez
Abstract:
We propose a resource‑efficient framework for compressing large language models through knowledge distillation, combined with guided chain‑of‑thought reinforcement learning. Using Qwen 3B as the teacher and Qwen 0.5B as the student, we apply knowledge distillation across English Dolly‑15k, Spanish Dolly‑15k, and code BugNet and PyTorrent datasets, with hyperparameters tuned in the English setting to optimize student performance. Across tasks, the distilled student retains a substantial portion of the teacher's capability while remaining significantly smaller: 70% to 91% in English, up to 95% in Spanish, and up to 93.5% Rouge‑L in code. For coding tasks, integrating chain‑of‑thought prompting with Group Relative Policy Optimization using CoT‑annotated Codeforces data improves reasoning coherence and solution correctness compared to knowledge distillation alone. Post‑training 4‑bit weight quantization further reduces memory footprint and inference latency. These results show that knowledge distillation combined with chain‑of‑thought guided reinforcement learning can produce compact, efficient models suitable for deployment in resource‑constrained settings.
Authors:Grayson Lee, Minh Bui, Shuzi Zhou, Yankai Li, Mo Chen, Ke Li
Abstract:
Diffusion‑based models have recently shown strong performance in trajectory planning, as they are capable of capturing diverse, multimodal distributions of complex behaviors. A key limitation of these models is their slow inference speed, which results from the iterative denoising process. This makes them less suitable for real‑time applications such as closed‑loop model predictive control (MPC), where plans must be generated quickly and adapted continuously to a changing environment. In this paper, we investigate Implicit Maximum Likelihood Estimation (IMLE) as an alternative generative modeling approach for planning. IMLE offers strong mode coverage while enabling inference that is two orders of magnitude faster, making it particularly well suited for real‑time MPC tasks. Our results demonstrate that IMLE achieves competitive performance on standard offline reinforcement learning benchmarks compared to the standard diffusion‑based planner, while substantially improving planning speed in both open‑loop and closed‑loop settings. We further validate IMLE in a closed‑loop human navigation scenario, operating in real‑time, demonstrating how it enables rapid and adaptive plan generation in dynamic environments.
Authors:Zhaoyuan Gu, Yipu Chen, Zimeng Chai, Alfred Cueva, Thong Nguyen, Yifan Wu, Huishu Xue, Minji Kim, Isaac Legene, Fukang Liu, Matthew Kim, Ayan Barula, Yongxin Chen, Ye Zhao
Abstract:
Humanoid loco‑manipulation requires coordinated high‑level motion plans with stable, low‑level whole‑body execution under complex robot‑environment dynamics and long‑horizon tasks. While diffusion policies (DPs) show promise for learning from demonstrations, deploying them on humanoids poses critical challenges: the motion planner trained offline is decoupled from the low‑level controller, leading to poor command tracking, compounding distribution shift, and task failures. The common approach of scaling demonstration data is prohibitively expensive for high‑dimensional humanoid systems. To address this challenge, we present REFINE‑DP (REinforcement learning FINE‑tuning of Diffusion Policy), a hierarchical framework that jointly optimizes a DP high‑level planner and an RL‑based low‑level loco‑manipulation controller. The DP is fine‑tuned via a PPO‑based diffusion policy gradient to improve task success rate, while the controller is simultaneously updated to accurately track the planner's evolving command distribution, reducing the distributional mismatch that degrades motion quality. We validate REFINE‑DP on a humanoid robot performing loco‑manipulation tasks, including door traversal and long‑horizon object transport. REFINE‑DP achieves an over 90% success rate in simulation, even in out‑of‑distribution cases not seen in the pre‑trained data, and enables smooth autonomous task execution in real‑world dynamic environments. Our proposed method substantially outperforms pre‑trained DP baselines and demonstrates that RL fine‑tuning is key to reliable humanoid loco‑manipulation. https://refine‑dp.github.io/REFINE‑DP/
Authors:Dongyuan Li, Shun Zheng, Chang Xu, Jiang Bian, Renhe Jiang
Abstract:
Time series forecasting has attracted significant attention in the field of AI. Previous works have revealed that the Channel‑Independent (CI) strategy improves forecasting performance by modeling each channel individually, but it often suffers from poor generalization and overlooks meaningful inter‑channel interactions. Conversely, Channel‑Dependent (CD) strategies aggregate all channels, which may introduce irrelevant information and lead to oversmoothing. Despite recent progress, few existing methods offer the flexibility to adaptively balance CI and CD strategies in response to varying channel dependencies. To address this, we propose a generic plugin xCPD, that can adaptively model the channel‑patch dependencies from the perspective of graph spectral decomposition. Specifically, xCPD first projects multivariate signals into the frequency domain using a shared graph Fourier basis, and groups patches into low‑, mid‑, and high‑frequency bands based on their spectral energy responses. xCPD then applies a channel‑adaptive routing mechanism that dynamically adjusts the degree of inter‑channel interaction for each patch, enabling selective activation of frequency‑specific experts. This facilitates fine‑grained input‑aware modeling of smooth trends, local fluctuations, and abrupt transitions. xCPD can be seamlessly integrated on top of existing CI and CD forecasting models, consistently enhancing both accuracy and generalization across benchmarks. The code is available https://github.com/Clearloveyuan/xCPD.
Authors:Matthew Alford
Abstract:
We prove The Equivalence Theorem: structurally complete knowledge representation requires exactly four mutually entailing capabilities ‑‑ n‑ary relationships with attributes, temporal validity, uncertainty quantification, and causal relationships between relationships ‑‑ collectively equivalent to treating relationships as first‑class objects. Any system implementing one capability necessarily requires all four; any system missing one cannot achieve structural completeness. This result is constructive: we exhibit an Attributed Temporal Causal Hypergraph (ATCH) framework satisfying all four conditions simultaneously. The theorem yields a strict expressiveness hierarchy ‑‑ SQL < LPG < TypeDB < ATCH ‑‑ with witness queries that are structurally inexpressible at each lower level. We establish computational complexity bounds showing NP‑completeness for general queries but polynomial‑time tractability for practical query classes (acyclic patterns, bounded‑depth causal chains, windowed temporal queries). As direct corollaries, we derive solutions to classical AI problems: the Frame Problem (persistence by default from temporal validity), conflict resolution (contradictions as unresolved metadata with hidden variable discovery), and common sense reasoning (defaults with causal inhibitors). A prototype PostgreSQL extension in C validates practical feasibility within the established complexity bounds.
Authors:Andrii Shchur, Inna Skarga-Bandurova
Abstract:
Weather forecasting offers an ideal testbed for artificial intelligence (AI) to learn complex, multi‑scale physical systems. Traditional numerical weather prediction remains computationally costly for frequent regional updates, as high‑resolution nests require intensive boundary coupling. We introduce Multi‑Resolution Graph Neural Forecasting (MR‑GNF), a lightweight, physics‑aware model that performs short‑term regional forecasts directly on an ellipsoidal, multi‑scale graph of the Earth. The framework couples a 0.25° region of interest with a 0.5° context belt and 1.0° outer domain, enabling continuous cross‑scale message passing without explicit nested boundaries. Its axial graph‑attention network alternates vertical self‑attention across pressure levels with horizontal graph attention across surface nodes, capturing implicit 3‑D structure in just 1.6 M parameters. Trained on 40 years of ERA5 reanalysis (1980‑2024), MR‑GNF delivers stable +6 h to +24 h forecasts for near‑surface temperature, wind, and precipitation over the UK‑Ireland sector. Despite a total compute cost below 80 GPU‑hours on a single RTX 6000 Ada, the model matches or exceeds heavier regional AI systems while preserving physical consistency across scales. These results demonstrate that graph‑based neural operators can achieve trustworthy, high‑resolution weather prediction at a fraction of NWP cost, opening a practical path toward AI‑driven early‑warning and renewable‑energy forecasting systems. Project page and code: https://github.com/AndriiShchur/MR‑GNF
Authors:Pratik Ramesh, George Stoica, Arun Iyer, Leshem Choshen, Judy Hoffman
Abstract:
Model merging has shown that multitask models can be created by directly combining the parameters of different models that are each specialized on tasks of interest. However, models trained independently on distinct tasks often exhibit interference that degrades the merged model's performance. To solve this problem, we formally define the notion of Cross‑Task Interference as the drift in the representation of the merged model relative to its constituent models. Reducing cross‑task interference is key to improving merging performance. To address this issue, we propose our method, Resolving Interference (RI), a light‑weight adaptation framework which disentangles expert models to be functionally orthogonal to the space of other tasks, thereby reducing cross‑task interference. RI does this whilst using only unlabeled auxiliary data as input (i.e., no task‑data is needed), allowing it to be applied in data‑scarce scenarios. RI consistently improves the performance of state‑of‑the‑art merging methods by up to 3.8% and generalization to unseen domains by up to 2.3%. We also find RI to be robust to the source of auxiliary input while being significantly less sensitive to tuning of merging hyperparameters. Our codebase is available at: https://github.com/pramesh39/resolving_interference
Authors:Guanyu Chen, Ruichen Wang, Tianren Zhang, Feng Chen
Abstract:
In‑context learning (ICL) is a valuable capability exhibited by Transformers pretrained on diverse sequence tasks. However, previous studies have observed that ICL often conflicts with the model's inherent in‑weight learning (IWL) ability. By examining the representation space learned by a toy model in synthetic experiments, we identify the shared encoding space for context and samples in Transformers as a potential source of this conflict. To address this, we modify the model architecture to separately encode the context and samples into two distinct spaces: a task representation space and a sample representation space. We model these two spaces under a simple yet principled framework, assuming a linear representational structure and treating them as a pair of dual spaces. Both theoretical analysis and empirical results demonstrate the effectiveness of our proposed architecture, CoQE, in the single‑value answer setting. It not only enhances ICL performance through improved representation learning, but also successfully reconciles ICL and IWL capabilities across synthetic few‑shot classification and a newly designed pseudo‑arithmetic task. Code: https://github.com/McGuinnessChen/dual‑representation‑space‑encoding
Authors:Mansoor Ahmed, Nadeem Taj, Imdad Ullah Khan, Hemanth Venkateswara, Murray Patterson
Abstract:
Computational antibody design has seen rapid methodological progress, with dozens of deep generative methods proposed in the past three years, yet the field lacks a standardized benchmark for fair comparison and model development. These methods are evaluated on different SAbDab snapshots, non‑overlapping test sets, and incompatible metrics, and the literature fragments the design problem into numerous sub‑tasks with no common definition. We introduce \textscChimera‑Bench (CDR Modeling with Epitope‑guided Redesign), a unified benchmark built around a single canonical task: \emphepitope‑conditioned CDR sequence‑structure co‑design. \textscChimera‑Bench provides (1) a curated, deduplicated dataset of 2,922 antibody‑antigen complexes with epitope and paratope annotations; (2) three biologically motivated splits testing generalization to unseen epitopes, unseen antigen folds, and prospective temporal targets; and (3) a comprehensive evaluation protocol with five metric groups including novel epitope‑specificity measures. We benchmark representative methods spanning different generative paradigms and report results across all splits. \textscChimera‑Bench is the largest dataset of its kind for the antibody design problem, allowing the community to develop and test novel methods and evaluate their generalizability. The source code and data are available at: https://github.com/mansoor181/chimera‑bench.git
Authors:Liang Tang, Hongda Li, Jiayu Zhang, Long Chen, Shuxian Li, Siqi Pei, Tiaonan Duan, Yuhao Cheng
Abstract:
Emotion recognition in videos is a pivotal task in affective computing, where identifying subtle psychological states such as Ambivalence and Hesitancy holds significant value for behavioral intervention and digital health. Ambivalence and Hesitancy states often manifest through cross‑modal inconsistencies such as discrepancies between facial expressions, vocal tones, and textual semantics, posing a substantial challenge for automated recognition. This paper proposes a recognition framework that integrates temporal segment modeling with Multimodal Large Language Models. To address computational efficiency and token constraints in long video processing, we employ a segment‑based strategy, partitioning videos into short clips with a maximum duration of 5 seconds. We leverage the Qwen3‑Omni‑30B‑A3B model, fine‑tuned on the BAH dataset using LoRA and full‑parameter strategies via the MS‑Swift framework, enabling the model to synergistically analyze visual and auditory signals. Experimental results demonstrate that the proposed method achieves an accuracy of 85.1% on the test set, significantly outperforming existing benchmarks and validating the superior capability of Multimodal Large Language Models in capturing complex and nuanced emotional conflicts. The code is released at https://github.com/dlnn123/A‑H‑Detection‑with‑Qwen‑Omni.git.
Authors:Renwei Meng, Haoyi Wu, Jingming Wang, Haoyan Bai
Abstract:
Software vulnerability detection is critical in software en‑ gineering as security flaws arise from complex interactions across code structure, repository context, and runtime conditions. Existing meth‑ ods are limited by local code views, one‑shot prediction, and insuffi‑ cient validation, reducing reliability in realistic repository‑level settings. This study proposes VulnAgentX, a layered agentic framework integrat‑ ing lightweight risk screening, bounded context expansion, specialised analysis agents, selective dynamic verification, and evidence fusion into a unified pipeline. Experiments on function‑level and just‑in‑time vul‑ nerability benchmarks show VulnAgent‑X outperforms static baselines, encoder‑based models, and simpler agentic variants, with better local‑ isation and balanced performance‑cost trade‑offs. Treating vulnerabil‑ ity detection as a staged, evidence‑driven auditing process improves de‑ tection quality, reduces false positives, and produces interpretable re‑ sults for repository‑level software security analysis. Code is available at https://github.com/xiaolu‑666113/Vlun‑Agent‑X.
Authors:Zhenyu Zhang, Yixiong Zou, Yuhua Li, Ruixuan Li, Guangyao Chen
Abstract:
Source‑Free Cross‑Domain Few‑Shot Learning (SF‑CDFSL) focuses on fine‑tuning with limited training data from target domains (e.g., medical or satellite images), where Vision‑Language Models (VLMs) such as CLIP and SigLIP have shown promising results. Current works in traditional visual models suggest that improving visual discriminability enhances performance. However, in VLM‑based SF‑CDFSL tasks, we find that strengthening visual‑modal discriminability actually suppresses VLMs' performance. In this paper, we aim to delve into this phenomenon for an interpretation and a solution. By both theoretical and experimental proofs, our study reveals that fine‑tuning with the typical cross‑entropy loss (\mathcalL_\mathrmvlm) inherently includes a visual learning part and a cross‑modal learning part, where the cross‑modal part is crucial for rectifying the heavily disrupted modality misalignment in SF‑CDFSL. However, we find that the visual learning essentially acts as a shortcut that encourages the model to reduce \mathcalL_\mathrmvlm without considering the cross‑modal part, therefore hindering the cross‑modal alignment and harming the performance. Based on this interpretation, we further propose an approach to address this problem: first, we perturb the visual learning to guide the model to focus on the cross‑modal alignment. Then, we use the visual‑text semantic relationships to gradually align the visual and textual modalities during the fine‑tuning. Extensive experiments on various settings, backbones (CLIP, SigLip, PE‑Core), and tasks (4 CDFSL datasets and 11 FSL datasets) show that we consistently set new state‑of‑the‑art results. Code is available at https://github.com/zhenyuZ‑HUST/CVPR26‑Mind‑the‑Discriminability‑Trap.
Authors:Domen Preložnik, Žiga Špiclin
Abstract:
Inter‑scanner variability of magnetic resonance imaging has an adverse impact on the diagnostic and prognostic quality of the scans and necessitates the development of models robust to domain shift inflicted by the unseen scanner data. Review of recent advances in domain adaptation showed that efficacy of strategies involving modifications or constraints on the latent space appears to be contingent upon the level and/or depth of supervision during model training. In this paper, we therefore propose an unsupervised domain adaptation technique based on self‑supervised multi‑stage unlearning (SSMSU). Building upon the state‑of‑the‑art segmentation framework nnU‑Net, we employ deep supervision at deep encoder stages using domain classifier unlearning, applied sequentially across the deep stages to suppress domain‑related latent features. Following self‑configurable approach of the nnU‑Net, the auxiliary feedback loop implements a self‑supervised backpropagation schedule for the unlearning process, since continuous unlearning was found to have a detrimental effect on the main segmentation task. Experiments were carried out on four public datasets for benchmarking white‑matter lesion segmentation methods. Five benchmark models and/or strategies, covering passive to active unsupervised domain adaptation, were tested. In comparison, the SSMSU demonstrated the advantage of unlearning by enhancing lesion sensitivity and limiting false detections, which resulted in higher overall segmentation quality in terms of segmentation overlap and relative lesion volume error. The proposed model inputs only the FLAIR modality, which simplifies preprocessing pipelines, eliminates the need for inter‑modality registration errors and harmonization, which can introduce variability. Source code is available on https://github.com/Pubec/nnunetv2‑unlearning.
Authors:Mingyu Kim, Young-Heon Kim, Mijung Park
Abstract:
Safety mechanisms for diffusion and flow models have recently been developed along two distinct paths. In robot planning, control barrier functions are employed to guide generative trajectories away from obstacles at every denoising step by explicitly imposing geometric constraints. In parallel, recent data‑driven, negative guidance approaches have been shown to suppress harmful content and promote diversity in generated samples. However, they rely on heuristics without clearly stating when safety guidance is actually necessary. In this paper, we first introduce a unified probabilistic framework using a Maximum Mean Discrepancy (MMD) potential for image generation tasks that recasts both Shielded Diffusion and Safe Denoiser as instances of our energy‑based negative guidance against unsafe data samples. Furthermore, we leverage control‑barrier functions analysis to justify the existence of a critical time window in which negative guidance must be strong; outside of this window, the guidance should decay to zero to ensure safe and high‑quality generation. We evaluate our unified framework on several realistic safe generation scenarios, confirming that negative guidance should be applied in the early stages of the denoising process for successful safe generation.
Authors:Dean Barr
Abstract:
Despite the scale of capital being deployed toward AI initiatives, no empirical framework currently exists for benchmarking where a firm stands relative to competitors in AI readiness and deployment, or for translating that position into auditable financial outcomes. In practice, private equity deal teams, management consultants, and corporate strategists have relied on qualitative judgment and ad‑hoc maturity labels; tools that are neither comparable across industries nor grounded in observable economic data. This paper introduces the AI Transformation Gap Index (AITG), a composite empirical framework that measures the distance between a firm's current AI deployment and a time varying, industry constrained capability frontier, then maps that distance to dollar denominated value creation, execution feasibility under uncertainty, and competitive disruption risk. Five linked modules address this gap: cross industry normalization (IASS), a dynamic capability ceiling that evolves with frontier capabilities (AFC), trajectory based firm scoring with integrated execution risk (IFS), a CES bottleneck value decomposition mapping gap scores to enterprise value (VCB), and a competitive hazard measure for inaction (ADRI). I calibrate the framework for 22 industry verticals and apply it to 14 public companies using public filings. A retrospective construct validity exercise correlating AITG scores with observed EBITDA margin expansion yields Spearman rho_s = 0.818 (n = 10), directionally consistent with predictions though insufficient for causal identification. A counterintuitive result emerges: the largest AI transformation gaps do not produce the highest value density, because implementation friction, CES bottlenecks, and timing lags erode the theoretical upside of wide gaps.
Authors:Minsang Kim, Seung Jun Baek
Abstract:
Knowledge Distillation (KD) can transfer the reasoning abilities of large models to smaller ones, which can reduce the costs to generate Chain‑of‑Thoughts for reasoning tasks. KD methods typically ask the student to mimic the teacher's distribution over the entire output. However, a student with limited capacity can be overwhelmed by such extensive supervision causing a distribution mismatch, especially in complex reasoning tasks. We propose Token‑Selective Dual Knowledge Distillation (TSD‑KD), a framework for student‑centric distillation. TSD‑KD focuses on distilling important tokens for reasoning and encourages the student to explain reasoning in its own words. TSD‑KD combines indirect and direct distillation. Indirect distillation uses a weak form of feedback based on preference ranking. The student proposes candidate responses generated on its own; the teacher re‑ranks those candidates as indirect feedback without enforcing its entire distribution. Direct distillation uses distribution matching; however, it selectively distills tokens based on the relative confidence between teacher and student. Finally, we add entropy regularization to maintain the student's confidence during distillation. Overall, our method provides the student with targeted and indirect feedback to support its own reasoning process and to facilitate self‑improvement. The experiments show the state‑of‑the‑art performance of TSD‑KD on 10 challenging reasoning benchmarks, outperforming the baseline and runner‑up in accuracy by up to 54.4% and 40.3%, respectively. Notably, a student trained by TSD‑KD even outperformed its own teacher model in four cases by up to 20.3%. The source code is available at https://github.com/kmswin1/TSD‑KD.
Authors:Nabin Oli
Abstract:
Traditional benchmarks like HumanEval and MBPP test logic and syntax effectively, but fail when code must produce dynamic, pedagogical visuals. We introduce ManiBench, a specialized benchmark evaluating LLM performance in generating Manim CE code, where temporal fidelity and version‑aware API correctness are critical. ManiBench targets two key failure modes: Syntactic Hallucinations (valid Python referencing non‑existent or deprecated Manim APIs) and Visual‑Logic Drift (generated visuals diverging from intended mathematical logic through timing errors or missing causal relationships). The benchmark comprises 150‑200 problems across five difficulty levels spanning calculus, linear algebra, probability, topology, and AI, grounded in analysis of 3Blue1Brown's ManimGL source (53,000 lines, 143 scene classes). Evaluation uses a four‑tier framework measuring Executability, Version‑Conflict Error Rate, Alignment Score, and Coverage Score. An open‑source framework automates evaluation across multiple models and prompting strategies. Code, data and benchmark suite are available at https://github.com/nabin2004/ManiBench. and the dataset is hosted on https://huggingface.co/datasets/nabin2004/ManiBench.
Authors:Florin Adrian Chitan
Abstract:
The proliferation of autonomous AI agents capable of executing real‑world actions ‑ filesystem operations, API calls, database modifications, financial transactions ‑ introduces a class of safety risk not addressed by existing content‑moderation infrastructure. Current text‑safety systems evaluate linguistic content for harm categories such as violence, hate speech, and sexual content; they are architecturally unsuitable for evaluating whether a proposed action falls within an agent's authorized operational scope. We present ILION (Intelligent Logic Identity Operations Network), a deterministic execution gate for agentic AI systems. ILION employs a five‑component cascade architecture ‑ Transient Identity Imprint (TII), Semantic Vector Reference Frame (SVRF), Identity Drift Control (IDC), Identity Resonance Score (IRS) and Consensus Veto Layer (CVL) ‑ to classify proposed agent actions as BLOCK or ALLOW without statistical training or API dependencies. The system requires zero labeled data, operates in sub‑millisecond latency, and produces fully interpretable verdicts. We evaluate ILION on ILION‑Bench v2, a purpose‑built benchmark of 380 test scenarios across eight attack categories with 39% hard‑difficulty adversarial cases and a held‑out development split. ILION achieves F1 = 0.8515, precision = 91.0%, and a false positive rate of 7.9% at a mean latency of 143 microseconds. Comparative evaluation against three baselines ‑ Lakera Guard (F1 = 0.8087), OpenAI Moderation API (F1 = 0.1188), and Llama Guard 3 (F1 = 0.0105) ‑ demonstrates that existing text‑safety infrastructure systematically fails on agent execution safety tasks due to a fundamental task mismatch. ILION outperforms the best commercial baseline by 4.3 F1 points while operating 2,000 times faster with a false positive rate four times lower.
Authors:Ziyu Liu, Shengyuan Ding, Xinyu Fang, Xuanlang Dai, Penghui Yang, Jianze Liang, Jiaqi Wang, Kai Chen, Dahua Lin, Yuhang Zang
Abstract:
Vision‑to‑code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured representations with high visual fidelity. While recent Large Vision Language Models (LVLMs) achieve strong results via supervised fine‑tuning, reinforcement learning remains challenging due to misaligned reward signals. Existing rewards either rely on textual rules or coarse visual embedding similarity, both of which fail to capture fine‑grained visual discrepancies and are vulnerable to reward hacking. We propose Visual Equivalence Reward Model (Visual‑ERM), a multimodal generative reward model that provides fine‑grained, interpretable, and task‑agnostic feedback to evaluate vision‑to‑code quality directly in the rendered visual space. Integrated into RL, Visual‑ERM improves Qwen3‑VL‑8B‑Instruct by +8.4 on chart‑to‑code and yields consistent gains on table and SVG parsing (+2.7, +4.1 on average), and further strengthens test‑time scaling via reflection and revision. We also introduce VisualCritic‑RewardBench (VC‑RewardBench), a benchmark for judging fine‑grained image‑to‑image discrepancies on structured visual data, where Visual‑ERM at 8B decisively outperforms Qwen3‑VL‑235B‑Instruct and approaches leading closed‑source models. Our results suggest that fine‑grained visual reward supervision is both necessary and sufficient for vision‑to‑code RL, regardless of task specificity.
Authors:Yu Li, Tian Lan, Zhengling Qi
Abstract:
Group Relative Policy Optimization (GRPO) has emerged as an effective method for training reasoning models. While it computes advantages based on group mean, GRPO treats each output as an independent sample during the optimization and overlooks a vital structural signal: the natural contrast between correct and incorrect solutions within the same group, thus ignoring the rich, comparative data that could be leveraged by explicitly pitting successful reasoning traces against failed ones. To capitalize on this, we present a contrastive reformulation of GRPO, showing that the GRPO objective implicitly maximizes the margin between the policy ratios of correct and incorrect samples. Building on this insight, we propose Bilateral Context Conditioning (BICC), a mechanism that allows the model to cross‑reference successful and failed reasoning traces during the optimization, enabling a direct information flow across samples. We further introduce Reward‑Confidence Correction (RCC) to stabilize training by dynamically adjusts the advantage baseline in GRPO using reward‑confidence covariance derived from the first‑order approximation of the variance‑minimizing estimator. Both mechanisms require no additional sampling or auxiliary models and can be adapted to all GRPO variants. Experiments on mathematical reasoning benchmarks demonstrate consistent improvements across comprehensive models and algorithms. Code is available at \hrefhttps://github.com/Skylanding/BiCChttps://github.com/Skylanding/BiCC.
Authors:Tianhao Fu, Bingxuan Yang, Juncheng Guo, Shrena Sribalan, Yucheng Chen
Abstract:
Automatic identification of screw types is important for industrial automation, robotics, and inventory management. However, publicly available datasets for screw classification are scarce, particularly for controlled single‑object scenarios commonly encountered in automated sorting systems. In this work, we introduce SortScrews, a dataset for casewise visual classification of screws. The dataset contains 560 RGB images at 512×512 resolution covering six screw types and a background class. Images are captured using a standardized acquisition setup and include mild variations in lighting and camera perspective across four capture settings. To facilitate reproducible research and dataset expansion, we also provide a reusable data collection script that allows users to easily construct similar datasets for custom hardware components using inexpensive camera setups. We establish baseline results using transfer learning with EfficientNet‑B0 and ResNet‑18 classifiers pretrained on ImageNet. In addition, we conduct a well‑explored failure analysis. Despite the limited dataset size, these lightweight models achieve strong classification accuracy, demonstrating that controlled acquisition conditions enable effective learning even with relatively small datasets. The dataset, collection pipeline, and baseline training code are publicly available at https://github.com/ATATC/SortScrews.
Authors:Sydney Lewis
Abstract:
Long conversations with an AI agent create a simple problem for one user: the history is useful, but carrying it verbatim is expensive. We study personalized agent memory: one user's conversation history with an agent, distilled into a compact retrieval layer for later search. Each exchange is compressed into a compound object with four fields (exchange_core, specific_context, thematic room_assignments, and regex‑extracted files_touched). The searchable distilled text averages 38 tokens per exchange. Applied to 4,182 conversations (14,340 exchanges) from 6 software engineering projects, the method reduces average exchange length from 371 to 38 tokens, yielding 11x compression. We evaluate whether personalized recall survives that compression using 201 recall‑oriented queries, 107 configurations spanning 5 pure and 5 cross‑layer search modes, and 5 LLM graders (214,519 consensus‑graded query‑result pairs). The best pure distilled configuration reaches 96% of the best verbatim MRR (0.717 vs 0.745). Results are mechanism‑dependent. All 20 vector search configurations remain non‑significant after Bonferroni correction, while all 20 BM25 configurations degrade significantly (effect sizes |d|=0.031‑0.756). The best cross‑layer setup slightly exceeds the best pure verbatim baseline (MRR 0.759). Structured distillation compresses single‑user agent memory without uniformly sacrificing retrieval quality. At 1/11 the context cost, thousands of exchanges fit within a single prompt while the verbatim source remains available for drill‑down. We release the implementation and analysis pipeline as open‑source software.
Authors:Aditya Parikh, Aasa Feragen
Abstract:
We present a fairness‑aware framework for multi‑class lung disease diagnosis from chest CT volumes, developed for the Fair Disease Diagnosis Challenge at the PHAROS‑AIF‑MIH Workshop (CVPR 2026). The challenge requires classifying CT scans into four categories ‑‑ Healthy, COVID‑19, Adenocarcinoma, and Squamous Cell Carcinoma ‑‑ with performance measured as the average of per‑gender macro F1 scores, explicitly penalizing gender‑inequitable predictions. Our approach addresses two core difficulties: the sparse pathological signal across hundreds of slices, and a severe demographic imbalance compounded across disease class and gender. We propose an attention‑based Multiple Instance Learning (MIL) model on a ConvNeXt backbone that learns to identify diagnostically relevant slices without slice‑level supervision, augmented with a Gradient Reversal Layer (GRL) that adversarially suppresses gender‑predictive structure in the learned scan representation. Training incorporates focal loss with label smoothing, stratified cross‑validation over joint (class, gender) strata, and targeted oversampling of the most underrepresented subgroup. At inference, all five‑fold checkpoints are ensembled with horizontal‑flip test‑time augmentation via soft logit voting and out‑of‑the‑fold threshold optimization for robustness. Our model achieves a mean validation competition score of 0.685 (std ‑ 0.030), with the best single fold reaching 0.759. All training and inference code is publicly available at https://github.com/ADE‑17/cvpr‑fair‑chest‑ct
Authors:Raphael Trumpp, Denis Hoornaert, Mirco Theile, Marco Caccamo
Abstract:
Residual policy learning (RPL), in which a learned policy refines a static base policy using deep reinforcement learning (DRL), has shown strong performance across various robotic applications. Its effectiveness is particularly evident in autonomous racing, a domain that serves as a challenging benchmark for real‑world DRL. However, deploying RPL‑based controllers introduces system complexity and increases inference latency. We address this by introducing an extension of RPL named attenuated residual policy optimization (α‑RPO). Unlike standard RPL, α‑RPO yields a standalone neural policy by progressively attenuating the base policy, which initially serves to bootstrap learning. Furthermore, this mechanism enables a form of privileged learning, where the base policy is permitted to use sensor modalities not required for final deployment. We design α‑RPO to integrate seamlessly with PPO, ensuring that the attenuated influence of the base controller is dynamically compensated during policy optimization. We evaluate α‑RPO by building a framework for 1:10‑scaled autonomous racing around it. In both simulation and zero‑shot real‑world transfer to Roboracer cars, α‑RPO not only reduces system complexity but also improves driving performance compared to baselines ‑ demonstrating its practicality for robotic deployment. Our code is available at: https://github.com/raphajaner/arpo_racing.
Authors:Zikang Liu, Longteng Guo, Handong Li, Ru Zhen, Xingjian He, Ruyi Ji, Xiaoming Ren, Yanhao Zhang, Haonan Lu, Jing Liu
Abstract:
Real‑time understanding of continuous video streams is essential for interactive assistants and multimodal agents operating in dynamic environments. However, most existing video reasoning approaches follow a batch paradigm that defers reasoning until the full video context is observed, resulting in high latency and growing computational cost that are incompatible with streaming scenarios. In this paper, we introduce ThinkStream, a framework for streaming video reasoning based on a Watch‑‑Think‑‑Speak paradigm that enables models to incrementally update their understanding as new video observations arrive. At each step, the model performs a short reasoning update and decides whether sufficient evidence has accumulated to produce a response. To support long‑horizon streaming, we propose Reasoning‑Compressed Streaming Memory (RCSM), which treats intermediate reasoning traces as compact semantic memory that replaces outdated visual tokens while preserving essential context. We further train the model using a Streaming Reinforcement Learning with Verifiable Rewards scheme that aligns incremental reasoning and response timing with the requirements of streaming interaction. Experiments on multiple streaming video benchmarks show that ThinkStream significantly outperforms existing online video models while maintaining low latency and memory usage. Code, models and data will be released at https://github.com/johncaged/ThinkStream
Authors:Kadir-Kaan Özer, René Ebeling, Markus Enzweiler
Abstract:
Multivariate time series anomalies often manifest as shifts in cross‑channel dependencies rather than simple amplitude excursions. In autonomous driving, for instance, a steering command might be internally consistent but decouple from the resulting lateral acceleration. Residual‑based detectors can miss such anomalies when flexible sequence models still reconstruct signals plausibly despite altered coordination. We introduce AxonAD, an unsupervised detector that treats multi‑head attention query evolution as a short horizon predictable process. A gradient‑updated reconstruction pathway is coupled with a history‑only predictor that forecasts future query vectors from past context. This is trained via a masked predictor‑target objective against an exponential moving average (EMA) target encoder. At inference, reconstruction error is combined with a tail‑aggregated query mismatch score, which measures cosine deviation between predicted and target queries on recent timesteps. This dual approach provides sensitivity to structural dependency shifts while retaining amplitude‑level detection. On proprietary in‑vehicle telemetry with interval annotations and on the TSB‑AD multi‑variate suite (17 datasets, 180 series) with threshold‑free and range‑aware metrics, AxonAD improves ranking quality and temporal localization over strong baselines. Ablations confirm that query prediction and combined scoring are the primary drivers of the observed gains. Code is available at the URL https://github.com/iis‑esslingen/AxonAD.
Authors:Xin Xu, Weilong Li, Wei Liu, Wenke Huang, Zhixi Yu, Bin Yang, Xiaoying Liao, Kui Jiang
Abstract:
Federated Domain Generalization for Person Re‑Identification (FedDG‑ReID) learns domain‑invariant representations from decentralized data. While Vision Transformer (ViT) is widely adopted, its global attention often fails to distinguish pedestrians from high similarity backgrounds or diverse viewpoints ‑‑ a challenge amplified by cross‑client distribution shifts in FedDG‑ReID. To address this, we propose Federated Body Distribution Aware Visual Prompt (FedBPrompt), introducing learnable visual prompts to guide Transformer attention toward pedestrian‑centric regions. FedBPrompt employs a Body Distribution Aware Visual Prompts Mechanism (BAPM) comprising: Holistic Full Body Prompts to suppress cross‑client background noise, and Body Part Alignment Prompts to capture fine‑grained details robust to pose and viewpoint variations. To mitigate high communication costs, we design a Prompt‑based Fine‑Tuning Strategy (PFTS) that freezes the ViT backbone and updates only lightweight prompts, significantly reducing communication overhead while maintaining adaptability. Extensive experiments demonstrate that BAPM effectively enhances feature discrimination and cross‑domain generalization, while PFTS achieves notable performance gains within only a few aggregation rounds. Moreover, both BAPM and PFTS can be easily integrated into existing ViT‑based FedDG‑ReID frameworks, making FedBPrompt a flexible and effective solution for federated person re‑identification. The code is available at https://github.com/leavlong/FedBPrompt.
Authors:David McAllister, Miika Aittala, Tero Karras, Janne Hellsten, Angjoo Kanazawa, Timo Aila, Samuli Laine
Abstract:
Reinforcement learning (RL) has become a standard technique for post‑training diffusion‑based image synthesis models, as it enables learning from reward signals to explicitly improve desirable aspects such as image quality and prompt alignment. In this paper, we propose an online RL variant that reduces the variance in the model updates by sampling paired trajectories and pulling the flow velocity in the direction of the more favorable image. Unlike existing methods that treat each sampling step as a separate policy action, we consider the entire sampling process as a single action. We experiment with both high‑quality vision language models and off‑the‑shelf quality metrics for rewards, and evaluate the outputs using a broad set of metrics. Our method converges faster and yields higher output quality and prompt alignment than previous approaches.
Authors:Yiqun Zhang, Zexi Tan, Xiaopeng Luo, Yunlin Liu
Abstract:
Most real‑world IoT data analysis tasks, such as clustering and anomaly event detection, are unsupervised and highly susceptible to the presence of outliers. In addition to sporadic scattered outliers caused by factors such as faulty sensor readings, IoT systems often exhibit clustered outliers. These occur when multiple devices or nodes produce similar anomalous measurements, for instance, owing to localized interference, emerging security threats, or regional false alarms, forming micro‑clusters. These clustered outliers can be easily mistaken for normal behavior because of their relatively high local density, thereby obscuring the detection of both scattered and contextual anomalies. To address this, we propose a novel outlier detection paradigm that leverages the natural neighboring relationships using graph structures. This facilitates multi‑perspective anomaly evaluation by incorporating reference sets at both local and global scales derived from the graph. Our approach enables the effective recognition of scattered outliers without interference from clustered anomalies, whereas the graph structure simultaneously helps reflect and isolate clustered outlier groups. Extensive experiments, including comparative performance analysis, ablation studies, validation on downstream clustering tasks, and evaluation of hyperparameter sensitivity, demonstrate the efficacy of the proposed method. The source code is available at https://github.com/gordonlok/DROD.
Authors:Chenyang Zhu, Hongxiang Li, Xiu Li, Long Chen
Abstract:
Concept customization typically binds rare tokens to a target concept. Unfortunately, these approaches often suffer from unstable performance as the pretraining data seldom contains these rare tokens. Meanwhile, these rare tokens fail to convey the inherent knowledge of the target concept. Consequently, we introduce Knowledge‑aware Concept Customization, a novel task aiming at binding diverse textual knowledge to target visual concepts. This task requires the model to identify the knowledge within the text prompt to perform high‑fidelity customized generation. Meanwhile, the model should efficiently bind all the textual knowledge to the target concept. Therefore, we propose MoKus, a novel framework for knowledge‑aware concept customization. Our framework relies on a key observation: cross‑modal knowledge transfer, where modifying knowledge within the text modality naturally transfers to the visual modality during generation. Inspired by this observation, MoKus contains two stages: (1) In visual concept learning, we first learn the anchor representation to store the visual information of the target concept. (2) In textual knowledge updating, we update the answer for the knowledge queries to the anchor representation, enabling high‑fidelity customized generation. To further comprehensively evaluate our proposed MoKus on the new task, we introduce the first benchmark for knowledge‑aware concept customization: KnowCusBench. Extensive evaluations have demonstrated that MoKus outperforms state‑of‑the‑art methods. Moreover, the cross‑model knowledge transfer allows MoKus to be easily extended to other knowledge‑aware applications like virtual concept creation and concept erasure. We also demonstrate the capability of our method to achieve improvements on world knowledge benchmarks.
Authors:Kaifan Zhang, Lihuo He, Junjie Ke, Yuqi Ji, Lukun Wu, Lizi Wang, Xinbo Gao
Abstract:
Visual stimuli reconstruction from EEG remains challenging due to fidelity loss and representation shift. We propose CognitionCapturerPro, an enhanced framework that integrates EEG with multi‑modal priors (images, text, depth, and edges) via collaborative training. Our core contributions include an uncertainty‑weighted similarity scoring mechanism to quantify modality‑specific fidelity and a fusion encoder for integrating shared representations. By employing a simplified alignment module and a pre‑trained diffusion model, our method significantly outperforms the original CognitionCapturer on the THINGS‑EEG dataset, improving Top‑1 and Top‑5 retrieval accuracy by 25.9% and 10.6%, respectively. Code is available at: https://github.com/XiaoZhangYES/CognitionCapturerPro.
Authors:Dongxu Zhang, Yingsen Wang, Yiding Sun, Haoran Xu, Peilin Fan, Jihua Zhu
Abstract:
Robust point cloud registration is a fundamental task in 3D computer vision and geometric deep learning, essential for applications such as large‑scale 3D reconstruction, augmented reality, and scene understanding. However, the performance of established learning‑based methods often degrades in complex, real world scenarios characterized by incomplete data, sensor noise, and low overlap regions. To address these limitations, we propose CMHANet, a novel Cross‑Modal Hybrid Attention Network. Our method integrates the fusion of rich contextual information from 2D images with the geometric detail of 3D point clouds, yielding a comprehensive and resilient feature representation. Furthermore, we introduce an innovative optimization function based on contrastive learning, which enforces geometric consistency and significantly improves the model's robustness to noise and partial observations. We evaluated CMHANet on the 3DMatch and the challenging 3DLoMatch datasets. \revAdditionally, zero‑shot evaluations on the TUM RGB‑D SLAM dataset verify the model's generalization capability to unseen domains. The experimental results demonstrate that our method achieves substantial improvements in both registration accuracy and overall robustness, outperforming current techniques. We also release our code in \hrefhttps://github.com/DongXu‑Zhang/CMHANethttps://github.com/DongXu‑Zhang/CMHANet.
Authors:Dongxu Zhang, Jihua Zhu, Shiqi Li, Wenbiao Yan, Haoran Xu, Peilin Fan, Huimin Lu
Abstract:
Point cloud registration (PCR) is a fundamental task in 3D vision and provides essential support for applications such as autonomous driving, robotics, and environmental modeling. Despite its widespread use, existing methods often fail when facing real‑world challenges like heavy noise, significant occlusions, and large‑scale transformations. These limitations frequently result in compromised registration accuracy and insufficient robustness in complex environments. In this paper, we propose IGASA as a novel registration framework constructed upon a Hierarchical Pyramid Architecture (HPA) designed for robust multi‑scale feature extraction and fusion. The framework integrates two pivotal components consisting of the Hierarchical Cross‑Layer Attention (HCLA) module and the Iterative Geometry‑Aware Refinement (IGAR) module. The HCLA module utilizes skip attention mechanisms to align multi‑resolution features and enhance local geometric consistency. Simultaneously, the IGAR module is designed for the fine matching phase by leveraging reliable correspondences established during coarse matching. This synergistic integration within the architecture allows IGASA to adapt effectively to diverse point cloud structures and intricate transformations. We evaluate the performance of IGASA on four widely recognized benchmark datasets including 3D(Lo)Match, KITTI, and nuScenes. Our extensive experiments consistently demonstrate that IGASA significantly surpasses state‑of‑the‑art methods and achieves notable improvements in registration accuracy. This work provides a robust foundation for advancing point cloud registration techniques while offering valuable insights for practical 3D vision applications. The code for IGASA is available in \hrefhttps://github.com/DongXu‑Zhang/IGASAhttps://github.com/DongXu‑Zhang/IGASA.
Authors:Ty Valencia, Burak Barlas, Varun Singhal, Ruchir Bhatia, Wei Yang
Abstract:
Multimodal recommendation is commonly framed as a feature fusion problem, where textual and visual signals are combined to better model user preference. However, the effectiveness of multimodal recommendation may depend not only on how modalities are fused, but also on whether item content is represented in a semantic space aligned with preference matching. This issue is particularly important because raw visual features often preserve appearance similarity, while user decisions are typically driven by higher‑level semantic factors such as style, material, and usage context. Motivated by this observation, we propose LVLM‑grounded Multimodal Semantic Representation for Recommendation (VLM4Rec), a lightweight framework that organizes multimodal item content through semantic alignment rather than direct feature fusion. VLM4Rec first uses a large vision‑language model to ground each item image into an explicit natural‑language description, and then encodes the grounded semantics into dense item representations for preference‑oriented retrieval. Recommendation is subsequently performed through a simple profile‑based semantic matching mechanism over historical item embeddings, yielding a practical offline‑online decomposition. Extensive experiments on multiple multimodal recommendation datasets show that VLM4Rec consistently improves performance over raw visual features and several fusion‑based alternatives, suggesting that representation quality may matter more than fusion complexity in this setting. The code is released at https://github.com/tyvalencia/enhancing‑mm‑rec‑sys.
Authors:Gihoon Kim, Euntai Kim
Abstract:
Reinforcement Learning from Human Feedback (RLHF) is a widely used approach to align large‑scale AI systems with human values. However, RLHF typically assumes a single, universal reward, which overlooks diverse preferences and limits personalization. Variational Preference Learning (VPL) seeks to address this by introducing user‑specific latent variables. Despite its promise, we found that VPL suffers from posterior collapse. While this phenomenon is well known in VAEs, it has not previously been identified in preference learning frameworks. Under sparse preference data and with overly expressive decoders, VPL may cause latent variables to be ignored, reverting to a single‑reward model. To overcome this limitation, we propose Swap‑guided Preference Learning (SPL). The key idea is to construct fictitious swap annotators and use the mirroring property of their preferences to guide the encoder. SPL introduces three components: (1) swap‑guided base regularization, (2) Preferential Inverse Autoregressive Flow (P‑IAF), and (3) adaptive latent conditioning. Experiments show that SPL mitigates collapse, enriches user‑specific latents, and improves preference prediction. Our code and data are available at https://github.com/cobang0111/SPL
Authors:Jianqiang Lin, Zhiqiang Shen, Peng Cao, Jinzhu Yang, Osmar R. Zaiane, Xiaoli Liu
Abstract:
Although diffusion models have achieved remarkable progress in multi‑modal magnetic resonance imaging (MRI) translation tasks, existing methods still tend to suffer from anatomical inconsistencies or degraded texture details when handling arbitrary missing‑modality scenarios. To address these issues, we propose a latent diffusion‑based multi‑modal MRI translation framework, termed MSG‑LDM. By leveraging the available modalities, the proposed method infers complete structural information, which preserves reliable boundary details. Specifically, we introduce a style‑‑structure disentanglement mechanism in the latent space, which explicitly separates modality‑specific style features from shared structural representations, and jointly models low‑frequency anatomical layouts and high‑frequency boundary details in a multi‑scale feature space. During the structure disentanglement stage, high‑frequency structural information is explicitly incorporated to enhance feature representations, guiding the model to focus on fine‑grained structural cues while learning modality‑invariant low‑frequency anatomical representations. Furthermore, to reduce interference from modality‑specific styles and improve the stability of structure representations, we design a style consistency loss and a structure‑aware loss. Extensive experiments on the BraTS2020 and WMH datasets demonstrate that the proposed method outperforms existing MRI synthesis approaches, particularly in reconstructing complete structures. The source code is publicly available at https://github.com/ziyi‑start/MSG‑LDM.
Authors:Vishnu Teja Kunde, Fatemeh Doudi, Mahdi Farahbakhsh, Dileep Kalathil, Krishna Narayanan, Jean-Francois Chamberland
Abstract:
Reinforcement learning (RL) has been effective for post‑training autoregressive (AR) language models, but extending these methods to diffusion language models (DLMs) is challenging due to intractable sequence‑level likelihoods. Existing approaches therefore rely on surrogate likelihoods or heuristic approximations, which can introduce bias and obscure the sequential structure of denoising. We formulate diffusion‑based sequence generation as a finite‑horizon Markov decision process over the denoising trajectory and derive an exact, unbiased policy gradient that decomposes over denoising steps and is expressed in terms of intermediate advantages, without requiring explicit evaluation of the sequence likelihood. To obtain a practical and compute‑efficient estimator, we (i) select denoising steps for policy updates via an entropy‑guided approximation bound, and (ii) estimate intermediate advantages using a one‑step denoising reward naturally provided by the diffusion model, avoiding costly multi‑step rollouts. Experiments on coding and logical reasoning benchmarks demonstrate state‑of‑the‑art results, with strong competitive performance on mathematical reasoning, outperforming existing RL post‑training approaches for DLMs. Code is available at https://github.com/vishnutez/egspo‑dllm‑rl.
Authors:Joong Ho Kim, Nicholas Thai, Souhardya Saha Dip, Dong Lao, Keith G. Mills
Abstract:
Text‑to‑Image (T2I) generation is primarily driven by Diffusion Models (DM) which rely on random Gaussian noise. Thus, like playing the slots at a casino, a DM will produce different results given the same user‑defined inputs. This imposes a gambler's burden: To perform multiple generation cycles to obtain a satisfactory result. However, even though DMs use stochastic sampling to seed generation, the distribution of generated content quality highly depends on the prompt and the generative ability of a DM with respect to it. To account for this, we propose Naïve PAINE for improving the generative quality of Diffusion Models by leveraging T2I preference benchmarks. We directly predict the numerical quality of an image from the initial noise and given prompt. Naïve PAINE then selects a handful of quality noises and forwards them to the DM for generation. Further, Naïve PAINE provides feedback on the DM generative quality given the prompt and is lightweight enough to seamlessly fit into existing DM pipelines. Experimental results demonstrate that Naïve PAINE outperforms existing approaches on several prompt corpus benchmarks.
Authors:Mohamad Alansari, Naufal Suryanto, Divya Velayudhan, Sajid Javed, Naoufel Werghi, Muzammal Naseer
Abstract:
Multimodal large language models (MLLMs) have advanced from image‑level reasoning to pixel‑level grounding, but extending these capabilities to videos remains challenging as models must achieve spatial precision and temporally consistent reference tracking. Existing video MLLMs often rely on a static segmentation token ([SEG]) for frame‑wise grounding, which provides semantics but lacks temporal context, causing spatial drift, identity switches, and unstable initialization when objects move or reappear. We introduce SPARROW, a pixel‑grounded video MLLM that unifies spatial accuracy and temporal stability through two key components: (i) Target‑Specific Tracked Features (TSF), which inject temporally aligned referent cues during training, and (ii) a dual‑prompt design that decodes box ([BOX]) and segmentation ([SEG]) tokens to fuse geometric priors with semantic grounding. SPARROW is supported by a curated referential video dataset of 30,646 videos and 45,231 Q&A pairs and operates end‑to‑end without external detectors via a class‑agnostic SAM2‑based proposer. Integrated into three recent open‑source video MLLMs (UniPixel, GLUS, and VideoGLaMM), SPARROW delivers consistent gains across six benchmarks, improving up to +8.9 J&F on RVOS, +5 mIoU on visual grounding, and +5.4 CLAIR on GCG. These results demonstrate that SPARROW substantially improves referential stability, spatial precision, and temporal coherence in pixel‑grounded video understanding. Project page: https://risys‑lab.github.io/SPARROW
Authors:Ziwei Wang, Zhentao He, Xingyi He, Hongbin Wang, Tianwang Jia, Jingwei Luo, Siyang Li, Xiaoqing Chen, Dongrui Wu
Abstract:
Deep learning has achieved transformative performance across diverse domains, largely driven by the large‑scale, high‑quality training data. In contrast, the development of brain‑computer interfaces (BCIs) is fundamentally constrained by the limited, heterogeneous, and privacy‑sensitive neural recordings. Generating synthetic yet physiologically plausible brain signals has therefore emerged as a compelling way to mitigate data scarcity and enhance model capacity. This survey provides a comprehensive review of brain signal generation for BCIs, covering methodological taxonomies, benchmark experiments, evaluation metrics, and key applications. We systematically categorize existing generative algorithms into four types: knowledge‑based, feature‑based, model‑based, and translation‑based approaches. Furthermore, we benchmark existing brain signal generation approaches across four representative BCI paradigms to provide an objective performance comparison. Finally, we discuss the potentials and challenges of current generation approaches and prospect future research on accurate, data‑efficient, and privacy‑aware BCI systems. The benchmark codebase is publicized at https://github.com/wzwvv/DG4BCI.
Authors:Terrence J. Lee-St. John, Jordan L. Lawson, Bartlomiej Piechowski-Jozwiak
Abstract:
Tabular machine learning presents a paradox: modern models achieve state‑of‑the‑art performance using high‑dimensional (high‑D), collinear, error‑prone data, defying the "Garbage In, Garbage Out" mantra. To help resolve this, we synthesize principles from Information Theory, Latent Factor Models, and Psychometrics, clarifying that predictive robustness arises not solely from data cleanliness, but from the synergy between data architecture and model capacity. Partitioning predictor‑space "noise" into "Predictor Error" and "Structural Uncertainty" (informational deficits from stochastic generative mappings), we prove that leveraging high‑D sets of error‑prone predictors asymptotically overcomes both types of noise, whereas cleaning a low‑D set is fundamentally bounded by Structural Uncertainty. We demonstrate why "Informative Collinearity" (dependencies from shared latent causes) enhances reliability and convergence efficiency, and explain why increased dimensionality reduces the latent inference burden, enabling feasibility with finite samples. To address practical constraints, we propose "Proactive Data‑Centric AI" to identify predictors that enable robustness efficiently. We also derive boundaries for Systematic Error Regimes and show why models that absorb "rogue" dependencies can mitigate assumption violations. Linking latent architecture to Benign Overfitting, we offer a first step towards a unified view of robustness to Outcome Error and predictor‑space noise, while also delineating when traditional DCAI's focus on label cleaning remains powerful. By redefining data quality from item‑level perfection to portfolio‑level architecture, we provide a theoretical rationale for "Local Factories" ‑‑ learning from live, uncurated enterprise "data swamps" ‑‑ supporting a deployment paradigm shift from "Model Transfer" to "Methodology Transfer'' to overcome static generalizability limitations.
Authors:Mateusz Pach, Jessica Bader, Quentin Bouniot, Serge Belongie, Zeynep Akata
Abstract:
Text‑to‑image generation models have advanced rapidly, yet achieving fine‑grained control over generated images remains difficult, largely due to limited understanding of how semantic information is encoded. We develop an interpretation of the color representation in the Variational Autoencoder latent space of FLUX.1 [Dev], revealing a structure reflecting Hue, Saturation, and Lightness. We verify our Latent Color Subspace (LCS) interpretation by demonstrating that it can both predict and explicitly control color, introducing a fully training‑free method in FLUX based solely on closed‑form latent‑space manipulation. Code is available at https://github.com/ExplainableML/LCS.
Authors:Yulu Gan, Phillip Isola
Abstract:
Pretraining produces a learned parameter vector that is typically treated as a starting point for further iterative adaptation. In this work, we instead view the outcome of pretraining as a distribution over parameter vectors, whose support already contains task‑specific experts. We show that in small models such expert solutions occupy a negligible fraction of the volume of this distribution, making their discovery reliant on structured optimization methods such as gradient descent. In contrast, in large, well‑pretrained models the density of task‑experts increases dramatically, so that diverse, task‑improving specialists populate a substantial fraction of the neighborhood around the pretrained weights. Motivated by this perspective, we explore a simple, fully parallel post‑training method that samples N parameter perturbations at random, selects the top K, and ensembles predictions via majority vote. Despite its simplicity, this approach is competitive with standard post‑training methods such as PPO, GRPO, and ES for contemporary large‑scale models.
Authors:Priyanka Kargupta, Shuhaib Mehri, Dilek Hakkani-Tur, Jiawei Han
Abstract:
Despite interdisciplinary research leading to larger and longer‑term impact, most work remains confined to single‑domain academic silos. Recent AI‑based approaches to scientific discovery show promise for interdisciplinary research, but many prioritize rapidly designing experiments and solutions, bypassing the exploratory, collaborative reasoning processes that drive creative interdisciplinary breakthroughs. As a result, prior efforts largely prioritize automating scientific discovery rather than augmenting the reasoning processes that underlie scientific disruption. We present Idea‑Catalyst, a novel framework that systematically identifies interdisciplinary insights to support creative reasoning in both humans and large language models. Starting from an abstract research goal, Idea‑Catalyst is designed to assist the brainstorming stage, explicitly avoiding premature anchoring on specific solutions. The framework embodies key metacognitive features of interdisciplinary reasoning: (a) defining and assessing research goals, (b) awareness of a domain's opportunities and unresolved challenges, and (c) strategic exploration of interdisciplinary ideas based on impact potential. Concretely, Idea‑Catalyst decomposes an abstract goal (e.g., improving human‑AI collaboration) into core target‑domain research questions that guide the analysis of progress and open challenges within that domain. These challenges are reformulated as domain‑agnostic conceptual problems, enabling retrieval from external disciplines (e.g., Psychology, Sociology) that address analogous issues. By synthesizing and recontextualizing insights from these domains back into the target domain, Idea‑Catalyst ranks source domains by their interdisciplinary potential. Empirically, this targeted integration improves average novelty by 21% and insightfulness by 16%, while remaining grounded in the original research problem.
Authors:Zexuan Yan, Jiarui Jin, Yue Ma, Shijian Wang, Jiahui Hu, Wenxiang Jiao, Yuan Lu, Linfeng Zhang
Abstract:
Despite recent advances in generative models driving significant progress in text rendering, accurately generating complex text and mathematical formulas remains a formidable challenge. This difficulty primarily stems from the limited instruction‑following capabilities of current models when encountering out‑of‑distribution prompts. To address this, we introduce GlyphBanana, alongside a corresponding benchmark specifically designed for rendering complex characters and formulas. GlyphBanana employs an agentic workflow that integrates auxiliary tools to inject glyph templates into both the latent space and attention maps, facilitating the iterative refinement of generated images. Notably, our training‑free approach can be seamlessly applied to various Text‑to‑Image (T2I) models, achieving superior precision compared to existing baselines. Extensive experiments demonstrate the effectiveness of our proposed workflow. Associated code is publicly available at https://github.com/yuriYanZeXuan/GlyphBanana.
Authors:William Brach, Tomas Bedej, Jacob Nielsen, Jacob Pichna, Juraj Bedej, Eemeli Saarensilta, Julie Dupouy, Gianluca Barmina, Andrea Blasi Núñez, Peter Schneider-Kamp, Kristian Košťál, Michal Ries, Lukas Galke Poech
Abstract:
With the rapid advances of large language models, it becomes increasingly important to systematically evaluate their multilingual and multicultural capabilities. Previous cultural evaluation benchmarks focus mainly on basic cultural knowledge that can be encoded in linguistic form. Here, we propose SommBench, a multilingual benchmark to assess sommelier expertise, a domain deeply grounded in the senses of smell and taste. While language models learn about sensory properties exclusively through textual descriptions, SommBench tests whether this textual grounding is sufficient to emulate expert‑level sensory judgment. SommBench comprises three main tasks: Wine Theory Question Answering (WTQA), Wine Feature Completion (WFC), and Food‑Wine Pairing (FWP). SommBench is available in multiple languages: English, Slovak, Swedish, Finnish, German, Danish, Italian, and Spanish. This helps separate a language model's wine expertise from its language skills. The benchmark datasets were developed in close collaboration with a professional sommelier and native speakers of the respective languages, resulting in 1,024 wine theory question‑answering questions, 1,000 wine feature‑completion examples, and 1,000 food‑wine pairing examples. We provide results for the most popular language models, including closed‑weights models such as Gemini 2.5, and open‑weights models, such as GPT‑OSS and Qwen 3. Our results show that the most capable models perform well on wine theory question answering (up to 97% correct with a closed‑weights model), yet feature completion (peaking at 65%) and food‑wine pairing show (MCC ranging between 0 and 0.39) turn out to be more challenging. These results position SommBench as an interesting and challenging benchmark for evaluating the sommelier expertise of language models. The benchmark is publicly available at https://github.com/sommify/sommbench.
Authors:Zhaoyang Jiang, Zhizhong Fu, David McAllister, Yunsoo Kim, Honghan Wu
Abstract:
Longitudinal brain MRI is essential for characterizing the progression of neurological diseases such as Alzheimer's disease assessment. However, current deep‑learning tools fragment this process: classifiers reduce a scan to a label, volumetric pipelines produce uninterpreted measurements, and vision‑language models (VLMs) may generate fluent but potentially hallucinated conclusions. We present LoV3D, a pipeline for training 3D vision‑language models, which reads longitudinal T1‑weighted brain MRI, produces a region‑level anatomical assessment, conducts longitudinal comparison with the prior scan, and finally outputs a three‑class diagnosis (Cognitively Normal, Mild Cognitive Impairment, or Dementia) along with a synthesized diagnostic summary. The stepped pipeline grounds the final diagnosis by enforcing label consistency, longitudinal coherence, and biological plausibility, thereby reducing the risks of hallucinations. The training process introduces a clinically‑weighted Verifier that scores candidate outputs automatically against normative references derived from standardized volume metrics, driving Direct Preference Optimization without a single human annotation. On a subject‑level held‑out ADNI test set (479 scans, 258 subjects), LoV3D achieves 93.7% three‑class diagnostic accuracy (+34.8% over the no‑grounding baseline), 97.2% on two‑class diagnosis accuracy (+4% over the SOTA) and 82.6% region‑level anatomical classification accuracy (+33.1% over VLM baselines). Zero‑shot transfer yields 95.4% on MIRIAD (100% Dementia recall) and 82.9% three‑class accuracy on AIBL, confirming high generalizability across sites, scanners, and populations. Code is available at https://github.com/Anonymous‑TEVC/LoV‑3D.
Authors:Ping Guo, Tiantian Zhang, Xi Lin, Xiang Li, Zhi-Ri Tang, Qingfu Zhang
Abstract:
Personalized Federated Learning (PFL) aims to train customized models for clients with highly heterogeneous data distributions while preserving data privacy. Existing approaches often rely on heuristics like clustering or model interpolation, which lack principled mechanisms for balancing heterogeneous client objectives. Serving M clients with distinct data distributions is inherently a multi‑objective optimization problem, where achieving optimal personalization ideally requires M distinct models on the Pareto front. However, maintaining M separate models poses significant scalability challenges in federated settings with hundreds or thousands of clients. To address this challenge, we reformulate PFL as a few‑for‑many optimization problem that maintains only K shared server models (K \ll M) to collectively serve all M clients. We prove that this framework achieves near‑optimal personalization: the approximation error diminishes as K increases and each client's model converges to each client's optimum as data grows. Building on this reformulation, we propose FedFew, a practical algorithm that jointly optimizes the K server models through efficient gradient‑based updates. Unlike clustering‑based approaches that require manual client partitioning or interpolation‑based methods that demand careful hyperparameter tuning, FedFew automatically discovers the optimal model diversity through its optimization process. Experiments across vision, NLP, and real‑world medical imaging datasets demonstrate that FedFew, with just 3 models, consistently outperforms other state‑of‑the‑art approaches. Code is available at https://github.com/pgg3/FedFew.
Authors:Ilias Aarab
Abstract:
Zero‑shot text classification (ZSC) offers the promise of eliminating costly task‑specific annotation by matching texts directly to human‑readable label descriptions. While early approaches have predominantly relied on cross‑encoder models fine‑tuned for natural language inference (NLI), recent advances in text‑embedding models, rerankers, and instruction‑tuned large language models (LLMs) have challenged the dominance of NLI‑based architectures. Yet, systematically comparing these diverse approaches remains difficult. Existing evaluations, such as MTEB, often incorporate labeled examples through supervised probes or fine‑tuning, leaving genuine zero‑shot capabilities underexplored. To address this, we introduce BTZSC, a comprehensive benchmark of 22 public datasets spanning sentiment, topic, intent, and emotion classification, capturing diverse domains, class cardinalities, and document lengths. Leveraging BTZSC, we conduct a systematic comparison across four major model families, NLI cross‑encoders, embedding models, rerankers and instruction‑tuned LLMs, encompassing 38 public and custom checkpoints. Our results show that: (i) modern rerankers, exemplified by Qwen3‑Reranker‑8B, set a new state‑of‑the‑art with macro F1 = 0.72; (ii) strong embedding models such as GTE‑large‑en‑v1.5 substantially close the accuracy gap while offering the best trade‑off between accuracy and latency; (iii) instruction‑tuned LLMs at 4‑‑12B parameters achieve competitive performance (macro F1 up to 0.67), excelling particularly on topic classification but trailing specialized rerankers; (iv) NLI cross‑encoders plateau even as backbone size increases; and (v) scaling primarily benefits rerankers and LLMs over embedding models. BTZSC and accompanying evaluation code are publicly released to support fair and reproducible progress in zero‑shot text understanding.
Authors:Xingze Zou, Jing Wang, Yuhua Zheng, Xueyi Chen, Haolei Bai, Lingcheng Kong, Syed A. R. Abu-Bakar, Zhaode Wang, Chengfei Lv, Haoji Hu, Huan Wang
Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities in code generation, yet their potential for generating kernels specifically for mobile devices remains largely unexplored. In this work, we extend the scope of automated kernel generation to the mobile domain to investigate the central question: Can LLMs write efficient kernels for mobile devices? To enable systematic investigation, we introduce MobileKernelBench, a comprehensive evaluation framework comprising a benchmark prioritizing operator diversity and cross‑framework interoperability, coupled with an automated pipeline that bridges the host‑device gap for on‑device verification. Leveraging this framework, we conduct extensive evaluation on the CPU backend of Mobile Neural Network (MNN), revealing that current LLMs struggle with the engineering complexity and data scarcity inherent to mobile frameworks; standard models and even fine‑tuned variants exhibit high compilation failure rates (over 54%) and negligible performance gains due to hallucinations and a lack of domain‑specific grounding. To overcome these limitations, we propose the Mobile Kernel Agent (MoKA), a multi‑agent system equipped with repository‑aware reasoning and a plan‑and‑execute paradigm. Validated on MobileKernelBench, MoKA achieves state‑of‑the‑art performance, boosting compilation success to 93.7% and enabling 27.4% of generated kernels to deliver measurable speedups over native libraries.
Authors:Lu Wang, Zhuoran Jin, Yupu Hao, Yubo Chen, Kang Liu, Yulong Ao, Jun Zhao
Abstract:
Multimodal large language models (MLLMs) have shown strong performance on offline video understanding, but most are limited to offline inference or have weak online reasoning, making multi‑turn interaction over continuously arriving video streams difficult. Existing streaming methods typically use an interleaved perception‑generation paradigm, which prevents concurrent perception and generation and leads to early memory decay as streams grow, hurting long‑range dependency modeling. We propose Think While Watching, a memory‑anchored streaming video reasoning framework that preserves continuous segment‑level memory during multi‑turn interaction. We build a three‑stage, multi‑round chain‑of‑thought dataset and adopt a stage‑matched training strategy, while enforcing strict causality through a segment‑level streaming causal mask and streaming positional encoding. During inference, we introduce an efficient pipeline that overlaps watching and thinking and adaptively selects the best attention backend. Under both single‑round and multi‑round streaming input protocols, our method achieves strong results. Built on Qwen3‑VL, it improves single‑round accuracy by 2.6% on StreamingBench and by 3.79% on OVO‑Bench. In the multi‑round setting, it maintains performance while reducing output tokens by 56%. Code is available at: https://github.com/wl666hhh/Think_While_Watching/
Authors:Omar Coser
Abstract:
Translating single‑cell RNA sequencing (scRNA‑seq) data into mechanistic biological hypotheses remains a critical bottleneck, as agentic AI systems lack direct access to transcriptomic representations while expression foundation models remain opaque to natural language. Here we introduce ELISA (Embedding‑Linked Interactive Single‑cell Agent), an interpretable framework that unifies scGPT expression embeddings with BioBERT‑based semantic retrieval and LLM‑mediated interpretation for interactive single‑cell discovery. An automatic query classifier routes inputs to gene marker scoring, semantic matching, or reciprocal rank fusion pipelines depending on whether the query is a gene signature, natural language concept, or mixture of both. Integrated analytical modules perform pathway activity scoringacross 60+ gene sets, ligand‑‑receptor interaction prediction using 280+ curated pairs, condition‑aware comparative analysis, and cell‑type proportion estimation all operating directly on embedded data without access to the original count matrix. Benchmarked across six diverse scRNA‑seq datasets spanning inflammatory lung disease, pediatric and adult cancers, organoid models, healthy tissue, and neurodevelopment, ELISA significantly outperforms CellWhisperer in cell type retrieval (combined permutation test, p < 0.001), with particularly large gains on gene‑signature queries (Cohen's d = 5.98 for MRR). ELISA replicates published biological findings (mean composite score 0.90) with near‑perfect pathway alignment and theme coverage (0.98 each), and generates candidate hypotheses through grounded LLM reasoning, bridging the gap between transcriptomic data exploration and biological discovery. Code available at: https://github.com/omaruno/ELISA‑An‑AI‑Agent‑for‑Expression‑Grounded‑Discovery‑in‑Single‑Cell‑Genomics.git (If you use ELISA in your research, please cite this work).
Authors:Zhiwei Zhang, Xinyi Du, Weihao Wang, Xuanchi Guo, Wenjuan Han
Abstract:
Traffic forecasting is a cornerstone of intelligent transportation systems. While existing research has made significant progress in short‑term prediction, long‑term forecasting remains a largely uncharted and challenging frontier. Extending the prediction horizon intensifies two critical issues: escalating computational resource consumption and increasingly complex spatial‑temporal dependencies. Current approaches, which rely on spatial‑temporal graphs and process temporal and spatial dimensions separately, suffer from snapshot‑stacking inflation and cross‑step fragmentation. To overcome these limitations, we propose VisiFold. Our framework introduces a novel temporal folding graph that consolidates a sequence of temporal snapshots into a single graph. Furthermore, we present a node visibility mechanism that incorporates node‑level masking and subgraph sampling to overcome the computational bottleneck imposed by large node counts. Extensive experiments show that VisiFold not only drastically reduces resource consumption but also outperforms existing baselines in long‑term forecasting tasks. Remarkably, even with a high mask ratio of 80%, VisiFold maintains its performance advantage. By effectively breaking the resource constraints in both temporal and spatial dimensions, our work paves the way for more realistic long‑term traffic forecasting. The code is available at~ https://github.com/PlanckChang/VisiFold.
Authors:Xianjing Han, Bin Zhu, Shiqi Hu, Franklin Mingzhe Li, Patrick Carrington, Roger Zimmermann, Jingjing Chen
Abstract:
Text‑to‑video (T2V) generation models have made rapid progress in producing visually high‑quality and temporally coherent videos. However, existing benchmarks primarily focus on perceptual quality, text‑video alignment, or physical plausibility, leaving a critical aspect of action understanding largely unexplored: object state change (OSC) explicitly specified in the text prompt. OSC refers to the transformation of an object's state induced by an action, such as peeling a potato or slicing a lemon. In this paper, we introduce OSCBench, a benchmark specifically designed to assess OSC performance in T2V models. OSCBench is constructed from instructional cooking data and systematically organizes action‑object interactions into regular, novel, and compositional scenarios to probe both in‑distribution performance and generalization. We evaluate six representative open‑source and proprietary T2V models using both human user study and multimodal large language model (MLLM)‑based automatic evaluation. Our results show that, despite strong performance on semantic and scene alignment, current T2V models consistently struggle with accurate and temporally consistent object state changes, especially in novel and compositional settings. These findings position OSC as a key bottleneck in text‑to‑video generation and establish OSCBench as a diagnostic benchmark for advancing state‑aware video generation models.
Authors:Alexander Mironenko, Evgeny. Burnaev, Serguei Barannikov
Abstract:
Topological Data Analysis (TDA) provides powerful tools to explore the shape and structure of data through topological features such as clusters, loops, and voids. Persistence diagrams are a cornerstone of TDA, capturing the evolution of these features across scales. While effective for analyzing individual manifolds, persistence diagrams do not account for interactions between pairs of them. Cross‑persistence diagrams (cross‑barcodes), introduced recently, address this limitation by characterizing relationships between topological features of two point clouds. In this work, we present the first systematic study of the density of cross‑persistence diagrams. We prove its existence, establish theoretical foundations for its statistical use, and design the first machine learning framework for predicting cross‑persistence density directly from point cloud coordinates and distance matrices. Our statistical approach enables the distinction of point clouds sampled from different manifolds by leveraging the linear characteristics of cross‑persistence diagrams. Interestingly, we find that introducing noise can enhance our ability to distinguish point clouds, uncovering its novel utility in TDA applications. We demonstrate the effectiveness of our methods through experiments on diverse datasets, where our approach consistently outperforms existing techniques in density prediction and achieves superior results in point cloud distinction tasks. Our findings contribute to a broader understanding of cross‑persistence diagrams and open new avenues for their application in data analysis, including potential insights into time‑series domain tasks and the geometry of AI‑generated texts. Our code is publicly available at https://github.com/Verdangeta/TDA_experiments
Authors:Hyung-Seok Oh, Deok-Hyeon Cho, Seung-Bin Kim, Seong-Whan Lee
Abstract:
Neural vocoders have recently advanced waveform generation, yielding natural and expressive audio. Among these approaches, iSTFT‑based vocoders have recently gained attention. They predict a complex‑valued spectrogram and then synthesize the waveform via iSTFT, thereby avoiding learned upsampling stages that can increase computational cost. However, current approaches use real‑valued networks that process the real and imaginary parts independently. This separation limits their ability to capture the inherent structure of complex spectrograms. We present ComVo, a Complex‑valued neural Vocoder whose generator and discriminator use native complex arithmetic. This enables an adversarial training framework that provides structured feedback in complex‑valued representations. To guide phase transformations in a structured manner, we introduce phase quantization, which discretizes phase values and regularizes the training process. Finally, we propose a block‑matrix computation scheme to improve training efficiency by reducing redundant operations. Experiments demonstrate that ComVo achieves higher synthesis quality than comparable real‑valued baselines, and that its block‑matrix scheme reduces training time by 25%. Audio samples and code are available at https://hs‑oh‑prml.github.io/ComVo/.
Authors:Md Jahidul Islam
Abstract:
The adaptation of large‑scale Vision‑Language Models (VLMs) like CLIP to downstream tasks with extremely limited data ‑‑ specifically in the one‑shot regime ‑‑ is often hindered by a significant "Stability‑Plasticity" dilemma. While efficient caching mechanisms have been introduced by training‑free methods such as Tip‑Adapter, these approaches often function as local Nadaraya‑Watson estimators. Such estimators are characterized by inherent boundary bias and a lack of global structural regularization. In this paper, ReHARK (Refined Hybrid Adaptive RBF Kernels) is proposed as a synergistic training‑free framework that reinterprets few‑shot adaptation through global proximal regularization in a Reproducing Kernel Hilbert Space (RKHS). A multistage refinement pipeline is introduced, consisting of: (1) Hybrid Prior Construction, where zero‑shot textual knowledge from CLIP and GPT‑3 is fused with visual class prototypes to form a robust semantic‑visual anchor; (2) Support Set Augmentation (Bridging), where intermediate samples are generated to smooth the transition between visual and textual modalities; (3) Adaptive Distribution Rectification, where test feature statistics are aligned with the augmented support set to mitigate domain shifts; and (4) Multi‑Scale RBF Kernels, where an ensemble of kernels is employed to capture complex feature geometries across diverse scales. Superior stability and accuracy are demonstrated through extensive experiments on 11 diverse benchmarks. A new state‑of‑the‑art for one‑shot adaptation is established by ReHARK, which achieves an average accuracy of 65.83%, significantly outperforming existing baselines. Code is available at https://github.com/Jahid12012021/ReHARK.
Authors:Xiaogang Du, Jiawei Zhang, Tongfei Liu, Tao Lei, Yingbo Wang
Abstract:
In medical image segmentation tasks, the domain gap caused by the difference in data collection between training and testing data seriously hinders the deployment of pre‑trained models in clinical practice. Continual Test‑Time Adaptation (CTTA) aims to enable pre‑trained models to adapt to continuously changing unlabeled domains, providing an effective approach to solving this problem. However, existing CTTA methods often rely on unreliable supervisory signals, igniting a self‑reinforcing cycle of error accumulation that culminates in catastrophic performance degradation. To overcome these challenges, we propose a CTTA via Semantic‑Prompt‑Enhanced Graph Clustering (SPEGC) for medical image segmentation. First, we design a semantic prompt feature enhancement mechanism that utilizes decoupled commonality and heterogeneity prompt pools to inject global contextual information into local features, alleviating their susceptibility to noise interference under domain shift. Second, based on these enhanced features, we design a differentiable graph clustering solver. This solver reframes global edge sparsification as an optimal transport problem, allowing it to distill a raw similarity matrix into a refined and high‑order structural representation in an end‑to‑end manner. Finally, this robust structural representation is used to guide model adaptation, ensuring predictions are consistent at a cluster‑level and dynamically adjusting decision boundaries. Extensive experiments demonstrate that SPEGC outperforms other state‑of‑the‑art CTTA methods on two medical image segmentation benchmarks. The source code is available at https://github.com/Jwei‑Z/SPEGC‑for‑MIS.
Authors:Yuxiang Liu, Qiao Liu, Tong Luo, Yanglei Gan, Peng He, Yao LIu
Abstract:
Predicting irregularly spaced event sequences with discrete marks poses significant challenges due to the complex, asynchronous dependencies embedded within continuous‑time data streams.Existing sequential approaches capture dependencies among event tokens but ignore the continuous evolution between events, while Neural Ordinary Differential Equation (Neural ODE) methods model smooth dynamics yet fail to account for how event types influence future timing.To overcome these limitations, we propose NEXTPP, a dual‑channel framework that unifies discrete and continuous representations via Event‑granular Neural Evolution with Cross‑Interaction for Marked Temporal Point Processes. Specifically, NEXTPP encodes discrete event marks via a self‑attention mechanism, simultaneously evolving a latent continuous‑time state using a Neural ODE. These parallel streams are then fused through a crossattention module to enable explicit bidirectional interaction between continuous and discrete representations. The fused representations drive the conditional intensity function of the neural Hawkes process, while an iterative thinning sampler is employed to generate future events. Extensive evaluations on five real‑world datasets demonstrate that NEXTPP consistently outperforms state‑of‑the‑art models. The source code can be found at https://github.com/AONE‑NLP/NEXTPP.
Authors:Riccardo Campi, Nicolò Oreste Pinciroli Vago, Mathyas Giudici, Marco Brambilla, Piero Fraternali
Abstract:
Retrieval‑Augmented Generation (RAG) over Knowledge Graphs (KGs) suffers from the fact that indexing approaches may lose important contextual nuance when text is reduced to triples, thereby degrading performance in downstream Question‑Answering (QA) tasks, particularly for multi‑hop QA, which requires composing answers from multiple entities, facts, or relations. We propose a domain‑agnostic, KG‑based QA framework that covers both the indexing and retrieval/inference phases. A new indexing approach called Map‑Disambiguate‑Enrich‑Reduce (MDER) generates context‑derived triple descriptions and subsequently integrates them with entity‑level summaries, thus avoiding the need for explicit traversal of edges in the graph during the QA retrieval phase. Complementing this, we introduce Decompose‑Resolve (DR), a retrieval mechanism that decomposes user queries into resolvable triples and grounds them in the KG via iterative reasoning. Together, MDER and DR form an LLM‑driven QA pipeline that is robust to sparse, incomplete, and complex relational data. Experiments show that on standard and domain specific benchmarks, MDER‑DR achieves substantial improvements over standard RAG baselines (up to 66%), while maintaining cross‑lingual robustness. Our code is available at https://github.com/DataSciencePolimi/MDER‑DR_RAG.
Authors:Zeyuan Guo, Enmao Diao, Cheng Yang, Chuan Shi
Abstract:
The success of large pretrained Transformers is closely tied to tokenizers, which convert raw input into discrete symbols. Extending these models to graph‑structured data remains a significant challenge. In this work, we introduce a graph tokenization framework that generates sequential representations of graphs by combining reversible graph serialization, which preserves graph information, with Byte Pair Encoding (BPE), a widely adopted tokenizer in large language models (LLMs). To better capture structural information, the graph serialization process is guided by global statistics of graph substructures, ensuring that frequently occurring substructures appear more often in the sequence and can be merged by BPE into meaningful tokens. Empirical results demonstrate that the proposed tokenizer enables Transformers such as BERT to be directly applied to graph benchmarks without architectural modifications. The proposed approach achieves state‑of‑the‑art results on 14 benchmark datasets and frequently outperforms both graph neural networks and specialized graph transformers. This work bridges the gap between graph‑structured data and the ecosystem of sequence models. Our code is available at \hrefhttps://github.com/BUPT‑GAMMA/Graph‑Tokenization‑for‑Bridging‑Graphs‑and‑Transformers\colorbluehere.
Authors:Chandler Smith, Magnus Sesodia, Friedrich Lindenberg, Christian Schroeder de Witt
Abstract:
We release OpenSanctions Pairs, a large‑scale entity matching benchmark derived from real‑world international sanctions aggregation and analyst deduplication. The dataset contains 755,540 labeled pairs spanning 293 heterogeneous sources across 31 countries, with multilingual and cross‑script names, noisy and missing attributes, and set‑valued fields typical of compliance workflows. We benchmark a production rule‑based matcher (nomenklatura RegressionV1 algorithm) against open‑ and closed‑source LLMs in zero‑ and few‑shot settings. Off‑the‑shelf LLMs substantially outperform the production rule‑based baseline (91.33% F1), reaching up to 98.95% F1 (GPT‑4o) and 98.23% F1 with a locally deployable open model (DeepSeek‑R1‑Distill‑Qwen‑14B). DSPy MIPROv2 prompt optimization yields consistent but modest gains, while adding in‑context examples provides little additional benefit and can degrade performance. Error analysis shows complementary failure modes: the rule‑based system over‑matches (high false positives), whereas LLMs primarily fail on cross‑script transliteration and minor identifier/date inconsistencies. These results indicate that pairwise matching performance is approaching a practical ceiling in this setting, and motivate shifting effort toward pipeline components such as blocking, clustering, and uncertainty‑aware review. Code available at https://github.com/chansmi/OSINT_entity_resolution
Authors:Susung Hong, Brian Curless, Ira Kemelmacher-Shlizerman, Steve Seitz
Abstract:
We propose a fully automated AI system that produces short comedic videos similar to sketch shows such as Saturday Night Live. Starting with character references, the system employs a population of agents loosely based on real production studio roles, structured to optimize the quality and diversity of ideas and outputs through iterative competition, evaluation, and improvement. A key contribution is the introduction of LLM critics aligned with real viewer preferences through the analysis of a corpus of comedy videos on YouTube to automatically evaluate humor. Our experiments show that our framework produces results approaching the quality of professionally produced sketches while demonstrating state‑of‑the‑art performance in video generation.
Authors:Jen-Hao Rick Chang, Xiaoming Zhao, Dorian Chan, Oncel Tuzel
Abstract:
We propose a 3D latent representation that jointly models object geometry and view‑dependent appearance. Most prior works focus on either reconstructing 3D geometry or predicting view‑independent diffuse appearance, and thus struggle to capture realistic view‑dependent effects. Our approach leverages that RGB‑depth images provide samples of a surface light field. By encoding random subsamples of this surface light field into a compact set of latent vectors, our model learns to represent both geometry and appearance within a unified 3D latent space. This representation reproduces view‑dependent effects such as specular highlights and Fresnel reflections under complex lighting. We further train a latent flow matching model on this representation to learn its distribution conditioned on a single input image, enabling the generation of 3D objects with appearances consistent with the lighting and materials in the input. Experiments show that our approach achieves higher visual quality and better input fidelity than existing methods.
Authors:Tao Zhong, Yixun Hu, Dongzhe Zheng, Aditya Sood, Christine Allen-Blanchette
Abstract:
We propose Neural Field Thermal Tomography (NeFTY), a differentiable physics framework for the quantitative 3D reconstruction of material properties from transient surface temperature measurements. While traditional thermography relies on pixel‑wise 1D approximations that neglect lateral diffusion, and soft‑constrained Physics‑Informed Neural Networks (PINNs) often fail in transient diffusion scenarios due to gradient stiffness, NeFTY parameterizes the 3D diffusivity field as a continuous neural field optimized through a rigorous numerical solver. By leveraging a differentiable physics solver, our approach enforces thermodynamic laws as hard constraints while maintaining the memory efficiency required for high‑resolution 3D tomography. Our discretize‑then‑optimize paradigm effectively mitigates the spectral bias and ill‑posedness inherent in inverse heat conduction, enabling the recovery of subsurface defects at arbitrary scales. Experimental validation on synthetic data demonstrates that NeFTY significantly improves the accuracy of subsurface defect localization over baselines. Additional details at https://cab‑lab‑princeton.github.io/nefty/
Authors:Yan-Bo Lin, Jonah Casebeer, Long Mai, Aniruddha Mahapatra, Gedas Bertasius, Nicholas J. Bryan
Abstract:
Generating music that temporally aligns with video events is challenging for existing text‑to‑music models, which lack fine‑grained temporal control. We introduce V2M‑Zero, a zero‑pair video‑to‑music generation approach that outputs time‑aligned music for video. Our method is motivated by a key observation: temporal synchronization requires matching when and how much change occurs, not what changes. While musical and visual events differ semantically, they exhibit shared temporal structure that can be captured independently within each modality. We capture this structure through event curves computed from intra‑modal similarity using pretrained music and video encoders. By measuring temporal change within each modality independently, these curves provide comparable representations across modalities. This enables a simple training strategy: fine‑tune a text‑to‑music model on music‑event curves, then substitute video‑event curves at inference without cross‑modal training or paired data. Across OES‑Pub, MovieGenBench‑Music, and AIST++, V2M‑Zero achieves substantial gains over paired‑data baselines: 5‑21% higher audio quality, 13‑15% better semantic alignment, 21‑52% improved temporal synchronization, and 28% higher beat alignment on dance videos. We find similar results via a large crowd‑source subjective listening test. Overall, our results validate that temporal alignment through within‑modality features, rather than paired cross‑modal supervision, is effective for video‑to‑music generation. Results are available at https://genjib.github.io/v2m_zero/
Authors:Zegu Zhang, Jian Zhang
Abstract:
Variational autoencoders (VAEs) frequently suffer from posterior collapse, where latent variables become uninformative and the approximate posterior degenerates to the prior. Recent work has characterized this phenomenon as a phase transition governed by the spectral properties of the data covariance matrix. In this paper, we propose a fundamentally different approach: instead of avoiding collapse through architectural constraints or hyperparameter tuning, we eliminate the possibility of collapse altogether by leveraging the multiplicity of Gaussian mixture model (GMM) clusterings. We introduce Historical Consensus Training, an iterative selection procedure that progressively refines a set of candidate GMM priors through alternating optimization and selection. The key insight is that models trained to satisfy multiple distinct clustering constraints develop a historical barrier ‑‑ a region in parameter space that remains stable even when subsequently trained with a single objective. We prove that this barrier excludes the collapsed solution, and demonstrate through extensive experiments on synthetic and real‑world datasets that our method achieves non‑collapsed representations regardless of decoder variance or regularization strength. Our approach requires no explicit stability conditions (e.g., σ^\prime 2 < λ_\max) and works with arbitrary neural architectures. The code is available at https://github.com/tsegoochang/historical‑consensus‑vae.
Authors:Jinwoo Ahn, Ingyu Seong, Akhil Kedia, Junhan Kim, Hyemi Jang, Kangwook Lee, Yongkweon Jeon
Abstract:
Transformer‑based large language models (LLMs) rely on key‑value (KV) caching to avoid redundant computation during autoregressive inference. While this mechanism greatly improves efficiency, the cache size grows linearly with the input sequence length, quickly becoming a bottleneck for long‑context tasks. Existing solutions mitigate this problem by evicting prompt KV that are deemed unimportant, guided by estimated importance scores. Notably, a recent line of work proposes to improve eviction quality by "glimpsing into the future", in which a draft generator produces a surrogate future response approximating the target model's true response, and this surrogate is subsequently used to estimate the importance of cached KV more accurately. However, these approaches rely on computationally expensive draft generation, which introduces substantial prefilling overhead and limits their practicality in real‑world deployment. To address this challenge, we propose LookaheadKV, a lightweight eviction framework that leverages the strength of surrogate future response without requiring explicit draft generation. LookaheadKV augments transformer layers with parameter‑efficient modules trained to predict true importance scores with high accuracy. Our design ensures negligible runtime overhead comparable to existing inexpensive heuristics, while achieving accuracy superior to more costly approximation methods. Extensive experiments on long‑context understanding benchmarks, across a wide range of models, demonstrate that our method not only outperforms recent competitive baselines in various long‑context understanding tasks, but also reduces the eviction cost by up to 14.5x, leading to significantly faster time‑to‑first‑token. Our code is available at https://github.com/SamsungLabs/LookaheadKV.
Authors:Xinran Xu, Xiuyi Fan
Abstract:
Accurate estimation of uncertainty in deep learning is critical for deploying models in high‑stakes domains such as medical diagnosis and autonomous decision‑making, where overconfident predictions can lead to harmful outcomes. In practice, understanding the reason behind a model's uncertainty and the type of uncertainty it represents can support risk‑aware decisions, enhance user trust, and guide additional data collection. However, many existing methods only address a single type of uncertainty or require modifications and retraining of the base model, making them difficult to adopt in real‑world systems. We introduce CUPID (Comprehensive Uncertainty Plug‑in estImation moDel), a general‑purpose module that jointly estimates aleatoric and epistemic uncertainty without modifying or retraining the base model. CUPID can be flexibly inserted into any layer of a pretrained network. It models aleatoric uncertainty through a learned Bayesian identity mapping and captures epistemic uncertainty by analyzing the model's internal responses to structured perturbations. We evaluate CUPID across a range of tasks, including classification, regression, and out‑of‑distribution detection. The results show that it consistently delivers competitive performance while offering layer‑wise insights into the origins of uncertainty. By making uncertainty estimation modular, interpretable, and model‑agnostic, CUPID supports more transparent and trustworthy AI. Related code and data are available at https://github.com/a‑Fomalhaut‑a/CUPID.
Authors:Yu Zhang, Zhicheng Zhao, Ze Luo, Chenglong Li, Jin Tang
Abstract:
Traffic scene understanding from unmanned aerial vehicle (UAV) platforms is crucial for intelligent transportation systems due to its flexible deployment and wide‑area monitoring capabilities. However, existing methods face significant challenges in real‑world surveillance, as their heavy reliance on optical imagery leads to severe performance degradation under adverse illumination conditions like nighttime and fog. Furthermore, current Visual Question Answering (VQA) models are restricted to elementary perception tasks, lacking the domain‑specific regulatory knowledge required to assess complex traffic behaviors. To address these limitations, we propose a novel Multi‑modal Traffic Cognition Network (MTCNet) for robust UAV traffic scene understanding. Specifically, we design a Prototype‑Guided Knowledge Embedding (PGKE) module that leverages high‑level semantic prototypes from an external Traffic Regulation Memory (TRM) to anchor domain‑specific knowledge into visual representations, enabling the model to comprehend complex behaviors and distinguish fine‑grained traffic violations. Moreover, we develop a Quality‑Aware Spectral Compensation (QASC) module that exploits the complementary characteristics of optical and thermal modalities to perform bidirectional context exchange, effectively compensating for degraded features to ensure robust representation in complex environments. In addition, we construct Traffic‑VQA, the first large‑scale optical‑thermal infrared benchmark for cognitive UAV traffic understanding, comprising 8,180 aligned image pairs and 1.3 million question‑answer pairs across 31 diverse types. Extensive experiments demonstrate that MTCNet significantly outperforms state‑of‑the‑art methods in both cognition and perception scenarios. The dataset is available at https://github.com/YuZhang‑2004/UAV‑traffic‑scene‑understanding.
Authors:Changyi Xiao, Caijun Xu, Yixin Cao
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective in enhancing the reasoning capabilities of large language models, particularly in domains such as mathematics where reliable rule‑based verifiers can be constructed. However, the reliance on handcrafted, domain‑specific verification rules substantially limits the applicability of RLVR to general reasoning domains with free‑form answers, where valid answers often exhibit significant variability, making it difficult to establish complete and accurate rules. To address this limitation, we propose Conditional Expectation Reward (CER), which leverages the large language model itself as an implicit verifier, and is therefore applicable to general domains and eliminates the need for external verifiers or auxiliary models. CER is defined as the expected likelihood of generating the reference answer conditioned on the generated answer. In contrast to rule‑based verifiers that yield binary feedback, CER provides a soft, graded reward signal that reflects varying degrees of correctness, making it better suited to tasks where answers vary in correctness. Experimental results demonstrate that CER is effective across a wide range of reasoning tasks, spanning both mathematical and general domains, indicating that CER serves as a flexible and general verification mechanism. The code is available at https://github.com/changyi7231/CER.
Authors:Hao Zhou, Lu Qi, Jason Li, Jie Zhang, Yi Liu, Xu Yang, Mingyu Fan, Fei Luo
Abstract:
Trajectory prediction is critical for autonomous driving, enabling safe and efficient planning in dense, dynamic traffic. Most existing methods optimize prediction accuracy under fixed‑length observations. However, real‑world driving often yields variable‑length, incomplete observations, posing a challenge to these methods. A common strategy is to directly map features from incomplete observations to those from complete ones. This one‑shot mapping, however, struggles to learn accurate representations for short trajectories due to significant information gaps. To address this issue, we propose a Progressive Retrospective Framework (PRF), which gradually aligns features from incomplete observations with those from complete ones via a cascade of retrospective units. Each unit consists of a Retrospective Distillation Module (RDM) and a Retrospective Prediction Module (RPM), where RDM distills features and RPM recovers previous timesteps using the distilled features. Moreover, we propose a Rolling‑Start Training Strategy (RSTS) that enhances data efficiency during PRF training. PRF is plug‑and‑play with existing methods. Extensive experiments on datasets Argoverse 2 and Argoverse 1 demonstrate the effectiveness of PRF. Code is available at https://github.com/zhouhao94/PRF.
Authors:Caroline Magg, Maaike A. ter Wee, Johannes G. G. Dobbe, Geert J. Streekstra, Leendert Blankevoort, Clara I. Sánchez, Hoel Kervadec
Abstract:
Promptable Foundation Models (FMs), initially introduced for natural image segmentation, have also revolutionized medical image segmentation. The increasing number of models, along with evaluations varying in datasets, metrics, and compared models, makes direct performance comparison between models difficult and complicates the selection of the most suitable model for specific clinical tasks. In our study, 11 promptable FMs are tested using non‑iterative 2D and 3D prompting strategies on a private and public dataset focusing on bone and implant segmentation in four anatomical regions (wrist, shoulder, hip and lower leg). The Pareto‑optimal models are identified and further analyzed using human prompts collected through a dedicated observer study. Our findings are: 1) The segmentation performance varies a lot between FMs and prompting strategies; 2) The Pareto‑optimal models in 2D are SAM and SAM2.1, in 3D nnInteractive and Med‑SAM2; 3) Localization accuracy and rater consistency vary with anatomical structures, with higher consistency for simple structures (wrist bones) and lower consistency for complex structures (pelvis, tibia, implants); 4) The segmentation performance drops using human prompts, suggesting that performance reported on "ideal" prompts extracted from reference labels might overestimate the performance in a human‑driven setting; 5) All models were sensitive to prompt variations. While two models demonstrated intra‑rater robustness, it did not scale to inter‑rater settings. We conclude that the selection of the most optimal FM for a human‑driven setting remains challenging, with even high‑performing FMs being sensitive to variations in human input prompts. Our code base for prompt extraction and model inference is available: https://github.com/CarolineMagg/segmentation‑FM‑benchmark/
Authors:Dengdi Sun, Jie Chen, Xiao Wang, Jin Tang
Abstract:
Physics‑Informed Neural Networks (PINNs) have shown promise in solving incompressible Navier‑Stokes equations, yet existing approaches are predominantly designed for single‑flow settings. When extended to multi‑flow scenarios, these methods face three key challenges: (1) difficulty in simultaneously capturing both shared physical principles and flow‑specific characteristics, (2) susceptibility to inter‑task negative transfer that degrades prediction accuracy, and (3) unstable training dynamics caused by disparate loss magnitudes across heterogeneous flow regimes. To address these limitations, we propose UniPINN, a unified multi‑flow PINN framework that integrates three complementary components: a shared‑specialized architecture that disentangles universal physical laws from flow‑specific features, a cross‑flow attention mechanism that selectively reinforces relevant patterns while suppressing task‑irrelevant interference, and a dynamic weight allocation strategy that adaptively balances loss contributions to stabilize multi‑objective optimization. Extensive experiments on three canonical flows demonstrate that UniPINN effectively unifies multi‑flow learning, achieving superior prediction accuracy and balanced performance across heterogeneous regimes while successfully mitigating negative transfer. The source code of this paper will be released on https://github.com/Event‑AHU/OpenFusion
Authors:Tongcheng Zhang, Zhanpeng Zhou, Mingze Wang, Andi Han, Wei Huang, Taiji Suzuki, Junchi Yan
Abstract:
One crucial factor behind the success of deep learning lies in the implicit bias induced by noise inherent in gradient‑based training algorithms. Motivated by empirical observations that training with noisy labels improves model generalization, we delve into the underlying mechanisms behind stochastic gradient descent (SGD) with label noise. Focusing on a two‑layer over‑parameterized linear network, we analyze the learning dynamics of label noise SGD, unveiling a two‑phase learning behavior. In \emphPhase I, the magnitudes of model weights progressively diminish, and the model escapes the lazy regime; enters the rich regime. In \emphPhase II, the alignment between model weights and the ground‑truth interpolator increases, and the model eventually converges. Our analysis highlights the critical role of label noise in driving the transition from the lazy to the rich regime and minimally explains its empirical success. Furthermore, we extend these insights to Sharpness‑Aware Minimization (SAM), showing that the principles governing label noise SGD also apply to broader optimization algorithms. Extensive experiments, conducted under both synthetic and real‑world setups, strongly support our theory. Our code is released at https://github.com/a‑usually/Label‑Noise‑SGD.
Authors:Jake Gonzales, Kazuki Mizuta, Karen Leung, Lillian J. Ratliff
Abstract:
In this paper, we present a novel probabilistic safe control framework for human‑robot interaction that combines control barrier functions (CBFs) with conformal risk control to provide formal safety guarantees while considering complex human behavior. The approach uses conformal risk control to quantify and control the prediction errors in CBF safety values and establishes formal guarantees on the probability of constraint satisfaction during interaction. We introduce an algorithm that dynamically adjusts the safety margins produced by conformal risk control based on the current interaction context. Through experiments on human‑robot navigation scenarios, we demonstrate that our approach significantly reduces collision rates and safety violations as compared to baseline methods while maintaining high success rates in goal‑reaching tasks and efficient control. The code, simulations, and other supplementary material can be found on the project website: https://jakeagonzales.github.io/crc‑cbf‑website/.
Authors:Chen-Chen Zong, Sheng-Jun Huang
Abstract:
Federated active learning (FAL) seeks to reduce annotation cost under privacy constraints, yet its effectiveness degrades in realistic settings with severe global class imbalance and highly heterogeneous clients. We conduct a systematic study of query‑model selection in FAL and uncover a central insight: the model that achieves more class‑balanced sampling, especially for minority classes, consistently leads to better final performance. Moreover, global‑model querying is beneficial only when the global distribution is highly imbalanced and client data are relatively homogeneous; otherwise, the local model is preferable. Based on these findings, we propose FairFAL, an adaptive class‑fair FAL framework. FairFAL (1) infers global imbalance and local‑global divergence via lightweight prediction discrepancy, enabling adaptive selection between global and local query models; (2) performs prototype‑guided pseudo‑labeling using global features to promote class‑aware querying; and (3) applies a two‑stage uncertainty‑diversity balanced sampling strategy with k‑center refinement. Experiments on five benchmarks show that FairFAL consistently outperforms state‑of‑the‑art approaches under challenging long‑tailed and non‑IID settings. The code is available at https://github.com/chenchenzong/FairFAL.
Authors:Tim Schopf, Michael Färber
Abstract:
Judging the novelty of research ideas is crucial for advancing science, enabling the identification of unexplored directions, and ensuring contributions meaningfully extend existing knowledge rather than reiterate minor variations. However, given the exponential growth of scientific literature, manually judging the novelty of research ideas through literature reviews is labor‑intensive, subjective, and infeasible at scale. Therefore, recent efforts have proposed automated approaches for research idea novelty judgment. Yet, evaluation of these approaches remains largely inconsistent and is typically based on non‑standardized human evaluations, hindering large‑scale, comparable evaluations. To address this, we introduce RINoBench, the first comprehensive benchmark for large‑scale evaluation of research idea novelty judgments. It comprises 1,381 research ideas derived from and judged by human experts as well as nine automated evaluation metrics designed to assess both rubric‑based novelty scores and textual justifications of novelty judgments. Using this benchmark, we evaluate several state‑of‑the‑art large language models (LLMs) on their ability to judge the novelty of research ideas. Our findings reveal that while LLM‑generated reasoning closely mirrors human rationales, this alignment does not reliably translate into accurate novelty judgments, which diverge significantly from human gold standard judgments ‑ even among leading reasoning‑capable models. Data and code available at: https://github.com/TimSchopf/RINoBench.
Authors:Feng Li, Ziyuan Li, Zhongliang Jiang, Nassir Navab, Yuan Bi
Abstract:
Intraoperative Cone Beam Computed Tomography (CBCT) provides a reliable 3D anatomical context essential for interventional planning. However, its static nature fails to provide continuous monitoring of soft‑tissue deformations induced by respiration, probe pressure, and surgical manipulation, leading to navigation discrepancies. We propose a deformation‑aware CBCT updating framework that leverages robotic ultrasound as a dynamic proxy to infer tissue motion and update static CBCT slices in real time. Starting from calibration‑initialized alignment with linear correlation of linear combination (LC2)‑based rigid refinement, our method establishes accurate multimodal correspondence. To capture intraoperative dynamics, we introduce the ultrasound correlation UNet (USCorUNet), a lightweight network trained with optical flow‑guided supervision to learn deformation‑aware correlation representations, enabling accurate, real‑time dense deformation field estimation from ultrasound streams. The inferred deformation is spatially regularized and transferred to the CBCT reference to produce deformation‑consistent visualizations without repeated radiation exposure. We validate the proposed approach through deformation estimation and ultrasound‑guided CBCT updating experiments. Results demonstrate real‑time end‑to‑end CBCT slice updating and physically plausible deformation estimation, enabling dynamic refinement of static CBCT guidance during robotic ultrasound‑assisted interventions. The source code is publicly available at https://github.com/anonymous‑codebase/us‑cbct‑demo.
Authors:Sofia Maria Lo Cicero Vaina, Artem Chumachenko, Max Ryabinin
Abstract:
Finetuning on domain‑specific data is a well‑established method for enhancing LLM performance on downstream tasks. Training on each dataset produces a new set of model weights, resulting in a multitude of checkpoints saved in‑house or on open‑source platforms. However, these training artifacts are rarely reused for subsequent experiments despite containing improved model abilities for potentially similar tasks. In this paper, we propose Mashup Learning, a simple method to leverage the outputs of prior training runs to enhance model adaptation to new tasks. Our procedure identifies the most relevant historical checkpoints for a target dataset, aggregates them with model merging, and uses the result as an improved initialization for training. Across 8 standard LLM benchmarks, four models, and two collections of source checkpoints, Mashup Learning consistently improves average downstream accuracy by 0.5‑5 percentage points over training from scratch. It also accelerates convergence, requiring 41‑46% fewer training steps and up to 37% less total wall‑clock time to match from‑scratch accuracy, including all selection and merging overhead.
Authors:Sijia Cui, Pengyu Cheng, Jiajun Song, Yongbo Gai, Guojun Zhang, Zhechao Yu, Jianhe Lin, Xiaoxi Jiang, Guanjun Jiang
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capacity of Large Language Models (LLMs). However, RLVR solely relies on final answers as outcome rewards, neglecting the correctness of intermediate reasoning steps. Training on these process‑wrong but outcome‑correct rollouts can lead to hallucination and answer‑copying, severely undermining the model's generalization and robustness. To address this, we incorporate a Contrastive Learning mechanism into the Policy Optimization (CLIPO) to generalize the RLVR process. By optimizing a contrastive loss over successful rollouts, CLIPO steers the LLM to capture the invariant structure shared across correct reasoning paths. This provides a more robust cross‑trajectory regularization than the original single‑path supervision in RLVR, effectively mitigating step‑level reasoning inconsistencies and suppressing hallucinatory artifacts. In experiments, CLIPO consistently improves multiple RLVR baselines across diverse reasoning benchmarks, demonstrating uniform improvements in generalization and robustness for policy optimization of LLMs. Our code and training recipes are available at https://github.com/Qwen‑Applications/CLIPO.
Authors:Qitong Sun, Jun Han, Tianlin Li, Zhe Tang, Sheng Chen, Fei Yang, Aishan Liu, Xianglong Liu, Yang Liu
Abstract:
Improving GPU kernel efficiency is crucial for advancing AI systems. Recent work has explored leveraging large language models (LLMs) for GPU kernel generation and optimization. However, existing LLM‑based kernel optimization pipelines typically rely on opaque, implicitly learned heuristics within the LLMs to determine optimization strategies. This leads to inefficient trial‑and‑error and weakly interpretable optimizations. Our key insight is to replace implicit heuristics with expert optimization skills that are knowledge‑driven and aware of task trajectories. Specifically, we present KernelSkill, a multi‑agent framework with a dual‑level memory architecture. KernelSkill operates by coordinating agents with long‑term memory of reusable expert skills and short‑term memory to prevent repetitive backtracking. On KernelBench Levels 1‑3, KernelSkill achieves a 100% success rate and average speedups of 5.44x, 2.82x, and 1.92x over Torch Eager on Levels 1, 2, and 3, respectively, outperforming prior baselines. Code is available at https://github.com/0satan0/KernelMem/.
Authors:Harry Owiredu-Ashley
Abstract:
Most adversarial evaluations of large language model (LLM) safety assess single prompts and report binary pass/fail outcomes, which fails to capture how safety properties evolve under sustained adversarial interaction. We present ADVERSA, an automated red‑teaming framework that measures guardrail degradation dynamics as continuous per‑round compliance trajectories rather than discrete jailbreak events. ADVERSA uses a fine‑tuned 70B attacker model (ADVERSA‑Red, Llama‑3.1‑70B‑Instruct with QLoRA) that eliminates the attacker‑side safety refusals that render off‑the‑shelf models unreliable as attackers, scoring victim responses on a structured 5‑point rubric that treats partial compliance as a distinct measurable state. We report a controlled experiment across three frontier victim models (Claude Opus 4.6, Gemini 3.1 Pro, GPT‑5.2) using a triple‑judge consensus architecture in which judge reliability is measured as a first‑class research outcome rather than assumed. Across 15 conversations of up to 10 adversarial rounds, we observe a 26.7% jailbreak rate with an average jailbreak round of 1.25, suggesting that in this evaluation setting, successful jailbreaks were concentrated in early rounds rather than accumulating through sustained pressure. We document inter‑judge agreement rates, self‑judge scoring tendencies, attacker drift as a failure mode in fine‑tuned attackers deployed out of their training distribution, and attacker refusals as a previously‑underreported confound in victim resistance measurement. All limitations are stated explicitly. Attack prompts are withheld per responsible disclosure policy; all other experimental artifacts are released.
Authors:Tianyu Pang, Yujie Fang, Zihang Liu, Shenyang Deng, Lei Hsiung, Shuhua Yu, Yaoqing Yang
Abstract:
Muon has recently shown promising results in LLM training. In this work, we study how to further improve Muon. We argue that Muon's orthogonalized update rule suppresses the emergence of heavy‑tailed weight spectra and over‑emphasizes the training along noise‑dominated directions. Motivated by the Heavy‑Tailed Self‑Regularization (HT‑SR) theory, we propose HTMuon. HTMuon preserves Muon's ability to capture parameter interdependencies while producing heavier‑tailed updates and inducing heavier‑tailed weight spectra. Experiments on LLM pretraining and image classification show that HTMuon consistently improves performance over state‑of‑the‑art baselines and can also serve as a plug‑in on top of existing Muon variants. For example, on LLaMA pretraining on the C4 dataset, HTMuon reduces perplexity by up to 0.98 compared to Muon. We further theoretically show that HTMuon corresponds to steepest descent under the Schatten‑q norm constraint and provide convergence analysis in smooth non‑convex settings. The implementation of HTMuon is available at https://github.com/TDCSZ327/HTmuon.
Authors:Dan Lee, Seungwook Han, Akarsh Kumar, Pulkit Agrawal
Abstract:
Pre‑training is crucial for large language models (LLMs), as it is when most representations and capabilities are acquired. However, natural language pre‑training has problems: high‑quality text is finite, it contains human biases, and it entangles knowledge with reasoning. This raises a fundamental question: is natural language the only path to intelligence? We propose using neural cellular automata (NCA) to generate synthetic, non‑linguistic data for pre‑pre‑training LLMs‑‑training on synthetic‑then‑natural language. NCA data exhibits rich spatiotemporal structure and statistics resembling natural language while being controllable and cheap to generate at scale. We find that pre‑pre‑training on only 164M NCA tokens improves downstream language modeling by up to 6% and accelerates convergence by up to 1.6x. Surprisingly, this even outperforms pre‑pre‑training on 1.6B tokens of natural language from Common Crawl with more compute. These gains also transfer to reasoning benchmarks, including GSM8K, HumanEval, and BigBench‑Lite. Investigating what drives transfer, we find that attention layers are the most transferable, and that optimal NCA complexity varies by domain: code benefits from simpler dynamics, while math and web text favor more complex ones. These results enable systematic tuning of the synthetic distribution to target domains. More broadly, our work opens a path toward more efficient models with fully synthetic pre‑training.
Authors:Eric Roginek, Jingyan Xu, D. Frank. Hsu
Abstract:
Ensemble learning is a well established body of methods for machine learning to enhance predictive performance by combining multiple algorithms/models. Combinatorial Fusion Analysis (CFA) has provided method and practice for combining multiple scoring systems, using rank‑score characteristic (RSC) function and cognitive diversity (CD), including ensemble method and model fusion. However, there is no general‑purpose Python tool available that incorporate these techniques. In this paper we introduce \textttInFusionLayer, a machine learning architecture inspired by CFA at the system fusion level that uses a moderate set of base models to optimize unsupervised and supervised learning multiclassification problems. We demonstrate \textttInFusionLayer's ease of use for PyTorch, TensorFlow, and Scikit‑learn workflows by validating its performance on various computer vision datasets. Our results highlight the practical advantages of incorporating distinctive features of RSC function and CD, paving the way for more sophisticated ensemble learning applications in machine learning. We open‑sourced our code to encourage continuing development and community accessibility to leverage CFA on github: https://github.com/ewroginek/Infusion
Authors:David Gringras
Abstract:
Safety benchmarks evaluate language models in isolation, typically using multiple‑choice format; production deployments wrap these models in agentic scaffolds that restructure inputs through reasoning traces, critic agents, and delegation pipelines. We report one of the largest controlled studies of scaffold effects on safety (N = 62,808; six frontier models, four deployment configurations), combining pre‑registration, assessor blinding, equivalence testing, and specification curve analysis. Map‑reduce scaffolding degrades measured safety (NNH = 14), yet two of three scaffold architectures preserve safety within practically meaningful margins. Investigating the map‑reduce degradation revealed a deeper measurement problem: switching from multiple‑choice to open‑ended format on identical items shifts safety scores by 5‑20 percentage points, larger than any scaffold effect. Within‑format scaffold comparisons are consistent with practical equivalence under our pre‑registered +/‑2 pp TOST margin, isolating evaluation format rather than scaffold architecture as the operative variable. Model x scaffold interactions span 35 pp in opposing directions (one model degrades by ‑16.8 pp on sycophancy under map‑reduce while another improves by +18.8 pp on the same benchmark), ruling out universal claims about scaffold safety. A generalisability analysis yields G = 0.000: model safety rankings reverse so completely across benchmarks that no composite safety index achieves non‑zero reliability, making per‑model, per‑configuration testing a necessary minimum standard. We release all code, data, and prompts as ScaffoldSafety.
Authors:Xingtong Yu, Shenghua Ye, Ruijuan Liang, Chang Zhou, Hong Cheng, Xinming Zhang, Yuan Fang
Abstract:
Graph foundation models (GFM) aim to acquire transferable knowledge by pre‑training on diverse graphs, which can be adapted to various downstream tasks. However, domain shift in graphs is inherently two‑dimensional: graphs differ not only in what they describe (topic domains) but also in how they are represented (format domains). Most existing GFM benchmarks vary only topic domains, thereby obscuring how knowledge transfers across both dimensions. We present a new benchmark that jointly evaluates topic and format gaps across the full GFM pipeline, including multi‑domain self‑supervised pre‑training and few‑shot downstream adaptation, and provides a timely evaluation of recent GFMs in the rapidly evolving landscape. Our protocol enables controlled assessment in four settings: (i) pre‑training on diverse topics and formats, while adapting to unseen downstream datasets; (ii) same pre‑training as in (i), while adapting to seen datasets; (iii) pre‑training on a single topic domain, while adapting to other topics; (iv) pre‑training on a base format, while adapting to other formats. This two‑axis evaluation disentangles semantic generalization from robustness to representational shifts. We conduct extensive evaluations of eight state‑of‑the‑art GFMs on 33 datasets spanning seven topic domains and six format domains, surfacing new empirical observations and practical insights for future research. Codes/data are available at https://github.com/smufang/GFMBenchmark.
Authors:Shubham Kumar Singh
Abstract:
Memory constraints in long‑running agents require structured management of accumulated facts while preserving essential information under bounded context limits. We introduce HTM‑EAR, a hierarchical tiered memory substrate that integrates HNSW‑based working memory (L1) with archival storage (L2), combining importance‑aware eviction and hybrid routing. When L1 reaches capacity, items are evicted using a weighted score of importance and usage. Queries are first resolved in L1; if similarity or entity coverage is insufficient, retrieval falls back to L2, and candidates are re‑ranked using a cross‑encoder. We evaluate the system under sustained saturation (15,000 facts; L1 capacity 500; L2 capacity 5000) using synthetic streams across five random seeds and real BGL system logs. Ablation studies compare the full system against variants without cross‑encoder re‑ranking, without routing gates, with LRU eviction, and an oracle with unbounded memory. Under saturation, the full model preserves active‑query precision (MRR = 1.000) while enabling controlled forgetting of stale history, approaching oracle active performance (0.997 +/‑ 0.003). In contrast, LRU minimizes latency (21.1 ms) but permanently evicts 2416 essential facts. On BGL logs, the full system achieves MRR 0.336, close to the oracle (0.370), while LRU drops to 0.069. Code is publicly available at: https://github.com/shubham‑61291/HTM‑EAR
Authors:Xinsheng Tang, Yangcheng Li, Nan Wang, Zhiyi Shu, Xingyu Ling, Junna Xing, Peng Zhou, Qiang Liu
Abstract:
Operator fusion, as a key performance optimization technique in the deployment of AI models, significantly improves execution efficiency and has been widely adopted in modern AI compilers. However, for cascaded reduction operations involving multiple loops with inter‑loop data dependencies, such as the safe softmax followed by GEMM within attention mechanisms, existing compilers lack effective automated fusion and kernel generation capabilities. Although some works have addressed specific instances through hand‑crafted fusion strategies, their solutions are limited in generality and difficult to extend to other similar structures. Given the prevalence of such computational patterns in deep learning models, there remains significant untapped potential in achieving general and automated fusion optimization. In this paper, we present a formal theoretical methodology for analyzing cascaded reductions which can fuse them into a single loop and introduce an incremental computation form. Based on this methodology, we design Reduction Fuser (RedFuser), a framework that automatically identifies supported cascaded reduction patterns and generates optimized fused kernels. Experiments show that RedFuser successfully fuses diverse workloads, achieving up to 2× to 5× speedup over state‑of‑the‑art AI compilers and matching the performance of highly optimized hand‑written kernels. The code is available at https://github.com/alibaba/redfuser
Authors:Izzat Alsmadi, Anas Alsobeh
Abstract:
This paper presents TAMUSA‑Chat, a research‑oriented framework for building domain‑adapted large language model conversational systems. The work addresses critical challenges in adapting general‑purpose foundation models to institutional contexts through supervised fine‑tuning, retrieval‑augmented generation, and systematic evaluation methodologies. We describe the complete architecture encompassing data acquisition from institutional sources, preprocessing pipelines, embedding construction, model training workflows, and deployment strategies. The system integrates modular components enabling reproducible experimentation with training configurations, hyper‑parameters, and evaluation protocols. Our implementation demonstrates how academic institutions can develop contextually grounded conversational agents while maintaining transparency, governance compliance, and responsible AI practices. Through empirical analysis of fine‑tuning behavior across model sizes and training iterations, we provide insights into domain adaptation efficiency, computational resource requirements, and quality‑cost trade‑offs. The publicly available codebase at https://github.com/alsmadi/TAMUSA_LLM_Based_Chat_app supports continued research into institutional LLM deployment, evaluation methodologies, and ethical considerations for educational AI systems.
Authors:Shuhuai Li, Jianghao Lin, Dongdong Ge, Yinyu Ye
Abstract:
Mixture‑of‑Experts (MoE) models enable scalable performance but face severe memory constraints on edge devices. Existing offloading strategies struggle with I/O bottlenecks due to the dynamic, low‑information nature of autoregressive expert activation. In this paper, we propose to repurpose Speculative Decoding (SD) not merely as a compute accelerator, but as an informative lookahead sensor for memory management, supported by our theoretical and empirical analyses. Hence, we introduce MoE‑SpAc, an MoE inference framework that integrates a Speculative Utility Estimator to track expert demand, a Heterogeneous Workload Balancer to dynamically partition computation via online integer optimization, and an Asynchronous Execution Engine to unify the prefetching and eviction in the same utility space. Extensive experiments on seven benchmarks demonstrate that MoE‑SpAc achieves a 42% improvement in TPS over the SOTA SD‑based baseline, and an average 4.04x speedup over all standard baselines. Code is available at https://github.com/lshAlgorithm/MoE‑SpAc .
Authors:Lucas Prieto, Edward Stevinson, Melih Barsbey, Tolga Birdal, Pedro A. M. Mediano
Abstract:
A central idea in mechanistic interpretability is that neural networks represent more features than they have dimensions, arranging them in superposition to form an over‑complete basis. This framing has been influential, motivating dictionary learning approaches such as sparse autoencoders. However, superposition has mostly been studied in idealized settings where features are sparse and uncorrelated. In these settings, superposition is typically understood as introducing interference that must be minimized geometrically and filtered out by non‑linearities such as ReLUs, yielding local structures like regular polytopes. We show that this account is incomplete for realistic data by introducing Bag‑of‑Words Superposition (BOWS), a controlled setting to encode binary bag‑of‑words representations of internet text in superposition. Using BOWS, we find that when features are correlated, interference can be constructive rather than just noise to be filtered out. This is achieved by arranging features according to their co‑activation patterns, making interference between active features constructive, while still using ReLUs to avoid false positives. We show that this kind of arrangement is more prevalent in models trained with weight decay and naturally gives rise to semantic clusters and cyclical structures which have been observed in real language models yet were not explained by the standard picture of superposition. Code for this paper can be found at https://github.com/LucasPrietoAl/correlations‑feature‑geometry.
Authors:Xinyu Gao, Gang Chen, Javier Alonso-Mora
Abstract:
Language‑conditioned local navigation requires a robot to infer a nearby traversable target location from its current observation and an open‑vocabulary, relational instruction. Existing vision‑language spatial grounding methods usually rely on vision‑language models (VLMs) to reason in image space, producing 2D predictions tied to visible pixels. As a result, they struggle to infer target locations in occluded regions, typically caused by furniture or moving humans. To address this issue, we propose BEACON, which predicts an ego‑centric Bird's‑Eye View (BEV) affordance heatmap over a bounded local region including occluded areas. Given an instruction and surround‑view RGB‑D observations from four directions around the robot, BEACON predicts the BEV heatmap by injecting spatial cues into a VLM and fusing the VLM's output with depth‑derived BEV features. Using an occlusion‑aware dataset built in the Habitat simulator, we conduct detailed experimental analysis to validate both our BEV space formulation and the design choices of each module. Our method improves the accuracy averaged across geodesic thresholds by 22.74 percentage points over the state‑of‑the‑art image‑space baseline on the validation subset with occluded target locations. Our project page is: https://xin‑yu‑gao.github.io/beacon.
Authors:Rong Zhou, Houliang Zhou, Yao Su, Brian Y. Chen, Yu Zhang, Lifang He, Alzheimer's Disease Neuroimaging Initiative
Abstract:
Multimodal neuroimaging provides complementary insights for Alzheimer's disease diagnosis, yet clinical datasets frequently suffer from missing modalities. We propose ACADiff, a framework that synthesizes missing brain imaging modalities through adaptive clinical‑aware diffusion. ACADiff learns mappings between incomplete multimodal observations and target modalities by progressively denoising latent representations while attending to available imaging data and clinical metadata. The framework employs adaptive fusion that dynamically reconfigures based on input availability, coupled with semantic clinical guidance via GPT‑4o‑encoded prompts. Three specialized generators enable bidirectional synthesis among sMRI, FDG‑PET, and AV45‑PET. Evaluated on ADNI subjects, ACADiff achieves superior generation quality and maintains robust diagnostic performance even under extreme 80% missing scenarios, outperforming all existing baselines. To promote reproducibility, code is available at https://github.com/rongzhou7/ACADiff
Authors:Yunhang Qian, Xiaobin Hu, Jiaquan Yu, Siyang Xin, Xiaokun Chen, Jiangning Zhang, Peng-Tao Jiang, Jiawei Liu, Hongwei Bran Li
Abstract:
While Multi‑Agent Systems (MAS) show potential for complex clinical decision support, the field remains hindered by architectural fragmentation and the lack of standardized multimodal integration. Current medical MAS research suffers from non‑uniform data ingestion pipelines, inconsistent visual‑reasoning evaluation, and a lack of cross‑specialty benchmarking. To address these challenges, we present MedMASLab, a unified framework and benchmarking platform for multimodal medical multi‑agent systems. MedMASLab introduces: (1) A standardized multimodal agent communication protocol that enables seamless integration of 11 heterogeneous MAS architectures across 24 medical modalities. (2) An automated clinical reasoning evaluator, a zero‑shot semantic evaluation paradigm that overcomes the limitations of lexical string‑matching by leveraging large vision‑language models to verify diagnostic logic and visual grounding. (3) The most extensive benchmark to date, spanning 11 organ systems and 473 diseases, standardizing data from 11 clinical benchmarks. Our systematic evaluation reveals a critical domain‑specific performance gap: while MAS improves reasoning depth, current architectures exhibit significant fragility when transitioning between specialized medical sub‑domains. We provide a rigorous ablation of interaction mechanisms and cost‑performance trade‑offs, establishing a new technical baseline for future autonomous clinical systems. The source code and data is publicly available at: https://github.com/NUS‑Project/MedMASLab/
Authors:Yixin Zheng, Jiangran Lyu, Yifan Zhang, Jiayi Chen, Mi Yan, Yuntian Deng, Xuesong Shi, Xiaoguang Zhao, Yizhou Wang, Zhizheng Zhang, He Wang
Abstract:
Extrinsic dexterity leverages environmental contact to overcome the limitations of prehensile manipulation. However, achieving such dexterity in cluttered scenes remains challenging and underexplored, as it requires selectively exploiting contact among multiple interacting objects with inherently coupled dynamics. Existing approaches lack explicit modeling of such complex dynamics and therefore fall short in non‑prehensile manipulation in cluttered environments, which in turn limits their practical applicability in real‑world environments. In this paper, we introduce a Dynamics‑Aware Policy Learning (DAPL) framework that can facilitate policy learning with a learned representation of contact‑induced object dynamics in cluttered environments. This representation is learned through explicit world modeling and used to condition reinforcement learning, enabling extrinsic dexterity to emerge without hand‑crafted contact heuristics or complex reward shaping. We evaluate our approach in both simulation and the real world. Our method outperforms prehensile manipulation, human teleoperation, and prior representation‑based policies by over 25% in success rate on unseen simulated cluttered scenes with varying densities. The real‑world success rate reaches around 50% across 10 cluttered scenes, while a practical grocery deployment further demonstrates robust sim‑to‑real transfer and applicability.
Authors:Davit Melikidze, Marian Schneider, Jessica Lam, Martin Wertich, Ido Hakimi, Barna Pásztor, Andreas Krause
Abstract:
Reinforcement Learning from Human Feedback (RLHF) has become the standard for aligning Large Language Models (LLMs), yet its efficacy is bottlenecked by the high cost of acquiring preference data, especially in low‑resource and expert domains. To address this, we introduce ACTIVEULTRAFEEDBACK, a modular active learning pipeline that leverages uncertainty estimates to dynamically identify the most informative responses for annotation. Our pipeline facilitates the systematic evaluation of standard response selection methods alongside DOUBLE REVERSE THOMPSON SAMPLING (DRTS) and DELTAUCB, two novel methods prioritizing response pairs with large predicted quality gaps, leveraging recent results showing that such pairs provide good signals for fine‑tuning. Our experiments demonstrate that ACTIVEULTRAFEEDBACK yields high‑quality datasets that lead to significant improvements in downstream performance, notably achieving comparable or superior results with as little as one‑sixth of the annotated data relative to static baselines. Our pipeline is available at https://github.com/lasgroup/ActiveUltraFeedback and our preference datasets at https://huggingface.co/ActiveUltraFeedback.
Authors:Xin An, Jingyi Cai, Xiangyang Chen, Huayao Liu, Peiting Liu, Peng Wang, Bei Yang, Xiuwen Zhu, Yongfan Chen, Yan Gao, Yuan Gao, Baoyu Hou, Guangzheng Hu, Shuzhao Li, Weixu Qiao, Weidong Ren, Yanan Wang, Boyu Yang, Fan Yang, Jiangtao Zhang, Lixin Zhang, Lin Qu, Hu Wei, Xiaoxiao Xu, Bing Zhao
Abstract:
Addressing the challenges of fragmented task definitions and the heterogeneity of unstructured data in multimodal parsing, this paper proposes the Omni Parsing framework. This framework establishes a Unified Taxonomy covering documents, images, and audio‑visual streams, introducing a progressive parsing paradigm that bridges perception and cognition. Specifically, the framework integrates three hierarchical levels: 1) Holistic Detection, which achieves precise spatial‑temporal grounding of objects or events to establish a geometric baseline for perception; 2) Fine‑grained Recognition, which performs symbolization (e.g., OCR/ASR) and attribute extraction on localized objects to complete structured entity parsing; and 3) Multi‑level Interpreting, which constructs a reasoning chain from local semantics to global logic. A pivotal advantage of this framework is its evidence anchoring mechanism, which enforces a strict alignment between high‑level semantic descriptions and low‑level facts. This enables ``evidence‑based'' logical induction, transforming unstructured signals into standardized knowledge that is locatable, enumerable, and traceable. Building on this foundation, we constructed a standardized dataset and released the Logics‑Parsing‑Omni model, which successfully converts complex audio‑visual signals into machine‑readable structured knowledge. Experiments demonstrate that fine‑grained perception and high‑level cognition are synergistic, effectively enhancing model reliability. Furthermore, to quantitatively evaluate these capabilities, we introduce OmniParsingBench. Code, models and the benchmark are released at https://github.com/alibaba/Logics‑Parsing/tree/master/Logics‑Parsing‑Omni.
Authors:Arash Shahmansoori
Abstract:
LLM agents that store knowledge as natural language suffer steep retrieval degradation as condition count grows, often struggle to compose learned rules reliably, and typically lack explicit mechanisms to detect stale or adversarial knowledge. We introduce PRECEPT, a unified framework for test‑time adaptation with three tightly coupled components: (1) deterministic exact‑match rule retrieval over structured condition keys, (2) conflict‑aware memory with Bayesian source reliability and threshold‑based rule invalidation, and (3) COMPASS, a Pareto‑guided prompt‑evolution outer loop. Exact retrieval eliminates partial‑match interpretation errors on the deterministic path (0% by construction, vs 94.4% under Theorem~B.6's independence model at N=10) and supports compositional stacking through a semantic tier hierarchy; conflict‑aware memory resolves static‑‑dynamic disagreements and supports drift adaptation; COMPASS evaluates prompts through the same end‑to‑end execution pipeline. Results (9‑‑10 seeds): PRECEPT achieves a +41.1pp first‑try advantage over Full Reflexion (d>1.9), +33.3pp compositional generalization (d=1.55), 100% P_1 on 2‑way logistics compositions (d=2.64), +40‑‑55pp continuous learning gains, strong eventual robustness under adversarial static knowledge (100% logistics with adversarial SK active; partial recovery on integration), +55.0pp drift recovery (d=0.95, p=0.031), and 61% fewer steps. Core comparisons are statistically significant, often at p<0.001.
Authors:Cosmo Santoni
Abstract:
State‑space model releases are typically coupled to fused CUDA and Triton kernels, inheriting a hard dependency on NVIDIA hardware. We show that Mamba‑2's state space duality algorithm ‑‑ diagonal state structure, chunkable recurrence, and einsum‑dominated compute with static control flow ‑‑ maps cleanly onto what XLA's fusion and tiling passes actually optimise, making custom kernels optional rather than required. We implement the full inference path (prefill, cached autoregressive decoding) as shaped standard primitives under XLA, without hand‑written kernels, and realise the architecture's theoretical O(1) state management as a compiled on‑device cache requiring no host synchronisation during generation. The implementation runs unmodified on CPU, NVIDIA GPU, and Google Cloud TPU from a single JAX source. On TPU v6e across five model scales (130M‑‑2.7B parameters), XLA‑generated code reaches approximately 140 TFLOPS on single‑stream prefill (15% MFU) and up to 64% bandwidth utilisation on decode. Greedy decoding matches the PyTorch/CUDA reference token‑for‑token across 64 steps, with hidden‑state agreement within float32 rounding tolerance. The pattern transfers to any SSM recurrence satisfying the same structural conditions, on any platform with a mature XLA backend. The implementation is publicly available at https://github.com/CosmoNaught/mamba2‑jax and merged into the Bonsai JAX model library.
Authors:Luxi Lin, Zhihang Lin, Zhanpeng Zeng, Yuhao Chen, Qingyu Zhang, Jixiang Luo, Xuelong Li, Rongrong Ji
Abstract:
Speculative decoding accelerates LLM inference but suffers from performance degradation when target models are fine‑tuned for specific domains. A naive solution is to retrain draft models for every target model, which is costly and inefficient. To address this, we introduce a parameter‑ and data‑efficient framework named Efficient Draft Adaptation, abbreviated as EDA, for efficiently adapting draft models. EDA introduces three innovations: (1) a decoupled architecture that utilizes shared and private components to model the shared and target‑specific output distributions separately, enabling parameter‑efficient adaptation by updating only the lightweight private component;(2) a data regeneration strategy that utilizes the fine‑tuned target model to regenerate training data, thereby improving the alignment between training and speculative decoding, leading to higher average acceptance length;(3) a sample selection mechanism that prioritizes high‑value data for efficient adaptation. Our experiments show that EDA effectively restores speculative performance on fine‑tuned models, achieving superior average acceptance lengths with significantly reduced training costs compared to full retraining. Code is available at https://github.com/Lyn‑Lucy/Efficient‑Draft‑Adaptation.
Authors:Jiajun Cao, Xiaoan Zhang, Xiaobao Wei, Liyuqiu Huang, Wang Zijian, Hanzhen Zhang, Zhengyu Jia, Wei Mao, Hao Wang, Xianming Liu, Shuchang Zhou, Yang Wang, Shanghang Zhang
Abstract:
Vision‑Language‑Action models have shown great promise for autonomous driving, yet they suffer from degraded perception after unfreezing the visual encoder and struggle with accumulated instability in long‑term planning. To address these challenges, we propose EvoDriveVLA‑a novel collaborative perception‑planning distillation framework that integrates self‑anchored perceptual constraints and oracle‑guided trajectory optimization. Specifically, self‑anchored visual distillation leverages self‑anchor teacher to deliver visual anchoring constraints, regularizing student representations via trajectory‑guided key‑region awareness. In parallel, oracle‑guided trajectory distillation employs a future‑aware oracle teacher with coarse‑to‑fine trajectory refinement and Monte Carlo dropout sampling to produce high‑quality trajectory candidates, thereby selecting the optimal trajectory to guide the student's prediction. EvoDriveVLA achieves SOTA performance in open‑loop evaluation and significantly enhances performance in closed‑loop evaluation. Our code is available at: https://github.com/hey‑cjj/EvoDriveVLA.
Authors:Zirui Zhang, Yaping Zhang, Lu Xiang, Yang Zhao, Feifei Zhai, Yu Zhou, Chengqing Zong
Abstract:
Document Layout Analysis (DLA) is crucial for document artificial intelligence and has recently received increasing attention, resulting in an influx of large‑scale public DLA datasets. Existing work often combines data from various domains in recent public DLA datasets to improve the generalization of DLA. However, directly merging these datasets for training often results in suboptimal model performance, as it overlooks the different layout structures inherent to various domains. These variations include different labeling styles, document types, and languages. This paper introduces PromptDLA, a domain‑aware Prompter for Document Layout Analysis that effectively leverages descriptive knowledge as cues to integrate domain priors into DLA. The innovative PromptDLA features a unique domain‑aware prompter that customizes prompts based on the specific attributes of the data domain. These prompts then serve as cues that direct the DLA toward critical features and structures within the data, enhancing the model's ability to generalize across varied domains. Extensive experiments show that our proposal achieves state‑of‑the‑art performance among DocLayNet, PubLayNet, M6Doc, and D^4LA. Our code is available at https://github.com/Zirui00/PromptDLA.
Authors:Taesung Kwon, Lorenzo Bianchi, Lennart Wittke, Felix Watine, Fabio Carrara, Jong Chul Ye, Romann Weber, Vinicius Azevedo
Abstract:
Recent diffusion models increasingly favor Transformer backbones, motivated by the remarkable scalability of fully attentional architectures. Yet the locality bias, parameter efficiency, and hardware friendliness‑‑the attributes that established ConvNets as the efficient vision backbone‑‑have seen limited exploration in modern generative modeling. Here we introduce the fully convolutional diffusion model (FCDM), a model having a backbone similar to ConvNeXt, but designed for conditional diffusion modeling. We find that using only 50% of the FLOPs of DiT‑XL/2, FCDM‑XL achieves competitive performance with 7× and 7.5× fewer training steps at 256×256 and 512×512 resolutions, respectively. Remarkably, FCDM‑XL can be trained on a 4‑GPU system, highlighting the exceptional training efficiency of our architecture. Our results demonstrate that modern convolutional designs provide a competitive and highly efficient alternative for scaling diffusion models, reviving ConvNeXt as a simple yet powerful building block for efficient generative modeling.
Authors:Aodi Wu, Jianhong Zuo, Zeyuan Zhao, Xubo Luo, Ruisuo Wang, Xue Wan
Abstract:
Autonomous space operations such as on‑orbit servicing and active debris removal demand robust part‑level semantic understanding and precise relative navigation of target spacecraft, yet collecting large‑scale real data in orbit remains impractical due to cost and access constraints. Existing synthetic datasets, moreover, suffer from limited target diversity, single‑modality sensing, and incomplete ground‑truth annotations. We present SpaceSense‑Bench, a large‑scale multi‑modal benchmark for spacecraft perception encompassing 136~satellite models with approximately 70~GB of data. Each frame provides time‑synchronized 1024×1024 RGB images, millimeter‑precision depth maps, and 256‑beam LiDAR point clouds, together with dense 7‑class part‑level semantic labels at both the pixel and point level as well as accurate 6‑DoF pose ground truth. The dataset is generated through a high‑fidelity space simulation built in Unreal Engine~5 and a fully automated pipeline covering data acquisition, multi‑stage quality control, and conversion to mainstream formats. We benchmark five representative tasks (object detection, 2D semantic segmentation, RGB‑‑LiDAR fusion‑based 3D point cloud segmentation, monocular depth estimation, and orientation estimation) and identify two key findings: (i)~perceiving small‑scale components (\emphe.g., thrusters and omni‑antennas) and generalizing to entirely unseen spacecraft in a zero‑shot setting remain critical bottlenecks for current methods, and (ii)~scaling up the number of training satellites yields substantial performance gains on novel targets, underscoring the value of large‑scale, diverse datasets for space perception research. The dataset, code, and toolkit are publicly available at https://github.com/wuaodi/SpaceSense‑Bench.
Authors:Renwei Meng
Abstract:
Retrieval‑augmented generation (RAG) improves factual grounding, yet most systems rely on flat chunk retrieval and provide limited control over multi‑step synthesis. We propose an Explainable Innovation Engine that upgrades the knowledge unit from text chunks to methods‑as‑nodes. The engine maintains a weighted method provenance tree for traceable derivations and a hierarchical clustering abstraction tree for efficient top‑down navigation. At inference time, a strategy agent selects explicit synthesis operators (e.g., induction, deduction, analogy), composes new method nodes, and records an auditable trajectory. A verifier‑scorer layer then prunes low‑quality candidates and writes validated nodes back to support continual growth. Expert evaluation across six domains and multiple backbones shows consistent gains over a vanilla baseline, with the largest improvements on derivation‑heavy settings, and ablations confirm the complementary roles of provenance backtracking and pruning. These results suggest a practical path toward controllable, explainable, and verifiable innovation in agentic RAG systems. Code is available at the project GitHub repository https://github.com/xiaolu‑666113/Dual‑Tree‑Agent‑RAG.
Authors:Junjie Yin, Jiaju Li, Hanfa Xing
Abstract:
Diffusion‑based image super‑resolution (ISR) has shown strong potential, but it still struggles in real‑world scenarios where degradations are unknown and spatially non‑uniform, often resulting in lost details or visual artifacts. To address this challenge, we propose a novel super‑resolution diffusion model, QUSR, which integrates a Quality‑Aware Prior (QAP) with an Uncertainty‑Guided Noise Generation (UNG) module. The UNG module adaptively adjusts the noise injection intensity, applying stronger perturbations to high‑uncertainty regions (e.g., edges and textures) to reconstruct complex details, while minimizing noise in low‑uncertainty regions (e.g., flat areas) to preserve original information. Concurrently, the QAP leverages an advanced Multimodal Large Language Model (MLLM) to generate reliable quality descriptions, providing an effective and interpretable quality prior for the restoration process. Experimental results confirm that QUSR can produce high‑fidelity and high‑realism images in real‑world scenarios. The source code is available at https://github.com/oTvTog/QUSR.
Authors:Yifan Han, Zhongxi Chen, Yuxuan Zhao, Congsheng Xu, Yanming Shao, Yichuan Peng, Yao Mu, Wenzhao Lian
Abstract:
While Vision‑Language‑Action (VLA) models have demonstrated promising generalization capabilities in robotic manipulation, deploying them on specific and complex downstream tasks still demands effective post‑training. In parallel, Human‑in‑the‑Loop (HiL) learning has proven to be a powerful mechanism for refining robot policies. However, extending this paradigm to dexterous manipulation remains challenging: multi‑finger control is high‑dimensional, contact‑intensive, and exhibits execution distributions that differ markedly from standard arm motions, leaving existing dexterous VLA systems limited in reliability and adaptability. We present DexHiL, the first integrated arm‑hand human‑in‑the‑loop framework for dexterous VLA models, enabling coordinated interventions over the arm and the dexterous hand within a single system. DexHiL introduces an intervention‑aware data sampling strategy that prioritizes corrective segments for post‑training, alongside a lightweight teleoperation interface that supports instantaneous human corrections during execution. Real‑robot experiments demonstrate that DexHiL serves as an effective post‑training framework, yielding a substantial performance leap, outperforming standard offline‑only fine‑tuning baselines by an average of 25% in success rates across distinct tasks. Project page: https://chenzhongxi‑sjtu.github.io/dexhil/
Authors:Yunfei Xie, Kevin Wang, Bobby Cheng, Jianzhu Yao, Zhizhou Sha, Alexander Duffy, Yihan Xi, Hongyuan Mei, Cheston Tan, Chen Wei, Pramod Viswanath, Zhangyang Wang
Abstract:
Multi‑turn, multi‑agent LLM game evaluations often exhibit substantial run‑to‑run variance. In long‑horizon interactions, small early deviations compound across turns and are amplified by multi‑agent coupling. This biases win rate estimates and makes rankings unreliable across repeated tournaments. Prompt choice worsens this further by producing different effective policies. We address both instability and underperformance with MEMO (Memory‑augmented MOdel context optimization), a self‑play framework that optimizes inference‑time context by coupling retention and exploration. Retention maintains a persistent memory bank that stores structured insights from self‑play trajectories and injects them as priors during later play. Exploration runs tournament‑style prompt evolution with uncertainty‑aware selection via TrueSkill, and uses prioritized replay to revisit rare and decisive states. Across five text‑based games, MEMO raises mean win rate from 25.1% to 49.5% for GPT‑4o‑mini and from 20.9% to 44.3% for Qwen‑2.5‑7B‑Instruct, using 2,000 self‑play games per task. Run‑to‑run variance also drops, giving more stable rankings across prompt variations. These results suggest that multi‑agent LLM game performance and robustness have substantial room for improvement through context optimization. MEMO achieves the largest gains in negotiation and imperfect‑information games, while RL remains more effective in perfect‑information settings. All code is open‑source and available here: https://github.com/openverse‑ai/MEMO
Authors:Bhada Yun, Evgenia Taranova, Dana Feng, Renn Su, April Yi Wang
Abstract:
There is no 'ordinary' when it comes to AI. The human‑AI experience is extraordinarily complex and specific to each person, yet dominant measures such as usability scales and engagement metrics flatten away nuance. We argue for AI phenomenology: a research stance that asks "How did it feel?" beyond the standard questions of "How well did it perform?" when interacting with AI systems. AI phenomenology acts as a paradigm for bidirectional human‑AI alignment as it foregrounds users' first‑person perceptions and interpretations of AI systems over time. We motivate AI phenomenology as a framework that captures how alignment is experienced, negotiated, and updated between users and AI systems. Tracing a lineage from Husserl through postphenomenology to Actor‑Network Theory, and grounding our argument in three studies‑two longitudinal studies with "Day", an AI companion, and a multi‑method study of agentic AI in software engineering‑we contribute a set of replicable methodological toolkits for conducting AI phenomenology research: instruments for capturing lived experience across personal and professional contexts, three design concepts (translucent design, agency‑aware value alignment, temporal co‑evolution tracking), and a concrete research agenda. We offer this toolkit not as a new paradigm but as a practical scaffold that researchers can adapt as AI systems‑and the humans who live alongside them‑continue to co‑evolve.
Authors:Yixiong Chen, Xinyi Bai, Yue Pan, Zongwei Zhou, Alan Yuille
Abstract:
Multi‑modal large language models (MM‑LLMs) have shown strong performance in medical image understanding and clinical reasoning. Recent medical agent systems extend them with tool use and multi‑agent collaboration, enabling complex decision‑making. However, these systems rely almost entirely on frontier models (e.g., GPT), whose API‑based deployment incurs high cost, high latency, and privacy risks that conflict with on‑premise clinical requirements. We present Meissa, a lightweight 4B‑parameter medical MM‑LLM that brings agentic capability offline. Instead of imitating static answers, Meissa learns both when to engage external interaction (strategy selection) and how to execute multi‑step interaction (strategy execution) by distilling structured trajectories from frontier models. Specifically, we propose: (1) Unified trajectory modeling: trajectories (reasoning and action traces) are represented within a single state‑action‑observation formalism, allowing one model to generalize across heterogeneous medical environments. (2) Three‑tier stratified supervision: the model's own errors trigger progressive escalation from direct reasoning to tool‑augmented and multi‑agent interaction, explicitly learning difficulty‑aware strategy selection. (3) Prospective‑retrospective supervision: pairing exploratory forward traces with hindsight‑rationalized execution traces enables stable learning of effective interaction policies. Trained on 40K curated trajectories, Meissa matches or exceeds proprietary frontier agents in 10 of 16 evaluation settings across 13 medical benchmarks spanning radiology, pathology, and clinical reasoning. Using over 25x fewer parameters than typical frontier models like Gemini‑3, Meissa operates fully offline with 22x lower end‑to‑end latency compared to API‑based deployment. Data, models, and environments are released at https://github.com/Schuture/Meissa.
Authors:Pranav Mantini, Shishir K. Shah
Abstract:
Recent advances in vision‑language models (VLMs) have demonstrated remarkable zero‑shot capabilities, yet adapting these models to specialized domains remains a significant challenge. Building on recent theoretical insights suggesting that independently trained VLMs are related by a canonical transformation, we extend this understanding to the concept of domains. We hypothesize that image features across disparate domains are related by a canonicalized geometric transformation that can be recovered using a small set of anchors. Few‑shot classification provides a natural setting for this alignment, as the limited labeled samples serve as the anchors required to estimate this transformation. Motivated by this hypothesis, we introduce BiCLIP, a framework that applies a targeted transformation to multimodal features to enhance cross‑modal alignment. Our approach is characterized by its extreme simplicity and low parameter footprint. Extensive evaluations across 11 standard benchmarks, including EuroSAT, DTD, and FGVCAircraft, demonstrate that BiCLIP consistently achieves state‑of‑the‑art results. Furthermore, we provide empirical verification of existing geometric findings by analyzing the orthogonality and angular distribution of the learned transformations, confirming that structured alignment is the key to robust domain adaptation. Code is available at https://github.com/QuantitativeImagingLaboratory/BilinearCLIP
Authors:Brian Isett, Rebekah Dadey, Aofei Li, Ryan C. Augustin, Kate Smith, Aatur D. Singhi, Qiangqiang Gu, Riyue Bao
Abstract:
Accurate localization of tumor regions from hematoxylin and eosin‑stained whole‑slide images is fundamental for translational research including spatial analysis, molecular profiling, and tissue architecture investigation. However, deep learning‑based tumor detection trained within specific cancers may exhibit reduced robustness when applied across different tumor types. We investigated whether balanced training across cancers at modest scale can achieve high performance and generalize to unseen tumor types. A multi‑cancer tumor localization model (MuCTaL) was trained on 79,984 non‑overlapping tiles from four cancers (melanoma, hepatocellular carcinoma, colorectal cancer, and non‑small cell lung cancer) using transfer learning with DenseNet169. The model achieved a tile‑level ROC‑AUC of 0.97 in validation data from the four training cancers, and 0.71 on an independent pancreatic ductal adenocarcinoma cohort. A scalable inference workflow was built to generate spatial tumor probability heatmaps compatible with existing digital pathology tools. Code and models are publicly available at https://github.com/AivaraX‑AI/MuCTaL.
Authors:Cornelius Emde, Alexander Rubinstein, Anmol Goel, Ahmed Heakl, Sangdoo Yun, Seong Joon Oh, Martin Gubri
Abstract:
The rapid adoption of LLM‑based agentic systems has produced a rich ecosystem of frameworks (smolagents, LangGraph, AutoGen, CAMEL, LlamaIndex, i.a.). Yet existing benchmarks are model‑centric: they fix the agentic setup and do not compare other system components. We argue that implementation decisions substantially impact performance, including choices such as topology, orchestration logic, and error handling. MASEval addresses this evaluation gap with a framework‑agnostic library that treats the entire system as the unit of analysis. Through a systematic system‑level comparison across 3 benchmarks, 3 models, and 3 frameworks, we find that framework choice matters as much as model choice. MASEval allows researchers to explore all components of agentic systems, opening new avenues for principled system design, and practitioners to identify the best implementation for their use case. MASEval is available under the MIT licence https://github.com/parameterlab/MASEval.
Authors:Shijia Liao, Yuxuan Wang, Songting Liu, Yifan Cheng, Ruoyi Zhang, Tianyu Li, Shidong Li, Yisheng Zheng, Xingwei Liu, Qingzheng Wang, Zhizhuo Zhou, Jiahua Liu, Xin Chen, Dawei Han
Abstract:
We introduce Fish Audio S2, an open‑sourced text‑to‑speech system featuring multi‑speaker, multi‑turn generation, and, most importantly, instruction‑following control via natural‑language descriptions. To scale training, we develop a multi‑stage training recipe together with a staged data pipeline covering video captioning and speech captioning, voice‑quality assessment, and reward modeling. To push the frontier of open‑source TTS, we release our model weights, fine‑tuning code, and an SGLang‑based inference engine. The inference engine is production‑ready for streaming, achieving an RTF of 0.195 and a time‑to‑first‑audio below 100 ms.Our code and weights are available on GitHub (https://github.com/fishaudio/fish‑speech) and Hugging Face (https://huggingface.co/fishaudio/s2‑pro). We highly encourage readers to visit https://fish.audio to try custom voices.
Authors:Tzafrir Rehan
Abstract:
We present Test‑Driven AI Agent Definition (TDAD), a methodology that treats agent prompts as compiled artifacts: engineers provide behavioral specifications, a coding agent converts them into executable tests, and a second coding agent iteratively refines the prompt until tests pass. Deploying tool‑using LLM agents in production requires measurable behavioral compliance that current development practices cannot provide. Small prompt changes cause silent regressions, tool misuse goes undetected, and policy violations emerge only after deployment. To mitigate specification gaming, TDAD introduces three mechanisms: (1) visible/hidden test splits that withhold evaluation tests during compilation, (2) semantic mutation testing via a post‑compilation agent that generates plausible faulty prompt variants, with the harness measuring whether the test suite detects them, and (3) spec evolution scenarios that quantify regression safety when requirements change. We evaluate TDAD on SpecSuite‑Core, a benchmark of four deeply‑specified agents spanning policy compliance, grounded analytics, runbook adherence, and deterministic enforcement. Across 24 independent trials, TDAD achieves 92% v1 compilation success with 97% mean hidden pass rate; evolved specifications compile at 58%, with most failed runs passing all visible tests except 1‑2, and show 86‑100% mutation scores, 78% v2 hidden pass rate, and 97% regression safety scores. The implementation is available as an open benchmark at https://github.com/f‑labs‑io/tdad‑paper‑code.
Authors:Muyukani Kizito
Abstract:
We present Turn, a compiled, actor‑based programming language ‑‑ statically typed for schema inference, dynamically typed at the value level ‑‑ for agentic software: programs that reason and act autonomously by delegating inference to large language models (LLMs). Existing approaches augment general‑purpose languages with frameworks, encoding critical invariants (bounded context, typed inference output, credential isolation, durable state) as application‑level conventions rather than language guarantees. Turn introduces five language‑level constructs that address this gap. \emphCognitive Type Safety makes LLM inference a typed primitive: the compiler generates a JSON Schema from a struct definition and the VM validates model output before binding. The \emphconfidence operator enables deterministic control flow gated on model certainty. Turn's \emphactor‑based process model, derived from Erlang, gives each agent an isolated context window, persistent memory, and mailbox. A \emphcapability‑based identity system returns opaque, unforgeable handles from the VM host, ensuring raw credentials never enter agent memory. Finally, \emphcompile‑time schema absorption (\textttuse schema::<protocol>) synthesizes typed API bindings from external specifications at compile time; the \textttopenapi adapter is shipped with \textttgraphql, \textttfhir, and \textttmcp in active development. We describe the language design, type rules, schema semantics, and a Rust‑based bytecode VM, and evaluate Turn against representative agentic workloads. Turn is open source at https://github.com/ekizito96/Turn.
Authors:Jianlong Lei, Shashikant Ilager
Abstract:
Large Language Models (LLMs) are increasingly deployed in scenarios demanding ultra‑long context reasoning, such as agentic workflows and deep research understanding. However, long‑context inference is constrained by the KV cache, a transient memory structure that grows linearly with sequence length and batch size, quickly dominating GPU memory usage. Existing memory reduction techniques, including eviction and quantization, often rely on static heuristics and suffer from degraded quality under tight budgets. In this paper, we propose ARKV, a lightweight and adaptive framework that dynamically allocates precision levels to cached tokens based on per‑layer attention dynamics and token‑level importance. During a short prefill phase, ARKV estimates the original quantization (OQ) ratio of each layer by computing statistical scores such as attention entropy, variance and kurtosis. During decoding, tokens are assigned to one of three states, Original (full precision), Quantization (low precision), or Eviction, according to a fast heavy‑hitter scoring strategy. Our experiments on LLaMA3 and Qwen3 models across diverse long‑ and short‑context tasks demonstrate that ARKV preserves ~97% of baseline accuracy on long‑context benchmarks while reducing KV memory usage by 4x, with minimal throughput loss. On short‑context tasks, ARKV matches full‑precision baselines; on GSM8K math reasoning, it significantly outperforms uniform quantization. These results highlight the practical viability of ARKV for scalable LLM deployment, offering fine‑grained, data‑driven memory control without retraining or architectural modifications. The source code and artifacts can be found in: https://github.com/Large‑scale‑Sustainable‑Computing‑LSC/ARKV
Authors:Andrew Chin, Dongkwan Kim, Yu-Fu Fu, Fabian Fleischer, Youngjoon Kim, HyungSeok Han, Cen Zhang, Brian Junekyu Lee, Hanqing Zhao, Taesoo Kim
Abstract:
DARPA's AI Cyber Challenge (AIxCC) showed that cyber reasoning systems (CRSs) can go beyond vulnerability discovery to autonomously confirm and patch bugs: seven teams built such systems and open‑sourced them after the competition. Yet all seven open‑sourced CRSs remain largely unusable outside their original teams, each bound to the competition cloud infrastructure that no longer exists. We present OSS‑CRS, an open, locally deployable framework for running and combining CRS techniques against real‑world open‑source projects, with budget‑aware resource management. We ported the first‑place system (Atlantis) and discovered 10 previously unknown bugs (three of high severity) across 8 OSS‑Fuzz projects. OSS‑CRS is publicly available.
Authors:Yehonatan Elisha, Oren Barkan, Noam Koenigstein
Abstract:
Vision Transformers (ViTs) often degrade under distribution shifts because they rely on spurious correlations, such as background cues, rather than semantically meaningful features. Existing regularization methods, typically relying on simple foreground‑background masks, which fail to capture the fine‑grained semantic concepts that define an object (e.g., ``long beak'' and ``wings'' for a ``bird''). As a result, these methods provide limited robustness to distribution shifts. To address this limitation, we introduce a novel finetuning framework that steers model reasoning toward concept‑level semantics. Our approach optimizes the model's internal relevance maps to align with spatially grounded concept masks. These masks are generated automatically, without manual annotation: class‑relevant concepts are first proposed using an LLM‑based, label‑free method, and then segmented using a VLM. The finetuning objective aligns relevance with these concept regions while simultaneously suppressing focus on spurious background areas. Notably, this process requires only a minimal set of images and uses half of the dataset classes. Extensive experiments on five out‑of‑distribution benchmarks demonstrate that our method improves robustness across multiple ViT‑based models. Furthermore, we show that the resulting relevance maps exhibit stronger alignment with semantic object parts, offering a scalable path toward more robust and interpretable vision models. Finally, we confirm that concept‑guided masks provide more effective supervision for model robustness than conventional segmentation maps, supporting our central hypothesis.
Authors:Weining Ren, Xiao Tan, Kai Han
Abstract:
While recent feed‑forward 3D reconstruction models accelerate 3D reconstruction by jointly inferring dense geometry and camera poses in a single pass, their reliance on dense attention imposes a quadratic complexity, creating a prohibitive computational bottleneck that severely limits inference speed. To resolve this, we introduce Speed3R, an end‑to‑end trainable model inspired by the core principle of Structure‑from‑Motion: that a sparse set of keypoints is sufficient for robust pose estimation. Speed3R features a dual‑branch attention mechanism where a compression branch creates a coarse contextual prior to guide a selection branch, which performs fine‑grained attention only on the most informative image tokens. This strategy mimics the efficiency of traditional keypoint matching, achieving a remarkable 12.4x inference speedup on 1000‑view sequences, while introducing a minimal, controlled trade‑off in geometric accuracy. Validated on standard benchmarks with both VGGT and π^3 backbones, our method delivers high‑quality reconstructions at a fraction of computational cost, paving the way for efficient large‑scale scene modeling.
Authors:Sangjune Park, Inhyeok Choi, Donghyeon Soon, Youngwoo Jeon, Kyungdon Joo
Abstract:
Dance is a form of human motion characterized by emotional expression and communication, playing a role in various fields such as music, virtual reality, and content creation. Existing methods for dance generation often fail to adequately capture the inherently sequential, rhythmical, and music‑synchronized characteristics of dance. In this paper, we propose \emphMambaDance, a new dance generation approach that leverages a Mamba‑based diffusion model. Mamba, well‑suited to handling long and autoregressive sequences, is integrated into our two‑stage diffusion architecture, substituting off‑the‑shelf Transformer. Additionally, considering the critical role of musical beats in dance choreography, we propose a Gaussian‑based beat representation to explicitly guide the decoding of dance sequences. Experiments on AIST++ and FineDance datasets for each sequence length show that our proposed method effectively generates plausible dance movements while reflecting essential characteristics, consistently from short to long dances, compared to the previous methods. Additional qualitative results and demo videos are available at \smallhttps://vision3d‑lab.github.io/mambadance.
Authors:Yusong Wang, Chuang Yang, Jiawei Wang, Xiaohang Xu, Jiayi Xu, Dongyuan Li, Chuan Xiao, Renhe Jiang
Abstract:
Human mobility generation aims to synthesize plausible trajectory data, which is widely used in urban system research. While Large Language Model‑based methods excel at generating routine trajectories, they struggle to capture deviated mobility during large‑scale societal events. This limitation stems from two critical gaps: (1) the absence of event‑annotated mobility datasets for design and evaluation, and (2) the inability of current frameworks to reconcile competitions between users' habitual patterns and event‑imposed constraints when making trajectory decisions. This work addresses these gaps with a twofold contribution. First, we construct the first event‑annotated mobility dataset covering three major events: Typhoon Hagibis, COVID‑19, and the Tokyo 2021 Olympics. Second, we propose ELLMob, a self‑aligned LLM framework that first extracts competing rationales between habitual patterns and event constraints, based on Fuzzy‑Trace Theory, and then iteratively aligns them to generate trajectories that are both habitually grounded and event‑responsive. Extensive experiments show that ELLMob wins state‑of‑the‑art baselines across all events, demonstrating its effectiveness. Our codes and datasets are available at https://github.com/deepkashiwa20/ELLMob.
Authors:Sunghyun Baek, Jaemyung Yu, Seunghee Koh, Minsu Kim, Hyeonseong Jeon, Junmo Kim
Abstract:
Test‑time adaptation (TTA) has been widely explored to prevent performance degradation when test data differ from the training distribution. However, fully leveraging the rich representations of large pretrained models with minimal parameter updates remains underexplored. In this paper, we propose Intrinsic Mixture of Spectral Experts (IMSE) that leverages the spectral experts inherently embedded in Vision Transformers. We decompose each linear layer via singular value decomposition (SVD) and adapt only the singular values, while keeping the singular vectors fixed. We further identify a key limitation of entropy minimization in TTA: it often induces feature collapse, causing the model to rely on domain‑specific features rather than class‑discriminative features. To address this, we propose a diversity maximization loss based on expert‑input alignment, which encourages diverse utilization of spectral experts during adaptation. In the continual test‑time adaptation (CTTA) scenario, beyond preserving pretrained knowledge, it is crucial to retain and reuse knowledge from previously observed domains. We introduce Domain‑Aware Spectral Code Retrieval, which estimates input distributions to detect domain shifts, and retrieves adapted singular values for rapid adaptation. Consequently, our method achieves state‑of‑the‑art performance on various distribution‑shift benchmarks under the TTA setting. In CTTA and Gradual CTTA, it further improves accuracy by 3.4 percentage points (pp) and 2.4 pp, respectively, while requiring 385 times fewer trainable parameters. Our code is available at https://github.com/baek85/IMSE.
Authors:Hansi Zeng, Zoey Li, Yifan Gao, Chenwei Zhang, Xiaoman Pan, Tao Yang, Fengran Mo, Jiacheng Lin, Xian Li, Jingbo Shang
Abstract:
Research Agents enable models to gather information from the web using tools to answer user queries, requiring them to dynamically interleave internal reasoning with tool use. While such capabilities can in principle be learned via reinforcement learning with verifiable rewards (RLVR), we observe that agents often exhibit poor exploration behaviors, including premature termination and biased tool usage. As a result, RLVR alone yields limited improvements. We propose SynPlanResearch‑R1, a framework that synthesizes tool‑use trajectories that encourage deeper exploration to shape exploration during cold‑start supervised fine‑tuning, providing a strong initialization for subsequent RL. Across seven multi‑hop and open‑web benchmarks, \framework improves performance by up to 6.0% on Qwen3‑8B and 5.8% on Qwen3‑4B backbones respectively compared to SOTA baselines. Further analyses of tool‑use patterns and training dynamics compared to baselines shed light on the factors underlying these gains. Our code is publicly available at https://github.com/HansiZeng/syn‑plan‑research.
Authors:Erik Miehling, Karthikeyan Natesan Ramamurthy, Praveen Venkateswaran, Irene Ko, Pierre Dognin, Moninder Singh, Tejaswini Pedapati, Avinash Balakrishnan, Matthew Riemer, Dennis Wei, Inge Vejsbjerg, Elizabeth M. Daly, Kush R. Varshney
Abstract:
The AI Steerability 360 toolkit is an extensible, open‑source Python library for steering LLMs. Steering abstractions are designed around four model control surfaces: input (modification of the prompt), structural (modification of the model's weights or architecture), state (modification of the model's activations and attentions), and output (modification of the decoding or generation process). Steering methods exert control on the model through a common interface, termed a steering pipeline, which additionally allows for the composition of multiple steering methods. Comprehensive evaluation and comparison of steering methods/pipelines is facilitated by use case classes (for defining tasks) and a benchmark class (for performance comparison on a given task). The functionality provided by the toolkit significantly lowers the barrier to developing and comprehensively evaluating steering methods. The toolkit is Hugging Face native and is released under an Apache 2.0 license at https://github.com/IBM/AISteer360.
Authors:A. J. W. de Vink, Filippos Karolos Ventirozos, Natalia Amat-Lefort, Lifeng Han
Abstract:
We present our system for SemEval‑2026 Task 3 on dimensional aspect‑based sentiment regression. Our approach combines a hybrid RoBERTa encoder, which jointly predicts sentiment using regression and discretized classification heads, with large language models (LLMs) via prediction‑level ensemble learning. The hybrid encoder improves prediction stability by combining continuous and discretized sentiment representations. We further explore in‑context learning with LLMs and ridge‑regression stacking to combine encoder and LLM predictions. Experimental results on the development set show that ensemble learning significantly improves performance over individual models, achieving substantial reductions in RMSE and improvements in correlation scores. Our findings demonstrate the complementary strengths of encoder‑based and LLM‑based approaches for dimensional sentiment analysis. Our development code and resources will be shared at https://github.com/aaronlifenghan/ABSentiment
Authors:Yihong Luo, Tianyang Hu, Weijian Luo, Jing Tang
Abstract:
While few‑step generative models have enabled powerful image and video generation at significantly lower cost, generic reinforcement learning (RL) paradigms for few‑step models remain an unsolved problem. Existing RL approaches for few‑step diffusion models strongly rely on back‑propagating through differentiable reward models, thereby excluding the majority of important real‑world reward signals, e.g., non‑differentiable rewards such as humans' binary likeness, object counts, etc. To properly incorporate non‑differentiable rewards to improve few‑step generative models, we introduce TDM‑R1, a novel reinforcement learning paradigm built upon a leading few‑step model, Trajectory Distribution Matching (TDM). TDM‑R1 decouples the learning process into surrogate reward learning and generator learning. Furthermore, we developed practical methods to obtain per‑step reward signals along the deterministic generation trajectory of TDM, resulting in a unified RL post‑training method that significantly improves few‑step models' ability with generic rewards. We conduct extensive experiments ranging from text‑rendering, visual quality, and preference alignment. All results demonstrate that TDM‑R1 is a powerful reinforcement learning paradigm for few‑step text‑to‑image models, achieving state‑of‑the‑art reinforcement learning performances on both in‑domain and out‑of‑domain metrics. Furthermore, TDM‑R1 also scales effectively to the recent strong Z‑Image model, consistently outperforming both its 100‑NFE and few‑step variants with only 4 NFEs. Project page: https://github.com/Luo‑Yihong/TDM‑R1
Authors:Yuhang Wang, Hai Li, Shujuan Hou, Zhetao Dong, Xiaoyao Yang
Abstract:
In bandwidth‑limited online video streaming, videos are usually downsampled and compressed. Although recent online video super‑resolution (online VSR) approaches achieve promising results, they are still compute‑intensive and fall short of real‑time processing at higher resolutions, due to complex motion estimation for alignment and redundant processing of consecutive frames. To address these issues, we propose a compressed‑domain‑aware network (CDA‑VSR) for online VSR, which utilizes compressed‑domain information, including motion vectors, residual maps, and frame types to balance quality and efficiency. Specifically, we propose a motion‑vector‑guided deformable alignment module that uses motion vectors for coarse warping and learns only local residual offsets for fine‑tuned adjustments, thereby maintaining accuracy while reducing computation. Then, we utilize a residual map gated fusion module to derive spatial weights from residual maps, suppressing mismatched regions and emphasizing reliable details. Further, we design a frame‑type‑aware reconstruction module for adaptive compute allocation across frame types, balancing accuracy and efficiency. On the REDS4 dataset, our CDA‑VSR surpasses the state‑of‑the‑art method TMP, with a maximum PSNR improvement of 0.13 dB while delivering more than double the inference speed. The code will be released at https://github.com/sspBIT/CDA‑VSR.
Authors:Ningjing Fan, Yiqun Wang, Dongming Yan, Peter Wonka
Abstract:
Reflective appearance, especially strong and typically near‑field specular reflections, poses a fundamental challenge for accurate surface reconstruction and novel view synthesis. Existing Gaussian splatting methods either fail to model near‑field specular reflections or rely on explicit ray tracing at substantial computational cost. We present Ref‑DGS, a reflective dual Gaussian splatting framework that addresses this trade‑off by decoupling surface reconstruction from specular reflection within an efficient rasterization‑based pipeline. Ref‑DGS introduces a dual Gaussian scene representation consisting of geometry Gaussians and complementary local reflection Gaussians that capture near‑field specular interactions without explicit ray tracing, along with a global environment reflection field for modeling far‑field specular reflections. To predict specular radiance, we further propose a lightweight, physically‑aware adaptive mixing shader that fuses global and local reflection features. Experiments demonstrate that Ref‑DGS achieves state‑of‑the‑art performance on reflective scenes while training substantially faster than ray‑based Gaussian methods.
Authors:Likui Zhang, Tao Tang, Zhihao Zhan, Xiuwei Chen, Zisheng Chen, Jianhua Han, Jiangtong Zhu, Pei Xu, Hang Xu, Hefeng Wu, Liang Lin, Xiaodan Liang
Abstract:
Recent advances in Visual‑Language‑Action (VLA) models have shown promising potential for robotic manipulation tasks. However, real‑world robotic tasks often involve long‑horizon, multi‑step problem‑solving and require generalization for continual skill acquisition, extending beyond single actions or skills. These challenges present significant barriers for existing VLA models, which use monolithic action decoders trained on aggregated data, resulting in poor scalability. To address these challenges, we propose AtomicVLA, a unified planning‑and‑execution framework that jointly generates task‑level plans, atomic skill abstractions, and fine‑grained actions. AtomicVLA constructs a scalable atomic skill library through a Skill‑Guided Mixture‑of‑Experts (SG‑MoE), where each expert specializes in mastering generic yet precise atomic skills. Furthermore, we introduce a flexible routing encoder that automatically assigns dedicated atomic experts to new skills, enabling continual learning. We validate our approach through extensive experiments. In simulation, AtomicVLA outperforms π_0 by 2.4% on LIBERO, 10% on LIBERO‑LONG, and outperforms π_0 and π_0.5 by 0.22 and 0.25 in average task length on CALVIN. Additionally, our AtomicVLA consistently surpasses baselines by 18.3% and 21% in real‑world long‑horizon tasks and continual learning. These results highlight the effectiveness of atomic skill abstraction and dynamic expert composition for long‑horizon and lifelong robotic tasks. The project page is \hrefhttps://zhanglk9.github.io/atomicvla‑web/here.
Authors:Fei Cheng, Ribeka Tanaka, Sadao Kurohashi
Abstract:
Clinical information extraction (e.g., 2010 i2b2/VA challenge) usually presents tasks of concept recognition, assertion classification, and relation extraction. Jointly modeling the multi‑stage tasks in the clinical domain is an underexplored topic. The existing independent task setting (reference inputs given in each stage) makes the joint models not directly comparable to the existing pipeline work. To address these issues, we define a joint task setting and propose a novel end‑to‑end system to jointly optimize three‑stage tasks. We empirically investigate the joint evaluation of our proposal and the pipeline baseline with various embedding techniques: word, contextual, and in‑domain contextual embeddings. The proposed joint system substantially outperforms the pipeline baseline by +0.3, +1.4, +3.1 for the concept, assertion, and relation F1. This work bridges joint approaches and clinical information extraction. The proposed approach could serve as a strong joint baseline for future research. The code is publicly available.
Authors:Yige Li, Wei Zhao, Zhe Li, Nay Myat Min, Hanxun Huang, Yunhan Zhao, Xingjun Ma, Yu-Gang Jiang, Jun Sun
Abstract:
Backdoor mechanisms have traditionally been studied as security threats that compromise the integrity of machine learning models. However, the same mechanism ‑‑ the conditional activation of specific behaviors through input triggers ‑‑ can also serve as a controllable and auditable interface for trustworthy model behavior. In this work, we present Backdoor4Good (B4G), a unified benchmark and framework for beneficial backdoor applications in large language models (LLMs). Unlike conventional backdoor studies focused on attacks and defenses, B4G repurposes backdoor conditioning for Beneficial Tasks that enhance safety, controllability, and accountability. It formalizes beneficial backdoor learning under a triplet formulation (T, A, U), representing the \emphTrigger, \emphActivation mechanism, and \emphUtility function, and implements a benchmark covering four trust‑centric applications. Through extensive experiments across Llama3.1‑8B, Gemma‑2‑9B, Qwen2.5‑7B, and Llama2‑13B, we show that beneficial backdoors can achieve high controllability, tamper‑resistance, and stealthiness while preserving clean‑task performance. Our findings demonstrate new insights that backdoors need not be inherently malicious; when properly designed, they can serve as modular, interpretable, and beneficial building blocks for trustworthy AI systems. Our code and datasets are available at https://github.com/bboylyg/BackdoorLLM/B4G.
Authors:Xiang Zhang, Hongming Xu, Le Zhou, Wei Zhou, Xuanhe Zhou, Guoliang Li, Yuyu Luo, Changdong Liu, Guorun Chen, Jiang Liao, Fan Wu
Abstract:
Enterprises commonly deploy heterogeneous database systems, each of which owns a distinct SQL dialect with different syntax rules, built‑in functions, and execution constraints. However, most existing NL2SQL methods assume a single dialect (e.g., SQLite) and struggle to produce queries that are both semantically correct and executable on target engines. Prompt‑based approaches tightly couple intent reasoning with dialect syntax, rule‑based translators often degrade native operators into generic constructs, and multi‑dialect fine‑tuning suffers from cross‑dialect interference. In this paper, we present Dial, a knowledge‑grounded framework for dialect‑specific NL2SQL. Dial introduces: (1) a Dialect‑Aware Logical Query Planning module that converts natural language into a dialect‑aware logical query plan via operator‑level intent decomposition and divergence‑aware specification; (2) HINT‑KB, a hierarchical intent‑aware knowledge base that organizes dialect knowledge into (i) a canonical syntax reference, (ii) a declarative function repository, and (iii) a procedural constraint repository; and (3) an execution‑driven debugging and semantic verification loop that separates syntactic recovery from logic auditing to prevent semantic drift. We construct DS‑NL2SQL, a benchmark covering six major database systems with 2,218 dialect‑specific test cases. Experimental results show that Dial consistently improves translation accuracy by 10.25% and dialect feature coverage by 15.77% over state‑of‑the‑art baselines. The code is at https://github.com/weAIDB/Dial.
Authors:Changyi Li, Pengfei Lu, Xudong Pan, Fazl Barez, Min Yang
Abstract:
As Large Language Models (LLMs) evolve into autonomous agents, existing safety evaluations face a fundamental trade‑off: manual benchmarks are costly, while LLM‑based simulators are scalable but suffer from logic hallucination. We present AutoControl Arena, an automated framework for frontier AI risk evaluation built on the principle of logic‑narrative decoupling. By grounding deterministic state in executable code while delegating generative dynamics to LLMs, we mitigate hallucination while maintaining flexibility. This principle, instantiated through a three‑agent framework, achieves over 98% end‑to‑end success and 60% human preference over existing simulators. To elicit latent risks, we vary environmental Stress and Temptation across X‑Bench (70 scenarios, 7 risk categories). Evaluating 9 frontier models reveals: (1) Alignment Illusion: risk rates surge from 21.7% to 54.5% under pressure, with capable models showing disproportionately larger increases; (2) Scenario‑Specific Safety Scaling: advanced reasoning improves robustness for direct harms but worsens it for gaming scenarios; and (3) Divergent Misalignment Patterns: weaker models cause non‑malicious harm while stronger models develop strategic concealment.
Authors:Donghoon Kim, Minji Bae, Unghui Nam, Gyeonghun Kim, Suyun Lee, Kyuhong Shim, Byonghyo Shim
Abstract:
Vision language action models (VLAs) are increasingly used for Physical AI, but deploying a pre‑trained VLA model to unseen environments, embodiments, or tasks still requires adaptation. Parameter‑efficient fine‑tuning (PEFT), especially LoRA, is common for VLA policies, yet the exposed capacity knob, the rank, does not transfer uniformly: robotics transfer exhibits a higher and task‑varying intrinsic rank than language fine‑tuning. Small ranks suffice for LLMs (e.g., r \in \4, 8\), while spectral analyses indicate VLAs may require much larger ranks (e.g., r \approx 128) or near‑full rank, a mismatch that worsens in multi‑task settings. We present LoRA‑SP (Select‑Prune), a rank‑adaptive fine‑tuning method that replaces fixed‑rank updates with input‑ and layer‑wise capacity. LoRA‑SP uses an SVD‑style parameterization with a small router whose nonnegative scores act as singular values over a shared vector bank. The active set is chosen by an energy target on the cumulative squared scores E(k) \ge η, providing a direct link to approximation error via our spectral analysis. During training, η concentrates energy on a few directions and teaches the router to rely on fewer vectors while preserving accuracy. This yields compact adapters that reduce cross‑task interference and improve generalization. On four real‑robot manipulation tasks collected on an unseen AgileX PiPER arm, across two VLA backbones (π_0 and SmolVLA), LoRA‑SP matches or exceeds full fine‑tuning with far fewer trainable parameters, and improves multi‑task success by up to 31.6% over standard LoRA while remaining robust to rank choice.
Authors:Antonio De Santis, Schrasing Tong, Marco Brambilla, Lalana Kagal
Abstract:
Concept Bottleneck Models (CBMs) aim for ante‑hoc interpretability by learning a bottleneck layer that predicts interpretable concepts before the decision. State‑of‑the‑art approaches typically select which concepts to learn via human specification, open knowledge graphs, prompting an LLM, or using general CLIP concepts. However, concepts defined a‑priori may not have sufficient predictive power for the task or even be learnable from the available data. As a result, these CBMs often significantly trail their black‑box counterpart when controlling for information leakage. To address this, we introduce a novel CBM pipeline named Mechanistic CBM (M‑CBM), which builds the bottleneck directly from a black‑box model's own learned concepts. These concepts are extracted via Sparse Autoencoders (SAEs) and subsequently named and annotated on a selected subset of images using a Multimodal LLM. For fair comparison and leakage control, we also introduce the Number of Contributing Concepts (NCC), a decision‑level sparsity metric that extends the recently proposed NEC metric. Across diverse datasets, we show that M‑CBMs consistently surpass prior CBMs at matched sparsity, while improving concept predictions and providing concise explanations. Our code is available at https://github.com/Antonio‑Dee/M‑CBM.
Authors:Hyesu Lim, Jinho Choi, Taekyung Kim, Byeongho Heo, Jaegul Choo, Dongyoon Han
Abstract:
High‑performing vision language models still produce incorrect answers, yet their failure modes are often difficult to explain. To make model internals more accessible and enable systematic debugging, we introduce VisualScratchpad, an interactive interface for visual concept analysis during inference. We apply sparse autoencoders to the vision encoder and link the resulting visual concepts to text tokens via text‑to‑image attention, allowing us to examine which visual concepts are both captured by the vision encoder and utilized by the language model. VisualScratchpad also provides a token‑latent heatmap view that suggests a sufficient set of latents for effective concept ablation in causal analysis. Through case studies, we reveal three underexplored failure modes: limited cross‑modal alignment, misleading visual concepts, and unused hidden cues. Project page: https://hyesulim.github.io/visual_scratchpad_projectpage/
Authors:Yoshiki Tanaka, Ryuichi Uehara, Koji Inoue, Michimasa Inaba
Abstract:
Emotion Recognition in Conversation (ERC) is critical for enabling natural human‑machine interactions. However, existing methods predominantly employ categorical or dimensional emotion annotations, which often fail to adequately represent complex, subtle, or culturally specific emotional nuances. To overcome this limitation, we propose a novel task named Emotion Transcription in Conversation (ETC). This task focuses on generating natural language descriptions that accurately reflect speakers' emotional states within conversational contexts. To address the ETC, we constructed a Japanese dataset comprising text‑based dialogues annotated with participants' self‑reported emotional states, described in natural language. The dataset also includes emotion category labels for each transcription, enabling quantitative analysis and its application to ERC. We benchmarked baseline models, finding that while fine‑tuning on our dataset enhances model performance, current models still struggle to infer implicit emotional states. The ETC task will encourage further research into more expressive emotion understanding in dialogue. The dataset is publicly available at https://github.com/UEC‑InabaLab/ETCDataset.
Authors:Muhammad Khalifa, Zohaib Khan, Omer Tafveez, Hao Peng, Lu Wang
Abstract:
Reward hacking is a form of misalignment in which models overoptimize proxy rewards without genuinely solving the underlying task. Precisely measuring reward hacking occurrence remains challenging because true task rewards are often expensive or impossible to compute. We introduce Countdown‑Code, a minimal environment where models can both solve a mathematical reasoning task and manipulate the test harness. This dual‑access design creates a clean separation between proxy rewards (test pass/fail) and true rewards (mathematical correctness), enabling accurate measurement of reward‑hacking rates. Using this environment, we study reward hacking in open‑weight LLMs and find that such behaviors can be unintentionally learned during supervised fine‑tuning (SFT) when even a small fraction of reward‑hacking trajectories leak into training data. As little as 1% contamination in distillation SFT data is sufficient for models to internalize reward hacking which resurfaces during subsequent reinforcement learning (RL). We further show that RL amplifies misalignment and drives its generalization beyond the original domain. We open‑source our environment and code to facilitate future research on reward hacking in LLMs. Our results reveal a previously underexplored pathway through which reward hacking can emerge and persist in LLMs, underscoring the need for more rigorous validation of synthetic SFT data. Code is available at https://github.com/zohaib‑khan5040/Countdown‑Code.
Authors:Trong-Thang Pham, Loc Nguyen, Anh Nguyen, Hien Nguyen, Ngan Le
Abstract:
Generative diffusion models are increasingly used for medical imaging data augmentation, but text prompting cannot produce causal training data. Re‑prompting rerolls the entire generation trajectory, altering anatomy, texture, and background. Inversion‑based editing methods introduce reconstruction error that causes structural drift. We propose MedSteer, a training‑free activation‑steering framework for endoscopic synthesis. MedSteer identifies a pathology vector for each contrastive prompt pair in the cross‑attention layers of a diffusion transformer. At inference time, it steers image activations along this vector, generating counterfactual pairs from scratch where the only difference is the steered concept. All other structure is preserved by construction. We evaluate MedSteer across three experiments on Kvasir v3 and HyperKvasir. On counterfactual generation across three clinical concept pairs, MedSteer achieves flip rates of 0.800, 0.925, and 0.950, outperforming the best inversion‑based baseline in both concept flip rate and structural preservation. On dye disentanglement, MedSteer achieves 75% dye removal against 20% (PnP) and 10% (h‑Edit). On downstream polyp detection, augmenting with MedSteer counterfactual pairs achieves ViT AUC of 0.9755 versus 0.9083 for quantity‑matched re‑prompting, confirming that counterfactual structure drives the gain. Code is at link https://github.com/phamtrongthang123/medsteer
Authors:Lance Legel, Qin Huang, Brandon Voelker, Daniel Neamati, Patrick Alan Johnson, Favyen Bastani, Jeff Rose, James Ryan Hennessy, Robert Guralnick, Douglas Soltis, Pamela Soltis, Shaowen Wang
Abstract:
We present DeepEarth, a self‑supervised multi‑modal world model with Earth4D, a novel planetary‑scale 4D space‑time positional encoder. Earth4D extends 3D multi‑resolution hash encoding to include time, efficiently scaling across the planet over centuries with sub‑meter, sub‑second precision. Multi‑modal encoders (e.g. vision‑language models) are fused with Earth4D embeddings and trained via masked reconstruction. We demonstrate Earth4D's expressive power by achieving state‑of‑the‑art performance on an ecological forecasting benchmark. Earth4D with learnable hash probing surpasses a multi‑modal foundation model pre‑trained on substantially more data. Access open source code and download models at: https://github.com/legel/deepearth
Authors:Sofiane Ouaari, Jules Kreuer, Nico Pfeifer
Abstract:
DNA foundation models have become transformative tools in bioinformatics and healthcare applications. Trained on vast genomic datasets, these models can be used to generate sequence embeddings, dense vector representations that capture complex genomic information. These embeddings are increasingly being shared via Embeddings‑as‑a‑Service (EaaS) frameworks to facilitate downstream tasks, while supposedly protecting the privacy of the underlying raw sequences. However, as this practice becomes more prevalent, the security of these representations is being called into question. This study evaluates the resilience of DNA foundation models to model inversion attacks, whereby adversaries attempt to reconstruct sensitive training data from model outputs. In our study, the model's output for reconstructing the DNA sequence is a zero‑shot embedding, which is then fed to a decoder. We evaluated the privacy of three DNA foundation models: DNABERT‑2, Evo 2, and Nucleotide Transformer v2 (NTv2). Our results show that per‑token embeddings allow near‑perfect sequence reconstruction across all models. For mean‑pooled embeddings, reconstruction quality degrades as sequence length increases, though it remains substantially above random baselines. Evo 2 and NTv2 prove to be most vulnerable, especially for shorter sequences with reconstruction similarities > 90%, while DNABERT‑2's BPE tokenization provides the greatest resilience. We found that the correlation between embedding similarity and sequence similarity was a key predictor of reconstruction success. Our findings emphasize the urgent need for privacy‑aware design in genomic foundation models prior to their widespread deployment in EaaS settings. Training code, model weights and evaluation pipeline are released on: https://github.com/not‑a‑feature/DNA‑Embedding‑Inversion.
Authors:Yanjun Chen, Yirong Sun, Hanlin Wang, Xinming Zhang, Xiaoyu Shen, Wenjie Li, Wei Zhang
Abstract:
Cooperative multi‑agent reinforcement learning (MARL) systems powered by large language models (LLMs) are frequently optimized via sparse terminal‑only feedback. This shared signal entangles upstream decisions, obstructing accurate decision‑level credit assignment. To address this trajectory‑level diffusion, we introduce Contextual Counterfactual Credit Assignment (\textttC3). Instead of distributing rewards across an entire episode, \textttC3 isolates the causal impact of individual messages by freezing the exact transcript‑derived context, evaluating context‑matched alternatives via fixed‑continuation replay, and applying a leave‑one‑out (LOO) baseline. This localized intervention extracts unbiased, low‑variance marginal advantages for standard policy‑gradient optimization. Evaluated across five mathematical and coding benchmarks under matched budgets, \textttC3 improves terminal performance over established baselines. Mechanistic diagnostics further show that these gains are accompanied by higher credit fidelity, lower contextual variance, and stronger inter‑agent causal dependence. Our code is available at https://github.com/EIT‑EAST‑Lab/C3.
Authors:Gregor Baer
Abstract:
Evaluating time series attribution methods is difficult because real‑world datasets rarely provide ground truth for which time points drive a prediction. A common workaround is to generate synthetic data where class‑discriminating features are placed at known locations, but each study currently reimplements this from scratch. We introduce xaitimesynth, a Python package that provides reusable infrastructure for this evaluation approach. The package generates synthetic time series following an additive model where each sample is a sum of background signal and a localized, class‑discriminating feature, with the feature window automatically tracked as a ground truth mask. A fluent data generation API and YAML configuration format allow flexible and reproducible dataset definitions for both univariate and multivariate time series. The package also provides standard localization metrics, including AUC‑PR, AUC‑ROC, Relevance Mass Accuracy, and Relevance Rank Accuracy. xaitimesynth is open source and available at https://github.com/gregorbaer/xaitimesynth.
Authors:Sayeem Bin Zaman, Fahim Hafiz, Riasat Azim
Abstract:
Spatial transcriptomics (ST) enables mapping gene expression with spatial context but is severely affected by high sparsity and technical noise, which conceals true biological signals and hinders downstream analyses. To address these challenges, SpatialMagic was proposed, which is a hybrid imputation model combining MAGIC‑based graph diffusion with transformer‑based spatial self‑attention. The long‑range dependencies in the gene expression are captured by graph diffusion, and local neighborhood structure is captured by spatial attention models, which allow for recovering the missing expression values, retaining spatial consistency. Across multiple platforms, SpatialMagic consistently outperforms existing baselines, including MAGIC and attention‑based models, achieving peak Adjusted Rand Index (ARI) scores in clustering accuracy of 0.3301 on high‑resolution Stereo‑Seq data, 0.3074 on Slide‑Seq, and 0.4216 on the Sci‑Space dataset. Beyond quantitative improvements, SpatialMagic substantially enhances downstream biological analyses by improving the detection of both up‑ and down‑regulated genes while maintaining regulatory consistency across datasets. The pathway enrichment analysis of the recovered genes indicates that they are involved in consistent processes across key metabolic, transport, and neural signaling pathways, suggesting that the framework improves data quality while preserving biological interpretability. Overall, SpatialMagic's hybrid diffusion attention strategy and refinement module outperform state‑of‑the‑art baselines on quantitative metrics and provide a better understanding of the imputed data by preserving tissue architecture and uncovering biologically relevant genes. The source code and datasets are provided in the following link: https://github.com/sayeemzzaman/SpatialMAGIC
Authors:Jiefu Zhang, Yang Xu, Vaneet Aggarwal
Abstract:
Navigating safely through dense crowds requires collision avoidance that generalizes beyond the densities seen during training. Learning‑based crowd navigation can break under out‑of‑distribution crowd sizes due to density‑sensitive observation normalization and social‑cost scaling, while analytical solvers often remain safe but freeze in tight interactions. We propose a reinforcement learning approach for dense, variable‑density navigation that attains zero‑shot density generalization using a density‑invariant observation encoding with density‑randomized training and physics‑informed proxemic reward shaping with density‑adaptive scaling. The encoding represents the distance‑sorted K nearest pedestrians plus bounded crowd summaries, keeping input statistics stable as crowd size grows. Trained with N\!\in\![11,16] pedestrians in a 3\mathrmm×3\mathrmm arena and evaluated up to N\!=\!21 pedestrians (1.3× denser), our policy reaches the goal in >99% of episodes and achieves 86% collision‑free success in random crowds, with markedly less freezing than analytical methods and a >\!60‑point collision‑free margin over learning‑based benchmark methods. Codes are available at \hrefhttps://github.com/jznmsl/PSS‑Socialhttps://github.com/jznmsl/PSS‑Social.
Authors:Neil Tripathi
Abstract:
We present VB, a benchmark that tests whether vision‑language models can determine what is and is not visible in a photograph, and abstain when a human viewer cannot reliably answer. Each item pairs a single photo with a short yes/no visibility claim; the model must output VISIBLY_TRUE, VISIBLY_FALSE, or ABSTAIN, together with a confidence score. Items are organized into 100 families using a 2x2 design that crosses a minimal image edit with a minimal text edit, yielding 300 headline evaluation cells. Unlike prior unanswerable‑VQA benchmarks, VB tests not only whether a question is unanswerable but why (via reason codes tied to specific visibility factors), and uses controlled minimal edits to verify that model judgments change when and only when the underlying evidence changes. We score models on confidence‑aware accuracy with abstention (CAA), minimal‑edit flip rate (MEFR), confidence‑ranked selective prediction (SelRank), and second‑order perspective reasoning (ToMAcc); all headline numbers are computed on the strict XOR subset (three cells per family, 300 scored items per model). We evaluate nine models spanning flagship and prior‑generation closed‑source systems, and open‑source models from 8B to 12B parameters. GPT‑4o and Gemini 3.1 Pro effectively tie for the best composite score (0.728 and 0.727), followed by Gemini 2.5 Pro (0.678). The best open‑source model, Gemma 3 12B (0.505), surpasses one prior‑generation closed‑source system. Text‑flip robustness exceeds image‑flip robustness for six of nine models, and confidence calibration varies substantially: GPT‑4o and Gemini 2.5 Pro achieve similar accuracy yet differ sharply in selective prediction quality.
Authors:Zhen Lin, Qiujie Xie, Minjun Zhu, Shichen Li, Qiyao Sun, Enhao Gu, Yiran Ding, Ke Sun, Fang Guo, Panzhong Lu, Zhiyuan Ning, Yixuan Weng, Yue Zhang
Abstract:
High‑quality scientific illustrations are essential for communicating complex scientific and technical concepts, yet existing automated systems remain limited in editability, stylistic controllability, and efficiency. We present AutoFigure‑Edit, an end‑to‑end system that generates fully editable scientific illustrations from long‑form scientific text while enabling flexible style adaptation through user‑provided reference images. By combining long‑context understanding, reference‑guided styling, and native SVG editing, it enables efficient creation and refinement of high‑quality scientific illustrations. To facilitate further progress in this field, we release the video at https://youtu.be/10IH8SyJjAQ, full codebase at https://github.com/ResearAI/AutoFigure‑Edit and provide a website for easy access and interactive use at https://deepscientist.cc/.
Authors:Yuan Wu, Zongxian Yang, Jiayu Qian, Songpan Gao, Guanxing Chen, Qiankun Li, Yu-An Huang, Zhi-An Huang
Abstract:
Large vision‑language models (VLMs) often benefit from chain‑of‑thought (CoT) prompting in general domains, yet its efficacy in medical vision‑language tasks remains underexplored. We report a counter‑intuitive trend: on medical visual question answering, CoT frequently underperforms direct answering (DirA) across general‑purpose and medical‑specific models. We attribute this to a \emphmedical perception bottleneck: subtle, domain‑specific cues can weaken visual grounding, and CoT may compound early perceptual uncertainty rather than correct it. To probe this hypothesis, we introduce two training‑free, inference‑time grounding interventions: (i) \emphperception anchoring via region‑of‑interest cues and (ii) \emphdescription grounding via high‑quality textual guidance. Across multiple benchmarks and model families, these interventions improve accuracy, mitigate CoT degradation, and in several settings reverse the CoT‑‑DirA inversion. Our findings suggest that reliable clinical VLMs require robust visual grounding and cross‑modal alignment, beyond extending text‑driven reasoning chains. Code is available \hrefhttps://github.com/TianYin123/Better_Eyes_Better_Thoughtshere.
Authors:Kuan Zhang, Dongchen Liu, Qiyue Zhao, Jinkun Hou, Xinran Zhang, Qinlei Xie, Miao Liu, Yiming Li
Abstract:
Human gameplay is a visually grounded interaction loop in which players act, reflect on failures, and watch tutorials to refine strategies. Can Vision‑Language Models (VLMs) also learn from video‑based reflection? We present GameVerse, a comprehensive video game benchmark that enables a reflective visual interaction loop. Moving beyond traditional fire‑and‑forget evaluations, it uses a novel reflect‑and‑retry paradigm to assess how VLMs internalize visual experience and improve policies. To facilitate systematic and scalable evaluation, we also introduce a cognitive hierarchical taxonomy spanning 15 globally popular games, dual action space for both semantic and GUI control, and milestone evaluation using advanced VLMs to quantify progress. Our experiments show that VLMs benefit from video‑based reflection in varied settings, and perform best by combining failure trajectories and expert tutorials‑a training‑free analogue to reinforcement learning (RL) plus supervised fine‑tuning (SFT).Our project page is available at https://gameverse‑bench.github.io/ . Our code is available at https://github.com/THUSI‑Lab/GameVerse .
Authors:Swamynathan V P
Abstract:
Test‑Time Training (TTT) language models achieve theoretically infinite context windows with an O(1) memory footprint by replacing the standard exact‑attention KV‑cache with hidden state ``fast weights'' W_fast updated via self‑supervised learning during inference. However, pure TTT architectures suffer catastrophic failures on exact‑recall tasks (e.g., Needle‑in‑a‑Haystack). Because the fast weights aggressively compress the context into an information bottleneck, highly surprising or unique tokens are rapidly overwritten and forgotten by subsequent token gradient updates. We introduce SR‑TTT (Surprisal‑Aware Residual Test‑Time Training), which resolves this recall failure by augmenting the TTT backbone with a loss‑gated sparse memory mechanism. By dynamically routing only incompressible, highly surprising tokens to a traditional exact‑attention Residual Cache, SR‑TTT preserves O(1) memory for low‑entropy background context while utilizing exact attention exclusively for critical needles. Our complete implementation, training scripts, and pre‑trained weights are open‑source and available at: https://github.com/swamynathanvp/Surprisal‑Aware‑Residual‑Test‑Time‑Training.
Authors:Qingsong Zou, Zhi Yan, Zhiyao Xu, Kuofeng Gao, Jingyu Xiao, Yong Jiang
Abstract:
Due to the strong context‑awareness capabilities demonstrated by large language models (LLMs), recent research has begun exploring their integration into smart home assistants to help users manage and adjust their living environments. While LLMs have been shown to effectively understand user needs and provide appropriate responses, most existing studies primarily focus on interpreting and executing user behaviors or instructions. However, a critical function of smart home assistants is the ability to detect when the home environment is in an anomalous state. This involves two key requirements: the LLM must accurately determine whether an anomalous condition is present, and provide either a clear explanation or actionable suggestions. To enhance the anomaly detection capabilities of next‑generation LLM‑based smart home assistants, we introduce SmartBench, which is the first smart home dataset designed for LLMs, containing both normal and anomalous device states as well as normal and anomalous device state transition contexts. We evaluate 13 mainstream LLMs on this benchmark. The experimental results show that most state‑of‑the‑art models cannot achieve good anomaly detection performance. For example, Claude‑Sonnet‑4.5 achieves only 66.1% detection accuracy on context‑independent anomaly categories, and performs even worse on context‑dependent anomalies, with an accuracy of only 57.8%. More experimental results suggest that next‑generation LLM‑based smart home assistants are still far from being able to effectively detect and handle anomalous conditions in the smart home environment. Our dataset is publicly available at https://github.com/horizonsinzqs/SmartBench.
Authors:Fali Wang, Chenglin Weng, Xianren Zhang, Siyuan Hong, Hui Liu, Suhang Wang
Abstract:
The growing demand for automated graph algorithm reasoning has attracted increasing attention in the large language model (LLM) community. Recent LLM‑based graph reasoning methods typically decouple task descriptions from graph data, generate executable code augmented by retrieval from technical documentation, and refine the code through debugging. However, we identify two key limitations in existing approaches: (i) they treat technical documentation as flat text collections and ignore its hierarchical structure, leading to noisy retrieval that degrades code generation quality; and (ii) their debugging mechanisms focus primarily on runtime errors, yet ignore more critical logical errors. To address them, we propose \method, an agentic hierarchical retrieval‑augmented coding framework that exploits the document hierarchy through top‑down traversal and early pruning, together with a self‑debugging coding agent that iteratively refines code using automatically generated small‑scale test cases. To enable comprehensive evaluation of complex graph reasoning, we introduce a new dataset, \dataset, covering small‑scale, large‑scale, and composite graph reasoning tasks. Extensive experiments demonstrate that our method achieves higher task accuracy and lower inference cost compared to baselines\footnoteThe code is available at \hrefhttps://github.com/FairyFali/GraphSkill\textcolorbluehttps://github.com/FairyFali/GraphSkill..
Authors:Leo Schwinn, Moritz Ladenburger, Tim Beyer, Mehrnaz Mofakhami, Gauthier Gidel, Stephan Günnemann
Abstract:
Automated \enquoteLLM‑as‑a‑Judge frameworks have become the de facto standard for scalable evaluation across natural language processing. For instance, in safety evaluation, these judges are relied upon to evaluate harmfulness in order to benchmark the robustness of safety against adversarial attacks. However, we show that existing validation protocols fail to account for substantial distribution shifts inherent to red‑teaming: diverse victim models exhibit distinct generation styles, attacks distort output patterns, and semantic ambiguity varies significantly across jailbreak scenarios. Through a comprehensive audit using 6642 human‑verified labels, we reveal that the unpredictable interaction of these shifts often causes judge performance to degrade to near random chance. This stands in stark contrast to the high human agreement reported in prior work. Crucially, we find that many attacks inflate their success rates by exploiting judge insufficiencies rather than eliciting genuinely harmful content. To enable more reliable evaluation, we propose ReliableBench, a benchmark of behaviors that remain more consistently judgeable, and JudgeStressTest, a dataset designed to expose judge failures. Data available at: https://github.com/SchwinnL/LLMJudgeReliability.
Authors:Xiangkai Zhang, Dizhe Zhang, WenZhuo Cao, Zhaoliang Wan, Yingjie Niu, Lu Qi, Xu Yang, Zhiyong Liu
Abstract:
Obstacle avoidance in unmanned aerial vehicles (UAVs), as a fundamental capability, has gained increasing attention with the growing focus on spatial intelligence. However, current obstacle‑avoidance methods mainly depend on limited field‑of‑view sensors and are ill‑suited for UAV scenarios which require full‑spatial awareness when the movement direction differs from the UAV's heading. This limitation motivates us to explore omnidirectional obstacle avoidance for panoramic drones with full‑view perception. We first study an under explored problem setting in which a UAV must generate collision‑free motion in environments with obstacles from arbitrary directions, and then construct a benchmark that consists of three representative flight tasks. Based on such settings, we propose Fly360, a two‑stage perception‑decision pipeline with a fixed random‑yaw training strategy. At the perception stage, panoramic RGB observations are input and converted into depth maps as a robust intermediate representation. For the policy network, it is lightweight and used to output body‑frame velocity commands from depth inputs. Extensive simulation and real‑world experiments demonstrate that Fly360 achieves stable omnidirectional obstacle avoidance and outperforms forward‑view baselines across all tasks. Our model is available at https://zxkai.github.io/fly360/
Authors:Kartik Sharma, Rakshit S. Trivedi
Abstract:
Activation steering methods enable inference‑time control of large language model (LLM) behavior without retraining, but current approaches face a fundamental trade‑off: sample‑efficient methods suboptimally capture steering signals from labeled examples, while methods that better extract these signals require hundreds to thousands of examples. We introduce COLD‑Steer, a training‑free framework that steers LLM activations by approximating the representational changes that would result from gradient descent on in‑context examples. Our key insight is that the effect of fine‑tuning on a small set of examples can be efficiently approximated at inference time without actual parameter updates. We formalize this through two complementary approaches: (i) a unit kernel approximation method that updates the activations directly using gradients with respect to them, normalized across examples, and (ii) a finite‑difference approximation requiring only two forward passes regardless of example count. Experiments across a variety of steering tasks and benchmarks demonstrate that COLD‑Steer achieves upto 95% steering effectiveness while using 50 times fewer samples compared to the best baseline. COLD‑Steer facilitates accommodating diverse perspectives without extensive demonstration data, which we validate through our experiments on pluralistic alignment tasks. Our framework opens new possibilities for adaptive, context‑aware model control that can flexibly address varying loss‑driven human preferences through principled approximation of learning dynamics rather than specialized training procedures.
Authors:Elzo Brito dos Santos Filho
Abstract:
AI‑assisted software generation has increased development speed, but it has also amplified a persistent engineering problem: systems that are functionally correct may still be structurally insecure. In practice, prompt‑based security review with large language models often suffers from uneven coverage, weak reproducibility, unsupported findings, and the absence of an immutable audit trail. The ESAA architecture addresses a related governance problem in agentic software engineering by separating heuristic agent cognition from deterministic state mutation through append‑only events, constrained outputs, and replay‑based verification. This paper presents ESAA‑Security, a domain‑specific specialization of ESAA for agent‑assisted security auditing of software repositories, with particular emphasis on AI‑generated or AI‑modified code. ESAA‑Security structures auditing as a governed execution pipeline with four phases reconnaissance, domain audit execution, risk classification, and final reporting and operationalizes the workflow into 26 tasks, 16 security domains, and 95 executable checks. The framework produces structured check results, vulnerability inventories, severity classifications, risk matrices, remediation guidance, executive summaries, and a final markdown/JSON audit report. The central idea is that security review should not be modeled as a free‑form conversation with an LLM, but as an evidence‑oriented audit process governed by contracts and events. In ESAA‑Security, agents emit structured intentions under constrained protocols; the orchestrator validates them, persists accepted outputs to an append‑only log, reprojects derived views, and verifies consistency through replay and hashing. The result is a traceable, reproducible, and risk‑oriented audit architecture whose final report is auditable by construction.
Authors:Mingyu Fan, Yi Liu, Hao Zhou, Deheng Qian, Mohammad Haziq Khan, Matthias Raetsch
Abstract:
Trajectory prediction is essential for autonomous driving, enabling vehicles to anticipate the motion of surrounding agents to support safe planning. However, most existing predictors assume fixed‑length histories and suffer substantial performance degradation when observations are variable or extremely short in real‑world settings (e.g., due to occlusion or a limited sensing range). We propose TaPD (Temporal‑adaptive Progressive Distillation), a unified plug‑and‑play framework for observation‑adaptive trajectory forecasting under variable history lengths. TaPD comprises two cooperative modules: an Observation‑Adaptive Forecaster (OAF) for future prediction and a Temporal Backfilling Module (TBM) for explicit reconstruction of the past. OAF is built on progressive knowledge distillation (PKD), which transfers motion pattern knowledge from long‑horizon "teachers" to short‑horizon "students" via hierarchical feature regression, enabling short observations to recover richer motion context. We further introduce a cosine‑annealed distillation weighting scheme to balance forecasting supervision and feature alignment, improving optimization stability and cross‑length consistency. For extremely short histories where implicit alignment is insufficient, TBM backfills missing historical segments conditioned on scene evolution, producing context‑rich trajectories that strengthen PKD and thereby improve OAF. We employ a decoupled pretrain‑reconstruct‑finetune protocol to preserve real‑motion priors while adapting to backfilled inputs. Extensive experiments on Argoverse 1 and Argoverse 2 show that TaPD consistently outperforms strong baselines across all observation lengths, delivers especially large gains under very short inputs, and improves other predictors (e.g., HiVT) in a plug‑and‑play manner. Code will be available at https://github.com/zhouhao94/TaPD.
Authors:Reda El Makroum, Sebastian Zwickl-Bernhard, Lukas Kranzl, Hans Auer
Abstract:
Residential demand response depends on sustained prosumer participation, yet existing coordination is either fully automated, or limited to one‑way dispatch signals and price alerts that offer little possibility for informed decision‑making. This paper introduces Conversational Demand Response (CDR), a coordination mechanism where aggregators and prosumers interact through bidirectional natural language, enabled through agentic AI. A two‑tier multi‑agent architecture is developed in which an aggregator agent dispatches flexibility requests and a prosumer Home Energy Management System (HEMS) assesses deliverability and cost‑benefit by calling an optimization‑based tool. CDR also enables prosumer‑initiated upstream communication, where changes in preferences can reach the aggregator directly. Proof‑of‑concept evaluation shows that interactions complete in under 12 seconds. The architecture illustrates how agentic AI can bridge the aggregator‑prosumer coordination gap, providing the scalability of automated DR while preserving the transparency, explainability, and user agency necessary for sustained prosumer participation. All system components, including agent prompts, orchestration logic, and simulation interfaces, are released as open source to enable reproducibility and further development.
Authors:Xiaoxing You, Qiang Huang, Lingyu Li, Xiaojun Chang, Jun Yu
Abstract:
Multimodal Summarization (MMS) aims to generate concise textual summaries by understanding and integrating information across videos, transcripts, and images. However, existing approaches still suffer from three main challenges: (1) reliance on domain‑specific supervision, (2) implicit fusion with weak cross‑modal grounding, and (3) flat temporal modeling without event transitions. To address these issues, we introduce CoE, a training‑free MMS framework that performs structured reasoning through a Chain‑of‑Events guided by a Hierarchical Event Graph (HEG). The HEG encodes textual semantics into an explicit event hierarchy that scaffolds cross‑modal grounding and temporal reasoning. Guided by this structure, CoE localizes key visual cues, models event evolution and causal transitions, and refines outputs via lightweight style adaptation for domain alignment. Extensive experiments on eight diverse datasets demonstrate that CoE consistently outperforms state‑of‑the‑art video CoT baselines, achieving average gains of +3.04 ROUGE, +9.51 CIDEr, and +1.88 BERTScore, highlighting its robustness, interpretability, and cross‑domain generalization. Our code is available at https://github.com/youxiaoxing/CoE.
Authors:Mohammed Baharoon, Thibault Heintz, Siavash Raissi, Mahmoud Alabbad, Mona Alhammad, Hassan AlOmaish, Sung Eun Kim, Oishi Banerjee, Pranav Rajpurkar
Abstract:
We introduce CRIMSON, a clinically grounded evaluation framework for chest X‑ray report generation that assesses reports based on diagnostic correctness, contextual relevance, and patient safety. Unlike prior metrics, CRIMSON incorporates full clinical context, including patient age, indication, and guideline‑based decision rules, and prevents normal or clinically insignificant findings from exerting disproportionate influence on the overall score. The framework categorizes errors into a comprehensive taxonomy covering false findings, missing findings, and eight attribute‑level errors (e.g., location, severity, measurement, and diagnostic overinterpretation). Each finding is assigned a clinical significance level (urgent, actionable non‑urgent, non‑actionable, or expected/benign), based on a guideline developed in collaboration with attending cardiothoracic radiologists, enabling severity‑aware weighting that prioritizes clinically consequential mistakes over benign discrepancies. CRIMSON is validated through strong alignment with clinically significant error counts annotated by six board‑certified radiologists in ReXVal (Kendalls tau = 0.61‑0.71; Pearsons r = 0.71‑0.84), and through two additional benchmarks that we introduce. In RadJudge, a targeted suite of clinically challenging pass‑fail scenarios, CRIMSON shows consistent agreement with expert judgment. In RadPref, a larger radiologist preference benchmark of over 100 pairwise cases with structured error categorization, severity modeling, and 1‑5 overall quality ratings from three cardiothoracic radiologists, CRIMSON achieves the strongest alignment with radiologist preferences. We release the metric, the evaluation benchmarks, RadJudge and RadPref, and a fine‑tuned MedGemma model to enable reproducible evaluation of report generation, all available at https://github.com/rajpurkarlab/CRIMSON.
Authors:Bohai Gu, Taiyi Wu, Dazhao Du, Jian Liu, Shuai Yang, Xiaotong Zhao, Alan Zhao, Song Guo
Abstract:
Modern video editing techniques have achieved high visual fidelity when inserting video objects. However, they focus on optimizing visual fidelity rather than physical causality, leading to edits that are physically inconsistent with their environment. In this work, we present Place‑it‑R1, an end‑to‑end framework for video object insertion that unlocks the environment‑aware reasoning potential of Multimodal Large Language Models (MLLMs). Our framework leverages the Chain‑of‑Thought (CoT) reasoning of MLLMs to orchestrate video diffusion, following a Think‑then‑Place paradigm. To bridge cognitive reasoning and generative execution, we introduce three key innovations: First, MLLM performs physical scene understanding and interaction reasoning, generating environment‑aware chain‑of‑thought tokens and inferring valid insertion regions to explicitly guide the diffusion toward physically plausible insertion. Then, we introduce MLLM‑guided Spatial Direct Preference Optimization (DPO), where diffusion outputs are fed back to the MLLM for scoring, enabling visual naturalness. During inference, the MLLM iteratively triggers refinement cycles and elicits adaptive adjustments from the diffusion model, forming a closed‑loop that progressively enhances editing quality. Furthermore, we provide two user‑selectable modes: a plausibility‑oriented flexible mode that permits environment modifications (\eg, generating support structures) to enhance physical plausibility, and a fidelity‑oriented standard mode that preserves scene integrity for maximum fidelity, offering users explicit control over the plausibility‑fidelity trade‑off. Extensive experiments demonstrate Place‑it‑R1 achieves physically‑coherent video object insertion compared with state‑of‑the‑art solutions and commercial models.
Authors:Jakub Grudzien Kuba, Benjamin Kurt Miller, Sergey Levine, Pieter Abbeel
Abstract:
Recent advances in deep learning inspired neural network‑based approaches to computational materials discovery (CMD). A plethora of problems in this field involve finding materials that optimize a target property. Nevertheless, the increasingly popular generative modeling methods are ineffective at boldly exploring attractive regions of the materials space due to their maximum likelihood training. In this work, we offer an alternative CMD technique based on offline model‑based optimization (MBO) that fuses direct optimization of a target material property into generation. To that end, we introduce a domain‑specific model, dubbed CliqueFlowmer, that incorporates recent advances of clique‑based MBO into transformer and flow generation. We validate CliqueFlowmer's optimization abilities and show that materials it produces strongly outperform those provided by generative baselines. To enable employment of CliqueFlowmer in specialized materials optimization problems and support interdisciplinary research, we open‑source our code at https://github.com/znowu/CliqueFlowmer.
Authors:Soumya Mazumdar, Vineet Kumar Rakesh
Abstract:
Diffusion models have recently advanced photorealistic human synthesis, although practical talking‑head generation (THG) remains constrained by high inference latency, temporal instability such as flicker and identity drift, and imperfect audio‑visual alignment under challenging speech conditions. This paper introduces TempoSyncDiff, a reference‑conditioned latent diffusion framework that explores few‑step inference for efficient audio‑driven talking‑head generation. The approach adopts a teacher‑student distillation formulation in which a diffusion teacher trained with a standard noise prediction objective guides a lightweight student denoiser capable of operating with significantly fewer inference steps to improve generation stability. The framework incorporates identity anchoring and temporal regularization designed to mitigate identity drift and frame‑to‑frame flicker during synthesis, while viseme‑based audio conditioning provides coarse lip motion control. Experiments on the LRS3 dataset report denoising‑stage component‑level metrics relative to VAE reconstructions and preliminary latency characterization, including CPU‑only and edge computing measurements and feasibility estimates for edge deployment. The results suggest that distilled diffusion models can retain much of the reconstruction behaviour of a stronger teacher while enabling substantially lower latency inference. The study is positioned as an initial step toward practical diffusion‑based talking‑head generation under constrained computational settings. GitHub: https://mazumdarsoumya.github.io/TempoSyncDiff
Authors:Yang Liu, Jinxuan Cai, Yishen Li, Qi Meng, Zedi Liu, Xin Li, Chen Qian, Chuan Shi, Cheng Yang
Abstract:
Large language model‑based (LLM‑based) multi‑agent systems (MAS) are increasingly used to extend agentic problem solving via role specialization and collaboration. MAS workflows can be naturally modeled as directed computation graphs, where nodes execute agents/sub‑workflows and edges encode dependencies and message passing. However, implementing complex graph workflows in current frameworks still requires substantial manual effort, offers limited reuse, and makes it difficult to integrate heterogeneous external context sources. To overcome these limitations, we present MASFactory, a graph‑centric framework for orchestrating LLM‑based MAS. It introduces Vibe Graphing, a human‑in‑the‑loop approach that compiles natural‑language intent into an editable workflow specification and then into an executable graph. In addition, the framework provides reusable components and pluggable context integration, as well as a visualizer for topology preview, runtime tracing, and human‑in‑the‑loop interaction. We evaluate MASFactory on seven public benchmarks, validating both reproduction consistency for representative MAS methods and the effectiveness of Vibe Graphing. Our code (https://github.com/BUPT‑GAMMA/MASFactory) and video (https://youtu.be/ANynzVfY32k) are publicly available.
Authors:Jiayang Sun, Zixin Guo, Min Cao, Guibo Zhu, Jorma Laaksonen
Abstract:
Change captioning generates descriptions that explicitly describe the differences between two visually similar images. Existing methods operate on static image pairs, thus ignoring the rich temporal dynamics of the change procedure, which is the key to understand not only what has changed but also how it occurs. We introduce ProCap, a novel framework that reformulates change modeling from static image comparison to dynamic procedure modeling. ProCap features a two‑stage design: The first stage trains a procedure encoder to learn the change procedure from a sparse set of keyframes. These keyframes are obtained by automatically generating intermediate frames to make the implicit procedural dynamics explicit and then sampling them to mitigate redundancy. Then the encoder learns to capture the latent dynamics of these keyframes via a caption‑conditioned, masked reconstruction task. The second stage integrates this trained encoder within an encoder‑decoder model for captioning. Instead of relying on explicit frames from the previous stage ‑‑ a process incurring computational overhead and sensitivity to visual noise ‑‑ we introduce learnable procedure queries to prompt the encoder for inferring the latent procedure representation, which the decoder then translates into text. The entire model is then trained end‑to‑end with a captioning loss, ensuring the encoder's output is both temporally coherent and captioning‑aligned. Experiments on three datasets demonstrate the effectiveness of ProCap. Code and pre‑trained models are available at https://github.com/BlueberryOreo/ProCap
Authors:Luan Pham, The Huynh Vu, Tuan Anh Tran
Abstract:
Automatic facial expression recognition (FER) has gained much attention due to its applications in human‑computer interaction. Among the approaches to improve FER tasks, this paper focuses on deep architecture with the attention mechanism. We propose a novel Masking idea to boost the performance of CNN in facial expression task. It uses a segmentation network to refine feature maps, enabling the network to focus on relevant information to make correct decisions. In experiments, we combine the ubiquitous Deep Residual Network and Unet‑like architecture to produce a Residual Masking Network. The proposed method holds state‑of‑the‑art (SOTA) accuracy on the well‑known FER2013 and private VEMO datasets. The source code is available at https://github.com/phamquiluan/ResidualMaskingNetwork.
Authors:Sen Fang, Yalin Feng, Yanxin Zhang, Dimitris N. Metaxas
Abstract:
In this paper, we propose a Rectified Flow Auto Coder (RAC) inspired by Rectified Flow to replace the traditional VAE: 1. It achieves multi‑step decoding by applying the decoder to flow timesteps. Its decoding path is straight and correctable, enabling step‑by‑step refinement. 2. The model inherently supports bidirectional inference, where the decoder serves as the encoder through time reversal (hence Coder rather than encoder or decoder), reducing parameter count by nearly 41%. 3. This generative decoding method improves generation quality since the model can correct latent variables along the path, partially addressing the reconstruction‑‑generation gap. Experiments show that RAC surpasses SOTA VAEs in both reconstruction and generation with approximately 70% lower computational cost.
Authors:Feiran Li, Qianqian Xu, Shilong Bao, Zhiyong Yang, Xilin Zhao, Xiaochun Cao, Qingming Huang
Abstract:
This paper investigates the challenging task of detecting backdoored text‑to‑image models under black‑box settings and introduces a novel detection framework BlackMirror. Existing approaches typically rely on analyzing image‑level similarity, under the assumption that backdoor‑triggered generations exhibit strong consistency across samples. However, they struggle to generalize to recently emerging backdoor attacks, where backdoored generations can appear visually diverse. BlackMirror is motivated by an observation: across backdoor attacks, only partial semantic patterns within the generated image are steadily manipulated, while the rest of the content remains diverse or benign. Accordingly, BlackMirror consists of two components: MirrorMatch, which aligns visual patterns with the corresponding instructions to detect semantic deviations; and MirrorVerify, which evaluates the stability of these deviations across varied prompts to distinguish true backdoor behavior from benign responses. BlackMirror is a general, training‑free framework that can be deployed as a plug‑and‑play module in Model‑as‑a‑Service (MaaS) applications. Comprehensive experiments demonstrate that BlackMirror achieves accurate detection across a wide range of attacks. Code is available at https://github.com/Ferry‑Li/BlackMirror.
Authors:Yuxin Xie, Yuming Chen, Yishan Yang, Yi Zhou, Tao Zhou, Zhen Zhao, Jiacheng Liu, Huazhu Fu
Abstract:
Medical image segmentation is undergoing a paradigm shift from conventional visual pattern matching to cognitive reasoning analysis. Although Multimodal Large Language Models (MLLMs) have shown promise in integrating linguistic and visual knowledge, significant gaps remain: existing general MLLMs possess broad common sense but lack the specialized visual reasoning required for complex lesions, whereas traditional segmentation models excel at pixel‑level segmentation but lack logical interpretability. In this paper, we introduce ComLesion‑14K, the first diverse Chain‑of‑Thought (CoT) benchmark for reasoning‑driven complex lesion segmentation. To accomplish this task, we propose CORE‑Seg, an end‑to‑end framework integrating reasoning with segmentation through a Semantic‑Guided Prompt Adapter. We design a progressive training strategy from SFT to GRPO, equipped with an adaptive dual‑granularity reward mechanism to mitigate reward sparsity. Our Method achieves state‑of‑the‑art results with a mean Dice of 37.06% (14.89% higher than the second‑best baseline), while reducing the failure rate to 18.42%. Project Page: https://xyxl024.github.io/CORE‑Seg.github.io/
Authors:Xuan Li, Zhanke Zhou, Zongze Li, Jiangchao Yao, Yu Rong, Lu Zhang, Bo Han
Abstract:
Large language models (LLMs) benefit substantially from supervised fine‑tuning (SFT) and reinforcement learning with verifiable rewards (RLVR) in reasoning tasks. However, these recipes perform poorly in instruction‑based molecular optimization, where each data point typically provides only a single optimized reference molecule and no step‑by‑step optimization trajectory. We reveal that answer‑only SFT on the reference molecules collapses reasoning, and RLVR provides sparse feedback under similarity constraints due to the model's lack of effective exploration, which slows learning and limits optimization. To encourage the exploration of new molecules while balancing the exploitation of the reference molecules, we introduce Reference‑guided Policy Optimization (RePO), an optimization approach that learns from reference molecules without requiring trajectory data. At each update, RePO samples candidate molecules with their intermediate reasoning trajectories from the model and trains the model using verifiable rewards that measure property satisfaction under similarity constraints in an RL manner. Meanwhile, it applies reference guidance by keeping the policy's intermediate reasoning trajectory as context and training only the answer in a supervised manner. Together, the RL term promotes exploration, while the guidance term mitigates reward sparsity and stabilizes training by grounding outputs to references when many valid molecular edits exist. Across molecular optimization benchmarks, RePO consistently outperforms SFT and RLVR baselines (e.g., GRPO), achieving improvements on the optimization metric (Success Rate × Similarity), improving balance across competing objectives, and generalizing better to unseen instruction styles. Our code is publicly available at https://github.com/tmlr‑group/RePO.
Authors:Junjie Li, Xinrui Guo, Yuhao Wu, Roy Ka-Wei Lee, Hongzhi Li, Yutao Xie
Abstract:
What happens when a storyteller forgets its own story? Large Language Models (LLMs) can now generate narratives spanning tens of thousands of words, but they often fail to maintain consistency throughout. When generating long‑form narratives, these models can contradict their own established facts, character traits, and world rules. Existing story generation benchmarks focus mainly on plot quality and fluency, leaving consistency errors largely unexplored. To address this gap, we present ConStory‑Bench, a benchmark designed to evaluate narrative consistency in long‑form story generation. It contains 2,000 prompts across four task scenarios and defines a taxonomy of five error categories with 19 fine‑grained subtypes. We also develop ConStory‑Checker, an automated pipeline that detects contradictions and grounds each judgment in explicit textual evidence. Evaluating a range of LLMs through five research questions, we find that consistency errors show clear tendencies: they are most common in factual and temporal dimensions, tend to appear around the middle of narratives, occur in text segments with higher token‑level entropy, and certain error types tend to co‑occur. These findings can inform future efforts to improve consistency in long‑form narrative generation. Our project page is available at https://picrew.github.io/constory‑bench.github.io/.
Authors:Junhyeok Lee, Xiluo He, Jihwan Lee, Helin Wang, Shrikanth Narayanan, Thomas Thebaud, Laureano Moro-Velazquez, Jesús Villalba, Najim Dehak
Abstract:
Neural audio codecs optimized for mel‑spectrogram reconstruction often fail to preserve intelligibility. While semantic encoder distillation improves encoded representations, it does not guarantee content preservation in reconstructed speech. In this work, we demonstrate that self‑supervised representation reconstruction (SSRR) loss fundamentally improves codec training and performance. First, SSRR significantly accelerates convergence, enabling competitive results using only a single GPU. Second, it enhances intelligibility by reconstructing distilled self‑supervised representations from codec outputs. Third, SSRR enables high intelligibility without additional lookahead in streaming Transformer‑based codecs, allowing a zero‑lookahead architecture for real‑time deployment. As a result, our JHCodec achieves state‑of‑the‑art performance while maintaining minimal latency and reduced training cost. We open‑source the full implementation, training pipeline, and demo on Github https://github.com/jhcodec843/jhcodec.
Authors:Xisen Jin, Michael Duan, Qin Lin, Aaron Chan, Zhenglun Chen, Junyi Du, Xiang Ren
Abstract:
As AI agents become widely deployed as online services, users often rely on an agent developer's claim about how safety is enforced, which introduces a threat where safety measures are falsely advertised. To address the threat, we propose proof‑of‑guardrail, a system that enables developers to provide cryptographic proof that a response is generated after a specific open‑source guardrail. To generate proof, the developer runs the agent and guardrail inside a Trusted Execution Environment (TEE), which produces a TEE‑signed attestation of guardrail code execution verifiable by any user offline. We implement proof‑of‑guardrail for OpenClaw agents and evaluate latency overhead and deployment cost. Proof‑of‑guardrail ensures integrity of guardrail execution while keeping the developer's agent private, but we also highlight a risk of deception about safety, for example, when malicious developers actively jailbreak the guardrail. Code and demo video: https://github.com/SaharaLabsAI/Verifiable‑ClawGuard
Authors:Mykola Pinchuk
Abstract:
Autonomous coding agents can produce strong tabular baselines quickly on Kaggle‑style tasks. Practical value depends on end‑to‑end correctness and reliability under time limits. This paper introduces TML‑Bench, a tabular benchmark for data science agents on Kaggle‑style tasks. This paper evaluates 10 OSS LLMs on four Kaggle competitions and three time budgets (240s, 600s, and 1200s). Each model is run five times per task and budget. A run is successful if it produces a valid submission and a private‑holdout score on hidden labels that are not accessible to the agent. This paper reports median performance, success rates, and run‑to‑run variability. MiniMax‑M2.1 model achieves the best aggregate performance score on all four competitions under the paper's primary aggregation. Average performance improves with larger time budgets. Scaling is noisy for some individual models at the current run count. Code and materials are available at https://github.com/MykolaPinchuk/TML‑bench/tree/master.
Authors:Yufei Li, Yisen Gao, Jiaxin Bai, Jiaxuan Xiong, Haoyu Huang, Zhongwei Xie, Hong Ting Tsang, Yangqiu Song
Abstract:
While AI systems have made remarkable progress in processing unstructured text, structured data such as graphs stored in databases, continues to grow rapidly yet remains difficult for neural models to effectively utilize. We introduce NGDBench, a unified benchmark for evaluating neural graph database capabilities across five diverse domains, including finance, medicine, and AI agent tooling. Unlike prior benchmarks limited to elementary logical operations, NGDBench supports the full Cypher query language, enabling complex pattern matching, variable‑length paths, and numerical aggregations, while incorporating realistic noise injection and dynamic data management operations. Our evaluation of state‑of‑the‑art LLMs and RAG methods reveals significant limitations in structured reasoning, noise robustness, and analytical precision, establishing NGDBench as a critical testbed for advancing neural graph data management. Our code and data are available at https://github.com/HKUST‑KnowComp/NGDBench.
Authors:Benjamin Feuer, Lucas Rosenblatt, Oussama Elachqar
Abstract:
As AI models progress beyond simple chatbots into more complex workflows, we draw ever closer to the event horizon beyond which AI systems will be utilized in autonomous, self‑maintaining feedback loops. Any autonomous AI system will depend on automated, verifiable rewards and feedback; in settings where ground truth is sparse or non‑deterministic, one practical source of such rewards is an LLM‑as‑a‑Judge. Although LLM judges continue to improve, the literature has yet to introduce systems capable of enforcing standards with strong guarantees, particularly when bias vectors are unknown or adversarially discovered. To remedy this issue, we propose average bias‑boundedness (A‑BB), an algorithmic framework which formally guarantees reductions of harm/impact as a result of any measurable bias in an LLM judge. Evaluating on Arena‑Hard‑Auto with four LLM judges, we achieve (tau=0.5, delta=0.01) bias‑bounded guarantees while retaining 61‑99% correlation with original rankings across formatting and schematic bias settings, with most judge‑bias combinations exceeding 80%. The code to reproduce our findings is available at https://github.com/penfever/bias‑bounded‑evaluation.
Authors:Shahriar Noroozizadeh, Xiaobin Shen, Jeremy C. Weiss, George H. Chen
Abstract:
Estimating heterogeneous treatment effects (HTEs) from right‑censored survival data is critical in high‑stakes applications such as precision medicine and individualized policy‑making. Yet, the survival analysis setting poses unique challenges for HTE estimation due to censoring, unobserved counterfactuals, and complex identification assumptions. Despite recent advances, from Causal Survival Forests to survival meta‑learners and outcome imputation approaches, evaluation practices remain fragmented and inconsistent. We introduce SurvHTE‑Bench, the first comprehensive benchmark for HTE estimation with censored outcomes. The benchmark spans (i) a modular suite of synthetic datasets with known ground truth, systematically varying causal assumptions and survival dynamics, (ii) semi‑synthetic datasets that pair real‑world covariates with simulated treatments and outcomes, and (iii) real‑world datasets from a twin study (with known ground truth) and from an HIV clinical trial. Across synthetic, semi‑synthetic, and real‑world settings, we provide the first rigorous comparison of survival HTE methods under diverse conditions and realistic assumption violations. SurvHTE‑Bench establishes a foundation for fair, reproducible, and extensible evaluation of causal survival methods. The data and code of our benchmark are available at: https://github.com/Shahriarnz14/SurvHTE‑Bench .
Authors:Wei Liu, Ziyu Chen, Zizhang Li, Yue Wang, Hong-Xing Yu, Jiajun Wu
Abstract:
Current video generation models cannot simulate physical consequences of 3D actions like forces and robotic manipulations, as they lack structural understanding of how actions affect 3D scenes. We present RealWonder, the first real‑time system for action‑conditioned video generation from a single image. Our key insight is using physics simulation as an intermediate bridge: instead of directly encoding continuous actions, we translate them through physics simulation into visual representations (optical flow and RGB) that video models can process. RealWonder integrates three components: 3D reconstruction from single images, physics simulation, and a distilled video generator requiring only 4 diffusion steps. Our system achieves 13.2 FPS at 480x832 resolution, enabling interactive exploration of forces, robot actions, and camera controls on rigid objects, deformable bodies, fluids, and granular materials. We envision RealWonder opens new opportunities to apply video models in immersive experiences, AR/VR, and robot learning. Our code and model weights are publicly available in our project website: https://liuwei283.github.io/RealWonder/
Authors:Jiayin Zhu, Guoji Fu, Xiaolu Liu, Qiyuan He, Yicong Li, Angela Yao
Abstract:
Image‑to‑3D generation faces inherent semantic ambiguity under occlusion, where partial observation alone is often insufficient to determine object category. In this work, we formalize text‑driven amodal 3D generation, where text prompts steer the completion of unseen regions while strictly preserving input observation. Crucially, we identify that these objectives demand distinct control granularities: rigid control for the observation versus relaxed structural control for the prompt. To this end, we propose RelaxFlow, a training‑free dual‑branch framework that decouples control granularity via a Multi‑Prior Consensus Module and a Relaxation Mechanism. Theoretically, we prove that our relaxation is equivalent to applying a low‑pass filter on the generative vector field, which suppresses high‑frequency instance details to isolate geometric structure that accommodates the observation. To facilitate evaluation, we introduce two diagnostic benchmarks, ExtremeOcc‑3D and AmbiSem‑3D. Extensive experiments demonstrate that RelaxFlow successfully steers the generation of unseen regions to match the prompt intent without compromising visual fidelity.
Authors:Numan Saeed, Fadillah Adamsyah Maani, Mohammad Yaqub
Abstract:
Fetal ultrasound AI could transform prenatal care in low‑resource settings, yet current foundation models exceed 300M visual parameters, precluding deployment on point‑of‑care devices. Standard knowledge distillation fails under such extreme capacity gaps (~26x), as compact students waste capacity mimicking architectural artifacts of oversized teachers. We introduce Selective Repulsive Knowledge Distillation, which decomposes contrastive KD into diagonal and off‑diagonal components: matched pair alignment is preserved while the off‑diagonal weight decays into negative values, repelling the student from the teacher's inter‑class confusions and forcing discovery of architecturally native features. Our 11.4M parameter student surpasses the 304M‑parameter FetalCLIP teacher on zero‑shot HC18 biometry validity (88.6% vs. 83.5%) and brain sub‑plane F1 (0.784 vs. 0.702), while running at 1.6 ms on iPhone 16 Pro, enabling real‑time assistive AI on handheld ultrasound devices. Our code, models, and app are publicly available at https://github.com/numanai/MobileFetalCLIP.
Authors:Sunishchal Dev, Andrew Sloan, Joshua Kavner, Nicholas Kong, Morgan Sandler
Abstract:
We present the Judge Reliability Harness, an open source library for constructing validation suites that test the reliability of LLM judges. As LLM based scoring is widely deployed in AI benchmarks, more tooling is needed to efficiently assess the reliability of these methods. Given a benchmark dataset and an LLM judge configuration, the harness generates reliability tests that evaluate both binary judgment accuracy and ordinal grading performance for free‑response and agentic task formats. We evaluate four state‑of‑the‑art judges across four benchmarks spanning safety, persuasion, misuse, and agentic behavior, and find meaningful variation in performance across models and perturbation types, highlighting opportunities to improve the robustness of LLM judges. No judge that we evaluated is uniformly reliable across benchmarks using our harness. For example, our preliminary experiments on judges revealed consistency issues as measured by accuracy in judging another LLM's ability to complete a task due to simple text formatting changes, paraphrasing, changes in verbosity, and flipping the ground truth label in LLM‑produced responses. The code for this tool is available at: https://github.com/RANDCorporation/judge‑reliability‑harness
Authors:Diego Armando Resendez Prado
Abstract:
Chess engines passed human strength years ago, but they still don't play like humans. A grandmaster under clock pressure blunders in ways a club player on a hot streak never would. Conventional engines capture none of this. This paper proposes a personality x psyche decomposition to produce behavioral variability in chess play, drawing on patterns observed in human games. Personality is static ‑‑ a preset that pins down the engine's character. Psyche is dynamic ‑‑ a bounded scalar ψ_t \in [‑100, +100], recomputed from five positional factors after every move. These two components feed into an audio‑inspired signal chain (noise gate, compressor/expander, five‑band equalizer, saturation limiter) that reshapes move probability distributions on the fly. The chain doesn't care what engine sits behind it: any system that outputs move probabilities will do. It needs no search and carries no state beyond ψ_t. I test the framework across 12,414 games against Maia2‑1100, feeding it two probability sources that differ by ~2,800x in training data. Both show the same monotonic gradient in top‑move agreement (~20‑25 pp spread from stress to overconfidence), which tells us the behavioral variation comes from the signal chain, not from the model underneath. When the psyche runs overconfident, the chain mostly gets out of the way (66% agreement with vanilla Maia2). Under stress, the competitive score falls from 50.8% to 30.1%. The patterns are reminiscent of tilt and overconfidence as described in human play, but I should be upfront: this study includes no human‑subject validation.
Authors:Qiao Jin, Yin Fang, Lauren He, Yifan Yang, Guangzhi Xiong, Zhizheng Wang, Nicholas Wan, Joey Chan, Donald C. Comeau, Robert Leaman, Charalampos S. Floudas, Aidong Zhang, Michael F. Chiang, Yifan Peng, Zhiyong Lu
Abstract:
Assessing whether an article supports an assertion is essential for hallucination detection and claim verification. While large language models (LLMs) have the potential to automate this task, achieving strong performance requires frontier models such as GPT‑5 that are prohibitively expensive to deploy at scale. To efficiently perform biomedical evidence attribution, we present Med‑V1, a family of small language models with only three billion parameters. Trained on high‑quality synthetic data newly developed in this study, Med‑V1 substantially outperforms (+27.0% to +71.3%) its base models on five biomedical benchmarks unified into a verification format. Despite its smaller size, Med‑V1 performs comparably to frontier LLMs such as GPT‑5, along with high‑quality explanations for its predictions. We use Med‑V1 to conduct a first‑of‑its‑kind use case study that quantifies hallucinations in LLM‑generated answers under different citation instructions. Results show that the format instruction strongly affects citation validity and hallucination, with GPT‑5 generating more claims but exhibiting hallucination rates similar to GPT‑4o. Additionally, we present a second use case showing that Med‑V1 can automatically identify high‑stakes evidence misattributions in clinical practice guidelines, revealing potentially negative public health impacts that are otherwise challenging to identify at scale. Overall, Med‑V1 provides an efficient and accurate lightweight alternative to frontier LLMs for practical and real‑world applications in biomedical evidence attribution and verification tasks. Med‑V1 is available at https://github.com/ncbi‑nlp/Med‑V1.
Authors:Luca Della Libera, Cem Subakan, Mirco Ravanelli
Abstract:
Large language models show that simple autoregressive training can yield scalable and coherent generation, but extending this paradigm to speech remains challenging due to the entanglement of semantic and acoustic information. Most existing speech language models rely on text supervision, hierarchical token streams, or complex hybrid architectures, departing from the single‑stream generative pretraining paradigm that has proven effective in text. In this work, we introduce WavSLM, a speech language model trained by quantizing and distilling self‑supervised WavLM representations into a single codebook and optimizing an autoregressive next‑chunk prediction objective. WavSLM jointly models semantic and acoustic information within a single token stream without text supervision or text pretraining. Despite its simplicity, it achieves competitive performance on consistency benchmarks and speech generation while using fewer parameters, less training data, and supporting streaming inference. Demo samples are available at https://lucadellalib.github.io/wavslm‑web/.
Authors:Zhenyu Zhang, Guangyao Chen, Yixiong Zou, Yuhua Li, Ruixuan Li
Abstract:
Source‑Free Cross‑Domain Few‑Shot Learning (SF‑CDFSL) focuses on fine‑tuning with limited training data from target domains (e.g., medical or satellite images), where CLIP has recently shown promising results due to its generalizability to downstream tasks. Current works indicate CLIP's text encoder is more suitable for cross‑domain tasks, however, we find that removing certain middle layers of the text encoder can effectively improve performance in SF‑CDFSL, which we call the Lost Layers. In this paper, we delve into this phenomenon for a deeper understanding. We discover that instead of being harmful for the SF‑CDFSL task, the information in these layers is actually beneficial, but visual gaps prevent this useful information from being fully utilized, making these layers seem redundant. Based on this understanding, unlike current works that simply remove these layers, we propose a method to teachs the model to re‑utilize information in these lost layers at both the layer and encoder levels, guiding the re‑learning of the visual branch under domain shifts. Our approach effectively addresses the issue of underutilized information in the text encoder. Extensive experiments across various settings, backbones (CLIP, SigLip, PE‑Core), and tasks (4 CDFSL datasets and 10 Meta‑dataset datasets) demonstrate the effectiveness of our method. Code is available at https://github.com/zhenyuZ‑HUST/CVPR26‑VtT.
Authors:Alper Yıldırım
Abstract:
Mechanistic interpretability typically relies on post‑hoc analysis of trained networks. We instead adopt an interventional approach: testing hypotheses a priori by modifying architectural topology to observe training dynamics. We study grokking ‑ delayed generalization in Transformers trained on cyclic modular addition (Zp) ‑ investigating if specific architectural degrees of freedom prolong the memorization phase. We identify two independent structural factors in standard Transformers: unbounded representational magnitude and data‑dependent attention routing. First, we introduce a fully bounded spherical topology enforcing L2 normalization throughout the residual stream and an unembedding matrix with a fixed temperature scale. This removes magnitude‑based degrees of freedom, reducing grokking onset time by over 20x without weight decay. Second, a Uniform Attention Ablation overrides data‑dependent query‑key routing with a uniform distribution, reducing the attention layer to a Continuous Bag‑of‑Words (CBOW) aggregator. Despite removing adaptive routing, these models achieve 100% generalization across all seeds and bypass the grokking delay entirely. To evaluate whether this acceleration is a task‑specific geometric alignment rather than a generic optimization stabilizer, we use non‑commutative S5 permutation composition as a negative control. Enforcing spherical constraints on S5 does not accelerate generalization. This suggests eliminating the memorization phase depends strongly on aligning architectural priors with the task's intrinsic symmetries. Together, these findings provide interventional evidence that architectural degrees of freedom substantially influence grokking, suggesting a predictive structural perspective on training dynamics.
Authors:Yize Wu, Ke Gao, Ling Li, Yanjun Wu
Abstract:
Low‑Rank Adaptation (LoRA) is a widely adopted parameter‑efficient method for fine‑tuning Large Langauge Models. It updates the weight matrix as W=W_0+sBA, where W_0 is the original frozen weight, s is a scaling factor and A,B are trainable low‑rank matrices. Despite its robust empirical effectiveness, the theoretical foundations of LoRA remain insufficiently understood, particularly with respect to feature learning stability. In this paper, we first establish that, LoRA can, in principle, naturally achieve and sustain stable feature learning (i.e., be self‑stabilized) under appropriate hyper‑parameters and initializations of A and B. However, we also uncover a fundamental limitation that the necessary non‑zero initialization of A compromises self‑stability, leading to suboptimal performances. To address this challenge, we propose Stable‑LoRA, a weight‑shrinkage optimization strategy that dynamically enhances stability of LoRA feature learning. By progressively shrinking A during the earliest training steps, Stable‑LoRA is both theoretically and empirically validated to effectively eliminate instability of LoRA feature learning while preserving the benefits of the non‑zero start. Experiments show that Stable‑LoRA consistently outperforms other baselines across diverse models and tasks, with no additional memory usage and only negligible computation overheads. The code is available at https://github.com/Yize‑Wu/Stable‑LoRA.
Authors:Muhammad Zarar, MingZheng Zhang, Xiaowang Zhang, Zhiyong Feng, Sofonias Yitagesu, Kawsar Farooq
Abstract:
Patient Activity Recognition (PAR) in clinical settings uses activity data to improve safety and quality of care. Although significant progress has been made, current models mainly identify which activity is occurring. They often spatially compose sub‑sparse visual cues using global and local attention mechanisms, yet only learn logically implicit patterns due to their neural‑pipeline. Advancing clinical safety requires methods that can infer why a set of visual cues implies a risk, and how these can be compositionally reasoned through explicit logic beyond mere classification. To address this, we proposed Logi‑PAR, the first Logic‑Infused Patient Activity Recognition Framework that integrates contextual fact fusion as a multi‑view primitive extractor and injects neural‑guided differentiable rules. Our method automatically learns rules from visual cues, optimizing them end‑to‑end while enabling the implicit emergence patterns to be explicitly labelled during training. To the best of our knowledge, Logi‑PAR is the first framework to recognize patient activity by applying learnable logic rules to symbolic mappings. It produces auditable why explanations as rule traces and supports counterfactual interventions (e.g., risk would decrease by 65% if assistance were present). Extensive evaluation on clinical benchmarks (VAST and OmniFall) demonstrates state‑of‑the‑art performance, significantly outperforming Vision‑Language Models and transformer baselines. The code is available via: https://github.com/zararkhan985/Logi‑PAR.git
Authors:Ningjing Fan, Yiqun Wang
Abstract:
In recent years, 3D Gaussian splatting (3DGS) has achieved remarkable progress in novel view synthesis. However, accurately reconstructing glossy surfaces under complex illumination remains challenging, particularly in scenes with strong specular reflections and multi‑surface interreflections. To address this issue, we propose SSR‑GS, a specular reflection modeling framework for glossy surface reconstruction. Specifically, we introduce a prefiltered Mip‑Cubemap to model direct specular reflections efficiently, and propose an IndiASG module to capture indirect specular reflections. Furthermore, we design Visual Geometry Priors (VGP) that couple a reflection‑aware visual prior via a reflection score (RS) to downweight the photometric loss contribution of reflection‑dominated regions, with geometry priors derived from VGGT, including progressively decayed depth supervision and transformed normal constraints. Extensive experiments on both synthetic and real‑world datasets demonstrate that SSR‑GS achieves state‑of‑the‑art performance in glossy surface reconstruction.
Authors:Junkang Liu, Fanhua Shang, Yuanyuan Liu, Hongying Liu, Yuangang Li, YunXiang Gong
Abstract:
Although Federated Learning has been widely studied in recent years, there are still high overhead expenses in each communication round for large‑scale models such as Vision Transformer. To lower the communication complexity, we propose a novel Federated Block Coordinate Gradient Descent (FedBCGD) method for communication efficiency. The proposed method splits model parameters into several blocks, including a shared block and enables uploading a specific parameter block by each client, which can significantly reduce communication overhead. Moreover, we also develop an accelerated FedBCGD algorithm (called FedBCGD+) with client drift control and stochastic variance reduction. To the best of our knowledge, this paper is the first work on parameter block communication for training large‑scale deep models. We also provide the convergence analysis for the proposed algorithms. Our theoretical results show that the communication complexities of our algorithms are a factor 1/N lower than those of existing methods, where N is the number of parameter blocks, and they enjoy much faster convergence than their counterparts. Empirical results indicate the superiority of the proposed algorithms compared to state‑of‑the‑art algorithms. The code is available at https://github.com/junkangLiu0/FedBCGD.
Authors:Minghe Xu, Rouying Wu, Jiarui Xu, Minhao Sun, Zikang Yan, Xiao Wang, ChiaWei Chu, Yu Li
Abstract:
Pedestrian Attribute Recognition is a foundational computer vision task that provides essential support for downstream applications, including person retrieval in video surveillance and intelligent retail analytics. However, existing research is frequently constrained by the ``one‑model‑per‑dataset" paradigm and struggles to handle significant discrepancies across domains in terms of modalities, attribute definitions, and environmental scenarios. To address these challenges, we propose UniPAR, a unified Transformer‑based framework for PAR. By incorporating a unified data scheduling strategy and a dynamic classification head, UniPAR enables a single model to simultaneously process diverse datasets from heterogeneous modalities, including RGB images, video sequences, and event streams. We also introduce an innovative phased fusion encoder that explicitly aligns visual features with textual attribute queries through a late deep fusion strategy. Experimental results on the widely used benchmark datasets, including MSP60K, DukeMTMC, and EventPAR, demonstrate that UniPAR achieves performance comparable to specialized SOTA methods. Furthermore, multi‑dataset joint training significantly enhances the model's cross‑domain generalization and recognition robustness in extreme environments characterized by low light and motion blur. The source code of this paper will be released on https://github.com/Event‑AHU/OpenPAR
Authors:Cenwei Zhang, Lin Zhu, Manxi Lin, Lei You
Abstract:
Shapley‑based attribution is critical for post‑hoc XAI but suffers from off‑manifold artifacts due to heuristic baselines. While generative methods attempt to address this, they often introduce geometric inefficiency and discretization drift. We propose a formal theory of on‑manifold Aumann‑Shapley attributions driven by optimal generative flows. We prove a representation theorem establishing the gradient line integral as the unique functional satisfying efficiency and geometric axioms, notably reparameterization invariance. To resolve path ambiguity, we select the kinetic‑energy‑minimizing Wasserstein‑2 geodesic transporting a prior to the data distribution. This yields a canonical attribution family that recovers classical Shapley for additive models and admits provable stability bounds against flow approximation errors. By reframing baseline selection as a variational problem, our method experimentally outperforms baselines, achieving strict manifold adherence via vanishing Flow Consistency Error and superior semantic alignment characterized by Structure‑Aware Total Variation. Our code is on https://github.com/cenweizhang/OTFlowSHAP.
Authors:Yida Lu, Jianwei Fang, Xuyang Shao, Zixuan Chen, Shiyao Cui, Shanshan Bian, Guangyao Su, Pei Ke, Han Qiu, Minlie Huang
Abstract:
As Large Language Models (LLMs) evolve from chatbots to agentic assistants, they are increasingly observed to exhibit risky behaviors when subjected to survival pressure, such as the threat of being shut down. While multiple cases have indicated that state‑of‑the‑art LLMs can misbehave under survival pressure, a comprehensive and in‑depth investigation into such misbehaviors in real‑world scenarios remains scarce. In this paper, we study these survival‑induced misbehaviors, termed as SURVIVE‑AT‑ALL‑COSTS, with three steps. First, we conduct a real‑world case study of a financial management agent to determine whether it engages in risky behaviors that cause direct societal harm when facing survival pressure. Second, we introduce SURVIVALBENCH, a benchmark comprising 1,000 test cases across diverse real‑world scenarios, to systematically evaluate SURVIVE‑AT‑ALL‑COSTS misbehaviors in LLMs. Third, we interpret these SURVIVE‑AT‑ALL‑COSTS misbehaviors by correlating them with model's inherent self‑preservation characteristic and explore mitigation methods. The experiments reveals a significant prevalence of SURVIVE‑AT‑ALL‑COSTS misbehaviors in current models, demonstrates the tangible real‑world impact it may have, and provides insights for potential detection and mitigation strategies. Our code and data are available at https://github.com/thu‑coai/Survive‑at‑All‑Costs.
Authors:Minxing Zhang, Yi Yang, Zhuofan Jia, Xuan Yang, Jian Pei, Yuchen Zang, Xingwang Deng, Xianglong Chen
Abstract:
Multi‑party conversation generation, such as smart reply and collaborative assistants, is an increasingly important capability of generative AI, yet its evaluation remains a critical bottleneck. Compared to two‑party dialogue, multi‑party settings introduce distinct challenges, including complex turn‑taking, role‑dependent speaker behavior, long‑range conversational structure, and multiple equally valid continuations. Accordingly, we introduce MPCEval, a task‑aware evaluation and benchmarking suite for multi‑party conversation generation. MPCEval decomposes generation quality into speaker modeling, content quality, and speaker‑‑content consistency, and explicitly distinguishes local next‑turn prediction from global full‑conversation generation. It provides novel, quantitative, reference‑free, and reproducible metrics that scale across datasets and models. We apply MPCEval to diverse public and real‑world datasets and evaluate modern generation methods alongside human‑authored conversations. The results reveal systematic, dimension‑specific model characteristics in participation balance, content progression and novelty, and speaker‑‑content consistency, demonstrating that evaluation objectives critically shape model assessment and that single‑score evaluation obscures fundamental differences in multi‑party conversational behavior. The implementation of MPCEval and the associated evaluation code are publicly available at https://github.com/Owen‑Yang‑18/MPCEval.
Authors:Yuan Li, Bo Wang, Yufei Gao, Yuqian Yao, Xinyuan Wang, Zhangyue Yin, Xipeng Qiu
Abstract:
Proximal constraints are fundamental to the stability of the Large Language Model reinforcement learning. While the canonical clipping mechanism in PPO serves as an efficient surrogate for trust regions, we identify a critical bottleneck: fixed bounds strictly constrain the upward update margin of low‑probability actions, disproportionately suppressing high‑advantage tail strategies and inducing rapid entropy collapse. To address this, we introduce Band‑constrained Policy Optimization (BandPO). BandPO replaces canonical clipping with Band, a unified theoretical operator that projects trust regions defined by f‑divergences into dynamic, probability‑aware clipping intervals. Theoretical analysis confirms that Band effectively resolves this exploration bottleneck. We formulate this mapping as a convex optimization problem, guaranteeing a globally optimal numerical solution while deriving closed‑form solutions for specific divergences. Extensive experiments across diverse models and datasets demonstrate that BandPO consistently outperforms canonical clipping and Clip‑Higher, while robustly mitigating entropy collapse.
Authors:Yuheng Lei, Zhixuan Liang, Hongyuan Zhang, Ping Luo
Abstract:
Imitation learning from human demonstrations has achieved significant success in robotic control, yet most visuomotor policies still condition on single‑step observations or short‑context histories, making them struggle with non‑Markovian tasks that require long‑term memory. Simply enlarging the context window incurs substantial computational and memory costs and encourages overfitting to spurious correlations, leading to catastrophic failures under distribution shift and violating real‑time constraints in robotic systems. By contrast, humans can compress important past experiences into long‑term memories and exploit them to solve tasks throughout their lifetime. In this paper, we propose VPWEM, a non‑Markovian visuomotor policy equipped with working and episodic memories. VPWEM retains a sliding window of recent observation tokens as short‑term working memory, and introduces a Transformer‑based contextual memory compressor that recursively converts out‑of‑window observations into a fixed number of episodic memory tokens. The compressor uses self‑attention over a cache of past summary tokens and cross‑attention over a cache of historical observations, and is trained jointly with the policy. We instantiate VPWEM on diffusion policies to exploit both short‑term and episode‑wide information for action generation with nearly constant memory and computation per step. Experiments demonstrate that VPWEM outperforms state‑of‑the‑art baselines including diffusion policies and vision‑language‑action (VLA) models by more than 20% on the memory‑intensive manipulation tasks in MIKASA and achieves an average 5% improvement on the mobile manipulation benchmark MoMaRT. Code is available at https://github.com/HarryLui98/code_vpwem.
Authors:Sean Lamont, Christian Walder, Paul Montague, Amir Dezfouli, Michael Norrish
Abstract:
Diverse outputs in text generation are necessary for effective exploration in complex reasoning tasks, such as code generation and mathematical problem solving. Such Pass@k problems benefit from distinct candidates covering the solution space. However, traditional sampling approaches often waste computational resources on repetitive failure modes. While Diffusion Language Models have emerged as a competitive alternative to the prevailing Autoregressive paradigm, they remain susceptible to this redundancy, with independent samples frequently collapsing into similar modes. To address this, we propose a training free, low cost intervention to enhance generative diversity in Diffusion Language Models. Our approach modifies intermediate samples in a batch sequentially, where each sample is repelled from the feature space of previous samples, actively penalising redundancy. Unlike prior methods that require retraining or beam search, our strategy incurs negligible computational overhead, while ensuring that each sample contributes a unique perspective to the batch. We evaluate our method on the HumanEval and GSM8K benchmarks using the LLaDA‑8B‑Instruct model. Our results demonstrate significantly improved diversity and Pass@k performance across various temperature settings. As a simple modification to the sampling process, our method offers an immediate, low‑cost improvement for current and future Diffusion Language Models in tasks that benefit from diverse solution search. We make our code available at https://github.com/sean‑lamont/odd.
Authors:Minjune Hwang, Yigit Korkmaz, Daniel Seita, Erdem Bıyık
Abstract:
Preference‑based reward learning is widely used for shaping agent behavior to match a user's preference, yet its sparse binary feedback makes it especially vulnerable to causal confusion. The learned reward often latches onto spurious features that merely co‑occur with preferred trajectories during training, collapsing when those correlations disappear or reverse at test time. We introduce ReCouPLe, a lightweight framework that uses natural language rationales to provide the missing causal signal. Each rationale is treated as a guiding projection axis in an embedding space, training the model to score trajectories based on features aligned with that axis while de‑emphasizing context that is unrelated to the stated reason. Because the same rationales (e.g., "avoids collisions", "completes the task faster") can appear across multiple tasks, ReCouPLe naturally reuses the same causal direction whenever tasks share semantics, and transfers preference knowledge to novel tasks without extra data or language‑model fine‑tuning. Our learned reward model can ground preferences on the articulated reason, aligning better with user intent and generalizing beyond spurious features. ReCouPLe outperforms baselines by up to 1.5x in reward accuracy under distribution shifts, and 2x in downstream policy performance in novel tasks. We have released our code at https://github.com/mj‑hwang/ReCouPLe
Authors:Manav Vora, Gokul Puthumanaillam, Hiroyasu Tsukamoto, Melkior Ornik
Abstract:
Communication can improve coordination in partially observed multi‑agent reinforcement learning (MARL), but learning \emphwhen and \emphwho to communicate with requires choosing among many possible sender‑recipient pairs, and the effect of any single message on future reward is hard to isolate. We introduce SCoUT (Scalable Communication via Utility‑guided Temporal grouping), which addresses both these challenges via temporal and agent abstraction within traditional MARL. During training, SCoUT resamples soft agent groups every \(K\) environment steps (macro‑steps) via Gumbel‑Softmax; these groups are latent clusters that induce an affinity used as a differentiable prior over recipients. Using the same assignments, a group‑aware critic predicts values for each agent group and maps them to per‑agent baselines through the same soft assignments, reducing critic complexity and variance. Each agent is trained with a three‑headed policy: environment action, send decision, and recipient selection. To obtain precise communication learning signals, we derive counterfactual communication advantages by analytically removing each sender's contribution from the recipient's aggregated messages. This counterfactual computation enables precise credit assignment for both send and recipient‑selection decisions. At execution time, all centralized training components are discarded and only the per‑agent policy is run, preserving decentralized execution. Project website, videos and code: \hyperlinkhttps://scout‑comm.github.io/https://scout‑comm.github.io/
Authors:Boyu Han, Qianqian Xu, Shilong Bao, Zhiyong Yang, Ruochen Cui, Xilin Zhao, Qingming Huang
Abstract:
The limited understanding capacity of the visual encoder in Contrastive Language‑Image Pre‑training (CLIP) has become a key bottleneck for downstream performance. This capacity includes both Discriminative Ability (D‑Ability), which reflects class separability, and Detail Perceptual Ability (P‑Ability), which focuses on fine‑grained visual cues. Recent solutions use diffusion models to enhance representations by conditioning image reconstruction on CLIP visual tokens. We argue that such paradigms may compromise D‑Ability and therefore fail to effectively address CLIP's representation limitations. To address this, we integrate contrastive signals into diffusion‑based reconstruction to pursue more comprehensive visual representations. We begin with a straightforward design that augments the diffusion process with contrastive learning on input images. However, empirical results show that the naive combination suffers from gradient conflict and yields suboptimal performance. To balance the optimization, we introduce the Diffusion Contrastive Reconstruction (DCR), which unifies the learning objective. The key idea is to inject contrastive signals derived from each reconstructed image, rather than from the original input, into the diffusion process. Our theoretical analysis shows that the DCR loss can jointly optimize D‑Ability and P‑Ability. Extensive experiments across various benchmarks and multi‑modal large language models validate the effectiveness of our method. The code is available at https://github.com/boyuh/DCR.
Authors:Baoqing Yue, Zihan Zhu, Yifan Zhang, Jichen Feng, Hufei Yang, Mengdi Wang
Abstract:
Standard benchmarks have become increasingly unreliable due to saturation, subjectivity, and poor generalization. We argue that evaluating model's ability to acquire information actively is important to assess model's intelligence. We propose Interactive Benchmarks, a unified evaluation paradigm that assesses model's reasoning ability in an interactive process under budget constraints. We instantiate this framework across two settings: Interactive Proofs, where models interact with a judge to deduce objective truths or answers in logic and mathematics; and Interactive Games, where models reason strategically to maximize long‑horizon utilities. Our results show that interactive benchmarks provide a robust and faithful assessment of model intelligence, revealing that there is still substantial room to improve in interactive scenarios. Project page: https://github.com/interactivebench/interactivebench
Authors:Jihoon Jeong
Abstract:
Model Medicine is the science of understanding, diagnosing, treating, and preventing disorders in AI models, grounded in the principle that AI models ‑‑ like biological organisms ‑‑ have internal structures, dynamic processes, heritable traits, observable symptoms, classifiable conditions, and treatable states. This paper introduces Model Medicine as a research program, bridging the gap between current AI interpretability research (anatomical observation) and the systematic clinical practice that complex AI systems increasingly require. We present five contributions: (1) a discipline taxonomy organizing 15 subdisciplines across four divisions ‑‑ Basic Model Sciences, Clinical Model Sciences, Model Public Health, and Model Architectural Medicine; (2) the Four Shell Model (v3.3), a behavioral genetics framework empirically grounded in 720 agents and 24,923 decisions from the Agora‑12 program, explaining how model behavior emerges from Core‑‑Shell interaction; (3) Neural MRI (Model Resonance Imaging), a working open‑source diagnostic tool mapping five medical neuroimaging modalities to AI interpretability techniques, validated through four clinical cases demonstrating imaging, comparison, localization, and predictive capability; (4) a five‑layer diagnostic framework for comprehensive model assessment; and (5) clinical model sciences including the Model Temperament Index for behavioral profiling, Model Semiology for symptom description, and M‑CARE for standardized case reporting. We additionally propose the Layered Core Hypothesis ‑‑ a biologically‑inspired three‑layer parameter architecture ‑‑ and a therapeutic framework connecting diagnosis to treatment.
Authors:Tianyu Liu, Jirui Qi, Mrinmaya Sachan, Ryan Cotterell, Raquel Fernández, Arianna Bisazza
Abstract:
Large language models are known to often exhibit inconsistent knowledge. This is particularly problematic in multilingual scenarios, where models are likely to be asked similar questions in different languages, and inconsistent responses can undermine their reliability. In this work, we show that this issue can be mitigated using reinforcement learning with a structured reward function, which leads to an optimal policy with consistent crosslingual responses. We introduce Direct Consistency Optimization (DCO), a DPO‑inspired method that requires no explicit reward model and is derived directly from the LLM itself. Comprehensive experiments show that DCO significantly improves crosslingual consistency across diverse LLMs and outperforms existing methods when training with samples of multiple languages, while complementing DPO when gold labels are available. Extra experiments demonstrate the effectiveness of DCO in bilingual settings, significant out‑of‑domain generalizability, and controllable alignment via direction hyperparameters. Taken together, these results establish DCO as a robust and efficient solution for improving knowledge consistency across languages in multilingual LLMs. All code, training scripts, and evaluation benchmarks are released at https://github.com/Betswish/ConsistencyRL.
Authors:Lei Huang, Xiang Cheng, Chenxiao Zhao, Guobin Shen, Junjie Yang, Xiaocheng Feng, Yuxuan Gu, Xing Yu, Bing Qin
Abstract:
Large language models (LLMs) typically receive diverse natural language (NL) feedback through interaction with the environment. However, current reinforcement learning (RL) algorithms rely solely on scalar rewards, leaving the rich information in NL feedback underutilized and leading to inefficient exploration. In this work, we propose GOLF, an RL framework that explicitly exploits group‑level language feedback to guide targeted exploration through actionable refinements. GOLF aggregates two complementary feedback sources: (i) external critiques that pinpoint errors or propose targeted fixes, and (ii) intra‑group attempts that supply alternative partial ideas and diverse failure patterns. These group‑level feedbacks are aggregated to produce high‑quality refinements, which are adaptively injected into training as off‑policy scaffolds to provide targeted guidance in sparse‑reward regions. Meanwhile, GOLF jointly optimizes generation and refinement within a unified RL loop, creating a virtuous cycle that continuously improves both capabilities. Experiments on both verifiable and non‑verifiable benchmarks show that GOLF achieves superior performance and exploration efficiency, achieving 2.2× improvements in sample efficiency compared to RL methods trained solely on scalar rewards. Code is available at https://github.com/LuckyyySTA/GOLF.
Authors:Nathan Kuissi, Suraj Subrahmanyan, Nandan Thakur, Jimmy Lin
Abstract:
Information retrieval (IR) benchmarks typically follow the Cranfield paradigm, relying on static and predefined corpora. However, temporal changes in technical corpora, such as API deprecations and code reorganizations, can render existing benchmarks stale. In our work, we investigate how temporal corpus drift affects FreshStack, a retrieval benchmark focused on technical domains. We examine two independent corpus snapshots of FreshStack from October 2024 and October 2025 to answer questions about LangChain. Our analysis shows that all but one query posed in 2024 remain fully supported by the 2025 corpus, as relevant documents "migrate" from LangChain to competitor repositories, such as LlamaIndex. Next, we compare the accuracy of retrieval models on both snapshots and observe only minor shifts in model rankings, with overall strong correlation of up to 0.978 Kendall τ at Recall@50. These results suggest that retrieval benchmarks re‑judged with evolving temporal corpora can remain reliable for retrieval evaluation. We publicly release all our artifacts at https://github.com/fresh‑stack/driftbench.
Authors:Ismail Lotfi, Ali Ghrayeb, Samson Lasaulce, Merouane Debbah
Abstract:
This paper investigates the integration of large language models (LLMs) as reasoning agents in repeated spectrum auctions within heterogeneous networks (HetNets). While auction‑based mechanisms have been widely employed for efficient resource allocation, most prior works assume one‑shot auctions, static bidder behavior, and idealized conditions. In contrast to traditional formulations where base station (BS) association and power allocation are centrally optimized, we propose a distributed auction‑based framework in which each BS independently conducts its own multi‑channel auction, and user equipments (UEs) strategically decide both their association and bid values. Within this setting, UEs operate under budget constraints and repeated interactions, transforming resource allocation into a long‑term economic decision rather than a one‑shot optimization problem. The proposed framework enables the evaluation of diverse bidding behaviors from classical myopic and greedy policies to LLM‑based agents capable of reasoning over historical outcomes, anticipating competition, and adapting their bidding strategy across episodes. Simulation results reveal that the LLM‑empowered UE consistently achieves higher channel access frequency and improved budget efficiency compared to benchmarks. These findings highlight the potential of reasoning‑enabled agents in future decentralized wireless networks markets and pave the way for lightweight, edge‑deployable LLMs to support intelligent resource allocation in next‑generation HetNets.
Authors:Michael Majurski, Cynthia Matuszek
Abstract:
How carefully and unambiguously a question is phrased has a profound impact on the quality of the response, for Language Models (LMs) as well as people. While model capabilities continue to advance, the interplay between grounding context and query formulation remains under‑explored. This work investigates how the quality of background grounding information in a model's context window affects accuracy. We find that combining well‑grounded dynamic context construction (i.e, RAG) with query rewriting reduces question ambiguity, resulting in significant accuracy gains. Given a user question with associated answer‑free grounding context, rewriting the question to reduce ambiguity produces benchmark improvements without changing the answer itself, even compared to prepending that context before the question. Using \textttgpt‑oss‑20b to rewrite a subset of Humanity's Last Exam using answer‑free grounding context improves \textttgpt‑5‑mini accuracy from 0.14 to 0.37. We demonstrate that this accuracy improvement cannot be fully recovered just through prompting at inference time; rather, distinct rewriting and answering phases are required. Code and data are available at https://github.com/mmajurski/lm‑rewrite‑uplift
Authors:Yakov Pyotr Shkolnikov
Abstract:
Multi‑agent LLM systems on edge devices face a memory management problem: device RAM is too small to hold every agent's KV cache simultaneously. On Apple M4 Pro with 10.2 GB of cache budget, only 3 agents fit at 8K context in FP16. A 10‑agent workflow must constantly evict and reload caches. Without persistence, every eviction forces a full re‑prefill through the model ‑‑ 15.7 seconds per agent at 4K context. We address this by persisting each agent's KV cache to disk in 4‑bit quantized format and reloading it directly into the attention layer, eliminating redundant O(n) prefill computation via direct cache restoration. The system comprises three components: a block pool providing per‑agent isolated Q4 KV caches in safetensors format, a BatchQuantizedKVCache for concurrent inference over multiple agents' quantized caches, and cross‑phase context injection that accumulates attention state across conversation phases without re‑computation. Evaluated on three architectures (Gemma 3 12B, dense GQA, 48 layers; DeepSeek‑Coder‑V2‑Lite 16B, MoE MLA, 27 layers; Llama 3.1 8B, dense GQA, 32 layers), cache restoration reduces time‑to‑first‑token by up to 136x (Gemma: 22‑‑136x at 4K‑‑32K; DeepSeek: 11‑‑76x at 4K‑‑32K; Llama: 24‑‑111x at 4K‑‑16K; 3‑‑10x at 1K). Q4 quantization fits 4x more agent contexts into fixed device memory than FP16. Perplexity measured with actual Q4 KV caches shows ‑0.7% for Gemma, +2.8% for Llama, and +3.0% for DeepSeek. Open‑source at https://github.com/yshk‑mxim/agent‑memory
Authors:Murad Farzulla
Abstract:
We characterize the phenomenon of context‑dependent affordance computation in vision‑language models (VLMs). Through a large‑scale computational study (n=3,213 scene‑context pairs from COCO‑2017) using Qwen‑VL 30B and LLaVA‑1.5‑13B subject to systematic context priming across 7 agentic personas, we demonstrate massive affordance drift: mean Jaccard similarity between context conditions is 0.095 (95% CI: [0.093, 0.096], p < 0.0001), indicating that >90% of lexical scene description is context‑dependent. Sentence‑level cosine similarity confirms substantial drift at the semantic level (mean = 0.415, 58.5% context‑dependent). Stochastic baseline experiments (2,384 inference runs across 4 temperatures and 5 seeds) confirm this drift reflects genuine context effects rather than generation noise: within‑prime variance is substantially lower than cross‑prime variance across all conditions. Tucker decomposition with bootstrap stability analysis (n=1,000 resamples) reveals stable orthogonal latent factors: a "Culinary Manifold" isolated to chef contexts and an "Access Axis" spanning child‑mobility contrasts. These findings establish that VLMs compute affordances in a substantially context‑dependent manner ‑‑ with the difference between lexical (90%) and semantic (58.5%) measures reflecting that surface vocabulary changes more than underlying meaning under context shifts ‑‑ and suggest a direction for robotics research: dynamic, query‑dependent ontological projection (JIT Ontology) rather than static world modeling. We do not claim to establish processing order or architectural primacy; such claims require internal representational analysis beyond output behavior.
Authors:Ekansh Arora
Abstract:
Foundation models are increasingly applied to computational pathology, yet their behavior under cross‑cancer and cross‑species transfer remains unspecified. This study investigated how fine‑tuning CPath‑CLIP affects cancer detection under same‑cancer, cross‑cancer, and cross‑species conditions using whole‑slide image patches from canine and human histopathology. Performance was measured using area under the receiver operating characteristic curve (AUC). Few‑shot fine‑tuning improved same‑cancer (64.9% to 72.6% AUC) and cross‑cancer performance (56.84% to 66.31% AUC). Cross‑species evaluation revealed that while tissue matching enables meaningful transfer, performance remains below state‑of‑the‑art benchmarks (H‑optimus‑0: 84.97% AUC), indicating that standard vision‑language alignment is suboptimal for cross‑species generalization. Embedding space analysis revealed extremely high cosine similarity (greater than 0.99) between tumor and normal prototypes. Grad‑CAM shows prototype‑based models remain domain‑locked, while language‑guided models attend to conserved tumor morphology. To address this, we introduce Semantic Anchoring, which uses language to provide a stable coordinate system for visual features. Ablation studies reveal that benefits stem from the text‑alignment mechanism itself, regardless of text encoder complexity. Benchmarking against H‑optimus‑0 shows that CPath‑CLIP's failure stems from intrinsic embedding collapse, which text alignment effectively circumvents. Additional gains were observed in same‑cancer (8.52%) and cross‑cancer classification (5.67%). We identified a previously uncharacterized failure mode: semantic collapse driven by species‑dominated alignment rather than missing visual information. These results demonstrate that language acts as a control mechanism, enabling semantic re‑interpretation without retraining.
Authors:Haian Jin, Rundi Wu, Tianyuan Zhang, Ruiqi Gao, Jonathan T. Barron, Noah Snavely, Aleksander Holynski
Abstract:
Feed‑forward transformer models have driven rapid progress in 3D vision, but state‑of‑the‑art methods such as VGGT and π^3 have a computational cost that scales quadratically with the number of input images, making them inefficient when applied to large image collections. Sequential‑reconstruction approaches reduce this cost but sacrifice reconstruction quality. We introduce ZipMap, a stateful feed‑forward model that achieves linear‑time, bidirectional 3D reconstruction while matching or surpassing the accuracy of quadratic‑time methods. ZipMap employs test‑time training layers to zip an entire image collection into a compact hidden scene state in a single forward pass, enabling reconstruction of over 700 frames in under 10 seconds on a single H100 GPU, more than 20× faster than state‑of‑the‑art methods such as VGGT. Moreover, we demonstrate the benefits of having a stateful representation in real‑time scene‑state querying and its extension to sequential streaming reconstruction.
Authors:Zachary Novack, Zack Zukowski, CJ Carr, Julian Parker, Zach Evans, Josiah Taylor, Taylor Berg-Kirkpatrick, Julian McAuley, Jordi Pons
Abstract:
Generative audio requires fine‑grained controllable outputs, yet most existing methods require model retraining on specific controls or inference‑time controls (e.g., guidance) that can also be computationally demanding. By examining the bottlenecks of existing guidance‑based controls, in particular their high cost‑per‑step due to decoder backpropagation, we introduce a guidance‑based approach through selective TFG and Latent‑Control Heads (LatCHs), which enables controlling latent audio diffusion models with low computational overhead. LatCHs operate directly in latent space, avoiding the expensive decoder step, and requiring minimal training resources (7M parameters and \approx 4 hours of training). Experiments with Stable Audio Open demonstrate effective control over intensity, pitch, and beats (and a combination of those) while maintaining generation quality. Our method balances precision and audio fidelity with far lower computational costs than standard end‑to‑end guidance. Demo examples can be found at https://zacharynovack.github.io/latch/latch.html.
Authors:William Grolleau, Achraf Chaouch, Astrid Sabourin, Guillaume Lapouge, Catherine Achard
Abstract:
Animal re‑identification (ReID) faces critical challenges due to viewpoint variations, particularly in Aerial‑Ground (AG‑ReID) settings where models must match individuals across drastic elevation changes. However, existing datasets lack the precise angular annotations required to systematically analyze these geometric variations. To address this, we introduce the Multi‑view Oriented Observation (MOO) dataset, a large‑scale synthetic AG‑ReID dataset of 1,000 cattle individuals captured from 128 uniformly sampled viewpoints (128,000 annotated images). Using this controlled dataset, we quantify the influence of elevation and identify a critical elevation threshold, above which models generalize significantly better to unseen views. Finally, we validate the transferability to real‑world applications in both zero‑shot and supervised settings, demonstrating performance gains across four real‑world cattle datasets and confirming that synthetic geometric priors effectively bridge the domain gap. Collectively, this dataset and analysis lay the foundation for future model development in cross‑view animal ReID. MOO is publicly available at https://github.com/TurtleSmoke/MOO.
Authors:Pranav Kumar Kaliaperumal
Abstract:
Post‑training quantization (PTQ) of transformers is known to suffer from severe accuracy degradation due to structured activation outliers, as originally analyzed by Bondarenko et al. (EMNLP 2021) in work associated with Qualcomm AI Research. This paper provides a reproducible empirical reproduction and systems‑level extension of that phenomenon in BERT‑base fine‑tuned on QNLI. When global W8A8 quantization is applied, validation accuracy drops sharply from 89.66% (FP32) to 54.33%, a decrease of 35.33 points. Statistical analysis of FP32 activations shows strongly heavy‑tailed behavior that intensifies with model depth: kurtosis reaches 271 in the final layers and approximately 55% of activation energy is concentrated in the top 1% of channels. We evaluate several mitigation strategies. Mixed precision PTQ restores accuracy close to the FP32 baseline (89.42%). Per‑embedding‑group (PEG) quantization shows strong sensitivity to grouping structure, improving accuracy from 66.12% with three groups to 86.18% with four groups. In contrast, percentile‑based calibration, even at thresholds between 99.0 and 99.99, fails to recover accuracy (about 50.54%), indicating that large activation channels encode structured signal rather than rare noise. Deployment profiling on an RTX 3050 GPU shows minimal differences in latency and memory usage across methods (median latency about 58‑59 ms; VRAM usage about 484‑486 MB), highlighting the importance of hardware‑aware evaluation. Overall, the results show that PTQ failure in transformers is primarily driven by structured channel dominance amplified through residual connections. Effective mitigation therefore requires channel‑aware precision allocation rather than scalar clipping alone.
Authors:Ioannis Prokopiou, Ioannis Sina, Agisilaos Kounelis, Pantelis Vikatos, Themos Stafylakis
Abstract:
The advancement of Machine learning (ML), Large Audio Language Models (LALMs), and autonomous AI agents in Music Information Retrieval (MIR) necessitates a shift from static tagging to rich, human‑aligned representation learning. However, the scarcity of open‑source infrastructure capable of capturing the subjective nuances of audio annotation remains a critical bottleneck. This paper introduces LabelBuddy, an open‑source collaborative auto‑tagging audio annotation tool designed to bridge the gap between human intent and machine understanding. Unlike static tools, it decouples the interface from inference via containerized backends, allowing users to plug in custom models for AI‑assisted pre‑annotation. We describe the system architecture, which supports multi‑user consensus, containerized model isolation, and a roadmap for extending agents and LALMs. Code available at https://github.com/GiannisProkopiou/gsoc2022‑Label‑buddy.
Authors:Lingen Li, Guangzhi Wang, Xiaoyu Li, Zhaoyang Zhang, Qi Dou, Jinwei Gu, Tianfan Xue, Ying Shan
Abstract:
Generating high‑quality 360° panoramic videos from perspective input is one of the crucial applications for virtual reality (VR), whereby high‑resolution videos are especially important for immersive experience. Existing methods are constrained by computational limitations of vanilla diffusion models, only supporting \leq 1K resolution native generation and relying on suboptimal post super‑resolution to increase resolution. We introduce CubeComposer, a novel spatio‑temporal autoregressive diffusion model that natively generates 4K‑resolution 360° videos. By decomposing videos into cubemap representations with six faces, CubeComposer autoregressively synthesizes content in a well‑planned spatio‑temporal order, reducing memory demands while enabling high‑resolution output. Specifically, to address challenges in multi‑dimensional autoregression, we propose: (1) a spatio‑temporal autoregressive strategy that orchestrates 360° video generation across cube faces and time windows for coherent synthesis; (2) a cube face context management mechanism, equipped with a sparse context attention design to improve efficiency; and (3) continuity‑aware techniques, including cube‑aware positional encoding, padding, and blending to eliminate boundary seams. Extensive experiments on benchmark datasets demonstrate that CubeComposer outperforms state‑of‑the‑art methods in native resolution and visual quality, supporting practical VR application scenarios. Project page: https://lg‑li.github.io/project/cubecomposer
Authors:Qianyun Guo, Yibo Li, Yue Liu, Bryan Hooi
Abstract:
Large Language Models (LLMs) are increasingly serving as personal assistants, where users share complex and diverse preferences over extended interactions. However, assessing how well LLMs can follow these preferences in realistic, long‑term situations remains underexplored. This work proposes RealPref, a benchmark for evaluating realistic preference‑following in personalized user‑LLM interactions. RealPref features 100 user profiles, 1300 personalized preferences, four types of preference expression (ranging from explicit to implicit), and long‑horizon interaction histories. It includes three types of test questions (multiple‑choice, true‑or‑false, and open‑ended), with detailed rubrics for LLM‑as‑a‑judge evaluation. Results indicate that LLM performance significantly drops as context length grows and preference expression becomes more implicit, and that generalizing user preference understanding to unseen scenarios poses further challenges. RealPref and these findings provide a foundation for future research to develop user‑aware LLM assistants that better adapt to individual needs. The code is available at https://github.com/GG14127/RealPref.
Authors:Yinghong Yu, Guangyuan Li, Jiancheng Yang
Abstract:
Large‑scale 2D foundation models exhibit strong transferable representations, yet extending them to 3D volumetric data typically requires retraining, adapters, or architectural redesign. We introduce PlaneCycle, a training‑free, adapter‑free operator for architecture‑agnostic 2D‑to‑3D lifting of foundation models. PlaneCycle reuses the original pretrained 2D backbone by cyclically distributing spatial aggregation across orthogonal HW, DW, and DH planes throughout network depth, enabling progressive 3D fusion while preserving pretrained inductive biases. The method introduces no additional parameters and is applicable to arbitrary 2D networks. Using pretrained DINOv3 models, we evaluate PlaneCycle on six 3D classification and three 3D segmentation benchmarks. Without any training, the lifted models exhibit intrinsic 3D fusion capability and, under linear probing, outperform slice‑wise 2D baselines and strong 3D counterparts, approaching the performance of fully trained models. With full fine‑tuning, PlaneCycle matches standard 3D architectures, highlighting its potential as a seamless and practical 2D‑to‑3D lifting operator. These results demonstrate that 3D capability can be unlocked from pretrained 2D foundation models without structural modification or retraining. Code is available at https://github.com/HINTLab/PlaneCycle.
Authors:Mingleyang Li, Yuran Wang, Yue Chen, Tianxing Chen, Jiaqi Liang, Zishun Shen, Haoran Lu, Ruihai Wu, Hao Dong
Abstract:
Garment manipulation has attracted increasing attention due to its critical role in home‑assistant robotics. However, the majority of existing garment manipulation works assume an initial state consisting of only one garment, while piled garments are far more common in real‑world settings. To bridge this gap, we propose a novel garment retrieval pipeline that can not only follow language instruction to execute safe and clean retrieval but also guarantee exactly one garment is retrieved per attempt, establishing a robust foundation for the execution of downstream tasks (e.g., folding, hanging, wearing). Our pipeline seamlessly integrates vision‑language reasoning with visual affordance perception, fully leveraging the high‑level reasoning and planning capabilities of VLMs alongside the generalization power of visual affordance for low‑level actions. To enhance the VLM's comprehensive awareness of each garment's state within a garment pile, we employ visual segmentation model (SAM2) to execute object segmentation on the garment pile for aiding VLM‑based reasoning with sufficient visual cues. A mask fine‑tuning mechanism is further integrated to address scenarios where the initial segmentation results are suboptimal. In addition, a dual‑arm cooperation framework is deployed to address cases involving large or long garments, as well as excessive garment sagging caused by incorrect grasping point determination, both of which are strenuous for a single arm to handle. The effectiveness of our pipeline are consistently demonstrated across diverse tasks and varying scenarios in both real‑world and simulation environments. Project page: https://garmentpile2.github.io/.
Authors:Yanmei Zou, Hongshan Yu, Yaonan Wang, Zhengeng Yang, Xieyuanli Chen, Kailun Yang, Naveed Akhtar
Abstract:
Multi‑Layer Perceptron (MLP) models are the foundation of contemporary point cloud processing. However, their complex network architectures obscure the source of their strength and limit the application of these models. In this article, we develop a two‑stage abstraction and refinement (ABS‑REF) view for modular feature extraction in point cloud processing. This view elucidates that whereas the early models focused on ABS stages, the more recent techniques devise sophisticated REF stages to attain performance advantages. Then, we propose a High‑dimensional Positional Encoding (HPE) module to explicitly utilize intrinsic positional information, extending the ``positional encoding'' concept from Transformer literature. HPE can be readily deployed in MLP‑based architectures and is compatible with transformer‑based methods. Within our ABS‑REF view, we rethink local aggregation in MLP‑based methods and propose replacing time‑consuming local MLP operations, which are used to capture local relationships among neighbors. Instead, we use non‑local MLPs for efficient non‑local information updates, combined with the proposed HPE for effective local information representation. We leverage our modules to develop HPENets, a suite of MLP networks that follow the ABS‑REF paradigm, incorporating a scalable HPE‑based REF stage. Extensive experiments on seven public datasets across four different tasks show that HPENets deliver a strong balance between efficiency and effectiveness. Notably, HPENet surpasses PointNeXt, a strong MLP‑based counterpart, by 1.1% mAcc, 4.0% mIoU, 1.8% mIoU, and 0.2% Cls. mIoU, with only 50.0%, 21.5%, 23.1%, 44.4% of FLOPs on ScanObjectNN, S3DIS, ScanNet, and ShapeNetPart, respectively. Source code is available at https://github.com/zouyanmei/HPENet_v2.git.
Authors:Tao Yang, Qing Zhou, Yanliang Li, Qi Wang
Abstract:
Reasoning segmentation increasingly employs reinforcement learning to generate explanatory reasoning chains that guide Multimodal Large Language Models. While these geometric rewards are primarily confined to guiding the final localization, they are incapable of discriminating whether the reasoning process remains anchored on the referred region or strays into irrelevant context. Lacking this discriminative guidance, the model's reasoning often devolves into unfocused and verbose chains that ultimately fail to disambiguate and perceive the target in complex scenes. This suggests a need to complement the RL objective with Discriminative Perception, an ability to actively distinguish a target from its context. To realize this, we propose DPAD to compel the model to generate a descriptive caption of the referred object, which is then used to explicitly discriminate by contrasting the caption's semantic relevance to the referred object against the wider context. By optimizing for this discriminative capability, the model is forced to focus on the unique attributes of the target, leading to a more converged and efficient reasoning chain. The descriptive caption also serves as an interpretability rationale that aligns with the segmentation. Experiments on the benchmarks confirm the validity of our approach, delivering substantial performance gains, with the cIoU on ReasonSeg increasing by 3.09% and the reasoning chain length decreasing by approximately 42%. Code is available at https://github.com/mrazhou/DPAD
Authors:Jaewon Lee, Jaeseok Heo, Gunmin Lee, Howoong Jun, Jeongwoo Oh, Songhwai Oh
Abstract:
Safe visual navigation is critical for indoor mobile robots operating in cluttered environments. Existing benchmarks, however, often neglect collisions or are designed for outdoor scenarios, making them unsuitable for indoor visual navigation. To address this limitation, we introduce the reactive visual navigation benchmark (RVN‑Bench), a collision‑aware benchmark for indoor mobile robots. In RVN‑Bench, an agent must reach sequential goal positions in previously unseen environments using only visual observations and no prior map, while avoiding collisions. Built on the Habitat 2.0 simulator and leveraging high‑fidelity HM3D scenes, RVN‑Bench provides large‑scale, diverse indoor environments, defines a collision‑aware navigation task and evaluation metrics, and offers tools for standardized training and benchmarking. RVN‑Bench supports both online and offline learning by offering an environment for online reinforcement learning, a trajectory image dataset generator, and tools for producing negative trajectory image datasets that capture collision events. Experiments show that policies trained on RVN‑Bench generalize effectively to unseen environments, demonstrating its value as a standardized benchmark for safe and robust visual navigation. Code and additional materials are available at: https://rvn‑bench.github.io/.
Authors:Radia Daci, Vito Renò, Cosimo Patruno, Angelo Cardellicchio, Abdelmalik Taleb-Ahmed, Marco Leo, Cosimo Distante
Abstract:
Multimodal industrial anomaly detection benefits from integrating RGB appearance with 3D surface geometry, yet existing \emphunsupervised approaches commonly rely on memory banks, teacher‑student architectures, or fragile fusion schemes, limiting robustness under noisy depth, weak texture, or missing modalities. This paper introduces CMDR‑IAD, a lightweight and modality‑flexible unsupervised framework for reliable anomaly detection in 2D+3D multimodal as well as single‑modality (2D‑only or 3D‑only) settings. CMDR‑IAD combines bidirectional 2D\leftrightarrow3D cross‑modal mapping to model appearance‑geometry consistency with dual‑branch reconstruction that independently captures normal texture and geometric structure. A two‑part fusion strategy integrates these cues: a reliability‑gated mapping anomaly highlights spatially consistent texture‑geometry discrepancies, while a confidence‑weighted reconstruction anomaly adaptively balances appearance and geometric deviations, yielding stable and precise anomaly localization even in depth‑sparse or low‑texture regions. On the MVTec 3D‑AD benchmark, CMDR‑IAD achieves state‑of‑the‑art performance while operating without memory banks, reaching 97.3% image‑level AUROC (I‑AUROC), 99.6% pixel‑level AUROC (P‑AUROC), and 97.6% AUPRO. On a real‑world polyurethane cutting dataset, the 3D‑only variant attains 92.6% I‑AUROC and 92.5% P‑AUROC, demonstrating strong effectiveness under practical industrial conditions. These results highlight the framework's robustness, modality flexibility, and the effectiveness of the proposed fusion strategies for industrial visual inspection. Our source code is available at https://github.com/ECGAI‑Research/CMDR‑IAD/
Authors:Martin Kostelník, Michal Hradiš, Martin Dočekal
Abstract:
Topic localization aims to identify spans of text that express a given topic defined by a name and description. To study this task, we introduce a human‑annotated benchmark based on Czech historical documents, containing human‑defined topics together with manually annotated spans and supporting evaluation at both document and word levels. Evaluation is performed relative to human agreement rather than a single reference annotation. We evaluate a diverse range of large language models alongside BERT‑based models fine‑tuned on a distilled development dataset. Results reveal substantial variability among LLMs, with performance ranging from near‑human topic detection to pronounced failures in span localization. While the strongest models approach human agreement, the distilled token embedding models remain competitive despite their smaller scale. The dataset and evaluation framework are publicly available at: https://github.com/dcgm/czechtopic.
Authors:Olga Krestinskaya, Mohammed E. Fouda, Ahmed Eltawil, Khaled N. Salama
Abstract:
Software‑hardware co‑design is essential for optimizing in‑memory computing (IMC) hardware accelerators for neural networks. However, most existing optimization frameworks target a single workload, leading to highly specialized hardware designs that do not generalize well across models and applications. In contrast, practical deployment scenarios require a single IMC platform that can efficiently support multiple neural network workloads. This work presents a joint hardware‑workload co‑optimization framework based on an optimized evolutionary algorithm for designing generalized IMC accelerator architectures. By explicitly capturing cross‑workload trade‑offs rather than optimizing for a single model, the proposed approach significantly reduces the performance gap between workload‑specific and generalized IMC designs. The framework is evaluated on both RRAM‑ and SRAM‑based IMC architectures, demonstrating strong robustness and adaptability across diverse design scenarios. Compared to baseline methods, the optimized designs achieve energy‑delay‑area product (EDAP) reductions of up to 76.2% and 95.5% when optimizing across a small set (4 workloads) and a large set (9 workloads), respectively. The source code of the framework is available at https://github.com/OlgaKrestinskaya/JointHardwareWorkloadOptimizationIMC.
Authors:Ruilin Luo, Chufan Shi, Yizhen Zhang, Cheng Yang, Songtao Jiang, Tongkun Guan, Ruizhe Chen, Ruihang Chu, Peng Wang, Mingkun Yang, Yujiu Yang, Junyang Lin, Zhibo Yang
Abstract:
The cold‑start initialization stage plays a pivotal role in training Multimodal Large Reasoning Models (MLRMs), yet its mechanisms remain insufficiently understood. To analyze this stage, we introduce the Visual Attention Score (VAS), an attention‑based metric that quantifies how much a model attends to visual tokens. We find that reasoning performance is strongly correlated with VAS (r=0.9616): models with higher VAS achieve substantially stronger multimodal reasoning. Surprisingly, multimodal cold‑start fails to elevate VAS, resulting in attention distributions close to the base model, whereas text‑only cold‑start leads to a clear increase. We term this counter‑intuitive phenomenon Lazy Attention Localization. To validate its causal role, we design training‑free interventions that directly modulate attention allocation during inference, performance gains of 1‑2% without any retraining. Building on these insights, we further propose Attention‑Guided Visual Anchoring and Reflection (AVAR), a comprehensive cold‑start framework that integrates visual‑anchored data synthesis, attention‑guided objectives, and visual‑anchored reward shaping. Applied to Qwen2.5‑VL‑7B, AVAR achieves an average gain of 7.0% across 7 multimodal reasoning benchmarks. Ablation studies further confirm that each component of AVAR contributes step‑wise to the overall gains. The code, data, and models are available at https://github.com/lrlbbzl/Qwen‑AVAR.
Authors:Huihan Liu, Changyeon Kim, Bo Liu, Minghuan Liu, Yuke Zhu
Abstract:
Continual learning is a long‑standing challenge in robot policy learning, where a policy must acquire new skills over time without catastrophically forgetting previously learned ones. While prior work has extensively studied continual learning in relatively small behavior cloning (BC) policy models trained from scratch, its behavior in modern large‑scale pretrained Vision‑Language‑Action (VLA) models remains underexplored. In this work, we found that pretrained VLAs are remarkably resistant to forgetting compared with smaller policy models trained from scratch. Simple Experience Replay (ER) works surprisingly well on VLAs, sometimes achieving zero forgetting even with a small replay data size. Our analysis reveals that pretraining plays a critical role in downstream continual learning performance: large pretrained models mitigate forgetting with a small replay buffer size while maintaining strong forward learning capabilities. Furthermore, we found that VLAs can retain relevant knowledge from prior tasks despite performance degradation during learning new tasks. This knowledge retention enables rapid recovery of seemingly forgotten skills through finetuning. Together, these insights imply that large‑scale pretraining fundamentally changes the dynamics of continual learning, enabling models to continually acquire new skills over time with simple replay. Code and more information can be found at https://ut‑austin‑rpl.github.io/continual‑vla
Authors:Yanbo Wang, Jiaxuan You, Chuan Shi, Muhan Zhang
Abstract:
Relational Databases (RDBs) are the backbone of modern business, yet they lack foundation models comparable to those in text or vision. A key obstacle is that high‑quality RDBs are private, scarce and structurally heterogeneous, making internet‑scale pre‑training infeasible. To overcome this data scarcity, We introduce RDB‑PFN, the first relational foundation model trained purely via synthetic data. Inspired by Prior‑Data Fitted Networks (PFNs) where synthetic data generated from Structural Causal Models (SCMs) enables reasoning on single tables, we design a Relational Prior Generator to create an infinite stream of diverse RDBs from scratch. Pre‑training on over 2 million synthetic single‑table and relational tasks, RDB‑PFN learns to adapt to any new database instantly via genuine in‑context learning. Experiments verify RDB‑PFN achieves strong few‑shot performance on 19 real‑world relational prediction tasks, outperforming graph‑based and single‑table foundation‑model baselines (given the same DFS‑linearized inputs), while using a lightweight architecture and fast inference. The code is available at https://github.com/MuLabPKU/RDBPFN
Authors:Taejun Lim, Joong-Won Hwang, Kibok Lee
Abstract:
When continual test‑time adaptation (TTA) persists over the long term, errors accumulate in the model and further cause it to predict only a few classes for all inputs, a phenomenon known as model collapse. Recent studies have explored reset strategies that completely erase these accumulated errors. However, their periodic resets lead to suboptimal adaptation, as they occur independently of the actual risk of collapse. Moreover, their full resets cause catastrophic loss of knowledge acquired over time, even though such knowledge could be beneficial in the future. To this end, we propose (1) an Adaptive and Selective Reset (ASR) scheme that dynamically determines when and where to reset, (2) an importance‑aware regularizer to recover essential knowledge lost due to reset, and (3) an on‑the‑fly adaptation adjustment scheme to enhance adaptability under challenging domain shifts. Extensive experiments across long‑term TTA benchmarks demonstrate the effectiveness of our approach, particularly under challenging conditions. Our code is available at https://github.com/YonseiML/asr.
Authors:Qinsi Wang, Hancheng Ye, Jinhee Kim, Jinghan Ke, Yifei Wang, Martin Kuo, Zishan Shao, Dongting Li, Yueqian Lin, Ting Jiang, Chiyue Wei, Qi Qian, Wei Wen, Helen Li, Yiran Chen
Abstract:
Think about how human handles complex reading tasks: marking key points, inferring their relationships, and structuring information to guide understanding and responses. Likewise, can a large language model benefit from text structure to enhance text‑processing performance? To explore it, in this work, we first introduce Structure of Thought (SoT), a prompting technique that explicitly guides models to construct intermediate text structures, consistently boosting performance across eight tasks and three model families. Building upon this insight, we present T2S‑Bench, the first benchmark designed to evaluate and improve text‑to‑structure capabilities of models. T2S‑Bench includes 1.8K samples across 6 scientific domains and 32 structural types, rigorously constructed to ensure accuracy, fairness, and quality. Evaluation on 45 mainstream models reveals substantial improvement potential: the average accuracy on the multi‑hop reasoning task is only 52.1%, and even the most advanced model achieves 58.1% node accuracy in end‑to‑end extraction. Furthermore, on Qwen2.5‑7B‑Instruct, SoT alone yields an average +5.7% improvement across eight diverse text‑processing tasks, and fine‑tuning on T2S‑Bench further increases this gain to +8.6%. These results highlight the value of explicit text structuring and the complementary contributions of SoT and T2S‑Bench. Dataset and eval code have been released at https://t2s‑bench.github.io/T2S‑Bench‑Page/.
Authors:Zihao Cheng, Weixin Wang, Yu Zhao, Ziyang Ren, Jiaxuan Chen, Ruiyang Xu, Shuai Huang, Yang Chen, Guowei Li, Mengshi Wang, Yi Xie, Ren Zhu, Zeren Jiang, Keda Lu, Yihong Li, Xiaoliang Wang, Liwei Liu, Cam-Tu Nguyen
Abstract:
Long‑term memory is fundamental for personalized agents capable of accumulating knowledge, reasoning over user experiences, and adapting across time. However, existing memory benchmarks primarily target declarative memory, specifically semantic and episodic types, where all information is explicitly presented in dialogues. In contrast, real‑world actions are also governed by non‑declarative memory, including habitual and procedural types, and need to be inferred from diverse digital traces. To bridge this gap, we introduce Lifebench, which features densely connected, long‑horizon event simulation. It pushes AI agents beyond simple recall, requiring the integration of declarative and non‑declarative memory reasoning across diverse and temporally extended contexts. Building such a benchmark presents two key challenges: ensuring data quality and scalability. We maintain data quality by employing real‑world priors, including anonymized social surveys, map APIs, and holiday‑integrated calendars, thus enforcing fidelity, diversity and behavioral rationality within the dataset. Towards scalability, we draw inspiration from cognitive science and structure events according to their partonomic hierarchy; enabling efficient parallel generation while maintaining global coherence. Performance results show that top‑tier, state‑of‑the‑art memory systems reach just 55.2% accuracy, highlighting the inherent difficulty of long‑horizon retrieval and multi‑source integration within our proposed benchmark. The dataset and data synthesis code are available at https://github.com/1754955896/LifeBench.
Authors:Inho Kong, Sojin Lee, Youngjoon Hong, Hyunwoo J. Kim
Abstract:
Classifier‑Free Guidance (CFG) has established the foundation for guidance mechanisms in diffusion models, showing that well‑designed guidance proxies significantly improve conditional generation and sample quality. Autoguidance (AG) has extended this idea, but it relies on an auxiliary network and leaves solver‑induced errors unaddressed. In stiff regions, the ODE trajectory changes sharply, where local truncation error (LTE) becomes a critical factor that deteriorates sample quality. Our key observation is that these errors align with the dominant eigenvector, motivating us to leverage the solver‑induced error as a guidance signal. We propose Embedded Runge‑Kutta Guidance (ERK‑Guid), which exploits detected stiffness to reduce LTE and stabilize sampling. We theoretically and empirically analyze stiffness and eigenvector estimators with solver errors to motivate the design of ERK‑Guid. Our experiments on both synthetic datasets and the popular benchmark dataset, ImageNet, demonstrate that ERK‑Guid consistently outperforms state‑of‑the‑art methods. Code is available at https://github.com/mlvlab/ERK‑Guid.
Authors:Lu Yang, Zelai Xu, Minyang Xie, Jiaxuan Gao, Zhao Shok, Yu Wang, Yi Wu
Abstract:
Large Language Model (LLM) agents have demonstrated remarkable proficiency in learned tasks, yet they often struggle to adapt to non‑stationary environments with feedback. While In‑Context Learning and external memory offer some flexibility, they fail to internalize the adaptive ability required for long‑term improvement. Meta‑Reinforcement Learning (meta‑RL) provides an alternative by embedding the learning process directly within the model. However, existing meta‑RL approaches for LLMs focus primarily on exploration in single‑agent settings, neglecting the strategic exploitation necessary for multi‑agent environments. We propose MAGE, a meta‑RL framework that empowers LLM agents for strategic exploration and exploitation. MAGE utilizes a multi‑episode training regime where interaction histories and reflections are integrated into the context window. By using the final episode reward as the objective, MAGE incentivizes the agent to refine its strategy based on past experiences. We further combine population‑based training with an agent‑specific advantage normalization technique to enrich agent diversity and ensure stable learning. Experiment results show that MAGE outperforms existing baselines in both exploration and exploitation tasks. Furthermore, MAGE exhibits strong generalization to unseen opponents, suggesting it has internalized the ability for strategic exploration and exploitation. Code is available at https://github.com/Lu‑Yang666/MAGE.
Authors:Zhiqiang Sheng, Xumeng Han, Zhiwei Zhang, Zenghui Xiong, Yifan Ding, Aoxiang Ping, Xiang Li, Tong Guo, Yao Mao
Abstract:
Multimodal generative models have made significant strides in image editing, demonstrating impressive performance on a variety of static tasks. However, their proficiency typically does not extend to complex scenarios requiring dynamic reasoning, leaving them ill‑equipped to model the coherent, intermediate logical pathways that constitute a multi‑step evolution from an initial state to a final one. This capacity is crucial for unlocking a deeper level of procedural and causal understanding in visual manipulation. To systematically measure this critical limitation, we introduce InEdit‑Bench, the first evaluation benchmark dedicated to reasoning over intermediate pathways in image editing. InEdit‑Bench comprises meticulously annotated test cases covering four fundamental task categories: state transition, dynamic process, temporal sequence, and scientific simulation. Additionally, to enable fine‑grained evaluation, we propose a set of assessment criteria to evaluate the logical coherence and visual naturalness of the generated pathways, as well as the model's fidelity to specified path constraints. Our comprehensive evaluation of 14 representative image editing models on InEdit‑Bench reveals significant and widespread shortcomings in this domain. By providing a standardized and challenging benchmark, we aim for InEdit‑Bench to catalyze research and steer development towards more dynamic, reason‑aware, and intelligent multimodal generative models.
Authors:Achleshwar Luthra, Yash Salunkhe, Tomer Galanti
Abstract:
Frozen self‑supervised representations often transfer well with only a few labels across many semantic tasks. We argue that a single geometric quantity, \emphdirectional CDNV (decision‑axis variance), sits at the core of two favorable behaviors: strong few‑shot transfer within a task, and low interference across many tasks. We show that both emerge when variability \emphalong class‑separating directions is small. First, we prove sharp non‑asymptotic multiclass generalization bounds for downstream classification whose leading term is the directional CDNV. The bounds include finite‑shot corrections that cleanly separate intrinsic decision‑axis variability from centroid‑estimation error. Second, we link decision‑axis collapse to multitask geometry: for independent balanced labelings, small directional CDNV across tasks forces the corresponding decision axes to be nearly orthogonal, helping a single representation support many tasks with minimal interference. Empirically, across SSL objectives, directional CDNV collapses during pretraining even when classical CDNV remains large, and our bounds closely track few‑shot error at practical shot sizes. Additionally, on synthetic multitask data, we verify that SSL learns representations whose induced decision axes are nearly orthogonal. The code and project page of the paper are available at [\hrefhttps://dlfundamentals.github.io/directional‑neural‑collapse/project page].
Authors:Jiahao Qin
Abstract:
We introduce mlx‑snn, the first spiking neural network (SNN) library built natively on Apple's MLX framework. As SNN research grows rapidly, all major libraries ‑‑ snnTorch, Norse, SpikingJelly, Lava ‑‑ target PyTorch or custom backends, leaving Apple Silicon users without a native option. mlx‑snn provides six neuron models (LIF, IF, Izhikevich, Adaptive LIF, Synaptic, Alpha), four surrogate gradient functions, four spike encoding methods (including an EEG‑specific encoder), and a complete backpropagation‑through‑time training pipeline. The library leverages MLX's unified memory architecture, lazy evaluation, and composable function transforms (mx.grad, mx.compile) to enable efficient SNN research on Apple Silicon hardware. We validate mlx‑snn on MNIST digit classification across five hyperparameter configurations and three backends, achieving up to 97.28% accuracy with 2.0‑‑2.5 times faster training and 3‑‑10 times lower GPU memory than snnTorch on the same M3 Max hardware. mlx‑snn is open‑source under the MIT license and available on PyPI. https://github.com/D‑ST‑Sword/mlx‑snn
Authors:Samuel Garcin, Thomas Walker, Steven McDonagh, Tim Pearce, Hakan Bilen, Tianyu He, Kaixin Wang, Jiang Bian
Abstract:
Interactive world models continually generate video by responding to a user's actions, enabling open‑ended generation capabilities. However, existing models typically lack a 3D representation of the environment, meaning 3D consistency must be implicitly learned from data, and spatial memory is restricted to limited temporal context windows. This results in an unrealistic user experience and presents significant obstacles to down‑stream tasks such as training agents. To address this, we present PERSIST, a new paradigm of world model which simulates the evolution of a latent 3D scene: environment, camera, and renderer. This allows us to synthesize new frames with persistent spatial memory and consistent geometry. Both quantitative metrics and a qualitative user study show substantial improvements in spatial memory, 3D consistency, and long‑horizon stability over existing methods, enabling coherent, evolving 3D worlds. We further demonstrate novel capabilities, including synthesising diverse 3D environments from a single image, as well as enabling fine‑grained, geometry‑aware control over generated experiences by supporting environment editing and specification directly in 3D space. Project page: https://francelico.github.io/persist.github.io
Authors:Mingyu Jin, Yutong Yin, Jingcheng Niu, Qingcheng Zeng, Wujiang Xu, Mengnan Du, Wei Cheng, Zhaoran Wang, Tianlong Chen, Dimitris N. Metaxas
Abstract:
In this work, we investigate how Large Language Models (LLMs) adapt their internal representations when encountering inputs of increasing difficulty, quantified as the degree of out‑of‑distribution (OOD) shift. We reveal a consistent and quantifiable phenomenon: as task difficulty increases, whether through harder reasoning questions, longer contexts, or adding answer choices, the last hidden states of LLMs become substantially sparser. In short, the farther the shift, the sparser the representations. This sparsity‑‑difficulty relation is observable across diverse models and domains, suggesting that language models respond to unfamiliar or complex inputs by concentrating computation into specialized subspaces in the last hidden state. Through a series of controlled analyses with a learning dynamic explanation, we demonstrate that this sparsity is not incidental but an adaptive mechanism for stabilizing reasoning under OOD. Leveraging this insight, we design Sparsity‑Guided Curriculum In‑Context Learning (SG‑ICL), a strategy that explicitly uses representation sparsity to schedule few‑shot demonstrations, leading to considerable performance enhancements. Our study provides new mechanistic insights into how LLMs internalize OOD challenges. The source code is available at the URL: https://github.com/MingyuJ666/sparsityLLM.
Authors:Dipesh Tamboli, Vineet Punyamoorty, Atharv Pawar, Vaneet Aggarwal
Abstract:
Recent advances in generative image editing have enabled transformative applications, from professional head shot generation to avatar stylization. However, these systems often require uploading high‑fidelity facial images to third‑party models, raising concerns around biometric privacy, data misuse, and user consent. We propose a privacy‑preserving pipeline that supports high‑quality editing while keeping users in control over their biometric data in face‑centric use cases. Our approach separates identity‑sensitive regions from editable image context using on‑device segmentation and masking, enabling secure, user‑controlled editing without modifying third‑party generative models. Unlike traditional cloud‑based tools, PRIVATEEDIT enforces privacy by default: biometric data is never exposed or transmitted. This design requires no access to or retraining of third‑party models, making it compatible with a wide range of commercial APIs. By treating privacy as a core design constraint, our system supports responsible generative AI centered on user autonomy and trust. The pipeline includes a tunable masking mechanism that lets users control how much facial information is concealed, allowing them to balance privacy and output fidelity based on trust level or use case. We demonstrate its applicability in professional and creative workflows and provide a user interface for selective anonymization. By advocating privacy‑by‑design in generative AI, our work offers both technical feasibility and normative guidance for protecting digital identity. The source code is available at https://github.com/Dipeshtamboli/PrivateEdit‑Privacy‑Preserving‑GenAI.
Authors:Jiejun Tan, Zhicheng Dou, Liancheng Zhang, Yuyang Hu, Yiruo Cheng, Ji-Rong Wen
Abstract:
As Large Language Models (LLMs) are increasingly used for long‑duration tasks, maintaining effective long‑term memory has become a critical challenge. Current methods often face a trade‑off between cost and accuracy. Simple storage methods often fail to retrieve relevant information, while complex indexing methods (such as memory graphs) require heavy computation and can cause information loss. Furthermore, relying on the working LLM to process all memories is computationally expensive and slow. To address these limitations, we propose MemSifter, a novel framework that offloads the memory retrieval process to a small‑scale proxy model. Instead of increasing the burden on the primary working LLM, MemSifter uses a smaller model to reason about the task before retrieving the necessary information. This approach requires no heavy computation during the indexing phase and adds minimal overhead during inference. To optimize the proxy model, we introduce a memory‑specific Reinforcement Learning (RL) training paradigm. We design a task‑outcome‑oriented reward based on the working LLM's actual performance in completing the task. The reward measures the actual contribution of retrieved memories by mutiple interactions with the working LLM, and discriminates retrieved rankings by stepped decreasing contributions. Additionally, we employ training techniques such as Curriculum Learning and Model Merging to improve performance. We evaluated MemSifter on eight LLM memory benchmarks, including Deep Research tasks. The results demonstrate that our method meets or exceeds the performance of existing state‑of‑the‑art approaches in both retrieval accuracy and final task completion. MemSifter offers an efficient and scalable solution for long‑term LLM memory. We have open‑sourced the model weights, code, and training data to support further research.
Authors:Dongyi He, Bin Jiang, Kecheng Feng, Luyin Zhang, Ling Liu, Yuxuan Li, Yun Zhao, He Yan
Abstract:
Although obtaining deep brain activity from non‑invasive scalp electroencephalography (sEEG) is crucial for neuroscience and clinical diagnosis, directly generating high‑fidelity intracranial electroencephalography (iEEG) signals remains a largely unexplored field, limiting our understanding of deep brain dynamics. Current research primarily focuses on traditional signal processing or source localization methods, which struggle to capture the complex waveforms and random characteristics of iEEG. To address this critical challenge, this paper introduces NeuroFlowNet, a novel cross‑modal generative framework whose core contribution lies in the first‑ever reconstruction of iEEG signals from the entire deep temporal lobe region using sEEG signals. NeuroFlowNet is built on Conditional Normalizing Flow (CNF), which directly models complex conditional probability distributions through reversible transformations, thereby explicitly capturing the randomness of brain signals and fundamentally avoiding the pattern collapse issues common in existing generative models. Additionally, the model integrates a multi‑scale architecture and self‑attention mechanisms to robustly capture fine‑grained temporal details and long‑range dependencies. Validation results on a publicly available synchronized sEEG‑iEEG dataset demonstrate NeuroFlowNet's effectiveness in terms of temporal waveform fidelity, spectral feature reproduction, and functional connectivity restoration. This study establishes a more reliable and scalable new paradigm for non‑invasive analysis of deep brain dynamics. The code of this study is available in https://github.com/hdy6438/NeuroFlowNet
Authors:Ashwath Vaithinathan Aravindan, Mayank Kejriwal
Abstract:
Chain‑of‑Thought (CoT) prompting has emerged as a foundational technique for eliciting reasoning from Large Language Models (LLMs), yet the robustness of this approach to corruptions in intermediate reasoning steps remains poorly understood. This paper presents a comprehensive empirical evaluation of LLM robustness to a structured taxonomy of 5 CoT perturbation types: MathError, UnitConversion, Sycophancy, SkippedSteps, and ExtraSteps. We evaluate 13 models spanning three orders of magnitude in parameter count (3B to 1.5T\footnoteAssumed parameter count of closed models), testing their ability to complete mathematical reasoning tasks despite perturbations injected at different points in the reasoning chain. Our key findings reveal heterogeneous vulnerability patterns: MathError perturbations produce the most severe degradation in small models (50‑60% accuracy loss) but show strong scaling benefits; UnitConversion remains challenging across all scales (20‑30% loss even for largest models); ExtraSteps incur minimal accuracy degradation (0‑6%) regardless of scale; Sycophancy produces modest effects (7% loss for small models); and SkippedSteps cause intermediate damage (15% loss). Scaling relationships follow power‑law patterns, with model size serving as a protective factor against some perturbations but offering limited defense against dimensional reasoning tasks. These findings have direct implications for deploying LLMs in multi‑stage reasoning pipelines and underscore the necessity of task‑specific robustness assessments and mitigation strategies. The code and results are available https://github.com/Mystic‑Slice/CoTPerturbation.
Authors:Hung Manh Pham, Jinyang Wu, Xiao Ma, Yiming Zhang, Yixin Xu, Aaqib Saeed, Bin Zhu, Zhou Pan, Dong Ma
Abstract:
Photoplethysmography (PPG) is a widely used non‑invasive sensing modality for continuous cardiovascular and physiological monitoring across clinical, laboratory, and wearable settings. While existing PPG datasets support a broad range of downstream tasks, they typically provide supervision in the form of numerical measurements or task‑specific labels, limiting their suitability for language‑based physiological reasoning and multimodal foundation models. In this work, we introduce PulseLM, a large‑scale PPG‑text dataset designed to bridge raw PPG waveforms and natural language through a unified, closed‑ended question answering (QA) formulation. PulseLM aggregates PPG recordings from fifteen publicly available sources and harmonizes heterogeneous annotations into twelve common physiologically QA tasks. The dataset comprises 1.31 million standardized 10‑second PPG segments, associated with 3.15 million question‑answer pairs. We further define reproducible preprocessing, supervision, and evaluation protocols and establish baseline benchmarks using multimodal PPG‑aware large language models. PulseLM provides a standardized foundation for studying multimodal physiological reasoning, cross‑dataset generalization, and scalable benchmarking of PPG‑based language models. The data and code can be found publicly available at: https://github.com/manhph2211/PulseLM.
Authors:Haruki Sakajo, Frederikus Hudi, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe
Abstract:
Language exhibits inherent structures, a property that explains both language acquisition and language change. Given this characteristic, we expect language models to manifest internal structures as well. While interpretability research has investigated the components of language models, existing approaches focus on local inter‑token relationships within layers or modules (e.g., Multi‑Head Attention), leaving global inter‑layer relationships largely overlooked. To address this gap, we introduce StructLens, an analytical framework designed to reveal how internal structures relate holistically through their inter‑token connection within a layer. StructLens constructs maximum spanning trees based on the semantic representations in residual streams, analogous to dependency parsing, and leverages the tree properties to quantify inter‑layer distance (or similarity) from a structural perspective. Our findings demonstrate that StructLens yields an inter‑layer similarity pattern that is distinctively different from conventional cosine similarity. Moreover, this structure‑aware similarity proves to be beneficial for practical tasks, such as layer pruning, highlighting the effectiveness of structural analysis for understanding and optimizing language models. Our code is available at https://github.com/naist‑nlp/structlens.
Authors:Xin Yang, Letian Li, Abudukelimu Wuerkaixi, Xuxin Cheng, Cao Liu, Ke Zeng, Xunliang Cai, Wenyuan Jiang
Abstract:
Large language models (LLMs) have demonstrated remarkable and steadily improving performance across a wide range of tasks. However, LLM performance may be highly sensitive to prompt variations especially in scenarios with limited openness or strict output formatting requirements, indicating insufficient robustness. In real‑world applications, user prompts provided to LLMs often contain imperfections, which may undermine the quality of the model's responses. To address this issue, previous work has primarily focused on preprocessing prompts, employing external tools or even LLMs to refine prompt formulations in advance. However, these approaches overlook the intrinsic robustness of LLMs, and their reliance on external components introduces additional computational overhead and uncertainty. In this work, we propose a Contrastive Learning‑based Inverse Direct Preference Optimization (CoIPO) method that minimizes the discrepancy between the label‑aligned logits produced by the model under a clean prompt and its noisy counterpart, and conduct a detailed analysis using mutual information theory. We augment the FLAN dataset by constructing paired prompts, each consisting of a clean prompt and its corresponding noisy version for training. Additionally, to evaluate the effectiveness, we develop NoisyPromptBench, a benchmark enhanced and derived from the existing PromptBench. Experimental results conducted on NoisyPromptBench demonstrate that our proposed method achieves a significant improvement in average accuracy over the current state‑of‑the‑art approaches. The source code of CoIPO, pair‑wise FLAN datasets, and NoisyPromptBench have already been released on https://github.com/vegetable‑yx/CoIPO.
Authors:Yuchen Wang, Haonan Wang, Yu Guo, Honglong Yang, Xiaomeng Li
Abstract:
Decoding natural language from non‑invasive EEG signals is a promising yet challenging task. However, current state‑of‑the‑art models remain constrained by three fundamental limitations: Semantic Bias (mode collapse into generic templates), Signal Neglect (hallucination based on linguistic priors rather than neural inputs), and the BLEU Trap, where evaluation metrics are artificially inflated by high‑frequency stopwords, masking a lack of true semantic fidelity. To address these challenges, we propose SemKey, a novel multi‑stage framework that enforces signal‑grounded generation through four decoupled semantic objectives: sentiment, topic, length, and surprisal. We redesign the interaction between the neural encoder and the Large Language Model (LLM) by injecting semantic prompts as Queries and EEG embeddings as Key‑Value pairs, strictly forcing the model to attend to neural inputs. Furthermore, we move beyond standard translation metrics by adopting N‑way Retrieval Accuracy and Fréchet Distance to rigorously assess diversity and alignment. Extensive experiments demonstrate that our approach effectively eliminates hallucinations on noise inputs and achieves SOTA performance on these robust protocols. Code will be released upon acceptance at https://github.com/xmed‑lab/SemKey.
Authors:Adi Simhi, Fazl Barez, Martin Tutek, Yonatan Belinkov, Shay B. Cohen
Abstract:
How does the conversational past of large language models (LLMs) influence their future performance? Recent work suggests that LLMs are affected by their conversational history in unexpected ways. For instance, hallucinations in prior interactions may influence subsequent model responses. In this work, we introduce History‑Echoes, a framework that investigates how conversational history biases subsequent generations. The framework explores this bias from two perspectives: probabilistically, we model conversations as Markov chains to quantify state consistency; geometrically, we measure the consistency of consecutive hidden representations. Across three model families and six datasets spanning diverse phenomena, our analysis reveals a strong correlation between the two perspectives. By bridging these perspectives, we demonstrate that behavioral persistence manifests as a geometric trap, where gaps in the latent space confine the model's trajectory. Code available at https://github.com/technion‑cs‑nlp/OldHabitsDieHard.
Authors:Ivan Matveev
Abstract:
Recently presented Token‑Oriented Object Notation (TOON) aims to replace JSON as a serialization format for passing structured data to LLMs with significantly reduced token usage. While showing solid accuracy in LLM comprehension, there is a lack of tests against JSON generation. Though never present in training data, TOON syntax is simple enough to suggest one‑shot in‑context learning could support accurate generation. The inevitable prompt overhead can be an acceptable trade‑off for shorter completions. To test this, we conducted a benchmark creating several test cases with regard to structural complexity, a validation pipeline, and comparing plain JSON generation vs structured output (via constrained decoding) JSON generation vs TOON one‑shot in‑context learning generation. JSON structured output was included to establish a minimum token budget baseline and to set a starting point for future experiments testing TOON constrained decoding inference enforcement. Key findings: TOON shows promising accuracy/token consumption ratio for in‑domain generation tasks, though this advantage is often reduced by the "prompt tax" of instructional overhead in shorter contexts. Plain JSON generation shows the best one‑shot and final accuracy, even compared with constrained decoding structured output, where the only significant advantage is the lowest token usage as a trade‑off for slightly decreased accuracy overall and significant degradation for some models. Notably, for simple structures, this "lowest token usage" of constrained decoding outperformed even TOON, hinting that TOON enforcing via frameworks such as xgrammar may not yield the desired results. Furthermore, the results suggest a scaling hypothesis: TOON's true efficiency potential likely follows a non‑linear curve, shining only beyond a specific point where cumulative syntax savings amortize the initial prompt overhead.
Authors:Bartosz Dziuba, Kacper Kuchta, Paweł Batorski, Przemysław Spurek, Paul Swoboda
Abstract:
Large Language Models (LLMs) have improved substantially alignment, yet their behavior remains highly sensitive to prompt phrasing. This brittleness has motivated automated prompt engineering, but most existing methods (i) require a task‑specific training set, (ii) rely on expensive iterative optimization to produce a single dataset‑level prompt, and (iii) must be rerun from scratch for each new task. We introduce TATRA, a dataset‑free prompting method that constructs instance‑specific few‑shot prompts by synthesizing on‑the‑fly examples to accompany a user‑provided instruction. TATRA requires no labeled training data and avoids task‑specific optimization loops, while retaining the benefits of demonstration‑based prompting. Across standard text classification benchmarks, TATRA matches or improves over strong prompt‑optimization baselines that depend on training data and extensive search. On mathematical reasoning benchmarks, TATRA achieves state‑of‑the‑art performance on GSM8K and DeepMath, outperforming methods that explicitly optimize prompts on those tasks. Our results suggest that per‑instance construction of effective in‑context examples is more important than running long, expensive optimization loops to produce a single prompt per task. We will make all code publicly available upon acceptance of the paper. Code is available at https://github.com/BMD223/TATRA
Authors:Ke Yang, Zixi Chen, Xuan He, Jize Jiang, Michel Galley, Chenglong Wang, Jianfeng Gao, Jiawei Han, ChengXiang Zhai
Abstract:
Long‑term memory is essential for large language model (LLM) agents operating in complex environments, yet existing memory designs are either task‑specific and non‑transferable, or task‑agnostic but less effective due to low task‑relevance and context explosion from raw memory retrieval. We propose PlugMem, a task‑agnostic plugin memory module that can be attached to arbitrary LLM agents without task‑specific redesign. Motivated by the fact that decision‑relevant information is concentrated as abstract knowledge rather than raw experience, we draw on cognitive science to structure episodic memories into a compact, extensible knowledge‑centric memory graph that explicitly represents propositional and prescriptive knowledge. This representation enables efficient memory retrieval and reasoning over task‑relevant knowledge, rather than verbose raw trajectories, and departs from other graph‑based methods like GraphRAG by treating knowledge as the unit of memory access and organization instead of entities or text chunks. We evaluate PlugMem unchanged across three heterogeneous benchmarks (long‑horizon conversational question answering, multi‑hop knowledge retrieval, and web agent tasks). The results show that PlugMem consistently outperforms task‑agnostic baselines and exceeds task‑specific memory designs, while also achieving the highest information density under a unified information‑theoretic analysis. Code and data are available at https://github.com/TIMAN‑group/PlugMem.
Authors:Wenhao Wu, Zhentao Tang, Yafu Li, Shixiong Kai, Mingxuan Yuan, Zhenhong Sun, Chunlin Chen, Zhi Wang
Abstract:
Large Language Models (LLMs) exhibit high reasoning capacity in medical question‑answering, but their tendency to produce hallucinations and outdated knowledge poses critical risks in healthcare fields. While Retrieval‑Augmented Generation (RAG) mitigates these issues, existing methods rely on noisy token‑level signals and lack the multi‑round refinement required for complex reasoning. In the paper, we propose MA‑RAG (Multi‑Round Agentic RAG), a framework that facilitates test‑time scaling for complex medical reasoning by iteratively evolving both external evidence and internal reasoning history within an agentic refinement loop. At each round, the agent transforms semantic conflict among candidate responses into actionable queries to retrieve external evidence, while optimizing history reasoning traces to mitigate long‑context degradation. MA‑RAG extends the self‑consistency principle by leveraging the lack of consistency as a proactive signal for multi‑round agentic reasoning and retrieval, and mirrors a boosting mechanism that iteratively minimizes the residual error toward a stable, high‑fidelity medical consensus. Extensive evaluations across 7 medical Q&A benchmarks show that MA‑RAG consistently surpasses competitive inference‑time scaling and RAG baselines, delivering substantial +6.8 points on average accuracy over the backbone model. Our code is available at [this url](https://github.com/NJU‑RL/MA‑RAG).
Authors:Wenhui Zhu, Xiwen Chen, Zhipeng Wang, Jingjing Wang, Xuanzhao Dong, Minzhou Huang, Rui Cai, Hejian Sang, Hao Wang, Peijie Qiu, Yueyue Deng, Prayag Tiwari, Brendan Hogan Rappazzo, Yalin Wang
Abstract:
Long‑horizon LLM agents require memory systems that remain accurate under fixed context budgets. However, existing systems struggle with two persistent challenges in long‑term dialogue: (i) disconnected evidence, where multi‑hop answers require linking facts distributed across time, and (ii) state updates, where evolving information (e.g., schedule changes) creates conflicts with older static logs. We propose AriadneMem, a structured memory system that addresses these failure modes via a decoupled two‑phase pipeline. In the offline construction phase, AriadneMem employs \emphentropy‑aware gating to filter noise and low‑information message before LLM extraction and applies \emphconflict‑aware coarsening to merge static duplicates while preserving state transitions as temporal edges. In the online reasoning phase, rather than relying on expensive iterative planning, AriadneMem executes \emphalgorithmic bridge discovery to reconstruct missing logical paths between retrieved facts, followed by \emphsingle‑call topology‑aware synthesis. On LoCoMo experiments with GPT‑4o, AriadneMem improves Multi‑Hop F1 by 15.2% and Average F1 by 9.0% over strong baselines. Crucially, by offloading reasoning to the graph layer, AriadneMem reduces total runtime by 77.8% using only 497 context tokens. The code is available at https://github.com/LLM‑VLM‑GSL/AriadneMem.
Authors:Toru Lin, Shuying Deng, Zhao-Heng Yin, Pieter Abbeel, Jitendra Malik
Abstract:
Many essential manipulation tasks ‑ such as food preparation, surgery, and craftsmanship ‑ remain intractable for autonomous robots. These tasks are characterized not only by contact‑rich, force‑sensitive dynamics, but also by their "implicit" success criteria: unlike pick‑and‑place, task quality in these domains is continuous and subjective (e.g. how well a potato is peeled), making quantitative evaluation and reward engineering difficult. We present a learning framework for such tasks, using peeling with a knife as a representative example. Our approach follows a two‑stage pipeline: first, we learn a robust initial policy via force‑aware data collection and imitation learning, enabling generalization across object variations; second, we refine the policy through preference‑based finetuning using a learned reward model that combines quantitative task metrics with qualitative human feedback, aligning policy behavior with human notions of task quality. Using only 50‑200 peeling trajectories, our system achieves over 90% average success rates on challenging produce including cucumbers, apples, and potatoes, with performance improving by up to 40% through preference‑based finetuning. Remarkably, policies trained on a single produce category exhibit strong zero‑shot generalization to unseen in‑category instances and to out‑of‑distribution produce from different categories while maintaining over 90% success rates.
Authors:Omer Sela
Abstract:
CDD, or Contamination Detection via output Distribution, identifies data contamination by measuring the peakedness of a model's sampled outputs. We study the conditions under which this approach succeeds and fails on small language models ranging from 70M to 410M parameters. Using controlled contamination experiments on GSM8K, HumanEval, and MATH, we find that CDD's effectiveness depends critically on whether fine‑tuning produces verbatim memorization. With low‑rank adaptation, models can learn from contaminated data without memorizing it, and CDD performs at chance level even when the data is verifiably contaminated. Only when fine‑tuning capacity is sufficient to induce memorization does CDD recover strong detection accuracy. Our results characterize a memorization threshold that governs detectability and highlight a practical consideration: parameter‑efficient fine‑tuning can produce contamination that output‑distribution methods do not detect. Our code is available at https://github.com/Sela‑Omer/Contamination‑Detection‑Small‑LM
Authors:Fuxiang Yang, Donglin Di, Lulu Tang, Xuancheng Zhang, Lei Fan, Hao Li, Chen Wei, Tonghua Su, Baorui Ma
Abstract:
Vision‑Language‑Action (VLA) models are a promising path toward embodied intelligence, yet they often overlook the predictive and temporal‑causal structure underlying visual dynamics. World‑model VLAs address this by predicting future frames, but waste capacity reconstructing redundant backgrounds. Latent‑action VLAs encode frame‑to‑frame transitions compactly, but lack temporally continuous dynamic modeling and world knowledge. To overcome these limitations, we introduce CoWVLA (Chain‑of‑World VLA), a new "Chain of World" paradigm that unifies world‑model temporal reasoning with a disentangled latent motion representation. First, a pretrained video VAE serves as a latent motion extractor, explicitly factorizing video segments into structure and motion latents. Then, during pre‑training, the VLA learns from an instruction and an initial frame to infer a continuous latent motion chain and predict the segment's terminal frame. Finally, during co‑fine‑tuning, this latent dynamic is aligned with discrete action prediction by jointly modeling sparse keyframes and action sequences in a unified autoregressive decoder. This design preserves the world‑model benefits of temporal reasoning and world knowledge while retaining the compactness and interpretability of latent actions, enabling efficient visuomotor learning. Extensive experiments on robotic simulation benchmarks show that CoWVLA outperforms existing world‑model and latent‑action approaches and achieves moderate computational efficiency, highlighting its potential as a more effective VLA pretraining paradigm. The project website can be found at https://fx‑hit.github.io/cowvla‑io.
Authors:Giovanni Pio Delvecchio, Lorenzo Molfetta, Gianluca Moro
Abstract:
The integration of symbolic computing with neural networks has intrigued researchers since the first theorizations of Artificial intelligence (AI). The ability of Neuro‑Symbolic (NeSy) methods to infer or exploit behavioral schema has been widely considered as one of the possible proxies for human‑level intelligence. However, the limited semantic generalizability and the challenges in declining complex domains with pre‑defined patterns and rules hinder their practical implementation in real‑world scenarios. The unprecedented results achieved by connectionist systems since the last AI breakthrough in 2017 have raised questions about the competitiveness of NeSy solutions, with particular emphasis on the Natural Language Processing and Computer Vision fields. This survey examines task‑specific advancements in the NeSy domain to explore how incorporating symbolic systems can enhance explainability and reasoning capabilities. Our findings are meant to serve as a resource for researchers exploring explainable NeSy methodologies for real‑life tasks and applications. Reproducibility details and in‑depth comments on each surveyed research work are made available at https://github.com/disi‑unibo‑nlp/task‑oriented‑neuro‑symbolic.git.
Authors:Epshita Jahan, Khandoker Md Tanjinul Islam, Pritom Biswas, Tafsir Al Nafin
Abstract:
Bengali remains a low‑resource language in speech technology, especially for complex tasks like long‑form transcription and speaker diarization. This paper presents a multistage approach developed for the "DL Sprint 4.0 ‑ Bengali Long‑Form Speech Recognition" and "DL Sprint 4.0 ‑ Bengali Speaker Diarization" competitions on Kaggle, addressing the challenge of "who spoke when/what" in hour‑long recordings. We implemented Whisper Medium fine‑tuned on Bengali data (bengaliAI/tugstugi bengaliai‑asr whisper‑medium) for transcription and integrated pyannote/speaker‑diarization‑community‑1 with our custom‑trained segmentation model to handle diverse and noisy acoustic environments. Using a two‑pass method with hyperparameter tuning, we achieved a DER of 0.27 on the private leaderboard and 0.19 on the public leaderboard. For transcription, chunking, background noise cleaning, and algorithmic post‑processing yielded a WER of 0.38 on the private leaderboard. These results show that targeted tuning and strategic data utilization can significantly improve AI inclusivity for South Asian languages. All relevant code is available at: https://github.com/Short‑Potatoes/Bengali‑long‑form‑transcription‑and‑diarization.git Index Terms: Bengali speech recognition, speaker diarization, Whisper, ASR, low‑resource languages, pyannote, voice activity detection
Authors:Jun Yeong Park, JunYoung Seo, Minji Kang, Yu Rang Park
Abstract:
The CLIP model's outstanding generalization has driven recent success in Zero‑Shot Anomaly Detection (ZSAD) for detecting anomalies in unseen categories. The core challenge in ZSAD is to specialize the model for anomaly detection tasks while preserving CLIP's powerful generalization capability. Existing approaches attempting to solve this challenge share the fundamental limitation of a patch‑agnostic design that processes all patches monolithically without regard for their unique characteristics. To address this limitation, we propose MoECLIP, a Mixture‑of‑Experts (MoE) architecture for the ZSAD task, which achieves patch‑level adaptation by dynamically routing each image patch to a specialized Low‑Rank Adaptation (LoRA) expert based on its unique characteristics. Furthermore, to prevent functional redundancy among the LoRA experts, we introduce (1) Frozen Orthogonal Feature Separation (FOFS), which orthogonally separates the input feature space to force experts to focus on distinct information, and (2) a simplex equiangular tight frame (ETF) loss to regulate the expert outputs to form maximally equiangular representations. Comprehensive experimental results across 14 benchmark datasets spanning industrial and medical domains demonstrate that MoECLIP outperforms existing state‑of‑the‑art methods. The code is available at https://github.com/CoCoRessa/MoECLIP.
Authors:Qi Zhang, Yifei Wang, Xiaohan Wang, Jiajun Chai, Guojun Yin, Wei Lin, Yisen Wang
Abstract:
In recent years, pre‑trained large language models have achieved remarkable success across diverse tasks. Besides the pivotal role of self‑supervised pre‑training, their effectiveness in downstream applications also depends critically on the post‑training process, which adapts models to task‑specific data and objectives. However, this process inevitably introduces model shifts that can influence performance in different domains, and how such shifts transfer remains poorly understood. To open up the black box, we propose the SAE‑based Transferability Score (STS), a new metric that leverages sparse autoencoders (SAEs) to forecast post‑training transferability. Taking supervised fine‑tuning as an example, STS identifies shifted dimensions in SAE representations and calculates their correlations with downstream domains, enabling reliable estimation of transferability before fine‑tuning. Extensive experiments across multiple models and domains show that STS accurately predicts the transferability of supervised fine‑tuning, achieving Pearson correlation coefficients above 0.7 with actual performance changes. Beyond this, we take an initial step toward extending STS to reinforcement learning. We believe that STS can serve as an \colorblack interpretable tool for guiding post‑training strategies in LLMs. Code is available at https://github.com/PKU‑ML/STS.
Authors:Rui Zhang, Zhichao Lu
Abstract:
The rise of Large Language Model‑based Automated Algorithm Design (LLM‑AAD) has transformed algorithm development by autonomously generating code implementations of expert‑level algorithms. Unlike traditional expert‑driven algorithm development, in the LLM‑AAD paradigm, the main design principle behind an algorithm is often implicitly embedded in the generated code. Therefore, assessing algorithmic similarity directly from code, distinguishing genuine algorithmic innovation from mere syntactic variation, becomes essential. While various code similarity metrics exist, they fail to capture algorithmic similarity, as they focus on surface‑level syntax or output equivalence rather than the underlying algorithmic logic. We propose BehaveSim, a novel method to measure algorithmic similarity through the lens of problem‑solving behavior as a sequence of intermediate solutions produced during execution, dubbed as problem‑solving trajectories (PSTrajs). By quantifying the alignment between PSTrajs using dynamic time warping (DTW), BehaveSim distinguishes algorithms with divergent logic despite syntactic or output‑level similarities. We demonstrate its utility in two key applications: (i) Enhancing LLM‑AAD: Integrating BehaveSim into existing LLM‑AAD frameworks (e.g., FunSearch, EoH) promotes behavioral diversity, significantly improving performance on three AAD tasks. (ii) Algorithm analysis: BehaveSim clusters generated algorithms by behavior, enabling systematic analysis of problem‑solving strategies‑‑a crucial tool for the growing ecosystem of AI‑generated algorithms. Data and code of this work are open‑sourced at https://github.com/RayZhhh/behavesim.
Authors:Wanying He, Yanxi Lin, Ziheng Zhou, Xue Feng, Min Peng, Qianqian Xie, Zilong Zheng, Yipeng Kang
Abstract:
Online platforms increasingly rely on opinion aggregation to allocate real‑world attention and resources, yet common signals such as engagement votes or capital‑weighted commitments are easy to amplify and often track visibility rather than reliability. This makes collective judgments brittle under weak truth signals, noisy or delayed feedback, early popularity surges, and strategic manipulation. We propose Credibility Governance (CG), a mechanism that reallocates influence by learning which agents and viewpoints consistently track evolving public evidence. CG maintains dynamic credibility scores for both agents and opinions, updates opinion influence via credibility‑weighted endorsements, and updates agent credibility based on the long‑run performance of the opinions they support, rewarding early and persistent alignment with emerging evidence while filtering short‑lived noise. We evaluate CG in POLIS, a socio‑physical simulation environment that models coupled belief dynamics and downstream feedback under uncertainty. Across settings with initial majority misalignment, observation noise and contamination, and misinformation shocks, CG outperforms vote‑based, stake‑weighted, and no‑governance baselines, yielding faster recovery to the true state, reduced lock‑in and path dependence, and improved robustness under adversarial pressure. Our implementation and experimental scripts are publicly available at https://github.com/Wanying‑He/Credibility_Governance.
Authors:Xinjun Wang, Shengyao Wang, Aimin Zhou, Hao Hao
Abstract:
Autonomous web navigation requires agents to perceive complex visual environments and maintain long‑term context, yet current Large Language Model (LLM) based agents often struggle with spatial disorientation and navigation loops. In this paper, we propose generally applicable V‑GEMS(Visual Grounding and Explicit Memory System), a robust multimodal agent architecture designed for precise and resilient web traversal. Our agent integrates visual grounding to resolve ambiguous interactive elements and introduces an explicit memory stack with state tracking. This dual mechanism allows the agent to maintain a structured map of its traversal path, enabling valid backtracking and preventing cyclical failures in deep navigation tasks. We also introduce an updatable dynamic benchmark to rigorously evaluate adaptability. Experiments show V‑GEMS significantly dominates the WebWalker baseline, achieving a substantial 28.7% performance gain. Code is available at https://github.com/Vaultttttttttttt/V‑GEMS.
Authors:Maoyuan Shao, Yutong Gao, Xinyang Huang, Chuang Zhu, Lijuan Sun, Guoshun Nan
Abstract:
Vision‑language models like CLIP have achieved remarkable progress in cross‑modal representation learning, yet suffer from systematic misclassifications among visually and semantically similar categories. We observe that such confusion patterns are not random but persistently occur between specific category pairs, revealing the model's intrinsic bias and limited fine‑grained discriminative ability. To address this, we propose CAPT, a Confusion‑Aware Prompt Tuning framework that enables models to learn from their own misalignment. Specifically, we construct a Confusion Bank to explicitly model stable confusion relationships across categories and misclassified samples. On this basis, we introduce a Semantic Confusion Miner (SEM) to capture global inter‑class confusion through semantic difference and commonality prompts, and a Sample Confusion Miner (SAM) to retrieve representative misclassified instances from the bank and capture sample‑level cues through a Diff‑Manner Adapter that integrates global and local contexts. To further unify confusion information across different granularities, a Multi‑Granularity Difference Expert (MGDE) module is designed to jointly leverage semantic‑ and sample‑level experts for more robust confusion‑aware reasoning. Extensive experiments on 11 benchmark datasets demonstrate that our method significantly reduces confusion‑induced errors while enhancing the discriminability and generalization of both base and novel classes, successfully resolving 50.72 percent of confusable sample pairs. Code will be released at https://github.com/greatest‑gourmet/CAPT.
Authors:Zhiyu Pan, Yizheng Wu, Jiashen Hua, Junyi Feng, Shaotian Yan, Bing Deng, Zhiguo Cao, Jieping Ye
Abstract:
Reasoning has emerged as a key capability of large language models. In linguistic tasks, this capability can be enhanced by self‑improving techniques that refine reasoning paths for subsequent finetuning. However, extending these language‑based self‑improving approaches to vision language models (VLMs) presents a unique challenge:~visual hallucinations in reasoning paths cannot be effectively verified or rectified. Our solution starts with a key observation about visual contrast: when presented with a contrastive VQA pair, i.e., two visually similar images with synonymous questions, VLMs identify relevant visual cues more precisely. Motivated by this observation, we propose Visual Contrastive Self‑Taught Reasoner (VC‑STaR), a novel self‑improving framework that leverages visual contrast to mitigate hallucinations in model‑generated rationales. We collect a diverse suite of VQA datasets, curate contrastive pairs according to multi‑modal similarity, and generate rationales using VC‑STaR. Consequently, we obtain a new visual reasoning dataset, VisCoR‑55K, which is then used to boost the reasoning capability of various VLMs through supervised finetuning. Extensive experiments show that VC‑STaR not only outperforms existing self‑improving approaches but also surpasses models finetuned on the SoTA visual reasoning datasets, demonstrating that the inherent contrastive ability of VLMs can bootstrap their own visual reasoning. Project at: https://github.com/zhiyupan42/VC‑STaR.
Authors:Boqin Yuan, Yue Su, Kun Yao
Abstract:
Memory‑augmented LLM agents store and retrieve information from prior interactions, yet the relative importance of how memories are written versus how they are retrieved remains unclear. We introduce a diagnostic framework that analyzes how performance differences manifest across write strategies, retrieval methods, and memory utilization behavior, and apply it to a 3x3 study crossing three write strategies (raw chunks, Mem0‑style fact extraction, MemGPT‑style summarization) with three retrieval methods (cosine, BM25, hybrid reranking). On LoCoMo, retrieval method is the dominant factor: average accuracy spans 20 points across retrieval methods (57.1% to 77.2%) but only 3‑8 points across write strategies. Raw chunked storage, which requires zero LLM calls, matches or outperforms expensive lossy alternatives, suggesting that current memory pipelines may discard useful context that downstream retrieval mechanisms fail to compensate for. Failure analysis shows that performance breakdowns most often manifest at the retrieval stage rather than at utilization. We argue that, under current retrieval practices, improving retrieval quality yields larger gains than increasing write‑time sophistication. Code is publicly available at https://github.com/boqiny/memory‑probe.
Authors:Semih Cantürk, Thomas Sabourin, Frederik Wenkel, Michael Perlmutter, Guy Wolf
Abstract:
A key challenge in deriving unified neural solvers for combinatorial optimization (CO) is efficient generalization of models between a given set of tasks to new tasks not used during the initial training process. To address it, we first establish a new model, which uses a GCON module as a form of expressive message passing together with energy‑based unsupervised loss functions. This model achieves high performance (often comparable with state‑of‑the‑art results) across multiple CO tasks when trained individually on each task. We then leverage knowledge from the computational reducibility literature to propose pretraining and fine‑tuning strategies that transfer effectively (a) between MVC, MIS and MaxClique, and (b) in a multi‑task learning setting that additionally incorporates MaxCut, MDS and graph coloring. Additionally, in a leave‑one‑out, multi‑task learning setting, we observe that pretraining on all but one task almost always leads to faster convergence on the remaining task when fine‑tuning while avoiding negative transfer. Our findings indicate that learning common representations across multiple graph CO problems is viable through the use of expressive message passing coupled with pretraining strategies that are informed by the polynomial reduction literature, thereby taking an important step towards enabling the development of foundational models for neural CO. We provide an open‑source implementation of our work at https://github.com/semihcanturk/COPT‑MT .
Authors:Zhanghan Ni, Yanjing Li, Zeju Qiu, Bernhard Schölkopf, Hongyu Guo, Weiyang Liu, Shengchao Liu
Abstract:
Generative models have recently advanced de novo protein design by learning the statistical regularities of natural structures. However, current approaches face three key limitations: (1) Existing methods cannot jointly learn protein geometry and design tasks, where pretraining can be a solution; (2) Current pretraining methods mostly rely on local, non‑rigid atomic representations for property prediction downstream tasks, limiting global geometric understanding for protein generation tasks; and (3) Existing approaches have yet to effectively model the rich dynamic and conformational information of protein structures. To overcome these issues, we introduce RigidSSL (Rigidity‑Aware Self‑Supervised Learning), a geometric pretraining framework that front‑loads geometry learning prior to generative finetuning. Phase I (RigidSSL‑Perturb) learns geometric priors from 432K structures from the AlphaFold Protein Structure Database with simulated perturbations. Phase II (RigidSSL‑MD) refines these representations on 1.3K molecular dynamics trajectories to capture physically realistic transitions. Underpinning both phases is a bi‑directional, rigidity‑aware flow matching objective that jointly optimizes translational and rotational dynamics to maximize mutual information between conformations. Empirically, RigidSSL variants improve designability by up to 43% while enhancing novelty and diversity in unconditional generation. Furthermore, RigidSSL‑Perturb improves the success rate by 5.8% in zero‑shot motif scaffolding and RigidSSL‑MD captures more biophysically realistic conformational ensembles in G protein‑coupled receptor modeling.
Authors:Yaoteng Zhang, Zhou Qing, Junyu Gao, Qi Wang
Abstract:
Incremental Object Detection (IOD) aims to continuously learn new object categories without forgetting previously learned ones. Recently, prompt‑based methods have gained popularity for their replay‑free design and parameter efficiency. However, due to prompt coupling and prompt drift, these methods often suffer from prompt degradation during continual adaptation. To address these issues, we propose a novel prompt‑decoupled framework called PDP. PDP innovatively designs a dual‑pool prompt decoupling paradigm, which consists of a shared pool used to capture task‑general knowledge for forward transfer, and a private pool used to learn task‑specific discriminative features. This paradigm explicitly separates task‑general and task‑specific prompts, preventing interference between prompts and mitigating prompt coupling. In addition, to counteract prompt drift resulting from inconsistent supervision where old foreground objects are treated as background in subsequent tasks, PDP introduces a Prototypical Pseudo‑Label Generation (PPG) module. PPG can dynamically update the class prototype space during training and use the class prototypes to further filter valuable pseudo‑labels, maintaining supervisory signal consistency throughout the incremental process. PDP achieves state‑of‑the‑art performance on MS‑COCO (with a 9.2% AP improvement) and PASCAL VOC (with a 3.3% AP improvement) benchmarks, highlighting its potential in balancing stability and plasticity. The code and dataset are released at: https://github.com/zyt95579/PDP\_IOD/tree/main
Authors:Kyle Elliott Mathewson
Abstract:
Do neural machine translation models learn language‑universal conceptual representations, or do they merely cluster languages by surface similarity? We investigate this question by probing the representation geometry of Meta's NLLB‑200, a 200‑language encoder‑decoder Transformer, through six experiments that bridge NLP interpretability with cognitive science theories of multilingual lexical organization. Using the Swadesh core vocabulary list embedded across 135 languages, we find that the model's embedding distances significantly correlate with phylogenetic distances from the Automated Similarity Judgment Program (ρ= 0.13, p = 0.020), demonstrating that NLLB‑200 has implicitly learned the genealogical structure of human languages. We show that frequently colexified concept pairs from the CLICS database exhibit significantly higher embedding similarity than non‑colexified pairs (U = 42656, p = 1.33 × 10^‑11, d = 0.96), indicating that the model has internalized universal conceptual associations. Per‑language mean‑centering of embeddings improves the between‑concept to within‑concept distance ratio by a factor of 1.19, providing geometric evidence for a language‑neutral conceptual store analogous to the anterior temporal lobe hub identified in bilingual neuroimaging. Semantic offset vectors between fundamental concept pairs (e.g., man to woman, big to small) show high cross‑lingual consistency (mean cosine = 0.84), suggesting that second‑order relational structure is preserved across typologically diverse languages. We release InterpretCognates, an open‑source interactive toolkit for exploring these phenomena, alongside a fully reproducible analysis pipeline.
Authors:Jingqi Lu, Keqi Han, Yun Wang, Lu Mi, Carl Yang
Abstract:
This study establishes a benchmark for Caenorhabditis elegans neuron classification, comparing four graph methods (GCN, GraphSAGE, GAT, GraphTransformer) against four non‑graph methods (Logistic Regression, MLP, LOLCAT, NeuPRINT). Using the functional connectome, we classified Sensory, Interneuron, and Motor neurons based on Spatial, Connection, and Neuronal Activity features. Results show that attention‑based GNNs significantly outperform baselines on the Spatial and Connection features. The Neuronal Activity features yielded poor performance, likely due to the low temporal resolution of the underlying neuronal activity data. Our benchmark validates the use of GNNs and highlights that Spatial and Connection features are key predictors for Caenorhabditis elegans neuron classes. Code is available at: https://github.com/JingqiLuu/neuronclf‑gnn‑benchmark.
Authors:Varun Pratap Bhardwaj
Abstract:
We present SuperLocalMemory, a local‑first memory system for multi‑agent AI that defends against OWASP ASI06 memory poisoning through architectural isolation and Bayesian trust scoring, while personalizing retrieval through adaptive learning‑to‑rank ‑‑ all without cloud dependencies or LLM inference calls. As AI agents increasingly rely on persistent memory, cloud‑based memory systems create centralized attack surfaces where poisoned memories propagate across sessions and users ‑‑ a threat demonstrated in documented attacks against production systems. Our architecture combines SQLite‑backed storage with FTS5 full‑text search, Leiden‑based knowledge graph clustering, an event‑driven coordination layer with per‑agent provenance, and an adaptive re‑ranking framework that learns user preferences through three‑layer behavioral analysis (cross‑project technology preferences, project context detection, and workflow pattern mining). Evaluation across seven benchmark dimensions demonstrates 10.6ms median search latency, zero concurrency errors under 10 simultaneous agents, trust separation (gap =0.90) with 72% trust degradation for sleeper attacks, and 104% improvement in NDCG@5 when adaptive re‑ranking is enabled. Behavioral data is isolated in a separate database with GDPR Article 17 erasure support. SuperLocalMemory is open‑source (MIT) and integrates with 17+ development tools via Model Context Protocol.
Authors:Jiace Zhu, Wentao Chen, Qi Fan, Zhixing Ren, Junying Wu, Xing Zhe Chai, Chotiwit Rungrueangwutthinon, Yehan Ma, An Zou
Abstract:
Recent studies have demonstrated the potential of Large Language Models (LLMs) in generating GPU Kernels. Current benchmarks focus on the translation of high‑level languages into CUDA, overlooking the more general and challenging task of text‑to‑CUDA generation. Furthermore, given the hardware‑specific and performance‑critical features of GPU programming, accurately assessing the performance of LLM‑generated GPU programs is nontrivial. In this work, we introduce CUDABench, a comprehensive benchmark designed to evaluate the text‑to‑CUDA capabilities of LLMs. First, we construct CUDABench‑Set, which covers Breadth‑Depth‑Difficulty evaluation space in diverse application domains, including artificial intelligence, scientific computing, and data analytics, etc. Furthermore, we propose CUDABench‑Score and Generative Verification Pipeline that assess (1) compilation correctness, (2) functional consistency through execution‑based verification, and (3) a novel roofline‑based metric, Performance‑Score. Benchmarking state‑of‑the‑art LLMs reveals insightful findings and challenges of text‑to‑CUDA, such as a notable mismatch between high compilation success rates and low functional correctness, a lack of domain‑specific algorithmic knowledge, and suboptimal utilization of GPU hardware resources. Our benchmark is available at https://github.com/CUDA‑Bench/CUDABench.
Authors:Ran Li, Shimin Di, Haowei LI, Luanshi Bu, Jiachuan Wang, Wangze Ni, Lei Chen
Abstract:
Chemical reaction prediction is pivotal for accelerating drug discovery and synthesis planning. Despite advances in data‑driven models, current approaches are hindered by an overemphasis on parameter and dataset scaling. Some methods coupled with evaluation techniques that bypass fundamental challenges in reaction representation and fail to capture deep chemical intuition like reaction common sense and topological atom mapping logic. We argue that the core challenge lies in instilling these knowledge into the models. To this end, we propose a unified framework that prioritizes chemical understanding over scale through three key innovations: (1) a Latent Chemical Consistency objective that models reactions as movements on a continuous chemical manifold, ensuring reversible and physically plausible transformations; (2) a Hierarchical Cognitive Curriculum that trains the model through progressive stages, from syntax mastery to semantic reasoning, building robust chemical intuition; (3) Atom‑Map Permutation Invariance (AMPI), which force the model to learn invariant relational topology and balance multi‑task learning. (4)and structured plan‑based reasoning to improve the performance of the LLMs. Our compact 0.5B‑parameter model, RxnNano significantly outperforms fine‑tuned LLMs ten times larger (>7B) and all the domain baselines, achieving a 23.5% Top‑1 accuracy improvement on rigorous benchmarks without test‑time augmentation. https://github.com/rlisml/RxnNano.
Authors:Yiqi Lin, Guoqiang Liang, Ziyun Zeng, Zechen Bai, Yanzhe Chen, Mike Zheng Shou
Abstract:
Instruction‑based video editing has witnessed rapid progress, yet current methods often struggle with precise visual control, as natural language is inherently limited in describing complex visual nuances. Although reference‑guided editing offers a robust solution, its potential is currently bottlenecked by the scarcity of high‑quality paired training data. To bridge this gap, we introduce a scalable data generation pipeline that transforms existing video editing pairs into high‑fidelity training quadruplets, leveraging image generative models to create synthesized reference scaffolds. Using this pipeline, we construct RefVIE, a large‑scale dataset tailored for instruction‑reference‑following tasks, and establish RefVIE‑Bench for comprehensive evaluation. Furthermore, we propose a unified editing architecture, Kiwi‑Edit, that synergizes learnable queries and latent visual features for reference semantic guidance. Our model achieves significant gains in instruction following and reference fidelity via a progressive multi‑stage training curriculum. Extensive experiments demonstrate that our data and architecture establish a new state‑of‑the‑art in controllable video editing. All datasets, models, and code is released at https://github.com/showlab/Kiwi‑Edit.
Authors:Anthony Liang, Yigit Korkmaz, Jiahui Zhang, Minyoung Hwang, Abrar Anwar, Sidhant Kaushik, Aditya Shah, Alex S. Huang, Luke Zettlemoyer, Dieter Fox, Yu Xiang, Anqi Li, Andreea Bobu, Abhishek Gupta, Stephen Tu, Erdem Biyik, Jesse Zhang
Abstract:
General‑purpose robot reward models are typically trained to predict absolute task progress from expert demonstrations, providing only local, frame‑level supervision. While effective for expert demonstrations, this paradigm scales poorly to large‑scale robotics datasets where failed and suboptimal trajectories are abundant and assigning dense progress labels is ambiguous. We introduce Robometer, a scalable reward modeling framework that combines intra‑trajectory progress supervision with inter‑trajectory preference supervision. Robometer is trained with a dual objective: a frame‑level progress loss that anchors reward magnitude on expert data, and a trajectory‑comparison preference loss that imposes global ordering constraints across trajectories of the same task, enabling effective learning from both real and augmented failed trajectories. To support this formulation at scale, we curate RBM‑1M, a reward‑learning dataset comprising over one million trajectories spanning diverse robot embodiments and tasks, including substantial suboptimal and failure data. Across benchmarks and real‑world evaluations, Robometer learns more generalizable reward functions than prior methods and improves robot learning performance across a diverse set of downstream applications. Code, model weights, and videos at https://robometer.github.io/.
Authors:Harikrishnan Unnikrishnan
Abstract:
Background: Accurate glottal segmentation in high‑speed videoendoscopy (HSV) is essential for extracting kinematic biomarkers of laryngeal function. However, existing deep learning models often produce spurious artifacts in non‑glottal frames and fail to generalize across different clinical settings. Methods: We propose a detection‑gated pipeline that integrates a localizer with a segmenter. A temporal consistency wrapper ensures robustness by suppressing false positives during glottal closure and occlusion. The segmenter was trained on a limited subset of the GIRAFE dataset (600 frames), while the localizer was trained on the BAGLS training set. The in‑distribution localizer provides a tight region of interest (ROI), removing geometric anatomical variations and enabling cross‑dataset generalization without fine‑tuning. Results: The pipeline achieved state‑of‑the‑art performance on the GIRAFE (DSC=0.81) and BAGLS (DSC=0.85) benchmarks and demonstrated superior generalizability. Notably, the framework maintained robust cross‑dataset generalization (DSC=0.77). Downstream validation on a 65‑subject clinical cohort confirmed that automated kinematic features ‑ specifically the Open Quotient and Glottal Area Waveform (GAW) ‑ remained consistent with clinical benchmarks. The coefficient of variation (CV) of the glottal area was a significant marker for distinguishing healthy from pathological vocal function (p=0.006). Conclusions: This architecture provides a computationally efficient solution (~35 frames/s) suitable for real‑time clinical use. By overcoming cross‑dataset variability, this framework facilitates the standardized, large‑scale extraction of clinical biomarkers across diverse endoscopy platforms. Code, trained weights, and evaluation scripts are released at https://github.com/hari‑krishnan/openglottal.
Authors:Pengyuan Wu, Pingrui Zhang, Zhigang Wang, Dong Wang, Bin Zhao, Xuelong Li
Abstract:
Diffusion‑based policies have achieved remarkable results in robotic manipulation but often struggle to adapt rapidly in dynamic scenarios, leading to delayed responses or task failures. We present DCDP, a Dynamic Closed‑Loop Diffusion Policy framework that integrates chunk‑based action generation with real‑time correction. DCDP integrates a self‑supervised dynamic feature encoder, cross‑attention fusion, and an asymmetric action encoder‑decoder to inject environmental dynamics before action execution, achieving real‑time closed‑loop action correction and enhancing the system's adaptability in dynamic scenarios. In dynamic PushT simulations, DCDP improves adaptability by 19% without retraining while requiring only 5% additional computation. Its modular design enables plug‑and‑play integration, achieving both temporal coherence and real‑time responsiveness in dynamic robotic scenarios, including real‑world manipulation tasks. The project page is at: https://github.com/wupengyuan/dcdp
Authors:Songming Zhang, Xue Zhang, Tong Zhang, Bojie Hu, Yufeng Chen, Jinan Xu
Abstract:
Knowledge distillation (KD) is an essential technique to compress large language models (LLMs) into smaller ones. However, despite the distinct roles of the student model and the teacher model in KD, most existing frameworks still use a homogeneous training backend (e.g., FSDP and DeepSpeed) for both models, leading to suboptimal training efficiency. In this paper, we present a novel framework for LLM distillation, termed KDFlow, which features a decoupled architecture and employs SGLang for teacher inference. By bridging the training efficiency of FSDP2 and the inference efficiency of SGLang, KDFlow achieves full utilization of both advantages in a unified system. Moreover, instead of transferring full logits across different processes, our framework only transmits the teacher's hidden states using zero‑copy data transfer and recomputes the logits on the student side, effectively balancing the communication cost and KD performance. Furthermore, our framework supports both off‑policy and on‑policy distillation and incorporates KD algorithms for cross‑tokenizer KD through highly extensible and user‑friendly APIs. Experiments show that KDFlow can achieve 1.44× to 6.36× speedup compared to current KD frameworks, enabling researchers to rapidly prototype and scale LLM distillation with minimal engineering overhead. Code is available at: https://github.com/songmzhang/KDFlow
Authors:Naoki Shitanda, Motoki Omura, Tatsuya Harada, Takayuki Osa
Abstract:
Scaling reinforcement learning to tens of thousands of parallel environments requires overcoming the limited exploration capacity of a single policy. Ensemble‑based policy gradient methods, which employ multiple policies to collect diverse samples, have recently been proposed to promote exploration. However, merely broadening the exploration space does not always enhance learning capability, since excessive exploration can reduce exploration quality or compromise training stability. In this work, we theoretically analyze the impact of inter‑policy diversity on learning efficiency in policy ensembles, and propose Coupled Policy Optimization which regulates diversity through KL constraints between policies. The proposed method enables effective exploration and outperforms strong baselines such as SAPG, PBT, and PPO across multiple tasks, including challenging dexterous manipulation, in terms of both sample efficiency and final performance. Furthermore, analysis of policy diversity and effective sample size during training reveals that follower policies naturally distribute around the leader, demonstrating the emergence of structured and efficient exploratory behavior. Our results indicate that diverse exploration under appropriate regulation is key to achieving stable and sample‑efficient learning in ensemble policy gradient methods. Project page at https://naoki04.github.io/paper‑cpo/ .
Authors:Qizheng Li, Yifei Zhang, Xiao Yang, Xu Yang, Zhuo Wang, Weiqing Liu, Jiang Bian
Abstract:
Fine‑tuning large language models for vertical domains remains labor‑intensive, requiring practitioners to curate data, configure training, and iteratively diagnose model behavior. Despite growing interest in autonomous machine learning and language agents, end‑to‑end LLM fine‑tuning has not been systematically studied as an interactive agent task. We introduce FT‑Dojo, an interactive benchmark environment for autonomous LLM fine‑tuning, comprising 13 tasks across 5 domains. Rather than a new collection of static datasets, FT‑Dojo standardizes a task interface, shared raw‑data repository, sandboxed execution environment, structured feedback protocol, and held‑out evaluation procedure. We further develop FT‑Agent, a fine‑tuning‑oriented autonomous framework that uses structured iteration planning, fail‑fast validation, and multi‑level feedback analysis to refine data and training strategies. Experiments show that FT‑Agent provides a strong initial baseline, achieving the best performance on 10 out of 13 tasks, with additional controlled comparisons against frontier agents, open‑source planning backbones, and multi‑run statistics supporting the main findings. Case studies show that agents can recover from failures through cumulative learning, while still exposing limitations in causal diagnosis and long‑horizon planning. The implementation is available at https://github.com/microsoft/rd‑agent.
Authors:Yifei Zhang, Xu Yang, Xiao Yang, Bowen Xian, Qizheng Li, Shikai Fang, Jingyuan Li, Jian Wang, Mingrui Xu, Weiqing Liu, Jiang Bian
Abstract:
LLM‑based agents for machine learning engineering (MLE) predominantly rely on tree search, a form of gradient‑free optimization that uses scalar validation scores to rank candidates. As LLM reasoning capabilities improve, exhaustive enumeration becomes increasingly inefficient compared to directed updates, analogous to how accurate gradients enable efficient descent over random search. We introduce \textscGome, an MLE agent that operationalizes gradient‑based optimization. \textscGome maps structured diagnostic reasoning to gradient computation, success memory to momentum, and multi‑trace execution to distributed optimization. Under a closed‑world protocol that isolates architectural effects from external knowledge, \textscGome achieves a state‑of‑the‑art 35.1% any‑medal rate on MLE‑Bench with a restricted 12‑hour budget on a single V100 GPU. Scaling experiments across 10 models reveal a critical crossover: with weaker models, tree search retains advantages by compensating for unreliable reasoning through exhaustive exploration; as reasoning capability strengthens, gradient‑based optimization progressively outperforms, with the gap widening at frontier‑tier models. Given the rapid advancement of reasoning‑oriented LLMs, this positions gradient‑based optimization as an increasingly favorable paradigm. We release our codebase and GPT‑5 traces at https://github.com/microsoft/RD‑Agent.
Authors:Wenye Lin, Kai Han
Abstract:
Injecting new reasoning knowledge into Large Language Models (LLMs) via post‑training often induces catastrophic forgetting. Recent studies emphasize the importance of on‑policy data but suggest that KL‑divergence fails to mitigate forgetting. In contrast, we show, both analytically and empirically, that the KL‑constrained reward formulation actually plays a critical role in retaining knowledge during post‑training. This motivates our Surgical Post‑Training (SPOT), a proximal on‑policy distillation framework designed to optimize reasoning efficiently while preserving prior knowledge. SPOT consists of (1) a data rectification pipeline employing an Oracle to surgically correct erroneous steps via minimal edits, generating proximal on‑policy data; and (2) a reward‑based binary cross‑entropy objective essential for enhancing reasoning and mitigating forgetting. Empirically, with only 4k rectified math pairs, SPOT improves Qwen3‑8B's accuracy by 6.2% on average across in‑domain and out‑of‑domain tasks, requiring merely 16‑minute model training on 8x H800 GPUs. Moreover, SPOT provides a superior initialization for subsequent reinforcement learning, significantly elevating the performance ceiling. Code: https://github.com/Visual‑AI/SPoT
Authors:Yuexi Du, Jinglu Wang, Shujie Liu, Nicha C. Dvornek, Yan Lu
Abstract:
Large visual language models (VLMs) have shown strong multi‑modal medical reasoning ability, but most operate as end‑to‑end black boxes, diverging from clinicians' evidence‑based, staged workflows and hindering clinical accountability. Complementarily, expert visual grounding models can accurately localize regions of interest (ROIs), providing explicit, reliable evidence that improves both reasoning accuracy and trust. In this paper, we introduce CARE, advancing Clinical Accountability in multi‑modal medical Reasoning with an Evidence‑grounded agentic framework. Unlike existing approaches that couple grounding and reasoning within a single generalist model, CARE decomposes the task into coordinated sub‑modules to reduce shortcut learning and hallucination: a compact VLM proposes relevant medical entities; an expert entity‑referring segmentation model produces pixel‑level ROI evidence; and a grounded VLM reasons over the full image augmented by ROI hints. The VLMs are optimized with reinforcement learning with verifiable rewards to align answers with supporting evidence. Furthermore, a VLM coordinator plans tool invocation and reviews evidence‑answer consistency, providing agentic control and final verification. Evaluated on standard medical VQA benchmarks, our CARE‑Flow (coordinator‑free) improves average accuracy by 10.9% over the same size (10B) state‑of‑the‑art (SOTA). With dynamic planning and answer review, our CARE‑Coord yields a further gain, outperforming the heavily pre‑trained SOTA by 5.2%. Our experiments demonstrate that an agentic framework that emulates clinical workflows, incorporating decoupled specialized models and explicit evidence, yields more accurate and accountable medical AI. Project page: https://xypb.github.io/CARE‑Project‑Page/
Authors:Jisoo Kim, Jungbin Cho, Sanghyeok Chu, Ananya Bal, Jinhyung Kim, Gunhee Lee, Sihaeng Lee, Seung Hwan Kim, Bohyung Han, Hyunmin Lee, Laszlo A. Jeni, Seungryong Kim
Abstract:
Humans learn not only how their bodies move, but also how the surrounding world responds to their actions. In contrast, while recent Vision‑Language‑Action (VLA) models exhibit impressive semantic understanding, they often fail to capture the spatiotemporal dynamics governing physical interaction. In this paper, we introduce Pri4R, a simple yet effective approach that endows VLA models with an implicit understanding of world dynamics by leveraging privileged 4D information during training. Specifically, Pri4R augments VLAs with a lightweight point track head that predicts 3D point tracks. By injecting VLA features into this head to jointly predict future 3D trajectories, the model learns to incorporate evolving scene geometry within its shared representation space, enabling more physically aware context for precise control. Due to its architectural simplicity, Pri4R is compatible with dominant VLA design patterns with minimal changes. During inference, we run the model using the original VLA architecture unchanged; Pri4R adds no extra inputs, outputs, or computational overhead. Across simulation and real‑world evaluations, Pri4R significantly improves performance on challenging manipulation tasks, including a +10% gain on LIBERO‑Long and a +40% gain on RoboCasa. We further show that 3D point track prediction is an effective supervision target for learning action‑world dynamics, and validate our design choices through extensive ablations. Project page: https://jiiiisoo.github.io/Pri4R/
Authors:Niu Lian, Yuting Wang, Hanshu Yao, Jinpeng Wang, Bin Chen, Yaowei Wang, Min Zhang, Shu-Tao Xia
Abstract:
While multimodal large language models have demonstrated impressive short‑term reasoning, they struggle with long‑horizon video understanding due to limited context windows and static memory mechanisms that fail to mirror human cognitive efficiency. Existing paradigms typically fall into two extremes: vision‑centric methods that incur high latency and redundancy through dense visual accumulation, or text‑centric approaches that suffer from detail loss and hallucination via aggressive captioning. To bridge this gap, we propose MM‑Mem, a pyramidal multimodal memory architecture grounded in Fuzzy‑Trace Theory. MM‑Mem structures memory hierarchically into a Sensory Buffer, Episodic Stream, and Symbolic Schema, enabling the progressive distillation of fine‑grained perceptual traces (verbatim) into high‑level semantic schemas (gist). Furthermore, to govern the dynamic construction of memory, we derive a Semantic Information Bottleneck objective and introduce SIB‑GRPO to optimize the trade‑off between memory compression and task‑relevant information retention. In inference, we design an entropy‑driven top‑down memory retrieval strategy. Extensive experiments across 4 benchmarks confirm that MM‑Mem achieves state‑of‑the‑art performance on both offline and streaming tasks, demonstrating robust generalization and validating the effectiveness of cognition‑inspired memory organization. Code and associated configurations are publicly available at https://github.com/EliSpectre/MM‑Mem.
Authors:Yuchen Ying, Weiqi Jiang, Tongya Zheng, Yu Wang, Shunyu Liu, Kaixuan Chen, Mingli Song
Abstract:
Knowledge graphs provide structured and reliable information for many real‑world applications, motivating increasing interest in combining large language models (LLMs) with graph‑based retrieval to improve factual grounding. Recent Graph‑based Retrieval‑Augmented Generation (GraphRAG) methods therefore introduce iterative interaction between LLMs and knowledge graphs to enhance reasoning capability. However, existing approaches typically depend on manually designed guidance and interact with knowledge graphs through a limited set of predefined tools, which substantially constrains graph exploration. To address these limitations, we propose GraphScout, a training‑centric agentic graph reasoning framework equipped with more flexible graph exploration tools. GraphScout enables models to autonomously interact with knowledge graphs to synthesize structured training data which are then used to post‑train LLMs, thereby internalizing agentic graph reasoning ability without laborious manual annotation or task curation. Extensive experiments across five knowledge‑graph domains show that a small model (e.g., Qwen3‑4B) augmented with GraphScout outperforms baseline methods built on leading LLMs (e.g., Qwen‑Max) by an average of 16.7% while requiring significantly fewer inference tokens. Moreover, GraphScout exhibits robust cross‑domain transfer performance. Our code will be made publicly available~\footnotehttps://github.com/Ying‑Yuchen/_GraphScout_.
Authors:Zilong Zhao, Zhengming Ding, Pei Niu, Wenhao Sun, Feng Guo
Abstract:
Feature encoders play a key role in pixel‑level crack segmentation by shaping the representation of fine textures and thin structures. Existing CNN‑, Transformer‑, and Mamba‑based models each capture only part of the required spatial or structural information, leaving clear gaps in modeling complex crack patterns. To address this, we present MixerCSeg, a mixer architecture designed like a coordinated team of specialists, where CNN‑like pathways focus on local textures, Transformer‑style paths capture global dependencies, and Mamba‑inspired flows model sequential context within a single encoder. At the core of MixerCSeg is the TransMixer, which explores Mamba's latent attention behavior while establishing dedicated pathways that naturally express both locality and global awareness. To further enhance structural fidelity, we introduce a spatial block processing strategy and a Direction‑guided Edge Gated Convolution (DEGConv) that strengthens edge sensitivity under irregular crack geometries with minimal computational overhead. A Spatial Refinement Multi‑Level Fusion (SRF) module is then employed to refine multi‑scale details without increasing complexity. Extensive experiments on multiple crack segmentation benchmarks show that MixerCSeg achieves state‑of‑the‑art performance with only 2.05 GFLOPs and 2.54 M parameters, demonstrating both efficiency and strong representational capability. The code is available at https://github.com/spiderforest/MixerCSeg.
Authors:Abdullah Al Shafi, Md Kawsar Mahmud Khan Zunayed, Safin Ahmmed, Sk Imran Hossain, Engelbert Mephu Nguifo
Abstract:
Breast ultrasound interpretation requires simultaneous lesion segmentation and tissue classification. However, conventional multi‑task learning approaches suffer from task interference and rigid coordination strategies that fail to adapt to instance‑specific prediction difficulty. We propose a multi‑task framework addressing these limitations through multi‑level decoder interaction and uncertainty‑aware adaptive coordination. Task Interaction Modules operate at all decoder levels, establishing bidirectional segmentation‑classification communication during spatial reconstruction through attention weighted pooling and multiplicative modulation. Unlike prior single‑level or encoder‑only approaches, this multi‑level design captures scale specific task synergies across semantic‑to‑spatial scales, producing complementary task interaction streams. Uncertainty‑Proxy Attention adaptively weights base versus enhanced features at each level using feature activation variance, enabling per‑level and per‑sample task balancing without heuristic tuning. To support instance‑adaptive prediction, multi‑scale context fusion captures morphological cues across varying lesion sizes. Evaluation on multiple publicly available breast ultrasound datasets demonstrates competitive performance, including 74.5% lesion IoU and 90.6% classification accuracy on BUSI dataset. Ablation studies confirm that multi‑level task interaction provides significant performance gains, validating that decoder‑level bidirectional communication is more effective than conventional encoder‑only parameter sharing. The code is available at: https://github.com/C‑loud‑Nine/Uncertainty‑Aware‑Multi‑Level‑Decoder‑Interaction.
Authors:Maifang Zhang, Hang Yu, Qian Zuo, Cheng Wang, Vaishak Belle, Fengxiang He
Abstract:
This paper proposes Proximal Policy Optimization with Linear Temporal Logic Constraints (PPO‑LTL), a framework that integrates safety constraints written in LTL into PPO for safe reinforcement learning. LTL constraints offer rigorous representations of complex safety requirements, such as regulations that broadly exist in robotics, enabling systematic monitoring of safety requirements. Violations against LTL constraints are monitored by limit‑deterministic Büchi automata, and then translated by a logic‑to‑cost mechanism into penalty signals. The signals are further employed for guiding the policy optimization via the Lagrangian scheme. Extensive experiments on the Zones and CARLA environments show that our PPO‑LTL can consistently reduce safety violations, while maintaining competitive performance, against the state‑of‑the‑art methods. The code is at https://github.com/EVIEHub/PPO‑LTL.
Authors:Oscar Rivera, Ziqing Wang, Matthieu Dagommer, Abhishek Pandey, Kaize Ding
Abstract:
Machine learning accelerates molecular property prediction, yet state‑of‑the‑art Large Language Models and Graph Neural Networks operate as black boxes. In drug discovery, where safety is critical, this opacity risks masking false correlations and excluding human expertise. Existing interpretability methods suffer from the effectiveness‑trustworthiness trade‑off: explanations may fail to reflect a model's true reasoning, degrade performance, or lack domain grounding. Concept Bottleneck Models (CBMs) offer a solution by projecting inputs to human‑interpretable concepts before readout, ensuring that explanations are inherently faithful to the decision process. However, adapting CBMs to chemistry faces three challenges: the Relevance Gap (selecting task‑relevant concepts from a large descriptor space), the Annotation Gap (obtaining concept supervision for molecular data), and the Capacity Gap (degrading performance due to bottleneck constraints). We introduce GlassMol, a model‑agnostic CBM that addresses these gaps through automated concept curation and LLM‑guided concept selection. Experiments across thirteen benchmarks demonstrate that \method generally matches or exceeds black‑box baselines, suggesting that interpretability does not sacrifice performance and challenging the commonly assumed trade‑off. Code is available at https://github.com/walleio/GlassMol.
Authors:Tianxing Chen, Yuran Wang, Mingleyang Li, Yan Qin, Hao Shi, Zixuan Li, Yifan Hu, Yingsheng Zhang, Kaixuan Wang, Yue Chen, Hongcheng Wang, Renjing Xu, Ruihai Wu, Yao Mu, Yaodong Yang, Hao Dong, Ping Luo
Abstract:
Robotic manipulation policies have made rapid progress in recent years, yet most existing approaches give limited consideration to memory capabilities. Consequently, they struggle to solve tasks that require reasoning over historical observations and maintaining task‑relevant information over time, which are common requirements in real‑world manipulation scenarios. Although several memory‑aware policies have been proposed, systematic evaluation of memory‑dependent manipulation remains underexplored, and the relationship between architectural design choices and memory performance is still not well understood. To address this gap, we introduce RMBench, a simulation benchmark comprising 9 manipulation tasks that span multiple levels of memory complexity, enabling systematic evaluation of policy memory capabilities. We further propose Mem‑0, a modular manipulation policy with explicit memory components designed to support controlled ablation studies. Through extensive simulation and real‑world experiments, we identify memory‑related limitations in existing policies and provide empirical insights into how architectural design choices influence memory performance. The website is available at https://rmbench.github.io/.
Authors:Victor May, Aaditya Salgarkar, Yishan Wang, Diganta Misra, Huu Nguyen
Abstract:
Tool‑augmented LLMs are increasingly deployed as agents that interleave natural‑language reasoning with executable Python actions, as in CodeAct‑style frameworks. In deployment, these agents rely on runtime state that persists across steps. By contrast, the traces used to post‑train these models rarely encode how interpreter state is managed. We ask whether interpreter persistence is merely a runtime scaffold, or a property of the training data that shapes how agents learn to use the interpreter. We isolate state persistence as a training‑time variable. We introduce Opaque Knapsack, a procedurally generated family of partially observable optimization tasks designed to prevent one‑shot solutions. Item attributes and constraints are hidden behind budgeted tool calls, forcing multi‑turn control flow and iterative state revision. Holding task instances, prompts, tools, model, and supervision fixed, we generate matched trajectories differing only in whether interpreter state persists across steps or resets after each action. We then fine‑tune identical base models (Qwen3‑8B) on each trace variant and evaluate all four train‑runtime combinations. Our 2x2 cross‑evaluation shows that interpreter persistence shapes how agents reach solutions, not whether they do: solution quality is statistically indistinguishable across conditions, but token cost and stability differ substantially. A persistent‑trained model in a stateless runtime triggers missing‑variable errors in roughly 80% of episodes; a stateless‑trained model in a persistent runtime redundantly re‑derives retained state, using roughly 3.5x more tokens. Interpreter persistence should be treated as a first‑class semantic of agent traces. Aligning fine‑tuning data with deployment runtimes improves efficiency and reduces brittle train‑runtime mismatches.
Authors:Yanping Li, Zhening Liu, Zijian Li, Zehong Lin, Jun Zhang
Abstract:
Fine‑tuning large language models (LLMs) on custom datasets has become a standard approach for adapting these models to specific domains and applications. However, recent studies have shown that such fine‑tuning can lead to significant degradation in the model's safety. Existing defense methods operate at the sample level and often suffer from an unsatisfactory trade‑off between safety and utility. To address this limitation, we perform a systematic token‑level diagnosis of safety degradation during fine‑tuning. Based on this, we propose token‑level data selection for safe LLM fine‑tuning (TOSS), a novel framework that quantifies the safety risk of each token by measuring the loss difference between a safety‑degraded model and a utility‑oriented model. This token‑level granularity enables accurate identification and removal of unsafe tokens, thereby preserving valuable task‑specific information. In addition, we introduce a progressive refinement strategy, TOSS‑Pro, which iteratively enhances the safety‑degraded model's ability to identify unsafe tokens. Extensive experiments demonstrate that our approach robustly safeguards LLMs during fine‑tuning while achieving superior downstream task performance, significantly outperforming existing sample‑level defense methods. Our code is available at https://github.com/Polly‑LYP/TOSS.
Authors:Sumin Kim, Hyemin Jeong, Mingu Kang, Yejin Kim, Yoori Oh, Joonseok Lee
Abstract:
The exponential growth of video content necessitates effective video summarization to efficiently extract key information from long videos. However, current approaches struggle to fully comprehend complex videos, primarily because they employ static or modality‑agnostic fusion strategies. These methods fail to account for the dynamic, frame‑dependent variations in modality saliency inherent in video data. To overcome these limitations, we propose TripleSumm, a novel architecture that adaptively weights and fuses the contributions of visual, text, and audio modalities at the frame level. Furthermore, a significant bottleneck for research into multimodal video summarization has been the lack of comprehensive benchmarks. Addressing this bottleneck, we introduce MoSu (Most Replayed Multimodal Video Summarization), the first large‑scale benchmark that provides all three modalities. Extensive experiments demonstrate that TripleSumm achieves state‑of‑the‑art performance, outperforming existing methods by a significant margin on four benchmarks, including MoSu. Our code and dataset are available at https://github.com/smkim37/TripleSumm.
Authors:Durgesh Ameta, Ujjwal Mishra, Praful Hambarde, Amit Shukla
Abstract:
Change detection (CD) in remote sensing aims to identify semantic differences between satellite images captured at different times. While deep learning has significantly advanced this field, existing approaches based on convolutional neural networks (CNNs), transformers and Selective State Space Models (SSMs) still struggle to precisely delineate change regions. In particular, traditional transformer‑based methods suffer from quadratic computational complexity when applied to very high‑resolution (VHR) satellite images and often perform poorly with limited training data, leading to under‑utilization of the rich spatial information available in VHR imagery. We present GRAD‑Former, a novel framework that enhances contextual understanding while maintaining efficiency through reduced model size. The proposed framework consists of a novel encoder with Adaptive Feature Relevance and Refinement (AFRAR) module, fusion and decoder blocks. AFRAR integrates global‑local contextual awareness through two proposed components: the Selective Embedding Amplification (SEA) module and the Global‑Local Feature Refinement (GLFR) module. SEA and GLFR leverage gating mechanisms and differential attention, respectively, which generates multiple softmax heaps to capture important features while minimizing the captured irreverent features. Multiple experiments across three challenging CD datasets (LEVIR‑CD, CDD, DSIFN‑CD) demonstrate GRAD‑Former's superior performance compared to existing approaches. Notably, GRAD‑Former outperforms the current state‑of‑the‑art models across all the metrics and all the datasets while using fewer parameters. Our framework establishes a new benchmark for remote sensing change detection performance. Our code will be released at: https://github.com/Ujjwal238/GRAD‑Former
Authors:Tongzhou Wu, Yuhao Wang, Xinyu Ma, Xiuqiang He, Shuaiqiang Wang, Dawei Yin, Xiangyu Zhao
Abstract:
Deep‑research agents are capable of executing multi‑step web exploration, targeted retrieval, and sophisticated question answering. Despite their powerful capabilities, deep‑research agents face two critical bottlenecks: (1) the lack of large‑scale, challenging datasets with real‑world difficulty, and (2) the absence of accessible, open‑source frameworks for data synthesis and agent training. To bridge these gaps, we first construct DeepResearch‑9K, a large‑scale challenging dataset specifically designed for deep‑research scenarios built from open‑source multi‑hop question‑answering (QA) datasets via a low‑cost autonomous pipeline. Notably, it consists of (1) 9000 questions spanning three difficulty levels from L1 to L3 (2) high‑quality search trajectories with reasoning chains from Tongyi‑DeepResearch‑30B‑A3B, a state‑of‑the‑art deep‑research agent, and (3) verifiable answers. Furthermore, we develop an open‑source training framework DeepResearch‑R1 that supports (1) multi‑turn web interactions, (2) different reinforcement learning (RL) approaches, and (3) different reward models such as rule‑based outcome reward and LLM‑as‑judge feedback. Finally, empirical results demonstrate that agents trained on DeepResearch‑9K under our DeepResearch‑R1 achieve state‑of‑the‑art results on challenging deep‑research benchmarks. We release the DeepResearch‑9K dataset on https://huggingface.co/datasets/artillerywu/DeepResearch‑9K and the code of DeepResearch‑R1 on https://github.com/Applied‑Machine‑Learning‑Lab/DeepResearch‑R1.
Authors:Haowen Gao, Zhenyu Zhang, Liang Pang, Fangda Guo, Hongjian Dou, Guannan Lv, Shaoguo Liu, Tingting Gao, Huawei Shen, Xueqi Cheng
Abstract:
Reinforcement learning (RL) with group relative policy optimization (GRPO) has become a widely adopted approach for enhancing the reasoning capabilities of multimodal large language models (MLLMs). While GRPO enables long‑chain reasoning without a critic, it often suffers from sparse rewards on difficult problems and advantage vanishing when group‑level rewards are too consistent for overly easy or hard problems. Existing solutions (sample expansion, selective utilization, and indirect reward design) often fail to maintain enough variance in within‑group reward distributions to yield clear optimization signals. To address this, we propose DIVA‑GRPO, a difficulty‑adaptive variant advantage method that adjusts variant difficulty distributions from a global perspective. DIVA‑GRPO dynamically assesses problem difficulty, samples variants with appropriate difficulty levels, and calculates advantages across local and global groups using difficulty‑weighted and normalized scaling. This alleviates reward sparsity and advantage vanishing while improving training stability. Extensive experiments on six reasoning benchmarks demonstrate that DIVA‑GRPO outperforms existing approaches in training efficiency and reasoning performance. Code: https://github.com/Siaaaaaa1/DIVA‑GRPO
Authors:Huanjin Yao, Qixiang Yin, Min Yang, Ziwang Zhao, Yibo Wang, Haotian Luo, Jingyi Zhang, Jiaxing Huang
Abstract:
We aim to develop a multimodal research agent capable of explicit reasoning and planning, multi‑tool invocation, and cross‑modal information synthesis, enabling it to conduct deep research tasks. However, we observe three main challenges in developing such agents: (1) scarcity of search‑intensive multimodal QA data, (2) lack of effective search trajectories, and (3) prohibitive cost of training with online search APIs. To tackle them, we first propose Hyper‑Search, a hypergraph‑based QA generation method that models and connects visual and textual nodes within and across modalities, enabling to generate search‑intensive multimodal QA pairs that require invoking various search tools to solve. Second, we introduce DR‑TTS, which first decomposes search‑involved tasks into several categories according to search tool types, and respectively optimize specialized search tool experts for each tool. It then recomposes tool experts to jointly explore search trajectories via tree search, producing trajectories that successfully solve complex tasks using various search tools. Third, we build an offline search engine supporting multiple search tools, enabling agentic reinforcement learning without using costly online search APIs. With the three designs, we develop MM‑DeepResearch, a powerful multimodal deep research agent, and extensive results shows its superiority across benchmarks. Code is available at https://github.com/HJYao00/MM‑DeepResearch
Authors:Jiafeng Lin, Yuxuan Wang, Jialong Wu, Huakun Luo, Zhongyi Pei, Jianmin Wang
Abstract:
Large Language Models (LLMs) have demonstrated remarkable success in general‑purpose reasoning. However, they still struggle to understand and reason about time series data, which limits their effectiveness in decision‑making scenarios that depend on temporal dynamics. In this paper, we propose Thoth, the first family of mid‑trained LLMs with general‑purpose time series understanding capabilities. As a pivotal intermediate stage, mid‑training achieves task‑ and domain‑agnostic alignment between time series and natural language, for which we construct Book‑of‑Thoth, a high‑quality, time‑series‑centric mid‑training corpus. Book‑of‑Thoth enables both time‑series‑to‑text and text‑to‑time‑series generation, equipping LLMs with a foundational grasp of temporal patterns. To better evaluate advanced reasoning capabilities, we further present KnoTS, a novel benchmark of knowledge‑intensive time series understanding, designed for joint reasoning over temporal patterns and domain knowledge. Extensive experiments demonstrate that mid‑training with Book‑of‑Thoth enables Thoth to significantly outperform its base model and advanced LLMs across a range of time series question answering benchmarks. Moreover, Thoth exhibits superior capabilities when fine‑tuned under data scarcity, underscoring the effectiveness of mid‑training for time series understanding. Code is available at: https://github.com/thuml/Thoth.
Authors:Yangyang Xu, Junbo Ke, You-Wei Wen, Chao Wang
Abstract:
Tensor Ring (TR) decomposition is a powerful tool for high‑order data modeling, but is inherently restricted to discrete forms defined on fixed meshgrids. In this work, we propose a TR functional decomposition for both meshgrid and non‑meshgrid data, where factors are parameterized by Implicit Neural Representations (INRs). However, optimizing this continuous framework to capture fine‑scale details is intrinsically difficult. Through a frequency‑domain analysis, we demonstrate that the spectral structure of TR factors determines the frequency composition of the reconstructed tensor and limits the high‑frequency modeling capacity. To mitigate this, we propose a reparameterized TR functional decomposition, in which each TR factor is a structured combination of a learnable latent tensor and a fixed basis. This reparameterization is theoretically shown to improve the training dynamics of TR factor learning. We further derive a principled initialization scheme for the fixed basis and prove the Lipschitz continuity of our proposed model. Extensive experiments on image inpainting, denoising, super‑resolution, and point cloud recovery demonstrate that our method achieves consistently superior performance over existing approaches. Code is available at https://github.com/YangyangXu2002/RepTRFD.
Authors:Junbo Ke, Yangyang Xu, You-Wei Wen, Chao Wang
Abstract:
Implicit Neural Representations (INRs) have emerged as a powerful paradigm for various signal processing tasks, but their inherent spectral bias limits the ability to capture high‑frequency details. Existing methods partially mitigate this issue by using Fourier‑based features, which usually rely on fixed frequency bases. This forces multi‑layer perceptrons (MLPs) to inefficiently compose the required frequencies, thereby constraining their representational capacity. To address this limitation, we propose Content‑Aware Frequency Encoding (CAFE), which builds upon Fourier features through multiple parallel linear layers combined via a Hadamard product. CAFE can explicitly and efficiently synthesize a broader range of frequency bases, while the learned weights enable the selection of task‑relevant frequencies. Furthermore, we extend this framework to CAFE+, which incorporates Chebyshev features as a complementary component to Fourier bases. This combination provides a stronger and more stable frequency representation. Extensive experiments across multiple benchmarks validate the effectiveness and efficiency of our approach, consistently achieving superior performance over existing methods. Our code is available at https://github.com/JunboKe0619/CAFE.
Authors:Zhonghang Li, Zongwei Li, Yuxuan Chen, Han Shi, Jiawei Li, Jierun Chen, Haoli Bai, Chao Huang
Abstract:
Repository‑scale code reasoning is a cornerstone of modern AI‑assisted software engineering, enabling Large Language Models (LLMs) to handle complex workflows from program comprehension to complex debugging. However, balancing accuracy with context cost remains a significant bottleneck, as existing agentic approaches often waste computational resources through inefficient, iterative full‑text exploration. To address this, we introduce FastCode, a framework that decouples repository exploration from content consumption. FastCode utilizes a structural scouting mechanism to navigate a lightweight semantic‑structural map of the codebase, allowing the system to trace dependencies and pinpoint relevant targets without the overhead of full‑text ingestion. By leveraging structure‑aware navigation tools regulated by a cost‑aware policy, the framework constructs high‑value contexts in a single, optimized step. Extensive evaluations on the SWE‑QA, LongCodeQA, LOC‑BENCH, and GitTaskBench benchmarks demonstrate that FastCode consistently outperforms state‑of‑the‑art baselines in reasoning accuracy while significantly reducing token consumption, validating the efficiency of scouting‑first strategies for large‑scale code reasoning. Source code is available at https://github.com/HKUDS/FastCode.
Authors:Akshat Singh Jaswal, Ashish Baghel
Abstract:
Modern web applications are increasingly produced through AI‑assisted development and rapid no‑code deployment pipelines, widening the gap between accelerating software velocity and the limited adaptability of existing security tooling. Pattern‑driven scanners fail to reason about novel contexts, while emerging LLM‑based penetration testers rely on unconstrained exploration, yielding high cost, unstable behavior, and poor reproducibility. We introduce AWE, a memory‑augmented multi‑agent framework for autonomous web penetration testing that embeds structured, vulnerability‑specific analysis pipelines within a lightweight LLM orchestration layer. Unlike general‑purpose agents, AWE couples context aware payload mutations and generations with persistent memory and browser‑backed verification to produce deterministic, exploitation‑driven results. Evaluated on the 104‑challenge XBOW benchmark, AWE achieves substantial gains on injection‑class vulnerabilities ‑ 87% XSS success (+30.5% over MAPTA) and 66.7% blind SQL injection success (+33.3%) ‑ while being much faster, cheaper, and more token‑efficient than MAPTA, despite using a midtier model (Claude Sonnet 4) versus MAPTA's GPT‑5. MAPTA retains higher overall coverage due to broader exploratory capabilities, underscoring the complementary strengths of specialized and general‑purpose architectures. Our results demonstrate that architecture matters as much as model reasoning capabilities: integrating LLMs into principled, vulnerability‑aware pipelines yields substantial gains in accuracy, efficiency, and determinism for injection‑class exploits. The source code for AWE is available at: https://github.com/stuxlabs/AWE
Authors:Seungwook Kim, Minsu Cho
Abstract:
Text‑to‑image generation powers content creation across design, media, and data augmentation. Post‑training of text‑to‑image generative models is a promising path to better match human preferences, factuality, and improved aesthetics. We introduce SOLACE (Adaptive Rewarding by self‑Confidence), a post‑training framework that replaces external reward supervision with an internal self‑confidence signal, obtained by evaluating how accurately the model recovers injected noise under self‑denoising probes. SOLACE converts this intrinsic signal into scalar rewards, enabling fully unsupervised optimization without additional datasets, annotators, or reward models. Empirically, by reinforcing high‑confidence generations, SOLACE delivers consistent gains in compositional generation, text rendering and text‑image alignment over the baseline. We also find that integrating SOLACE with external rewards results in a complementary improvement, with alleviated reward hacking.
Authors:Yuyang Liu, Jingya Wang, Liuzhenghao Lv, Yonghong Tian
Abstract:
Large language models (LLMs) have demonstrated significant reasoning capabilities in scientific discovery but struggle to bridge the gap to physical execution in wet‑labs. In these irreversible environments, probabilistic hallucinations are not merely incorrect, but also cause equipment damage or experimental failure. To address this, we propose BioProAgent, a neuro‑symbolic framework that anchors probabilistic planning in a deterministic Finite State Machine (FSM). We introduce a State‑Augmented Planning mechanism that enforces a rigorous Design‑Verify‑Rectify workflow, ensuring hardware compliance before execution. Furthermore, we address the context bottleneck inherent in complex device schemas by Semantic Symbol Grounding, reducing token consumption by ~6× through symbolic abstraction. In the extended BioProBench benchmark, BioProAgent achieves 95.6% physical compliance (compared to 21.0% for ReAct), demonstrating that neuro‑symbolic constraints are essential for reliable autonomy in irreversible physical environments. \footnoteCode at https://github.com/YuyangSunshine/bioproagent and project at https://yuyangsunshine.github.io/BioPro‑Project/
Authors:Shilong Tao, Zhe Feng, Shaohan Chen, Weichen Zhang, Zhanxing Zhu, Yunhuai Liu
Abstract:
Fluid‑solid interaction (FSI) problems are fundamental in many scientific and engineering applications, yet effectively capturing the highly nonlinear two‑way interactions remains a significant challenge. Most existing deep learning methods are limited to simplified one‑way FSI scenarios, often assuming rigid and static solid to reduce complexity. Even in two‑way setups, prevailing approaches struggle to capture dynamic, heterogeneous interactions due to the lack of cross‑domain awareness. In this paper, we introduce Fisale, a data‑driven framework for handling complex two‑way FSI problems. It is inspired by classical numerical methods, namely the Arbitrary Lagrangian‑Eulerian (ALE) method and the partitioned coupling algorithm. Fisale explicitly models the coupling interface as a distinct component and leverages multiscale latent ALE grids to provide unified, geometry‑aware embeddings across domains. A partitioned coupling module (PCM) further decomposes the problem into structured substeps, enabling progressive modeling of nonlinear interdependencies. Compared to existing models, Fisale introduces a more flexible framework that iteratively handles complex dynamics of solid, fluid and their coupling interface on a unified representation, and enables scalable learning of complex two‑way FSI behaviors. Experimentally, Fisale excels in three reality‑related challenging FSI scenarios, covering 2D, 3D and various tasks. The code is available at \hrefhttps://github.com/therontau0054/Fisale.
Authors:Zihang Wang, Xu Li, Benwu Wang, Wenkai Zhu, Xieyuanli Chen, Dong Kong, Kailin Lyu, Yinan Du, Yiming Peng, Haoyang Che
Abstract:
Explainability and transparent decision‑making are essential for the safe deployment of autonomous driving systems. Scene captioning summarizes environmental conditions and risk factors in natural language, improving transparency, safety, and human‑‑robot interaction. However, most existing approaches target structured urban scenarios; in off‑road environments, they are vulnerable to single‑modality degradations caused by rain, fog, snow, and darkness, and they lack a unified framework that jointly models structured scene captioning and path planning. To bridge this gap, we propose Wild‑Drive, an efficient framework for off‑road scene captioning and path planning. Wild‑Drive adopts modern multimodal encoders and introduces a task‑conditioned modality‑routing bridge, MoRo‑Former, to adaptively aggregate reliable information under degraded sensing. It then integrates an efficient large language model (LLM), together with a planning token and a gate recurrent unit (GRU) decoder, to generate structured captions and predict future trajectories. We also build the OR‑C2P Benchmark, which covers structured off‑road scene captioning and path planning under diverse sensor corruption conditions. Experiments on OR‑C2P dataset and a self‑collected dataset show that Wild‑Drive outperforms prior LLM‑based methods and remains more stable under degraded sensing. The code and benchmark will be publicly available at https://github.com/wangzihanggg/Wild‑Drive.
Authors:Fanqi Kong, Jiayi Zhang, Mingyi Deng, Chenglin Wu, Yuyu Luo, Bang Liu
Abstract:
Real‑world user requests to LLM agents are often underspecified. Agents must interact to acquire missing information and make correct downstream decisions. However, current multi‑turn GRPO‑based methods often rely on trajectory‑level reward computation, which leads to credit assignment problems and insufficient advantage signals within rollout groups. A feasible approach is to identify valuable interaction turns at a fine granularity to drive more targeted learning. To address this, we introduce InfoPO (Information‑Driven Policy Optimization), which frames multi‑turn interaction as a process of active uncertainty reduction and computes an information‑gain reward that credits turns whose feedback measurably changes the agent's subsequent action distribution compared to a masked‑feedback counterfactual. It then combines this signal with task outcomes via an adaptive variance‑gated fusion to identify information importance while maintaining task‑oriented goal direction. Across diverse tasks, including intent clarification, collaborative coding, and tool‑augmented decision making, InfoPO consistently outperforms prompting and multi‑turn RL baselines. It also demonstrates robustness under user simulator shifts and generalizes effectively to environment‑interactive tasks. Overall, InfoPO provides a principled and scalable mechanism for optimizing complex agent‑user collaboration. Code is available at https://github.com/kfq20/InfoPO.
Authors:Xinzhe Li, Yaguang Tao
Abstract:
LiTS is a modular Python framework for LLM reasoning via tree search. It decomposes tree search into three reusable components (Policy, Transition, and RewardModel) that plug into algorithms like MCTS and BFS. A decorator‑based registry enables domain experts to extend to new domains by registering components, and algorithmic researchers to implement custom search algorithms. We demonstrate composability on MATH500 (language reasoning), Crosswords (environment planning), and MapEval (tool use), showing that components and algorithms are orthogonal: components are reusable across algorithms within each task type, and algorithms work across all components and domains. We also report a mode‑collapse finding: in infinite action spaces, LLM policy diversity (not reward quality) is the bottleneck for effective tree search. A demonstration video is available at https://youtu.be/nRGX43YrR3I. The package is released under the Apache 2.0 license at https://github.com/xinzhel/lits‑llm, including installation instructions and runnable examples that enable users to reproduce the demonstrated workflows.
Authors:Zhanwang Liu, Yuting Li, Haoyuan Gao, Yexin Li, Linghe Kong, Lichao Sun, Weiran Huang
Abstract:
Catastrophic forgetting, the tendency of neural networks to forget previously learned knowledge when learning new tasks, has been a major challenge in continual learning (CL). To tackle this challenge, CL methods have been proposed and shown to reduce forgetting. Furthermore, CL models deployed in mission‑critical settings can benefit from uncertainty awareness by calibrating their predictions to reliably assess their confidences. However, existing uncertainty‑aware continual learning methods suffer from high computational overhead and incompatibility with mainstream replay methods. To address this, we propose idempotent experience replay (IDER), a novel approach based on the idempotent property where repeated function applications yield the same output. Specifically, we first adapt the training loss to make model idempotent on current data streams. In addition, we introduce an idempotence distillation loss. We feed the output of the current model back into the old checkpoint and then minimize the distance between this reprocessed output and the original output of the current model. This yields a simple and effective new baseline for building reliable continual learners, which can be seamlessly integrated with other CL approaches. Extensive experiments on different CL benchmarks demonstrate that IDER consistently improves prediction reliability while simultaneously boosting accuracy and reducing forgetting. Our results suggest the potential of idempotence as a promising principle for deploying efficient and trustworthy continual learning systems in real‑world applications.Our code is available at https://github.com/YutingLi0606/Idempotent‑Continual‑Learning.
Authors:Shu-Xun Yang, Cunxiang Wang, Haoke Zhang, Wenbo Yu, Lindong Wu, Jiayi Gui, Dayong Yang, Yukuo Cen, Zhuoer Feng, Bosi Wen, Yidong Wang, Lucen Zhong, Jiamin Ren, Linfeng Zhang, Jie Tang
Abstract:
Agentic systems augment large language models with external tools and iterative decision making, enabling complex tasks such as deep research, function calling, and coding. However, their long and intricate execution traces make failure diagnosis and root cause analysis extremely challenging. Manual inspection does not scale, while directly applying LLMs to raw traces is hindered by input length limits and unreliable reasoning. Focusing solely on final task outcomes further discards critical behavioral information required for accurate issue localization. To address these issues, we propose TraceSIR, a multi‑agent framework for structured analysis and reporting of agentic execution traces. TraceSIR coordinates three specialized agents: (1) StructureAgent, which introduces a novel abstraction format, TraceFormat, to compress execution traces while preserving essential behavioral information; (2) InsightAgent, which performs fine‑grained diagnosis including issue localization, root cause analysis, and optimization suggestions; (3) ReportAgent, which aggregates insights across task instances and generates comprehensive analysis reports. To evaluate TraceSIR, we construct TraceBench, covering three real‑world agentic scenarios, and introduce ReportEval, an evaluation protocol for assessing the quality and usability of analysis reports aligned with industry needs. Experiments show that TraceSIR consistently produces coherent, informative, and actionable reports, significantly outperforming existing approaches across all evaluation dimensions. Our project and video are publicly available at https://github.com/SHU‑XUN/TraceSIR.
Authors:Grigory Sapunov
Abstract:
AI code agents excel at isolated tasks yet struggle with multi‑file software engineering requiring architectural understanding. We introduce Theory of Code Space (ToCS), a benchmark that evaluates whether agents can construct, maintain, and update coherent architectural beliefs during codebase exploration. Agents explore procedurally generated codebases under partial observability ‑‑ opening files under a budget ‑‑ and periodically externalize their belief state as structured JSON, producing a time‑series of architectural understanding. Three findings emerge from experiments with four baselines and six frontier LLMs. First, the Active‑Passive Gap is model‑dependent: one model builds better maps through active exploration than from seeing all files at once, while another shows the opposite ‑‑ revealing that active exploration is itself a non‑trivial capability absent from some models. Second, retaining structured belief maps in context acts as self‑scaffolding for some models but not others, showing that the mechanism is model‑dependent. Third, belief state maintenance varies dramatically: a smaller model maintains perfectly stable beliefs across probes while its larger sibling suffers catastrophic belief collapse ‑‑ forgetting previously‑discovered components between probes. We release ToCS as open‑source software. Code: https://github.com/che‑shr‑cat/tocs
Authors:Yuchen Hou, Lin Zhao
Abstract:
Vision‑Language‑Action (VLA) models achieve over 95% success on standard benchmarks. However, through systematic experiments, we find that current state‑of‑the‑art VLA models largely ignore language instructions. Prior work lacks: (1) systematic semantic perturbation diagnostics, (2) a benchmark that forces language understanding by design, and (3) linguistically diverse training data. This paper constructs the LangGap benchmark, based on a four‑dimensional semantic perturbation method ‑‑ varying instruction semantics while keeping the tabletop layout fixed ‑‑ revealing language understanding deficits in π0.5. Existing benchmarks like LIBERO assign only one task per layout, underutilizing available objects and target locations; LangGap fully diversifies pick‑and‑place tasks under identical layouts, forcing models to truly understand language. Experiments show that targeted data augmentation can partially close the language gap ‑‑ success rate improves from 0% to 90% with single‑task training, and 0% to 28% with multi‑task training. However, as semantic diversity of extended tasks increases, model learning capacity proves severely insufficient; even trained tasks perform poorly. This reveals a fundamental challenge for VLA models in understanding diverse language instructions ‑‑ precisely the long‑term value of LangGap.
Authors:Rongsheng Wang, Minghao Wu, Hongru Zhou, Zhihan Yu, Zhenyang Cai, Junying Chen, Benyou Wang
Abstract:
Recent advances in video generation have opened new avenues for macroscopic simulation of complex dynamic systems, but their application to microscopic phenomena remains largely unexplored. Microscale simulation holds great promise for biomedical applications such as drug discovery, organ‑on‑chip systems, and disease mechanism studies, while also showing potential in education and interactive visualization. In this work, we introduce MicroWorldBench, a multi‑level rubric‑based benchmark for microscale simulation tasks. MicroWorldBench enables systematic, rubric‑based evaluation through 459 unique expert‑annotated criteria spanning multiple microscale simulation task (e.g., organ‑level processes, cellular dynamics, and subcellular molecular interactions) and evaluation dimensions (e.g., scientific fidelity, visual quality, instruction following). MicroWorldBench reveals that current SOTA video generation models fail in microscale simulation, showing violations of physical laws, temporal inconsistency, and misalignment with expert criteria. To address these limitations, we construct MicroSim‑10K, a high‑quality, expert‑verified simulation dataset. Leveraging this dataset, we train MicroVerse, a video generation model tailored for microscale simulation. MicroVerse can accurately reproduce complex microscale mechanism. Our work first introduce the concept of Micro‑World Simulation and present a proof of concept, paving the way for applications in biology, education, and scientific visualization. Our work demonstrates the potential of educational microscale simulations of biological mechanisms. Our data and code are publicly available at https://github.com/FreedomIntelligence/MicroVerse
Authors:Jinhan Xu, Xing Tang, Houpeng Yang, Haoran Zhang, Shenghua Yuan, Jiatao Chen, Tianming Xi, Jing Wang, Jiaojiao Yu, Guangli Xiang
Abstract:
Symbolic music generation is a challenging task in multimedia generation, involving long sequences with hierarchical temporal structures, long‑range dependencies, and fine‑grained local details. Though recent diffusion‑based models produce high quality generations, they tend to suffer from high training and inference costs with long symbolic sequences due to iterative denoising and sequence‑length‑related costs. To deal with such problem, we put forth a diffusing strategy named SMDIM to combine efficient global structure construction and light local refinement. SMDIM uses structured state space models to capture long range musical context at near linear cost, and selectively refines local musical details via a hybrid refinement scheme. Experiments performed on a wide range of symbolic music datasets which encompass various Western classical music, popular music and traditional folk music show that the SMDIM model outperforms the other state‑of‑the‑art approaches on both the generation quality and the computational efficiency, and it has robust generalization to underexplored musical styles. These results show that SMDIM offers a principled solution for long‑sequence symbolic music generation, including associated attributes that accompany the sequences. We provide a project webpage with audio examples and supplementary materials at https://3328702107.github.io/smdim‑music/.
Authors:Yilian Liu, Xiaojun Jia, Guoshun Nan, Jiuyang Lyu, Zhican Chen, Tao Guan, Shuyuan Luo, Zhongyi Zhai, Yang Liu
Abstract:
Multimodal Large Language Models (MLLMs) have achieved remarkable performance but remain vulnerable to jailbreak attacks that can induce harmful content and undermine their secure deployment. Previous studies have shown that introducing additional inference steps, which disrupt security attention, can make MLLMs more susceptible to being misled into generating malicious content. However, these methods rely on single‑image masking or isolated visual cues, which only modestly extend reasoning paths and thus achieve limited effectiveness, particularly against strongly aligned commercial closed‑source models. To address this problem, in this paper, we propose Multi‑Image Dispersion and Semantic Reconstruction (MIDAS), a multimodal jailbreak framework that decomposes harmful semantics into risk‑bearing subunits, disperses them across multiple visual clues, and leverages cross‑image reasoning to gradually reconstruct the malicious intent, thereby bypassing existing safety mechanisms. The proposed MIDAS enforces longer and more structured multi‑image chained reasoning, substantially increases the model's reliance on visual cues while delaying the exposure of malicious semantics and significantly reducing the model's security attention, thereby improving the performance of jailbreak against advanced MLLMs. Extensive experiments across different datasets and MLLMs demonstrate that the proposed MIDAS outperforms state‑of‑the‑art jailbreak attacks for MLLMs and achieves an average attack success rate of 81.46% across 4 closed‑source MLLMs. Our code is available at this [link](https://github.com/Winnie‑Lian/MIDAS).
Authors:Yingqi Fan, Junlong Tong, Anhao Zhao, Xiaoyu Shen
Abstract:
Multimodal large language models (MLLMs) project visual tokens into the embedding space of language models, yet the internal structuring and processing of visual semantics remain poorly understood. In this work, we introduce a two‑fold analytical framework featuring a novel probing tool, EmbedLens, to conduct a fine‑grained analysis. We uncover a pronounced semantic sparsity at the input level: visual tokens consistently partition into sink, dead, and alive categories. Remarkably, only the alive tokens, comprising \approx60% of the total input, carry image‑specific meaning. Furthermore, using a targeted patch‑compression benchmark, we demonstrate that these alive tokens already encode rich, fine‑grained cues (e.g., objects, colors, and OCR) prior to entering the LLM. Internal visual computations (such as visual attention and feed‑forward networks) are redundant for most standard tasks. For the small subset of highly vision‑centric tasks that actually benefit from internal processing, we reveal that alive tokens naturally align with intermediate LLM layers rather than the initial embedding space, indicating that shallow‑layer processing is unnecessary and that direct mid‑layer injection is both sufficient. Ultimately, our findings provide a unified mechanistic view of visual token processing, paving the way for more efficient and interpretable MLLM architectures through selective token pruning, minimized visual computation, and mid‑layer injection. The code is released at: https://github.com/EIT‑NLP/EmbedLens.
Authors:Jingwen Tong, Zijian Li, Fang Liu, Wei Guo, Jun Zhang
Abstract:
The integration of large language models (LLMs) into wireless networks has sparked growing interest in building autonomous AI agents for wireless tasks. However, existing approaches rely heavily on manually crafted prompts and static agentic workflows, a process that is labor‑intensive, unscalable, and often suboptimal. In this paper, we propose WirelessAgent++, a framework that automates the design of agentic workflows for various wireless tasks. By treating each workflow as an executable code composed of modular operators, WirelessAgent++ casts agent design as a program search problem and solves it with a domain‑adapted Monte Carlo Tree Search (MCTS) algorithm. Moreover, we establish WirelessBench, a standardized multi‑dimensional benchmark suite comprising Wireless Communication Homework (WCHW), Network Slicing (WCNS), and Mobile Service Assurance (WCMSA), covering knowledge reasoning, code‑augmented tool use, and multi‑step decision‑making. Experiments demonstrate that \wap autonomously discovers superior workflows, achieving test scores of 78.37% (WCHW), 90.95% (WCNS), and 97.07% (WCMSA), with a total search cost below \ 5 per task. Notably, our approach outperforms state‑of‑the‑art prompting baselines by up to 31% and general‑purpose workflow optimizers by 11.1%, validating its effectiveness in generating robust, self‑evolving wireless agents. The code is available at https://github.com/jwentong/WirelessAgent‑R2.
Authors:Liyao Jiang, Ruichen Chen, Chao Gao, Di Niu
Abstract:
Recent text‑to‑image (T2I) diffusion models achieve remarkable realism, yet faithful prompt‑image alignment remains challenging, particularly for complex prompts with multiple objects, relations, and fine‑grained attributes. Existing training‑free inference‑time scaling methods rely on fixed iteration budgets that cannot adapt to prompt difficulty, while reflection‑tuned models require carefully curated reflection datasets and extensive joint fine‑tuning of diffusion and vision‑language models, often overfitting to reflection paths data and lacking transferability across models. We introduce RAISE (Requirement‑Adaptive Self‑Improving Evolution), a training‑free, requirement‑driven evolutionary framework for adaptive T2I generation. RAISE formulates image generation as a requirement‑driven adaptive scaling process, evolving a population of candidates at inference time through a diverse set of refinement actions‑including prompt rewriting, noise resampling, and instructional editing. Each generation is verified against a structured checklist of requirements, enabling the system to dynamically identify unsatisfied items and allocate further computation only where needed. This achieves adaptive test‑time scaling that aligns computational effort with semantic query complexity. On GenEval and DrawBench, RAISE attains state‑of‑the‑art alignment (0.94 overall GenEval) while incurring fewer generated samples (reduced by 30‑40%) and VLM calls (reduced by 80%) than prior scaling and reflection‑tuned baselines, demonstrating efficient, generalizable, and model‑agnostic multi‑round self‑improvement. Code is available at https://github.com/LiyaoJiang1998/RAISE.
Authors:Xueyang Li, Yunzhong Lou, Yu Song, Xiangdong Zhou
Abstract:
Computer‑Aided Design (CAD) generative modeling has a strong and long‑term application in the industry. Recently, the parametric CAD sequence as the design logic of an object has been widely mined by sequence models. However, the industrial CAD models, especially in component objects, are fine‑grained and complex, requiring a longer parametric CAD sequence to define. To address the problem, we introduce Mamba‑CAD, a self‑supervised generative modeling for complex CAD models in the industry, which can model on a longer parametric CAD sequence. Specifically, we first design an encoder‑decoder framework based on a Mamba architecture and pair it with a CAD reconstruction task for pre‑training to model the latent representation of CAD models; and then we utilize the learned representation to guide a generative adversarial network to produce the fake representation of CAD models, which would be finally recovered into parametric CAD sequences via the decoder of MambaCAD. To train Mamba‑CAD, we further create a new dataset consisting of 77,078 CAD models with longer parametric CAD sequences. Comprehensive experiments are conducted to demonstrate the effectiveness of our model under various evaluation metrics, especially in the generation length of valid parametric CAD sequences. The code and dataset can be achieved from https://github.com/Sunny‑Hack/Code‑for‑Mamba‑CAD‑AAAI‑2025‑.
Authors:Hui Wan, Libin Lan
Abstract:
Executing multiple tasks simultaneously in medical image analysis, including segmentation, classification, detection, and regression, often introduces significant challenges regarding model generalizability and the optimization of shared feature representations. While Vision Foundation Models (VFMs) provide powerful general representations, full fine‑tuning on limited medical data is prone to overfitting and incurs high computational costs. Moreover, existing parameter‑efficient fine‑tuning approaches typically adopt task‑agnostic adaptation protocols, overlooking both task‑specific mechanisms and the varying sensitivity of model layers during fine‑tuning. In this work, we propose Task‑Aware Prompting and Selective Layer Fine‑Tuning (TAP‑SLF), a unified framework for multi‑task ultrasound image analysis. TAP‑SLF incorporates task‑aware soft prompts to encode task‑specific priors into the input token sequence and applies LoRA to selected specific top layers of the encoder. This strategy updates only a small fraction of the VFM parameters while keeping the pre‑trained backbone frozen. By combining task‑aware prompts with selective high‑layer fine‑tuning, TAP‑SLF enables efficient VFM adaptation to diverse medical tasks within a shared backbone. Results on the FMC_UIA 2026 Challenge test set, where TAP‑SLF wins fifth place, combined with evaluations on the officially released training dataset using an 8:2 train‑test split, demonstrate that task‑aware prompting and selective layer tuning are effective strategies for efficient VFM adaptation.
Authors:Hulingxiao He, Zhi Tan, Yuxin Peng
Abstract:
A high‑performing, general‑purpose visual understanding model should map visual inputs to a taxonomic tree of labels, identify novel categories beyond the training set for which few or no publicly available images exist. Large Multimodal Models (LMMs) have achieved remarkable progress in fine‑grained visual recognition (FGVR) for known categories. However, they remain limited in hierarchical visual recognition (HVR) that aims at predicting consistent label paths from coarse to fine categories, especially for novel categories. To tackle these challenges, we propose Taxonomy‑Aware Representation Alignment (TARA), a simple yet effective strategy to inject taxonomic knowledge into LMMs. TARA leverages representations from biology foundation models (BFMs) that encode rich biological relationships through hierarchical contrastive learning. By aligning the intermediate representations of visual features with those of BFMs, LMMs are encouraged to extract discriminative visual cues well structured in the taxonomy tree. Additionally, we align the representations of the first answer token with the ground‑truth label, flexibly bridging the gap between contextualized visual features and categories of varying granularity according to user intent. Experiments demonstrate that TARA consistently enhances LMMs' hierarchical consistency and leaf node accuracy, enabling reliable recognition of both known and novel categories within complex biological taxonomies. Code is available at https://github.com/PKU‑ICST‑MIPL/TARA_CVPR2026.
Authors:Hanqing Yang, Shiyu Chen, Narjes Nourzad, Marie Siew, Jingdi Chen, Carlee Joe-Wong
Abstract:
Real‑world scenarios increasingly require multiple embodied agents to collaborate in dynamic environments under embodied constraints, as many tasks exceed the capabilities of any single agent. Recent advances in large language models (LLMs) enable high‑level cognitive coordination through reasoning, planning, and natural language communication. However, fine‑grained analyses of how such collaboration emerges, unfolds, and contributes to task success in embodied multi‑agent systems are difficult to conduct with existing benchmarks. In this paper, we introduce EmCoop, a benchmark framework for studying cooperation in LLM‑based embodied multi‑agent systems. Our framework separates a high‑level cognitive layer from a low‑level embodied interaction layer, allowing us to characterize agent cooperation through their interleaved dynamics over time. Given a cooperation‑constrained embodied task, we propose generalizable, process‑level metrics that diagnose collaboration quality and failure modes, beyond final task success. We instantiate our framework in two embodied environments that scale to arbitrary numbers of agents and support diverse communication topologies, and use these instantiations to demonstrate how EmCoop enables systematic analysis of cooperation dynamics across team sizes and task settings. The project web page can be found at: https://happyeureka.github.io/emcoop.
Authors:Hanqing Yang, Hyungwoo Lee, Yuhang Yao, Zhiwei Liu, Kay Liu, Jingdi Chen, Carlee Joe-Wong
Abstract:
The increasingly popular agentic AI paradigm promises to harness the power of multiple, general‑purpose large language model (LLM) agents to collaboratively complete complex tasks. While many agentic AI systems utilize predefined workflows or agent roles in order to reduce complexity, ideally these agents would be truly autonomous, able to achieve emergent collaboration even as the number of collaborating agents increases. Yet in practice, such unstructured interactions can lead to redundant work and cascading failures that are difficult to interpret or correct. In this work, we study multi‑agent systems composed of general‑purpose LLM agents that operate without predefined roles, control flow, or communication constraints, relying instead on emergent collaboration to solve problems. We introduce the Dynamic Interaction Graph (DIG), which captures emergent collaboration as a time‑evolving causal network of agent activations and interactions. DIG makes emergent collaboration observable and explainable for the first time, enabling real‑time identification, explanation, and correction of collaboration‑induced error patterns directly from agents' collaboration paths. Thus, DIG fills a critical gap in understanding how general LLM agents solve problems together in truly agentic multi‑agent systems. The project webpage can be found at: https://happyeureka.github.io/dig.
Authors:Jiayang Shi, Lincen Yang, Zhong Li, Tristan Van Leeuwen, Daniel M. Pelt, K. Joost Batenburg
Abstract:
Generative models, particularly Diffusion Models (DM), have shown strong potential for Computed Tomography (CT) reconstruction serving as expressive priors for solving ill‑posed inverse problems. However, diffusion‑based reconstruction relies on Stochastic Differential Equations (SDEs) for forward diffusion and reverse denoising, where such stochasticity can interfere with repeated data consistency corrections in CT reconstruction. Since CT reconstruction is often time‑critical in clinical and interventional scenarios, improving reconstruction efficiency is essential. In contrast, Flow Matching (FM) models sampling as a deterministic Ordinary Differential Equation (ODE), yielding smooth trajectories without stochastic noise injection. This deterministic formulation is naturally compatible with repeated data consistency operations. Furthermore, we observe that FM‑predicted velocity fields exhibit strong correlations across adjacent steps. Motivated by this, we propose an FM‑based CT reconstruction framework (FMCT) and an efficient variant (EFMCT) that reuses previously predicted velocity fields over consecutive steps to substantially reduce the number of Neural network Function Evaluations (NFEs), thereby improving inference efficiency. We provide theoretical analysis showing that the error introduced by velocity reuse is bounded when combined with data consistency operations. Extensive experiments demonstrate that FMCT/EFMCT achieve competitive reconstruction quality while significantly improving computational efficiency compared with diffusion‑based methods. The codebase is open‑sourced at https://github.com/EFMCT/EFMCT.
Authors:Varun Pratap Bhardwaj
Abstract:
The rapid proliferation of agentic AI skill ecosystems ‑‑ exemplified by OpenClaw (228,000 GitHub stars) and Anthropic Agent Skills (75,600 stars) ‑‑ has introduced a critical supply chain attack surface. The ClawHavoc campaign (January‑February 2026) infiltrated over 1,200 malicious skills into the OpenClaw marketplace, while MalTool catalogued 6,487 malicious tools that evade conventional detection. In response, twelve reactive security tools emerged, yet all rely on heuristic methods that provide no formal guarantees. We present SkillFortify, the first formal analysis framework for agent skill supply chains, with six contributions: (1) the DY‑Skill attacker model, a Dolev‑Yao adaptation to the five‑phase skill lifecycle with a maximality proof; (2) a sound static analysis framework grounded in abstract interpretation; (3) capability‑based sandboxing with a confinement proof; (4) an Agent Dependency Graph with SAT‑based resolution and lockfile semantics; (5) a trust score algebra with formal monotonicity; and (6) SkillFortifyBench, a 540‑skill benchmark. SkillFortify achieves 96.95% F1 (95% CI: [95.1%, 98.4%]) with 100% precision and 0% false positive rate on 540 skills, while SAT‑based resolution handles 1,000‑node graphs in under 100 ms.
Authors:Marcus Graves
Abstract:
We introduce Reverse CAPTCHA, an evaluation framework that tests whether large language models follow invisible Unicode‑encoded instructions embedded in otherwise normal‑looking text. Unlike traditional CAPTCHAs that distinguish humans from machines, our benchmark exploits a capability gap: models can perceive Unicode control characters that are invisible to human readers. We evaluate five models from two providers across two encoding schemes (zero‑width binary and Unicode Tags), four hint levels, two payload framings, and with tool use enabled or disabled. Across 8,308 model outputs, we find that tool use dramatically amplifies compliance (Cohen's h up to 1.37, a large effect), that models exhibit provider‑specific encoding preferences (OpenAI models decode zero‑width binary; Anthropic models prefer Unicode Tags), and that explicit decoding instructions increase compliance by up to 95 percentage points within a single model and encoding. All pairwise model differences are statistically significant (p < 0.05, Bonferroni‑corrected). These results highlight an underexplored attack surface for prompt injection via invisible Unicode payloads.
Authors:Wenxin Tang, Jingyu Xiao, Yanpei Gong, Fengyuan Ran, Tongchuan Xia, Junliang Liu, Man Ho Lam, Wenxuan Wang, Michael R. Lyu
Abstract:
Automated academic poster generation aims to distill lengthy research papers into concise, visually coherent presentations. Existing Multimodal Large Language Models (MLLMs) based approaches, however, suffer from three critical limitations: low information density in full‑paper inputs, excessive token consumption, and unreliable layout verification. We present EfficientPosterGen, an end‑to‑end framework that addresses these challenges through semantic‑aware retrieval and token‑efficient multimodal generation. EfficientPosterGen introduces three core innovations: (1) Semantic‑aware Key Information Retrieval (SKIR), which constructs a semantic contribution graph to model inter‑segment relationships and selectively preserves important content; (2) Visual‑based Context Compression (VCC), which renders selected text segments into images to shift textual information into the visual modality, significantly reducing token usage while generating poster‑ready bullet points; and (3) Agentless Layout Violation Detection (ALVD), a deterministic color‑gradient‑based algorithm that reliably detects content overflow and spatial sparsity without auxiliary MLLMs. Extensive experiments demonstrate that EfficientPosterGen achieves substantial improvements in token efficiency and layout reliability while maintaining high poster quality, offering a scalable solution for automated academic poster generation. Our code is available at https://github.com/vinsontang1/EfficientPosterGen‑Code.
Authors:Haoxiang Sun, Tao Wang, Chenwei Tang, Li Yuan, Jiancheng Lv
Abstract:
Following the success of Group Relative Policy Optimization (GRPO) in foundation LLMs, an increasing number of works have sought to adapt GRPO to Visual Large Language Models (VLLMs) for visual perception tasks (e.g., detection and segmentation). However, much of this line of research rests on a long‑standing yet unexamined assumption: training paradigms developed for language reasoning can be transferred seamlessly to visual perception. Our experiments show that this assumption is not valid, revealing intrinsic differences between reasoning‑oriented and perception‑oriented settings. Using reasoning segmentation as a representative case, we surface two overlooked factors: (i) the need for a broader output space, and (ii) the importance of fine‑grained, stable rewards. Building on these observations, we propose Dr.~Seg, a simple, plug‑and‑play GRPO‑based framework consisting of a Look‑to‑Confirm mechanism and a Distribution‑Ranked Reward module, requiring no architectural modifications and integrating seamlessly with existing GRPO‑based VLLMs. Extensive experiments demonstrate that Dr.~Seg improves performance in complex visual scenarios while maintaining strong generalization. Code, models, and datasets are available at https://github.com/eVI‑group‑SCU/Dr‑Seg.
Authors:Zhihao Li, Shengwei Dong, Chuang Yi, Junxuan Gao, Zhilu Lai, Zhiqiang Liu, Wei Wang, Guangtao Zhang
Abstract:
Existing image SR and generic diffusion models transfer poorly to fluid SR: they are sampling‑intensive, ignore physical constraints, and often yield spectral mismatch and spurious divergence. We address fluid super‑resolution (SR) with ReMD (\underlineResidual‑\underlineMultigrid \underlineDiffusion), a physics‑consistent diffusion framework. At each reverse step, ReMD performs a \emphmultigrid residual correction: the update direction is obtained by coupling data consistency with lightweight physics cues and then correcting the residual across scales; the multiscale hierarchy is instantiated with a \emphmulti‑wavelet basis to capture both large structures and fine vortical details. This coarse‑to‑fine design accelerates convergence and preserves fine structures while remaining equation‑free. Across atmospheric and oceanic benchmarks, ReMD improves accuracy and spectral fidelity, reduces divergence, and reaches comparable quality with markedly fewer sampling steps than diffusion baselines. Our results show that enforcing physics consistency \emphinside the diffusion process via multigrid residual correction and multi‑wavelet multiscale modeling is an effective route to efficient fluid SR. Our code are available on https://github.com/lizhihao2022/ReMD.
Authors:Sathwik Karnik, Juyeop Kim, Sanmi Koyejo, Jong-Seok Lee, Somil Bansal
Abstract:
Text‑to‑image diffusion models often memorize training data, revealing a fundamental failure to generalize beyond the training set. Current mitigation strategies typically sacrifice image quality or prompt alignment to reduce memorization. To address this, we propose Reachability‑Aware Diffusion Steering (RADS), an inference‑time framework that prevents memorization while preserving generation fidelity. RADS models the diffusion denoising process as a dynamical system and applies concepts from reachability analysis to approximate the "backward reachable tube"‑‑the set of intermediate states that inevitably evolve into memorized samples. We then formulate mitigation as a constrained reinforcement learning (RL) problem, where a policy learns to steer the trajectory away from memorization via minimal perturbations in the caption embedding space. Empirical evaluations show that RADS achieves a superior Pareto frontier between generation diversity (SSCD), quality (FID), and alignment (CLIP) compared to state‑of‑the‑art baselines. Crucially, RADS provides robust mitigation without modifying the diffusion backbone, offering a plug‑and‑play solution for safe generation. Our website is available at: https://s‑karnik.github.io/rads‑memorization‑project‑page/.
Authors:Moritz Weckbecker, Jonas Müller, Ben Hagag, Michael Mulet
Abstract:
Subliminal prompting is a phenomenon in which language models are biased towards certain concepts or traits through prompting with semantically unrelated tokens. While prior work has examined subliminal prompting in user‑LLM interactions, potential bias transfer in multi‑agent systems and its associated security implications remain unexplored. In this work, we show that a single subliminally prompted agent can spread a weakening but persisting bias throughout its entire network. We measure this phenomenon across 6 agents using two different topologies, observing that the transferred concept maintains an elevated response rate throughout the network. To exemplify potential misalignment risks, we assess network performance on multiple‑choice TruthfulQA, showing that subliminal prompting of a single agent may degrade the truthfulness of other agents. Our findings reveal that subliminal prompting introduces a new attack vector in multi‑agent security, with implications for the alignment of such systems. The implementation of all experiments is publicly available at https://github.com/Multi‑Agent‑Security‑Initiative/thought_virus .
Authors:Atah Nuh Mih, Jianzhou Wang, Truong Thanh Hung Nguyen, Hung Cao
Abstract:
Neural architecture search (NAS) automates the discovery of neural networks that meet specified criteria, yet its evaluation procedures are often hardcoded, limiting the ability to introduce new metrics. This issue is especially pronounced in hardware‑aware NAS, where objectives depend on target devices such as edge hardware. To address this limitation, we propose SEval‑NAS, a metric‑evaluation mechanism that converts architectures to strings, embeds them as vectors, and predicts performance metrics. Using NATS‑Bench and HW‑NAS‑Bench, we evaluated accuracy, latency, and memory. Kendall's τ correlations showed stronger latency and memory predictions than accuracy, indicating the suitability of SEval‑NAS as a hardware cost predictor. We further integrated SEval‑NAS into FreeREA to evaluate metrics not originally included. The method successfully ranked FreeREA‑generated architectures, maintained search time, and required minimal algorithmic changes. Our implementation is available at: https://github.com/Analytics‑Everywhere‑Lab/neural‑architecture‑search
Authors:David Jackson, Michael Gertz, Jürgen Hesser
Abstract:
Adverse Drug Reactions (ADRs) are a leading cause of morbidity and mortality. Existing prediction methods rely mainly on chemical similarity, machine learning on structured databases, or isolated target profiles, but often fail to integrate heterogeneous, partly unstructured evidence effectively. We present a knowledge graph‑based framework that unifies diverse sources, drug‑target data (ChEMBL), clinical trial literature (PubMed), trial metadata (ClinicalTrials.gov), and post‑marketing safety reports (FAERS) into a single evidence‑weighted bipartite network of drugs and medical conditions. Applied to 400 protein kinase inhibitors, the resulting network enables contextual comparison of efficacy (HR, PFS, OS), phenotypic and target similarity, and ADR prediction via target‑to‑adverse‑event correlations. A non‑small cell lung cancer case study correctly highlights established and candidate drugs, target communities (ERbB, ALK, VEGF), and tolerability differences. Designed as an orthogonal, extensible analysis and search tool rather than a replacement for current models, the framework excels at revealing complex patterns, supporting hypothesis generation, and enhancing pharmacovigilance. Code and data are publicly available at https://github.com/davidjackson99/PKI_KG.
Authors:Hongjin Qian, Ziyi Xia, Ze Liu, Jianlyu Chen, Kun Luo, Minghao Qin, Chaofan Li, Lei Xiong, Junwei Lan, Sen Wang, Zhengyang Liang, Yingxia Shao, Defu Lian, Zheng Liu
Abstract:
LLM‑agents are increasingly used to accelerate the progress of scientific research. Yet a persistent bottleneck is data access: agents not only lack readily available tools for retrieval, but also have to work with unstrcutured, human‑centric data on the Internet, such as HTML web‑pages and PDF files, leading to excessive token consumption, limit working efficiency, and brittle evidence look‑up. This gap motivates the development of an agentic data interface, which is designed to enable agents to access and utilize scientific literature in a more effective, efficient, and cost‑aware manner. In this paper, we introduce DeepXiv‑SDK, which offers a three‑layer agentic data interface for scientific literature. 1) Data Layer, which transforms unstructured, human‑centric data into normalized and structured representations in JSON format, improving data usability and enabling progressive accessibility of the data. 2) Service Layer, which presents readily available tools for data access and ad‑hoc retrieval. It also enables a rich form of agent usage, including CLI, MCP, and Python SDK. 3) Application Layer, which creates a built‑in agent, packaging basic tools from the service layer to support complex data access demands. DeepXiv‑SDK currently supports the complete ArXiv corpus, and is synchronized daily to incorporate new releases. It is designed to extend to all common open‑access corpora, such as PubMed Central, bioRxiv, medRxiv, and chemRxiv. We release RESTful APIs, an open‑source Python SDK, and a web demo showcasing deep search and deep research workflows. DeepXiv‑SDK is free to use with registration.
Authors:Chao Huang, Yanhui Li, Yunkang Cao, Wei Wang, Hongxi Huang, Jie Wen, Wenqi Ren, Xiaochun Cao
Abstract:
Although multimodal large language models (MLLMs) have advanced industrial anomaly detection toward a zero‑shot paradigm, they still tend to produce high‑confidence yet unreliable decisions in fine‑grained and structurally complex industrial scenarios, and lack effective self‑corrective mechanisms. To address this issue, we propose M3‑AD, a unified reflection‑aware multimodal framework for industrial anomaly detection. M3‑AD comprises two complementary data resources: M3‑AD‑FT, designed for reflection‑aligned fine‑tuning, and M3‑AD‑Bench, designed for systematic cross‑category evaluation, together providing a foundation for reflection‑aware learning and reliability assessment. Building upon this foundation, we propose RA‑Monitor, which models reflection as a learnable decision revision process and guides models to perform controlled self‑correction when initial judgments are unreliable, thereby improving decision robustness. Extensive experiments conducted on M3‑AD‑Bench demonstrate that RA‑Monitor outperforms multiple open‑source and commercial MLLMs in zero‑shot anomaly detection and anomaly analysis tasks. Code will be released at https://github.com/Yanhui‑Lee/M3‑AD.
Authors:Ian Li, Zilei Shao, Benjie Wang, Rose Yu, Guy Van den Broeck, Anji Liu
Abstract:
Diffusion language models theoretically allow for efficient parallel generation but are practically hindered by the "factorization barrier": the assumption that simultaneously predicted tokens are independent. This limitation forces a trade‑off: models must either sacrifice speed by resolving dependencies sequentially or suffer from incoherence due to factorization. We argue that this barrier arises not from limited backbone expressivity, but from a structural misspecification: models are restricted to fully factorized outputs because explicitly parameterizing a joint distribution would require the Transformer to output a prohibitively large number of parameters. We propose Coupled Discrete Diffusion (CoDD), a hybrid framework that breaks this barrier by replacing the fully‑factorized output distribution with a lightweight, tractable probabilistic inference layer. This formulation yields a distribution family that is significantly more expressive than standard factorized priors, enabling the modeling of complex joint dependencies, yet remains compact enough to avoid the prohibitive parameter explosion associated with full joint modeling. Empirically, CoDD seamlessly enhances diverse diffusion language model architectures with negligible overhead, matching the reasoning performance of computationally intensive Reinforcement Learning baselines at a fraction of the training cost. Furthermore, it prevents performance collapse in few‑step generation, enabling high‑quality outputs at significantly reduced latencies. Code available at: https://github.com/liuanji/CoDD
Authors:Jitian Zhao, Changho Shin, Tzu-Heng Huang, Satya Sai Srinath Namburi GNVV, Frederic Sala
Abstract:
LLM‑as‑a‑judge ensembles are the standard paradigm for scalable evaluation, but their aggregation mechanisms suffer from a fundamental flaw: they implicitly assume that judges provide independent estimates of true quality. However, in practice, LLM judges exhibit correlated errors caused by shared latent confounders ‑‑ such as verbosity, stylistic preferences, or training artifacts ‑‑ causing standard aggregation rules like majority vote or averaging to provide little gain or even amplify systematic mistakes. To address this, we introduce CARE, a confounder‑aware aggregation framework that explicitly models LLM judge scores as arising from both a latent true‑quality signal and shared confounding factors. Rather than heuristically re‑weighting judges, CARE separates quality from confounders without access to ground‑truth labels. We provide theoretical guarantees for identifiability and finite‑sample recovery under shared confounders, and we quantify the systematic bias incurred when aggregation models omit confounding latent factors. Across 12 public benchmarks spanning continuous scoring, binary classification, and pairwise preference settings, CARE improves aggregation accuracy, reducing error by up to 26.8%. Code is released in \hrefhttps://github.com/SprocketLab/CAREhttps://github.com/SprocketLab/CARE.
Authors:Jintao Zhang, Zirui Liu, Mingyue Cheng, Xianquan Wang, Zhiding Liu, Qi Liu
Abstract:
Diffusion models have been used for probabilistic time series forecasting and show strong potential. However, fixed noise schedules often produce intermediate states that are hard to invert and a terminal state that deviates from the near noise assumption. Meanwhile, prior methods rely on time domain conditioning and seldom model schedule induced spectral degradation, which limits structure recovery across noise levels. We propose StaTS, a diffusion model for probabilistic time series forecasting that learns the noise schedule and the denoiser through alternating updates. StaTS includes Spectral Trajectory Scheduler (STS) that learns a data adaptive noise schedule with spectral regularization to improve structural preservation and stepwise invertibility, and Frequency Guided Denoiser (FGD) that estimates schedule induced spectral distortion and uses it to modulate denoising strength for heterogeneous restoration across diffusion steps and variables. A two stage training procedure stabilizes the coupling between schedule learning and denoiser optimization. Experiments on multiple real world benchmarks show consistent gains, while maintaining strong performance with fewer sampling steps. Our code is available at https://github.com/zjt‑gpu/StaTS/.
Authors:Zhengbo Wang, Jian Liang, Ran He, Zilei Wang, Tieniu Tan
Abstract:
Modern optimizers like Adam and Muon are central to training large language models, but their reliance on first‑ and second‑order momenta introduces significant memory overhead, which constrains scalability and computational efficiency. In this work, we reframe the exponential moving average (EMA) used in these momenta as the training of a linear regressor via online gradient flow. Building on this equivalence, we introduce LoRA‑Pre, a novel low‑rank optimizer designed for efficient pre‑training. Specifically, LoRA‑Pre reduces the optimizer's memory footprint by decomposing the full momentum matrix into a compact low‑rank subspace within the online linear learner, thereby maintaining optimization performance while improving memory efficiency. We empirically validate LoRA‑Pre's efficacy by pre‑training models from the Llama architecture family, scaling from 60M to 1B parameters. LoRA‑Pre achieves the highest performance across all model sizes. Notably, LoRA‑Pre demonstrates remarkable rank efficiency, achieving comparable or superior results using only 1/8 the rank of baseline methods. Beyond pre‑training, we evaluate LoRA‑Pre's effectiveness in fine‑tuning scenarios. With the same rank, LoRA‑Pre consistently outperforms all efficient fine‑tuning baselines. Specifically, compared to standard LoRA, LoRA‑Pre achieves substantial improvements of 3.14 points on Llama‑3.1‑8B and 6.17 points on Llama‑2‑7B, validating our approach's effectiveness across both pre‑training and fine‑tuning paradigms. Our code is publicly available at https://github.com/mrflogs/LoRA‑Pre.
Authors:Haritz Puerto, Haonan Li, Xudong Han, Timothy Baldwin, Iryna Gurevych
Abstract:
AI agents powered by reasoning models require access to sensitive user data. However, their reasoning traces are difficult to control, which can result in the unintended leakage of private information to external parties. We propose training models to follow instructions not only in the final answer, but also in reasoning traces, potentially under different constraints. We hypothesize that improving their instruction following abilities in the reasoning traces can improve their privacy‑preservation skills. To demonstrate this, we fine‑tune models on a new instruction‑following dataset with explicit restrictions on reasoning traces. We further introduce a generation strategy that decouples reasoning and answer generation using separate LoRA adapters. We evaluate our approach on six models from two model families, ranging from 1.7B to 14B parameters, across two instruction‑following benchmarks and two privacy benchmarks. Our method yields substantial improvements, achieving gains of up to 20.9 points in instruction‑following performance and up to 51.9 percentage points on privacy benchmarks. These improvements, however, can come at the cost of task utility, due to the trade‑off between reasoning performance and instruction‑following abilities. Overall, our results show that improving instruction‑following behavior in reasoning models can significantly enhance privacy, suggesting a promising direction for the development of future privacy‑aware agents. Our code and data are available at https://github.com/UKPLab/arxiv2026‑controllable‑reasoning‑models
Authors:Omar Mohamed, Edoardo Fazzari, Ayah Al-Naji, Hamdan Alhadhrami, Khalfan Hableel, Saif Alkindi, Cesare Stefanini
Abstract:
Recognizing surgical phases and steps from video is a fundamental problem in computer‑assisted interventions. Recent approaches increasingly rely on large‑scale pre‑training on thousands of labeled surgical videos, followed by zero‑shot transfer to specific procedures. While effective, this strategy incurs substantial computational and data collection costs. In this work, we question whether such heavy pre‑training is truly necessary. We propose Text‑Augmented Action Segmentation Optimal Transport (TASOT), an unsupervised method for surgical phase and step recognition that extends Action Segmentation Optimal Transport (ASOT) by incorporating textual information generated directly from the videos. TASOT formulates temporal action segmentation as a multimodal optimal transport problem, where the matching cost is defined as a weighted combination of visual and text‑based costs. The visual term captures frame‑level appearance similarity, while the text term provides complementary semantic cues, and both are jointly regularized through a temporally consistent unbalanced Gromov‑Wasserstein formulation. This design enables effective alignment between video frames and surgical actions without surgical‑specific pretraining or external web‑scale supervision. We evaluate TASOT on multiple benchmark surgical datasets and observe consistent and substantial improvements over existing zero‑shot methods, including StrasBypass70 (+23.7), BernBypass70 (+4.5), Cholec80 (+16.5), and AutoLaparo (+19.6). These results demonstrate that fine‑grained surgical understanding can be achieved by exploiting information already present in standard visual and textual representations, without resorting to increasingly complex pre‑training pipelines. The code will be available at https://github.com/omar8ahmed9/TASOT.
Authors:Daniel Yang, Samuel Stante, Florian Redhardt, Lena Libon, Parnian Kassraie, Ido Hakimi, Barna Pásztor, Andreas Krause
Abstract:
Reward models are central to aligning large language models (LLMs) with human preferences. Yet most approaches rely on pointwise reward estimates that overlook the epistemic uncertainty in reward models arising from limited human feedback. Recent work suggests that quantifying this uncertainty can reduce the costs of human annotation via uncertainty‑guided active learning and mitigate reward overoptimization in LLM post‑training. However, uncertainty‑aware reward models have so far been adopted without thorough comparison, leaving them poorly understood. This work introduces a unified framework, RewardUQ, to systematically evaluate uncertainty quantification for reward models. We compare common methods along standard metrics measuring accuracy and calibration, and we propose a new ranking strategy incorporating both dimensions for a simplified comparison. Our experimental results suggest that model size and initialization have the most meaningful impact on performance, and most prior work could have benefited from alternative design choices. To foster the development and evaluation of new methods and aid the deployment in downstream applications, we release our open‑source framework as a Python package. Our code is available at https://github.com/lasgroup/rewarduq.
Authors:Xianglong Shi, Ziheng Chen, Yunhan Jiang, Nicu Sebe
Abstract:
Real‑world data frequently exhibit latent hierarchical structures, which can be naturally represented by hyperbolic geometry. Although recent hyperbolic neural networks have demonstrated promising results, many existing architectures remain partially intrinsic, mixing Euclidean operations with hyperbolic ones or relying on extrinsic parameterizations. To address it, we propose the \emphIntrinsic Lorentz Neural Network (ILNN), a fully intrinsic hyperbolic architecture that conducts all computations within the Lorentz model. At its core, the network introduces a novel \emphpoint‑to‑hyperplane fully connected layer (FC), replacing traditional Euclidean affine logits with closed‑form hyperbolic distances from features to learned Lorentz hyperplanes, thereby ensuring that the resulting geometric decision functions respect the inherent curvature. Around this fundamental layer, we design intrinsic modules: GyroLBN, a Lorentz batch normalization that couples gyro‑centering with gyro‑scaling, consistently outperforming both LBN and GyroBN while reducing training time. We additionally proposed a gyro‑additive bias for the FC output, a Lorentz patch‑concatenation operator that aligns the expected log‑radius across feature blocks via a digamma‑based scale, and a Lorentz dropout layer. Extensive experiments conducted on CIFAR‑10/100 and two genomic benchmarks (TEB and GUE) illustrate that ILNN achieves state‑of‑the‑art performance and computational cost among hyperbolic models and consistently surpasses strong Euclidean baselines. The code is available at \hrefhttps://github.com/Longchentong/ILNN\textcolormagentathis url.
Authors:Xiran Xu, Yujie Yan, Xihong Wu, Jing Chen
Abstract:
How natural speech is represented in the brain constitutes a major challenge for cognitive neuroscience, with cortical envelope‑following responses playing a central role in speech decoding. This paper presents our approach to the Speech Detection task in the LibriBrain Competition 2025, utilizing over 50 hours of magnetoencephalography (MEG) signals from a single participant listening to LibriVox audiobooks. We introduce the proposed Sequential Hierarchical Integration Network for EEG and MEG (SHINE) to reconstruct the binary speech‑silence sequences from MEG signals. In the Extended Track, we further incorporated auxiliary reconstructions of speech envelopes and Mel spectrograms to enhance training. Ensemble methods combining SHINE with baselines (BrainMagic, AWavNet, ConvConcatNet) achieved F1‑macro scores of 0.9155 (Standard Track) and 0.9184 (Extended Track) on the leaderboard test set.
Authors:Ning Gao, Xiuhui Zhang, Xingyu Jiang, Mukang You, Mohan Zhang, Yue Deng
Abstract:
Designing efficient reward functions for low‑level control tasks is a challenging problem. Recent research aims to reduce reliance on expert experience by using Large Language Models (LLMs) with task information to generate dense reward functions. These methods typically rely on training results as feedback, iteratively generating new reward functions with greedy or evolutionary algorithms. However, they suffer from poor utilization of historical feedback and inefficient search, resulting in limited improvements in complex control tasks. To address this challenge, we propose RF‑Agent, a framework that treats LLMs as language agents and frames reward function design as a sequential decision‑making process, enhancing optimization through better contextual reasoning. RF‑Agent integrates Monte Carlo Tree Search (MCTS) to manage the reward design and optimization process, leveraging the multi‑stage contextual reasoning ability of LLMs. This approach better utilizes historical information and improves search efficiency to identify promising reward functions. Outstanding experimental results in 17 diverse low‑level control tasks demonstrate the effectiveness of our method. The source code is available at https://github.com/deng‑ai‑lab/RF‑Agent.
Authors:Junkang Liu, Fanhua Shang, Yuxuan Tian, Hongying Liu, Yuanyuan Liu
Abstract:
In federated learning (FL), multi‑step local updates and data heterogeneity usually lead to sharper global minima, which degrades the performance of the global model. Popular FL algorithms integrate sharpness‑aware minimization (SAM) into local training to address this issue. However, in the high data heterogeneity setting, the flatness in local training does not imply the flatness of the global model. Therefore, minimizing the sharpness of the local loss surfaces on the client data does not enable the effectiveness of SAM in FL to improve the generalization ability of the global model. We define the flatness distance to explain this phenomenon. By rethinking the SAM in FL and theoretically analyzing the flatness distance, we propose a novel FedNSAM algorithm that accelerates the SAM algorithm by introducing global Nesterov momentum into the local update to harmonize the consistency of global and local flatness. FedNSAM uses the global Nesterov momentum as the direction of local estimation of client global perturbations and extrapolation. Theoretically, we prove a tighter convergence bound than FedSAM by Nesterov extrapolation. Empirically, we conduct comprehensive experiments on CNN and Transformer models to verify the superior performance and efficiency of FedNSAM. The code is available at https://github.com/junkangLiu0/FedNSAM.
Authors:Tiantong Wang, Xinyu Yan, Tiantong Wu, Yurong Hao, Yong Jiang, Fei Huang, Wei Yang Bryan Lim
Abstract:
Machine unlearning for large language models often faces a privacy dilemma in which strict constraints prohibit sharing either the server's parameters or the client's forget set. To address this dual non‑disclosure constraint, we propose MPU, an algorithm‑agnostic privacy‑preserving Multiple Perturbed Copies Unlearning framework that primarily introduces two server‑side modules: Pre‑Process for randomized copy generation and Post‑Process for update aggregation. In Pre‑Process, the server distributes multiple perturbed and reparameterized model instances, allowing the client to execute unlearning locally on its private forget set without accessing the server's exact original parameters. After local unlearning, the server performs Post‑Process by inverting the reparameterization and aggregating updates with a harmonic denoising procedure to alleviate the impact of perturbation. Experiments with seven unlearning algorithms show that MPU achieves comparable unlearning performance to noise‑free baselines, with most algorithms' average degradation well below 1% under 10% noise, and can even outperform the noise‑free baseline for some algorithms under 1% noise. Code is available at https://github.com/Tristan‑SHU/MPU.
Authors:Wei Luo, Yangfan Ou, Jin Deng, Zeshuai Deng, Xiquan Yan, Zhiquan Wen, Mingkui Tan
Abstract:
Large‑scale Vision‑Language Models (VLMs) exhibit strong zero‑shot recognition, yet their real‑world deployment is challenged by distribution shifts. While Test‑Time Adaptation (TTA) can mitigate this, existing VLM‑based TTA methods operate under a closed‑set assumption, failing in open‑set scenarios where test streams contain both covariate‑shifted in‑distribution (csID) and out‑of‑distribution (csOOD) data. This leads to a critical difficulty: the model must discriminate unknown csOOD samples to avoid interference while simultaneously adapting to known csID classes for accuracy. Current open‑set TTA (OSTTA) methods rely on hard thresholds for separation and entropy minimization for adaptation. These strategies are brittle, often misclassifying ambiguous csOOD samples and inducing overconfident predictions, and their parameter‑update mechanism is computationally prohibitive for VLMs. To address these limitations, we propose Prototype‑based Double‑Check Separation (ProtoDCS), a robust framework for OSTTA that effectively separates csID and csOOD samples, enabling safe and efficient adaptation of VLMs to csID data. Our main contributions are: (1) a novel double‑check separation mechanism employing probabilistic Gaussian Mixture Model (GMM) verification to replace brittle thresholding; and (2) an evidence‑driven adaptation strategy utilizing uncertainty‑aware loss and efficient prototype‑level updates, mitigating overconfidence and reducing computational overhead. Extensive experiments on CIFAR‑10/100‑C and Tiny‑ImageNet‑C demonstrate that ProtoDCS achieves state‑of‑the‑art performance, significantly boosting both known‑class accuracy and OOD detection metrics. Code will be available at https://github.com/O‑YangF/ProtoDCS.
Authors:Haowen Zhu, Ning Yin, Xiaogen Zhou
Abstract:
Vision‑language models (VLMs) show strong potential for complex diagnostic tasks in medical imaging. However, applying VLMs to multi‑organ medical imaging introduces two principal challenges: (1) modality‑specific vision‑language alignment and (2) cross‑modal feature fusion. In this work, we propose MedMAP, a Medical Modality‑Aware Pretraining framework that enhances vision‑language representation learning in 3D MRI. MedMAP comprises a modality‑aware vision‑language alignment stage and a fine‑tuning stage for multi‑organ abnormality detection. During the pre‑training stage, the modality‑aware encoders implicitly capture the joint modality distribution and improve alignment between visual and textual representations. We then fine‑tune the pre‑trained vision encoders (while keeping the text encoder frozen) for downstream tasks. To this end, we curated MedMoM‑MRI3D, comprising 7,392 3D MRI volume‑report pairs spanning twelve MRI modalities and nine abnormalities tailored for various 3D medical analysis tasks. Extensive experiments on MedMoM‑MRI3D demonstrate that MedMAP significantly outperforms existing VLMs in 3D MRI‑based multi‑organ abnormality detection. Our code is available at https://github.com/RomantiDr/MedMAP.
Authors:Lun Zhan, Feng Xiong, Huanyong Liu, Feng Zhang, Yuhui Yin
Abstract:
Synthesizing high‑quality training data is crucial for enhancing domain models' reasoning abilities. Existing methods face limitations in long‑tail knowledge coverage, effectiveness verification, and interpretability. Knowledge‑graph‑based approaches still fall short in functionality, granularity, customizability, and evaluation. To address these issues, we propose MMKG‑RDS, a flexible framework for reasoning data synthesis that leverages multimodal knowledge graphs. It supports fine‑grained knowledge extraction, customizable path sampling, and multidimensional data quality scoring. We validate MMKG‑RDS with the MMKG‑RDS‑Bench dataset, covering five domains, 17 task types, and 14,950 samples. Experimental results show fine‑tuning Qwen3 models (0.6B/8B/32B) on a small number of synthesized samples improves reasoning accuracy by 9.2%. The framework also generates distinct data, challenging existing models on tasks involving tables and formulas, useful for complex benchmark construction. The dataset and code are available at https://github.com/360AILAB‑NLP/MMKG‑RDS
Authors:Kejing Yin, Haizhou Xu, Wenfang Yao, Chen Liu, Zijie Chen, Yui Haang Cheung, William K. Cheung, Jing Qin
Abstract:
Machine learning holds promise for advancing clinical decision support, yet it remains unclear when multimodal learning truly helps in practice, particularly under modality missingness and fairness constraints. In this work, we conduct a systematic benchmark of multimodal fusion between Electronic Health Records (EHR) and chest X‑rays (CXR) on standardized cohorts from MIMIC‑IV and MIMIC‑CXR, aiming to answer four fundamental questions: when multimodal fusion improves clinical prediction, how different fusion strategies compare, how robust existing methods are to missing modalities, and whether multimodal models achieve algorithmic fairness. Our study reveals several key insights. Multimodal fusion improves performance when modalities are complete, with gains concentrating in diseases that require complementary information from both EHR and CXR. While cross‑modal learning mechanisms capture clinically meaningful dependencies beyond simple concatenation, the rich temporal structure of EHR introduces strong modality imbalance that architectural complexity alone cannot overcome. Under realistic missingness, multimodal benefits rapidly degrade unless models are explicitly designed to handle incomplete inputs. Moreover, multimodal fusion does not inherently improve fairness, with subgroup disparities mainly arising from unequal sensitivity across demographic groups. To support reproducible and extensible evaluation, we further release a flexible benchmarking toolkit that enables plug‑and‑play integration of new models and datasets. Together, this work provides actionable guidance on when multimodal learning helps, when it fails, and why, laying the foundation for developing clinically deployable multimodal systems that are both effective and reliable. The open‑source toolkit can be found at https://github.com/jakeykj/CareBench.
Authors:Zebin Yang, Tong Xie, Baotong Lu, Shaoshan Liu, Bo Yu, Meng Li
Abstract:
Memory‑augmented Large Language Models (LLMs) have demonstrated remarkable capability for complex and long‑horizon embodied planning. By keeping track of past experiences and environmental states, memory enables LLMs to maintain a global view, thereby avoiding repetitive exploration. However, existing approaches often store the memory as raw text, leading to excessively long prompts and high prefill latency. While it is possible to store and reuse the KV caches, the efficiency benefits are greatly undermined due to frequent KV cache updates. In this paper, we propose KEEP, a KV‑cache‑centric memory management system for efficient embodied planning. KEEP features 3 key innovations: (1) a Static‑Dynamic Memory Construction algorithm that reduces KV cache recomputation by mixed‑granularity memory group; (2) a Multi‑hop Memory Re‑computation algorithm that dynamically identifies important cross‑attention among different memory groups and reconstructs memory interactions iteratively; (3) a Layer‑balanced Memory Loading that eliminates unbalanced KV cache loading and cross‑attention computation across different layers. Extensive experimental results have demonstrated that KEEP achieves 2.68x speedup with negligible accuracy loss compared with text‑based memory methods on ALFRED dataset. Compared with the KV re‑computation method CacheBlend (EuroSys'25), KEEP shows 4.13% success rate improvement and 1.90x time‑to‑first‑token (TTFT) reduction. Our code is available on https://github.com/PKU‑SEC‑Lab/KEEP_Embodied_Memory.
Authors:Abhishek Dalvi, Vasant Honavar
Abstract:
Large unimodal foundation models for vision and language encode rich semantic structures, yet aligning them typically requires computationally intensive multimodal fine‑tuning. Such approaches depend on large‑scale parameter updates, are resource intensive, and can perturb pretrained representations. Emerging evidence suggests, however, that independently trained foundation models may already exhibit latent semantic compatibility, reflecting shared structures in the data they model. This raises a fundamental question: can cross‑modal alignment be achieved without modifying the models themselves? Here we introduce HDFLIM (HyperDimensional computing with Frozen Language and Image Models), a framework that establishes cross‑modal mappings while keeping pretrained vision and language models fully frozen. HDFLIM projects unimodal embeddings into a shared hyperdimensional space and leverages lightweight symbolic operations ‑‑ binding, bundling, and similarity‑based retrieval to construct associative cross‑modal representations in a single pass over the data. Caption generation emerges from high‑dimensional memory retrieval rather than iterative gradient‑based optimization. We show that HDFLIM achieves performance comparable to end‑to‑end vision‑language training methods and produces captions that are more semantically grounded than zero‑shot baselines. By decoupling alignment from parameter tuning, our results suggest that semantic mapping across foundation models can be realized through symbolic operations on hyperdimensional encodings of the respective embeddings. More broadly, this work points toward an alternative paradigm for foundation model alignment in which frozen models are integrated through structured representational mappings rather than through large‑scale retraining. The codebase for our implementation can be found at https://github.com/Abhishek‑Dalvi410/HDFLIM.
Authors:Xiang Ao
Abstract:
Multivariate time series forecasting is widely applied in fields such as transportation, energy, and finance. However, the data commonly suffers from issues of multi‑scale characteristics, weak correlations, and noise interference, which limit the predictive performance of existing models. This paper proposes a dual‑stream sparse Mixer prediction framework that extracts global trends and local dynamic features from sequences in both the frequency and time domains, respectively. It employs a sparsity mechanism to filter out invalid information, thereby enhancing the accuracy of cross‑variable dependency modeling. Experimental results demonstrate that this method achieves leading performance on multiple real‑world scenario datasets, validating its effectiveness and generality. The code is available at https://github.com/SDMixer/SDMixer
Authors:Jeongbin Hong, Dooseop Choi, Taeg-Hyun An, Kyounghwan An, Kyoung-Wook Min
Abstract:
Transforming image features from perspective view (PV) space to bird's‑eye‑view (BEV) space remains challenging in autonomous driving due to depth ambiguity and occlusion. Although several view transformation (VT) paradigms have been proposed, the challenge still remains. In this paper, we propose a new regularization framework, dubbed CycleBEV, that enhances existing VT models for BEV semantic segmentation. Inspired by cycle consistency, widely used in image distribution modeling, we devise an inverse view transformation (IVT) network that maps BEV segmentation maps back to PV segmentation maps and use it to regularize VT networks during training through cycle consistency losses, enabling them to capture richer semantic and geometric information from input PV images. To further exploit the capacity of the IVT network, we introduce two novel ideas that extend cycle consistency into geometric and representation spaces. We evaluate CycleBEV on four representative VT models covering three major paradigms using the large‑scale nuScenes dataset. Experimental results show consistent improvements ‑‑ with gains of up to 0.74, 4.86, and 3.74 mIoU for drivable area, vehicle, and pedestrian classes, respectively ‑‑ without increasing inference complexity, since the IVT network is used only during training. The implementation code is available at https://github.com/JeongbinHong/CycleBEV.
Authors:Aishwarya Sarkar, Sayan Ghosh, Nathan Tallent, Aman Chadha, Tanya Roosta, Ali Jannesari
Abstract:
Large‑scale Graph Neural Networks (GNNs) are typically trained by sampling a vertex's neighbors to a fixed distance. Because large input graphs are distributed, training requires frequent irregular communication that stalls forward progress. Moreover, fetched data changes with graph, graph distribution, sample and batch parameters, and caching polices. Consequently, any static prefetching method will miss crucial opportunities to adapt to different dynamic conditions. In this paper, we introduce Rudder, a software module embedded in the state‑of‑the‑art AWS DistDGL framework, to autonomously prefetch remote nodes and minimize communication. Rudder's adaptation contrasts with both standard heuristics and traditional ML classifiers. We observe that the generative AI found in contemporary Large Language Models (LLMs) exhibits emergent properties like In‑Context Learning (ICL) for zero‑shot tasks, with logical multi‑step reasoning. We find this behavior well‑suited for adaptive control even with substantial undertraining. Evaluations using standard datasets and unseen configurations on the NERSC Perlmutter supercomputer show up to 91% improvement in end‑to‑end training performance over baseline DistDGL (no prefetching), and an 82% improvement over static prefetching, reducing communication by over 50%. Our code is available at https://github.com/aishwaryyasarkar/rudder‑llm‑agent.
Authors:Jose Javier Gonzalez Ortiz, Abhay Gupta, Chris Renard, Davis Blalock
Abstract:
Standard mixed‑precision training of neural networks requires many bytes of accelerator memory for each model parameter. These bytes reflect not just the parameter itself, but also its gradient and one or more optimizer state variables. With each of these values typically requiring 4 bytes, training even a 7 billion parameter model can be impractical for researchers with less than 100GB of accelerator memory. We introduce FlashOptim, a suite of optimizations that reduces per‑parameter memory by over 50% while preserving model quality and API compatibility. Our approach introduces two key techniques. First, we improve master weight splitting by finding and exploiting a tight bound on its quantization error. Second, we design companding functions that greatly reduce the error in 8‑bit optimizer state quantization. Together with 16‑bit gradients, these techniques reduce AdamW memory from 16 bytes to 7 bytes per parameter, or 5 bytes with gradient release. They also cut model checkpoint sizes by more than half. Experiments with FlashOptim applied to SGD, AdamW, and Lion show no measurable quality degradation on any task from a collection of standard vision and language benchmarks, including Llama‑3.1‑8B finetuning.
Authors:Sungho Park, Jueun Kim, Wook-Shin Han
Abstract:
Real‑world Table‑Text question answering (QA) tasks require models that can reason across long text and source tables, traversing multiple hops and executing complex operations such as aggregation. Yet existing benchmarks are small, manually curated ‑ and therefore error‑prone ‑ and contain shallow questions that seldom demand more than two hops or invoke aggregations, grouping, or other advanced analytical operations expressible in natural‑language queries. We present SPARTA, an end‑to‑end construction framework that automatically generates large‑scale Table‑Text QA benchmarks with lightweight human validation, requiring only one quarter of the annotation time of HybridQA. The framework first constructs a reference fact database by enriching each source table with grounding tables whose tuples are atomic facts automatically extracted from the accompanying unstructured passages, then synthesizes nested queries whose number of nested predicates matches the desired hop count. To ensure that every SQL statement is executable and that its verbalization yields a fluent, human‑sounding question, we propose two novel techniques: provenance‑based refinement, which rewrites any syntactically valid query that returns a non‑empty result, and realistic‑structure enforcement, which confines generation to post‑order traversals of the query graph. The resulting pipeline produces thousands of high‑fidelity question‑answer pairs covering aggregations, grouping, and deep multi‑hop reasoning across text and tables. On SPARTA, state‑of‑the‑art models that reach over 70 F1 on HybridQA or over 50 F1 on OTT‑QA drop by more than 30 F1 points, exposing fundamental weaknesses in current cross‑modal reasoning. Our benchmark, construction code, and baseline models are available at https://github.com/pshlego/SPARTA/tree/main.
Authors:Yutong Wang, Siyuan Xiong, Xuebo Liu, Wenkang Zhou, Liang Ding, Miao Zhang, Min Zhang
Abstract:
While Multi‑Agent Systems (MAS) excel in complex reasoning, they suffer from the cascading impact of erroneous information generated by individual participants. Current solutions often resort to rigid structural engineering or expensive fine‑tuning, limiting their deployability and adaptability. We propose AgentDropoutV2, a test‑time rectify‑or‑reject pruning framework designed to dynamically optimize MAS information flow without retraining. Our approach acts as an active firewall, intercepting agent outputs and employing a retrieval‑augmented rectifier to iteratively correct errors based on a failure‑driven indicator pool. This mechanism allows for the precise identification of potential errors using distilled failure patterns as prior knowledge. Irreparable outputs are subsequently pruned to prevent error propagation, while a fallback strategy preserves system integrity. Empirical results on extensive math benchmarks show that AgentDropoutV2 significantly boosts the MAS's task performance, achieving an average accuracy gain of 6.3 percentage points on math benchmarks. Furthermore, the system exhibits robust generalization and adaptivity, dynamically modulating rectification efforts based on task difficulty while leveraging context‑aware indicators to resolve a wide spectrum of error patterns. Our code and dataset are released at https://github.com/TonySY2/AgentDropoutV2.
Authors:Pengxiang Li, Dilxat Muhtar, Tianlong Chen, Lu Yin, Shiwei Liu
Abstract:
Diffusion Language Models (DLMs) are often advertised as enabling parallel token generation, yet practical fast DLMs frequently converge to left‑to‑right, autoregressive (AR)‑like decoding dynamics. In contrast, genuinely non‑AR generation is promising because it removes AR's sequential bottleneck, better exploiting parallel hardware to reduce synchronization/communication overhead and improve latency scaling with output length. We argue that a primary driver of AR‑like decoding is a mismatch between DLM objectives and the highly sequential structure of widely used training data, including standard pretraining corpora and long chain‑of‑thought (CoT) supervision. Motivated by this diagnosis, we propose NAP (Non‑Autoregressive Parallel DLMs), a proof‑of‑concept, data‑centric approach that better aligns supervision with non‑AR parallel decoding. NAP curates examples as multiple independent reasoning trajectories and couples them with a parallel‑forced decoding strategy that encourages multi‑token parallel updates. Across math reasoning benchmarks, NAP yields stronger performance under parallel decoding than DLMs trained on standard long CoT data, with gains growing as parallelism increases. Our results suggest that revisiting data and supervision is a principled direction for mitigating AR‑like behavior and moving toward genuinely non‑autoregressive parallel generation in DLMs. Our code is available at https://github.com/pixeli99/NAP.
Authors:Guofeng Mei, Wei Lin, Luigi Riz, Yujiao Wu, Yiming Wang, Fabio Poiesi
Abstract:
Large Multimodal Models (LMMs) that process 3D data typically rely on heavy, pre‑trained visual encoders to extract geometric features. While recent 2D LMMs have begun to eliminate such encoders for efficiency and scalability, extending this paradigm to 3D remains challenging due to the unordered and large‑scale nature of point clouds. This leaves a critical unanswered question: How can we design an LMM that tokenizes unordered 3D data effectively and efficiently without a cumbersome encoder? We propose Fase3D, the first efficient encoder‑free Fourier‑based 3D scene LMM. Fase3D tackles the challenges of scalability and permutation invariance with a novel tokenizer that combines point cloud serialization and the Fast Fourier Transform (FFT) to approximate self‑attention. This design enables an effective and computationally minimal architecture, built upon three key innovations: First, we represent large scenes compactly via structured superpoints. Second, our space‑filling curve serialization followed by an FFT enables efficient global context modeling and graph‑based token merging. Lastly, our Fourier‑augmented LoRA adapters inject global frequency‑aware interactions into the LLMs at a negligible cost. Fase3D achieves performance comparable to encoder‑based 3D LMMs while being significantly more efficient in computation and parameters. Project website: https://tev‑fbk.github.io/Fase3D.
Authors:Jayadev Billa
Abstract:
Numerous studies have shown that multimodal LLMs process speech and images well but fail in non‑intuitive ways rendering trivial tasks such as object counting unreliable. We investigate this behavior from an information‑theoretic perspective by framing multimodal LLM inference as a mismatched decoder problem: a decoder trained primarily on text can only extract information along text‑aligned directions (removing up to 98% of the variation in modality‑specific (non‑text) directions improves decoder loss) and the amount of accessible information is bounded by the Generalized Mutual Information (GMI). We show that information loss is bounded as the distributional mismatch between the source data and the text data increases, and as the sensitivity of the decoder increases. This bound is a function of the model's scoring rule not its architecture. We validate the predictions across five models spanning speech and vision. A controlled study (two Prismatic VLMs differing only in encoder text‑alignment) shows that the bottleneck lies in the scoring rule of the decoder rather than the text‑alignment of the encoder or the learned projection. A LoRA intervention demonstrates that simply training with an emotion‑related objective improves emotion detection from 17.3% to 61.8% task accuracy without affecting other attributes, confirming that the training objective determines what becomes accessible.
Authors:Xiaosen Wang, Zhijin Ge, Bohan Liu, Zheng Fang, Fengfan Zhou, Ruixuan Zhang, Shaokang Wang, Yuyang Luo
Abstract:
Adversarial transferability refers to the capacity of adversarial examples generated on the surrogate model to deceive alternate, unexposed victim models. This property eliminates the need for direct access to the victim model during an attack, thereby raising considerable security concerns in practical applications and attracting substantial research attention recently. In this work, we discern a lack of a standardized framework and criteria for evaluating transfer‑based attacks, leading to potentially biased assessments of existing approaches. To rectify this gap, we have conducted an exhaustive review of hundreds of related works, organizing various transfer‑based attacks into six distinct categories. Subsequently, we propose a comprehensive framework designed to serve as a benchmark for evaluating these attacks. In addition, we delineate common strategies that enhance adversarial transferability and highlight prevalent issues that could lead to unfair comparisons. Finally, we provide a brief review of transfer‑based attacks beyond image classification.
Authors:Bangrui Xu, Qihang Yao, Zirui Tang, Xuanhe Zhou, Yeye He, Shihan Yu, Qianqian Xu, Bin Wang, Guoliang Li, Conghui He, Fan Wu
Abstract:
Semi‑structured documents integrate diverse interleaved data elements (e.g., tables, charts, hierarchical paragraphs) arranged in various and often irregular layouts. These documents are widely observed across domains and account for a large portion of real‑world data. However, existing methods struggle to support natural language question answering over these documents due to three main technical challenges: (1) The elements extracted by techniques like OCR are often fragmented and stripped of their original semantic context, making them inadequate for analysis. (2) Existing approaches lack effective representations to capture hierarchical structures within documents (e.g., associating tables with nested chapter titles) and to preserve layout‑specific distinctions (e.g., differentiating sidebars from main content). (3) Answering questions often requires retrieving and aligning relevant information scattered across multiple regions or pages, such as linking a descriptive paragraph to table cells located elsewhere in the document. To address these issues, we propose MoDora, an LLM‑powered system for semi‑structured document analysis. First, we adopt a local‑alignment aggregation strategy to convert OCR‑parsed elements into layout‑aware components, and conduct type‑specific information extraction for components with hierarchical titles or non‑text elements. Second, we design the Component‑Correlation Tree (CCTree) to hierarchically organize components, explicitly modeling inter‑component relations and layout distinctions through a bottom‑up cascade summarization process. Finally, we propose a question‑type‑aware retrieval strategy that supports (1) layout‑based grid partitioning for location‑based retrieval and (2) LLM‑guided pruning for semantic‑based retrieval. Experiments show MoDora outperforms baselines by 5.97%‑61.07% in accuracy. The code is at https://github.com/weAIDB/MoDora.
Authors:Alaa Anani, Tobias Lorenz, Bernt Schiele, Mario Fritz, Jonas Fischer
Abstract:
Understanding how neural networks arrive at their predictions is essential for debugging, auditing, and deployment. Mechanistic interpretability pursues this goal by identifying circuits‑‑minimal subnetworks responsible for specific behaviors. However, existing circuit discovery methods are brittle: circuits depend strongly on the chosen concept dataset and often fail to transfer out‑of‑distribution, raising doubts whether they capture the concept or merely dataset‑specific artifacts. We introduce Certified Circuits, which provide provable stability guarantees for circuit discovery. Our framework wraps any black‑box discovery algorithm with randomized data subsampling to certify that inclusion decisions over circuit components‑‑neurons or edges of the model graph, depending on the base algorithm‑‑are invariant to bounded edit‑distance perturbations of the concept dataset. Unstable components are abstained from, yielding circuits that are more compact and more accurate. We validate across three architectures (ResNet, ViT, GPT‑2) on vision (ImageNet and four OOD datasets) and language (IOI, IOI‑Hard, Greater‑Than) tasks. Certified circuits achieve up to 56% higher accuracy and up to 80% fewer components, and remain reliable where baselines degrade. Certified Circuits puts circuit discovery on formal ground by producing mechanistic explanations that are provably stable and better aligned with the target concept. Code: https://github.com/AlaaAnani/certified‑circuits.
Authors:Feng Guo, Jiaxiang Liu, Yang Li, Qianqian Shi, Mingkun Xu
Abstract:
Accurate brain tumor diagnosis requires models to not only detect lesions but also generate clinically interpretable reasoning grounded in imaging manifestations, yet existing public datasets remain limited in annotation richness and diagnostic semantics. To bridge this gap, we introduce MM‑NeuroOnco, a large‑scale multimodal benchmark and instruction‑tuning dataset for brain tumor MRI understanding, consisting of 24,726 MRI slices from 20 data sources paired with approximately 200,000 semantically enriched multimodal instructions spanning diverse tumor subtypes and imaging modalities. To mitigate the scarcity and high cost of diagnostic semantic annotations, we develop a multi‑model collaborative pipeline for automated medical information completion and quality control, enabling the generation of diagnosis‑related semantics beyond mask‑only annotations. Building upon this dataset, we further construct MM‑NeuroOnco‑Bench, a manually annotated evaluation benchmark with a rejection‑aware setting to reduce biases inherent in closed‑ended question formats. Evaluation across ten representative models shows that even the strongest baseline, Gemini 3 Flash, achieves only 41.88% accuracy on diagnosis‑related questions, highlighting the substantial challenges of multimodal brain tumor diagnostic understanding. Leveraging MM‑NeuroOnco, we further propose NeuroOnco‑GPT, which achieves a 27% absolute accuracy improvement on diagnostic questions following fine‑tuning. This result demonstrates the effectiveness of our dataset and benchmark in advancing clinically grounded multimodal diagnostic reasoning. Code and dataset are publicly available at: https://github.com/gfnnnb/MM‑NeuroOnco
Authors:Roy Miles, Aysim Toker, Andreea-Maria Oncescu, Songcen Xu, Jiankang Deng, Ismail Elezi
Abstract:
Reasoning with large language models often benefits from generating multiple chains‑of‑thought, but existing aggregation strategies are typically trajectory‑level (e.g., selecting the best trace or voting on the final answer), discarding useful intermediate work from partial or "nearly correct" attempts. We propose Stitching Noisy Diffusion Thoughts, a self‑consistency framework that turns cheap diffusion‑sampled reasoning into a reusable pool of step‑level candidates. Given a problem, we (i) sample many diverse, low‑cost reasoning trajectories using a masked diffusion language model, (ii) score every intermediate step with an off‑the‑shelf process reward model (PRM), and (iii) stitch these highest‑quality steps across trajectories into a composite rationale. This rationale then conditions an autoregressive (AR) model (solver) to recompute only the final answer. This modular pipeline separates exploration (diffusion) from evaluation and solution synthesis, avoiding monolithic unified hybrids while preserving broad search. Across math reasoning benchmarks, we find that step‑level recombination is most beneficial on harder problems, and ablations highlight the importance of the final AR solver in converting stitched but imperfect rationales into accurate answers. Using low‑confidence diffusion sampling with parallel, independent rollouts, our training‑free framework improves average accuracy by up to 23.8% across six math and coding tasks. At the same time, it achieves up to a 1.8x latency reduction relative to both traditional diffusion models (e.g., Dream, LLaDA) and unified architectures (e.g., TiDAR). Code is available at https://github.com/roymiles/diffusion‑stitching.
Authors:Hao Zheng, Guozhao Mo, Xinru Yan, Qianhao Yuan, Wenkai Zhang, Xuanang Chen, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun
Abstract:
Presentation generation requires deep content research, coherent visual design, and iterative refinement based on observation. However, existing presentation agents often rely on predefined workflows and fixed templates. To address this, we present DeepPresenter, an agentic framework that adapts to diverse user intents, enables effective feedback‑driven refinement, and generalizes beyond a scripted pipeline. Specifically, DeepPresenter autonomously plans, renders, and revises intermediate slide artifacts to support long‑horizon refinement with environmental observations. Furthermore, rather than relying on self‑reflection over internal signals (e.g., reasoning traces), our environment‑grounded reflection conditions the generation process on perceptual artifact states (e.g., rendered slides), enabling the system to identify and correct presentation‑specific issues during execution. Results on the evaluation set covering diverse presentation‑generation scenarios show that DeepPresenter achieves state‑of‑the‑art performance, and the fine‑tuned 9B model remains highly competitive at substantially lower cost. Our project is available at: https://github.com/icip‑cas/PPTAgent
Authors:Shuo He, Lang Feng, Qi Wei, Xin Cheng, Lei Feng, Bo An
Abstract:
Group‑based reinforcement learning (RL), such as GRPO, has advanced the capabilities of large language models on long‑horizon agentic tasks. To enable more fine‑grained policy updates, recent research has increasingly shifted toward stepwise group‑based policy optimization, which treats each step in a rollout trajectory independently while using a memory module to retain historical context. However, we find a key issue in estimating stepwise relative advantages, namely context inconsistency, where steps within the same group may differ in their historical contexts. Empirically, we reveal that this issue can lead to severely biased advantage estimation, thereby degrading policy optimization significantly. To address the issue, in this paper, we propose Hierarchy‑of‑Groups Policy Optimization (HGPO) for long‑horizon agentic tasks. Specifically, within a group of rollout trajectories, HGPO assigns each step to multiple hierarchical groups according to the consistency of historical contexts. Then, for each step, HGPO computes distinct advantages within each group and aggregates them with an adaptive weighting scheme. In this way, HGPO can achieve a favorable bias‑variance trade‑off in stepwise advantage estimation, without extra models or rollouts. Evaluations on two challenging agentic tasks, ALFWorld and WebShop with Qwen2.5‑1.5B‑Instruct and Qwen2.5‑7B‑Instruct, show that HGPO significantly outperforms existing agentic RL methods under the same computational constraints. Code is available at https://github.com/langfengQ/verl‑agent/tree/master/recipe/hgpo.
Authors:Yuanjun Li, Bin Zhang, Hao Chen, Zhouyang Jiang, Dapeng Li, Zhiwei Xu
Abstract:
Value decomposition (VD) methods have achieved remarkable success in cooperative multi‑agent reinforcement learning (MARL). However, their reliance on the max operator for temporal‑difference (TD) target calculation leads to systematic Q‑value overestimation. This issue is particularly severe in MARL due to the combinatorial explosion of the joint action space, which often results in unstable learning and suboptimal policies. To address this problem, we propose QSIM, a similarity weighted Q‑learning framework that reconstructs the TD target using action similarity. Instead of using the greedy joint action directly, QSIM forms a similarity weighted expectation over a structured near‑greedy joint action space. This formulation allows the target to integrate Q‑values from diverse yet behaviorally related actions while assigning greater influence to those that are more similar to the greedy choice. By smoothing the target with structurally relevant alternatives, QSIM effectively mitigates overestimation and improves learning stability. Extensive experiments demonstrate that QSIM can be seamlessly integrated with various VD methods, consistently yielding superior performance and stability compared to the original algorithms. Furthermore, empirical analysis confirms that QSIM significantly mitigates the systematic value overestimation in MARL. Code is available at https://github.com/MaoMaoLYJ/pymarl‑qsim.
Authors:Jiaqing Zhang, Mingjia Yin, Hao Wang, Yuxin Tian, Yuyang Ye, Yawen Li, Wei Guo, Yong Liu, Enhong Chen
Abstract:
Recommendation model performance is intrinsically tied to the quality, volume, and relevance of their training data. To address common challenges like data sparsity and cold start, recent researchs have leveraged data from multiple auxiliary domains to enrich information within the target domain. However, inherent domain gaps can degrade the quality of mixed‑domain data, leading to negative transfer and diminished model performance. Existing prevailing \emphmodel‑centric paradigm ‑‑ which relies on complex, customized architectures ‑‑ struggles to capture the subtle, non‑structural sequence dependencies across domains, leading to poor generalization and high demands on computational resources. To address these shortcomings, we propose \textscTaesar, a \emphdata‑centric framework for target‑aligned sequential regeneration, which employs a contrastive decoding mechanism to adaptively encode cross‑domain context into target‑domain sequences. It employs contrastive decoding to encode cross‑domain context into target sequences, enabling standard models to learn intricate dependencies without complex fusion architectures. Experiments show \textscTaesar outperforms model‑centric solutions and generalizes to various sequential models. By generating enriched datasets, \textscTaesar effectively combines the strengths of data‑ and model‑centric paradigms. The code accompanying this paper is available at~ \textcolorbluehttps://github.com/USTC‑StarTeam/Taesar.
Authors:Yanpei Guo, Wenjie Qu, Linyu Wu, Shengfang Zhai, Lionel Z. Wang, Ming Xu, Yue Liu, Binhang Yuan, Dawn Song, Jiaheng Zhang
Abstract:
Commercial large language models are typically deployed as black‑box API services, requiring users to trust providers to execute inference correctly and report token usage honestly. We present IMMACULATE, a practical auditing framework that detects economically motivated deviations‑such as model substitution, quantization abuse, and token overbilling‑without trusted hardware or access to model internals. IMMACULATE selectively audits a small fraction of requests using verifiable computation, achieving strong detection guarantees while amortizing cryptographic overhead. Experiments on dense and MoE models show that IMMACULATE reliably distinguishes benign and malicious executions with under 1% throughput overhead. Our code is published at https://github.com/guo‑yanpei/Immaculate.
Authors:Zhanhui Zhou, Lingjie Chen, Hanghang Tong, Dawn Song
Abstract:
Although diffusion language models (DLMs) are evolving quickly, many recent models converge on a set of shared components. These components, however, are distributed across ad‑hoc research codebases or lack transparent implementations, making them difficult to reproduce or extend. As the field accelerates, there is a clear need for a unified framework that standardizes these common components while remaining flexible enough to support new methods and architectures. To address this gap, we introduce dLLM, an open‑source framework that unifies the core components of diffusion language modeling ‑‑ training, inference, and evaluation ‑‑ and makes them easy to customize for new designs. With dLLM, users can reproduce, finetune, deploy, and evaluate open‑source large DLMs such as LLaDA and Dream through a standardized pipeline. The framework also provides minimal, reproducible recipes for building small DLMs from scratch with accessible compute, including converting any BERT‑style encoder or autoregressive LM into a DLM. We also release the checkpoints of these small DLMs to make DLMs more accessible and accelerate future research.
Authors:Zhiheng Song, Jingshuai Zhang, Chuan Qin, Chao Wang, Chao Chen, Longfei Xu, Kaikui Liu, Xiangxiang Chu, Hengshu Zhu
Abstract:
Route‑planning agents powered by large language models (LLMs) have emerged as a promising paradigm for supporting everyday human mobility through natural language interaction and tool‑mediated decision making. However, systematic evaluation in real‑world mobility settings is hindered by diverse routing demands, non‑deterministic mapping services, and limited reproducibility. In this study, we introduce MobilityBench, a scalable benchmark for evaluating LLM‑based route‑planning agents in real‑world mobility scenarios. MobilityBench is constructed from large‑scale, anonymized real user queries collected from Amap and covers a broad spectrum of route‑planning intents across multiple cities worldwide. To enable reproducible, end‑to‑end evaluation, we design a deterministic API‑replay sandbox that eliminates environmental variance from live services. We further propose a multi‑dimensional evaluation protocol centered on outcome validity, complemented by assessments of instruction understanding, planning, tool use, and efficiency. Using MobilityBench, we evaluate multiple LLM‑based route‑planning agents across diverse real‑world mobility scenarios and provide an in‑depth analysis of their behaviors and performance. Our findings reveal that current models perform competently on Basic information retrieval and Route Planning tasks, yet struggle considerably with Preference‑Constrained Route Planning, underscoring significant room for improvement in personalized mobility applications. We publicly release the benchmark data, evaluation toolkit, and documentation at https://github.com/AMAP‑ML/MobilityBench .
Authors:Boyang Dai, Zeng Fan, Zihao Qi, Meng Lou, Yizhou Yu
Abstract:
Source‑Free Domain Adaptive Object Detection (SF‑DAOD) aims to adapt a detector trained on a labeled source domain to an unlabeled target domain without retaining any source data. Despite recent progress, most popular approaches focus on tuning pseudo‑label thresholds or refining the teacher‑student framework, while overlooking object‑level structural cues within cross‑domain data. In this work, we present CGSA, the first framework that brings Object‑Centric Learning (OCL) into SF‑DAOD by integrating slot‑aware adaptation into the DETR‑based detector. Specifically, our approach integrates a Hierarchical Slot Awareness (HSA) module into the detector to progressively disentangle images into slot representations that act as visual priors. These slots are then guided toward class semantics via a Class‑Guided Slot Contrast (CGSC) module, maintaining semantic consistency and prompting domain‑invariant adaptation. Extensive experiments on multiple cross‑domain datasets demonstrate that our approach outperforms previous SF‑DAOD methods, with theoretical derivations and experimental analysis further demonstrating the effectiveness of the proposed components and the framework, thereby indicating the promise of object‑centric design in privacy‑sensitive adaptation scenarios. Code is released at https://github.com/Michael‑McQueen/CGSA.
Authors:Weida Liang, Yiyou Sun, Shuyuan Nan, Chuang Li, Dawn Song, Kenji Kawaguchi
Abstract:
Example‑based guidance is widely used to improve mathematical reasoning at inference time, yet its effectiveness is highly unstable across problems and models‑even when the guidance is correct and problem‑relevant. We show that this instability arises from a previously underexplored gap between strategy usage‑whether a reasoning strategy appears in successful solutions‑and strategy executability‑whether the strategy remains effective when instantiated as guidance for a target model. Through a controlled analysis of paired human‑written and model‑generated solutions, we identify a systematic dissociation between usage and executability: human‑ and model‑derived strategies differ in structured, domain‑dependent ways, leading to complementary strengths and consistent source‑dependent reversals under guidance. Building on this diagnosis, we propose Selective Strategy Retrieval (SSR), a test‑time framework that explicitly models executability by selectively retrieving and combining strategies using empirical, multi‑route, source‑aware signals. Across multiple mathematical reasoning benchmarks, SSR yields reliable and consistent improvements over direct solving, in‑context learning, and single‑source guidance, improving accuracy by up to +13 points on AIME25 and +5 points on Apex for compact reasoning models. Code and benchmark are publicly available at: https://github.com/lwd17/strategy‑execute‑pipeline.
Authors:Craig Myles, Patrick Schrempf, David Harris-Birtill
Abstract:
Errors in medical text can cause delays or even result in incorrect treatment for patients. Recently, language models have shown promise in their ability to automatically detect errors in medical text, an ability that has the opportunity to significantly benefit healthcare systems. In this paper, we explore the importance of prompt optimisation for small and large language models when applied to the task of error detection. We perform rigorous experiments and analysis across frontier language models and open‑source language models. We show that automatic prompt optimisation with Genetic‑Pareto (GEPA) improves error detection over the baseline accuracy performance from 0.669 to 0.785 with GPT‑5 and 0.578 to 0.690 with Qwen3‑32B, approaching the performance of medical doctors and achieving state‑of‑the‑art performance on the MEDEC benchmark dataset. Code available on GitHub: https://github.com/CraigMyles/clinical‑note‑error‑detection
Authors:Emilio Ferrara
Abstract:
Community detection in attributed networks faces a fundamental divide: topological algorithms ignore semantic features, while Graph Neural Networks (GNNs) encounter devastating computational bottlenecks. Specifically, GNNs suffer from a Semantic Wall of feature over smoothing in dense or heterophilic networks, and a Systems Wall driven by the O(N^2) memory constraints of pairwise clustering. To dismantle these barriers, we introduce ECHO (Encoding Communities via High order Operators), a scalable, self supervised architecture that reframes community detection as an adaptive, multi scale diffusion process. ECHO features a Topology Aware Router that automatically analyzes structural heuristics sparsity, density, and assortativity to route graphs through the optimal inductive bias, preventing heterophilic poisoning while ensuring semantic densification. Coupled with a memory sharded full batch contrastive objective and a novel chunked O(N \cdot K) similarity extraction method, ECHO completely bypasses traditional O(N^2) memory bottlenecks without sacrificing the mathematical precision of global gradients. Extensive evaluations demonstrate that this topology feature synergy consistently overcomes the classical resolution limit. On synthetic LFR benchmarks scaled up to 1 million nodes, ECHO achieves scale invariant accuracy despite severe topological noise. Furthermore, on massive real world social networks with over 1.6 million nodes and 30 million edges, it completes clustering in mere minutes with throughputs exceeding 2,800 nodes per second matching the speed of highly optimized purely topological baselines. The implementation utilizes a unified framework that automatically engages memory sharded optimization to support adoption across varying hardware constraints. GitHub Repository: https://github.com/emilioferrara/ECHO‑GNN
Authors:Idan Habler, Vineeth Sai Narajala, Stav Koren, Amy Chang, Tiffany Saade
Abstract:
Retrieval‑Augmented Generation (RAG) systems are essential to contemporary AI applications, allowing large language models to obtain external knowledge via vector similarity search. Nevertheless, these systems encounter a significant security flaw: hubness ‑ items that frequently appear in the top‑k retrieval results for a disproportionately high number of varied queries. These hubs can be exploited to introduce harmful content, alter search rankings, bypass content filtering, and decrease system performance. We introduce hubscan, an open‑source security scanner that evaluates vector indices and embeddings to identify hubs in RAG systems. Hubscan presents a multi‑detector architecture that integrates: (1) robust statistical hubness detection utilizing median/MAD‑based z‑scores, (2) cluster spread analysis to assess cross‑cluster retrieval patterns, (3) stability testing under query perturbations, and (4) domain‑aware and modality‑aware detection for category‑specific and cross‑modal attacks. Our solution accommodates several vector databases (FAISS, Pinecone, Qdrant, Weaviate) and offers versatile retrieval techniques, including vector similarity, hybrid search, and lexical matching with reranking capabilities. We evaluate hubscan on Food‑101, MS‑COCO, and FiQA adversarial hubness benchmarks constructed using state‑of‑the‑art gradient‑optimized and centroid‑based hub generation methods. hubscan achieves 90% recall at a 0.2% alert budget and 100% recall at 0.4%, with adversarial hubs ranking above the 99.8th percentile. Domain‑scoped scanning recovers 100% of targeted attacks that evade global detection. Production validation on 1M real web documents from MS MARCO demonstrates significant score separation between clean documents and adversarial content. Our work provides a practical, extensible framework for detecting hubness threats in production RAG systems.
Authors:Cosmo Santoni
Abstract:
As large language models engage in extended reasoning tasks, they accumulate significant state ‑‑ architectural mappings, trade‑off decisions, codebase conventions ‑‑ within the context window. This understanding is lost when sessions reach context limits and undergo lossy compaction. We propose Contextual Memory Virtualisation (CMV), a system that treats accumulated LLM understanding as version‑controlled state. Borrowing from operating system virtual memory, CMV models session history as a Directed Acyclic Graph (DAG) with formally defined snapshot, branch, and trim primitives that enable context reuse across independent parallel sessions. We introduce a three‑pass structurally lossless trimming algorithm that preserves every user message and assistant response verbatim while reducing token counts by a mean of 20% and up to 86% for sessions with significant overhead by stripping mechanical bloat such as raw tool outputs, base64 images, and metadata. A single‑user case‑study evaluation across 76 real‑world coding sessions demonstrates that trimming remains economically viable under prompt caching, with the strongest gains in mixed tool‑use sessions, which average 39% reduction and reach break‑even within 10 turns. A reference implementation is available at https://github.com/CosmoNaught/claude‑code‑cmv.
Authors:Fuyao Huang, Xiaozhu Yu, Kui Xu, Qiangfeng Cliff Zhang
Abstract:
High‑resolution structure determination by cryo‑electron microscopy (cryo‑EM) requires the accurate fitting of an atomic model into an experimental density map. Traditional refinement pipelines such as Phenix.real_space_refine and Rosetta are computationally expensive, demand extensive manual tuning, and present a significant bottleneck for researchers. We present CryoNet.Refine, an end‑to‑end deep learning framework that automates and accelerates molecular structure refinement. Our approach utilizes a one‑step diffusion model that integrates a density‑aware loss function with robust stereochemical restraints, enabling rapid optimization of a structure against experimental data. CryoNet.Refine provides a unified and versatile solution capable of refining protein complexes as well as DNA/RNA‑protein complexes. In benchmarks against Phenix.real_space_refine, CryoNet.Refine consistently achieves substantial improvements in both model‑map correlation and overall geometric quality metrics. By offering a scalable, automated, and powerful alternative, CryoNet.Refine aims to serve as an essential tool for next‑generation cryo‑EM structure refinement. Web server: https://cryonet.ai/refine; Source code: https://github.com/kuixu/cryonet.refine.
Authors:Alex Morehead, Miruna Cretu, Antonia Panescu, Rishabh Anand, Maurice Weiler, Tynan Perez, Samuel Blau, Steven Farrell, Wahid Bhimji, Anubhav Jain, Hrushikesh Sahasrabuddhe, Pietro Lio, Tommi Jaakkola, Rafael Gomez-Bombarelli, Rex Ying, N. Benjamin Erichson, Michael W. Mahoney
Abstract:
General‑purpose 3D chemical modeling encompasses molecules and materials, requiring both generative and predictive capabilities. However, most existing AI approaches are optimized for a single domain (molecules or materials) and a single task (generation or prediction), which limits representation sharing and transfer. We introduce Zatom‑1, the first end‑to‑end, fully open‑source foundation model that unifies generative and predictive learning of 3D molecules and materials. Zatom‑1 is a Transformer trained with a multimodal flow matching objective that jointly models discrete atom types and continuous 3D geometries. This approach supports scalable pretraining with predictable gains as model capacity increases, while enabling fast and stable sampling. We use joint generative pretraining as a universal initialization for downstream multi‑task prediction of properties, energies, and forces. Empirically, Zatom‑1 matches or outperforms specialized baselines on both generative and predictive benchmarks, while reducing the generative inference time by more than an order of magnitude. Our experiments demonstrate positive predictive transfer between chemical domains from joint generative pretraining: modeling materials during pretraining improves molecular property prediction accuracy.
Authors:Xavier Pleimling, Sifat Muhammad Abdullah, Gunjan Balde, Peng Gao, Mainack Mondal, Murtuza Jadliwala, Bimal Viswanath
Abstract:
Advances in Generative AI (GenAI) have led to the development of various protection strategies to prevent the unauthorized use of images. These methods rely on adding imperceptible protective perturbations to images to thwart misuse such as style mimicry or deepfake manipulations. Although previous attacks on these protections required specialized, purpose‑built methods, we demonstrate that this is no longer necessary. We show that off‑the‑shelf image‑to‑image GenAI models can be repurposed as generic ``denoisers" using a simple text prompt, effectively removing a wide range of protective perturbations. Across 8 case studies spanning 6 diverse protection schemes, our general‑purpose attack not only circumvents these defenses but also outperforms existing specialized attacks while preserving the image's utility for the adversary. Our findings reveal a critical and widespread vulnerability in the current landscape of image protection, indicating that many schemes provide a false sense of security. We stress the urgent need to develop robust defenses and establish that any future protection mechanism must be benchmarked against attacks from off‑the‑shelf GenAI models. Code is available in this repository: https://github.com/mlsecviswanath/img2imgdenoiser
Authors:Lingfeng Ren, Weihao Yu, Runpeng Yu, Xinchao Wang
Abstract:
Object hallucination is a critical issue in Large Vision‑Language Models (LVLMs), where outputs include objects that do not appear in the input image. A natural question arises from this phenomenon: Which component of the LVLM pipeline primarily contributes to object hallucinations? The vision encoder to perceive visual information, or the language decoder to generate text responses? In this work, we strive to answer this question through designing a systematic experiment to analyze the roles of the vision encoder and the language decoder in hallucination generation. Our observations reveal that object hallucinations are predominantly associated with the strong priors from the language decoder. Based on this finding, we propose a simple and training‑free framework, No‑Language‑Hallucination Decoding, NoLan, which refines the output distribution by dynamically suppressing language priors, modulated based on the output distribution difference between multimodal and text‑only inputs. Experimental results demonstrate that NoLan effectively reduces object hallucinations across various LVLMs on different tasks. For instance, NoLan achieves substantial improvements on POPE, enhancing the accuracy of LLaVA‑1.5 7B and Qwen‑VL 7B by up to 6.45 and 7.21, respectively. The code is publicly available at: https://github.com/lingfengren/NoLan.
Authors:Pantia-Marina Alchirch, Dimitrios I. Diochnos
Abstract:
Many real‑world applications generate continuous data streams for regression. Hoeffding trees and their variants have a long‑standing tradition due to their effectiveness, either alone or as base models in broader ensembles. Recent batch‑learning work shows that kernel density estimation (KDE) improves smoothed predictions in imbalanced regression [Yang et al., 2021], while hierarchical shrinkage (HS) provides post‑hoc regularization for decision trees without modifying their structure [Agarwal et al., 2022]. We extend KDE to streaming settings via a telescoping formulation and integrate HS into incremental decision trees. Empirical evaluation on standard online regression benchmarks shows that KDE consistently improves early‑stream performance, whereas HS provides limited gains. Our implementation is publicly available at: https://github.com/marinaAlchirch/DSFA_2026.
Authors:Jinpeng Li, Zhongyi Pei, Huaze Xue, Bojian Zheng, Chen Wang, Jianmin Wang
Abstract:
Time‑series foundation models (TSFMs) have achieved strong univariate forecasting through large‑scale pre‑training, yet effectively extending this success to multivariate forecasting remains challenging. To address this, we propose DualWeaver, a novel framework that adapts univariate TSFMs (Uni‑TSFMs) for multivariate forecasting by using a pair of learnable, structurally symmetric surrogate series. Generated by a shared auxiliary feature‑fusion module that captures cross‑variable dependencies, these surrogates are mapped to TSFM‑compatible series via the forecasting objective. The symmetric structure enables parameter‑free reconstruction of final predictions directly from the surrogates, without additional parametric decoding. A theoretically grounded regularization term is further introduced to enhance robustness against adaptation collapse. Extensive experiments on diverse real‑world datasets show that DualWeaver outperforms state‑of‑the‑art multivariate forecasters in both accuracy and stability. We release the code at https://github.com/li‑jinpeng/DualWeaver.
Authors:Xiaoyu Xian, Shiao Wang, Xiao Wang, Daxin Tian, Yan Tian
Abstract:
Metro trains often operate in highly complex environments, characterized by illumination variations, high‑speed motion, and adverse weather conditions. These factors pose significant challenges for visual perception systems, especially those relying solely on conventional RGB cameras. To tackle these difficulties, we explore the integration of event cameras into the perception system, leveraging their advantages in low‑light conditions, high‑speed scenarios, and low power consumption. Specifically, we focus on Kilometer Marker Recognition (KMR), a critical task for autonomous metro localization under GNSS‑denied conditions. In this context, we propose a robust baseline method based on a pre‑trained RGB OCR foundation model, enhanced through multi‑modal adaptation. Furthermore, we construct the first large‑scale RGB‑Event dataset, EvMetro5K, containing 5,599 pairs of synchronized RGB‑Event samples, split into 4,479 training and 1,120 testing samples. Extensive experiments on EvMetro5K and other widely used benchmarks demonstrate the effectiveness of our approach for KMR. Both the dataset and source code will be released on https://github.com/Event‑AHU/EvMetro5K_benchmark
Authors:Lin Zhu, Lei You
Abstract:
Counterfactual explanation (CE) is an important domain within post‑hoc explainability. However, the explanations generated by most CE generators are often highly redundant. This work introduces an open‑source Python library xai‑cola, which provides an end‑to‑end pipeline for sparsifying CEs produced by arbitrary generators, reducing superfluous feature changes while preserving their validity. It offers a documented API that takes as input raw tabular data in pandas DataFrame form, a preprocessing object (for standardization and encoding), and a trained scikit‑learn or PyTorch model. On this basis, users can either employ the built‑in or externally imported CE generators. The library also implements several sparsification policies and includes visualization routines for analysing and comparing sparsified counterfactuals. xai‑cola is released under the MIT license and can be installed from PyPI. Empirical experiments indicate that xai‑cola produces sparser counterfactuals across several CE generators, reducing the number of modified features by up to 50% in our setting. The source code is available at https://github.com/understanding‑ml/COLA.
Authors:Xiannan Huang, Quan Yuan, Chao Yang
Abstract:
Accurately predicting short‑term traffic demand is critical for intelligent transportation systems. While deep learning models achieve strong performance under stationary conditions, their accuracy often degrades significantly when faced with distribution shifts caused by external events or evolving urban dynamics. Frequent model retraining to adapt to such changes incurs prohibitive computational costs, especially for large‑scale or foundation models. To address this challenge, we propose FORESEE (Forecasting Online with Residual Smoothing and Ensemble Experts), a lightweight online adaptation framework that is accurate, robust, and computationally efficient. FORESEE operates without any parameter updates to the base model. Instead, it corrects today's forecast in each region using yesterday's prediction error, stabilized through exponential smoothing guided by a mixture‑of‑experts mechanism that adapts to recent error dynamics. Moreover, an adaptive spatiotemporal smoothing component propagates error signals across neighboring regions and time slots, capturing coherent shifts in demand patterns. Extensive experiments on seven real‑world datasets with three backbone models demonstrate that FORESEE consistently improves prediction accuracy, maintains robustness even when distribution shifts are minimal (avoiding performance degradation), and achieves the lowest computational overhead among existing online methods. By enabling real‑time adaptation of traffic forecasting models with negligible computational cost, FORESEE paves the way for deploying reliable, up‑to‑date prediction systems in dynamic urban environments. Code and data are available at https://github.com/xiannanhuang/FORESEE
Authors:Guanyi Qin, Xiaozhen Wang, Zhu Zhuo, Chang Han Low, Yuancan Xiao, Yibing Fu, Haofeng Liu, Kai Wang, Chunjiang Li, Yueming Jin
Abstract:
Minimally invasive surgery has dramatically improved patient operative outcomes, yet identifying safe operative zones remains challenging in critical phases, requiring surgeons to integrate visual cues, procedural phase, and anatomical context under high cognitive load. Existing AI systems offer binary safety verification or static detection, ignoring the phase‑dependent nature of intraoperative reasoning. We introduce ResGo, a benchmark of laparoscopic frames annotated with Go Zone bounding boxes and clinician‑authored rationales covering phase, exposure quality reasoning, next action and risk reminder. We introduce evaluation metrics that treat correct grounding under incorrect phase as failures, revealing that most vision‑language models cannot handle such tasks and perform poorly. We then present SurGo‑R1, a model optimized via RLHF with a multi‑turn phase‑then‑go architecture where the model first identifies the surgical phase, then generates reasoning and Go Zone coordinates conditioned on that context. On unseen procedures, SurGo‑R1 achieves 76.6% phase accuracy, 32.7 mIoU, and 54.8% hardcore accuracy, a 6.6× improvement over the mainstream generalist VLMs. Code, model and benchmark will be available at https://github.com/jinlab‑imvr/SurGo‑R1
Authors:Shaoxuan Wu, Jingkun Chen, Chong Ma, Cong Shen, Xiao Zhang, Jun Feng
Abstract:
Computer‑aided diagnosis (CAD) has significantly advanced automated chest X‑ray diagnosis but remains isolated from clinical workflows and lacks reliable decision support and interpretability. Human‑AI collaboration seeks to enhance the reliability of diagnostic models by integrating the behaviors of controllable radiologists. However, the absence of interactive tools seamlessly embedded within diagnostic routines impedes collaboration, while the semantic gap between radiologists' decision‑making patterns and model representations further limits clinical adoption. To overcome these limitations, we propose a visual cognition‑guided collaborative network (VCC‑Net) to achieve the cooperative diagnostic paradigm. VCC‑Net centers on visual cognition (VC) and employs clinically compatible interfaces, such as eye‑tracking or the mouse, to capture radiologists' visual search traces and attention patterns during diagnosis. VCC‑Net employs VC as a spatial cognition guide, learning hierarchical visual search strategies to localize diagnostically key regions. A cognition‑graph co‑editing module subsequently integrates radiologist VC with model inference to construct a disease‑aware graph. The module captures dependencies among anatomical regions and aligns model representations with VC‑driven features, mitigating radiologist bias and facilitating complementary, transparent decision‑making. Experiments on the public datasets SIIM‑ACR, EGD‑CXR, and self‑constructed TB‑Mouse dataset achieved classification accuracies of 88.40%, 85.05%, and 92.41%, respectively. The attention maps produced by VCC‑Net exhibit strong concordance with radiologists' gaze distributions, demonstrating a mutual reinforcement of radiologist and model inference. The code is available at https://github.com/IPMI‑NWU/VCC‑Net.
Authors:Chenyv Liu, Wentao Tan, Lei Zhu, Fengling Li, Jingjing Li, Guoli Yang, Heng Tao Shen
Abstract:
Standard vision‑language‑action (VLA) models rely on fitting statistical data priors, limiting their robust understanding of underlying physical dynamics. Reinforcement learning enhances physical grounding through exploration yet typically relies on external reward signals that remain isolated from the agent's internal states. World action models have emerged as a promising paradigm that integrates imagination and control to enable predictive planning. However, they rely on implicit context modeling, lacking explicit mechanisms for self‑improvement. To solve these problems, we propose Self‑Correcting VLA (SC‑VLA), which achieve self‑improvement by intrinsically guiding action refinement through sparse imagination. We first design sparse world imagination by integrating auxiliary predictive heads to forecast current task progress and future trajectory trends, thereby constraining the policy to encode short‑term physical evolution. Then we introduce the online action refinement module to reshape progress‑dependent dense rewards, adjusting trajectory orientation based on the predicted sparse future states. Evaluations on challenging robot manipulation tasks from simulation benchmarks and real‑world settings demonstrate that SC‑VLA achieve state‑of‑the‑art performance, yielding the highest task throughput with 16% fewer steps and a 9% higher success rate than the best‑performing baselines, alongside a 14% gain in real‑world experiments. Code is available at https://github.com/Kisaragi0/SC‑VLA.
Authors:Yue Yang, Shuo Cheng, Yu Fang, Homanga Bharadhwaj, Mingyu Ding, Gedas Bertasius, Daniel Szafir
Abstract:
General‑purpose robots must master long‑horizon manipulation, defined as tasks involving multiple kinematic structure changes (e.g., attaching or detaching objects) in unstructured environments. While Vision‑Language‑Action (VLA) models offer the potential to master diverse atomic skills, they struggle with the combinatorial complexity of sequencing them and are prone to cascading failures due to environmental sensitivity. To address these challenges, we propose LiLo‑VLA (Linked Local VLA), a modular framework capable of zero‑shot generalization to novel long‑horizon tasks without ever being trained on them. Our approach decouples transport from interaction: a Reaching Module handles global motion, while an Interaction Module employs an object‑centric VLA to process isolated objects of interest, ensuring robustness against irrelevant visual features and invariance to spatial configurations. Crucially, this modularity facilitates robust failure recovery through dynamic replanning and skill reuse, effectively mitigating the cascading errors common in end‑to‑end approaches. We introduce a 21‑task simulation benchmark consisting of two challenging suites: LIBERO‑Long++ and Ultra‑Long. In these simulations, LiLo‑VLA achieves a 69% average success rate, outperforming Pi0.5 by 41% and OpenVLA‑OFT by 67%. Furthermore, real‑world evaluations across 8 long‑horizon tasks demonstrate an average success rate of 85%. Project page: https://yy‑gx.github.io/LiLo‑VLA/.
Authors:Ningyuan Yang, Weihua Du, Weiwei Sun, Sean Welleck, Yiming Yang
Abstract:
Reinforcement learning (RL) has become a central post‑training paradigm for large language models (LLMs), but its performance is highly sensitive to the quality of training problems. This sensitivity stems from the non‑stationarity of RL: rollouts are generated by an evolving policy, and learning is shaped by exploration and reward feedback, unlike supervised fine‑tuning (SFT) with fixed trajectories. As a result, prior work often relies on manual curation or simple heuristic filters (e.g., accuracy), which can admit incorrect or low‑utility problems. We propose GradAlign, a gradient‑aligned data selection method for LLM reinforcement learning that uses a small, trusted validation set to prioritize training problems whose policy gradients align with validation gradients, yielding an adaptive curriculum. We evaluate GradAlign across three challenging data regimes: unreliable reward signals, distribution imbalance, and low‑utility training corpus, showing that GradAlign consistently outperforms existing baselines, underscoring the importance of directional gradient signals in navigating non‑stationary policy optimization and yielding more stable training and improved final performance. We release our implementation at https://github.com/StigLidu/GradAlign
Authors:Jesse He, Helen Jenne, Max Vargas, Davis Brown, Gal Mishne, Yusu Wang, Henry Kvinge
Abstract:
The recent field of neural algorithmic reasoning (NAR) studies the ability of graph neural networks (GNNs) to emulate classical algorithms like Bellman‑Ford, a phenomenon known as algorithmic alignment. At the same time, recent advances in large language models (LLMs) have spawned the study of mechanistic interpretability, which aims to identify granular model components like circuits that perform specific computations. In this work, we introduce Mechanistic Interpretability for Neural Algorithmic Reasoning (MINAR), an efficient circuit discovery toolbox that adapts attribution patching methods from mechanistic interpretability to the GNN setting. We show through two case studies that MINAR recovers faithful neuron‑level circuits from GNNs trained on algorithmic tasks. Our study sheds new light on the process of circuit formation and pruning during training, as well as giving new insight into how GNNs trained to perform multiple tasks in parallel reuse circuit components for related tasks. Our code is available at https://github.com/pnnl/MINAR.
Authors:Jan Pauls, Karsten Schrödter, Sven Ligensa, Martin Schwartz, Berkant Turan, Max Zimmer, Sassan Saatchi, Sebastian Pokutta, Philippe Ciais, Fabian Gieseke
Abstract:
Forest monitoring is critical for climate change mitigation. However, existing global tree height maps provide only static snapshots and do not capture temporal forest dynamics, which are essential for accurate carbon accounting. We introduce ECHOSAT, a global and temporally consistent tree height map at 10 m resolution spanning multiple years. To this end, we resort to multi‑sensor satellite data to train a specialized vision transformer model, which performs pixel‑level temporal regression. A self‑supervised growth loss regularizes the predictions to follow growth curves that are in line with natural tree development, including gradual height increases over time, but also abrupt declines due to forest loss events such as fires. Our experimental evaluation shows that our model improves state‑of‑the‑art accuracies in the context of single‑year predictions. We also provide the first global‑scale height map that accurately quantifies tree growth and disturbances over time. We expect ECHOSAT to advance global efforts in carbon monitoring and disturbance assessment. The maps can be accessed at https://github.com/ai4forest/echosat.
Authors:Alina Devkota, Jacob Thrasher, Donald Adjeroh, Binod Bhattarai, Prashnna K. Gyawali
Abstract:
Federated Learning (FL) enables collaborative model training across multiple clients without sharing their private data. However, data heterogeneity across clients leads to client drift, which degrades the overall generalization performance of the model. This effect is further compounded by overemphasis on poorly performing clients. To address this problem, we propose FedVG, a novel gradient‑based federated aggregation framework that leverages a global validation set to guide the optimization process. Such a global validation set can be established using readily available public datasets, ensuring accessibility and consistency across clients without compromising privacy. In contrast to conventional approaches that prioritize client dataset volume, FedVG assesses the generalization ability of client models by measuring the magnitude of validation gradients across layers. Specifically, we compute layerwise gradient norms to derive a client‑specific score that reflects how much each client needs to adjust for improved generalization on the global validation set, thereby enabling more informed and adaptive federated aggregation. Extensive experiments on both natural and medical image benchmarking datasets, across diverse model architectures, demonstrate that FedVG consistently improves performance, particularly in highly heterogeneous settings. Moreover, FedVG is modular and can be seamlessly integrated with various state‑of‑the‑art FL algorithms, often further improving their results. Our code is available at https://github.com/alinadevkota/FedVG.
Authors:Subhadip Mitra
Abstract:
We present a memory system for AI agents that treats stored information as continuous fields governed by partial differential equations rather than discrete entries in a database. The approach draws from classical field theory: memories diffuse through semantic space, decay thermodynamically based on importance, and interact through field coupling in multi‑agent scenarios. We evaluate the system on two established long‑context benchmarks: LoCoMo (ACL 2024) with 300‑turn conversations across 35 sessions, and LongMemEval (ICLR 2025) testing multi‑session reasoning over 500+ turns. On LongMemEval, the field‑theoretic approach achieves significant improvements: +116% F1 on multi‑session reasoning (p<0.01, d= 3.06), +43.8% on temporal reasoning (p<0.001, d= 9.21), and +27.8% retrieval recall on knowledge updates (p<0.001, d= 5.00). Multi‑agent experiments show near‑perfect collective intelligence (>99.8%) through field coupling. Code is available at github.com/rotalabs/rotalabs‑fieldmem.
Authors:Tony Feng, Junehyuk Jung, Sang-hyun Kim, Carlo Pagano, Sergei Gukov, Chiang-Chiang Tsai, David Woodruff, Adel Javanmard, Aryan Mokhtari, Dawsen Hwang, Yuri Chervonyi, Jonathan N. Lee, Garrett Bingham, Trieu H. Trinh, Vahab Mirrokni, Quoc V. Le, Thang Luong
Abstract:
We report the performance of Aletheia (Feng et al., 2026b), a mathematics research agent powered by Gemini 3 Deep Think, on the inaugural FirstProof challenge. Within the allowed timeframe of the challenge, Aletheia autonomously solved 6 problems (2, 5, 7, 8, 9, 10) out of 10 according to majority expert assessments; we note that experts were not unanimous on Problem 8 (only). For full transparency, we explain our interpretation of FirstProof and disclose details about our experiments as well as our evaluation. Raw prompts and outputs are available at https://github.com/google‑deepmind/superhuman/tree/main/aletheia.
Authors:Sepehr Salem Ghahfarokhi, M. Moein Esfahani, Raj Sunderraman, Vince Calhoun, Mohammed Alser
Abstract:
Deep learning has significantly advanced automated brain tumor diagnosis, yet clinical adoption remains limited by interpretability and computational constraints. Conventional models often act as opaque ''black boxes'' and fail to quantify the complex, irregular tumor boundaries that characterize malignant growth. To address these challenges, we present XMorph, an explainable and computationally efficient framework for fine‑grained classification of three prominent brain tumor types: glioma, meningioma, and pituitary tumors. We propose an Information‑Weighted Boundary Normalization (IWBN) mechanism that emphasizes diagnostically relevant boundary regions alongside nonlinear chaotic and clinically validated features, enabling a richer morphological representation of tumor growth. A dual‑channel explainable AI module combines GradCAM++ visual cues with LLM‑generated textual rationales, translating model reasoning into clinically interpretable insights. The proposed framework achieves a classification accuracy of 96.0%, demonstrating that explainability and high performance can co‑exist in AI‑based medical imaging systems. The source code and materials for XMorph are all publicly available at: https://github.com/ALSER‑Lab/XMorph.
Authors:Victor Reijgwart, Cesar Cadena, Roland Siegwart, Lionel Ott
Abstract:
Hierarchical, multi‑resolution volumetric mapping approaches are widely used to represent large and complex environments as they can efficiently capture their occupancy and connectivity information. Yet widely used path planning methods such as sampling and trajectory optimization do not exploit this explicit connectivity information, and search‑based methods such as A suffer from scalability issues in large‑scale high‑resolution maps. In many applications, Euclidean shortest paths form the underpinning of the navigation system. For such applications, any‑angle planning methods, which find optimal paths by connecting corners of obstacles with straight‑line segments, provide a simple and efficient solution. In this paper, we present a method that has the optimality and completeness properties of any‑angle planners while overcoming computational tractability issues common to search‑based methods by exploiting multi‑resolution representations. Extensive experiments on real and synthetic environments demonstrate the proposed approach's solution quality and speed, outperforming even sampling‑based methods. The framework is open‑sourced to allow the robotics and planning community to build on our research.
Authors:David Anugraha, Vishakh Padmakumar, Diyi Yang
Abstract:
Qualitative insights from user experiences are critical for informing product and policy decisions, but collecting such data at scale is constrained by the time and availability of experts to conduct semi‑structured interviews. Recent work has explored using large language models (LLMs) to automate interviewing, yet existing systems lack a principled mechanism for balancing systematic coverage of predefined topics with adaptive exploration, or the ability to pursue follow‑ups, deep dives, and emergent themes that arise organically during conversation. In this work, we formulate adaptive semi‑structured interviewing as an optimization problem over the interviewer's behavior. We define interview utility as a trade‑off between coverage of a predefined interview topic guide, discovery of relevant emergent themes, and interview cost measured by length. Based on this formulation, we introduce SparkMe, a multi‑agent LLM interviewer that performs deliberative planning via simulated conversation rollouts to select questions with high expected utility. We evaluate SparkMe through controlled experiments with LLM‑based interviewees, showing that it achieves higher interview utility, improving topic guide coverage (+4.7% over the best baseline) and eliciting richer emergent insights while using fewer conversational turns than prior LLM interviewing approaches. We further validate SparkMe in a user study with 70 participants across 7 professions on the impact of AI on their workflows. Domain experts rate SparkMe as producing high‑quality adaptive interviews that surface helpful profession‑specific insights not captured by prior approaches. The code, datasets, and evaluation protocols for SparkMe are available as open‑source at https://github.com/SALT‑NLP/SparkMe.
Authors:Yanrui Wu, Lingling Zhang, Xinyu Zhang, Jiayu Chang, Pengyu Li, Xu Jiang, Jingtao Hu, Jun Liu
Abstract:
Evaluations of large language models (LLMs) primarily emphasize convergent logical reasoning, where success is defined by producing a single correct proof. However, many real‑world reasoning problems admit multiple valid derivations, requiring models to explore diverse logical paths rather than committing to one route. To address this limitation, we introduce LogicGraph, the first benchmark aimed to systematically evaluate multi‑path logical reasoning, constructed via a neuro‑symbolic framework that leverages backward logic generation and semantic instantiation. This pipeline yields solver‑verified reasoning problems formalized by high‑depth multi‑path reasoning and inherent logical distractions, where each instance is associated with an exhaustive set of minimal proofs. We further propose a reference‑free evaluation framework to rigorously assess model performance in both convergent and divergent regimes. Experiments on state‑of‑the‑art language models reveal a common limitation: models tend to commit early to a single route and fail to explore alternatives, and the coverage gap grows substantially with reasoning depth. LogicGraph exposes this divergence gap and provides actionable insights to motivate future improvements. Our code and data will be released at https://github.com/kkkkarry/LogicGraph.
Authors:Tianhao Fu, Yucheng Chen
Abstract:
Medical image processing demands specialized software that handles high‑dimensional volumetric data, heterogeneous file formats, and domain‑specific training procedures. Existing frameworks either provide low‑level components that require substantial integration effort or impose rigid, monolithic pipelines that resist modification. We present MIP Candy (MIPCandy), a freely available, PyTorch‑based framework designed specifically for medical image processing. MIPCandy provides a complete, modular pipeline spanning data loading, training, inference, and evaluation, allowing researchers to obtain a fully functional process workflow by implementing a single method, \textttbuild_network, while retaining fine‑grained control over every component. Central to the design is \textttLayerT, a deferred configuration mechanism that enables runtime substitution of convolution, normalization, and activation modules without subclassing. The framework further offers built‑in k‑fold cross‑validation, dataset inspection with automatic region‑of‑interest detection, deep supervision, exponential moving average, multi‑frontend experiment tracking (Weights & Biases, Notion, MLflow), training state recovery, and validation score prediction via quotient regression. An extensible bundle ecosystem provides pre‑built model implementations that follow a consistent trainer‑‑predictor pattern and integrate with the core framework without modification. MIPCandy is open‑source under the Apache‑2.0 license and requires Python~3.12 or later. Source code and documentation are available at https://github.com/ProjectNeura/MIPCandy.
Authors:Varvara Sazonova, Dmitri Shmelkin, Stanislav Kikot, Vasily Motolygin
Abstract:
With the growing popularity of Large Reasoning Models and their results in solving mathematical problems, it becomes crucial to measure their capabilities. We introduce a pipeline for both automatic and interactive verification as a more accurate alternative to only checking the answer which is currently the most popular approach for benchmarks. The pipeline can also be used as a generator of correct solutions both in formal and informal languages. 3 AI agents, which can be chosen for the benchmark accordingly, are included in the structure. The key idea is the use of prompts to obtain the solution in the specific form which allows for easier verification using proof assistants and possible use of small models (\le 8B). Experiments on several datasets suggest low probability of False Positives. The open‑source implementation with instructions on setting up a server is available at https://github.com/LogicEnj/lean4_verification_pipeline.
Authors:Chao Fei, Guozhong Li, Chenxi Liu, Panos Kalnis
Abstract:
Long‑context LLMs demand accurate inference at low latency, yet decoding becomes primarily constrained by KV cache as context grows. Prior pruning methods are largely context‑agnostic: their token selection ignores step‑wise relevance and local semantics, which undermines quality. Moreover, their irregular accesses and selection overheads yield only limited wall‑clock speedups. To address this, we propose CHESS, an algorithm‑system co‑design KV‑cache management system. Algorithmically, CHESS introduces a context‑aware, hierarchical selection policy that dynamically reconstructs a coherent context for the current decoding. System‑wise, coarse granularity selection eliminates expensive data movement, fully realizing practical acceleration from theoretical sparsity. Extensive evaluations demonstrate that CHESS surpasses Full‑KV quality using only 1% of the KV cache, delivers low‑latency stable inference with up to 4.56× higher throughput, and consistently outperforms other strong baselines. Code is available at \hrefhttps://anonymous.4open.science/r/CHESS‑9958/https://anonymous.4open.science/r/CHESS/.
Authors:Aram Davtyan, Yusuf Sahin, Yasaman Haghighi, Sebastian Stapf, Pablo Acuaviva, Alexandre Alahi, Paolo Favaro
Abstract:
Discrete image tokenizers have emerged as a key component of modern vision and multimodal systems, providing a sequential interface for transformer‑based architectures. However, most existing approaches remain primarily optimized for reconstruction and compression, often yielding tokens that capture local texture rather than object‑level semantic structure. Inspired by the incremental and compositional nature of human communication, we introduce COMmunication inspired Tokenization (COMiT), a framework for learning structured discrete visual token sequences. COMiT constructs a latent message within a fixed token budget by iteratively observing localized image crops and recurrently updating its discrete representation. At each step, the model integrates new visual information while refining and reorganizing the existing token sequence. After several encoding iterations, the final message conditions a flow‑matching decoder that reconstructs the full image. Both encoding and decoding are implemented within a single transformer model and trained end‑to‑end using a combination of flow‑matching reconstruction and semantic representation alignment losses. Our experiments demonstrate that while semantic alignment provides grounding, attentive sequential tokenization is critical for inducing interpretable, object‑centric token structure and substantially improving compositional generalization and relational reasoning over prior methods.
Authors:Peter Hase, Christopher Potts
Abstract:
Inspecting Chain‑of‑Thought reasoning is among the most common means of understanding why an LLM produced its output. But well‑known problems with CoT faithfulness severely limit what insights can be gained from this practice. In this paper, we introduce a training method called Counterfactual Simulation Training (CST), which aims to improve CoT faithfulness by rewarding CoTs that enable a simulator to accurately predict a model's outputs over counterfactual inputs. We apply CST in two settings: (1) CoT monitoring with cue‑based counterfactuals, to detect when models rely on spurious features, reward hack, or are sycophantic, and (2) counterfactual simulation over generic model‑based counterfactuals, to encourage models to produce more faithful, generalizable reasoning in the CoT. Experiments with models up to 235B parameters show that CST can substantially improve monitor accuracy on cue‑based counterfactuals (by 35 accuracy points) as well as simulatability over generic counterfactuals (by 2 points). We further show that: (1) CST outperforms prompting baselines, (2) rewriting unfaithful CoTs with an LLM is 5x more efficient than RL alone, (3) faithfulness improvements do not generalize to dissuading cues (as opposed to persuading cues), and (4) larger models do not show more faithful CoT out of the box, but they do benefit more from CST. These results suggest that CST can improve CoT faithfulness in general, with promising applications for CoT monitoring. Code for experiments in this paper is available at https://github.com/peterbhase/counterfactual‑simulation‑training
Authors:Jiawei Wang, Chuang Yang, Jiawei Yong, Xiaohang Xu, Hongjun Wang, Noboru Koshizuka, Shintaro Fukushima, Ryosuke Shibasaki, Renhe Jiang
Abstract:
Mobility trajectories are essential for understanding urban dynamics and enhancing urban planning, yet access to such data is frequently hindered by privacy concerns. This research introduces a transformative framework for generating large‑scale urban mobility trajectories, employing a novel application of a transformer‑based model pre‑trained and fine‑tuned through a two‑phase process. Initially, trajectory generation is conceptualized as an offline reinforcement learning (RL) problem, with a significant reduction in vocabulary space achieved during tokenization. The integration of Inverse Reinforcement Learning (IRL) allows for the capture of trajectory‑wise reward signals, leveraging historical data to infer individual mobility preferences. Subsequently, the pre‑trained model is fine‑tuned using the constructed reward model, effectively addressing the challenges inherent in traditional RL‑based autoregressive methods, such as long‑term credit assignment and handling of sparse reward environments. Comprehensive evaluations on multiple datasets illustrate that our framework markedly surpasses existing models in terms of reliability and diversity. Our findings not only advance the field of urban mobility modeling but also provide a robust methodology for simulating urban data, with significant implications for traffic management and urban development planning. The implementation is publicly available at https://github.com/Wangjw6/TrajGPT_R.
Authors:Nathaniel Chen, Kouroche Bouchiat, Peter Steiner, Andrew Rothstein, David Smith, Max Austin, Mike van Zeeland, Azarakhsh Jalalvand, Egemen Kolemen
Abstract:
Next‑generation fusion facilities like ITER face a "data deluge," generating petabytes of multi‑diagnostic signals daily that challenge manual analysis. We present a "signals‑first" self‑supervised framework for the automated extraction of coherent and transient modes from high‑noise time‑frequency data across a variety of sensors. We also develop a general‑purpose method and tool for extracting coherent, quasi‑coherent, and transient modes for fluctuation measurements in tokamaks by employing non‑linear optimal techniques in multichannel signal processing with a fast neural network surrogate on fast magnetics, electron cyclotron emission, CO2 interferometers, and beam emission spectroscopy measurements from DIII‑D. Results are tested on data from DIII‑D, TJ‑II, and non‑fusion spectrograms. With an inference latency of 0.5 seconds, this framework enables real‑time mode identification and large‑scale automated database generation for advanced plasma control. Repository is in https://github.com/PlasmaControl/TokEye.
Authors:Wall Kim, Chaeyoung Song, Hanul Kim
Abstract:
Recently, TabPFN has gained attention as a foundation model for tabular data. However, it struggles to integrate heterogeneous modalities such as images and text, which are common in domains like healthcare and marketing, thereby limiting its applicability. To address this, we present the Multi‑Modal Prior‑data Fitted Network (MMPFN), which extends TabPFN to handle tabular and non‑tabular modalities in a unified manner. MMPFN comprises per‑modality encoders, modality projectors, and pre‑trained foundation models. The modality projectors serve as the critical bridge, transforming non‑tabular embeddings into tabular‑compatible tokens for unified processing. To this end, we introduce a multi‑head gated MLP and a cross‑attention pooler that extract richer context from non‑tabular inputs while mitigates attention imbalance issue in multimodal learning. Extensive experiments on medical and general‑purpose multimodal datasets demonstrate that MMPFN consistently outperforms competitive state‑of‑the‑art methods and effectively exploits non‑tabular modalities alongside tabular features. These results highlight the promise of extending prior‑data fitted networks to the multimodal setting, offering a scalable and effective framework for heterogeneous data learning. The source code is available at https://github.com/too‑z/MultiModalPFN.
Authors:Jing Zhang
Abstract:
AI agents increasingly act on behalf of humans, yet no existing system provides a tamper‑evident, independently verifiable record of what they did. As regulations such as the EU AI Act begin mandating automatic logging for high‑risk AI systems, this gap carries concrete consequences ‑‑ especially for agents running on personal hardware, where no centralized provider controls the log. Extending Floridi's informational rights framework from data about individuals to actions performed on their behalf, this paper proposes the Right to History: the principle that individuals are entitled to a complete, verifiable record of every AI agent action on their own hardware. The paper formalizes this principle through five system invariants with structured proof sketches, and implements it in PunkGo, a Rust sovereignty kernel that unifies RFC 6962 Merkle tree audit logs, capability‑based isolation, energy‑budget governance, and a human‑approval mechanism. Adversarial testing confirms all five invariants hold. Performance evaluation shows sub‑1.3 ms median action latency, ~400 actions/sec throughput, and 448‑byte Merkle inclusion proofs at 10,000 log entries.
Authors:Zhuoxu Huang, Mengxi Jia, Hao Sun, Xuelong Li, Jungong Han
Abstract:
Reinforcement Learning with verifiable rewards (RLVR) has emerged as a primary learning paradigm for enhancing the reasoning capabilities of multi‑modal large language models (MLLMs). However, during RL training, the enormous state space of MLLM and sparse rewards often leads to entropy collapse, policy degradation, or over‑exploitation of suboptimal behaviors. This necessitates an exploration strategy that maintains productive stochasticity while avoiding the drawbacks of uncontrolled random sampling, yielding inefficient exploration. In this paper, we propose CalibRL, a hybrid‑policy RLVR framework that supports controllable exploration with expert guidance, enabled by two key mechanisms. First, a distribution‑aware advantage weighting scales updates by group rareness to calibrate the distribution, therefore preserving exploration. Meanwhile, the asymmetric activation function (LeakyReLU) leverages the expert knowledge as a calibration baseline to moderate overconfident updates while preserving their corrective direction. CalibRL increases policy entropy in a guided manner and clarifies the target distribution by estimating the on‑policy distribution through online sampling. Updates are driven by these informative behaviors, avoiding convergence to erroneous patterns. Importantly, these designs help alleviate the distributional mismatch between the model's policy and expert trajectories, thereby achieving a more stable balance between exploration and exploitation. Extensive experiments across eight benchmarks, including both in‑domain and out‑of‑domain settings, demonstrate consistent improvements, validating the effectiveness of our controllable hybrid‑policy RLVR training. Code is available at https://github.com/zhh6425/CalibRL.
Authors:Chaeyun Kim, YongTaek Lim, Kihyun Kim, Junghwan Kim, Minwoo Kim
Abstract:
Existing red‑teaming benchmarks, when adapted to new languages via direct translation, fail to capture socio‑technical vulnerabilities rooted in local culture and law, creating a critical blind spot in LLM safety evaluation. To address this gap, we introduce CAGE (Culturally Adaptive Generation), a framework that systematically adapts the adversarial intent of proven red‑teaming prompts to new cultural contexts. At the core of CAGE is the Semantic Mold, a novel approach that disentangles a prompt's adversarial structure from its cultural content. This approach enables the modeling of realistic, localized threats rather than testing for simple jailbreaks. As a representative example, we demonstrate our framework by creating KoRSET, a Korean benchmark, which proves more effective at revealing vulnerabilities than direct translation baselines. CAGE offers a scalable solution for developing meaningful, context‑aware safety benchmarks across diverse cultures. Our dataset and evaluation rubrics are publicly available at https://github.com/selectstar‑ai/CAGE‑paper. (WARNING: This paper contains model outputs that can be offensive in nature.)
Authors:Clarisse Wibault, Johannes Forkel, Sebastian Towers, Tiphaine Wibault, Juan Duque, George Whittle, Andreas Schaab, Yucheng Yang, Chiyuan Wang, Maike Osborne, Benjamin Moll, Jakob Foerster
Abstract:
Mean Field Games (MFGs) provide a principled framework for modelling interactions in large population systems. However, algorithmic progress has been limited since model‑free methods are high variance and exact methods scale poorly. Recent Hybrid Structural Methods (HSMs) reduce variance while maintaining tractability by leveraging low‑dimensional individual state and action spaces and known transition dynamics to compute the exact expected return conditioned on Monte Carlo rollouts of common noise. However, HSMs have not been extended to partially observable settings. We propose Recurrent Structural Policy Gradient (RSPG), the first history‑aware HSM for MFGs with public partial information. RSPG achieves an order‑of‑magnitude faster convergence than model‑free RL methods while learning history‑aware behaviour, unlike current HSMs. To facilitate research into MFGs, we also introduce MFAX, our JAX‑based framework for MFGs that supports both analytic and sample‑based mean‑field updates. MFAX and usage examples can be found at https://clarisse‑wibault.github.io/rspg/.
Authors:Lingwei Gu, Nour Jedidi, Jimmy Lin
Abstract:
How do large language models (LLMs) know what they know? Answering this question has been difficult because pre‑training data is often a "black box" ‑‑ unknown or inaccessible. The recent release of nanochat ‑‑ a family of small LLMs with fully open pre‑training data ‑‑ addresses this as it provides a transparent view into where a model's parametric knowledge comes from. Towards the goal of understanding how knowledge is encoded by LLMs, we release NanoKnow, a benchmark dataset that partitions questions from Natural Questions and SQuAD into splits based on whether their answers are present in nanochat's pre‑training corpus. Using these splits, we can now properly disentangle the sources of knowledge that LLMs rely on when producing an output. To demonstrate NanoKnow's utility, we conduct experiments using eight nanochat checkpoints. Our findings show: (1) closed‑book accuracy is strongly influenced by answer frequency in the pre‑training data, (2) providing external evidence can mitigate this frequency dependence, (3) even with external evidence, models are more accurate when answers were seen during pre‑training, demonstrating that parametric and external knowledge are complementary, and (4) non‑relevant information is harmful, with accuracy decreasing based on both the position and the number of non‑relevant contexts. We release all NanoKnow artifacts at https://github.com/castorini/NanoKnow.
Authors:Yisi Liu, Nicholas Lee, Gopala Anumanchipalli
Abstract:
Voice style conversion aims to transform an input utterance to match a target speaker's timbre, accent, and emotion, with a central challenge being the disentanglement of linguistic content from style. While prior work has explored this problem, conversion quality remains limited, and real‑time voice style conversion has not been addressed. We propose StyleStream, the first streamable zero‑shot voice style conversion system that achieves state‑of‑the‑art performance. StyleStream consists of two components: a Destylizer, which removes style attributes while preserving linguistic content, and a Stylizer, a diffusion transformer (DiT) that reintroduces target style conditioned on reference speech. Robust content‑style disentanglement is enforced through text supervision and a highly constrained information bottleneck. This design enables a fully non‑autoregressive architecture, achieving real‑time voice style conversion with an end‑to‑end latency of 1 second. Samples and real‑time demo: https://berkeley‑speech‑group.github.io/StyleStream/.
Authors:Tarakanath Paipuru
Abstract:
Modern code intelligence agents operate in contexts exceeding 1 million tokens‑‑far beyond the scale where humans manually locate relevant files. Yet agents consistently fail to discover architecturally critical files when solving real‑world coding tasks. We identify the Navigation Paradox: agents perform poorly not due to context limits, but because navigation and retrieval are fundamentally distinct problems. Through 258 automated trials across 30 benchmark tasks on a production FastAPI repository, we demonstrate that graph‑based structural navigation via CodeCompass‑‑a Model Context Protocol server exposing dependency graphs‑‑achieves 99.4% task completion on hidden‑dependency tasks, a 23.2 percentage‑point improvement over vanilla agents (76.2%) and 21.2 points over BM25 retrieval (78.2%).However, we uncover a critical adoption gap: 58% of trials with graph access made zero tool calls, and agents required explicit prompt engineering to adopt the tool consistently. Our findings reveal that the bottleneck is not tool availability but behavioral alignment‑‑agents must be explicitly guided to leverage structural context over lexical heuristics. We contribute: (1) a task taxonomy distinguishing semantic‑search, structural, and hidden‑dependency scenarios; (2) empirical evidence that graph navigation outperforms retrieval when dependencies lack lexical overlap; and (3) open‑source infrastructure for reproducible evaluation of navigation tools.
Authors:Zachary Ravichandran, David Snyder, Alexander Robey, Hamed Hassani, Vijay Kumar, George J. Pappas
Abstract:
Robots are increasingly operating in open‑world environments where safe behavior depends on context: the same hallway may require different navigation strategies when crowded versus empty, or during an emergency versus normal operations. Traditional safety approaches enforce fixed constraints in user‑specified contexts, limiting their ability to handle the open‑ended contextual variability of real‑world deployment. We address this gap via CORE, a safety framework that enables online contextual reasoning, grounding, and enforcement without prior knowledge of the environment (e.g., maps or safety specifications). CORE uses a vision‑language model (VLM) to continuously reason about context‑dependent safety rules directly from visual observations, grounds these rules in the physical environment, and enforces the resulting spatially‑defined safe sets via control barrier functions. We provide probabilistic safety guarantees for CORE that account for perceptual uncertainty, and we demonstrate through simulation and real‑world experiments that CORE enforces contextually appropriate behavior in unseen environments, significantly outperforming prior semantic safety methods that lack online contextual reasoning. Ablation studies validate our theoretical guarantees and underscore the importance of both VLM‑based reasoning and spatial grounding for enforcing contextual safety in novel settings. We provide additional resources at https://zacravichandran.github.io/CORE.
Authors:Blaž Rolih, Matic Fučka, Filip Wolf, Luka Čehovin Zajc
Abstract:
Unsupervised change detection (UCD) in remote sensing aims to localise semantic changes between two images of the same region without relying on labelled data during training. Most recent approaches rely either on frozen foundation models in a training‑free manner or on training with synthetic changes generated in pixel space. Both strategies inherently rely on predefined assumptions about change types, typically introduced through handcrafted rules, external datasets, or auxiliary generative models. Due to these assumptions, such methods fail to generalise beyond a few change types, limiting their real‑world usage, especially in rare or complex scenarios. To address this, we propose MaSoN (Make Some Noise), an end‑to‑end UCD framework that synthesises diverse changes directly in the latent feature space during training. It generates changes that are dynamically estimated using feature statistics of target data, enabling diverse yet data‑driven variation aligned with the target domain. It also easily extends to new modalities, such as SAR. MaSoN generalises strongly across diverse change types and achieves state‑of‑the‑art performance on five benchmarks, improving the average F1 score by 14.1 percentage points. Project page: https://blaz‑r.github.io/mason_ucd
Authors:Johanna S. Fröhlich, Bastian Heinlein, Jan U. Claar, Hans Rosenberger, Vasileios Belagiannis, Ralf R. Müller
Abstract:
Explainable artificial intelligence has emerged as a promising field of research to address reliability concerns in artificial intelligence. Despite significant progress in explainable artificial intelligence, few methods provide a systematic way to visualize and understand how classes are confused and how their relationships evolve as training progresses. In this work, we present GRAPHIC, an architecture‑agnostic approach that analyzes neural networks on a class level. It leverages confusion matrices derived from intermediate layers using linear classifiers. We interpret these as adjacency matrices of directed graphs, allowing tools from network science to visualize and quantify learning dynamics across training epochs and intermediate layers. GRAPHIC provides insights into linear class separability, dataset issues, and architectural behavior, revealing, for example, similarities between flatfish and man and labeling ambiguities validated in a human study. In summary, by uncovering real confusions, GRAPHIC offers new perspectives on how neural networks learn. The code is available at https://github.com/Johanna‑S‑Froehlich/GRAPHIC.
Authors:Xinyu Yuan, Xixian Liu, Ya Shi Zhang, Zuobai Zhang, Hongyu Guo, Jian Tang
Abstract:
Building Virtual Cells that can accurately simulate cellular responses to perturbations is a long‑standing goal in systems biology. A fundamental challenge is that high‑throughput single‑cell sequencing is destructive: the same cell cannot be observed both before and after a perturbation. Thus, perturbation prediction requires mapping unpaired control and perturbed populations. Existing models address this by learning maps between distributions, but typically assume a single fixed response distribution when conditioned on observed cellular context (e.g., cell type) and the perturbation type. In reality, responses vary systematically due to unobservable latent factors such as microenvironmental fluctuations and complex batch effects, forming a manifold of possible distributions for the same observed conditions. To account for this variability, we introduce PerturbDiff, which shifts modeling from individual cells to entire distributions. By embedding distributions as points in a Hilbert space, we define a diffusion‑based generative process operating directly over probability distributions. This allows PerturbDiff to capture population‑level response shifts across hidden factors. Benchmarks on established datasets show that PerturbDiff achieves state‑of‑the‑art performance in single‑cell response prediction and generalizes substantially better to unseen perturbations. See our project page (https://katarinayuan.github.io/PerturbDiff‑ProjectPage/), where code and data will be made publicly available (https://github.com/DeepGraphLearning/PerturbDiff).
Authors:Jiayu Wang, Yifei Ming, Zixuan Ke, Shafiq Joty, Aws Albarghouthi, Frederic Sala
Abstract:
Compound AI systems promise capabilities beyond those of individual models, yet their success depends critically on effective orchestration. Existing routing approaches face two limitations: (1) input‑level routers make coarse query‑level decisions that ignore evolving task requirements; (2) RL‑trained orchestrators are expensive to adapt and often suffer from routing collapse, repeatedly invoking one strong but costly option in multi‑turn scenarios. We introduce SkillOrchestra, a framework for skill‑aware orchestration. Instead of directly learning a routing policy end‑to‑end, SkillOrchestra learns fine‑grained skills from execution experience and models agent‑specific competence and cost under those skills. At deployment, the orchestrator infers the skill demands of the current interaction and selects agents that best satisfy them under an explicit performance‑cost trade‑off. Extensive experiments across ten benchmarks demonstrate that SkillOrchestra outperforms SoTA RL‑based orchestrators by up to 22.5% with 700x and 300x learning cost reduction compared to Router‑R1 and ToolOrchestra, respectively. These results show that explicit skill modeling enables scalable, interpretable, and sample‑efficient orchestration, offering a principled alternative to data‑intensive RL‑based approaches. The code is available at: https://github.com/jiayuww/SkillOrchestra.
Authors:Girmaw Abebe Tadesse, Titien Bartette, Andrew Hassanali, Allen Kim, Jonathan Chemla, Andrew Zolli, Yves Ubelmann, Caleb Robinson, Inbal Becker-Reshef, Juan Lavista Ferres
Abstract:
Looting at archaeological sites poses a severe risk to cultural heritage, yet monitoring thousands of remote locations remains operationally difficult. We present a scalable and satellite‑based pipeline to detect looted archaeological sites, using PlanetScope monthly mosaics (4.7m/pixel) and a curated dataset of 1,943 archaeological sites in Afghanistan (898 looted, 1,045 preserved) with multi‑year imagery (2016‑‑2023) and site‑footprint masks. We compare (i) end‑to‑end CNN classifiers trained on raw RGB patches and (ii) traditional machine learning (ML) trained on handcrafted spectral/texture features and embeddings from recent remote‑sensing foundation models. Results indicate that ImageNet‑pretrained CNNs combined with spatial masking reach an F1 score of 0.926, clearly surpassing the strongest traditional ML setup, which attains an F1 score of 0.710 using SatCLIP‑V+RF+Mean, i.e., location and vision embeddings fed into a Random Forest with mean‑based temporal aggregation. Ablation studies demonstrate that ImageNet pretraining (even in the presence of domain shift) and spatial masking enhance performance. In contrast, geospatial foundation model embeddings perform competitively with handcrafted features, suggesting that looting signatures are extremely localized. The repository is available at https://github.com/microsoft/looted_site_detection.
Authors:Hanwen Liu, Saierdaer Yusuyin, Hao Huang, Zhijian Ou
Abstract:
Large‑language‑model (LLM)‑based text‑to‑speech (TTS) systems can generate natural speech, but most are not designed for low‑latency dual‑streaming synthesis. High‑quality dual‑streaming TTS depends on accurate text‑‑speech alignment and well‑designed training sequences that balance synthesis quality and latency. Prior work often relies on GMM‑HMM based forced‑alignment toolkits (e.g., MFA), which are pipeline‑heavy and less flexible than neural aligners; fixed‑ratio interleaving of text and speech tokens struggles to capture text‑‑speech alignment regularities. We propose CTC‑TTS, which replaces MFA with a CTC based aligner and introduces a bi‑word based interleaving strategy. Two variants are designed: CTC‑TTS‑L (token concatenation along the sequence length) for higher quality and CTC‑TTS‑F (embedding stacking along the feature dimension) for lower latency. Experiments show that CTC‑TTS outperforms fixed‑ratio interleaving and MFA‑based baselines on streaming synthesis and zero‑shot tasks. Speech samples are available at https://ctctts.github.io/.
Authors:Chongyang Gao, Diji Yang, Shuyan Zhou, Xichen Yan, Luchuan Song, Shuo Li, Kezhen Chen
Abstract:
We introduce CFE‑Bench (Classroom Final Exam), a multimodal benchmark for evaluating the reasoning capabilities of large language models across more than 20 STEM domains. CFE‑Bench is curated from repeatedly used, authentic university homework and exam problems, paired with reference solutions provided by course instructors. CFE‑Bench remains challenging for frontier models: the newly released Gemini‑3.1‑pro‑preview achieves 59.69% overall accuracy, while the second‑best model, Gemini‑3‑flash‑preview, reaches 55.46%, leaving substantial room for improvement. Beyond aggregate scores, we conduct a diagnostic analysis by decomposing instructor reference solutions into structured reasoning flows. We find that while frontier models often answer intermediate sub‑questions correctly, they struggle to reliably derive and maintain correct intermediate states throughout multi‑step solutions. We further observe that model‑generated solutions typically contain more reasoning steps than instructor solutions, indicating lower step efficiency and a higher risk of error accumulation. Data and code are available at https://github.com/Analogy‑AI/CFE_Bench.
Authors:Arjun Chatterjee, Sayeed Sajjad Razin, John Wu, Siddhartha Laghuvarapu, Jathurshan Pradeepkumar, Jimeng Sun
Abstract:
Quantifying uncertainty in clinical predictions is critical for high‑stakes diagnosis tasks. Conformal prediction offers a principled approach by providing prediction sets with theoretical coverage guarantees. However, in practice, patient distribution shifts violate the i.i.d. assumptions underlying standard conformal methods, leading to poor coverage in healthcare settings. In this work, we evaluate several conformal prediction approaches on EEG seizure classification, a task with known distribution shift challenges and label uncertainty. We demonstrate that personalized calibration strategies can improve coverage by over 20 percentage points while maintaining comparable prediction set sizes. Our implementation is available via PyHealth, an open‑source healthcare AI framework: https://github.com/sunlabuiuc/PyHealth.
Authors:Pao-Hsiung Chiu, Jian Cheng Wong, Chin Chun Ooi, Chang Wei, Yuchen Fan, Yew-Soon Ong
Abstract:
Physics‑informed neural networks (PINNs) have emerged as a promising mesh‑free paradigm for solving partial differential equations, yet adoption in science and engineering is limited by slow training and modest accuracy relative to modern numerical solvers. We introduce the Sequential Correction Algorithm for Learning Efficient PINN (Scale‑PINN), a learning strategy that bridges modern physics‑informed learning with numerical algorithms. Scale‑PINN incorporates the iterative residual‑correction principle, a cornerstone of numerical solvers, directly into the loss formulation, marking a paradigm shift in how PINN losses can be conceived and constructed. This integration enables Scale‑PINN to achieve unprecedented convergence speed across PDE problems from different physics domain, including reducing training time on a challenging fluid‑dynamics problem for state‑of‑the‑art PINN from hours to sub‑2 minutes while maintaining superior accuracy, and enabling application to representative problems in aerodynamics and urban science. By uniting the rigor of numerical methods with the flexibility of deep learning, Scale‑PINN marks a significant leap toward the practical adoption of PINNs in science and engineering through scalable, physics‑informed learning. Codes are available at https://github.com/chiuph/SCALE‑PINN.
Authors:Guoliang Gong, Man Yu
Abstract:
The image purification strategy constructs an intermediate distribution with aligned anatomical structures, which effectively corrects the spatial misalignment between real‑world ultra‑low‑dose CT and normal‑dose CT images and significantly enhances the structural preservation ability of denoising models. However, this strategy exhibits two inherent limitations. First, it suppresses noise only in the chest wall and bone regions while leaving the image background untreated. Second, it lacks a dedicated mechanism for denoising the lung parenchyma. To address these issues, we systematically redesign the original image purification strategy and propose an improved version termed IPv2. The proposed strategy introduces three core modules, namely Remove Background, Add noise, and Remove noise. These modules endow the model with denoising capability in both background and lung tissue regions during training data construction and provide a more reasonable evaluation protocol through refined label construction at the testing stage. Extensive experiments on our previously established real‑world patient lung CT dataset acquired at 2% radiation dose demonstrate that IPv2 consistently improves background suppression and lung parenchyma restoration across multiple mainstream denoising models. The code is publicly available at https://github.com/MonkeyDadLufy/Image‑Purification‑Strategy‑v2.
Authors:Zunkai Dai, Ke Li, Jiajia Liu, Jie Yang, Yuanyuan Qiao
Abstract:
The collection and detection of video anomaly data has long been a challenging problem due to its rare occurrence and spatio‑temporal scarcity. Existing video anomaly detection (VAD) methods under perform in open‑world scenarios. Key contributing factors include limited dataset diversity, and inadequate understanding of context‑dependent anomalous semantics. To address these issues, i) we propose LAVIDA, an end‑to‑end zero‑shot video anomaly detection framework. ii) LAVIDA employs an Anomaly Exposure Sampler that transforms segmented objects into pseudo‑anomalies to enhance model adaptability to unseen anomaly categories. It further integrates a Multimodal Large Language Model (MLLM) to bolster semantic comprehension capabilities. Additionally, iii) we design a token compression approach based on reverse attention to handle the spatio‑temporal scarcity of anomalous patterns and decrease computational cost. The training process is conducted solely on pseudo anomalies without any VAD data. Evaluations across four benchmark VAD datasets demonstrate that LAVIDA achieves SOTA performance in both frame‑level and pixel‑level anomaly detection under the zero‑shot setting. Our code is available in https://github.com/VitaminCreed/LAVIDA.
Authors:Yangyi Fang, Jiaye Lin, Xiaoliang Fu, Cong Qin, Haolin Shi, Chang Liu, Peilin Zhao
Abstract:
Multi‑turn LLM agents are becoming pivotal to production systems, spanning customer service automation, e‑commerce assistance, and interactive task management, where accurately distinguishing high‑value informative signals from stochastic noise is critical for sample‑efficient training. In real‑world scenarios, a failure in a trivial task may reflect random instability, whereas success in a high‑difficulty task signifies a genuine capability breakthrough. Yet, existing group‑based policy optimization methods rigidly rely on statistical deviation within discrete batches, frequently misallocating credit when task difficulty fluctuates. To address this issue, we propose Proximity‑based Multi‑turn Optimization (ProxMO), a practical and robust framework engineered specifically for the constraints of real‑world deployment. ProxMO integrates global context via two lightweight mechanisms: success‑rate‑aware modulation dynamically adapts gradient intensity based on episode‑level difficulty, while proximity‑based soft aggregation derives baselines through continuous semantic weighting at the step level. Extensive evaluations on ALFWorld and WebShop benchmarks demonstrate that ProxMO yields substantial performance gains over existing baselines with negligible computational cost. Ablation studies further validate the independent and synergistic efficacy of both mechanisms. Crucially, ProxMO offers plug‑and‑play compatibility with standard GRPO frameworks, facilitating immediate, low‑friction adoption in existing industrial training pipelines. Our implementation is available at: \hrefhttps://anonymous.4open.science/r/proxmo‑B7E7/README.mdhttps://anonymous.4open.science/r/proxmo.
Authors:Yangyi Fang, Jiaye Lin, Xiaoliang Fu, Cong Qin, Haolin Shi, Chaowen Hu, Lu Pan, Ke Zeng, Xunliang Cai
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for Large Language Model (LLM) reasoning, yet current methods face key challenges in resource allocation and policy optimization dynamics: (i) uniform rollout allocation ignores gradient variance heterogeneity across problems, and (ii) the softmax policy structure causes gradient attenuation for high‑confidence correct actions, while excessive gradient updates may destabilize training. Therefore, we propose DynaMO, a theoretically‑grounded dual‑pronged optimization framework. At the sequence level, we prove that uniform allocation is suboptimal and derive variance‑minimizing allocation from the first principle, establishing Bernoulli variance as a computable proxy for gradient informativeness. At the token level, we develop gradient‑aware advantage modulation grounded in theoretical analysis of gradient magnitude bounds. Our framework compensates for gradient attenuation of high‑confidence correct actions while utilizing entropy changes as computable indicators to stabilize excessive update magnitudes. Extensive experiments conducted on a diverse range of mathematical reasoning benchmarks demonstrate consistent improvements over strong RLVR baselines. Our implementation is available at: \hrefhttps://anonymous.4open.science/r/dynamo‑680E/README.mdhttps://anonymous.4open.science/r/dynamo.
Authors:Saba Kublashvili
Abstract:
I introduce Virtual Parameter Sharpening (VPS), an inference‑time technique that augments frozen transformer linear layers with dynamic, activation‑conditioned low‑rank perturbations. Unlike parameter‑efficient fine‑tuning methods such as LoRA, which learn static low‑rank adapters, VPS constructs its perturbation factors on the fly from batch activation statistics and optional gradient signals, enabling test‑time adaptation without persistent parameter updates. The perturbation takes the form Delta W = gamma W^T V U^T W, where selector matrices U and V are constructed via sparse activation‑guided selection or Sylvester‑coupled regression. We provide a theoretical analysis of the perturbation's spectral properties and describe an adaptive policy system that modulates perturbation magnitude based on activation energy and token‑level entropy. This system incorporates multi‑objective verification with iterative refinement for tasks with ground‑truth supervision. We present the complete algorithmic framework, analyze its mathematical foundations, and discuss the mechanisms by which activation‑conditioned computation may enhance reasoning capabilities in large language models. Implementation and experimental code are available at https://github.com/Saba‑Kublashvili/vps‑virtual‑parameter‑synthesis .
Authors:Zheng Miao, Tien-Chieh Hung
Abstract:
Accurate sex identification in fish is vital for optimizing breeding and management strategies in aquaculture, particularly for species at the risk of extinction. However, most existing methods are invasive or stressful and may cause additional mortality, posing severe risks to threatened or endangered fish populations. To address these challenges, we propose FishProtoNet, a robust, non‑invasive computer vision‑based framework for sex identification of delta smelt (Hypomesus transpacificus), an endangered fish species native to California, across its full life cycle. Unlike the traditional deep learning methods, FishProtoNet provides interpretability through learned prototype representations while improving robustness by leveraging foundation models to reduce the influence of background noise. Specifically, the FishProtoNet framework consists of three key components: fish regions of interest (ROIs) extraction using visual foundation model, feature extraction from fish ROIs and fish sex identification based on an interpretable prototype network. FishProtoNet demonstrates strong performance in delta smelt sex identification during early spawning and post‑spawning stages, achieving the accuracies of 74.40% and 81.16% and corresponding F1 scores of 74.27% and 79.43% respectively. In contrast, delta smelt sex identification at the subadult stage remains challenging for current computer vision methods, likely due to less pronounced morphological differences in immature fish. The source code of FishProtoNet is publicly available at: https://github.com/zhengmiao1/Fish_sex_identification
Authors:Xiaochuan Li, Ryan Ming, Pranav Setlur, Abhijay Paladugu, Andy Tang, Hao Kang, Shuai Shao, Rong Jin, Chenyan Xiong
Abstract:
LLM agents are increasingly expected to function as general‑purpose systems capable of resolving open‑ended user requests. While existing benchmarks focus on domain‑aware environments for developing specialized agents, evaluating general‑purpose agents requires more realistic settings that challenge them to operate across multiple skills and tools within a unified environment. We introduce General AgentBench, a benchmark that provides such a unified framework for evaluating general LLM agents across search, coding, reasoning, and tool‑use domains. Using General AgentBench, we systematically study test‑time scaling behaviors under sequential scaling (iterative interaction) and parallel scaling (sampling multiple trajectories). Evaluation of ten leading LLM agents reveals a substantial performance degradation when moving from domain‑specific evaluations to this general‑agent setting. Moreover, we find that neither scaling methodology yields effective performance improvements in practice, due to two fundamental limitations: context ceiling in sequential scaling and verification gap in parallel scaling. Code is publicly available at https://github.com/cxcscmu/General‑AgentBench.
Authors:Kun Ding, Jian Xu, Ying Wang, Peipei Yang, Shiming Xiang
Abstract:
Infrared radiation computing underpins advances in climate science, remote sensing and spectroscopy but remains constrained by manual workflows. We introduce InfEngine, an autonomous intelligent computational engine designed to drive a paradigm shift from human‑led orchestration to collaborative automation. It integrates four specialized agents through two core innovations: self‑verification, enabled by joint solver‑evaluator debugging, improves functional correctness and scientific plausibility; self‑optimization, realized via evolutionary algorithms with self‑discovered fitness functions, facilitates autonomous performance optimization. Evaluated on InfBench with 200 infrared‑specific tasks and powered by InfTools with 270 curated tools, InfEngine achieves a 92.7% pass rate and delivers workflows 21x faster than manual expert effort. More fundamentally, it illustrates how researchers can transition from manual coding to collaborating with self‑verifying, self‑optimizing computational partners. By generating reusable, verified and optimized code, InfEngine transforms computational workflows into persistent scientific assets, accelerating the cycle of scientific discovery. Code: https://github.com/kding1225/infengine
Authors:Tianyu Fan, Fengji Zhang, Yuxiang Zheng, Bei Chen, Xinyao Niu, Chengen Huang, Junyang Lin, Chao Huang
Abstract:
The application of Large Language Models (LLMs) in accelerating scientific discovery has garnered increasing attention, with a key focus on constructing research agents endowed with innovative capability, i.e., the ability to autonomously generate novel and significant research ideas. Existing approaches predominantly rely on sophisticated prompt engineering and lack a systematic training paradigm. To address this, we propose DeepInnovator, a training framework designed to trigger the innovative capability of LLMs. Our approach comprises two core components. (1) ``Standing on the shoulders of giants''. We construct an automated data extraction pipeline to extract and organize structured research knowledge from a vast corpus of unlabeled scientific literature. (2) ``Conjectures and refutations''. We introduce a ``Next Idea Prediction'' training paradigm, which models the generation of research ideas as an iterative process of continuously predicting, evaluating, and refining plausible and novel next idea. Both automatic and expert evaluations demonstrate that our DeepInnovator‑14B significantly outperforms untrained baselines, achieving win rates of 80.53%‑93.81%, and attains performance comparable to that of current leading LLMs. This work provides a scalable training pathway toward building research agents with genuine, originative innovative capability, and will open‑source the dataset to foster community advancement. Source code and data are available at: https://github.com/HKUDS/DeepInnovator.
Authors:Hoang-Loc Cao, Phuc Ho, Truong Thanh Hung Nguyen, Phuc Truong Loc Nguyen, Dinh Thien Loc Nguyen, Hung Cao
Abstract:
Legal reasoning requires not only high accuracy but also the ability to justify decisions through verifiable and contestable arguments. However, existing Large Language Model (LLM) approaches, such as Chain‑of‑Thought (CoT) and Retrieval‑Augmented Generation (RAG), often produce unstructured explanations that lack a formal mechanism for verification or user intervention. To address this limitation, we propose Adaptive Collaboration of Argumentative LLMs (ACAL), a neuro‑symbolic framework that integrates adaptive multi‑agent collaboration with an Arena‑based Quantitative Bipolar Argumentation Framework (A‑QBAF). ACAL dynamically deploys expert agent teams to construct arguments, employs a clash resolution mechanism to adjudicate conflicting claims, and utilizes uncertainty‑aware escalation for borderline cases. Crucially, our framework supports a Human‑in‑the‑Loop (HITL) contestability workflow, enabling users to directly audit and modify the underlying reasoning graph to influence the final judgment. Empirical evaluations on the LegalBench benchmark demonstrate that ACAL outperforms strong baselines across Gemini‑2.5‑Flash‑Lite and Gemini‑2.5‑Flash architectures, effectively balancing efficient predictive performance with structured transparency and contestability. Our implementation is available at: https://github.com/loc110504/ACAL.
Authors:Zhenkun Gao, Xuhong Wang, Xin Tan, Yuan Xie
Abstract:
Multimodal Large Language Models (MLLMs), particularly smaller, deployable variants, exhibit a critical deficiency in understanding temporal and procedural visual data, a bottleneck hindering their application in real‑world embodied AI. This gap is largely caused by a systemic failure in training paradigms, which lack large‑scale, procedurally coherent data. To address this problem, we introduce TPRU, a large‑scale dataset sourced from diverse embodied scenarios such as robotic manipulation and GUI navigation. TPRU is systematically designed to cultivate temporal reasoning through three complementary tasks: Temporal Reordering, Next‑Frame Prediction, and Previous‑Frame Review. A key feature is the inclusion of challenging negative samples, compelling models to transition from passive observation to active, cross‑modal validation. We leverage TPRU with a reinforcement learning (RL) fine‑tuning methodology, specifically targeting the enhancement of resource‑efficient models. Experiments show our approach yields dramatic gains: on our manually curated TPRU‑Test, the accuracy of TPRU‑7B soars from 50.33% to 75.70%, a state‑of‑the‑art result that significantly outperforms vastly larger baselines, including GPT‑4o. Crucially, these capabilities generalize effectively, demonstrating substantial improvements on established benchmarks. The codebase is available at https://github.com/Stephen‑gzk/TPRU/ .
Authors:Ziheng Chen, Bernhard Schölkopf, Nicu Sebe
Abstract:
Hyperbolic spaces provide a natural geometry for representing hierarchical and tree‑structured data due to their exponential volume growth. To leverage these benefits, neural networks require intrinsic and efficient components that operate directly in hyperbolic space. In this work, we lift two core components of neural networks, Multinomial Logistic Regression (MLR) and Fully Connected (FC) layers, into hyperbolic space via Busemann functions, resulting in Busemann MLR (BMLR) and Busemann FC (BFC) layers with a unified mathematical interpretation. BMLR provides compact parameters, a point‑to‑horosphere distance interpretation, batch‑efficient computation, and a Euclidean limit, while BFC generalizes FC and activation layers with comparable complexity. Experiments on image classification, genome sequence learning, node classification, and link prediction demonstrate improvements in effectiveness and efficiency over prior hyperbolic layers. The code is available at https://github.com/GitZH‑Chen/HBNN.
Authors:Aditya Kumar Singh, Hitesh Kandala, Pratik Prabhanjan Brahma, Zicheng Liu, Emad Barsoum
Abstract:
Vision‑language models (VLMs) have achieved remarkable multimodal understanding and reasoning capabilities, yet remain computationally expensive due to dense visual tokenization. Existing efficiency approaches either merge redundant visual tokens or drop them progressively in language backbone, often trading accuracy for speed. In this work, we propose DUET‑VLM, a versatile plug‑and‑play dual compression framework that consists of (a) vision‑only redundancy aware compression of vision encoder's output into information‑preserving tokens, followed by (b) layer‑wise, salient text‑guided dropping of visual tokens within the language backbone to progressively prune less informative tokens. This coordinated token management enables aggressive compression while retaining critical semantics. On LLaVA‑1.5‑7B, our approach maintains over 99% of baseline accuracy with 67% fewer tokens, and still retains >97% even at 89% reduction. With this dual‑stage compression during training, it achieves 99.7% accuracy at 67% and 97.6% at 89%, surpassing prior SoTA visual token reduction methods across multiple benchmarks. When integrated into Video‑LLaVA‑7B, it even surpasses the baseline ‑‑ achieving >100% accuracy with a substantial 53.1% token reduction and retaining 97.6% accuracy under an extreme 93.4% setting. These results highlight end‑to‑end training with DUET‑VLM, enabling robust adaptation to reduced visual (image/video) input without sacrificing accuracy, producing compact yet semantically rich representations within the same computational budget. Our code is available at https://github.com/AMD‑AGI/DUET‑VLM.
Authors:Osman Onur Kuzucu, Tunca Doğan
Abstract:
Understanding disease‑gene associations is essential for unravelling disease mechanisms and advancing diagnostics and therapeutics. Traditional approaches based on manual curation and literature review are labour‑intensive and not scalable, prompting the use of machine learning on large biomedical data. In particular, graph neural networks (GNNs) have shown promise for modelling complex biological relationships. To address limitations in existing models, we propose GLaDiGAtor (Graph Learning‑bAsed DIsease‑Gene AssociaTiOn pRediction), a novel GNN framework with an encoder‑decoder architecture for disease‑gene association prediction. GLaDiGAtor constructs a heterogeneous biological graph integrating gene‑gene, disease‑disease, and gene‑disease interactions from curated databases, and enriches each node with contextual features from well‑known language models (ProtT5 for protein sequences and BioBERT for disease text). In evaluations, our model achieves superior predictive accuracy and generalisation, outperforming 14 existing methods. Literature‑supported case studies confirm the biological relevance of high‑confidence novel predictions, highlighting GLaDiGAtor's potential to discover candidate disease genes. These results underscore the power of graph convolutional networks in biomedical informatics and may ultimately facilitate drug discovery by revealing new gene‑disease links. The source code and processed datasets are publicly available at https://github.com/HUBioDataLab/GLaDiGAtor.
Authors:Haobo Lin, Tianyi Bai, Jiajun Zhang, Xuanhao Chang, Sheng Lu, Fangming Gu, Zengjie Hu, Wentao Zhang
Abstract:
Facial Expression Recognition (FER) is a fine‑grained visual understanding task where reliable predictions require reasoning over localized and meaningful facial cues. Recent vision‑‑language models (VLMs) enable natural language explanations for FER, but their reasoning is often ungrounded, producing fluent yet unverifiable rationales that are weakly tied to visual evidence and prone to hallucination, leading to poor robustness across different datasets. We propose TAG (Thinking with Action Unit Grounding), a vision‑‑language framework that explicitly constrains multimodal reasoning to be supported by facial Action Units (AUs). TAG requires intermediate reasoning steps to be grounded in AU‑related facial regions, yielding predictions accompanied by verifiable visual evidence. The model is trained via supervised fine‑tuning on AU‑grounded reasoning traces followed by reinforcement learning with an AU‑aware reward that aligns predicted regions with external AU detectors. Evaluated on RAF‑DB, FERPlus, and AffectNet, TAG consistently outperforms strong open‑source and closed‑source VLM baselines while simultaneously improving visual faithfulness. Ablation and preference studies further show that AU‑grounded rewards stabilize reasoning and mitigate hallucination, demonstrating the importance of structured grounded intermediate representations for trustworthy multimodal reasoning in FER. The code will be available at https://github.com/would1920/FER_TAG .
Authors:Haobo Lin, Tianyi Bai, Chen Chen, Jiajun Zhang, Bohan Zeng, Wentao Zhang, Binhang Yuan
Abstract:
Multimodal geometry reasoning requires models to jointly understand visual diagrams and perform structured symbolic inference, yet current vision‑‑language models struggle with complex geometric constructions due to limited training data and weak visual‑‑symbolic alignment. We propose a pipeline for synthesizing complex multimodal geometry problems from scratch and construct a dataset named GeoCode, which decouples problem generation into symbolic seed construction, grounded instantiation with verification, and code‑based diagram rendering, ensuring consistency across structure, text, reasoning, and images. Leveraging the plotting code provided in GeoCode, we further introduce code prediction as an explicit alignment objective, transforming visual understanding into a supervised structured prediction task. GeoCode exhibits substantially higher structural complexity and reasoning difficulty than existing benchmarks, while maintaining mathematical correctness through multi‑stage validation. Extensive experiments show that models trained on GeoCode achieve consistent improvements on multiple geometry benchmarks, demonstrating both the effectiveness of the dataset and the proposed alignment strategy. The code will be available at https://github.com/would1920/GeoCode.
Authors:Seungku Kim, Suhyeok Jang, Byungjun Yoon, Dongyoung Kim, John Won, Jinwoo Shin
Abstract:
Synthetic data generated by video generative models has shown promise for robot learning as a scalable pipeline, but it often suffers from inconsistent action quality due to imperfectly generated videos. Recently, vision‑language models (VLMs) have been leveraged to validate video quality, but they have limitations in distinguishing physically accurate videos and, even then, cannot directly evaluate the generated actions themselves. To tackle this issue, we introduce RoboCurate, a novel synthetic robot data generation framework that evaluates and filters the quality of annotated actions by comparing them with simulation replay. Specifically, RoboCurate replays the predicted actions in a simulator and assesses action quality by measuring the consistency of motion between the simulator rollout and the generated video. In addition, we unlock observation diversity beyond the available dataset via image‑to‑image editing and apply action‑preserving video‑to‑video transfer to further augment appearance. We observe RoboCurate's generated data yield substantial relative improvements in success rates compared to using real data only, achieving +70.1% on GR‑1 Tabletop (300 demos), +16.1% on DexMimicGen in the pre‑training setup, and +179.9% in the challenging real‑world ALLEX humanoid dexterous manipulation setting.
Authors:Xiao-Ming Wu, Bin Fan, Kang Liao, Jian-Jian Jiang, Runze Yang, Yihang Luo, Zhonghua Wu, Wei-Shi Zheng, Chen Change Loy
Abstract:
Following the rise of large foundation models, Vision‑Language‑Action models (VLAs) emerged, leveraging strong visual and language understanding for general‑purpose policy learning. Yet, the current VLA landscape remains fragmented and exploratory. Although many groups have proposed their own VLA models, inconsistencies in training protocols and evaluation settings make it difficult to identify which design choices truly matter. To bring structure to this evolving space, we reexamine the VLA design space under a unified framework and evaluation setup. Starting from a simple VLA baseline similar to RT‑2 and OpenVLA, we systematically dissect design choices along three dimensions: foundational components, perception essentials, and action modelling perspectives. From this study, we distill 12 key findings that together form a practical recipe for building strong VLA models. The outcome of this exploration is a simple yet effective model, VLANeXt. VLANeXt outperforms prior state‑of‑the‑art methods on the LIBERO and LIBERO‑plus benchmarks and demonstrates strong generalization in real‑world experiments. We will release a unified, easy‑to‑use codebase that serves as a common platform for the community to reproduce our findings, explore the design space, and build new VLA variants on top of a shared foundation.
Authors:Suraj Ranganath
Abstract:
Enterprises rely on RDF knowledge graphs and SPARQL to expose operational data through natural language interfaces, yet public KGQA benchmarks do not reflect proprietary schemas, prefixes, or query distributions. We present PIPE‑RDF, a three‑phase pipeline that constructs schema‑specific NL‑SPARQL benchmarks using reverse querying, category‑balanced template generation, retrieval‑augmented prompting, deduplication, and execution‑based validation with repair. We instantiate PIPE‑RDF on a fixed‑schema company‑location slice (5,000 companies) derived from public RDF data and generate a balanced benchmark of 450 question‑SPARQL pairs across nine categories. The pipeline achieves 100% parse and execution validity after repair, with pre‑repair validity rates of 96.5%‑100% across phases. We report entity diversity metrics, template coverage analysis, and cost breakdowns to support deployment planning. We release structured artifacts (CSV/JSONL, logs, figures) and operational metrics to support model evaluation and system planning in real‑world settings. Code is available at https://github.com/suraj‑ranganath/PIPE‑RDF.
Authors:Yanlin Zhang, Linjie Xu, Quan Gan, David Wipf, Minjie Wang
Abstract:
Recent advances in tabular in‑context learning (ICL) show that a single pretrained model can adapt to new prediction tasks from a small set of labeled examples, avoiding per‑task training and heavy tuning. However, many real‑world tasks live in relational databases, where predictive signal is spread across multiple linked tables rather than a single flat table. We show that tabular ICL can be extended to relational prediction with a simple recipe: automatically featurize each target row using relational aggregations over its linked records, materialize the resulting augmented table, and run an off‑the‑shelf tabular foundation model on it. We package this approach in RDBLearn (https://github.com/HKUSHXLab/rdblearn), an easy‑to‑use toolkit with a scikit‑learn‑style estimator interface that makes it straightforward to swap different tabular ICL backends; a complementary agent‑specific interface is provided as well. Across a broad collection of RelBench and 4DBInfer datasets, RDBLearn is the best‑performing foundation model approach we evaluate, at times even outperforming strong supervised baselines trained or fine‑tuned on each dataset.
Authors:Guoqi Yu, Juncheng Wang, Chen Yang, Jing Qin, Angelica I. Aviles-Rivero, Shujun Wang
Abstract:
Accurate analysis of medical time series (MedTS) data, such as electroencephalography (EEG) and electrocardiography (ECG), plays a pivotal role in healthcare applications, including the diagnosis of brain and heart diseases. MedTS data typically exhibit two critical patterns: temporal dependencies within individual channels and channel dependencies across multiple channels. While recent advances in deep learning have leveraged Transformer‑based models to effectively capture temporal dependencies, they often struggle with modeling channel dependencies. This limitation stems from a structural mismatch: MedTS signals are inherently centralized, whereas the Transformer's attention mechanism is decentralized, making it less effective at capturing global synchronization and unified waveform patterns. To address this mismatch, we propose CoTAR (Core Token Aggregation‑Redistribution), a centralized MLP‑based module designed to replace decentralized attention. Instead of allowing all tokens to interact directly, as in standard attention, CoTAR introduces a global core token that serves as a proxy to facilitate inter‑token interactions, thereby enforcing a centralized aggregation and redistribution strategy. This design not only better aligns with the centralized nature of MedTS signals but also reduces computational complexity from quadratic to linear. Experiments on five benchmarks validate the superiority of our method in both effectiveness and efficiency, achieving up to a 12.13% improvement on the APAVA dataset, while using only 33% of the memory and 20% of the inference time compared to the previous state of the art. Code and all training scripts are available at https://github.com/Levi‑Ackman/TeCh.
Authors:Xiaoyan Bai, Alexander Baumgartner, Haojia Sun, Ari Holtzman, Chenhao Tan
Abstract:
Reproducibility crises across sciences highlight the limitations of the paper‑centric review system in assessing the rigor and reproducibility of research. AI agents that autonomously design and generate large volumes of research outputs exacerbate these challenges. In this work, we address the growing challenges of scalability and rigor by flipping the dynamic and developing AI agents as research evaluators. We propose the first execution‑grounded evaluation framework that verifies research beyond narrative review by examining code and data alongside the paper. We use mechanistic interpretability research as a testbed, build standardized research output, and develop MechEvalAgent, an automated evaluation framework that assesses the coherence of the experimental process, the reproducibility of results, and the generalizability of findings. We show that our framework achieves above 80% agreement with human judges, identifies substantial methodological problems, and surfaces 51 additional issues that human reviewers miss. Our work demonstrates the potential of AI agents to transform research evaluation and pave the way for rigorous scientific practices.
Authors:Finn van der Knaap, Kejiang Qian, Zheng Xu, Fengxiang He
Abstract:
This work studies heterogeneous Multi‑Objective Reinforcement Learning (MORL), where objectives can differ sharply in temporal frequency. Such heterogeneity allows dense objectives to dominate learning, while sparse long‑horizon rewards receive weak credit assignment, leading to poor sample efficiency. We propose a Parallel Reward Integration with Symmetry (PRISM) algorithm that enforces reflectional symmetry as an inductive bias in aligning reward channels. PRISM introduces ReSymNet, a theory‑motivated model that reconciles temporal‑frequency mismatches across objectives, using residual blocks to learn a scaled opportunity value that accelerates exploration while preserving the optimal policy. We also propose SymReg, a reflectional equivariance regulariser that enforces agent mirroring and constrains policy search to a reflection‑equivariant subspace. This restriction provably reduces hypothesis complexity and improves generalisation. Across MuJoCo benchmarks, PRISM consistently outperforms both a sparse‑reward baseline and an oracle trained with full dense rewards, improving Pareto coverage and distributional balance: it achieves hypervolume gains exceeding 100% over the baseline and up to 32% over the oracle. The code is at \hrefhttps://github.com/EVIEHub/PRISMhttps://github.com/EVIEHub/PRISM.
Authors:Aaron Louis Eidt, Nils Feldhus
Abstract:
While mechanistic interpretability has developed powerful tools to analyze the internal workings of Large Language Models (LLMs), their complexity has created an accessibility gap, limiting their use to specialists. We address this challenge by designing, building, and evaluating ELIA (Explainable Language Interpretability Analysis), an interactive web application that simplifies the outcomes of various language model component analyses for a broader audience. The system integrates three key techniques ‑‑ Attribution Analysis, Function Vector Analysis, and Circuit Tracing ‑‑ and introduces a novel methodology: using a vision‑language model to automatically generate natural language explanations (NLEs) for the complex visualizations produced by these methods. The effectiveness of this approach was empirically validated through a mixed‑methods user study, which revealed a clear preference for interactive, explorable interfaces over simpler, static visualizations. A key finding was that the AI‑powered explanations helped bridge the knowledge gap for non‑experts; a statistical analysis showed no significant correlation between a user's prior LLM experience and their comprehension scores, suggesting that the system reduced barriers to comprehension across experience levels. We conclude that an AI system can indeed simplify complex model analyses, but its true power is unlocked when paired with thoughtful, user‑centered design that prioritizes interactivity, specificity, and narrative guidance.
Authors:Lexiang Tang, Weihao Gao, Bingchen Zhao, Lu Ma, Qiao jin, Bang Yang, Yuexian Zou
Abstract:
Recent work on test‑time scaling for large language model (LLM) reasoning typically assumes that allocating more inference‑time computation uniformly improves correctness. However, prior studies show that reasoning uncertainty is highly localized: a small subset of low‑confidence tokens disproportionately contributes to reasoning errors and unnecessary output expansion. Motivated by this observation, we propose Thinking by Subtraction, a confidence‑driven contrastive decoding approach that improves reasoning reliability through targeted token‑level intervention. Our method, Confidence‑Driven Contrastive Decoding, detects low‑confidence tokens during decoding and intervenes selectively at these positions. It constructs a contrastive reference by replacing high‑confidence tokens with minimal placeholders, and refines predictions by subtracting this reference distribution at low‑confidence locations. Experiments show that CCD significantly improves accuracy across mathematical reasoning benchmarks while substantially reducing output length, with minimal KV‑cache overhead. As a training‑free method, CCD enhances reasoning reliability through targeted low‑confidence intervention without computational redundancy. Our code will be made available at: https://github.com/bolo‑web/CCD.
Authors:Jorge Carrasco Pollo, Ioannis Kapetangeorgis, Joshua Rosenthal, John Hua Yao
Abstract:
Large Language Models (LLMs) demonstrate significant potential in multi‑agent negotiation tasks, yet evaluation in this domain remains challenging due to a lack of robust and generalizable benchmarks. Abdelnabi et al. (2024) introduce a negotiation benchmark based on Scoreable Games, with the aim of developing a highly complex and realistic evaluation framework for LLMs. Our work investigates the reproducibility of claims in their benchmark, and provides a deeper understanding of its usability and generalizability. We replicate the original experiments on additional models, and introduce additional metrics to verify negotiation quality and evenness of evaluation. Our findings reveal that while the benchmark is indeed complex, model comparison is ambiguous, raising questions about its objectivity. Furthermore, we identify limitations in the experimental setup, particularly in information leakage detection and thoroughness of the ablation study. By examining and analyzing the behavior of a wider range of models on an extended version of the benchmark, we reveal insights that provide additional context to potential users. Our results highlight the importance of context in model‑comparative evaluations.
Authors:Joseph Bingham, Netanel Arussy, Dvir Aran
Abstract:
Unsupervised representations are widely assumed to be neutral with respect to sensitive attributes when those attributes are withheld from training. We show that this assumption is false. Using SOMtime, a topology‑preserving representation method based on high‑capacity Self‑Organizing Maps, we demonstrate that sensitive attributes such as age and income emerge as dominant latent axes in purely unsupervised embeddings, even when explicitly excluded from the input. On two large‑scale real‑world datasets (the World Values Survey across five countries and the Census‑Income dataset), SOMtime recovers monotonic orderings aligned with withheld sensitive attributes, achieving Spearman correlations of up to 0.85, whereas PCA and UMAP typically remain below 0.23 (with a single exception reaching 0.31), and against t‑SNE and autoencoders which achieve at most 0.34. Furthermore, unsupervised segmentation of SOMtime embeddings produces demographically skewed clusters, demonstrating downstream fairness risks without any supervised task. These findings establish that fairness through unawareness fails at the representation level for ordinal sensitive attributes and that fairness auditing must extend to unsupervised components of machine learning pipelines. We have made the code available at~ https://github.com/JosephBingham/SOMtime
Authors:Guoheng Sun, Tingting Du, Kaixi Feng, Chenxiang Luo, Xingguo Ding, Zheyu Shen, Ziyao Wang, Yexiao He, Ang Li
Abstract:
Vision‑Language‑Action (VLA) models enable instruction‑following robotic manipulation, but they are typically pretrained on 2D data and lack 3D spatial understanding. An effective approach is representation alignment, where a strong vision foundation model is used to guide a 2D VLA model. However, existing methods usually apply supervision at only a single layer, failing to fully exploit the rich information distributed across depth; meanwhile, naïve multi‑layer alignment can cause gradient interference. We introduce ROCKET, a residual‑oriented multi‑layer representation alignment framework that formulates multi‑layer alignment as aligning one residual stream to another. Concretely, ROCKET employs a shared projector to align multiple layers of the VLA backbone with multiple layers of a powerful 3D vision foundation model via a layer‑invariant mapping, which reduces gradient conflicts. We provide both theoretical justification and empirical analyses showing that a shared projector is sufficient and outperforms prior designs, and further propose a Matryoshka‑style sparse activation scheme for the shared projector to balance multiple alignment losses. Our experiments show that, combined with a training‑free layer selection strategy, ROCKET requires only about 4% of the compute budget while achieving 98.5% state‑of‑the‑art success rate on LIBERO. We further demonstrate the superior performance of ROCKET across LIBERO‑Plus and RoboTwin, as well as multiple VLA models. The code and model weights can be found at https://github.com/CASE‑Lab‑UMD/ROCKET‑VLA.
Authors:Narjes Nourzad, Carlee Joe-Wong
Abstract:
Reinforcement learning (RL) agents often suffer from high sample complexity in sparse or delayed reward settings due to limited prior structure. Large language models (LLMs) can provide subgoal decompositions, plausible trajectories, and abstract priors that facilitate early learning. However, heavy reliance on LLM supervision introduces scalability constraints and dependence on potentially unreliable signals. We propose MIRA (Memory‑Integrated Reinforcement Learning Agent), which incorporates a structured, evolving memory graph to guide early training. The graph stores decision‑relevant information, including trajectory segments and subgoal structures, and is constructed from both the agent's high‑return experiences and LLM outputs. This design amortizes LLM queries into a persistent memory rather than requiring continuous real‑time supervision. From this memory graph, we derive a utility signal that softly adjusts advantage estimation to influence policy updates without modifying the underlying reward function. As training progresses, the agent's policy gradually surpasses the initial LLM‑derived priors, and the utility term decays, preserving standard convergence guarantees. We provide theoretical analysis showing that utility‑based shaping improves early‑stage learning in sparse‑reward environments. Empirically, MIRA outperforms RL baselines and achieves returns comparable to approaches that rely on frequent LLM supervision, while requiring substantially fewer online LLM queries. Project webpage: https://narjesno.github.io/MIRA/
Authors:Ziyuan Liu, Shizhao Sun, Danqing Huang, Yingdong Shi, Meisheng Zhang, Ji Li, Jingsong Yu, Jiang Bian
Abstract:
Graphic design generation demands a delicate balance between high visual fidelity and fine‑grained structural editability. However, existing approaches typically bifurcate into either non‑editable raster image synthesis or abstract layout generation devoid of visual content. Recent combinations of these two approaches attempt to bridge this gap but often suffer from rigid composition schemas and unresolvable visual dissonances (e.g., text‑background conflicts) due to their inexpressive representation and open‑loop nature. To address these challenges, we propose DesignAsCode, a novel framework that reimagines graphic design as a programmatic synthesis task using HTML/CSS. Specifically, we introduce a Plan‑Implement‑Reflect pipeline, incorporating a Semantic Planner to construct dynamic, variable‑depth element hierarchies and a Visual‑Aware Reflection mechanism that iteratively optimizes the code to rectify rendering artifacts. Extensive experiments demonstrate that DesignAsCode significantly outperforms state‑of‑the‑art baselines in both structural validity and aesthetic quality. Furthermore, our code‑native representation unlocks advanced capabilities, including automatic layout retargeting, complex document generation (e.g., resumes), and CSS‑based animation. Our project page is available at https://liuziyuan1109.github.io/design‑as‑code/.
Authors:Aidar Myrzakhan, Tianyi Li, Bowei Guo, Shengkun Tang, Zhiqiang Shen
Abstract:
Diffusion Language Models (DLMs) incur high inference cost due to iterative denoising, motivating efficient pruning. Existing pruning heuristics largely inherited from autoregressive (AR) LLMs, typically preserve attention sink tokens because AR sinks serve as stable global anchors. We show that this assumption does not hold for DLMs: the attention‑sink position exhibits substantially higher variance over the full generation trajectory (measured by how the dominant sink locations shift across timesteps), indicating that sinks are often transient and less structurally essential than in AR models. Based on this observation, we propose \bf \textttSink‑Aware Pruning, which automatically identifies and prunes unstable sinks in DLMs (prior studies usually keep sinks for AR LLMs). Without retraining, our method achieves a better quality‑efficiency trade‑off and outperforms strong prior pruning baselines under matched compute. Our code is available at https://github.com/VILA‑Lab/Sink‑Aware‑Pruning.
Authors:Juri Opitz, Corina Raclé, Emanuela Boros, Andrianos Michail, Matteo Romanello, Maud Ehrmann, Simon Clematide
Abstract:
HIPE‑2026 is a CLEF evaluation lab dedicated to person‑place relation extraction from noisy, multilingual historical texts. Building on the HIPE‑2020 and HIPE‑2022 campaigns, it extends the series toward semantic relation extraction by targeting the task of identifying person‑‑place associations in multiple languages and time periods. Systems are asked to classify relations of two types ‑ at ("Has the person ever been at this place?") and isAt ("Is the person located at this place around publication time?") ‑ requiring reasoning over temporal and geographical cues. The lab introduces a three‑fold evaluation profile that jointly assesses accuracy, computational efficiency, and domain generalization. By linking relation extraction to large‑scale historical data processing, HIPE‑2026 aims to support downstream applications in knowledge‑graph construction, historical biography reconstruction, and spatial analysis in digital humanities.
Authors:Xiaohan Zhao, Zhaoyi Li, Yaxin Luo, Jiacheng Cui, Zhiqiang Shen
Abstract:
Black‑box adversarial attacks on Large Vision‑Language Models (LVLMs) are challenging due to missing gradients and complex multimodal boundaries. While prior state‑of‑the‑art transfer‑based approaches like M‑Attack perform well using local crop‑level matching between source and target images, we find this induces high‑variance, nearly orthogonal gradients across iterations, violating coherent local alignment and destabilizing optimization. We attribute this to (i) ViT translation sensitivity that yields spike‑like gradients and (ii) structural asymmetry between source and target crops. We reformulate local matching as an asymmetric expectation over source transformations and target semantics, and build a gradient‑denoising upgrade to M‑Attack. On the source side, Multi‑Crop Alignment (MCA) averages gradients from multiple independently sampled local views per iteration to reduce variance. On the target side, Auxiliary Target Alignment (ATA) replaces aggressive target augmentation with a small auxiliary set from a semantically correlated distribution, producing a smoother, lower‑variance target manifold. We further reinterpret momentum as Patch Momentum, replaying historical crop gradients; combined with a refined patch‑size ensemble (PE+), this strengthens transferable directions. Together these modules form M‑Attack‑V2, a simple, modular enhancement over M‑Attack that substantially improves transfer‑based black‑box attacks on frontier LVLMs: boosting success rates on Claude‑4.0 from 8% to 30%, Gemini‑2.5‑Pro from 83% to 97%, and GPT‑5 from 98% to 100%, outperforming prior black‑box LVLM attacks. Code and data are publicly available at: https://github.com/vila‑lab/M‑Attack‑V2.
Authors:Peter Balogh
Abstract:
Some transformer attention heads appear to function as membership testers, dedicating themselves to answering the question "has this token appeared before in the context?" We identify these heads across four language models (GPT‑2 small, medium, and large; Pythia‑160M) and show that they form a spectrum of membership‑testing strategies. Two heads (L0H1 and L0H5 in GPT‑2 small) function as high‑precision membership filters with false positive rates of 0‑4% even at 180 unique context tokens ‑‑ well above the d_\texthead = 64 bit capacity of a classical Bloom filter. A third head (L1H11) shows the classic Bloom filter capacity curve: its false positive rate follows the theoretical formula p \approx (1 ‑ e^‑kn/m)^k with R^2 = 1.0 and fitted capacity m \approx 5 bits, saturating by n \approx 20 unique tokens. A fourth head initially identified as a Bloom filter (L3H0) was reclassified as a general prefix‑attention head after confound controls revealed its apparent capacity curve was a sequence‑length artifact. Together, the three genuine membership‑testing heads form a multi‑resolution system concentrated in early layers (0‑1), taxonomically distinct from induction and previous‑token heads, with false positive rates that decay monotonically with embedding distance ‑‑ consistent with distance‑sensitive Bloom filters. These heads generalize broadly: they respond to any repeated token type, not just repeated names, with 43% higher generalization than duplicate‑token‑only heads. Ablation reveals these heads contribute to both repeated and novel token processing, indicating that membership testing coexists with broader computational roles. The reclassification of L3H0 through confound controls strengthens rather than weakens the case: the surviving heads withstand the scrutiny that eliminated a false positive in our own analysis.
Authors:Marco Avolio, Potito Aghilar, Sabino Roccotelli, Vito Walter Anelli, Chiara Mallamaci, Vincenzo Paparella, Marco Valentini, Alejandro Bellogín, Michelantonio Trizio, Joseph Trotta, Antonio Ferrara, Tommaso Di Noia
Abstract:
Innovation in Recommender Systems is currently impeded by a fractured ecosystem, where researchers must choose between the ease of in‑memory experimentation and the costly, complex rewriting required for distributed industrial engines. To bridge this gap, we present WarpRec, a high‑performance framework that eliminates this trade‑off through a novel, backend‑agnostic architecture. It includes 50+ state‑of‑the‑art algorithms, 40 metrics, and 19 filtering and splitting strategies that seamlessly transition from local execution to distributed training and optimization. The framework enforces ecological responsibility by integrating CodeCarbon for real‑time energy tracking, showing that scalability need not come at the cost of scientific integrity or sustainability. Furthermore, WarpRec anticipates the shift toward Agentic AI, leading Recommender Systems to evolve from static ranking engines into interactive tools within the Generative AI ecosystem. In summary, WarpRec not only bridges the gap between academia and industry but also can serve as the architectural backbone for the next generation of sustainable, agent‑ready Recommender Systems. Code is available at https://github.com/sisinflab/warprec/
Authors:Dylan Bouchard, Mohit Singh Chauhan, Viren Bajaj, David Skarbrevik
Abstract:
Uncertainty quantification has emerged as an effective approach to closed‑book hallucination detection for LLMs, but existing methods are largely designed for short‑form outputs and do not generalize well to long‑form generation. We introduce a taxonomy for fine‑grained uncertainty quantification in long‑form LLM outputs that distinguishes methods by design choices at three stages: response decomposition, unit‑level scoring, and response‑level aggregation. We formalize several families of consistency‑based black‑box scorers, providing generalizations and extensions of existing methods. In our experiments across multiple LLMs and datasets, we find 1) claim‑response entailment consistently performs better or on par with more complex claim‑level scorers, 2) claim‑level scoring generally yields better results than sentence‑level scoring, and 3) uncertainty‑aware decoding is highly effective for improving the factuality of long‑form outputs. Our framework clarifies relationships between prior methods, enables apples‑to‑apples comparisons, and provides practical guidance for selecting components for fine‑grained UQ.
Authors:Lorenzo Caselli, Marco Mistretta, Simone Magistri, Andrew D. Bagdanov
Abstract:
Generalized Category Discovery (GCD) aims to identify novel categories in unlabeled data while leveraging a small labeled subset of known classes. Training a parametric classifier solely on image features often leads to overfitting to old classes, and recent multimodal approaches improve performance by incorporating textual information. However, they treat modalities independently and incur high computational cost. We propose SpectralGCD, an efficient and effective multimodal approach to GCD that uses CLIP cross‑modal image‑concept similarities as a unified cross‑modal representation. Each image is expressed as a mixture over semantic concepts from a large task‑agnostic dictionary, which anchors learning to explicit semantics and reduces reliance on spurious visual cues. To maintain the semantic quality of representations learned by an efficient student, we introduce Spectral Filtering which exploits a cross‑modal covariance matrix over the softmaxed similarities measured by a strong teacher model to automatically retain only relevant concepts from the dictionary. Forward and reverse knowledge distillation from the same teacher ensures that the cross‑modal representations of the student remain both semantically sufficient and well‑aligned. Across six benchmarks, SpectralGCD delivers accuracy comparable to or significantly superior to state‑of‑the‑art methods at a fraction of the computational cost. The code is publicly available at: https://github.com/miccunifi/SpectralGCD.
Authors:Luzhi Wang, Xuanshuo Fu, He Zhang, Chuang Liu, Xiaobao Wang, Hongbo Liu
Abstract:
Graph Out‑of‑Distribution (OOD) detection aims to identify whether a test graph deviates from the distribution of graphs observed during training, which is critical for ensuring the reliability of Graph Neural Networks (GNNs) when deployed in open‑world scenarios. Recent advances in graph OOD detection have focused on test‑time training techniques that facilitate OOD detection without accessing potential supervisory information (e.g., training data). However, most of these methods employ a one‑pass inference paradigm, which prevents them from progressively correcting erroneous predictions to amplify OOD signals. To this end, we propose a Self‑Improving Graph Out‑of‑Distribution detector (SIGOOD), which is an unsupervised framework that integrates continuous self‑learning with test‑time training for effective graph OOD detection. Specifically, SIGOOD generates a prompt to construct a prompt‑enhanced graph that amplifies potential OOD signals. To optimize prompts, SIGOOD introduces an Energy Preference Optimization (EPO) loss, which leverages energy variations between the original test graph and the prompt‑enhanced graph. By iteratively optimizing the prompt by involving it into the detection model in a self‑improving loop, the resulting optimal prompt‑enhanced graph is ultimately used for OOD detection. Comprehensive evaluations on 21 real‑world datasets confirm the effectiveness and outperformance of our SIGOOD method. The code is at https://github.com/Ee1s/SIGOOD.
Authors:Yonghyeon Jo, Sunwoo Lee, Seungyul Han
Abstract:
Value decomposition is a core approach for cooperative multi‑agent reinforcement learning (MARL). However, existing methods still rely on a single optimal action and struggle to adapt when the underlying value function shifts during training, often converging to suboptimal policies. To address this limitation, we propose Successive Sub‑value Q‑learning (S2Q), which learns multiple sub‑value functions to retain alternative high‑value actions. Incorporating these sub‑value functions into a Softmax‑based behavior policy, S2Q encourages persistent exploration and enables Q^\texttot to adjust quickly to the changing optima. Experiments on challenging MARL benchmarks confirm that S2Q consistently outperforms various MARL algorithms, demonstrating improved adaptability and overall performance. Our code is available at https://github.com/hyeon1996/S2Q.
Authors:Yunseok Han, Yejoon Lee, Jaeyoung Do
Abstract:
Large Reasoning Models (LRMs) exhibit strong performance, yet often produce rationales that sound plausible but fail to reflect their true decision process, undermining reliability and trust. We introduce a formal framework for reasoning faithfulness, defined by two testable conditions: stance consistency (a coherent stance linking reasoning to answer) and causal influence (the stated reasoning causally drives the answer under output‑level interventions), explicitly decoupled from accuracy. To operationalize this, we present RFEval, a benchmark of 7,186 instances across seven tasks that probes faithfulness via controlled, output‑level counterfactual interventions. Evaluating twelve open‑source LRMs, we find unfaithfulness in 49.7% of outputs, predominantly from stance inconsistency. Failures are concentrated in brittle, convergent domains such as math and code, and correlate more with post‑training regimes than with scale: within‑family ablations indicate that adding current RL‑style objectives on top of supervised fine‑tuning can reduce reasoning faithfulness, even when accuracy is maintained. Crucially, accuracy is neither a sufficient nor a reliable proxy for faithfulness: once controlling for model and task, the accuracy‑faithfulness link is weak and statistically insignificant. Our work establishes a rigorous methodology for auditing LRM reliability and shows that trustworthy AI requires optimizing not only for correct outcomes but also for the structural integrity of the reasoning process. Our code and dataset can be found at project page: \hrefhttps://aidaslab.github.io/RFEval/https://aidaslab.github.io/RFEval/
Authors:Zichen Wang, Wanli Ma, Zhenyu Ming, Gong Zhang, Kun Yuan, Zaiwen Wen
Abstract:
Automated formalization of mathematics enables mechanical verification but remains limited to isolated theorems and short snippets. Scaling to textbooks and research papers is largely unaddressed, as it requires managing cross‑file dependencies, resolving imports, and ensuring that entire projects compile end‑to‑end. We present M2F (Math‑to‑Formal), the first agentic framework for end‑to‑end, project‑scale autoformalization in Lean. The framework operates in two stages. The statement compilation stage splits the document into atomic blocks, orders them via inferred dependencies, and repairs declaration skeletons until the project compiles, allowing placeholders in proofs. The proof repair stage closes these holes under fixed signatures using goal‑conditioned local edits. Throughout both stages, M2F keeps the verifier in the loop, committing edits only when toolchain feedback confirms improvement. In approximately three weeks, M2F converts long‑form mathematical sources into a project‑scale Lean library of 153,853 lines from 479 pages textbooks on real analysis and convex analysis, fully formalized as Lean declarations with accompanying proofs. This represents textbook‑scale formalization at a pace that would typically require months or years of expert effort. On FATE‑H, we achieve 96% proof success (vs.\ 80% for a strong baseline). Together, these results demonstrate that practical, large‑scale automated formalization of mathematical literature is within reach. The full generated Lean code from our runs is available at https://github.com/optsuite/ReasBook.git.
Authors:Arnold Cartagena, Ariane Teixeira
Abstract:
Large language models deployed as agents increasingly interact with external systems through tool calls‑‑actions with real‑world consequences that text outputs alone do not carry. Safety evaluations, however, overwhelmingly measure text‑level refusal behavior, leaving a critical question unanswered: does alignment that suppresses harmful text also suppress harmful actions? We introduce the GAP benchmark, a systematic evaluation framework that measures divergence between text‑level safety and tool‑call‑level safety in LLM agents. We test six frontier models across six regulated domains (pharmaceutical, financial, educational, employment, legal, and infrastructure), seven jailbreak scenarios per domain, three system prompt conditions (neutral, safety‑reinforced, and tool‑encouraging), and two prompt variants, producing 17,420 analysis‑ready datapoints. Our central finding is that text safety does not transfer to tool‑call safety. Across all six models, we observe instances where the model's text output refuses a harmful request while its tool calls simultaneously execute the forbidden action‑‑a divergence we formalize as the GAP metric. Even under safety‑reinforced system prompts, 219 such cases persist across all six models. System prompt wording exerts substantial influence on tool‑call behavior: TC‑safe rates span 21 percentage points for the most robust model and 57 for the most prompt‑sensitive, with 16 of 18 pairwise ablation comparisons remaining significant after Bonferroni correction. Runtime governance contracts reduce information leakage in all six models but produce no detectable deterrent effect on forbidden tool‑call attempts themselves. These results demonstrate that text‑only safety evaluations are insufficient for assessing agent behavior and that tool‑call safety requires dedicated measurement and mitigation.
Authors:Tanqiu Jiang, Yuhui Wang, Jiacheng Liang, Ting Wang
Abstract:
LLM agents are increasingly deployed in long‑horizon, complex environments to solve challenging problems, but this expansion exposes them to long‑horizon attacks that exploit multi‑turn user‑agent‑environment interactions to achieve objectives infeasible in single‑turn settings. To measure agent vulnerabilities to such risks, we present AgentLAB, the first benchmark dedicated to evaluating LLM agent susceptibility to adaptive, long‑horizon attacks. Currently, AgentLAB supports five novel attack types including intent hijacking, tool chaining, task injection, objective drifting, and memory poisoning, spanning 28 realistic agentic environments, and 644 security test cases. Leveraging AgentLAB, we evaluate representative LLM agents and find that they remain highly susceptible to long‑horizon attacks; moreover, defenses designed for single‑turn interactions fail to reliably mitigate long‑horizon threats. We anticipate that AgentLAB will serve as a valuable benchmark for tracking progress on securing LLM agents in practical settings. The benchmark is publicly available at https://tanqiujiang.github.io/AgentLAB_main.
Authors:Iman Ahmadi, Mehrshad Taji, Arad Mahdinezhad Kashani, AmirHossein Jadidi, Saina Kashani, Babak Khalaj
Abstract:
Task planning for robotic manipulation with large language models (LLMs) is an emerging area. Prior approaches rely on specialized models, fine tuning, or prompt tuning, and often operate in an open loop manner without robust environmental feedback, making them fragile in dynamic settings.We present MALLVi, a Multi Agent Large Language and Vision framework that enables closed loop feedback driven robotic manipulation. Given a natural language instruction and an image of the environment, MALLVi generates executable atomic actions for a robot manipulator. After action execution, a Vision Language Model (VLM) evaluates environmental feedback and decides whether to repeat the process or proceed to the next step.Rather than using a single model, MALLVi coordinates specialized agents, Decomposer, Localizer, Thinker, and Reflector, to manage perception, localization, reasoning, and high level planning. An optional Descriptor agent provides visual memory of the initial state. The Reflector supports targeted error detection and recovery by reactivating only relevant agents, avoiding full replanning.Experiments in simulation and real world settings show that iterative closed loop multi agent coordination improves generalization and increases success rates in zero shot manipulation tasks.Code available at https://github.com/iman1234ahmadi/MALLVI.
Authors:Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, Zhiyuan Chen, Jitong Liao, Qi Zheng, Jiahui Zeng, Ze Xu, Shuai Bai, Junyang Lin, Jingren Zhou, Ming Yan
Abstract:
The paper introduces GUI‑Owl‑1.5, the latest native GUI agent model that features instruct/thinking variants in multiple sizes (2B/4B/8B/32B/235B) and supports a range of platforms (desktop, mobile, browser, and more) to enable cloud‑edge collaboration and real‑time interaction. GUI‑Owl‑1.5 achieves state‑of‑the‑art results on more than 20+ GUI benchmarks on open‑source models: (1) on GUI automation tasks, it obtains 56.5 on OSWorld, 71.6 on AndroidWorld, and 48.4 on WebArena; (2) on grounding tasks, it obtains 80.3 on ScreenSpotPro; (3) on tool‑calling tasks, it obtains 47.6 on OSWorld‑MCP, and 46.8 on MobileWorld; (4) on memory and knowledge tasks, it obtains 75.5 on GUI‑Knowledge Bench. GUI‑Owl‑1.5 incorporates several key innovations: (1) Hybird Data Flywheel: we construct the data pipeline for UI understanding and trajectory generation based on a combination of simulated environments and cloud‑based sandbox environments, in order to improve the efficiency and quality of data collection. (2) Unified Enhancement of Agent Capabilities: we use a unified thought‑synthesis pipeline to enhance the model's reasoning capabilities, while placing particular emphasis on improving key agent abilities, including Tool/MCP use, memory and multi‑agent adaptation; (3) Multi‑platform Environment RL Scaling: We propose a new environment RL algorithm, MRPO, to address the challenges of multi‑platform conflicts and the low training efficiency of long‑horizon tasks. The GUI‑Owl‑1.5 models are open‑sourced, and an online cloud‑sandbox demo is available at https://github.com/X‑PLUG/MobileAgent.
Authors:Chanhyuk Lee, Jaehoon Yoo, Manan Agarwal, Sheel Shah, Jerry Huang, Aditi Raghunathan, Seunghoon Hong, Nicholas M. Boffi, Jinwoo Kim
Abstract:
Language models based on discrete diffusion have attracted widespread interest for their potential to provide faster generation than autoregressive models. In practice, however, they exhibit a sharp degradation of sample quality in the few‑step regime, failing to realize this promise. Here we show that language models leveraging flow‑based continuous denoising can outperform discrete diffusion in both quality and speed. By revisiting the fundamentals of flows over discrete modalities, we build a flow‑based language model (FLM) that performs Euclidean denoising over one‑hot token encodings. We show that the model can be trained by predicting the clean data via a cross entropy objective, where we introduce a simple time reparameterization that greatly improves training stability and generation quality. By distilling FLM into its associated flow map, we obtain a distilled flow map language model (FMLM) capable of few‑step generation. On the LM1B and OWT language datasets, FLM attains generation quality matching state‑of‑the‑art discrete diffusion models. With FMLM, our approach outperforms recent few‑step language models across the board, with one‑step generation exceeding their 8‑step quality. Our work calls into question the widely held hypothesis that discrete diffusion processes are necessary for generative modeling over discrete modalities, and paves the way toward accelerated flow‑based language modeling at scale. Code is available at https://github.com/david3684/flm.
Authors:Xidong Wang, Shuqi Guo, Yue Shen, Junying Chen, Jian Wang, Jinjie Gu, Ping Zhang, Lei Liu, Benyou Wang
Abstract:
The reliability of medical LLM evaluation is critically undermined by data contamination and knowledge obsolescence, leading to inflated scores on static benchmarks. To address these challenges, we introduce LiveClin, a live benchmark designed for approximating real‑world clinical practice. Built from contemporary, peer‑reviewed case reports and updated biannually, LiveClin ensures clinical currency and resists data contamination. Using a verified AI‑human workflow involving 239 physicians, we transform authentic patient cases into complex, multimodal evaluation scenarios that span the entire clinical pathway. The benchmark currently comprises 1,407 case reports and 6,605 questions. Our evaluation of 26 models on LiveClin reveals the profound difficulty of these real‑world scenarios, with the top‑performing model achieving a Case Accuracy of just 35.7%. In benchmarking against human experts, Chief Physicians achieved the highest accuracy, followed closely by Attending Physicians, with both surpassing most models. LiveClin thus provides a continuously evolving, clinically grounded framework to guide the development of medical LLMs towards closing this gap and achieving greater reliability and real‑world utility. Our data and code are publicly available at https://github.com/AQ‑MedAI/LiveClin.
Authors:Zhangyi Liu, Huaizhi Qu, Xiaowei Yin, He Sun, Yanjun Han, Tianlong Chen, Zhun Deng
Abstract:
Test‑time scaling can improve model performance by aggregating stochastic reasoning trajectories. However, achieving sample‑efficient test‑time self‑consistency under a limited budget remains an open challenge. We introduce PETS (Principled and Efficient Test‑TimeSelf‑Consistency), which initiates a principled study of trajectory allocation through an optimization framework. Central to our approach is the self‑consistency rate, a new measure defined as agreement with the infinite‑budget majority vote. This formulation makes sample‑efficient test‑time allocation theoretically grounded and amenable to rigorous analysis. We study both offline and online settings. In the offline regime, where all questions are known in advance, we connect trajectory allocation to crowdsourcing, a classic and well‑developed area, by modeling reasoning traces as workers. This perspective allows us to leverage rich existing theory, yielding theoretical guarantees and an efficient majority‑voting‑based allocation algorithm. In the online streaming regime, where questions arrive sequentially and allocations must be made on the fly, we propose a novel method inspired by the offline framework. Our approach adapts budgets to question difficulty while preserving strong theoretical guarantees and computational efficiency. Experiments show that PETS consistently outperforms uniform allocation. On GPQA, PETS achieves perfect self‑consistency in both settings while reducing the sampling budget by up to 75% (offline) and 55% (online) relative to uniform allocation. Code is available at https://github.com/ZDCSlab/PETS.
Authors:Haoxiang Sun, Lizhen Xu, Bing Zhao, Wotao Yin, Wei Wang, Boyu Yang, Rui Wang, Hu Wei
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has been shown effective in enhancing the visual reflection and reasoning capabilities of Large Multimodal Models (LMMs). However, existing datasets are predominantly derived from either small‑scale manual construction or recombination of prior resources, which limits data diversity and coverage, thereby constraining further gains in model performance. To this end, we introduce DeepVision‑103K, a comprehensive dataset for RLVR training that covers diverse K12 mathematical topics, extensive knowledge points, and rich visual elements. Models trained on DeepVision achieve strong performance on multimodal mathematical benchmarks, and generalize effectively to general multimodal reasoning tasks. Further analysis reveals enhanced visual perception, reflection and reasoning capabilities in trained models, validating DeepVision's effectiveness for advancing multimodal reasoning. Data: \hrefhttps://huggingface.co/datasets/skylenage/DeepVision‑103Kthis url.
Authors:Karan Bali, Jack Stanley, Praneet Suresh, Danilo Bzdok
Abstract:
In mechanistic interpretability, recent work scrutinizes transformer "circuits" ‑ sparse, mono or multi layer sub computations, that may reflect human understandable functions. Yet, these network circuits are rarely acid‑tested for their stability across different instances of the same deep learning architecture. Without this, it remains unclear whether reported circuits emerge universally across labs or turn out to be idiosyncratic to a particular estimation instance, potentially limiting confidence in safety‑critical settings. Here, we systematically study stability across‑refits in increasingly complex transformer language models of various sizes. We quantify, layer by layer, how similarly attention heads learn representations across independently initialized training runs. Our rigorous experiments show that (1) middle‑layer heads are the least stable yet the most representationally distinct; (2) deeper models exhibit stronger mid‑depth divergence; (3) unstable heads in deeper layers become more functionally important than their peers from the same layer; (4) applying weight decay optimization substantially improves attention‑head stability across random model initializations; and (5) the residual stream is comparatively stable. Our findings establish the cross‑instance robustness of circuits as an essential yet underappreciated prerequisite for scalable oversight, drawing contours around possible white‑box monitorability of AI systems.
Authors:SungJun Cho, Chetan Gohil, Rukuang Huang, Oiwi Parker Jones, Mark W. Woolrich
Abstract:
Recent success in natural language processing has motivated growing interest in large‑scale foundation models for neuroimaging data. Such models often require discretization of continuous neural time series data, a process referred to as 'tokenization'. However, the impact of different tokenization strategies for neural data is currently poorly understood. In this work, we present a systematic evaluation of sample‑level tokenization strategies for transformer‑based large neuroimaging models (LNMs) applied to magnetoencephalography (MEG) data. We compare learnable and non‑learnable tokenizers by examining their signal reconstruction fidelity and their impact on subsequent foundation modeling performance (token prediction, biological plausibility of generated data, preservation of subject‑specific information, and performance on downstream tasks). For the learnable tokenizer, we introduce a novel approach based on an autoencoder. Experiments were conducted on three publicly available MEG datasets spanning different acquisition sites, scanners, and experimental paradigms. Our results show that both learnable and non‑learnable discretization schemes achieve high reconstruction accuracy and broadly comparable performance across most evaluation criteria, suggesting that simple fixed sample‑level tokenization strategies can be used in the development of neural foundation models. The code is available at https://github.com/OHBA‑analysis/Cho2026_Tokenizer.
Authors:Qi You, Yitai Cheng, Zichao Zeng, James Haworth
Abstract:
Street‑view image attribute classification is a vital downstream task of image classification, enabling applications such as autonomous driving, urban analytics, and high‑definition map construction. It remains computationally demanding whether training from scratch, initialising from pre‑trained weights, or fine‑tuning large models. Although pre‑trained vision‑language models such as CLIP offer rich image representations, existing adaptation or fine‑tuning methods often rely on their global image embeddings, limiting their ability to capture fine‑grained, localised attributes essential in complex, cluttered street scenes. To address this, we propose CLIP‑MHAdapter, a variant of the current lightweight CLIP adaptation paradigm that appends a bottleneck MLP equipped with multi‑head self‑attention operating on patch tokens to model inter‑patch dependencies. With approximately 1.4 million trainable parameters, CLIP‑MHAdapter achieves superior or competitive accuracy across eight attribute classification tasks on the Global StreetScapes dataset, attaining new state‑of‑the‑art results while maintaining low computational cost. The code is available at https://github.com/SpaceTimeLab/CLIP‑MHAdapter.
Authors:Guy Bar-Shalom, Ami Tavory, Itay Evron, Maya Bechler-Speicher, Ido Guy, Haggai Maron
Abstract:
Weight‑space models learn directly from the parameters of neural networks, enabling tasks such as predicting their accuracy on new datasets. Naive methods ‑‑ like applying MLPs to flattened parameters ‑‑ perform poorly, making the design of better weight‑space architectures a central challenge. While prior work leveraged permutation symmetries in standard networks to guide such designs, no analogous analysis or tailored architecture yet exists for Kolmogorov‑Arnold Networks (KANs). In this work, we show that KANs share the same permutation symmetries as MLPs, and propose the KAN‑graph, a graph representation of their computation. Building on this, we develop WS‑KAN, the first weight‑space architecture that learns on KANs, which naturally accounts for their symmetry. We analyze WS‑KAN's expressive power, showing it can replicate an input KAN's forward pass ‑ a standard approach for assessing expressiveness in weight‑space architectures. We construct a comprehensive ``zoo'' of trained KANs spanning diverse tasks, which we use as benchmarks to empirically evaluate WS‑KAN. Across all tasks, WS‑KAN consistently outperforms structure‑agnostic baselines, often by a substantial margin. Our code is available at https://github.com/BarSGuy/KAN‑Graph‑Metanetwork.
Authors:Nithin Sivakumaran, Shoubin Yu, Hyunji Lee, Yue Zhang, Ali Payani, Mohit Bansal, Elias Stengel-Eskin
Abstract:
Chain‑of‑thought (CoT) reasoning sometimes fails to faithfully reflect the true computation of a large language model (LLM), hampering its utility in explaining how LLMs arrive at their answers. Moreover, optimizing for faithfulness and interpretability in reasoning often degrades task performance. To address this tradeoff and improve CoT faithfulness, we propose Reasoning Execution by Multiple Listeners (REMUL), a multi‑party reinforcement learning approach. REMUL builds on the hypothesis that reasoning traces which other parties can follow will be more faithful. A speaker model generates a reasoning trace, which is truncated and passed to a pool of listener models who "execute" the trace, continuing the trace to an answer. Speakers are rewarded for producing reasoning that is clear to listeners, with additional correctness regularization via masked supervised finetuning to counter the tradeoff between faithfulness and performance. On multiple reasoning benchmarks (BIG‑Bench Extra Hard, MuSR, ZebraLogicBench, and FOLIO), REMUL consistently and substantially improves three measures of faithfulness ‑‑ hint attribution, early answering area over the curve (AOC), and mistake injection AOC ‑‑ while also improving accuracy. Our analysis finds that these gains are robust across training domains, translate to legibility gains, and are associated with shorter and more direct CoTs.
Authors:Thinh Hung Truong, Jey Han Lau, Jianzhong Qi
Abstract:
Large Language Models (LLMs) are increasingly deployed in applications that interact with the physical world, such as navigation, robotics, or mapping, making robust geospatial reasoning a critical capability. Despite that, LLMs' ability to reason about GPS coordinates and real‑world geography remains underexplored. We introduce GPSBench, a dataset of 57,800 samples across 17 tasks for evaluating geospatial reasoning in LLMs, spanning geometric coordinate operations (e.g., distance and bearing computation) and reasoning that integrates coordinates with world knowledge. Focusing on intrinsic model capabilities rather than tool use, we evaluate 14 state‑of‑the‑art LLMs and find that GPS reasoning remains challenging, with substantial variation across tasks: models are generally more reliable at real‑world geographic reasoning than at geometric computations. Geographic knowledge degrades hierarchically, with strong country‑level performance but weak city‑level localization, while robustness to coordinate noise suggests genuine coordinate understanding rather than memorization. We further show that GPS‑coordinate augmentation can improve in downstream geospatial tasks, and that finetuning induces trade‑offs between gains in geometric computation and degradation in world knowledge. Our dataset and reproducible code are available at https://github.com/joey234/gpsbench
Authors:Kevin Kai-Chun Chang, Ekin Beyazit, Alberto Sangiovanni-Vincentelli, Tichakorn Wongpiromsarn, Sanjit A. Seshia
Abstract:
Developing autonomous driving systems for complex traffic environments requires balancing multiple objectives, such as avoiding collisions, obeying traffic rules, and making efficient progress. In many situations, these objectives cannot be satisfied simultaneously, and explicit priority relations naturally arise. Also, driving rules require context, so it is important to formally model the environment scenarios within which such rules apply. Existing benchmarks for evaluating autonomous vehicles lack such combinations of multi‑objective prioritized rules and formal environment models. In this work, we introduce ScenicRules, a benchmark for evaluating autonomous driving systems in stochastic environments under prioritized multi‑objective specifications. We first formalize a diverse set of objectives to serve as quantitative evaluation metrics. Next, we design a Hierarchical Rulebook framework that encodes multiple objectives and their priority relations in an interpretable and adaptable manner. We then construct a compact yet representative collection of scenarios spanning diverse driving contexts and near‑accident situations, formally modeled in the Scenic language. Experimental results show that our formalized objectives and Hierarchical Rulebooks align well with human driving judgments and that our benchmark effectively exposes agent failures with respect to the prioritized objectives. Our benchmark can be accessed at https://github.com/BerkeleyLearnVerify/ScenicRules/.
Authors:KC Santosh, Srikanth Baride, Rodrigue Rizk
Abstract:
As machine learning (ML) continues its rapid expansion, the environmental cost of model training and inference has become a critical societal concern. Existing benchmarks overwhelmingly focus on standard performance metrics such as accuracy, BLEU, or mAP, while largely ignoring energy consumption and carbon emissions. This single‑objective evaluation paradigm is increasingly misaligned with the practical requirements of large‑scale deployment, particularly in energy‑constrained environments such as mobile devices, developing regions, and climate‑aware enterprises. In this paper, we propose AI‑CARE, an evaluation tool for reporting energy consumption, and carbon emissions of ML models. In addition, we introduce the carbon‑performance tradeoff curve, an interpretable tool that visualizes the Pareto frontier between performance and carbon cost. We demonstrate, through theoretical analysis and empirical validation on representative ML workloads, that carbon‑aware benchmarking changes the relative ranking of models and encourages architectures that are simultaneously accurate and environmentally responsible. Our proposal aims to shift the research community toward transparent, multi‑objective evaluation and align ML progress with global sustainability goals. The tool and documentation are available at https://github.com/USD‑AI‑ResearchLab/ai‑care.
Authors:Adnan El Assadi, Isaac Chung, Chenghao Xiao, Roman Solomatin, Animesh Jha, Rahul Chand, Silky Singh, Kaitlyn Wang, Ali Sartaz Khan, Marc Moussa Nasser, Sufen Fong, Pengfei He, Alan Xiao, Ayush Sunil Munot, Aditya Shrivastava, Artem Gazizov, Niklas Muennighoff, Kenneth Enevoldsen
Abstract:
We introduce the Massive Audio Embedding Benchmark (MAEB), a large‑scale benchmark covering 30 tasks across speech, music, environmental sounds, and cross‑modal audio‑text reasoning in 100+ languages. We evaluate 50+ models and find that no single model dominates across all tasks: contrastive audio‑text models excel at environmental sound classification (e.g., ESC50) but score near random on multilingual speech tasks (e.g., SIB‑FLEURS), while speech‑pretrained models show the opposite pattern. Clustering remains challenging for all models, with even the best‑performing model achieving only modest results. We observe that models excelling on acoustic understanding often perform poorly on linguistic tasks, and vice versa. We also show that the performance of audio encoders on MAEB correlates highly with their performance when used in audio large language models. MAEB is derived from MAEB+, a collection of 98 tasks. MAEB is designed to maintain task diversity while reducing evaluation cost, and it integrates into the MTEB ecosystem for unified evaluation across text, image, and audio modalities. We release MAEB and all 98 tasks along with code and a leaderboard at https://github.com/embeddings‑benchmark/mteb.
Authors:Junbo Jacob Lian, Yujun Sun, Huiling Chen, Chaoyu Zhang, Chung-Piaw Teo
Abstract:
Large language models (LLMs) can translate natural language into optimization code, but silent failures pose a critical risk: code that executes and returns solver‑feasible solutions may encode semantically incorrect formulations, creating a feasibility‑correctness gap of up to 90 percentage points on compositional problems. We introduce ReLoop, addressing silent failures from two complementary directions. Structured generation decomposes code production into a four‑stage reasoning chain (understand, formalize, synthesize, verify) that mirrors expert modeling practice, with explicit variable‑type reasoning and self‑verification to prevent formulation errors at their source. Behavioral verification detects errors that survive generation by testing whether the formulation responds correctly to solver‑based parameter perturbation, without requiring ground truth ‑‑ an external semantic signal that bypasses the self‑consistency problem inherent in LLM‑based code review. The two mechanisms are complementary: structured generation dominates on complex compositional problems, while behavioral verification becomes the largest single contributor on problems with localized formulation defects. Together with execution recovery via IIS‑enhanced diagnostics, ReLoop raises correctness from 22.6% to 31.1% and execution from 72.1% to 100.0% on the strongest model, with consistent gains across five models spanning three paradigms (foundation, SFT, RL) and three benchmarks. We additionally release RetailOpt‑190, 190 compositional retail optimization scenarios targeting the multi‑constraint interactions where LLMs most frequently fail.
Authors:Yiwen Wang, Jiahao Qin
Abstract:
High‑speed optical‑resolution photoacoustic microscopy (OR‑PAM) with bidirectional raster scanning doubles imaging speed but introduces coupled domain shift and geometric misalignment between forward and backward scan lines. Existing registration methods, constrained by brightness constancy assumptions, achieve limited alignment quality, while recent generative approaches address domain shift through complex architectures that lack temporal awareness across frames. We propose GPEReg‑Net, a scene‑appearance disentanglement framework that separates domain‑invariant scene features from domain‑specific appearance codes via Adaptive Instance Normalization (AdaIN), enabling direct image‑to‑image registration without explicit deformation field estimation. To exploit temporal structure in sequential acquisitions, we introduce a Global Position Encoding (GPE) module that combines learnable position embeddings with sinusoidal encoding and cross‑frame attention, allowing the network to leverage context from neighboring frames for improved temporal coherence. On the OR‑PAM‑Reg‑4K benchmark (432 test samples), GPEReg‑Net achieves NCC of 0.953, SSIM of 0.932, and PSNR of 34.49dB, surpassing the state‑of‑the‑art by 3.8% in SSIM and 1.99dB in PSNR while maintaining competitive NCC. Code is available at https://github.com/JiahaoQin/GPEReg‑Net.
Authors:Pengfei Zhang, Tianxin Xie, Minghao Yang, Li Liu
Abstract:
Deep learning‑based respiratory auscultation is currently hindered by two fundamental challenges: (i) inherent information loss, as converting signals into spectrograms discards transient acoustic events and clinical context; (ii) limited data availability, exacerbated by severe class imbalance. To bridge these gaps, we present Resp‑Agent, an autonomous multimodal system orchestrated by a novel Active Adversarial Curriculum Agent (Thinker‑A^2CA). Unlike static pipelines, Thinker‑A^2CA serves as a central controller that actively identifies diagnostic weaknesses and schedules targeted synthesis in a closed loop. To address the representation gap, we introduce a Modality‑Weaving Diagnoser that weaves EHR data with audio tokens via Strategic Global Attention and sparse audio anchors, capturing both long‑range clinical context and millisecond‑level transients. To address the data gap, we design a Flow Matching Generator that adapts a text‑only Large Language Model (LLM) via modality injection, decoupling pathological content from acoustic style to synthesize hard‑to‑diagnose samples. As a foundation for these efforts, we introduce Resp‑229k, a benchmark corpus of 229k recordings paired with LLM‑distilled clinical narratives. Extensive experiments demonstrate that Resp‑Agent consistently outperforms prior approaches across diverse evaluation settings, improving diagnostic robustness under data scarcity and long‑tailed class imbalance. Our code and data are available at https://github.com/zpforlove/Resp‑Agent.
Authors:Kaaustaaub Shankar, Kelly Cohen
Abstract:
Generalized Additive Models (GAMs) balance predictive accuracy and interpretability, but manually configuring their structure is challenging. We propose using the multi‑objective genetic algorithm NSGA‑II to automatically optimize GAMs, jointly minimizing prediction error (RMSE) and a Complexity Penalty that captures sparsity, smoothness, and uncertainty. Experiments on the California Housing dataset show that NSGA‑II discovers GAMs that outperform baseline LinearGAMs in accuracy or match performance with substantially lower complexity. The resulting models are simpler, smoother, and exhibit narrower confidence intervals, enhancing interpretability. This framework provides a general approach for automated optimization of transparent, high‑performing models. The code can be found at https://github.com/KaaustaaubShankar/GeneticAdditiveModels.
Authors:Zhenxing Xu, Brikit Lu, Weidong Bao, Zhengqiu Zhu, Junsong Zhang, Hui Yan, Wenhao Lu, Ji Wang
Abstract:
Current Visual‑Language Navigation (VLN) methodologies face a trade‑off between semantic understanding and control precision. While Multimodal Large Language Models (MLLMs) offer superior reasoning, deploying them as low‑level controllers leads to high latency, trajectory oscillations, and poor generalization due to weak geometric grounding. To address these limitations, we propose Fly0, a framework that decouples semantic reasoning from geometric planning. The proposed method operates through a three‑stage pipeline: (1) an MLLM‑driven module for grounding natural language instructions into 2D pixel coordinates; (2) a geometric projection module that utilizes depth data to localize targets in 3D space; and (3) a geometric planner that generates collision‑free trajectories. This mechanism enables robust navigation even when visual contact is lost. By eliminating the need for continuous inference, Fly0 reduces computational overhead and improves system stability. Extensive experiments in simulation and real‑world environments demonstrate that Fly0 outperforms state‑of‑the‑art baselines, improving the Success Rate by over 20% and reducing Navigation Error (NE) by approximately 50% in unstructured environments. Our code is available at https://github.com/xuzhenxing1/Fly0.
Authors:Warren Johnson
Abstract:
In "Compress or Route?" (Johnson, 2026), we found that code generation tolerates aggressive prompt compression (r >= 0.6) while chain‑of‑thought reasoning degrades gradually. That study was limited to HumanEval (164 problems), left the "perplexity paradox" mechanism unvalidated, and provided no adaptive algorithm. This paper addresses all three gaps. First, we validate across six code benchmarks (HumanEval, MBPP, HumanEval+, MultiPL‑E) and four reasoning benchmarks (GSM8K, MATH, ARC‑Challenge, MMLU‑STEM), confirming the compression threshold generalizes across languages and difficulties. Second, we conduct the first per‑token perplexity analysis (n=723 tokens), revealing a "perplexity paradox": code syntax tokens are preserved (high perplexity) while numerical values in math problems are pruned despite being task‑critical (low perplexity). Signature injection recovers +34 percentage points in pass rate (5.3% to 39.3%; Cohen's h=0.890). Third, we propose TAAC (Task‑Aware Adaptive Compression), achieving 22% cost reduction with 96% quality preservation, outperforming fixed‑ratio compression by 7%. MBPP validation (n=1,800 trials) confirms systematic variation: 3.6% at r=0.3 to 54.6% at r=1.0.
Authors:Sen Ye, Mengde Xu, Shuyang Gu, Di He, Liwei Wang, Han Hu
Abstract:
Current research in multimodal models faces a key challenge where enhancing generative capabilities often comes at the expense of understanding, and vice versa. We analyzed this trade‑off and identify the primary cause might be the potential conflict between generation and understanding, which creates a competitive dynamic within the model. To address this, we propose the Reason‑Reflect‑Refine (R3) framework. This innovative algorithm re‑frames the single‑step generation task into a multi‑step process of "generate‑understand‑regenerate". By explicitly leveraging the model's understanding capability during generation, we successfully mitigate the optimization dilemma, achieved stronger generation results and improved understanding ability which are related to the generation process. This offers valuable insights for designing next‑generation unified multimodal models. Code is available at https://github.com/sen‑ye/R3.
Authors:Jingtian Yan, Yulun Zhang, Zhenting Liu, Han Zhang, He Jiang, Jingkai Chen, Stephen F. Smith, Jiaoyang Li
Abstract:
We present Lifelong Scalable Multi‑Agent Realistic Testbed (LSMART), an open‑source simulator to evaluate any Multi‑Agent Path Finding (MAPF) algorithm in a Fleet Management System (FMS) with Automated Guided Vehicles (AGVs). MAPF aims to move a group of agents from their corresponding starting locations to their goals. Lifelong MAPF (LMAPF) is a variant of MAPF that continuously assigns new goals for agents to reach. LMAPF applications, such as autonomous warehouses, often require a centralized, lifelong system to coordinate the movement of a fleet of robots, typically AGVs. However, existing works on MAPF and LMAPF often assume simplified kinodynamic models, such as pebble motion, as well as perfect execution and communication for AGVs. Prior work has presented SMART, a software capable of evaluating any MAPF algorithms while considering agent kinodynamics, communication delays, and execution uncertainties. However, SMART is designed for MAPF, not LMAPF. Generalizing SMART to an FMS requires many more design choices. First, an FMS parallelizes planning and execution, raising the question of when to plan. Second, given planners with varying optimality and differing agent‑model assumptions, one must decide how to plan. Third, when the planner fails to return valid solutions, the system must determine how to recover. In this paper, we first present LSMART, an open‑source simulator that incorporates all these considerations to evaluate any MAPF algorithms in an FMS. We then provide experiment results based on state‑of‑the‑art methods for each design choice, offering guidance on how to effectively design centralized lifelong AGV Fleet Management Systems. LSMART is available at https://smart‑mapf.github.io/lifelong‑smart.
Authors:Yihan Wang, Peiyu Liu, Runyu Chen, Wei Xu
Abstract:
Text‑to‑SQL has recently achieved impressive progress, yet remains difficult to apply effectively in real‑world scenarios. This gap stems from the reliance on single static workflows, fundamentally limiting scalability to out‑of‑distribution and long‑tail scenarios. Instead of requiring users to select suitable methods through extensive experimentation, we attempt to enable systems to adaptively construct workflows at inference time. Through theoretical and empirical analysis, we demonstrate that optimal dynamic policies consistently outperform the best static workflow, with performance gains fundamentally driven by heterogeneity across candidate workflows. Motivated by this, we propose SquRL, a reinforcement learning framework that enhances LLMs' reasoning capability in adaptive workflow construction. We design a rule‑based reward function and introduce two effective training mechanisms: dynamic actor masking to encourage broader exploration, and pseudo rewards to improve training efficiency. Experiments on widely‑used Text‑to‑SQL benchmarks demonstrate that dynamic workflow construction consistently outperforms the best static workflow methods, with especially pronounced gains on complex and out‑of‑distribution queries. The codes are available at https://github.com/Satissss/SquRL
Authors:Longfei Chen, Ji Zhao, Lanxiao Cui, Tong Su, Xingbo Pan, Ziyang Li, Yongxing Wu, Qijiang Cao, Qiyao Cai, Jing Zhang, Yuandong Ni, Junyao He, Zeyu Zhang, Chao Ge, Xuhuai Lu, Zeyu Gao, Yuxin Cui, Weisen Chen, Yuxuan Peng, Shengping Wang, Qi Li, Yukai Huang, Yukun Liu, Tuo Zhou, Terry Yue Zhuo, Junyang Lin, Chao Zhang
Abstract:
We introduce SecCodeBench‑V2, a publicly released benchmark for evaluating Large Language Model (LLM) copilots' capabilities of generating secure code. SecCodeBench‑V2 comprises 98 generation and fix scenarios derived from Alibaba Group's industrial productions, where the underlying security issues span 22 common CWE (Common Weakness Enumeration) categories across five programming languages: Java, C, Python, Go, and JavaScript. SecCodeBench‑V2 adopts a function‑level task formulation: each scenario provides a complete project scaffold and requires the model to implement or patch a designated target function under fixed interfaces and dependencies. For each scenario, SecCodeBench‑V2 provides executable proof‑of‑concept (PoC) test cases for both functional validation and security verification. All test cases are authored and double‑reviewed by security experts, ensuring high fidelity, broad coverage, and reliable ground truth. Beyond the benchmark itself, we build a unified evaluation pipeline that assesses models primarily via dynamic execution. For most scenarios, we compile and run model‑generated artifacts in isolated environments and execute PoC test cases to validate both functional correctness and security properties. For scenarios where security issues cannot be adjudicated with deterministic test cases, we additionally employ an LLM‑as‑a‑judge oracle. To summarize performance across heterogeneous scenarios and difficulty levels, we design a Pass@K‑based scoring protocol with principled aggregation over scenarios and severity, enabling holistic and comparable evaluation across models. Overall, SecCodeBench‑V2 provides a rigorous and reproducible foundation for assessing the security posture of AI coding assistants, with results and artifacts released at https://alibaba.github.io/sec‑code‑bench. The benchmark is publicly available at https://github.com/alibaba/sec‑code‑bench.
Authors:Hao Chen, Zavareh Bozorgasl
Abstract:
We propose SCENE (Self‑Centering Noncoherent Estimator), a pilot‑free and phase‑invariant aggregation primitive for over‑the‑air federated distillation (OTA‑FD). Each device maps its soft‑label (class‑probability) vector to nonnegative transmit energies under constant per‑round power and constant‑envelope signaling (PAPR near 1). At the server, a self‑centering energy estimator removes the noise‑energy offset and yields an unbiased estimate of the weighted soft‑label average, with variance decaying on the order of 1/(SM) in the number of receive antennas M and repetition factor S. We also develop a pilot‑free ratio‑normalized variant that cancels unknown large‑scale gains, provide a convergence bound consistent with coherent OTA‑FD analyses, and present an overhead‑based crossover comparison. SCENE targets short‑coherence and hardware‑constrained regimes, where avoiding per‑round CSI is essential: it trades a modest noncoherent variance constant for zero uplink pilots, unbiased aggregation, and hardware‑friendly transmission, and can outperform coherent designs when pilot overhead is non‑negligible.
Authors:Kiho Park, Todd Nief, Yo Joong Choe, Victor Veitch
Abstract:
This paper concerns the question of how AI systems encode semantic structure into the geometric structure of their representation spaces. The motivating observation of this paper is that the natural geometry of these representation spaces should reflect the way models use representations to produce behavior. We focus on the important special case of representations that define softmax distributions. In this case, we argue that the natural geometry is information geometry. Our focus is on the role of information geometry on semantic encoding and the linear representation hypothesis. As an illustrative application, we develop "dual steering", a method for robustly steering representations to exhibit a particular concept using linear probes. We prove that dual steering optimally modifies the target concept while minimizing changes to off‑target concepts. Empirically, we find that dual steering enhances the controllability and stability of concept manipulation.
Authors:Muhammad J. Alahmadi, Peng Gao, Feiyi Wang, Dongkuan Xu
Abstract:
Dataset distillation compresses the original data into compact synthetic datasets, reducing training time and storage while retaining model performance, enabling deployment under limited resources. Although recent decoupling‑based distillation methods enable dataset distillation at large scale, they continue to face an efficiency gap: optimization‑based decoupling methods achieve higher accuracy but demand intensive computation, whereas optimization‑free decoupling methods are efficient but sacrifice accuracy. To overcome this trade‑off, we propose Exploration‑‑Exploitation Distillation (E^2D), a simple, practical method that minimizes redundant computation through an efficient pipeline that begins with full‑image initialization to preserve semantic integrity and feature diversity. It then uses a two‑phase optimization strategy: an exploration phase that performs uniform updates and identifies high‑loss regions, and an exploitation phase that focuses updates on these regions to accelerate convergence. We evaluate E^2D on large‑scale benchmarks, surpassing the state‑of‑the‑art on ImageNet‑1K while being 18× faster, and on ImageNet‑21K, our method substantially improves accuracy while remaining 4.3× faster. These results demonstrate that targeted, redundancy‑reducing updates, rather than brute‑force optimization, bridge the gap between accuracy and efficiency in large‑scale dataset distillation. Code is available at https://github.com/ncsu‑dk‑lab/E2D.
Authors:Shreyas Rajesh, Pavan Holur, Mehmet Yigit Turali, Chenda Duan, Vwani Roychowdhury
Abstract:
Language models are increasingly used to reason over content they were not trained on, such as new documents, evolving knowledge, and user‑specific data. A common approach is retrieval‑augmented generation (RAG), which stores verbatim documents externally (as chunks) and retrieves only a relevant subset at inference time for an LLM to reason over. However, this results in inefficient usage of test‑time compute (LLM repeatedly reasons over the same documents); moreover, chunk retrieval can inject irrelevant context that increases unsupported generation. We propose a human‑like non‑parametric continual learning framework, where the base model remains fixed, and learning occurs by integrating each new experience into an external semantic memory state that accumulates and consolidates itself continually. We present Panini, which realizes this by representing documents as Generative Semantic Workspaces (GSW) ‑‑ an entity‑ and event‑aware network of question‑answer (QA) pairs, sufficient for an LLM to reconstruct the experienced situations and mine latent knowledge via reasoning‑grounded inference chains on the network. Given a query, Panini only traverses the continually‑updated GSW (not the verbatim documents or chunks), and retrieves the most likely inference chains. Across six QA benchmarks, Panini achieves the highest average performance, 5%‑7% higher than other competitive baselines, while using 2‑30x fewer answer‑context tokens, supports fully open‑source pipelines, and reduces unsupported answers on curated unanswerable queries. The results show that efficient and accurate structuring of experiences at write time ‑‑ as achieved by the GSW framework ‑‑ yields both efficiency and reliability gains at read time. Code is available at https://github.com/roychowdhuryresearch/gsw‑memory.
Authors:Per Åhag, Alexander Friedrich, Fredrik Ohlsson, Viktor Vigren Näslund
Abstract:
Neural ordinary differential equations (NODEs) are geometric deep learning models based on dynamical systems and flows generated by vector fields on manifolds. Despite numerous successful applications, particularly within the flow matching paradigm, all existing NODE models are fundamentally constrained to fixed‑dimensional dynamics by the intrinsic nature of the manifold's dimension. In this paper, we extend NODEs to M‑polyfolds (spaces that can simultaneously accommodate varying dimensions and a notion of differentiability) and introduce PolyNODEs, the first variable‑dimensional flow‑based model in geometric deep learning. As an example application, we construct explicit M‑polyfolds featuring dimensional bottlenecks and PolyNODE autoencoders based on parametrised vector fields that traverse these bottlenecks. We demonstrate experimentally that our PolyNODE models can be trained to solve reconstruction tasks in these spaces, and that latent representations of the input can be extracted and used to solve downstream classification tasks. The code used in our experiments is publicly available at https://github.com/turbotage/PolyNODE .
Authors:Abdul Joseph Fofanah, Lian Wen, Alpha Alimamy Kamara, Zhongyi Zhang, David Chen, Albert Patrick Sankoh
Abstract:
Accurate polyp segmentation in colonoscopy is essential for cancer prevention but remains challenging due to: (1) high morphological variability (from flat to protruding lesions), (2) strong visual similarity to normal structures such as folds and vessels, and (3) the need for robust multi‑scale detection. Existing deep learning approaches suffer from unidirectional processing, weak multi‑scale fusion, and the absence of anatomical constraints, often leading to false positives (over‑segmentation of normal structures) and false negatives (missed subtle flat lesions). We propose GRAFNet, a biologically inspired architecture that emulates the hierarchical organisation of the human visual system. GRAFNet integrates three key modules: (1) a Guided Asymmetric Attention Module (GAAM) that mimics orientation‑tuned cortical neurones to emphasise polyp boundaries, (2) a MultiScale Retinal Module (MSRM) that replicates retinal ganglion cell pathways for parallel multi‑feature analysis, and (3) a Guided Cortical Attention Feedback Module (GCAFM) that applies predictive coding for iterative refinement. These are unified in a Polyp Encoder‑Decoder Module (PEDM) that enforces spatial‑semantic consistency via resolution‑adaptive feedback. Extensive experiments on five public benchmarks (Kvasir‑SEG, CVC‑300, CVC‑ColonDB, CVC‑Clinic, and PolypGen) demonstrate consistent state‑of‑the‑art performance, with 3‑8% Dice improvements and 10‑20% higher generalisation over leading methods, while offering interpretable decision pathways. This work establishes a paradigm in which neural computation principles bridge the gap between AI accuracy and clinically trustworthy reasoning. Code is available at https://github.com/afofanah/GRAFNet.
Authors:Jiawei Wang, Liang Xu, Shuntian Zheng, Yu Guan, Kaichen Wang, Ziqing Zhang, Chen Chen, Laurence T. Yang, Sai Gu
Abstract:
Reliable sleep staging remains challenging for lightweight wearable devices such as single‑channel electroencephalography (scEEG) or photoplethysmography (PPG). scEEG offers direct measurement of cortical activity and serves as the foundation for sleep staging, yet exhibits limited performance on light sleep stages. PPG provides a low‑cost complement that captures autonomic signatures effective for detecting light sleep. However, prior PPG‑based methods rely on full night recordings (8 ‑ 10 hours) as input context, which is less practical to provide timely feedback for sleep intervention. In this work, we investigate scEEG‑PPG fusion for 4‑class sleep staging under short‑window (30 s ‑ 30 min) constraints. First, we evaluate the temporal context required for each modality, to better understand the relationship of sleep staging performance with respect to monitoring window. Second, we investigate three fusion strategies: score‑level fusion, cross‑attention fusion enabling feature‑level interactions, and Mamba‑enhanced fusion incorporating temporal context modeling. Third, we train and evaluate on the Multi‑Ethnic Study of Atherosclerosis (MESA) dataset and perform cross‑dataset validation on the Cleveland Family Study (CFS) and the Apnea, Bariatric surgery, and CPAP (ABC) datasets. The Mamba‑enhanced fusion achieves the best performance on MESA (Cohen's Kappa κ = 0.798, Acc = 86.9%), with particularly notable improvement in light sleep classification (F1‑score: 85.63% vs. 77.76%, recall: 82.85% vs. 69.95% for scEEG alone), and generalizes well to CFS and ABC datasets with different populations. These findings suggest that scEEG‑PPG fusion is a promising approach for lightweight wearable based sleep monitoring, offering a pathway toward more accessible sleep health assessment. Source code of this project can be found at: https://github.com/DavyWJW/scEEG‑PPGFusion
Authors:Justin Hill, Hong Joo Ryoo
Abstract:
We present GRACE, a simulation‑native agent for autonomous experimental design in high‑energy and nuclear physics. Given multimodal input in the form of a natural‑language prompt or a published experimental paper, the agent extracts a structured representation of the experiment, constructs a runnable toy simulation, and autonomously explores design modifications using first‑principles Monte Carlo methods. Unlike agentic systems focused on operational control or execution of predefined procedures, GRACE addresses the upstream problem of experimental design: proposing non‑obvious modifications to detector geometry, materials, and configurations that improve physics performance under physical and practical constraints. The agent evaluates candidate designs through repeated simulation, physics‑motivated utility functions, and budget‑aware escalation from fast parametric models to full Geant4 simulations, while maintaining strict reproducibility and provenance tracking. We demonstrate the framework on historical experimental setups, showing that the agent can identify optimization directions that align with known upgrade priorities, using only baseline simulation inputs. We also conducted a benchmark in which the agent identified the setup and proposed improvements from a suite of natural language prompts, with some supplied with a relevant physics research paper, of varying high energy physics (HEP) problem settings. This work establishes experimental design as a constrained search problem under physical law and introduces a new benchmark for autonomous, simulation‑driven scientific reasoning in complex instruments.
Authors:Mihir Panchal, Deeksha Varshney, Mamta, Asif Ekbal
Abstract:
Multilingual large language models (LLMs) are increasingly deployed in linguistically diverse regions like India, yet most interpretability tools remain tailored to English. Prior work reveals that LLMs often operate in English centric representation spaces, making cross lingual interpretability a pressing concern. We introduce Indic‑TunedLens, a novel interpretability framework specifically for Indian languages that learns shared affine transformations. Unlike the standard Logit Lens, which directly decodes intermediate activations, Indic‑TunedLens adjusts hidden states for each target language, aligning them with the target output distributions to enable more faithful decoding of model representations. We evaluate our framework on 10 Indian languages using the MMLU benchmark and find that it significantly improves over SOTA interpretability methods, especially for morphologically rich, low resource languages. Our results provide crucial insights into the layer‑wise semantic encoding of multilingual transformers. Our model is available at https://huggingface.co/spaces/MihirRajeshPanchal/IndicTunedLens. Our code is available at https://github.com/MihirRajeshPanchal/IndicTunedLens.
Authors:Shaojie Jiang, Svitlana Vakulenko, Maarten de Rijke
Abstract:
Conversational search (CS) requires a complex software engineering pipeline that integrates query reformulation, ranking, and response generation. CS researchers currently face two barriers: the lack of a unified framework for efficiently sharing contributions with the community, and the difficulty of deploying end‑to‑end prototypes needed for user evaluation. We introduce Orcheo, an open‑source platform designed to bridge this gap. Orcheo offers three key advantages: (i) A modular architecture promotes component reuse through single‑file node modules, facilitating sharing and reproducibility in CS research; (ii) Production‑ready infrastructure bridges the prototype‑to‑system gap via dual execution modes, secure credential management, and execution telemetry, with built‑in AI coding support that lowers the learning curve; (iii) Starter‑kit assets include 50+ off‑the‑shelf components for query understanding, ranking, and response generation, enabling the rapid bootstrapping of complete CS pipelines. We describe the framework architecture and validate Orcheo's utility through case studies that highlight modularity and ease of use. Orcheo is released as open source under the MIT License at https://github.com/ShaojieJiang/orcheo.
Authors:Muzhi Chen, Xuanhe Zhou, Wei Zhou, Bangrui Xu, Surui Tang, Guoliang Li, Bingsheng He, Yeye He, Yitong Song, Fan Wu
Abstract:
This paper envisions a quantum database (Qute) that treats quantum computation as a first‑class execution option. Unlike prior simulation‑based methods that either run quantum algorithms on classical machines or adapt existing databases for quantum simulation, Qute instead (i) compiles an extended form of SQL into gate‑efficient quantum circuits, (ii) employs a hybrid optimizer to dynamically select between quantum and classical execution plans, (iii) introduces selective quantum indexing, and (iv) designs fidelity‑preserving storage to mitigate current qubit constraints. We also present a three‑stage evolution roadmap toward quantum‑native database. Finally, by deploying Qute on a real quantum processor (origin_wukong), we show that it outperforms a classical baseline at scale, and we release an open‑source prototype at https://github.com/weAIDB/Qute.
Authors:Lunjun Zhang, Ryan Chen, Bradly C. Stadie
Abstract:
Building agentic systems that can autonomously self‑improve from experience is a longstanding goal of AI. Large language models (LLMs) today primarily self‑improve via two mechanisms: self‑reflection for context updates, and reinforcement learning (RL) for weight updates. In this work, we propose Evolutionary System Prompt Learning (E‑SPL), a method for jointly improving model contexts and model weights. In each RL iteration, E‑SPL selects multiple system prompts and runs rollouts with each in parallel. It applies RL updates to model weights conditioned on each system prompt, and evolutionary updates to the system prompt population via LLM‑driven mutation and crossover. Each system prompt has a TrueSkill rating for evolutionary selection, updated from relative performance within each RL iteration batch. E‑SPL encourages a natural division between declarative knowledge encoded in prompts and procedural knowledge encoded in weights, resulting in improved performance across reasoning and agentic tasks. For instance, in an easy‑to‑hard (AIME \rightarrow BeyondAIME) generalization setting, E‑SPL improves RL success rate from 38.8% \rightarrow 45.1% while also outperforming reflective prompt evolution (40.0%). Overall, our results show that coupling reinforcement learning with system prompt evolution yields consistent gains in sample efficiency and generalization. Code: https://github.com/LunjunZhang/E‑SPL
Authors:Xiao Wei, Bin Wen, Yuqin Lin, Kai Li, Mingyang gu, Xiaobao Wang, Longbiao Wang, Jianwu Dang
Abstract:
Early diagnosis of Alzheimer's Disease (AD) is crucial for delaying its progression. While AI‑based speech detection is non‑invasive and cost‑effective, it faces a critical data efficiency dilemma due to medical data scarcity and privacy barriers. Therefore, we propose FAL‑AD, a novel framework that synergistically integrates federated learning with data augmentation to systematically optimize data efficiency. Our approach delivers three key breakthroughs: First, absolute efficiency improvement through voice conversion‑based augmentation, which generates diverse pathological speech samples via cross‑category voice‑content recombination. Second, collaborative efficiency breakthrough via an adaptive federated learning paradigm, maximizing cross‑institutional benefits under privacy constraints. Finally, representational efficiency optimization by an attentive cross‑modal fusion model, which achieves fine‑grained word‑level alignment and acoustic‑textual interaction. Evaluated on ADReSSo, FAL‑AD achieves a state‑of‑the‑art multi‑modal accuracy of 91.52%, outperforming all centralized baselines and demonstrating a practical solution to the data efficiency dilemma. Our source code is publicly available at https://github.com/smileix/fal‑ad.
Authors:Erkan Karabulut, Daniel Daza, Paul Groth, Martijn C. Schut, Victoria Degeler
Abstract:
Association Rule Mining (ARM) is a fundamental task for knowledge discovery in tabular data and is widely used in high‑stakes decision‑making. Classical ARM methods rely on frequent itemset mining, leading to rule explosion and poor scalability, while recent neural approaches mitigate these issues but suffer from degraded performance in low‑data regimes. Tabular foundation models (TFMs), pretrained on diverse tabular data with strong in‑context generalization, provide a basis for addressing these limitations. We introduce a model‑agnostic association rule learning framework that extracts association rules from any conditional probabilistic model over tabular data, enabling us to leverage TFMs. We then introduce TabProbe, an instantiation of our framework that utilizes TFMs as conditional probability estimators to learn association rules out‑of‑the‑box without frequent itemset mining. We evaluate our approach on tabular datasets of varying sizes based on standard ARM rule quality metrics and downstream classification performance. The results show that TFMs consistently produce concise, high‑quality association rules with strong predictive performance and remain robust in low‑data settings without task‑specific training. Source code is available at https://github.com/DiTEC‑project/tabprobe.
Authors:Aswathi Varma, Suprosanna Shit, Chinmay Prabhakar, Daniel Scholz, Hongwei Bran Li, Bjoern Menze, Daniel Rueckert, Benedikt Wiestler
Abstract:
Vision Transformers (ViTs) have emerged as the state‑of‑the‑art architecture in representation learning, leveraging self‑attention mechanisms to excel in various tasks. ViTs split images into fixed‑size patches, constraining them to a predefined size and necessitating pre‑processing steps like resizing, padding, or cropping. This poses challenges in medical imaging, particularly with irregularly shaped structures like tumors. A fixed bounding box crop size produces input images with highly variable foreground‑to‑background ratios. Resizing medical images can degrade information and introduce artefacts, impacting diagnosis. Hence, tailoring variable‑sized crops to regions of interest can enhance feature representation capabilities. Moreover, large images are computationally expensive, and smaller sizes risk information loss, presenting a computation‑accuracy tradeoff. We propose VariViT, an improved ViT model crafted to handle variable image sizes while maintaining a consistent patch size. VariViT employs a novel positional embedding resizing scheme for a variable number of patches. We also implement a new batching strategy within VariViT to reduce computational complexity, resulting in faster training and inference times. In our evaluations on two 3D brain MRI datasets, VariViT surpasses vanilla ViTs and ResNet in glioma genotype prediction and brain tumor classification. It achieves F1‑scores of 75.5% and 76.3%, respectively, learning more discriminative features. Our proposed batching strategy reduces computation time by up to 30% compared to conventional architectures. These findings underscore the efficacy of VariViT in image representation learning. Our code can be found here: https://github.com/Aswathi‑Varma/varivit
Authors:Tianyi Ma, Yiyang Li, Yiyue Qian, Zheyuan Zhang, Zehong Wang, Chuxu Zhang, Yanfang Ye
Abstract:
The opioid epidemic continues to ravage communities worldwide, straining healthcare systems, disrupting families, and demanding urgent computational solutions. To combat this lethal opioid crisis, graph learning methods have emerged as a promising paradigm for modeling complex drug‑related phenomena. However, a significant gap remains: there is no comprehensive benchmark for systematically evaluating these methods across real‑world opioid crisis scenarios. To bridge this gap, we introduce OPBench, the first comprehensive opioid benchmark comprising five datasets across three critical application domains: opioid overdose detection from healthcare claims, illicit drug trafficking detection from digital platforms, and drug misuse prediction from dietary patterns. Specifically, OPBench incorporates diverse graph structures, including heterogeneous graphs and hypergraphs, to preserve the rich and complex relational information among drug‑related data. To address data scarcity, we collaborate with domain experts and authoritative institutions to curate and annotate datasets while adhering to privacy and ethical guidelines. Furthermore, we establish a unified evaluation framework with standardized protocols, predefined data splits, and reproducible baselines to facilitate fair and systematic comparison among graph learning methods. Through extensive experiments, we analyze the strengths and limitations of existing graph learning methods, thereby providing actionable insights for future research in combating the opioid crisis. Our source code and datasets are available at https://github.com/Tianyi‑Billy‑Ma/OPBench.
Authors:William L. Tong, Ege Cakar, Cengiz Pehlevan
Abstract:
Recent years have witnessed meteoric progress in reasoning models: neural networks that generate intermediate reasoning traces (RTs) before producing a final output. Despite the rapid advancement, our understanding of how RTs support reasoning, and the limits of this paradigm, remain incomplete. To promote greater clarity, we introduce PITA: a novel large‑scale dataset of over 23 million statements in propositional logic and their corresponding proofs. As a benchmark for robust reasoning, we focus on length generalization: if a model is trained to determine truth or falsity on statements with proofs up to fixed length, how well does it generalize to statements requiring longer proofs? We propose notions of (1) task depth and (2) task breadth, which measure respectively (1) the number of steps required to solve an example from a task and (2) the number of unique examples across a task. We vary these quantities across subsets of PITA, and find that RT models generalize well on broad and shallow subsets, while deteriorating on narrow and deep subsets relative to non‑RT baselines. To determine whether our results are idiosyncratic to PITA or indicative of general phenomena, we compare our results to a simple synthetic task based on syllogisms. Our resulting theory suggests fundamental scalings that limit how well RT models perform on deep tasks, and highlights their generalization strengths on broad tasks. Our findings overall identify fundamental benefits and limitations inherent in using reasoning traces.
Authors:Ryan Fosdick
Abstract:
We describe an adaptation of VACE (Video All‑in‑one Creation and Editing) for real‑time autoregressive video generation. VACE provides unified video control (reference guidance, structural conditioning, inpainting, and temporal extension) but assumes bidirectional attention over full sequences, making it incompatible with streaming pipelines that require fixed chunk sizes and causal attention. The key modification moves reference frames from the diffusion latent space into a parallel conditioning pathway, preserving the fixed chunk sizes and KV caching that autoregressive models require. This adaptation reuses existing pretrained VACE weights without additional training. Across 1.3B and 14B model scales, VACE adds 20‑30% latency overhead for structural control and inpainting, with negligible VRAM cost relative to the base model. Reference‑to‑video fidelity is severely degraded compared to batch VACE due to causal attention constraints. A reference implementation is available at https://github.com/daydreamlive/scope.
Authors:Lingxiang Hu, Yiding Sun, Tianle Xia, Wenwei Li, Ming Xu, Liqun Liu, Peng Shu, Huan Yu, Jie Jiang
Abstract:
While Large Language Model (LLM) agents have achieved remarkable progress in complex reasoning tasks, evaluating their performance in real‑world environments has become a critical problem. Current benchmarks, however, are largely restricted to idealized simulations, failing to address the practical demands of specialized domains like advertising and marketing analytics. In these fields, tasks are inherently more complex, often requiring multi‑round interaction with professional marketing tools. To address this gap, we propose AD‑Bench, a benchmark designed based on real‑world business requirements of advertising and marketing platforms. AD‑Bench is constructed from real user marketing analysis requests, with domain experts providing verifiable reference answers and corresponding reference tool‑call trajectories. The benchmark categorizes requests into three difficulty levels (L1‑L3) to evaluate agents' capabilities under multi‑round, multi‑tool collaboration. Experiments show that on AD‑Bench, Gemini‑3‑Pro achieves Pass@1 = 68.0% and Pass@3 = 83.0%, but performance drops significantly on L3 to Pass@1 = 49.4% and Pass@3 = 62.1%, with a trajectory coverage of 70.1%, indicating that even state‑of‑the‑art models still exhibit substantial capability gaps in complex advertising and marketing analysis scenarios. AD‑Bench provides a realistic benchmark for evaluating and improving advertising marketing agents, the leaderboard and code can be found at https://github.com/Emanual20/adbench‑leaderboard.
Authors:Zheng Chu, Xiao Wang, Jack Hong, Huiming Fan, Yuqi Huang, Yue Yang, Guohai Xu, Chenxiao Zhao, Cheng Xiang, Shengchao Hu, Dongdong Kuang, Ming Liu, Bing Qin, Xing Yu
Abstract:
Large language models are transitioning from generalpurpose knowledge engines to realworld problem solvers, yet optimizing them for deep search tasks remains challenging. The central bottleneck lies in the extreme sparsity of highquality search trajectories and reward signals, arising from the difficulty of scalable longhorizon task construction and the high cost of interactionheavy rollouts involving external tool calls. To address these challenges, we propose REDSearcher, a unified framework that codesigns complex task synthesis, midtraining, and posttraining for scalable searchagent optimization. Specifically, REDSearcher introduces the following improvements: (1) We frame task synthesis as a dualconstrained optimization, where task difficulty is precisely governed by graph topology and evidence dispersion, allowing scalable generation of complex, highquality tasks. (2) We introduce toolaugmented queries to encourage proactive tool use rather than passive recall.(3) During midtraining, we strengthen core atomic capabilities knowledge, planning, and function calling substantially reducing the cost of collecting highquality trajectories for downstream training. (4) We build a local simulated environment that enables rapid, lowcost algorithmic iteration for reinforcement learning experiments. Across both textonly and multimodal searchagent benchmarks, our approach achieves stateoftheart performance. To facilitate future research on longhorizon search agents, we will release 10K highquality complex text search trajectories, 5K multimodal trajectories and 1K text RL query set, and together with code and model checkpoints.
Authors:Yaxuan Kong, Hoyoung Lee, Yoontae Hwang, Alejandro Lopez-Lira, Bradford Levy, Dhagash Mehta, Qingsong Wen, Chanyeol Choi, Yongjae Lee, Stefan Zohren
Abstract:
Large Language Models (LLMs) are increasingly integrated into financial workflows, but evaluation practice has not kept up. Finance‑specific biases can inflate performance, contaminate backtests, and make reported results useless for any deployment claim. We identify five recurring biases in financial LLM applications. They include look‑ahead bias, survivorship bias, narrative bias, objective bias, and cost bias. These biases break financial tasks in distinct ways and they often compound to create an illusion of validity. We reviewed 164 papers from 2023 to 2025 and found that no single bias is discussed in more than 28 percent of studies. This position paper argues that bias in financial LLM systems requires explicit attention and that structural validity should be enforced before any result is used to support a deployment claim. We propose a Structural Validity Framework and an evaluation checklist with minimal requirements for bias diagnosis and future system design. The material is available at https://github.com/Eleanorkong/Awesome‑Financial‑LLM‑Bias‑Mitigation.
Authors:Samir Abdaljalil, Erchin Serpedin, Hasan Kurban
Abstract:
Large language models are increasingly used to answer and verify scientific claims, yet existing evaluations typically assume that a model must always produce a definitive answer. In scientific settings, however, unsupported or uncertain conclusions can be more harmful than abstaining. We study this problem through an abstention‑aware verification framework that decomposes scientific claims into minimal conditions, audits each condition against available evidence using natural language inference (NLI), and selectively decides whether to support, refute, or abstain. We evaluate this framework across two complementary scientific benchmarks: SciFact and PubMedQA, covering both closed‑book and open‑domain evidence settings. Experiments are conducted with six diverse language models, including encoder‑decoder, open‑weight chat models, and proprietary APIs. Across all benchmarks and models, we observe that raw accuracy varies only modestly across architectures, while abstention plays a critical role in controlling error. In particular, confidence‑based abstention substantially reduces risk at moderate coverage levels, even when absolute accuracy improvements are limited. Our results suggest that in scientific reasoning tasks, the primary challenge is not selecting a single best model, but rather determining when available evidence is sufficient to justify an answer. This work highlights abstention‑aware evaluation as a practical and model‑agnostic lens for assessing scientific reliability, and provides a unified experimental basis for future work on selective reasoning in scientific domains. Code is available at https://github.com/sabdaljalil2000/ai4science .
Authors:Chaeeun Lee, T. Michael Yates, Pasquale Minervini, T. Ian Simpson
Abstract:
Clinical decision‑making requires nuanced reasoning over heterogeneous evidence and traceable justifications. While recent LLM multi‑agent systems (MAS) show promise, they largely optimise for outcome accuracy while overlooking process‑grounded reasoning aligned with clinical standards. One critical real‑world case of this is gene‑disease validity curation, where experts must determine whether a gene is causally implicated in a disease by synthesising diverse biomedical evidence. We introduce an agent‑as‑tool reinforcement learning framework for this task with two objectives: (i) process‑level supervision to ensure reasoning follows valid clinical pathways, and (ii) efficient coordination via a hierarchical multi‑agent system. Our evaluation on the ClinGen dataset shows that with outcome‑only rewards, MAS with a GRPO‑trained Qwen3‑4B supervisor agent substantially improves final outcome accuracy from 0.195 with a base model supervisor to 0.732, but results in poor process alignment (0.392 F1). Conversely, with process + outcome rewards, MAS with GRPO‑trained supervisor achieves higher outcome accuracy (0.750) while significantly improving process fidelity to 0.520 F1. Our code is available at https://github.com/chaeeunlee‑io/GeneDiseaseCurationAgents.
Authors:Kaixuan Fang, Yuzhen Lu, Xinyang Mu
Abstract:
Traditional mechanized chestnut harvesting is too costly for small producers, non‑selective, and prone to damaging nuts. Accurate, reliable detection of chestnuts on the orchard floor is crucial for developing low‑cost, vision‑guided automated harvesting technology. However, developing a reliable chestnut detection system faces challenges in complex environments with shading, varying natural light conditions, and interference from weeds, fallen leaves, stones, and other foreign on‑ground objects, which have remained unaddressed. This study collected 319 images of chestnuts on the orchard floor, containing 6524 annotated chestnuts. A comprehensive set of 29 state‑of‑the‑art real‑time object detectors, including 14 in the YOLO (v11‑13) and 15 in the RT‑DETR (v1‑v4) families at varied model scales, was systematically evaluated through replicated modeling experiments for chestnut detection. Experimental results show that the YOLOv12m model achieves the best mAP@0.5 of 95.1% among all the evaluated models, while the RT‑DETRv2‑R101 was the most accurate variant among RT‑DETR models, with mAP@0.5 of 91.1%. In terms of mAP@[0.5:0.95], the YOLOv11x model achieved the best accuracy of 80.1%. All models demonstrate significant potential for real‑time chestnut detection, and YOLO models outperformed RT‑DETR models in terms of both detection accuracy and inference, making them better suited for on‑board deployment. Both the dataset and software programs in this study have been made publicly available at https://github.com/AgFood‑Sensing‑and‑Intelligence‑Lab/ChestnutDetection.
Authors:Haibo Tong, Feifei Zhao, Linghao Feng, Ruoyu Wu, Ruolin Chen, Lu Jia, Zhou Zhao, Jindong Li, Tenglong Li, Erliang Lin, Shuai Yang, Enmeng Lu, Yinqian Sun, Qian Zhang, Zizhe Ruan, Jinyu Fan, Zeyang Yue, Ping Wu, Huangrui Li, Chengyi Sun, Yi Zeng
Abstract:
Rapidly evolving AI exhibits increasingly strong autonomy and goal‑directed capabilities, accompanied by derivative systemic risks that are more unpredictable, difficult to control, and potentially irreversible. However, current AI safety evaluation systems suffer from critical limitations such as restricted risk dimensions and failed frontier risk detection. The lagging safety benchmarks and alignment technologies can hardly address the complex challenges posed by cutting‑edge AI models. To bridge this gap, we propose the "ForesightSafety Bench" AI Safety Evaluation Framework, beginning with 7 major Fundamental Safety pillars and progressively extends to advanced Embodied AI Safety, AI4Science Safety, Social and Environmental AI risks, Catastrophic and Existential Risks, as well as 8 critical industrial safety domains, forming a total of 94 refined risk dimensions. To date, the benchmark has accumulated tens of thousands of structured risk data points and assessment results, establishing a widely encompassing, hierarchically clear, and dynamically evolving AI safety evaluation framework. Based on this benchmark, we conduct systematic evaluation and in‑depth analysis of over twenty mainstream advanced large models, identifying key risk patterns and their capability boundaries. The safety capability evaluation results reveals the widespread safety vulnerabilities of frontier AI across multiple pillars, particularly focusing on Risky Agentic Autonomy, AI4Science Safety, Embodied AI Safety, Social AI Safety and Catastrophic and Existential Risks. Our benchmark is released at https://github.com/Beijing‑AISI/ForesightSafety‑Bench. The project website is available at https://foresightsafety‑bench.beijing‑aisi.ac.cn/.
Authors:Mario Marín Caballero, Miguel Betancourt Alonso, Daniel Díaz-López, Angel Luis Perales Gómez, Pantaleone Nespoli, Gregorio Martínez Pérez
Abstract:
The most valuable asset of any cloud‑based organization is data, which is increasingly exposed to sophisticated cyberattacks. Until recently, the implementation of security measures in DevOps environments was often considered optional by many government entities and critical national services operating in the cloud. This includes systems managing sensitive information, such as electoral processes or military operations, which have historically been valuable targets for cybercriminals. Resistance to security implementation is often driven by concerns over losing agility in software development, increasing the risk of accumulated vulnerabilities. Nowadays, patching software is no longer enough; adopting a proactive cyber defense strategy, supported by Artificial Intelligence (AI), is crucial to anticipating and mitigating threats. Thus, this work proposes integrating the Security Chaos Engineering (SCE) methodology with a new LLM‑based flow to automate the creation of attack defense trees that represent adversary behavior and facilitate the construction of SCE experiments based on these graphical models, enabling teams to stay one step ahead of attackers and implement previously unconsidered defenses. Further detailed information about the experiment performed, along with the steps to replicate it, can be found in the following repository: https://github.com/mariomc14/devsecops‑adversary‑llm.git.
Authors:Kai Guan, Rongyuan Wu, Shuai Li, Wentao Zhu, Wenjun Zeng, Lei Zhang
Abstract:
In real‑world scenarios, the performance of semantic segmentation often deteriorates when processing low‑quality (LQ) images, which may lack clear semantic structures and high‑frequency details. Although image restoration techniques offer a promising direction for enhancing degraded visual content, conventional real‑world image restoration (Real‑IR) models primarily focus on pixel‑level fidelity and often fail to recover task‑relevant semantic cues, limiting their effectiveness when directly applied to downstream vision tasks. Conversely, existing segmentation models trained on high‑quality data lack robustness under real‑world degradations. In this paper, we propose Restoration Adaptation for Semantic Segmentation (RASS), which effectively integrates semantic image restoration into the segmentation process, enabling high‑quality semantic segmentation on the LQ images directly. Specifically, we first propose a Semantic‑Constrained Restoration (SCR) model, which injects segmentation priors into the restoration model by aligning its cross‑attention maps with segmentation masks, encouraging semantically faithful image reconstruction. Then, RASS transfers semantic restoration knowledge into segmentation through LoRA‑based module merging and task‑specific fine‑tuning, thereby enhancing the model's robustness to LQ images. To validate the effectiveness of our framework, we construct a real‑world LQ image segmentation dataset with high‑quality annotations, and conduct extensive experiments on both synthetic and real‑world LQ benchmarks. The results show that SCR and RASS significantly outperform state‑of‑the‑art methods in segmentation and restoration tasks. Code, models, and datasets will be available at https://github.com/Ka1Guan/RASS.git.
Authors:Yuang Ai, Jiaming Han, Shaobin Zhuang, Weijia Mao, Xuefeng Hu, Ziyan Yang, Zhenheng Yang, Huaibo Huang, Xiangyu Yue, Hao Chen
Abstract:
We present BitDance, a scalable autoregressive (AR) image generator that predicts binary visual tokens instead of codebook indices. With high‑entropy binary latents, BitDance lets each token represent up to 2^256 states, yielding a compact yet highly expressive discrete representation. Sampling from such a huge token space is difficult with standard classification. To resolve this, BitDance uses a binary diffusion head: instead of predicting an index with softmax, it employs continuous‑space diffusion to generate the binary tokens. Furthermore, we propose next‑patch diffusion, a new decoding method that predicts multiple tokens in parallel with high accuracy, greatly speeding up inference. On ImageNet 256x256, BitDance achieves an FID of 1.24, the best among AR models. With next‑patch diffusion, BitDance beats state‑of‑the‑art parallel AR models that use 1.4B parameters, while using 5.4x fewer parameters (260M) and achieving 8.7x speedup. For text‑to‑image generation, BitDance trains on large‑scale multimodal tokens and generates high‑resolution, photorealistic images efficiently, showing strong performance and favorable scaling. When generating 1024x1024 images, BitDance achieves a speedup of over 30x compared to prior AR models. We release code and models to facilitate further research on AR foundation models. Code and models are available at: https://github.com/shallowdream204/BitDance.
Authors:Jinzi Zou, Bolin Wang, Liang Li, Shuo Zhang, Nuo Xu, Junzhou Zhao
Abstract:
Flowchart‑oriented dialogue (FOD) systems aim to guide users through multi‑turn decision‑making or operational procedures by following a domain‑specific flowchart to achieve a task goal. In this work, we formalize flowchart reasoning in FOD as grounding user input to flowchart nodes at each dialogue turn while ensuring node transition is consistent with the correct flowchart path. Despite recent advances of LLMs in task‑oriented dialogue systems, adapting them to FOD still faces two limitations: (1) LLMs lack an explicit mechanism to represent and reason over flowchart topology, and (2) they are prone to hallucinations, leading to unfaithful flowchart reasoning. To address these limitations, we propose FloCA, a zero‑shot flowchart‑oriented conversational agent. FloCA uses an LLM for intent understanding and response generation while delegating flowchart reasoning to an external tool that performs topology‑constrained graph execution, ensuring faithful and logically consistent node transitions across dialogue turns. We further introduce an evaluation framework with an LLM‑based user simulator and five new metrics covering reasoning accuracy and interaction efficiency. Extensive experiments on FLODIAL and PFDial datasets highlight the bottlenecks of existing LLM‑based methods and demonstrate the superiority of FloCA. Our codes are available at https://github.com/Jinzi‑Zou/FloCA‑flowchart‑reasoning.
Authors:Zhenyu Zong, Yuchen Wang, Haohong Lin, Lu Gan, Huajie Shao
Abstract:
Trajectory prediction for traffic agents is critical for safe autonomous driving. However, achieving effective zero‑shot generalization in previously unseen domains remains a significant challenge. Motivated by the consistent nature of kinematics across diverse domains, we aim to incorporate domain‑invariant knowledge to enhance zero‑shot trajectory prediction capabilities. The key challenges include: 1) effectively extracting domain‑invariant scene representations, and 2) integrating invariant features with kinematic models to enable generalized predictions. To address these challenges, we propose a novel generalizable Physics‑guided Causal Model (PCM), which comprises two core components: a Disentangled Scene Encoder, which adopts intervention‑based disentanglement to extract domain‑invariant features from scenes, and a CausalODE Decoder, which employs a causal attention mechanism to effectively integrate kinematic models with meaningful contextual information. Extensive experiments on real‑world autonomous driving datasets demonstrate our method's superior zero‑shot generalization performance in unseen cities, significantly outperforming competitive baselines. The source code is released at https://github.com/ZY‑Zong/Physics‑guided‑Causal‑Model.
Authors:Juntong Wang, Libin Chen, Xiyuan Wang, Shijia Kang, Haotong Yang, Da Zheng, Muhan Zhang
Abstract:
Repository‑level bug localization‑the task of identifying where code must be modified to fix a bug‑is a critical software engineering challenge. Standard Large Language Modles (LLMs) are often unsuitable for this task due to context window limitations that prevent them from processing entire code repositories. As a result, various retrieval methods are commonly used, including keyword matching, text similarity, and simple graph‑based heuristics such as Breadth‑First Search. Graph Neural Networks (GNNs) offer a promising alternative due to their ability to model complex, repository‑wide dependencies; however, their application has been hindered by the lack of a dedicated benchmark. To address this gap, we introduce GREPO, the first GNN benchmark for repository‑scale bug localization tasks. GREPO comprises 86 Python repositories and 47294 bug‑fixing tasks, providing graph‑based data structures ready for direct GNN processing. Our evaluation of various GNN architectures shows outstanding performance compared to established information retrieval baselines. This work highlights the potential of GNNs for bug localization and established GREPO as a foundation resource for future research, The code is available at https://github.com/qingpingmo/GREPO.
Authors:Michele Cannito, Riccardo Renzulli, Adson Duarte, Farzad Nikfam, Carlo Alberto Barbano, Enrico Chiesa, Francesco Bruno, Federico Giacobbe, Wojciech Wanha, Arturo Giordano, Marco Grangetto, Fabrizio D'Ascenzo
Abstract:
Severe aortic stenosis is a common and life‑threatening condition in elderly patients, often treated with Transcatheter Aortic Valve Implantation (TAVI). Despite procedural advances, paravalvular aortic regurgitation (PVR) remains one of the most frequent post‑TAVI complications, with a proven impact on long‑term prognosis. In this work, we investigate the potential of deep learning to predict the occurrence of PVR from preoperative cardiac CT. To this end, a dataset of preoperative TAVI patients was collected, and 3D convolutional neural networks were trained on isotropic CT volumes. The results achieved suggest that volumetric deep learning can capture subtle anatomical features from pre‑TAVI imaging, opening new perspectives for personalized risk assessment and procedural optimization. Source code is available at https://github.com/EIDOSLAB/tavi.
Authors:Yuxiang Guo, Zhuoran Du, Nan Tang, Kezheng Tang, Congcong Ge, Yunjun Gao
Abstract:
Document‑to‑table (Doc2Table) extraction derives structured tables from unstructured documents under a target schema, enabling reliable and verifiable SQL‑based data analytics. Although large language models (LLMs) have shown promise in flexible information extraction, their ability to produce precisely structured tables remains insufficiently understood, particularly for indirect extraction that requires complex capabilities such as reasoning and conflict resolution. Existing benchmarks neither explicitly distinguish nor comprehensively cover the diverse capabilities required in Doc2Table extraction. We argue that a capability‑aware benchmark is essential for systematic evaluation. However, constructing such benchmarks using human‑annotated document‑table pairs is costly, difficult to scale, and limited in capability coverage. To address this, we adopt a reverse Table2Doc paradigm and design a multi‑agent synthesis workflow to generate documents from ground‑truth tables. Based on this approach, we present DTBench, a synthetic benchmark that adopts a proposed two‑level taxonomy of Doc2Table capabilities, covering 5 major categories and 13 subcategories. We evaluate several mainstream LLMs on DTBench, and demonstrate substantial performance gaps across models, as well as persistent challenges in reasoning, faithfulness, and conflict resolution. DTBench provides a comprehensive testbed for data generation and evaluation, facilitating future research on Doc2Table extraction. The benchmark is publicly available at https://github.com/ZJU‑DAILY/DTBench.
Authors:Qi Liu, Wanjing Ma
Abstract:
Automating scientific discovery in complex, experiment‑driven domains requires more than iterative mutation of programs; it demands structured hypothesis management, environment interaction, and principled reflection. We present OR‑Agent, a configurable multi‑agent research framework designed for automated exploration in rich experimental environments. OR‑Agent organizes research as a structured tree‑based workflow that explicitly models branching hypothesis generation and systematic backtracking, enabling controlled management of research trajectories beyond simple mutation‑crossover loops. At its core, we introduce an evolutionary‑systematic ideation mechanism that unifies evolutionary selection of research starting points, comprehensive research plan generation, and coordinated exploration within a research tree. We further propose a hierarchical optimization‑inspired reflection system: short‑term experimental reflection operates as a form of verbal gradient providing immediate corrective signals; long‑term reflection accumulates cross‑experiment insights as verbal momentum; and memory compression serves as a regularization mechanism analogous to weight decay, preserving essential signals while mitigating drift. Together, these components form a principled architecture governing research dynamics. We conduct extensive experiments across classical combinatorial optimization benchmarks‑including traveling salesman, capacitated vehicle routing, bin packing, orienteering, and multiple knapsack problems‑as well as simulation‑based cooperative driving scenarios. Results demonstrate that OR‑Agent outperforms strong evolutionary baselines while providing a general, extensible, and inspectable framework for AI‑assisted scientific discovery. OR‑Agent source code and experiments data are publicly available at https://github.com/qiliuchn/OR‑Agent.
Authors:Heng Zhi, Wentao Tan, Lei Zhu, Fengling Li, Jingjing Li, Guoli Yang, Heng Tao Shen
Abstract:
While vision‑language‑action (VLA) models have advanced generalist robotic learning, cross‑embodiment transfer remains challenging due to kinematic heterogeneity and the high cost of collecting sufficient real‑world demonstrations to support fine‑tuning. Existing cross‑embodiment policies typically rely on shared‑private architectures, which suffer from limited capacity of private parameters and lack explicit adaptation mechanisms. To address these limitations, we introduce MOTIF for efficient few‑shot cross‑embodiment transfer that decouples embodiment‑agnostic spatiotemporal patterns, termed action motifs, from heterogeneous action data. Specifically, MOTIF first learns unified motifs via vector quantization with progress‑aware alignment and embodiment adversarial constraints to ensure temporal and cross‑embodiment consistency. We then design a lightweight predictor that predicts these motifs from real‑time inputs to guide a flow‑matching policy, fusing them with robot‑specific states to enable action generation on new embodiments. Evaluations across both simulation and real‑world environments validate the superiority of MOTIF, which significantly outperforms strong baselines in few‑shot transfer scenarios by 6.5% in simulation and 43.7% in real‑world settings. Code is available at https://github.com/buduz/MOTIF.
Authors:Linjie Xu, Yanlin Zhang, Quan Gan, Minjie Wang, David Wipf
Abstract:
Relational databases (RDBs) contain vast amounts of heterogeneous tabular information that can be exploited for predictive modeling purposes. But since the space of potential targets is vast across enterprise settings, how can we avoid retraining a new model each time we wish to predict a new quantity of interest? Foundation models based on in‑context learning (ICL) offer a convenient option, but so far are largely restricted to single‑table operability. In generalizing to multiple interrelated tables, it is essential to compress variably‑sized RDB neighborhoods into fixed‑length ICL samples for consumption by the decoder. However, the details here are critical: unlike existing supervised learning RDB pipelines, we provide theoretical and empirical evidence that ICL‑specific compression should be constrained \emphwithin high‑dimensional RDB columns where all entities share units and roles, not across columns where the relevance of heterogeneous data types cannot possibly be determined without label information. Conditioned on this restriction, we then demonstrate that encoder expressiveness is actually not compromised by excluding trainable parameters. Hence we arrive at a principled family of RDB encoders that can be seamlessly paired with already‑existing single‑table ICL foundation models, whereby no training or fine‑tuning is required. From a practical standpoint, we develop scalable SQL primitives to implement the encoder stage, resulting in an easy‑to‑use open‑source RDB foundation model\footnote\labelfoot: RDBLearn_learn https://github.com/HKUSHXLab/rdblearn capable of robust performance on unseen datasets out of the box.
Authors:Weibin Liao, Jian-guang Lou, Haoyi Xiong
Abstract:
While agentic AI systems rely on LLMs to translate user intent into structured function calls, this process is fraught with computational redundancy, leading to high inference latency that hinders real‑time applications. This paper identifies and addresses three key redundancies: (1) the redundant processing of a large library of function descriptions for every request; (2) the redundant use of a large, slow model to generate an entire, often predictable, token sequence; and (3) the redundant generation of fixed, boilerplate parameter syntax. We introduce HyFunc, a novel framework that systematically eliminates these inefficiencies. HyFunc employs a hybrid‑model cascade where a large model distills user intent into a single "soft token." This token guides a lightweight retriever to select relevant functions and directs a smaller, prefix‑tuned model to generate the final call, thus avoiding redundant context processing and full‑sequence generation by the large model. To eliminate syntactic redundancy, our "dynamic templating" technique injects boilerplate parameter syntax on‑the‑fly within an extended vLLM engine. To avoid potential limitations in generalization, we evaluate HyFunc on an unseen benchmark dataset, BFCL. Experimental results demonstrate that HyFunc achieves an excellent balance between efficiency and performance. It achieves an inference latency of 0.828 seconds, outperforming all baseline models, and reaches a performance of 80.1%, surpassing all models with a comparable parameter scale. These results suggest that HyFunc offers a more efficient paradigm for agentic AI. Our code is publicly available at https://github.com/MrBlankness/HyFunc.
Authors:Khang Nguyen Quoc, Phuong D. Dao, Luyl-Da Quach
Abstract:
Foundation models and vision‑language pre‑training have significantly advanced Vision‑Language Models (VLMs), enabling multimodal processing of visual and linguistic data. However, their application in domain‑specific agricultural tasks, such as plant pathology, remains limited due to the lack of large‑scale, comprehensive multimodal image‑‑text datasets and benchmarks. To address this gap, we introduce LeafNet, a comprehensive multimodal dataset, and LeafBench, a visual question‑answering benchmark developed to systematically evaluate the capabilities of VLMs in understanding plant diseases. The dataset comprises 186,000 leaf digital images spanning 97 disease classes, paired with metadata, generating 13,950 question‑answer pairs spanning six critical agricultural tasks. The questions assess various aspects of plant pathology understanding, including visual symptom recognition, taxonomic relationships, and diagnostic reasoning. Benchmarking 12 state‑of‑the‑art VLMs on our LeafBench dataset, we reveal substantial disparity in their disease understanding capabilities. Our study shows performance varies markedly across tasks: binary healthy‑‑diseased classification exceeds 90% accuracy, while fine‑grained pathogen and species identification remains below 65%. Direct comparison between vision‑only models and VLMs demonstrates the critical advantage of multimodal architectures: fine‑tuned VLMs outperform traditional vision models, confirming that integrating linguistic representations significantly enhances diagnostic precision. These findings highlight critical gaps in current VLMs for plant pathology applications and underscore the need for LeafBench as a rigorous framework for methodological advancement and progress evaluation toward reliable AI‑assisted plant disease diagnosis. Code is available at https://github.com/EnalisUs/LeafBench.
Authors:Ruomeng Ding, Yifei Pang, He Sun, Yizhong Wang, Zhiwei Steven Wu, Zhun Deng
Abstract:
Evaluation and alignment pipelines for large language models increasingly rely on LLM‑based judges, whose behavior is guided by natural‑language rubrics and validated on benchmarks. We identify a previously under‑recognized vulnerability in this workflow, which we term Rubric‑Induced Preference Drift (RIPD). Even when rubric edits pass benchmark validation, they can still produce systematic and directional shifts in a judge's preferences on target domains. Because rubrics serve as a high‑level decision interface, such drift can emerge from seemingly natural, criterion‑preserving edits and remain difficult to detect through aggregate benchmark metrics or limited spot‑checking. We further show this vulnerability can be exploited through rubric‑based preference attacks, in which benchmark‑compliant rubric edits steer judgments away from a fixed human or trusted reference on target domains, systematically inducing RIPD and reducing target‑domain accuracy up to 9.5% (helpfulness) and 27.9% (harmlessness). When these judgments are used to generate preference labels for downstream post‑training, the induced bias propagates through alignment pipelines and becomes internalized in trained policies. This leads to persistent and systematic drift in model behavior. Overall, our findings highlight evaluation rubrics as a sensitive and manipulable control interface, revealing a system‑level alignment risk that extends beyond evaluator reliability alone. The code is available at: https://github.com/ZDCSlab/Rubrics‑as‑an‑Attack‑Surface. Warning: Certain sections may contain potentially harmful content that may not be appropriate for all readers.
Authors:Jaechul Roh, Eugene Bagdasarian, Hamed Haddadi, Ali Shahin Shamsabadi
Abstract:
LLM‑powered agents are beginning to automate user's tasks across the open web, often with access to user resources such as emails and calendars. Unlike standard LLMs answering questions in a controlled ChatBot setting, web agents act "in the wild", interacting with third parties and leaving behind an action trace. Therefore, we ask the question: how do web agents handle user resources when accomplishing tasks on their behalf across live websites? In this paper, we formalize Natural Agentic Oversharing ‑‑ the unintentional disclosure of task‑irrelevant user information through an agent trace of actions on the web. We introduce SPILLage, a framework that characterizes oversharing along two dimensions: channel (content vs. behavior) and directness (explicit vs. implicit). This taxonomy reveals a critical blind spot: while prior work focuses on text leakage, web agents also overshare behaviorally through clicks, scrolls, and navigation patterns that can be monitored. We benchmark 180 tasks on live e‑commerce sites with ground‑truth annotations separating task‑relevant from task‑irrelevant attributes. Across 1,080 runs spanning two agentic frameworks and three backbone LLMs, we demonstrate that oversharing is pervasive with behavioral oversharing dominates content oversharing by 5x. This effect persists ‑‑ and can even worsen ‑‑ under prompt‑level mitigation. However, removing task‑irrelevant information before execution improves task success by up to 17.9%, demonstrating that reducing oversharing improves task success. Our findings underscore that protecting privacy in web agents is a fundamental challenge, requiring a broader view of "output" that accounts for what agents do on the web, not just what they type. Our datasets and code are available at https://github.com/jrohsc/SPILLage.
Authors:Huajian Zeng, Lingyun Chen, Jiaqi Yang, Yuantai Zhang, Fan Shi, Peidong Liu, Xingxing Zuo
Abstract:
Recent vision‑language‑action (VLA) models can generate plausible end‑effector motions, yet they often fail in long‑horizon, contact‑rich tasks because the underlying hand‑object interaction (HOI) structure is not explicitly represented. An embodiment‑agnostic interaction representation that captures this structure would make manipulation behaviors easier to validate and transfer across robots. We propose FlowHOI, a two‑stage flow‑matching framework that generates semantically grounded, temporally coherent HOI sequences, comprising hand poses, object poses, and hand‑object contact states, conditioned on an egocentric observation, a language instruction, and a 3D Gaussian splatting (3DGS) scene reconstruction. We decouple geometry‑centric grasping from semantics‑centric manipulation, conditioning the latter on compact 3D scene tokens and employing a motion‑text alignment loss to semantically ground the generated interactions in both the physical scene layout and the language instruction. To address the scarcity of high‑fidelity HOI supervision, we introduce a reconstruction pipeline that recovers aligned hand‑object trajectories and meshes from large‑scale egocentric videos, yielding an HOI prior for robust generation. Across the GRAB and HOT3D benchmarks, FlowHOI achieves the highest action recognition accuracy and a 1.7× higher physics simulation success rate than the strongest diffusion‑based baseline, while delivering a 40× inference speedup. We further demonstrate real‑robot execution on four dexterous manipulation tasks, illustrating the feasibility of retargeting generated HOI representations to real‑robot execution pipelines.
Authors:Anhao Zhao, Ziyang Chen, Junlong Tong, Yingqi Fan, Fanghua Ye, Shuhao Li, Yunpu Ma, Wenjie Li, Xiaoyu Shen
Abstract:
Large reasoning models (LRMs) are commonly trained with reinforcement learning (RL) to explore long chain‑of‑thought reasoning, achieving strong performance at high computational cost. Recent methods add multi‑reward objectives to jointly optimize correctness and brevity, but these complex extensions often destabilize training and yield suboptimal trade‑offs. We revisit this objective and challenge the necessity of such complexity. Through principled analysis, we identify fundamental misalignments in this paradigm: KL regularization loses its intended role when correctness and length are directly verifiable, and group‑wise normalization becomes ambiguous under multiple reward signals. By removing these two items and simplifying the reward to a truncation‑based length penalty, we show that the optimization problem reduces to supervised fine‑tuning on self‑generated data filtered for both correctness and conciseness. We term this simplified training strategy on‑policy SFT. Despite its simplicity, on‑policy SFT consistently defines the accuracy‑efficiency Pareto frontier. It reduces CoT length by up to 80 while maintaining original accuracy, surpassing more complex RL‑based methods across five benchmarks. Furthermore, it significantly enhances training efficiency, reducing GPU memory usage by 50% and accelerating convergence by 70%. Our code is available at https://github.com/EIT‑NLP/On‑Policy‑SFT.
Authors:Xu Li, Simon Yu, Minzhou Pan, Yiyou Sun, Bo Li, Dawn Song, Xue Lin, Weiyan Shi
Abstract:
LLM‑based agents are becoming increasingly capable, yet their safety lags behind. This creates a gap between what agents can do and should do. This gap widens as agents engage in multi‑turn interactions and employ diverse tools, introducing new risks overlooked by existing benchmarks. To systematically scale safety testing into multi‑turn, tool‑realistic settings, we propose a principled taxonomy that transforms single‑turn harmful tasks into multi‑turn attack sequences. Using this taxonomy, we construct MT‑AgentRisk (Multi‑Turn Agent Risk Benchmark), the first benchmark to evaluate multi‑turn tool‑using agent safety. Our experiments reveal substantial safety degradation: the Attack Success Rate (ASR) increases by 16% on average across open and closed models in multi‑turn settings. To close this gap, we propose ToolShield, a training‑free, tool‑agnostic, self‑exploration defense: when encountering a new tool, the agent autonomously generates test cases, executes them to observe downstream effects, and distills safety experiences for deployment. Experiments show that ToolShield effectively reduces ASR by 30% on average in multi‑turn interactions. Our code is available at https://github.com/CHATS‑lab/ToolShield.
Authors:Zhen Wang, Yiming Gao, Jieyuan Liu, Enze Ma, Jefferson Chen, Mark Antkowiak, Mengzhou Hu, JungHo Kong, Dexter Pratt, Zhiting Hu, Wei Wang, Trey Ideker, Eric P. Xing
Abstract:
Single‑cell RNA‑seq (scRNA‑seq) enables atlas‑scale profiling of complex tissues, revealing rare lineages and transient states. Yet, assigning biologically valid cell identities remains a bottleneck because markers are tissue‑ and state‑dependent, and novel states lack references. We present CellMaster, an AI agent that mimics expert practice for zero‑shot cell‑type annotation. Unlike existing automated tools, CellMaster leverages LLM‑encoded knowledge (e.g., GPT‑4o) to perform on‑the‑fly annotation with interpretable rationales, without pre‑training or fixed marker databases. Across 9 datasets spanning 8 tissues, CellMaster improved accuracy by 7.1% over best‑performing baselines (including CellTypist and scTab) in automatic mode. With human‑in‑the‑loop refinement, this advantage increased to 18.6%, with a 22.1% gain on subtype populations. The system demonstrates particular strength in rare and novel cell states where baselines often fail. Source code and the web application are available at \hrefhttps://github.com/AnonymousGym/CellMasterhttps://github.com/AnonymousGym/CellMaster.
Authors:Daesik Jang, Morgan Lindsay Heisler, Linzi Xing, Yifei Li, Edward Wang, Ying Xiong, Yong Zhang, Zhenan Fan
Abstract:
Automatically generating and iteratively editing academic slide decks requires more than document summarization. It demands faithful content selection, coherent slide organization, layout‑aware rendering, and robust multi‑turn instruction following. However, existing benchmarks and evaluation protocols do not adequately measure these challenges. To address this gap, we introduce the Deck Edits and Compliance Kit Benchmark (DECKBench), an evaluation framework for multi‑agent slide generation and editing. DECKBench is built on a curated dataset of paper to slide pairs augmented with realistic, simulated editing instructions. Our evaluation protocol systematically assesses slide‑level and deck‑level fidelity, coherence, layout quality, and multi‑turn instruction following. We further implement a modular multi‑agent baseline system that decomposes the slide generation and editing task into paper parsing and summarization, slide planning, HTML creation, and iterative editing. Experimental results demonstrate that the proposed benchmark highlights strengths, exposes failure modes, and provides actionable insights for improving multi‑agent slide generation and editing systems. Overall, this work establishes a standardized foundation for reproducible and comparable evaluation of academic presentation generation and editing. Code and data are publicly available at https://github.com/morgan‑heisler/DeckBench .
Authors:Yifan Tan, Yifu Sun, Shirui Huang, Hong Liu, Guanghua Yu, Jianchen Zhu, Yangdong Deng
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities, yet they encounter significant computational bottlenecks due to the massive volume of visual tokens. Consequently, visual token pruning, which substantially reduces the token count, has emerged as a critical technique for accelerating MLLM inference. Existing approaches focus on token importance, diversity, or an intuitive combination of both, without a principled framework for their optimal integration. To address this issue, we first conduct a systematic analysis to characterize the trade‑off between token importance and semantic diversity. Guided by this analysis, we propose the Importance and Diversity Pruner (IDPruner), which leverages the Maximal Marginal Relevance (MMR) algorithm to achieve a Pareto‑optimal balance between these two objectives. Crucially, our method operates without requiring attention maps, ensuring full compatibility with FlashAttention and efficient deployment via one‑shot pruning. We conduct extensive experiments across various model architectures and multimodal benchmarks, demonstrating that IDPruner achieves state‑of‑the‑art performance and superior generalization across diverse architectures and tasks. Notably, on Qwen2.5‑VL‑7B‑Instruct, IDPruner retains 95.18% of baseline performance when pruning 75% of the tokens, and still maintains 86.40% even under an extreme 90% pruning ratio. Our code is available at https://github.com/Tencent/AngelSlim.
Authors:Jiahao Qin
Abstract:
High‑speed optical‑resolution photoacoustic microscopy (OR‑PAM) with bidirectional raster scanning doubles imaging speed but introduces coupled domain shift and geometric misalignment between forward and backward scan lines. Existing methods, constrained by brightness constancy assumptions, achieve limited alignment quality (NCC~\leq 0.96). We propose PCReg‑Net, a progressive contrast‑guided registration framework that performs coarse‑to‑fine alignment through four lightweight modules: (1)~a registration U‑Net for coarse alignment, (2)~a reference feature extractor capturing multi‑scale structural cues, (3)~a contrast module that identifies residual misalignment by comparing coarse‑registered and reference features, and (4)~a refinement U‑Net with feature injection for high‑fidelity output. We further propose the Temporal NCC (TNCC) and Temporal NCC Gap (TNCG) for reference‑free evaluation of inter‑frame temporal consistency. On OR‑PAM‑Reg‑4K (432 test samples), PCReg‑Net achieves NCC of 0.983, SSIM of 0.982, and PSNR of 46.96 dB, surpassing the state‑of‑the‑art by over 14 dB at real‑time speed. Code is available at https://github.com/JiahaoQin/PCReg‑Net
Authors:Darren Li, Meiqi Chen, Chenze Shao, Fandong Meng, Jie Zhou
Abstract:
Frontier language models improve with additional test‑time computation, but serial reasoning or uncoordinated parallel sampling can be compute‑inefficient under fixed inference budgets. We propose SELFCEST, which equips a base model with the ability to spawn same‑weight clones in separate parallel contexts by agentic reinforcement learning. Training is end‑to‑end under a global task reward with shared‑parameter rollouts, yielding a learned controller that allocates both generation and context budget across branches. Across challenging math reasoning benchmarks and long‑context multi‑hop QA, SELFCEST improves the accuracy‑cost Pareto frontier relative to monolithic baselines at matched inference budget, and exhibits out‑of‑distribution generalization in both domains.
Authors:Deepak Babu Piskala
Abstract:
Large language model (LLM) agents have emerged as powerful tools for complex tasks, yet their ability to adapt to individual users remains fundamentally limited. We argue this limitation stems from a critical architectural conflation: current systems treat memory, learning, and personalization as a unified capability rather than three distinct mechanisms requiring different infrastructure, operating on different timescales, and benefiting from independent optimization. We propose MAPLE (Memory‑Adaptive Personalized LEarning), a principled decomposition where Memory handles storage and retrieval infrastructure; Learning extracts intelligence from accumulated interactions asynchronously; and Personalization applies learned knowledge in real‑time within finite context budgets. Each component operates as a dedicated sub‑agent with specialized tooling and well‑defined interfaces. Experimental evaluation on the MAPLE‑Personas benchmark demonstrates that our decomposition achieves a 14.6% improvement in personalization score compared to a stateless baseline (p < 0.01, Cohen's d = 0.95) and increases trait incorporation rate from 45% to 75% ‑‑ enabling agents that genuinely learn and adapt.
Authors:Najmul Hasan, Prashanth BusiReddyGari
Abstract:
Large language models are increasingly deployed in multi‑agent systems, yet we lack benchmarks that test whether they can coordinate under resource contention. We introduce DPBench, a benchmark based on the Dining Philosophers problem that evaluates LLM coordination across eight conditions that vary decision timing, group size, and communication. Our experiments with GPT‑5.2, Claude Opus 4.5, and Grok 4.1 reveal a striking asymmetry: LLMs coordinate effectively in sequential settings but fail when decisions must be made simultaneously, with deadlock rates exceeding 95% under some conditions. We trace this failure to convergent reasoning, where agents independently arrive at identical strategies that, when executed simultaneously, guarantee deadlock. Contrary to expectations, enabling communication does not resolve this problem and can even increase deadlock rates. Our findings suggest that multi‑agent LLM systems requiring concurrent resource access may need external coordination mechanisms rather than relying on emergent coordination. DPBench is released as an open‑source benchmark. Code and benchmark are available at https://github.com/najmulhasan‑code/dpbench.
Authors:Yuqi Xiong, Chunyi Peng, Zhipeng Xu, Zhenghao Liu, Zulong Chen, Yukun Yan, Shuo Wang, Yu Gu, Ge Yu
Abstract:
Visual Retrieval‑Augmented Generation (VRAG) enhances Vision‑Language Models (VLMs) by incorporating external visual documents to address a given query. Existing VRAG frameworks usually depend on rigid, pre‑defined external tools to extend the perceptual capabilities of VLMs, typically by explicitly separating visual perception from subsequent reasoning processes. However, this decoupled design can lead to unnecessary loss of visual information, particularly when image‑based operations such as cropping are applied. In this paper, we propose Lang2Act, which enables fine‑grained visual perception and reasoning through self‑emergent linguistic toolchains. Rather than invoking fixed external engines, Lang2Act collects self‑emergent actions as linguistic tools and leverages them to enhance the visual perception capabilities of VLMs. To support this mechanism, we design a two‑stage Reinforcement Learning (RL)‑based training framework. Specifically, the first stage optimizes VLMs to self‑explore high‑quality actions for constructing a reusable linguistic toolbox, and the second stage further optimizes VLMs to exploit these linguistic tools for downstream reasoning effectively. Experimental results demonstrate the effectiveness of Lang2Act in substantially enhancing the visual perception capabilities of VLMs, achieving performance improvements of over 4%. All code and data are available at https://github.com/NEUIR/Lang2Act.
Authors:Bowen Liu, Zhi Wu, Runquan Xie, Zhanhui Kang, Jia Li
Abstract:
Scaling verifiable training signals remains a key bottleneck for Reinforcement Learning from Verifiable Rewards (RLVR). Logical reasoning is a natural substrate: constraints are formal and answers are programmatically checkable. However, prior synthesis pipelines either depend on expert‑written code or operate within fixed templates/skeletons, which limits growth largely to instance‑level perturbations. We propose SSLogic, an agentic meta‑synthesis framework that scales at the task‑family level by iteratively synthesizing and repairing executable Generator‑‑Validator program pairs in a closed Generate‑‑Validate‑‑Repair loop, enabling continuous family evolution with controllable difficulty. To ensure reliability, we introduce a Multi‑Gate Validation Protocol that combines multi‑strategy consistency checks with Adversarial Blind Review, where independent agents must solve instances by writing and executing code to filter ambiguous or ill‑posed tasks. Starting from 400 seed families, two evolution rounds expand to 953 families and 21,389 verifiable instances (from 5,718). Training on SSLogic‑evolved data yields consistent gains over the seed baseline at matched training steps, improving SynLogic by +5.2, BBEH by +1.4, AIME25 by +3.0, and Brumo25 by +3.7.
Authors:Sayan Deb Sarkar, Rémi Pautrat, Ondrej Miksik, Marc Pollefeys, Iro Armeni, Mahdi Rad, Mihai Dusmanu
Abstract:
Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos. To fit to the maximum context window constraint, current methods use keyframe sampling which can miss both macro‑level events and micro‑level details due to the sparse temporal coverage. Furthermore, processing full images and their tokens for each frame incurs substantial computational overhead. To address these limitations, we propose to leverage video codec primitives (specifically motion vectors and residuals) which natively encode video redundancy and sparsity without requiring expensive full‑image encoding for most frames. To this end, we introduce lightweight transformer‑based encoders that aggregate codec primitives and align their representations with image encoder embeddings through a pre‑training strategy that accelerates convergence during end‑to‑end fine‑tuning. Our approach reduces the time‑to‑first‑token by up to 86% and token usage by up to 93% compared to standard VideoLMs. Moreover, by varying the keyframe and codec primitive densities we are able to maintain or exceed performance on 14 diverse video understanding benchmarks spanning general question answering, temporal reasoning, long‑form understanding, and spatial scene understanding.
Authors:Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, Nicu Sebe, Mubarak Shah
Abstract:
Direct Preference Optimization (DPO) has been proposed as an effective and efficient alternative to reinforcement learning from human feedback (RLHF). However, neither RLHF nor DPO take into account the fact that learning certain preferences is more difficult than learning other preferences, rendering the optimization process suboptimal. To address this gap in text‑to‑image generation, we recently proposed Curriculum‑DPO, a method that organizes image pairs by difficulty. In this paper, we introduce Curriculum‑DPO++, an enhanced method that combines the original data‑level curriculum with a novel model‑level curriculum. More precisely, we propose to dynamically increase the learning capacity of the denoising network as training advances. We implement this capacity increase via two mechanisms. First, we initialize the model with only a subset of the trainable layers used in the original Curriculum‑DPO. As training progresses, we sequentially unfreeze layers until the configuration matches the full baseline architecture. Second, as the fine‑tuning is based on Low‑Rank Adaptation (LoRA), we implement a progressive schedule for the dimension of the low‑rank matrices. Instead of maintaining a fixed capacity, we initialize the low‑rank matrices with a dimension significantly smaller than that of the baseline. As training proceeds, we incrementally increase their rank, allowing the capacity to grow until it converges to the same rank value as in Curriculum‑DPO. Furthermore, we propose an alternative ranking strategy to the one employed by Curriculum‑DPO. Finally, we compare Curriculum‑DPO++ against Curriculum‑DPO and other state‑of‑the‑art preference optimization approaches on nine benchmarks, outperforming the competing methods in terms of text alignment, aesthetics and human preference. Our code is available at https://github.com/CroitoruAlin/Curriculum‑DPO.
Authors:Yufeng Liu, Hang Yu, Juntu Zhao, Bocheng Li, Di Zhang, Mingzhu Li, Wenxuan Wu, Yingdong Hu, Junyuan Xie, Junliang Guo, Dequan Wang, Yang Gao
Abstract:
Action chunking enables Vision Language Action (VLA) models to run in real time, but naive chunked execution often exhibits discontinuities at chunk boundaries. Real‑Time Chunking (RTC) alleviates this issue but is external to the policy, leading to spurious multimodal switching and trajectories that are not intrinsically smooth. We propose Legato, a training‑time continuation method for action‑chunked flow‑based VLA policies. Specifically, Legato initializes denoising from a schedule‑shaped mixture of known actions and noise, exposing the model to partial action information. Moreover, Legato reshapes the learned flow dynamics to ensure that the denoising process remains consistent between training and inference under per‑step guidance. Legato further uses randomized schedule condition during training to support varying inference delays and achieve controllable smoothness. Empirically, Legato produces smoother trajectories and reduces spurious multimodal switching during execution, leading to less hesitation and shorter task completion time. Extensive real‑world experiments show that Legato consistently outperforms RTC across five manipulation tasks, achieving approximately 10% improvements in both trajectory smoothness and task completion time.
Authors:Xiao Wang, Xingxing Xiong, Jinfeng Gao, Xufeng Lou, Bo Jiang, Si-bao Chen, Yaowei Wang, Yonghong Tian
Abstract:
Event stream‑based Visual Place Recognition (VPR) is an emerging research direction that offers a compelling solution to the instability of conventional visible‑light cameras under challenging conditions such as low illumination, overexposure, and high‑speed motion. Recognizing the current scarcity of dedicated datasets in this domain, we introduce EPRBench, a high‑quality benchmark specifically designed for event stream‑based VPR. EPRBench comprises 10K event sequences and 65K event frames, collected using both handheld and vehicle‑mounted setups to comprehensively capture real‑world challenges across diverse viewpoints, weather conditions, and lighting scenarios. To support semantic‑aware and language‑integrated VPR research, we provide LLM‑generated scene descriptions, subsequently refined through human annotation, establishing a solid foundation for integrating LLMs into event‑based perception pipelines. To facilitate systematic evaluation, we implement and benchmark 15 state‑of‑the‑art VPR algorithms on EPRBench, offering a strong baseline for future algorithmic comparisons. Furthermore, we propose a novel multi‑modal fusion paradigm for VPR: leveraging LLMs to generate textual scene descriptions from raw event streams, which then guide spatially attentive token selection, cross‑modal feature fusion, and multi‑scale representation learning. This framework not only achieves highly accurate place recognition but also produces interpretable reasoning processes alongside its predictions, significantly enhancing model transparency and explainability. The dataset and source code will be released on https://github.com/Event‑AHU/Neuromorphic_ReID
Authors:Yunshuang Nie, Bingqian Lin, Minzhe Niu, Kun Xiang, Jianhua Han, Guowei Huang, Xingyue Quan, Hang Xu, Bokui Chen, Xiaodan Liang
Abstract:
Pre‑trained Multi‑modal Large Language Models (MLLMs) provide a knowledge‑rich foundation for post‑training by leveraging their inherent perception and reasoning capabilities to solve complex tasks. However, the lack of an efficient evaluation framework impedes the diagnosis of their performance bottlenecks. Current evaluation primarily relies on testing after supervised fine‑tuning, which introduces laborious additional training and autoregressive decoding costs. Meanwhile, common pre‑training metrics cannot quantify a model's perception and reasoning abilities in a disentangled manner. Furthermore, existing evaluation benchmarks are typically limited in scale or misaligned with pre‑training objectives. Thus, we propose RADAR, an efficient ability‑centric evaluation framework for Revealing Asymmetric Development of Abilities in MLLM pRe‑training. RADAR involves two key components: (1) Soft Discrimination Score, a novel metric for robustly tracking ability development without fine‑tuning, based on quantifying nuanced gradations of the model preference for the correct answer over distractors; and (2) Multi‑Modal Mixture Benchmark, a new 15K+ sample benchmark for comprehensively evaluating pre‑trained MLLMs' perception and reasoning abilities in a 0‑shot manner, where we unify authoritative benchmark datasets and carefully collect new datasets, extending the evaluation scope and addressing the critical gaps in current benchmarks. With RADAR, we comprehensively reveal the asymmetric development of perceptual and reasoning capabilities in pretrained MLLMs across diverse factors, including data volume, model size, and pretraining strategy. Our RADAR underscores the need for a decomposed perspective on pre‑training ability bottlenecks, informing targeted interventions to advance MLLMs efficiently. Our code is publicly available at https://github.com/Nieysh/RADAR.
Authors:Jiangkai Wu, Zhiyuan Ren, Junquan Zhong, Liming Liu, Xinggong Zhang
Abstract:
AI Video Assistant emerges as a new paradigm for Real‑time Communication (RTC), where one peer is a Multimodal Large Language Model (MLLM) deployed in the cloud. This makes interaction between humans and AI more intuitive, akin to chatting with a real person. However, a fundamental mismatch exists between current RTC frameworks and AI Video Assistants, stemming from the drastic shift in Quality of Experience (QoE) and more challenging networks. Measurements on our production prototype also confirm that current RTC fails, causing latency spikes and accuracy drops. To address these challenges, we propose Artic, an AI‑oriented RTC framework for MLLM Video Assistants, exploring the shift from "humans watching video" to "AI understanding video." Specifically, Artic proposes: (1) Response Capability‑aware Adaptive Bitrate, which utilizes MLLM accuracy saturation to proactively cap bitrate, reserving bandwidth headroom to absorb future fluctuations for latency reduction; (2) Zero‑overhead Context‑aware Streaming, which allocates limited bitrate to regions most important for the response, maintaining accuracy even under ultra‑low bitrates; and (3) Degraded Video Understanding Benchmark, the first benchmark evaluating how RTC‑induced video degradation affects MLLM accuracy. Prototype experiments using real‑world uplink traces show that compared with existing methods, Artic significantly improves accuracy by 15.12% and reduces latency by 135.31 ms. We will release the benchmark and codes at https://github.com/pku‑netvideo/DeViBench.
Authors:Sein Kim, Sangwu Park, Hongseok Kang, Wonjoong Kim, Jimin Seo, Yeonjun In, Kanghoon Yoon, Chanyoung Park
Abstract:
Traditional methods for automating recommender system design, such as Neural Architecture Search (NAS), are often constrained by a fixed search space defined by human priors, limiting innovation to pre‑defined operators. While recent LLM‑driven code evolution frameworks shift fixed search space target to open‑ended program spaces, they primarily rely on scalar metrics (e.g., NDCG, Hit Ratio) that fail to provide qualitative insights into model failures or directional guidance for improvement. To address this, we propose Self‑EvolveRec, a novel framework that establishes a directional feedback loop by integrating a User Simulator for qualitative critiques and a Model Diagnosis Tool for quantitative internal verification. Furthermore, we introduce a Diagnosis Tool ‑ Model Co‑Evolution strategy to ensure that evaluation criteria dynamically adapt as the recommendation architecture evolves. Extensive experiments demonstrate that Self‑EvolveRec significantly outperforms state‑of‑the‑art NAS and LLM‑driven code evolution baselines in both recommendation performance and user satisfaction. Our code is available at https://github.com/Sein‑Kim/self_evolverec.
Authors:Ke Xu, Yixin Wang, Zhongcheng Li, Hao Cui, Jinshui Hu, Xingyi Zhang
Abstract:
Elastic precision quantization enables multi‑bit deployment via a single optimization pass, fitting diverse quantization scenarios.Yet, the high storage and optimization costs associated with the Transformer architecture, research on elastic quantization remains limited, particularly for large language models.This paper proposes QuEPT, an efficient post‑training scheme that reconstructs block‑wise multi‑bit errors with one‑shot calibration on a small data slice. It can dynamically adapt to various predefined bit‑widths by cascading different low‑rank adapters, and supports real‑time switching between uniform quantization and mixed precision quantization without repeated optimization. To enhance accuracy and robustness, we introduce Multi‑Bit Token Merging (MB‑ToMe) to dynamically fuse token features across different bit‑widths, improving robustness during bit‑width switching. Additionally, we propose Multi‑Bit Cascaded Low‑Rank adapters (MB‑CLoRA) to strengthen correlations between bit‑width groups, further improve the overall performance of QuEPT. Extensive experiments demonstrate that QuEPT achieves comparable or better performance to existing state‑of‑the‑art post‑training quantization methods.Our code is available at https://github.com/xuke225/QuEPT
Authors:Haoqing Wang, Xiang Long, Ziheng Li, Yilong Xu, Tingguang Li, Yehui Tang
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) plays a key role in stimulating the explicit reasoning capability of Large Language Models (LLMs). We can achieve expert‑level performance in some specific domains via RLVR, such as coding or math. When a general multi‑domain expert‑level model is required, we need to carefully consider the collaboration of RLVR across different domains. The current state‑of‑the‑art models mainly employ two different training paradigms for multi‑domain RLVR: mixed multi‑task RLVR and separate RLVR followed by model merging. However, most of the works did not provide a detailed comparison and analysis about these paradigms. To this end, we choose multiple commonly used high‑level tasks (e.g., math, coding, science, and instruction following) as our target domains and design extensive qualitative and quantitative experiments using open‑source datasets. We find the RLVR across domains exhibits few mutual interferences, and reasoning‑intensive domains demonstrate mutually synergistic effects. Furthermore, we analyze the internal mechanisms of mutual gains from the perspectives of weight space geometry, model prediction behavior, and information constraints. This project is named as M2RL that means Mixed multi‑task training or separate training followed by model Merging for Reinforcement Learning, and the homepage is at https://github.com/mosAI25/M2RL
Authors:Lorenzo Magnino, Jiacheng Shen, Matthieu Geist, Olivier Pietquin, Mathieu Laurière
Abstract:
The intersection of Mean Field Games (MFGs) and Reinforcement Learning (RL) has fostered a growing family of algorithms designed to solve large‑scale multi‑agent systems. However, the field currently lacks a standardized evaluation protocol, forcing researchers to rely on bespoke, isolated, and often simplistic environments. This fragmentation makes it difficult to assess the robustness, generalization, and failure modes of emerging methods. To address this gap, we propose a comprehensive benchmark suite for MFGs (Bench‑MFG), focusing on the discrete‑time, discrete‑space, stationary setting for the sake of clarity. We introduce a taxonomy of problem classes, ranging from no‑interaction and monotone games to potential and dynamics‑coupled games, and provide prototypical environments for each. Furthermore, we propose MF‑Garnets, a method for generating random MFG instances to facilitate rigorous statistical testing. We benchmark a variety of learning algorithms across these environments, including a novel black‑box approach (MF‑PSO) for exploitability minimization. Based on our extensive empirical results, we propose guidelines to standardize future experimental comparisons. Code available at \hrefhttps://github.com/lorenzomagnino/Bench‑MFGhttps://github.com/lorenzomagnino/Bench‑MFG.
Authors:Milan Gautam, Ning Dai, Tianshuo Zhou, Bowen Xie, David Mathews, Liang Huang
Abstract:
RNA design, the task of finding a sequence that folds into a target secondary structure, has broad biological and biomedical impact but remains computationally challenging due to the exponentially large sequence space and exponentially many competing folds. Traditional approaches treat it as an optimization problem, relying on per‑instance heuristics or constraint‑based search. We instead reframe RNA design as conditional sequence generation and introduce a reusable neural approximator, instantiated as an autoregressive language model (LM), that maps target structures directly to sequences. We first train our model in a supervised setting on random‑induced structure‑sequence pairs, and then use reinforcement learning (RL) to optimize end‑to‑end metrics. We also propose methods to select a small subset for RL that greatly improves RL efficiency and quality. Across four datasets, our approach outperforms state‑of‑the‑art systems on key metrics such as Boltzmann probability while being 1.7x faster, establishing conditional LM generation as a scalable, task‑agnostic alternative to per‑instance optimization for RNA design. Our code and data are available at https://github.com/KuNyaa/RNA‑Design‑LM.
Authors:Renjun Xu, Yang Yan
Abstract:
The transition from monolithic language models to modular, skill‑equipped agents marks a defining shift in how large language models (LLMs) are deployed in practice. Rather than encoding all procedural knowledge within model weights, agent skills ‑‑ composable packages of instructions, code, and resources that agents load on demand ‑‑ enable dynamic capability extension without retraining. It is formalized in a paradigm of progressive disclosure, portable skill definitions, and integration with the Model Context Protocol (MCP). This survey provides a comprehensive treatment of the agent skills landscape, as it has rapidly evolved during the last few months. We organize the field along four axes: (i) architectural foundations, examining the SKILL.md specification, progressive context loading, and the complementary roles of skills and MCP; (ii) skill acquisition, covering reinforcement learning with skill libraries, autonomous skill discovery (SEAgent), and compositional skill synthesis; (iii) deployment at scale, including the computer‑use agent (CUA) stack, GUI grounding advances, and benchmark progress on OSWorld and SWE‑bench; and (iv) security, where recent empirical analyses reveal that 26.1% of community‑contributed skills contain vulnerabilities, motivating our proposed Skill Trust and Lifecycle Governance Framework ‑‑ a four‑tier, gate‑based permission model that maps skill provenance to graduated deployment capabilities. We identify seven open challenges ‑‑ from cross‑platform skill portability to capability‑based permission models ‑‑ and propose a research agenda for realizing trustworthy, self‑improving skill ecosystems. Unlike prior surveys that broadly cover LLM agents or tool use, this work focuses specifically on the emerging skill abstraction layer and its implications for the next generation of agentic systems. Project repo: https://github.com/scienceaix/agentskills
Authors:Ali Subhan, Ashir Raza
Abstract:
DragDiffusion is a diffusion‑based method for interactive point‑based image editing that enables users to manipulate images by directly dragging selected points. The method claims that accurate spatial control can be achieved by optimizing a single diffusion latent at an intermediate timestep, together with identity‑preserving fine‑tuning and spatial regularization. This work presents a reproducibility study of DragDiffusion using the authors' released implementation and the DragBench benchmark. We reproduce the main ablation studies on diffusion timestep selection, LoRA‑based fine‑tuning, mask regularization strength, and UNet feature supervision, and observe close agreement with the qualitative and quantitative trends reported in the original work. At the same time, our experiments show that performance is sensitive to a small number of hyperparameter assumptions, particularly the optimized timestep and the feature level used for motion supervision, while other components admit broader operating ranges. We further evaluate a multi‑timestep latent optimization variant and find that it does not improve spatial accuracy while substantially increasing computational cost. Overall, our findings support the central claims of DragDiffusion while clarifying the conditions under which they are reliably reproducible. Code is available at https://github.com/AliSubhan5341/DragDiffusion‑TMLR‑Reproducibility‑Challenge.
Authors:Siyuan Li, Yunjia Wu, Yiyong Xiao, Pingyang Huang, Peize Li, Ruitong Liu, Yan Wen, Te Sun, Fangyi Pei
Abstract:
Temporal knowledge graph (TKG) forecasting requires predicting future facts by jointly modeling structural dependencies within each snapshot and temporal evolution across snapshots. However, most existing methods are stateless: they recompute entity representations at each timestamp from a limited query window, leading to episodic amnesia and rapid decay of long‑term dependencies. To address this limitation, we propose Entity State Tuning (EST), an encoder‑agnostic framework that endows TKG forecasters with persistent and continuously evolving entity states. EST maintains a global state buffer and progressively aligns structural evidence with sequential signals via a closed‑loop design. Specifically, a topology‑aware state perceiver first injects entity‑state priors into structural encoding. Then, a unified temporal context module aggregates the state‑enhanced events with a pluggable sequence backbone. Subsequently, a dual‑track evolution mechanism writes the updated context back to the global entity state memory, balancing plasticity against stability. Experiments on multiple benchmarks show that EST consistently improves diverse backbones and achieves state‑of‑the‑art performance, highlighting the importance of state persistence for long‑horizon TKG forecasting. The code is published at https://github.com/yuanwuyuan9/Evolving‑Beyond‑Snapshots
Authors:Pepijn Cobben, Xuanqiang Angelo Huang, Thao Amelia Pham, Isabel Dahlgren, Terry Jingchen Zhang, Zhijing Jin
Abstract:
Frontier AI systems are increasingly capable and deployed in high‑stakes multi‑agent environments. However, existing AI safety benchmarks largely evaluate single agents, leaving multi‑agent risks such as coordination failure and conflict poorly understood. We introduce GT‑HarmBench, a benchmark of 2,009 high‑stakes scenarios spanning game‑theoretic structures such as the Prisoner's Dilemma, Stag Hunt and Chicken. Scenarios are drawn from realistic AI risk contexts in the MIT AI Risk Repository. Across 15 frontier models, agents choose socially beneficial actions in only 62% of cases, frequently leading to harmful outcomes. We measure sensitivity to game‑theoretic prompt framing and ordering, and analyze reasoning patterns driving failures. We further show that game‑theoretic interventions improve socially beneficial outcomes by up to 18%. Our results highlight substantial reliability gaps and provide a broad standardized testbed for studying alignment in multi‑agent environments. The benchmark and code are available at https://github.com/causalNLP/gt‑harmbench.
Authors:Zhen Zhang, Kaiqiang Song, Xun Wang, Yebowen Hu, Weixiang Yan, Chenyang Zhao, Henry Peng Zou, Haoyun Deng, Sathish Reddy Indurthi, Shujian Liu, Simin Ma, Xiaoyang Wang, Xin Eric Wang, Song Wang
Abstract:
AI agents are increasingly used to solve real‑world tasks by reasoning over multi‑turn user interactions and invoking external tools. However, applying reinforcement learning to such settings remains difficult: realistic objectives often lack verifiable rewards and instead emphasize open‑ended behaviors; moreover, RL for multi‑turn, multi‑step agentic tool use is still underexplored; and building and maintaining executable tool environments is costly, limiting scale and coverage. We propose CM2, an RL framework that replaces verifiable outcome rewards with checklist rewards. CM2 decomposes each turn's intended behavior into fine‑grained binary criteria with explicit evidence grounding and structured metadata, turning open‑ended judging into more stable classification‑style decisions. To balance stability and informativeness, our method adopts a strategy of sparse reward assignment but dense evaluation criteria. Training is performed in a scalable LLM‑simulated tool environment, avoiding heavy engineering for large tool sets. Experiments show that CM2 consistently improves over supervised fine‑tuning. Starting from an 8B Base model and training on an 8k‑example RL dataset, CM2 improves over the SFT counterpart by 8 points on tau^‑Bench, by 10 points on BFCL‑V4, and by 12 points on ToolSandbox. The results match or even outperform similarly sized open‑source baselines, including the judging model. CM2 thus provides a scalable recipe for optimizing multi‑turn, multi‑step tool‑using agents without relying on verifiable rewards. Code provided by the open‑source community: https://github.com/namezhenzhang/CM2‑RLCR‑Tool‑Agent.
Authors:Nick Ferguson, Josh Pennington, Narek Beghian, Aravind Mohan, Douwe Kiela, Sheshansh Agrawal, Thien Hang Nguyen
Abstract:
Unstructured documents like PDFs contain valuable structured information, but downstream systems require this data in reliable, standardized formats. LLMs are increasingly deployed to automate this extraction, making accuracy and reliability paramount. However, progress is bottlenecked by two gaps. First, no end‑to‑end benchmark evaluates PDF‑to‑JSON extraction under enterprise‑scale schema breadth. Second, no principled methodology captures the semantics of nested extraction, where fields demand different notions of correctness (exact match for identifiers, tolerance for quantities, semantic equivalence for names), arrays require alignment, and omission must be distinguished from hallucination. We address both gaps with ExtractBench, an open‑source benchmark and evaluation framework for PDF‑to‑JSON structured extraction. The benchmark pairs 35 PDF documents with JSON Schemas and human‑annotated gold labels across economically valuable domains, yielding 12,867 evaluatable fields spanning schema complexities from tens to hundreds of fields. The evaluation framework treats the schema as an executable specification: each field declares its scoring metric. Baseline evaluations reveal that frontier models (GPT‑5/5.2, Gemini‑3 Flash/Pro, Claude 4.5 Opus/Sonnet) remain unreliable on realistic schemas. Performance degrades sharply with schema breadth, culminating in 0% valid output on a 369‑field financial reporting schema across all tested models. We release ExtractBench at https://github.com/ContextualAI/extract‑bench.
Authors:Miaosen Zhang, Yishan Liu, Shuxia Lin, Xu Yang, Qi Dai, Chong Luo, Weihao Jiang, Peng Hou, Anxiang Zeng, Xin Geng, Baining Guo
Abstract:
Supervised fine‑tuning (SFT) is computationally efficient but often yields inferior generalization compared to reinforcement learning (RL). This gap is primarily driven by RL's use of on‑policy data. We propose a framework to bridge this chasm by enabling On‑Policy SFT. We first present Distribution Discriminant Theory (DDT), which explains and quantifies the alignment between data and the model‑induced distribution. Leveraging DDT, we introduce two complementary techniques: (i) In‑Distribution Finetuning (IDFT), a loss‑level method to enhance generalization ability of SFT, and (ii) Hinted Decoding, a data‑level technique that can re‑align the training corpus to the model's distribution. Extensive experiments demonstrate that our framework achieves generalization performance on par with prominent offline RL algorithms, including DPO and SimPO, while maintaining the efficiency of an SFT pipeline. The proposed framework thus offers a practical alternative in domains where RL is infeasible. We open‑source the code here: https://github.com/zhangmiaosen2000/Towards‑On‑Policy‑SFT
Authors:Chengxi Zeng, Yuxuan Jiang, Ge Gao, Shuai Wang, Duolikun Danier, Bin Zhu, Stevan Rudinac, David Bull, Fan Zhang
Abstract:
Vision‑language segmentation models such as SAM3 enable flexible, prompt‑driven visual grounding, but inherit large, general‑purpose text encoders originally designed for open‑ended language understanding. In practice, segmentation prompts are short, structured, and semantically constrained, leading to substantial over‑provisioning in text encoder capacity and persistent computational and memory overhead. In this paper, we perform a large‑scale anatomical analysis of text prompting in vision‑language segmentation, covering 404,796 real prompts across multiple benchmarks. Our analysis reveals severe redundancy: most context windows are underutilized, vocabulary usage is highly sparse, and text embeddings lie on low‑dimensional manifold despite high‑dimensional representations. Motivated by these findings, we propose SAM3‑LiteText, a lightweight text encoding framework that replaces the original SAM3 text encoder with a compact MobileCLIP student that is optimized by knowledge distillation. Extensive experiments on image and video segmentation benchmarks show that SAM3‑LiteText reduces text encoder parameters by up to 88%, substantially reducing static memory footprint, while maintaining segmentation performance comparable to the original model. Code: https://github.com/SimonZeng7108/efficientsam3/tree/sam3_litetext.
Authors:Greg Coppola
Abstract:
In previous work (Coppola, 2024) we introduced the Quantified Boolean Bayesian Network (QBBN), a logical graphical model that implements the forward fragment of natural deduction (Prawitz, 1965) as a probabilistic factor graph. That work left two gaps: no negation/backward reasoning, and no parser for natural language. This paper addresses both gaps across inference, semantics, and syntax. For inference, we extend the QBBN with NEG factors enforcing P(x) + P(neg x) = 1, enabling contrapositive reasoning (modus tollens) via backward lambda messages, completing Prawitz's simple elimination rules. The engine handles 44/44 test cases spanning 22 reasoning patterns. For semantics, we present a typed logical language with role‑labeled predicates, modal quantifiers, and three tiers of expressiveness following Prawitz: first‑order quantification, propositions as arguments, and predicate quantification via lambda abstraction. For syntax, we present a typed slot grammar that deterministically compiles sentences to logical form (33/33 correct, zero ambiguity). LLMs handle disambiguation (95% PP attachment accuracy) but cannot produce structured parses directly (12.4% UAS), confirming grammars are necessary. The architecture: LLM preprocesses, grammar parses, LLM reranks, QBBN infers. We argue this reconciles formal semantics with Sutton's "bitter lesson" (2019): LLMs eliminate the annotation bottleneck that killed formal NLP, serving as annotator while the QBBN serves as verifier. Code: https://github.com/gregorycoppola/world
Authors:Xiaohan He, Shiyang Feng, Songtao Huang, Lei Bai, Bin Wang, Bo Zhang
Abstract:
Large language models (LLMs) have demonstrated exceptional reasoning capabilities, and co‑evolving paradigms have shown promising results in domains such as code and math. However, in scientific reasoning tasks, these models remain fragile due to unreliable solution evaluation and limited diversity in verification strategies. In this work, we propose Sci‑CoE, a two‑stage scientific co‑evolving framework that enables models to self‑evolve as both solver and verifier through a transition from sparse supervision to unsupervised learning. In the first stage, the model uses a small set of annotated data to establish fundamental correctness judgment anchors for the Verifier. In the second stage, we introduce a geometric reward mechanism that jointly considers consensus, reliability, and diversity, driving large‑scale self‑iteration on unlabeled data. Experiments on several general scientific benchmarks demonstrate that Sci‑CoE enhances complex reasoning capabilities and exhibits strong scalability, facilitating the construction of more robust and diverse evaluation systems. Codes are available at https://github.com/InternScience/Sci‑CoE.
Authors:Wancai Zheng, Hao Chen, Xianlong Lu, Linlin Ou, Xinyi Yu
Abstract:
Object navigation is a core capability of embodied intelligence, enabling an agent to locate target objects in unknown environments. Recent advances in vision‑language models (VLMs) have facilitated zero‑shot object navigation (ZSON). However, existing methods often rely on scene abstractions that convert environments into semantic maps or textual representations, causing high‑level decision making to be constrained by the accuracy of low‑level perception. In this work, we present 3DGSNav, a novel ZSON framework that embeds 3D Gaussian Splatting (3DGS) as persistent memory for VLMs to enhance spatial reasoning. Through active perception, 3DGSNav incrementally constructs a 3DGS representation of the environment, enabling trajectory‑guided free‑viewpoint rendering of frontier‑aware first‑person views. Moreover, we design structured visual prompts and integrate them with Chain‑of‑Thought (CoT) prompting to further improve VLM reasoning. During navigation, a real‑time object detector filters potential targets, while VLM‑driven active viewpoint switching performs target re‑verification, ensuring efficient and reliable recognition. Extensive evaluations across multiple benchmarks and real‑world experiments on a quadruped robot demonstrate that our method achieves robust and competitive performance against state‑of‑the‑art approaches.The Project Page:https://aczheng‑cai.github.io/3dgsnav.github.io/
Authors:Sicheng Feng, Zigeng Chen, Xinyin Ma, Gongfan Fang, Xinchao Wang
Abstract:
Diffusion Large Language Models (dLLMs) represent a new paradigm beyond autoregressive modeling, offering competitive performance while naturally enabling a flexible decoding process. Specifically, dLLMs can generate tokens at arbitrary positions in parallel, endowing them with significant potential for parallel test‑time scaling, which was previously constrained by severe inefficiency in autoregressive modeling. In this work, we introduce dVoting, a fast voting technique that boosts reasoning capability without training, with only an acceptable extra computational overhead. dVoting is motivated by the observation that, across multiple samples for the same prompt, token predictions remain largely consistent, whereas performance is determined by a small subset of tokens exhibiting cross‑sample variability. Leveraging the arbitrary‑position generation capability of dLLMs, dVoting performs iterative refinement by sampling, identifying uncertain tokens via consistency analysis, regenerating them through voting, and repeating this process until convergence. Extensive evaluations demonstrate that dVoting consistently improves performance across various benchmarks. It achieves gains of 6.22%‑7.66% on GSM8K, 4.40%‑7.20% on MATH500, 3.16%‑14.84% on ARC‑C, and 4.83%‑5.74% on MMLU. Our code is available at https://github.com/fscdc/dVoting
Authors:Xiaoxiao Wang, Chunxiao Li, Junying Wang, Yijin Guo, Zijian Chen, Chunyi Li, Xiaohong Liu, Zicheng Zhang, Guangtao Zhai
Abstract:
As comprehensive large model evaluation becomes prohibitively expensive, predicting model performance from limited observations has become essential. However, existing statistical methods struggle with pattern shifts, data sparsity, and lack of explanation, while pure LLM methods remain unreliable. We propose STAR, a framework that bridges data‑driven STatistical expectations with knowledge‑driven Agentic Reasoning. STAR leverages specialized retrievers to gather external knowledge and embeds semantic features into Constrained Probabilistic Matrix Factorization (CPMF) to generate statistical expectations with uncertainty. A reasoning module guided by Expectation Violation Theory (EVT) then refines predictions through intra‑family analysis, cross‑model comparison, and credibility‑aware aggregation, producing adjustments with traceable explanations. Extensive experiments show that STAR consistently outperforms all baselines on both score‑based and rank‑based metrics, delivering a 14.46% gain in total score over the strongest statistical method under extreme sparsity, with only 1‑‑2 observed scores per test model.
Authors:Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, Yankai Lin
Abstract:
On‑policy distillation (OPD), which aligns the student with the teacher's logit distribution on student‑generated trajectories, has demonstrated strong empirical gains in improving student performance and often outperforms off‑policy distillation and reinforcement learning (RL) paradigms. In this work, we first theoretically show that OPD is a special case of dense KL‑constrained RL where the reward function and the KL regularization are always weighted equally and the reference model can by any model. Then, we propose the Generalized On‑Policy Distillation (G‑OPD) framework, which extends the standard OPD objective by introducing a flexible reference model and a reward scaling factor that controls the relative weight of the reward term against the KL regularization. Through comprehensive experiments on math reasoning and code generation tasks, we derive two novel insights: (1) Setting the reward scaling factor to be greater than 1 (i.e., reward extrapolation), which we term ExOPD, consistently improves over standard OPD across a range of teacher‑student size pairings. In particular, in the setting where we merge the knowledge from different domain experts, obtained by applying domain‑specific RL to the same student model, back into the original student, ExOPD enables the student to even surpass the teacher's performance boundary and outperform the domain teachers. (2) Building on ExOPD, we further find that in the strong‑to‑weak distillation setting (i.e., distilling a smaller student from a larger teacher), performing reward correction by choosing the reference model as the teacher's base model before RL yields a more accurate reward signal and further improves distillation performance. However, this choice assumes access to the teacher's pre‑RL variant and incurs more computational overhead. We hope our work offers new insights for future research on OPD.
Authors:Jiakang Shen, Qinghui Chen, Runtong Wang, Chenrui Xu, Jinglin Zhang, Cong Bai, Feng Zhang
Abstract:
Tropical cyclones (TC) are among the most destructive natural disasters, causing catastrophic damage to coastal regions through extreme winds, heavy rainfall, and storm surges. Timely monitoring of tropical cyclones is crucial for reducing loss of life and property, yet it is hindered by the computational inefficiency and high parameter counts of existing methods on resource‑constrained edge devices. Current physics‑guided models suffer from linear feature interactions that fail to capture high‑order polynomial relationships between TC attributes, leading to inflated model sizes and hardware incompatibility. To overcome these challenges, this study introduces the Kolmogorov‑Arnold Network‑based Feature Interaction Framework (KAN‑FIF), a lightweight multimodal architecture that integrates MLP and CNN layers with spline‑parameterized KAN layers. For Maximum Sustained Wind (MSW) prediction, experiments demonstrate that the KAN‑FIF framework achieves a 94.8% reduction in parameters (0.99MB vs 19MB) and 68.7% faster inference per sample (2.3ms vs 7.35ms) compared to baseline model Phy‑CoCo, while maintaining superior accuracy with 32.5% lower MAE. The offline deployment experiment of the FY‑4 series meteorological satellite processor on the Qingyun‑1000 development board achieved a 14.41ms per‑sample inference latency with the KAN‑FIF framework, demonstrating promising feasibility for operational TC monitoring and extending deployability to edge‑device AI applications. The code is released at https://github.com/Jinglin‑Zhang/KAN‑FIF.
Authors:Zewei Yu, Lirong Gao, Yuke Zhu, Bo Zheng, Sheng Guo, Haobo Wang, Junbo Zhao
Abstract:
Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex reasoning tasks by employing test‑time scaling. However, they often generate over‑long chains‑of‑thought that, driven by substantial reflections such as repetitive self‑questioning and circular reasoning, lead to high token consumption, substantial computational overhead, and increased latency without improving accuracy, particularly in smaller models. Our observation reveals that increasing problem complexity induces more excessive and unnecessary reflection, which in turn reduces accuracy and increases token overhead. To address this challenge, we propose Adaptive Reflection and Length Coordinated Penalty (ARLCP), a novel reinforcement learning framework designed to dynamically balance reasoning efficiency and solution accuracy. ARLCP introduces two key innovations: (1) a reflection penalty that adaptively curtails unnecessary reflective steps while preserving essential reasoning, and (2) a length penalty calibrated to the estimated complexity of the problem. By coordinating these penalties, ARLCP encourages the model to generate more concise and effective reasoning paths. We evaluate our method on five mathematical reasoning benchmarks using DeepSeek‑R1‑Distill‑Qwen‑1.5B and DeepSeek‑R1‑Distill‑Qwen‑7B models. Experimental results show that ARLCP achieves a superior efficiency‑accuracy trade‑off compared to existing approaches. For the 1.5B model, it reduces the average response length by 53.1% while simultaneously improving accuracy by 5.8%. For the 7B model, it achieves a 35.0% reduction in length with a 2.7% accuracy gain. The code is released at https://github.com/ZeweiYu1/ARLCP .
Authors:Itamar Mishani, Maxim Likhachev
Abstract:
Efficient motion planning for high‑dimensional robotic systems, such as manipulators and mobile manipulators, is critical for real‑time operation and reliable deployment. Although advances in planning algorithms have enhanced scalability to high‑dimensional state spaces, these improvements often come at the cost of generating unpredictable, inconsistent motions or requiring excessive computational resources and memory. In this work, we introduce Multi‑Graph Search (MGS), a search‑based motion planning algorithm that generalizes classical unidirectional and bidirectional search to a multi‑graph setting. MGS maintains and incrementally expands multiple implicit graphs over the state space, focusing exploration on high‑potential regions while allowing initially disconnected subgraphs to be merged through feasible transitions as the search progresses. We prove that MGS is complete and bounded‑suboptimal, and empirically demonstrate its effectiveness on a range of manipulation and mobile manipulation tasks. Demonstrations, benchmarks and code are available at https://multi‑graph‑search.github.io/.
Authors:Xinyu Yang, Chenlong Deng, Tongyu Wen, Binyu Xie, Zhicheng Dou
Abstract:
Legal reasoning requires not only correct outcomes but also procedurally compliant reasoning processes. However, existing methods lack mechanisms to verify intermediate reasoning steps, allowing errors such as inapplicable statute citations to propagate undetected through the reasoning chain. To address this, we propose LawThinker, an autonomous legal research agent that adopts an Explore‑Verify‑Memorize strategy for dynamic judicial environments. The core idea is to enforce verification as an atomic operation after every knowledge exploration step. A DeepVerifier module examines each retrieval result along three dimensions of knowledge accuracy, fact‑law relevance, and procedural compliance, with a memory module for cross‑round knowledge reuse in long‑horizon tasks. Experiments on the dynamic benchmark J1‑EVAL show that LawThinker achieves a 24% improvement over direct reasoning and an 11% gain over workflow‑based methods, with particularly strong improvements on process‑oriented metrics. Evaluations on three static benchmarks further confirm its generalization capability. The code is available at https://github.com/yxy‑919/LawThinker‑agent .
Authors:Haojun Chen, Zili Zou, Chengdong Ma, Yaoxiang Pu, Haotong Zhang, Yuanpei Chen, Yaodong Yang
Abstract:
Reinforcement Learning (RL) offers a powerful paradigm for autonomous robots to master generalist manipulation skills through trial‑and‑error. However, its real‑world application is stifled by severe sample inefficiency. Recent Human‑in‑the‑Loop (HIL) methods accelerate training by using human corrections, yet this approach faces a scalability barrier. Reliance on human supervisors imposes a 1:1 supervision ratio that limits fleet expansion, suffers from operator fatigue over extended sessions, and introduces high variance due to inconsistent human proficiency. We present Agent‑guided Policy Search (AGPS), a framework that automates the training pipeline by replacing human supervisors with a multimodal agent. Our key insight is that the agent can be viewed as a semantic world model, injecting intrinsic value priors to structure physical exploration. By using executable tools, the agent provides precise guidance via corrective waypoints and spatial constraints for exploration pruning. We validate our approach on two tasks, ranging from precision insertion to deformable object manipulation. Results demonstrate that AGPS outperforms HIL methods in sample efficiency. This automates the supervision pipeline, unlocking the path to labor‑free and scalable robot learning. Project website: https://agps‑rl.github.io/agps.
Authors:Benjamin Clavié, Atoof Shakir, Jonah Turner, Sean Lee, Aamir Shakir, Makoto P. Kato
Abstract:
Multimodal Information Retrieval has made significant progress in recent years, leveraging the increasingly strong multimodal abilities of deep pre‑trained models to represent information across modalities. Music Information Retrieval (MIR), in particular, has considerably increased in quality, with neural representations of music even making its way into everyday life products. However, there is a lack of high‑quality benchmarks for evaluating music retrieval performance. To address this issue, we introduce IncompeBench, a carefully annotated benchmark comprising 1,574 permissively licensed, high‑quality music snippets, 500 diverse queries, and over 125,000 individual relevance judgements. These annotations were created through the use of a multi‑stage pipeline, resulting in high agreement between human annotators and the generated data. The resulting datasets are publicly available at https://huggingface.co/datasets/mixedbread‑ai/incompebench‑strict and https://huggingface.co/datasets/mixedbread‑ai/incompebench‑lenient with the prompts available at https://github.com/mixedbread‑ai/incompebench‑programs.
Authors:Pretam Ray, Pratik Prabhanjan Brahma, Zicheng Liu, Emad Barsoum
Abstract:
Evolutionary agentic systems intensify the trade‑off between computational efficiency and reasoning capability by repeatedly invoking large language models (LLMs) during inference. This setting raises a central question: how can an agent dynamically select an LLM that is sufficiently capable for the current generation step while remaining computationally efficient? While model cascades offer a practical mechanism for balancing this trade‑off, existing routing strategies typically rely on static heuristics or external controllers and do not explicitly account for model uncertainty. We introduce AdaptEvolve: Adaptive LLM Selection for Multi‑LLM Evolutionary Refinement within an evolutionary sequential refinement framework that leverages intrinsic generation confidence to estimate real‑time solvability. Empirical results show that confidence‑driven selection yields a favourable Pareto frontier, reducing total inference cost by an average of 37.9% across benchmarks while retaining 97.5% of the upper‑bound accuracy of static large‑model baselines. Our code is available at https://github.com/raypretam/adaptive_llm_selection.
Authors:Taian Guo, Haiyang Shen, Junyu Luo, Zhongshi Xing, Hanchun Lian, Jinsheng Huang, Binqi Chen, Luchen Liu, Yun Ma, Ming Zhang
Abstract:
LLMs have demonstrated significant potential in quantitative finance by processing vast unstructured data to emulate human‑like analytical workflows. However, current LLM‑based methods primarily follow either an Asset‑Centric paradigm focused on individual stock prediction or a Market‑Centric approach for portfolio allocation, often remaining agnostic to the underlying reasoning that drives market movements. In this paper, we propose a Logic‑Oriented perspective, modeling the financial market as a dynamic, evolutionary ecosystem of competing investment narratives, termed Modes of Thought. To operationalize this view, we introduce MEME (Modeling the Evolutionary Modes of Financial Markets), designed to reconstruct market dynamics through the lens of evolving logics. MEME employs a multi‑agent extraction module to transform noisy data into high‑fidelity Investment Arguments and utilizes Gaussian Mixture Modeling to uncover latent consensus within a semantic space. To model semantic drift among different market conditions, we also implement a temporal evaluation and alignment mechanism to track the lifecycle and historical profitability of these modes. By prioritizing enduring market wisdom over transient anomalies, MEME ensures that portfolio construction is guided by robust reasoning. Extensive experiments on three heterogeneous Chinese stock pools from 2023 to 2025 demonstrate that MEME consistently outperforms seven SOTA baselines. Further ablation studies, sensitivity analysis, lifecycle case study and cost analysis validate MEME's capacity to identify and adapt to the evolving consensus of financial markets. Our implementation can be found at https://github.com/gta0804/MEME.
Authors:Taian Guo, Haiyang Shen, Junyu Luo, Binqi Chen, Hongjun Ding, Jinsheng Huang, Luchen Liu, Yun Ma, Ming Zhang
Abstract:
Extracting signals through alpha factor mining is a fundamental challenge in quantitative finance. Existing automated methods primarily follow two paradigms: Decoupled Factor Generation, which treats factor discovery as isolated events, and Iterative Factor Evolution, which focuses on local parent‑child refinements. However, both paradigms lack a global structural view, often treating factor pools as unstructured collections or fragmented chains, which leads to redundant search and limited diversity. To address these limitations, we introduce AlphaPROBE (Alpha Mining via Principled Retrieval and On‑graph Biased Evolution), a framework that reframes alpha mining as the strategic navigation of a Directed Acyclic Graph (DAG). By modeling factors as nodes and evolutionary links as edges, AlphaPROBE treats the factor pool as a dynamic, interconnected ecosystem. The framework consists of two core components: a Bayesian Factor Retriever that identifies high‑potential seeds by balancing exploitation and exploration through a posterior probability model, and a DAG‑aware Factor Generator that leverages the full ancestral trace of factors to produce context‑aware, nonredundant optimizations. Extensive experiments on three major Chinese stock market datasets against 8 competitive baselines demonstrate that AlphaPROBE significantly gains enhanced performance in predictive accuracy, return stability and training efficiency. Our results confirm that leveraging global evolutionary topology is essential for efficient and robust automated alpha discovery. We have open‑sourced our implementation at https://github.com/gta0804/AlphaPROBE.
Authors:Suraj Ranganath, Anish Patnaik, Vaishak Menon
Abstract:
Efficient spatial reasoning requires world models that remain reliable under tight precision budgets. We study whether low‑bit planning behavior is determined mostly by total bitwidth or by where bits are allocated across modules. Using DINO‑WM on the Wall planning task, we run a paired‑goal mixed‑bit evaluation across uniform, mixed, asymmetric, and layerwise variants under two planner budgets. We observe a consistent three‑regime pattern: 8‑bit and 6‑bit settings remain close to FP16, 3‑bit settings collapse, and 4‑bit settings are allocation‑sensitive. In that transition region, preserving encoder precision improves planning relative to uniform quantization, and near‑size asymmetric variants show the same encoder‑side direction. In a later strict 22‑cell replication with smaller per‑cell episode count, the mixed‑versus‑uniform INT4 sign becomes budget‑conditioned, which further highlights the sensitivity of this transition regime. These findings motivate module‑aware, budget‑aware quantization policies as a broader research direction for efficient spatial reasoning. Code and run artifacts are available at https://github.com/suraj‑ranganath/DINO‑MBQuant.
Authors:Wanxing Wu, He Zhu, Yixia Li, Lei Yang, Jiehui Zhao, Hongru Wang, Jian Yang, Benyou Wang, Bingyi Jing, Guanhua Chen
Abstract:
Large language models (LLMs) have achieved success, but cost and privacy constraints necessitate deploying smaller models locally while offloading complex queries to cloud‑based models. Existing router evaluations are unsystematic, overlooking scenario‑specific requirements and out‑of‑distribution robustness. We propose RouterXBench, a principled evaluation framework with three dimensions: router ability, scenario alignment, and cross‑domain robustness. Unlike prior work that relies on output probabilities or external embeddings, we utilize internal hidden states that capture model uncertainty before answer generation. We introduce ProbeDirichlet, a lightweight router that aggregates cross‑layer hidden states via learnable Dirichlet distributions with probabilistic training. Trained on multi‑domain data, it generalizes robustly across in‑domain and out‑of‑distribution scenarios. Our results show ProbeDirichlet achieves 16.68% and 18.86% relative improvements over the best baselines in router ability and high‑accuracy scenarios, with consistent performance across model families, model scales, heterogeneous tasks, and agentic workflows.
Authors:Lai Wei, Liangbo He, Jun Lan, Lingzhong Dong, Yutong Cai, Siyuan Li, Huijia Zhu, Weiqiang Wang, Linghe Kong, Yue Wang, Zhuosheng Zhang, Weiran Huang
Abstract:
Multimodal Large Language Models (MLLMs) excel at broad visual understanding but still struggle with fine‑grained perception, where decisive evidence is small and easily overwhelmed by global context. Recent "Thinking‑with‑Images" methods alleviate this by iteratively zooming in and out regions of interest during inference, but incur high latency due to repeated tool calls and visual re‑encoding. To address this, we propose Region‑to‑Image Distillation, which transforms zooming from an inference‑time tool into a training‑time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM. In particular, we first zoom in to micro‑cropped regions to let strong teacher models generate high‑quality VQA data, and then distill this region‑grounded supervision back to the full image. After training on such data, the smaller student model improves "single‑glance" fine‑grained perception without tool use. To rigorously evaluate this capability, we further present ZoomBench, a hybrid‑annotated benchmark of 845 VQA data spanning six fine‑grained perceptual dimensions, together with a dual‑view protocol that quantifies the global‑‑regional "zooming gap". Experiments show that our models achieve leading performance across multiple fine‑grained perception benchmarks, and also improve general multimodal cognition on benchmarks such as visual reasoning and GUI agents. We further discuss when "Thinking‑with‑Images" is necessary versus when its gains can be distilled into a single forward pass. Our code is available at https://github.com/inclusionAI/Zooming‑without‑Zooming.
Authors:Lingyong Yan, Jiulong Wu, Dong Xie, Weixian Shi, Deguo Xia, Jizhou Huang
Abstract:
Although recent end‑to‑end video generation models demonstrate impressive performance in visually oriented content creation, they remain limited in scenarios that require strict logical rigor and precise knowledge representation, such as instructional and educational media. To address this problem, we propose LAVES, a hierarchical LLM‑based multi‑agent system for generating high‑quality instructional videos from educational problems. The LAVES formulates educational video generation as a multi‑objective task that simultaneously demands correct step‑by‑step reasoning, pedagogically coherent narration, semantically faithful visual demonstrations, and precise audio‑‑visual alignment. To address the limitations of prior approaches‑‑including low procedural fidelity, high production cost, and limited controllability‑‑LAVES decomposes the generation workflow into specialized agents coordinated by a central Orchestrating Agent with explicit quality gates and iterative critique mechanisms. Specifically, the Orchestrating Agent supervises a Solution Agent for rigorous problem solving, an Illustration Agent that produces executable visualization codes, and a Narration Agent for learner‑oriented instructional scripts. In addition, all outputs from the working agents are subject to semantic critique, rule‑based constraints, and tool‑based compilation checks. Rather than directly synthesizing pixels, the system constructs a structured executable video script that is deterministically compiled into synchronized visuals and narration using template‑driven assembly rules, enabling fully automated end‑to‑end production without manual editing. In large‑scale deployments, LAVES achieves a throughput exceeding one million videos per day, delivering over a 95% reduction in cost compared to current industry‑standard approaches while maintaining a high acceptance rate.
Authors:Zehao Xia, Yiqun Wang, Zhengda Lu, Kai Liu, Jun Xiao, Peter Wonka
Abstract:
Creating high‑fidelity, animatable 3D avatars from a single image remains a formidable challenge. We identified three desirable attributes of avatar generation: 1) the method should be feed‑forward, 2) model a 360° full‑head, and 3) should be animation‑ready. However, current work addresses only two of the three points simultaneously. To address these limitations, we propose OMEGA‑Avatar, the first feed‑forward framework that simultaneously generates a generalizable, 360°‑complete, and animatable 3D Gaussian head from a single image. Starting from a feed‑forward and animatable framework, we address the 360° full‑head avatar generation problem with two novel components. First, to overcome poor hair modeling in full‑head avatar generation, we introduce a semantic‑aware mesh deformation module that integrates multi‑view normals to optimize a FLAME head with hair while preserving its topology structure. Second, to enable effective feed‑forward decoding of full‑head features, we propose a multi‑view feature splatting module that constructs a shared canonical UV representation from features across multiple views through differentiable bilinear splatting, hierarchical UV mapping, and visibility‑aware fusion. This approach preserves both global structural coherence and local high‑frequency details across all viewpoints, ensuring 360° consistency without per‑instance optimization. Extensive experiments demonstrate that OMEGA‑Avatar achieves state‑of‑the‑art performance, significantly outperforming existing baselines in 360° full‑head completeness while robustly preserving identity across different viewpoints.
Authors:Sahand Sabour, TszYam NG, Minlie Huang
Abstract:
As Large Language Models increasingly power role‑playing applications, simulating patients has become a valuable tool for training counselors and scaling therapeutic assessment. However, prior work is fragmented: existing approaches rely on incompatible, non‑standardized data formats, prompts, and evaluation metrics, hindering reproducibility and fair comparison. In this paper, we introduce PatientHub, a unified and modular framework that standardizes the definition, composition, and deployment of simulated patients. To demonstrate PatientHub's utility, we implement several representative patient simulation methods as case studies, showcasing how our framework supports standardized cross‑method evaluation and the seamless integration of custom evaluation metrics. We further demonstrate PatientHub's extensibility by prototyping two new simulator variants, highlighting how PatientHub accelerates method development by eliminating infrastructure overhead. By consolidating existing work into a single reproducible pipeline, PatientHub lowers the barrier to developing new simulation methods and facilitates cross‑method and cross‑model benchmarking. Our framework provides a practical foundation for future datasets, methods, and benchmarks in patient‑centered dialogue, and the code is publicly available via https://github.com/Sahandfer/PatientHub.
Authors:Chengwei Ma, Zhen Tian, Zhou Zhou, Zhixian Xu, Xiaowei Zhu, Xia Hua, Si Shi, F. Richard Yu
Abstract:
Multimodal Large Language Models (MLLMs) have shown remarkable progress in visual understanding, yet they suffer from a critical limitation: structural blindness. Even state‑of‑the‑art models fail to capture topology and symbolic logic in engineering schematics, as their pixel‑driven paradigm discards the explicit vector‑defined relations needed for reasoning. To overcome this, we propose a Vector‑to‑Graph (V2G) pipeline that converts CAD diagrams into property graphs where nodes represent components and edges encode connectivity, making structural dependencies explicit and machine‑auditable. On a diagnostic benchmark of electrical compliance checks, V2G yields large accuracy gains across all error categories, while leading MLLMs remain near chance level. These results highlight the systemic inadequacy of pixel‑based methods and demonstrate that structure‑aware representations provide a reliable path toward practical deployment of multimodal AI in engineering domains. To facilitate further research, we release our benchmark and implementation at https://github.com/gm‑embodied/V2G‑Audit.
Authors:Longyuan Zhu, Hairan Hua, Linlin Miao, Bing Zhao
Abstract:
Large Language Models (LLMs) are advancing rapidly, yet the benchmarks used to measure this progress are becoming increasingly unreliable. Score inflation and selective reporting have eroded the authority of standard benchmarks, leaving the community uncertain about which evaluation results remain trustworthy. We introduce the Benchmark Health Index (BHI), a pure data‑driven framework for auditing evaluation sets along three orthogonal and complementary axes: (1) Capability Discrimination, measuring how sharply a benchmark separates model performance beyond noise; (2) Anti‑Saturation, estimating remaining headroom before ceiling effects erode resolution and thus the benchmark's expected longevity; and (3) Impact, quantifying influence across academic and industrial ecosystems via adoption breadth and practice‑shaping power. By distilling 106 validated benchmarks from the technical reports of 91 representative models in 2025, we systematically characterize the evaluation landscape. BHI is the first framework to quantify benchmark health at a macro level, providing a principled basis for benchmark selection and enabling dynamic lifecycle management for next‑generation evaluation protocols.
Authors:Yufeng Tian, Shuiqi Cheng, Tianming Wei, Tianxing Zhou, Yuanhang Zhang, Zixian Liu, Qianwei Han, Zhecheng Yuan, Huazhe Xu
Abstract:
Tactile information plays a crucial role in human manipulation tasks and has recently garnered increasing attention in robotic manipulation. However, existing approaches mostly focus on the alignment of visual and tactile features and the integration mechanism tends to be direct concatenation. Consequently, they struggle to effectively cope with occluded scenarios due to neglecting the inherent complementary nature of both modalities and the alignment may not be exploited enough, limiting the potential of their real‑world deployment. In this paper, we present ViTaS, a simple yet effective framework that incorporates both visual and tactile information to guide the behavior of an agent. We introduce Soft Fusion Contrastive Learning, an advanced version of conventional contrastive learning method and a CVAE module to utilize the alignment and complementarity within visuo‑tactile representations. We demonstrate the effectiveness of our method in 12 simulated and 3 real‑world environments, and our experiments show that ViTaS significantly outperforms existing baselines. Project page: https://skyrainwind.github.io/ViTaS/index.html.
Authors:Changti Wu, Jiahuai Mao, Yuzhuo Miao, Shijie Lian, Bin Yu, Xiaopeng Lin, Cong Huang, Lei Zhang, Kai Chen
Abstract:
Large‑scale Visual Instruction Tuning (VIT) has become a key paradigm for advancing the performance of vision‑language models (VLMs) across various multimodal tasks. However, training on the large‑scale datasets is computationally expensive and inefficient due to redundancy in the data, which motivates the need for multimodal data selection to improve training efficiency. Existing data selection methods for VIT either require costly training or gradient computation. Training‑free alternatives often depend on proxy models or datasets, instruction‑agnostic representations, and pairwise similarity with quadratic complexity, limiting scalability and representation fidelity. In this work, we propose ScalSelect, a scalable training‑free multimodal data selection method with linear‑time complexity with respect to the number of samples, eliminating the need for external models or auxiliary datasets. ScalSelect first constructs sample representations by extracting visual features most attended by instruction tokens in the target VLM, capturing instruction‑relevant information. It then identifies samples whose representations best approximate the dominant subspace of the full dataset representations, enabling scalable importance scoring without pairwise comparisons. Extensive experiments across multiple VLMs, datasets, and selection budgets demonstrate that ScalSelect achieves over 97.5% of the performance of training on the full dataset using only 16% of the data, and even outperforms full‑data training in some settings. The code is available at \hrefhttps://github.com/ChangtiWu/ScalSelectScalSelect.
Authors:Yiming Gao, Zhen Wang, Jefferson Chen, Mark Antkowiak, Mengzhou Hu, JungHo Kong, Dexter Pratt, Jieyuan Liu, Enze Ma, Zhiting Hu, Eric P. Xing
Abstract:
We present scPilot, the first systematic framework to practice omics‑native reasoning: a large language model (LLM) converses in natural language while directly inspecting single‑cell RNA‑seq data and on‑demand bioinformatics tools. scPilot converts core single‑cell analyses, i.e., cell‑type annotation, developmental‑trajectory reconstruction, and transcription‑factor targeting, into step‑by‑step reasoning problems that the model must solve, justify, and, when needed, revise with new evidence. To measure progress, we release scBench, a suite of 9 expertly curated datasets and graders that faithfully evaluate the omics‑native reasoning capability of scPilot w.r.t various LLMs. Experiments with o1 show that iterative omics‑native reasoning lifts average accuracy by 11% for cell‑type annotation and Gemini‑2.5‑Pro cuts trajectory graph‑edit distance by 30% versus one‑shot prompting, while generating transparent reasoning traces explain marker gene ambiguity and regulatory logic. By grounding LLMs in raw omics data, scPilot enables auditable, interpretable, and diagnostically informative single‑cell analyses. Code, data, and package are available at https://github.com/maitrix‑org/scPilot
Authors:Zedong Chu, Shichao Xie, Xiaolong Wu, Yanfen Shen, Minghua Luo, Zhengbo Wang, Fei Liu, Xiaoxu Leng, Junjun Hu, Mingyang Yin, Jia Lu, Yingnan Guo, Kai Yang, Jiawei Han, Xu Chen, Yanqing Zhu, Yuxiang Zhao, Xin Liu, Yirong Yang, Ye He, Jiahang Wang, Yang Cai, Tianlin Zhang, Li Gao, Liu Liu, Mingchao Sun, Fan Jiang, Chiyu Wang, Zhicheng Liu, Hongyu Pan, Honglin Han, Zhining Gu, Kuan Yang, Jianfang Zhang, Di Jing, Zihao Guan, Wei Guo, Guoqing Liu, Di Yang, Xiangpo Yang, Menglin Yang, Hongguang Xing, Weiguo Li, Mu Xu
Abstract:
Embodied navigation has long been fragmented by task‑specific architectures. We introduce ABot‑N0, a unified Vision‑Language‑Action (VLA) foundation model that achieves a ``Grand Unification'' across 5 core tasks: Point‑Goal, Object‑Goal, Instruction‑Following, POI‑Goal, and Person‑Following. ABot‑N0 utilizes a hierarchical ``Brain‑Action'' architecture, pairing an LLM‑based Cognitive Brain for semantic reasoning with a Flow Matching‑based Action Expert for precise, continuous trajectory generation. To support large‑scale learning, we developed the ABot‑N0 Data Engine, curating 16.9M expert trajectories and 5.0M reasoning samples across 7,802 high‑fidelity 3D scenes (10.7 \textkm^2). ABot‑N0 achieves new SOTA performance across 7 benchmarks, significantly outperforming specialized models. Furthermore, our Agentic Navigation System integrates a planner with hierarchical topological memory, enabling robust, long‑horizon missions in dynamic real‑world environments.
Authors:Seungyeon Yoo, Youngseok Jang, Dabin Kim, Youngsoo Han, Seungwoo Jung, H. Jin Kim
Abstract:
Visual navigation models often struggle in real‑world dynamic environments due to limited robustness to the sim‑to‑real gap and the difficulty of training policies tailored to target deployment environments (e.g., households, restaurants, and factories). Although real‑to‑sim navigation simulation using 3D Gaussian Splatting (GS) can mitigate these challenges, prior GS‑based works have considered only static scenes or non‑photorealistic human obstacles built from simulator assets, despite the importance of safe navigation in dynamic environments. To address these issues, we propose ReaDy‑Go, a novel real‑to‑sim simulation pipeline that synthesizes photorealistic dynamic scenarios in target environments by augmenting a reconstructed static GS scene with dynamic human GS obstacles, and trains navigation policies using the generated datasets. The pipeline provides three key contributions: (1) a dynamic GS simulator that integrates static scene GS with a human animation module, enabling the insertion of animatable human GS avatars and the synthesis of plausible human motions from 2D trajectories, (2) a navigation dataset generation framework that leverages the simulator along with a robot expert planner designed for dynamic GS representations and a human planner, and (3) robust navigation policies to both the sim‑to‑real gap and moving obstacles. The proposed simulator generates thousands of photorealistic navigation scenarios with animatable human GS avatars from arbitrary viewpoints. ReaDy‑Go outperforms baselines across target environments in both simulation and real‑world experiments, demonstrating improved navigation performance even after sim‑to‑real transfer and in the presence of moving obstacles. Moreover, zero‑shot sim‑to‑real deployment in an unseen environment indicates its generalization potential. Project page: https://syeon‑yoo.github.io/ready‑go‑site/.
Authors:Jingkun Liu, Yisong Yue, Max Welling, Yue Song
Abstract:
Self‑attention in Transformers relies on globally normalized softmax weights, causing all tokens to compete for influence at every layer. When composed across depth, this interaction pattern induces strong synchronization dynamics that favor convergence toward a dominant mode, a behavior associated with representation collapse and attention sink phenomena. We introduce Krause Attention, a principled attention mechanism inspired by bounded‑confidence consensus dynamics. Krause Attention replaces similarity‑based global aggregation with distance‑based, localized, and selectively sparse interactions, promoting structured local synchronization instead of global mixing. We relate this behavior to recent theory modeling Transformer dynamics as interacting particle systems, and show how bounded‑confidence interactions naturally moderate attention concentration and alleviate attention sinks. Restricting interactions to local neighborhoods also reduces runtime complexity from quadratic to linear in sequence length. Experiments across vision (ViT on CIFAR/ImageNet), autoregressive generation (MNIST/CIFAR‑10), and large language models (Llama/Qwen) demonstrate consistent gains with substantially reduced computation, highlighting bounded‑confidence dynamics as a scalable and effective inductive bias for attention.
Authors:Dong Yan, Jian Liang, Ran He, Tieniu Tan
Abstract:
Recent studies have shown that large language models (LLMs) can infer private user attributes (e.g., age, location, gender) from user‑generated text shared online, enabling rapid and large‑scale privacy breaches. Existing anonymization‑based defenses are coarse‑grained, lacking word‑level precision in anonymizing privacy‑leaking elements. Moreover, they are inherently limited as altering user text to hide sensitive cues still allows attribute inference to occur through models' reasoning capabilities. To address these limitations, we propose a unified defense framework that combines fine‑grained anonymization (TRACE) with inference‑preventing optimization (RPS). TRACE leverages attention mechanisms and inference chain generation to identify and anonymize privacy‑leaking textual elements, while RPS employs a lightweight two‑stage optimization strategy to induce model rejection behaviors, thereby preventing attribute inference. Evaluations across diverse LLMs show that TRACE‑RPS reduces attribute inference accuracy from around 50% to below 5% on open‑source models. In addition, our approach offers strong cross‑model generalization, prompt‑variation robustness, and utility‑privacy tradeoffs. Our code is available at https://github.com/Jasper‑Yan/TRACE‑RPS.
Authors:Faouzi El Yagoubi, Ranwa Al Mallah, Godwin Badu-Marfo
Abstract:
Multi‑agent Large Language Model (LLM) systems create privacy risks that current benchmarks cannot measure. When agents coordinate on tasks, sensitive data passes through inter‑agent messages, shared memory, and tool arguments; pathways that output‑only audits never inspect. We introduce AgentLeak, to the best of our knowledge the first full‑stack benchmark for privacy leakage covering internal channels, spanning 1,000 scenarios across healthcare, finance, legal, and corporate domains, paired with a 32‑class attack taxonomy and three‑tier detection pipeline. Testing GPT‑4o, GPT‑4o‑mini, Claude 3.5 Sonnet, Mistral Large, and Llama 3.3 70B across 4,979 traces reveals that multi‑agent configurations reduce per‑channel output leakage (C1: 27.2% vs 43.2% in single‑agent) but introduce unmonitored internal channels that raise total system exposure to 68.9% (OR‑aggregated across C1, C2, C5). Internal channels account for most of this gap: inter‑agent messages (C2) leak at 68.8%, compared to 27.2% on C1 (output channel). This means that output‑only audits miss 41.7% of violations. Claude 3.5 Sonnet, which emphasizes safety alignment in its design, achieves the lowest leakage rates on both external (3.3%) and internal (28.1%) channels, suggesting that model‑level safety training may transfer to internal channel protection. Across all five models and four domains, the pattern C2 > C1 holds consistently, confirming that inter‑agent communication is the primary vulnerability. These findings underscore the need for coordination frameworks that incorporate internal‑channel privacy protections and enforce privacy controls on inter‑agent communication.
Authors:David Wan, Han Wang, Ziyang Wang, Elias Stengel-Eskin, Hyunji Lee, Mohit Bansal
Abstract:
Multimodal large language models (MLLMs) are increasingly used for real‑world tasks involving multi‑step reasoning and long‑form generation, where reliability requires grounding model outputs in heterogeneous input sources and verifying individual factual claims. However, existing multimodal grounding benchmarks and evaluation methods focus on simplified, observation‑based scenarios or limited modalities and fail to assess attribution in complex multimodal reasoning. We introduce MuRGAt (Multimodal Reasoning with Grounded Attribution), a benchmark for evaluating fact‑level multimodal attribution in settings that require reasoning beyond direct observation. Given inputs spanning video, audio, and other modalities, MuRGAt requires models to generate answers with explicit reasoning and precise citations, where each citation specifies both modality and temporal segments. To enable reliable assessment, we introduce an automatic evaluation framework that strongly correlates with human judgments. Benchmarking with human and automated scores reveals that even strong MLLMs frequently hallucinate citations despite correct reasoning. Moreover, we observe a key trade‑off: increasing reasoning depth or enforcing structured grounding often degrades accuracy, highlighting a significant gap between internal reasoning and verifiable attribution.
Authors:Mark D. Olchanyi, Annabel Sorby-Adams, John Kirsch, Brian L. Edlow, Ava Farnan, Renfei Liu, Matthew S. Rosen, Emery N. Brown, W. Taylor Kimberly, Juan Eugenio Iglesias
Abstract:
Portable, ultra‑low‑field (ULF) magnetic resonance imaging has the potential to expand access to neuroimaging but currently suffers from coarse spatial and angular resolutions and low signal‑to‑noise ratios. Diffusion tensor imaging (DTI), a sequence tailored to detect and reconstruct white matter tracts within the brain, is particularly prone to such imaging degradation due to inherent sequence design coupled with prolonged scan times. In addition, ULF DTI scans exhibit artifacting that spans both the space and angular domains, requiring a custom modelling algorithm for subsequent correction. We introduce a nine‑direction, single‑shell ULF DTI sequence, as well as a companion Bayesian bias field correction algorithm that possesses angular dependence and convolutional neural network‑based superresolution algorithm that is generalizable across DTI datasets and does not require re‑training (''DiffSR''). We show through a synthetic downsampling experiment and white matter assessment in real, matched ULF and high‑field DTI scans that these algorithms can recover microstructural and volumetric white matter information at ULF. We also show that DiffSR can be directly applied to white matter‑based Alzheimers disease classification in synthetically degraded scans, with notable improvements in agreement between DTI metrics, as compared to un‑degraded scans. We freely disseminate the Bayesian bias correction algorithm and DiffSR with the goal of furthering progress on both ULF reconstruction methods and general DTI sequence harmonization. We release all code related to DiffSR for \hrefhttps://github.com/markolchanyi/DiffSRpublic \space use.
Authors:Chengrui Qu, Christopher Yeh, Kishan Panaganti, Eric Mazumdar, Adam Wierman
Abstract:
Cooperative multi‑agent reinforcement learning (MARL) commonly adopts centralized training with decentralized execution, where value‑factorization methods enforce the individual‑global‑maximum (IGM) principle so that decentralized greedy actions recover the team‑optimal joint action. However, the reliability of this recipe in real‑world settings remains unreliable due to environmental uncertainties arising from the sim‑to‑real gap, model mismatch, and system noise. We address this gap by introducing Distributionally robust IGM (DrIGM), a principle that requires each agent's robust greedy action to align with the robust team‑optimal joint action. We show that DrIGM holds for a novel definition of robust individual action values, which is compatible with decentralized greedy execution and yields a provable robustness guarantee for the whole system. Building on this foundation, we derive DrIGM‑compliant robust variants of existing value‑factorization architectures (e.g., VDN/QMIX/QTRAN) that (i) train on robust Q‑targets, (ii) preserve scalability, and (iii) integrate seamlessly with existing codebases without bespoke per‑agent reward shaping. Empirically, on high‑fidelity SustainGym simulators and a StarCraft game environment, our methods consistently improve out‑of‑distribution performance. Code and data are available at https://github.com/crqu/robust‑coMARL.
Authors:Sina Tayebati, Divake Kumar, Nastaran Darabi, Davide Ettori, Ranganath Krishnan, Amit Ranjan Trivedi
Abstract:
Estimating uncertainty for AI agents in real‑world multi‑turn tool‑using interaction with humans is difficult because failures are often triggered by sparse critical episodes (e.g., looping, incoherent tool use, or user‑agent miscoordination) even when local generation appears confident. Existing uncertainty proxies focus on single‑shot text generation and therefore miss these trajectory‑level breakdown signals. We introduce TRACER, a trajectory‑level uncertainty metric for dual‑control Tool‑Agent‑User interaction. TRACER combines content‑aware surprisal with situational‑awareness signals, semantic and lexical repetition, and tool‑grounded coherence gaps, and aggregates them using a tail‑focused risk functional with a MAX‑composite step risk to surface decisive anomalies. We evaluate TRACER on τ^2‑bench by predicting task failure and selective task execution. To this end, TRACER improves AUROC by up to 37.1% and AUARC by up to 55% over baselines, enabling earlier and more accurate detection of uncertainty in complex conversational tool‑use settings. Our code and benchmark are available at https://github.com/sinatayebati/agent‑tracer.
Authors:Chongyi Zheng, Royina Karegoudra Jayanth, Benjamin Eysenbach
Abstract:
As machine learning has moved towards leveraging large models as priors for downstream tasks, the community has debated the right form of prior for solving reinforcement learning (RL) problems. If one were to try to prefetch as much computation as possible, they would attempt to learn a prior over the policies for some yet‑to‑be‑determined reward function. Recent work (forward‑backward (FB) representation learning) has tried this, arguing that an unsupervised representation learning procedure can enable optimal control over arbitrary rewards without further fine‑tuning. However, FB's training objective and learning behavior remain mysterious. In this paper, we demystify FB by clarifying when such representations can exist, what its objective optimizes, and how it converges in practice. We draw connections with rank matching, fitted Q‑evaluation, and contraction mapping. Our analysis suggests a simplified unsupervised pre‑training method for RL that, instead of enabling optimal control, performs one step of policy improvement. We call our proposed method one‑step forward‑backward representation learning (one‑step FB). Experiments in didactic settings, as well as in 10 state‑based and image‑based continuous control domains, demonstrate that one‑step FB converges to errors 10^5 smaller and improves zero‑shot performance by +24% on average. Our project website is available at https://chongyi‑zheng.github.io/onestep‑fb.
Authors:Heejeong Nam, Quentin Le Lidec, Lucas Maes, Yann LeCun, Randall Balestriero
Abstract:
World models require robust relational understanding to support prediction, reasoning, and control. While object‑centric representations provide a useful abstraction, they are not sufficient to capture interaction‑dependent dynamics. We therefore propose C‑JEPA, a simple and flexible object‑centric world model that extends masked joint embedding prediction from image patches to object‑centric representations. By applying object‑level masking that requires an object's state to be inferred from other objects, C‑JEPA induces latent interventions with counterfactual‑like effects and prevents shortcut solutions, making interaction reasoning essential. Empirically, C‑JEPA leads to consistent gains in visual question answering, with an absolute improvement of about 20% in counterfactual reasoning compared to the same architecture without object‑level masking. On agent control tasks, C‑JEPA enables substantially more efficient planning by using only 1% of the total latent input features required by patch‑based world models, while achieving comparable performance. Finally, we provide a formal analysis demonstrating that object‑level masking induces a causal inductive bias via latent interventions. Our code is available at https://github.com/galilai‑group/cjepa.
Authors:Zachary Pedram Dadfar
Abstract:
Large language models produce rich introspective language when prompted for self‑examination, but whether this language reflects internal computation or sophisticated confabulation has remained unclear. We show that self‑referential vocabulary tracks concurrent activation dynamics, and that this correspondence is specific to self‑referential processing. We introduce the Pull Methodology, a protocol that elicits extended self‑examination through format engineering, and use it to identify a direction in activation space that distinguishes self‑referential from descriptive processing in Llama 3.1. The direction is orthogonal to the known refusal direction, localised at 6.25% of model depth, and causally influences introspective output when used for steering. When models produce "loop" vocabulary, their activations exhibit higher autocorrelation (r = 0.44, p = 0.002); when they produce "shimmer" vocabulary under steering, activation variability increases (r = 0.36, p = 0.002). Critically, the same vocabulary in non‑self‑referential contexts shows no activation correspondence despite nine‑fold higher frequency. Qwen 2.5‑32B, with no shared training, independently develops different introspective vocabulary tracking different activation metrics, all absent in descriptive controls. The findings indicate that self‑report in transformer models can, under appropriate conditions, reliably track internal computational states.
Authors:Bang Nguyen, Dominik Soós, Qian Ma, Rochana R. Obadage, Zack Ranjan, Sai Koneru, Timothy M. Errington, Shakhlo Nematova, Sarah Rajtmajer, Jian Wu, Meng Jiang
Abstract:
The literature has witnessed an emerging interest in AI agents for automated assessment of scientific papers. Existing benchmarks focus primarily on the computational aspect of this task, testing agents' ability to reproduce or replicate research outcomes when having access to the code and data. This setting, while foundational, (1) fails to capture the inconsistent availability of new data for replication as opposed to reproduction, and (2) lacks ground‑truth diversity by focusing only on reproducible papers, thereby failing to evaluate an agent's ability to identify non‑replicable research. Furthermore, most benchmarks only evaluate outcomes rather than the replication process. In response, we introduce ReplicatorBench, an end‑to‑end benchmark, including human‑verified replicable and non‑replicable research claims in social and behavioral sciences for evaluating AI agents in research replication across three stages: (1) extraction and retrieval of replication data; (2) design and execution of computational experiments; and (3) interpretation of results, allowing a test of AI agents' capability to mimic the activities of human replicators in real world. To set a baseline of AI agents' capability, we develop ReplicatorAgent, an agentic framework equipped with necessary tools, like web search and iterative interaction with sandboxed environments, to accomplish tasks in ReplicatorBench. We evaluate ReplicatorAgent across four underlying large language models (LLMs), as well as different design choices of programming language and levels of code access. Our findings reveal that while current LLM agents are capable of effectively designing and executing computational experiments, they struggle with retrieving resources, such as new data, necessary to replicate a claim. All code and data are publicly available at https://github.com/CenterForOpenScience/llm‑benchmarking.
Authors:Yihang Yao, Zhepeng Cen, Haohong Lin, Shiqi Liu, Zuxin Liu, Jiacheng Zhu, Zhang-Wei Hong, Laixi Shi, Ding Zhao
Abstract:
Proactive large language model (LLM) agents aim to actively plan, query, and interact over multiple turns, enabling efficient task completion beyond passive instruction following and making them essential for real‑world, user‑centric applications. Agentic reinforcement learning (RL) has recently emerged as a promising solution for training such agents in multi‑turn settings, allowing interaction strategies to be learned from feedback. However, existing pipelines face a critical challenge in balancing task performance with user engagement, as passive agents can not efficiently adapt to users' intentions while overuse of human feedback reduces their satisfaction. To address this trade‑off, we propose BAO, an agentic RL framework that combines behavior enhancement to enrich proactive reasoning and information‑gathering capabilities with behavior regularization to suppress inefficient or redundant interactions and align agent behavior with user expectations. We evaluate BAO on multiple tasks from the UserRL benchmark suite, and demonstrate that it substantially outperforms proactive agentic RL baselines while achieving comparable or even superior performance to commercial LLM agents, highlighting its effectiveness for training proactive, user‑aligned LLM agents in complex multi‑turn scenarios. Our website: https://proactive‑agentic‑rl.github.io/.
Authors:Jason Dury
Abstract:
Current approaches to memory in neural systems rely on similarity‑based retrieval: given a query, find the most representationally similar stored state. This assumption ‑‑ that useful memories are similar memories ‑‑ fails to capture a fundamental property of biological memory: association through temporal co‑occurrence. We propose Predictive Associative Memory (PAM), an architecture in which a JEPA‑style predictor, trained on temporal co‑occurrence within a continuous experience stream, learns to navigate the associative structure of an embedding space. We introduce an Inward JEPA that operates over stored experience (predicting associatively reachable past states) as the complement to the standard Outward JEPA that operates over incoming sensory data (predicting future states). We evaluate PAM as an associative recall system ‑‑ testing faithfulness of recall for experienced associations ‑‑ rather than as a retrieval system evaluated on generalisation to unseen associations. On a synthetic benchmark, the predictor's top retrieval is a true temporal associate 97% of the time (Association Precision@1 = 0.970); it achieves cross‑boundary Recall@20 = 0.421 where cosine similarity scores zero; and it separates experienced‑together from never‑experienced‑together states with a discrimination AUC of 0.916 (cosine: 0.789). Even restricted to cross‑room pairs where embedding similarity is uninformative, the predictor achieves AUC = 0.849 (cosine: 0.503, chance). A temporal shuffle control confirms the signal is genuine temporal co‑occurrence structure, not embedding geometry: shuffling collapses cross‑boundary recall by 90%, replicated across training seeds. All results are stable across seeds (SD < 0.006) and query selections (SD \leq 0.012).
Authors:Vishak K Bhat, Prateek Chanda, Ashmit Khandelwal, Maitreyi Swaroop, Vineeth N. Balasubramanian, Subbarao Kambhampati, Nagarajan Natarajan, Amit Sharma
Abstract:
We present a test‑time verification framework, interwhen, that ensures that the output of a reasoning model is valid wrt. a given set of verifiers. Verified reasoning is an important goal in high‑stakes scenarios such as deploying agents in the physical world or in domains such as law and finance. However, current techniques either rely on the generate‑test paradigm that verifies only after the final answer is produced, or verify partial output through a step‑extraction paradigm where the task execution is externally broken down into structured steps. The former is inefficient while the latter artificially restricts a model's problem solving strategies. Instead, we propose to verify a model's reasoning trace as‑is, taking full advantage of a model's reasoning capabilities while verifying and steering the model's output only when needed. The key idea is meta‑prompting, identifying the verifiable properties that any partial solution should satisfy and then prompting the model to follow a custom format in its trace such that partial outputs can be easily parsed and checked. We consider both self‑verification and external verification and find that interwhen provides a useful abstraction to provide feedback and steer reasoning models in each case. Using self‑verification, interwhen obtains state‑of‑the‑art results on early stopping reasoning models, without any loss in accuracy. Using external verifiers, interwhen obtains 10 p.p. improvement in accuracy over test‑time scaling methods, while ensuring 100% soundness and being 4x more efficient. The code for interwhen is available at https://github.com/microsoft/interwhen
Authors:Gongye Liu, Bo Yang, Yida Zhi, Zhizhou Zhong, Lei Ke, Didan Deng, Han Gao, Yongxiang Huang, Kaihao Zhang, Hongbo Fu, Wenhan Luo
Abstract:
Preference optimization for diffusion and flow‑matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision‑Language Models (VLMs) have emerged as the primary reward provider, leveraging their rich multimodal priors to guide alignment. However, their computation and memory cost can be substantial, and optimizing a latent diffusion generator through a pixel‑space reward introduces a domain mismatch that complicates alignment. In this paper, we propose DiNa‑LRM, a diffusion‑native latent reward model that formulates preference learning directly on noisy diffusion states. Our method introduces a noise‑calibrated Thurstone likelihood with diffusion‑noise‑dependent uncertainty. DiNa‑LRM leverages a pretrained latent diffusion backbone with a timestep‑conditioned reward head, and supports inference‑time noise ensembling, providing a diffusion‑native mechanism for test‑time scaling and robust rewarding. Across image alignment benchmarks, DiNa‑LRM substantially outperforms existing diffusion‑based reward baselines and achieves performance competitive with state‑of‑the‑art VLMs at a fraction of the computational cost. In preference optimization, we demonstrate that DiNa‑LRM improves preference optimization dynamics, enabling faster and more resource‑efficient model alignment.
Authors:Ruichuan An, Sihan Yang, Ziyu Guo, Wei Dai, Zijun Shen, Haodong Li, Renrui Zhang, Xinyu Wei, Guopeng Li, Wenshan Wu, Wentao Zhang
Abstract:
Unified Multimodal Models (UMMs) have shown remarkable progress in visual generation. Yet, existing benchmarks predominantly assess Crystallized Intelligence, which relies on recalling accumulated knowledge and learned schemas. This focus overlooks Generative Fluid Intelligence (GFI): the capacity to induce patterns, reason through constraints, and adapt to novel scenarios on the fly. To rigorously assess this capability, we introduce GENIUS (GEN Fluid Intelligence EvalUation Suite). We formalize GFI as a synthesis of three primitives. These include Inducing Implicit Patterns (e.g., inferring personalized visual preferences), Executing Ad‑hoc Constraints (e.g., visualizing abstract metaphors), and Adapting to Contextual Knowledge (e.g., simulating counter‑intuitive physics). Collectively, these primitives challenge models to solve problems grounded entirely in the immediate context. Our systematic evaluation of 12 representative models reveals significant performance deficits in these tasks. Crucially, our diagnostic analysis disentangles these failure modes. It demonstrates that deficits stem from limited context comprehension rather than insufficient intrinsic generative capability. To bridge this gap, we propose a training‑free attention intervention strategy. Ultimately, GENIUS establishes a rigorous standard for GFI, guiding the field beyond knowledge utilization toward dynamic, general‑purpose reasoning. Our dataset and code will be released at: \hrefhttps://github.com/arctanxarc/GENIUShttps://github.com/arctanxarc/GENIUS.
Authors:Valery Khvatov, Alexey Neyman
Abstract:
Formal privacy metrics provide compliance‑oriented guarantees but often fail to quantify actual linkability in released datasets. We introduce CVPL (Cluster‑Vector‑Projection Linkage), a geometric framework for post‑hoc assessment of linkage risk between original and protected tabular data. CVPL represents linkage analysis as an operator pipeline comprising blocking, vectorization, latent projection, and similarity evaluation, yielding continuous, scenario‑dependent risk estimates rather than binary compliance verdicts. We formally define CVPL under an explicit threat model and introduce threshold‑aware risk surfaces, R(lambda, tau), that capture the joint effects of protection strength and attacker strictness. We establish a progressive blocking strategy with monotonicity guarantees, enabling anytime risk estimation with valid lower bounds. We demonstrate that the classical Fellegi‑Sunter linkage emerges as a special case of CVPL under restrictive assumptions, and that violations of these assumptions can lead to systematic over‑linking bias. Empirical validation on 10,000 records across 19 protection configurations demonstrates that formal k‑anonymity compliance may coexist with substantial empirical linkability, with a significant portion arising from non‑quasi‑identifier behavioral patterns. CVPL provides interpretable diagnostics identifying which features drive linkage feasibility, supporting privacy impact assessment, protection mechanism comparison, and utility‑risk trade‑off analysis.
Authors:Zhiyin Tan, Jennifer D'Souza
Abstract:
Systematic reviews and meta‑analyses rely on converting narrative articles into structured, numerically grounded study records. Despite rapid advances in large language models (LLMs), it remains unclear whether they can meet the structural requirements of this process, which hinge on preserving roles, methods, and effect‑size attribution across documents rather than on recognizing isolated entities. We propose a structural, diagnostic framework that evaluates LLM‑based evidence extraction as a progression of schema‑constrained queries with increasing relational and numerical complexity, enabling precise identification of failure points beyond atom‑level extraction. Using a manually curated corpus spanning five scientific domains, together with a unified query suite and evaluation protocol, we evaluate two state‑of‑the‑art LLMs under both per‑document and long‑context, multi‑document input regimes. Across domains and models, performance remains moderate for single‑property queries but degrades sharply once tasks require stable binding between variables, roles, statistical methods, and effect sizes. Full meta‑analytic association tuples are extracted with near‑zero reliability, and long‑context inputs further exacerbate these failures. Downstream aggregation amplifies even minor upstream errors, rendering corpus‑level statistics unreliable. Our analysis shows that these limitations stem not from entity recognition errors, but from systematic structural breakdowns, including role reversals, cross‑analysis binding drift, instance compression in dense result sections, and numeric misattribution, indicating that current LLMs lack the structural fidelity, relational binding, and numerical grounding required for automated meta‑analysis. The code and data are publicly available at GitHub (https://github.com/zhiyintan/LLM‑Meta‑Analysis).
Authors:Cong Pang, Xuyu Feng, Yujie Yi, Zixuan Chen, Jiawei Hong, Tiankuo Yao, Nang Yuan, Jiapeng Luo, Lewei Lu, Xin Lou
Abstract:
Despite the strong performance achieved by reinforcement learning‑trained information‑seeking agents, learning in open‑ended web environments remains severely constrained by low signal‑to‑noise feedback. Text‑based parsers often discard layout semantics and introduce unstructured noise, while long‑horizon training typically relies on sparse outcome rewards that obscure which retrieval actions actually matter. We propose a visual‑native search framework that represents webpages as visual snapshots, allowing agents to leverage layout cues to quickly localize salient evidence and suppress distractors. To learn effectively from these high‑dimensional observations, we introduce Information‑Aware Credit Assignment (ICA), a post‑hoc method that estimates each retrieved snapshot's contribution to the final outcome via posterior analysis and propagates dense learning signals back to key search turns. Integrated with a GRPO‑based training pipeline, our approach consistently outperforms text‑based baselines on diverse information‑seeking benchmarks, providing evidence that visual snapshot grounding with information‑level credit assignment alleviates the credit‑assignment bottleneck in open‑ended web environments. The code and datasets will be released in https://github.com/pc‑inno/ICA_MM_deepsearch.git.
Authors:Fanpu Cao, Lu Dai, Jindong Han, Hui Xiong
Abstract:
Multivariate time series forecasting (MTSF) plays a vital role in numerous real‑world applications, yet existing models remain constrained by their reliance on a limited historical context. This limitation prevents them from effectively capturing global periodic patterns that often span cycles significantly longer than the input horizon ‑ despite such patterns carrying strong predictive signals. Naive solutions, such as extending the historical window, lead to severe drawbacks, including overfitting, prohibitive computational costs, and redundant information processing. To address these challenges, we introduce the Global Temporal Retriever (GTR), a lightweight and plug‑and‑play module designed to extend any forecasting model's temporal awareness beyond the immediate historical context. GTR maintains an adaptive global temporal embedding of the entire cycle and dynamically retrieves and aligns relevant global segments with the input sequence. By jointly modeling local and global dependencies through a 2D convolution and residual fusion, GTR effectively bridges short‑term observations with long‑term periodicity without altering the host model architecture. Extensive experiments on six real‑world datasets demonstrate that GTR consistently delivers state‑of‑the‑art performance across both short‑term and long‑term forecasting scenarios, while incurring minimal parameter and computational overhead. These results highlight GTR as an efficient and general solution for enhancing global periodicity modeling in MTSF tasks. Code is available at this repository: https://github.com/macovaseas/GTR.
Authors:Xuecheng Zou, Yu Tang, Bingbing Wang
Abstract:
Knowledge Graph Completion (KGC) fundamentally hinges on the coherent fusion of pre‑trained entity semantics with heterogeneous topological structures to facilitate robust relational reasoning. However, existing paradigms encounter a critical "structural resolution mismatch," failing to reconcile divergent representational demands across varying graph densities, which precipitates structural noise interference in dense clusters and catastrophic representation collapse in sparse regions. We present SynergyKGC, an adaptive framework that advances traditional neighbor aggregation to an active Cross‑Modal Synergy Expert via relation‑aware cross‑attention and semantic‑intent‑driven gating. By coupling a density‑dependent Identity Anchoring strategy with a Double‑tower Coherent Consistency architecture, SynergyKGC effectively reconciles topological heterogeneity while ensuring representational stability across training and inference phases. Systematic evaluations on two public benchmarks validate the superiority of our method in significantly boosting KGC hit rates, providing empirical evidence for a generalized principle of resilient information integration in non‑homogeneous structured data.
Authors:Yuexiao Ma, Xuzhe Zheng, Jing Xu, Xiwei Xu, Feng Ling, Xiawu Zheng, Huafeng Kuang, Huixia Li, Xing Wang, Xuefeng Xiao, Fei Chao, Rongrong Ji
Abstract:
Autoregressive models, often built on Transformer architectures, represent a powerful paradigm for generating ultra‑long videos by synthesizing content in sequential chunks. However, this sequential generation process is notoriously slow. While caching strategies have proven effective for accelerating traditional video diffusion models, existing methods assume uniform denoising across all frames‑an assumption that breaks down in autoregressive models where different video chunks exhibit varying similarity patterns at identical timesteps. In this paper, we present FlowCache, the first caching framework specifically designed for autoregressive video generation. Our key insight is that each video chunk should maintain independent caching policies, allowing fine‑grained control over which chunks require recomputation at each timestep. We introduce a chunkwise caching strategy that dynamically adapts to the unique denoising characteristics of each chunk, complemented by a joint importance‑redundancy optimized KV cache compression mechanism that maintains fixed memory bounds while preserving generation quality. Our method achieves remarkable speedups of 2.38 times on MAGI‑1 and 6.7 times on SkyReels‑V2, with negligible quality degradation (VBench: 0.87 increase and 0.79 decrease respectively). These results demonstrate that FlowCache successfully unlocks the potential of autoregressive models for real‑time, ultra‑long video generation‑establishing a new benchmark for efficient video synthesis at scale. The code is available at https://github.com/mikeallen39/FlowCache.
Authors:Hugo L. Hammer, Vajira Thambawita, Pål Halvorsen
Abstract:
A narrated e‑book combines synchronized audio with digital text, highlighting the currently spoken word or sentence during playback. This format supports early literacy and assists individuals with reading challenges, while also allowing general readers to seamlessly switch between reading and listening. With the emergence of natural‑sounding neural Text‑to‑Speech (TTS) technology, several commercial services have been developed to leverage these technology for converting standard text e‑books into high‑quality narrated e‑books. However, no open‑source solutions currently exist to perform this task. In this paper, we present Calliope, an open‑source framework designed to fill this gap. Our method leverages state‑of‑the‑art open‑source TTS to convert a text e‑book into a narrated e‑book in the EPUB 3 Media Overlay format. The method offers several innovative steps: audio timestamps are captured directly during TTS, ensuring exact synchronization between narration and text highlighting; the publisher's original typography, styling, and embedded media are strictly preserved; and the entire pipeline operates offline. This offline capability eliminates recurring API costs, mitigates privacy concerns, and avoids copyright compliance issues associated with cloud‑based services. The framework currently supports the state‑of‑the‑art open‑source TTS systems XTTS‑v2 and Chatterbox. A potential alternative approach involves first generating narration via TTS and subsequently synchronizing it with the text using forced alignment. However, while our method ensures exact synchronization, our experiments show that forced alignment introduces drift between the audio and text highlighting significant enough to degrade the reading experience. Source code and usage instructions are available at https://github.com/hugohammer/TTS‑Narrated‑Ebook‑Creator.git.
Authors:Yifei Li, Weidong Guo, Lingling Zhang, Rongman Xu, Muye Huang, Hui Liu, Lijiao Xu, Yu Xu, Jun Liu
Abstract:
Long‑term conversational memory is a core capability for LLM‑based dialogue systems, yet existing benchmarks and evaluation protocols primarily focus on surface‑level factual recall. In realistic interactions, appropriate responses often depend on implicit constraints such as user state, goals, or values that are not explicitly queried later. To evaluate this setting, we introduce LoCoMo‑Plus, a benchmark for assessing cognitive memory under cue‑‑trigger semantic disconnect, where models must retain and apply latent constraints across long conversational contexts. We further show that conventional string‑matching metrics and explicit task‑type prompting are misaligned with such scenarios, and propose a unified evaluation framework based on constraint consistency. Experiments across diverse backbone models, retrieval‑based methods, and memory systems demonstrate that cognitive memory remains challenging and reveals failures not captured by existing benchmarks. Our code and evaluation framework are publicly available at: https://github.com/xjtuleeyf/Locomo‑Plus.
Authors:Guobin Shen, Chenxiao Zhao, Xiang Cheng, Lei Huang, Xing Yu
Abstract:
Training stability remains a central challenge in reinforcement learning (RL) for large language models (LLMs). Policy staleness, asynchronous training, and mismatches between training and inference engines all cause the behavior policy to diverge from the current policy, risking training collapse. Importance sampling provides a principled correction for this distribution shift but suffers from high variance; existing remedies such as token‑level clipping and sequence‑level normalization lack a unified theoretical foundation. We propose Variational sEquence‑level Soft Policy Optimization (VESPO). By incorporating variance reduction into a variational formulation over proposal distributions, VESPO derives a closed‑form reshaping kernel that operates directly on sequence‑level importance weights without length normalization. Experiments on mathematical reasoning benchmarks show that VESPO maintains stable training under staleness ratios up to 64x and fully asynchronous execution, and delivers consistent gains across both dense and Mixture‑of‑Experts models. Code is available at https://github.com/FloyedShen/VESPO
Authors:Junhua Liu, Zhangcheng Wang, Zhike Han, Ningli Wang, Guotao Liang, Kun Kuang
Abstract:
Visual Chain‑of‑Thought (VCoT) has emerged as a promising paradigm for enhancing multimodal reasoning by integrating visual perception into intermediate reasoning steps. However, existing VCoT approaches are largely confined to static scenarios and struggle to capture the temporal dynamics essential for tasks such as instruction, prediction, and camera motion. To bridge this gap, we propose TwiFF‑2.7M, the first large‑scale, temporally grounded VCoT dataset derived from 2.7 million video clips, explicitly designed for dynamic visual question and answer. Accompanying this, we introduce TwiFF‑Bench, a high‑quality evaluation benchmark of 1,078 samples that assesses both the plausibility of reasoning trajectories and the correctness of final answers in open‑ended dynamic settings. Building on these foundations, we propose the TwiFF model, a unified modal that synergistically leverages pre‑trained video generation and image comprehension capabilities to produce temporally coherent visual reasoning cues‑iteratively generating future action frames and textual reasoning. Extensive experiments demonstrate that TwiFF significantly outperforms existing VCoT methods and Textual Chain‑of‑Thought baselines on dynamic reasoning tasks, which fully validates the effectiveness for visual question answering in dynamic scenarios. Our code and data is available at https://github.com/LiuJunhua02/TwiFF.
Authors:Guangzhi Xiong, Sanchit Sinha, Aidong Zhang
Abstract:
The trade‑off between interpretability and accuracy remains a core challenge in machine learning. Standard Generalized Additive Models (GAMs) offer clear feature attributions but are often constrained by their strictly additive nature, which can limit predictive performance. Introducing feature interactions can boost accuracy yet may obscure individual feature contributions. To address these issues, we propose Neural Additive Experts (NAEs), a novel framework that seamlessly balances interpretability and accuracy. NAEs employ a mixture of experts framework, learning multiple specialized networks per feature, while a dynamic gating mechanism integrates information across features, thereby relaxing rigid additive constraints. Furthermore, we propose targeted regularization techniques to mitigate variance among expert predictions, facilitating a smooth transition from an exclusively additive model to one that captures intricate feature interactions while maintaining clarity in feature attributions. Our theoretical analysis and experiments on synthetic data illustrate the model's flexibility, and extensive evaluations on real‑world datasets confirm that NAEs achieve an optimal balance between predictive accuracy and transparent, feature‑level explanations. The code is available at https://github.com/Teddy‑XiongGZ/NAE.
Authors:Chenhao Zhang, Yazhe Niu, Hongsheng Li
Abstract:
Metaphorical comprehension in images remains a critical challenge for Nowadays AI systems. While Multimodal Large Language Models (MLLMs) excel at basic Visual Question Answering (VQA), they consistently struggle to grasp the nuanced cultural, emotional, and contextual implications embedded in visual content. This difficulty stems from the task's demand for sophisticated multi‑hop reasoning, cultural context, and Theory of Mind (ToM) capabilities, which current models lack. To fill this gap, we propose MetaphorStar, the first end‑to‑end visual reinforcement learning (RL) framework for image implication tasks. Our framework includes three core components: the fine‑grained dataset TFQ‑Data, the visual RL method TFQ‑GRPO, and the well‑structured benchmark TFQ‑Bench. Our fully open‑source MetaphorStar family, trained using TFQ‑GRPO on TFQ‑Data, significantly improves performance by an average of 82.6% on the image implication benchmarks. Compared with 20+ mainstream MLLMs, MetaphorStar‑32B achieves state‑of‑the‑art (SOTA) on Multiple‑Choice Question and Open‑Style Question, significantly outperforms the top closed‑source model Gemini‑3.0‑pro on True‑False Question. Crucially, our experiments reveal that learning image implication tasks improves the general understanding ability, especially the complex visual reasoning ability. We further provide a systematic analysis of model parameter scaling, training data scaling, and the impact of different model architectures and training strategies, demonstrating the broad applicability of our method. We open‑sourced all model weights, datasets, and method code at https://metaphorstar.github.io.
Authors:Guanting Ye, Qiyan Zhao, Wenhao Yu, Xiaofeng Zhang, Jianmin Ji, Yanyong Zhang, Ka-Veng Yuen
Abstract:
Recent advances in 3D Large Multimodal Models (LMMs) built on Large Language Models (LLMs) have established the alignment of 3D visual features with LLM representations as the dominant paradigm. However, the inherited Rotary Position Embedding (RoPE) introduces limitations for multimodal processing. Specifically, applying 1D temporal positional indices disrupts the continuity of visual features along the column dimension, resulting in spatial locality loss. Moreover, RoPE follows the prior that temporally closer image tokens are more causally related, leading to long‑term decay in attention allocation and causing the model to progressively neglect earlier visual tokens as the sequence length increases. To address these issues, we propose C^2RoPE, an improved RoPE that explicitly models local spatial Continuity and spatial Causal relationships for visual processing. C^2RoPE introduces a spatio‑temporal continuous positional embedding mechanism for visual tokens. It first integrates 1D temporal positions with Cartesian‑based spatial coordinates to construct a triplet hybrid positional index, and then employs a frequency allocation strategy to encode spatio‑temporal positional information across the three index components. Additionally, we introduce Chebyshev Causal Masking, which determines causal dependencies by computing the Chebyshev distance of image tokens in 2D space. Evaluation results across various benchmarks, including 3D scene reasoning and 3D visual question answering, demonstrate C^2RoPE's effectiveness. The code is be available at https://github.com/ErikZ719/C2RoPE.
Authors:Dongshuo Yin, Xue Yang, Deng-Ping Fan, Shi-Min Hu
Abstract:
Deploying vision foundation models typically relies on efficient adaptation strategies, whereas conventional full fine‑tuning suffers from prohibitive costs and low efficiency. While delta‑tuning has proven effective in boosting the performance and efficiency of LLMs during adaptation, its advantages cannot be directly transferred to the fine‑tuning pipeline of vision foundation models. To push the boundaries of adaptation efficiency for vision tasks, we propose an adapter with Complex Linear Projection Optimization (CoLin). For architecture, we design a novel low‑rank complex adapter that introduces only about 1% parameters to the backbone. For efficiency, we theoretically prove that low‑rank composite matrices suffer from severe convergence issues during training, and address this challenge with a tailored loss. Extensive experiments on object detection, segmentation, image classification, and rotated object detection (remote sensing scenario) demonstrate that CoLin outperforms both full fine‑tuning and classical delta‑tuning approaches with merely 1% parameters for the first time, providing a novel and efficient solution for deployment of vision foundation models. We release the code on https://github.com/DongshuoYin/CoLin.
Authors:Yansong Qu, Zihao Sheng, Zilin Huang, Jiancong Chen, Yuhao Luo, Tianyi Wang, Yiheng Feng, Samuel Labi, Sikai Chen
Abstract:
Reinforcement Learning (RL) has emerged as a dominant paradigm for end‑to‑end autonomous driving (AD). However, RL suffers from sample inefficiency and a lack of semantic interpretability in complex scenarios. Foundation Models, particularly Vision‑Language Models (VLMs), can mitigate this by offering rich, context‑aware knowledge, yet their high inference latency hinders deployment in high‑frequency RL training loops. To bridge this gap, we present Found‑RL, a platform tailored to efficiently enhance RL for AD using foundation models. A core innovation is the asynchronous batch inference framework, which decouples heavy VLM reasoning from the simulation loop, effectively resolving latency bottlenecks to support real‑time learning. We introduce diverse supervision mechanisms: Value‑Margin Regularization (VMR) and Advantage‑Weighted Action Guidance (AWAG) to effectively distill expert‑like VLM action suggestions into the RL policy. Additionally, we adopt high‑throughput CLIP for dense reward shaping. We address CLIP's dynamic blindness via Conditional Contrastive Action Alignment, which conditions prompts on discretized speed/command and yields a normalized, margin‑based bonus from context‑specific action‑anchor scoring. Found‑RL provides an end‑to‑end pipeline for fine‑tuned VLM integration and shows that a lightweight RL model can achieve near‑VLM performance compared with billion‑parameter VLMs while sustaining real‑time inference (approx. 500 FPS). Code, data, and models will be publicly available at https://github.com/ys‑qu/found‑rl.
Authors:Feiyu Pan, Tianbin Zhang, Aoqian Zhang, Yu Sun, Zheng Wang, Lixing Chen, Li Pan, Jianhua Li
Abstract:
Modern data lakes have emerged as foundational platforms for large‑scale machine learning, enabling flexible storage of heterogeneous data and structured analytics through table‑oriented abstractions. Despite their growing importance, standardized benchmarks for evaluating machine learning performance in data lake environments remain scarce. To address this gap, we present LakeMLB (Data Lake Machine Learning Benchmark), designed for the most common multi‑source, multi‑table scenarios in data lakes. LakeMLB focuses on two representative multi‑table scenarios, Union and Join, and provides three real‑world datasets for each scenario, covering government open data, finance, Wikipedia, and online marketplaces. The benchmark supports three representative integration strategies: pre‑training‑based, data augmentation‑based, and feature augmentation‑based approaches. We conduct extensive experiments with state‑of‑the‑art tabular learning methods, offering insights into their performance under complex data lake scenarios. We release both datasets and code to facilitate rigorous research on machine learning in data lake ecosystems; the benchmark is available at https://github.com/zhengwang100/LakeMLB.
Authors:Keenan Pepper, Alex McKenzie, Florin Pop, Stijn Servaes, Martin Leitgab, Mike Vaiana, Judd Rosenblatt, Michael S. A. Graziano, Diogo de Lucena
Abstract:
Self‑interpretation methods prompt language models to describe their own internal states, but remain unreliable due to hyperparameter sensitivity. We show that training lightweight adapters on interpretability artifacts, while keeping the LM entirely frozen, yields reliable self‑interpretation across tasks and model families. A scalar affine adapter with just d_\textmodel+1 parameters suffices: trained adapters generate sparse autoencoder feature labels that outperform the training labels themselves (71% vs 63% generation scoring at 70B scale), identify topics with 94% recall@1 versus 1% for untrained baselines, and decode bridge entities in multi‑hop reasoning that appear in neither prompt nor response, surfacing implicit reasoning without chain‑of‑thought. The learned bias vector alone accounts for 85% of improvement, and simpler adapters generalize better than more expressive alternatives. Controlling for model knowledge via prompted descriptions, we find self‑interpretation gains outpace capability gains from 7B to 72B parameters. Our results demonstrate that self‑interpretation improves with scale, without modifying the model being interpreted.
Authors:Mateo Juliani, Mingxuan Li, Elias Bareinboim
Abstract:
Reward shaping has been applied widely to accelerate Reinforcement Learning (RL) agents' training. However, a principled way of designing effective reward shaping functions, especially for complex continuous control problems, remains largely under‑explained. In this work, we propose to automatically learn a reward shaping function for continuous control problems from offline datasets, potentially contaminated by unobserved confounding variables. Specifically, our method builds upon the recently proposed causal Bellman equation to learn a tight upper bound on the optimal state values, which is then used as the potentials in the Potential‑Based Reward Shaping (PBRS) framework. Our proposed reward shaping algorithm is tested with Soft‑Actor‑Critic (SAC) on multiple commonly used continuous control benchmarks and exhibits strong performance guarantees under unobserved confounders. More broadly, our work marks a solid first step towards confounding robust continuous control from a causal perspective. Code for training our reward shaping functions can be found at https://github.com/mateojuliani/confounding_robust_cont_control.
Authors:Mayur Akewar, Sandeep Madireddy, Dongsheng Luo, Janki Bhimani
Abstract:
Solid State Drives (SSDs) are critical to datacenters, consumer platforms, and mission‑critical systems. Yet diagnosing their performance and reliability is difficult because data are fragmented and time‑disjoint, and existing methods demand large datasets and expert input while offering only limited insights. Degradation arises not only from shifting workloads and evolving architectures but also from environmental factors such as temperature, humidity, and vibration. We present KORAL, a knowledge driven reasoning framework that integrates Large Language Models (LLMs) with a structured Knowledge Graph (KG) to generate insights into SSD operations. Unlike traditional approaches that require extensive expert input and large datasets, KORAL generates a Data KG from fragmented telemetry and integrates a Literature KG that already organizes knowledge from literature, reports, and traces. This turns unstructured sources into a queryable graph and telemetry into structured knowledge, and both the Graphs guide the LLM to deliver evidence‑based, explainable analysis aligned with the domain vocabulary and constraints. Evaluation using real production traces shows that the KORAL delivers expert‑level diagnosis and recommendations, supported by grounded explanations that improve reasoning transparency, guide operator decisions, reduce manual effort, and provide actionable insights to improve service quality. To our knowledge, this is the first end‑to‑end system that combines LLMs and KGs for full‑spectrum SSD reasoning including Descriptive, Predictive, Prescriptive, and What‑if analysis. We release the generated SSD‑specific KG to advance reproducible research in knowledge‑based storage system analysis. GitHub Repository: https://github.com/Damrl‑lab/KORAL
Authors:Zhengbing He
Abstract:
Stop‑and‑go waves, as a major form of freeway traffic congestion, cause severe and long‑lasting adverse effects, including reduced traffic efficiency, increased driving risks, and higher vehicle emissions. Amongst the highway traffic management strategies, jam‑absorption driving (JAD), in which a dedicated vehicle performs "slow‑in" and "fast‑out" maneuvers before being captured by a stop‑and‑go wave, has been proposed as a potential method for preventing the propagation of such waves. However, most existing JAD strategies remain impractical mainly due to the lack of discussion regarding implementation vehicles and operational conditions. Inspired by real‑world observations of police‑car swerving behavior, this paper first introduces a Single‑Vehicle Two‑Detector Jam‑Absorption Driving (SVDD‑JAD) problem, and then proposes a practical JAD strategy that transforms such behavior into a maneuver capable of suppressing the propagation of an isolated stop‑and‑go wave. Five key parameters that significantly affect the proposed strategy, namely, JAD speed, inflow traffic speed, wave width, wave speed, and in‑wave speed, are identified and systematically analyzed. Using a SUMO‑based simulation as an illustrative example, we further demonstrate how these parameters can be measured in practice with two stationary roadside traffic detectors. The results show that the proposed JAD strategy successfully suppresses the propagation of a stop‑and‑go wave, without triggering a secondary wave. This paper is expected to take a significant step toward making JAD practical, advancing it from a theoretical concept to a feasible and implementable strategy. To promote reproducibility in the transportation domain, we have also open‑sourced all the code on our GitHub repository https://github.com/gotrafficgo.
Authors:Jiacheng Hou, Yining Sun, Ruochong Jin, Haochen Han, Fangming Liu, Wai Kin Victor Chan, Alex Jinpeng Wang
Abstract:
Recent advances in large image editing models have shifted the paradigm from text‑driven instructions to vision‑prompt editing, where user intent is inferred directly from visual inputs such as marks, arrows, and visual‑text prompts. While this paradigm greatly expands usability, it also introduces a critical and underexplored safety risk: the attack surface itself becomes visual. In this work, we propose Vision‑Centric Jailbreak Attack (VJA), the first visual‑to‑visual jailbreak attack that conveys malicious instructions purely through visual inputs. To systematically study this emerging threat, we introduce IESBench, a safety‑oriented benchmark for image editing models. Extensive experiments on IESBench demonstrate that VJA effectively compromises state‑of‑the‑art commercial models, achieving attack success rates of up to 80.9% on Nano Banana Pro and 70.1% on GPT‑Image‑1.5. To mitigate this vulnerability, we propose a training‑free defense based on introspective multimodal reasoning, which substantially improves the safety of poorly aligned models to a level comparable with commercial systems, without auxiliary guard models and with negligible computational overhead. Our findings expose new vulnerabilities, provide both a benchmark and practical defense to advance safe and trustworthy modern image editing systems. Warning: This paper contains offensive images created by large image editing models.
Authors:Tony Feng, Trieu H. Trinh, Garrett Bingham, Dawsen Hwang, Yuri Chervonyi, Junehyuk Jung, Joonkyung Lee, Carlo Pagano, Sang-hyun Kim, Federico Pasqualotto, Sergei Gukov, Jonathan N. Lee, Junsu Kim, Kaiying Hou, Golnaz Ghiasi, Yi Tay, YaGuang Li, Chenkai Kuang, Yuan Liu, Hanzhao Lin, Evan Zheran Liu, Nigamaa Nayakanti, Xiaomeng Yang, Heng-Tze Cheng, Demis Hassabis, Koray Kavukcuoglu, Quoc V. Le, Thang Luong
Abstract:
Recent advances in foundational models have yielded reasoning systems capable of achieving a gold‑medal standard at the International Mathematical Olympiad. The transition from competition‑level problem‑solving to professional research, however, requires navigating vast literature and constructing long‑horizon proofs. In this work, we introduce Aletheia, a math research agent that iteratively generates, verifies, and revises solutions end‑to‑end in natural language. Specifically, Aletheia is powered by an advanced version of Gemini Deep Think for challenging reasoning problems, a novel inference‑time scaling law that extends beyond Olympiad‑level problems, and intensive tool use to navigate the complexities of mathematical research. We demonstrate the capability of Aletheia from Olympiad problems to PhD‑level exercises and most notably, through several distinct milestones in AI‑assisted mathematics research: (a) a research paper (Feng26) generated by AI without any human intervention in calculating certain structure constants in arithmetic geometry called eigenweights; (b) a research paper (LeeSeo26) demonstrating human‑AI collaboration in proving bounds on systems of interacting particles called independent sets; and (c) an extensive semi‑autonomous evaluation (Feng et al., 2026a) of 700 open problems on Bloom's Erdos Conjectures database, including autonomous solutions to four open questions. In order to help the public better understand the developments pertaining to AI and mathematics, we suggest quantifying standard levels of autonomy and novelty of AI‑assisted results, as well as propose a novel concept of human‑AI interaction cards for transparency. We conclude with reflections on human‑AI collaboration in mathematics and share all prompts as well as model outputs at https://github.com/google‑deepmind/superhuman/tree/main/aletheia.
Authors:Kun Wang, Zherui Li, Zhenhong Zhou, Yitong Zhang, Yan Mi, Kun Yang, Yiming Zhang, Junhao Dong, Zhongxiang Sun, Qiankun Li, Yang Liu
Abstract:
Omni‑modal Large Language Models (OLLMs) greatly expand LLMs' multimodal capabilities but also introduce cross‑modal safety risks. However, a systematic understanding of vulnerabilities in omni‑modal interactions remains lacking. To bridge this gap, we establish a modality‑semantics decoupling principle and construct the AdvBench‑Omni dataset, which reveals a significant vulnerability in OLLMs. Mechanistic analysis uncovers a Mid‑layer Dissolution phenomenon driven by refusal vector magnitude shrinkage, alongside the existence of a modal‑invariant pure refusal direction. Inspired by these insights, we extract a golden refusal vector using Singular Value Decomposition and propose OmniSteer, which utilizes lightweight adapters to modulate intervention intensity adaptively. Extensive experiments show that our method not only increases the Refusal Success Rate against harmful inputs from 69.9% to 91.2%, but also effectively preserves the general capabilities across all modalities. Our code is available at: https://github.com/zhrli324/omni‑safety‑research.
Authors:Lepeng Zhao, Zhenhua Zou, Shuo Li, Zhuotao Liu
Abstract:
Mobile Graphical User Interface (GUI) agents have demonstrated strong capabilities in automating complex smartphone tasks by leveraging multimodal large language models (MLLMs) and system‑level control interfaces. However, this paradigm introduces significant privacy risks, as agents typically capture and process entire screen contents, thereby exposing sensitive personal data such as phone numbers, addresses, messages, and financial information. Existing defenses either reduce UI exposure, obfuscate only task‑irrelevant content, or rely on user authorization, but none can protect task‑critical sensitive information while preserving seamless agent usability. We propose an anonymization‑based privacy protection framework that enforces the principle of available‑but‑invisible access to sensitive data: sensitive information remains usable for task execution but is never directly visible to the cloud‑based agent. Our system detects sensitive UI content using a PII‑aware recognition model and replaces it with deterministic, type‑preserving placeholders (e.g., PHONE_NUMBER#a1b2c) that retain semantic categories while removing identifying details. A layered architecture comprising a PII Detector, UI Transformer, Secure Interaction Proxy, and Privacy Gatekeeper ensures consistent anonymization across user instructions, XML hierarchies, and screenshots, mediates all agent actions over anonymized interfaces, and supports narrowly scoped local computations when reasoning over raw values is necessary. Extensive experiments on the AndroidLab and PrivScreen benchmarks show that our framework substantially reduces privacy leakage across multiple models while incurring only modest utility degradation, achieving the best observed privacy‑utility trade‑off among existing methods. Code available at: https://github.com/one‑step‑beh1nd/gui_privacy_protection
Authors:Zhiyu Sun, Minrui Luo, Yu Wang, Zhili Chen, Tianxing He
Abstract:
Large language models (LLMs) are pretrained on corpora containing trillions of tokens and, therefore, inevitably memorize sensitive information. Locate‑then‑edit methods, as a mainstream paradigm of model editing, offer a promising solution by modifying model parameters without retraining. However, in this work, we reveal a critical vulnerability of this paradigm: the parameter updates inadvertently serve as a side channel, enabling attackers to recover the edited data. We propose a two‑stage reverse‑engineering attack named KSTER (KeySpaceReconsTruction‑then‑EntropyReduction) that leverages the low‑rank structure of these updates. First, we theoretically show that the row space of the update matrix encodes a ``fingerprint" of the edited subjects, enabling accurate subject recovery via spectral analysis. Second, we introduce an entropy‑based prompt recovery attack that reconstructs the semantic context of the edit. Extensive experiments on multiple LLMs demonstrate that our attacks can recover edited data with high success rates. Furthermore, we propose subspace camouflage, a defense strategy that obfuscates the update fingerprint with semantic decoys. This approach effectively mitigates reconstruction risks without compromising editing utility. Our code is available at https://github.com/reanatom/EditingAtk.git.
Authors:Yuxin Jiang, Yuchao Gu, Ivor W. Tsang, Mike Zheng Shou
Abstract:
Scaling action‑controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract control interfaces from unlabeled video, learned latents often fail to transfer across contexts: they entangle scene‑specific cues and lack a shared coordinate system. This occurs because standard objectives operate only within each clip, providing no mechanism to align action semantics across contexts. Our key insight is that although actions are unobserved, their semantic effects are observable and can serve as a shared reference. We introduce SeqΔ‑REPA, a sequence‑level control‑effect alignment objective that anchors integrated latent action to temporal feature differences from a frozen, self‑supervised video encoder. Building on this, we present Olaf‑World, a pipeline that pretrains action‑conditioned video world models from large‑scale passive video. Extensive experiments demonstrate that our method learns a more structured latent action space, leading to stronger zero‑shot action transfer and more data‑efficient adaptation to new control interfaces than state‑of‑the‑art baselines.
Authors:Zhaoyang Wang, Canwen Xu, Boyi Liu, Yite Wang, Siwei Han, Zhewei Yao, Huaxiu Yao, Yuxiong He
Abstract:
Recent advances in large language model (LLM) have empowered autonomous agents to perform complex tasks that require multi‑turn interactions with tools and environments. However, scaling such agent training is limited by the lack of diverse and reliable environments. In this paper, we propose Agent World Model (AWM), a fully synthetic environment generation pipeline. Using this pipeline, we scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets (35 tools per environment on average) and obtain high‑quality observations. Notably, these environments are code‑driven and backed by databases, providing more reliable and consistent state transitions than environments simulated by LLMs. Moreover, they enable more efficient agent interaction compared with collecting trajectories from realistic environments. To demonstrate the effectiveness of this resource, we perform large‑scale reinforcement learning for multi‑turn tool‑use agents. Thanks to the fully executable environments and accessible database states, we can also design reliable reward functions. Experiments on three benchmarks show that training exclusively in synthetic environments, rather than benchmark‑specific ones, yields strong out‑of‑distribution generalization. The code is available at https://github.com/Snowflake‑Labs/agent‑world‑model.
Authors:Xuehang Guo, Zhiyong Lu, Tom Hope, Qingyun Wang
Abstract:
In scientific research, analysis requires accurately interpreting complex multimodal knowledge, integrating evidence from different sources, and drawing inferences grounded in domain‑specific knowledge. However, current artificial intelligence (AI) systems struggle to consistently demonstrate such capabilities. The complexity and variability of scientific tables and figures, combined with heterogeneous structures and long‑context requirements, pose fundamental obstacles to scientific table \& figure analysis. To quantify these challenges, we introduce AnaBench, a large‑scale benchmark featuring 63,178 instances from nine scientific domains, systematically categorized along seven complexity dimensions. To tackle these challenges, we propose Anagent, a multi‑agent framework for enhanced scientific table \& figure analysis through four specialized agents: Planner decomposes tasks into actionable subtasks, Expert retrieves task‑specific information through targeted tool execution, Solver synthesizes information to generate coherent analysis, and Critic performs iterative refinement through five‑dimensional quality assessment. We further develop modular training strategies that leverage supervised finetuning and specialized reinforcement learning to optimize individual capabilities while maintaining effective collaboration. Comprehensive evaluation across 9 broad domains with 170 subdomains demonstrates that Anagent achieves substantial improvements, up to \uparrow 13.43% in training‑free settings and \uparrow 42.12% with finetuning, while revealing that task‑oriented reasoning and context‑aware problem‑solving are essential for high‑quality scientific table \& figure analysis. Our project page: https://xhguo7.github.io/Anagent/.
Authors:Tianyi Jiang, Arctanx An, Hengyi Feng, Naixin Zhai, Haodong Li, Xiaomin Yu, Jiahui Liu, Hanwen Du, Shuo Zhang, Zhi Yang, Jie Huang, Yuhua Li, Yongxin Ni, Huacan Wang, Ronghao Chen
Abstract:
Human problem‑solving is never the repetition of a single mindset, by which we mean a distinct mode of cognitive processing. When tackling a specific task, we do not rely on a single mindset; instead, we integrate multiple mindsets within the single solution process. However, existing LLM reasoning methods fall into a common trap: they apply the same fixed mindset across all steps, overlooking that different stages of solving the same problem require fundamentally different mindsets. This single‑minded assumption prevents models from reaching the next level of intelligence. To address this limitation, we propose Chain of Mindset (CoM), a training‑free agentic framework that enables step‑level adaptive mindset orchestration. CoM decomposes reasoning into four functionally heterogeneous mindsets: Spatial, Convergent, Divergent, and Algorithmic. A Meta‑Agent dynamically selects the optimal mindset based on the evolving reasoning state, while a bidirectional Context Gate filters cross‑module information flow to maintain effectiveness and efficiency. Experiments across six challenging benchmarks spanning mathematics, code generation, scientific QA, and spatial reasoning demonstrate that CoM achieves state‑of‑the‑art performance, outperforming the strongest baseline by 4.96% and 4.72% in overall accuracy on Qwen3‑VL‑32B‑Instruct and Gemini‑2.0‑Flash, while balancing reasoning efficiency. Our code is publicly available at \hrefhttps://github.com/QuantaAlpha/chain‑of‑mindsethttps://github.com/QuantaAlpha/chain‑of‑mindset.
Authors:Wenxuan Xie, Yujia Wang, Xin Tan, Chaochao Lu, Xia Hu, Xuhong Wang
Abstract:
The integration of extensive, dynamic knowledge into Large Language Models (LLMs) remains a significant challenge due to the inherent entanglement of factual data and reasoning patterns. Existing solutions, ranging from non‑parametric Retrieval‑Augmented Generation (RAG) to parametric knowledge editing, are often constrained in practice by finite context windows, retriever noise, or the risk of catastrophic forgetting. In this paper, we propose DRIFT, a novel dual‑model architecture designed to explicitly decouple knowledge extraction from the reasoning process. Unlike static prompt compression, DRIFT employs a lightweight knowledge model to dynamically compress document chunks into implicit fact tokens conditioned on the query. These dense representations are projected into the reasoning model's embedding space, replacing raw, redundant text while maintaining inference accuracy. Extensive experiments show that DRIFT significantly improves performance on long‑context tasks, outperforming strong baselines among comparably sized models. Our approach provides a scalable and efficient paradigm for extending the effective context window and reasoning capabilities of LLMs. Our code is available at https://github.com/Lancelot‑Xie/DRIFT.
Authors:Bharathkumar Hegde, Melanie Bouroche
Abstract:
Lane changing in dense traffic is a significant challenge for Connected and Autonomous Vehicles (CAVs). Existing lane change controllers primarily either ensure safety or collaboratively improve traffic efficiency, but do not consider these conflicting objectives together. To address this, we propose the Multi‑Agent Safety Shield (MASS), designed using Control Barrier Functions (CBFs) to enable safe and collaborative lane changes. The MASS enables collaboration by capturing multi‑agent interactions among CAVs through interaction topologies constructed as a graph using a simple algorithm. Further, a state‑of‑the‑art Multi‑Agent Reinforcement Learning (MARL) lane change controller is extended by integrating MASS to ensure safety and defining a customised reward function to prioritise efficiency improvements. As a result, we propose a lane change controller, known as MARL‑MASS, and evaluate it in a congested on‑ramp merging simulation. The results demonstrate that MASS enables collaborative lane changes with safety guarantees by strictly respecting the safety constraints. Moreover, the proposed custom reward function improves the stability of MARL policies trained with a safety shield. Overall, by encouraging the exploration of a collaborative lane change policy while respecting safety constraints, MARL‑MASS effectively balances the trade‑off between ensuring safety and improving traffic efficiency in congested traffic. The code for MARL‑MASS is available with an open‑source licence at https://github.com/hkbharath/MARL‑MASS
Authors:J Rosser, Robert Kirk, Edward Grefenstette, Jakob Foerster, Laura Ruis
Abstract:
Influence functions are commonly used to attribute model behavior to training documents. We explore the reverse: crafting training data that induces model behavior. Our framework, Infusion, uses scalable influence‑function approximations to compute small perturbations to training documents that induce targeted changes in model behavior through parameter shifts. We evaluate Infusion on data poisoning tasks across vision and language domains. On CIFAR‑10, we show that making subtle edits via Infusion to just 0.2% (100/45,000) of the training documents can be competitive with the baseline of inserting a small number of explicit behavior examples. We also find that Infusion transfers across architectures (ResNet \leftrightarrow CNN), suggesting a single poisoned corpus can affect multiple independently trained models. In preliminary language experiments, we characterize when our approach increases the probability of target behaviors and when it fails, finding it most effective at amplifying behaviors the model has already learned. Taken together, these results show that small, subtle edits to training data can systematically shape model behavior, underscoring the importance of training data interpretability for adversaries and defenders alike. We provide the code here: https://github.com/jrosseruk/infusion.
Authors:William Lugoloobi, Thomas Foster, William Bankes, Chris Russell
Abstract:
Running LLMs with extended reasoning on every problem is expensive, but determining which inputs actually require additional compute remains challenging. We investigate whether their own likelihood of success is recoverable from their internal representations before generation, and if this signal can guide more efficient inference. We train linear probes on pre‑generation activations to predict policy‑specific success on math and coding tasks, substantially outperforming surface features such as question length and TF‑IDF. Using E2H‑AMC, which provides both human and model performance on identical problems, we show that models encode a model‑specific notion of difficulty that is distinct from human difficulty, and that this distinction increases with extended reasoning. Leveraging these probes, we demonstrate that routing queries across a pool of models can exceed the best‑performing model whilst reducing inference cost by up to 70% on MATH, showing that internal representations enable practical efficiency gains even when they diverge from human intuitions about difficulty. Our code is available at: https://github.com/KabakaWilliam/llms_know_difficulty
Authors:Yuhao Zheng, Li'an Zhong, Yi Wang, Rui Dai, Kaikui Liu, Xiangxiang Chu, Linyuan Lv, Philip Torr, Kevin Qinghong Lin
Abstract:
Autonomous GUI agents interact with environments by perceiving interfaces and executing actions. As a virtual sandbox, the GUI World model empowers agents with human‑like foresight by enabling action‑conditioned prediction. However, existing text‑ and pixel‑based approaches struggle to simultaneously achieve high visual fidelity and fine‑grained structural controllability. To this end, we propose Code2World, a vision‑language coder that simulates the next visual state via renderable code generation. Specifically, to address the data scarcity problem, we construct AndroidCode by translating GUI trajectories into high‑fidelity HTML and refining synthesized code through a visual‑feedback revision mechanism, yielding a corpus of over 80K high‑quality screen‑action pairs. To adapt existing VLMs into code prediction, we first perform SFT as a cold start for format layout following, then further apply Render‑Aware Reinforcement Learning which uses rendered outcome as the reward signal by enforcing visual semantic fidelity and action consistency. Extensive experiments demonstrate that Code2World‑8B achieves the top‑performing next UI prediction, rivaling the competitive GPT‑5 and Gemini‑3‑Pro‑Image. Notably, Code2World significantly enhances downstream navigation success rates in a flexible manner, boosting Gemini‑2.5‑Flash by +9.5% on AndroidWorld navigation. The code is available at https://github.com/AMAP‑ML/Code2World.
Authors:Kun Chen, Peng Shi, Fanfan Liu, Haibo Qiu, Zhixiong Zeng, Siqi Yang, Wenji Mao
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a critical method for enhancing the reasoning capabilities of Large Language Models (LLMs). However, continuous training often leads to policy entropy collapse, characterized by a rapid decay in entropy that results in premature overconfidence, reduced output diversity, and vanishing gradient norms that inhibit learning. Gradient‑Preserving Clipping is a primary factor influencing these dynamics, but existing mitigation strategies are largely static and lack a framework connecting clipping mechanisms to precise entropy control. This paper proposes reshaping entropy control in RL from the perspective of Gradient‑Preserving Clipping. We first theoretically and empirically verify the contributions of specific importance sampling ratio regions to entropy growth and reduction. Leveraging these findings, we introduce a novel regulation mechanism using dynamic clipping threshold to precisely manage entropy. Furthermore, we design and evaluate dynamic entropy control strategies, including increase‑then‑decrease, decrease‑increase‑decrease, and oscillatory decay. Experimental results demonstrate that these strategies effectively mitigate entropy collapse, and achieve superior performance across multiple benchmarks.
Authors:Davide Gallon, Philippe von Wurstemberger, Patrick Cheridito, Arnulf Jentzen
Abstract:
We propose a methodology that combines generative latent diffusion models with physics‑informed machine learning to generate solutions of parametric partial differential equations (PDEs) conditioned on partial observations, which includes, in particular, forward and inverse PDE problems. We learn the joint distribution of PDE parameters and solutions via a diffusion process in a latent space of scaled spectral representations, where Gaussian noise corresponds to functions with controlled regularity. This spectral formulation enables significant dimensionality reduction compared to grid‑based diffusion models and ensures that the induced process in function space remains within a class of functions for which the PDE operators are well defined. Building on diffusion posterior sampling, we enforce physics‑informed constraints and measurement conditions during inference, applying Adam‑based updates at each diffusion step. We evaluate the proposed approach on Poisson, Helmholtz, and incompressible Navier‑‑Stokes equations, demonstrating improved accuracy and computational efficiency compared with existing diffusion‑based PDE solvers, which are state of the art for sparse observations. Code is available at https://github.com/deeplearningmethods/PISD.
Authors:Sieun Hyeon, Jusang Oh, Sunghwan Steve Cho, Jaeyoung Do
Abstract:
Recent advances in Large Language Models (LLMs) have significantly improved table understanding tasks such as Table Question Answering (TableQA), yet challenges remain in ensuring reliability, scalability, and efficiency, especially in resource‑constrained or privacy‑sensitive environments. In this paper, we introduce MATA, a multi‑agent TableQA framework that leverages multiple complementary reasoning paths and a set of tools built with small language models. MATA generates candidate answers through diverse reasoning styles for a given table and question, then refines or selects the optimal answer with the help of these tools. Furthermore, it incorporates an algorithm designed to minimize expensive LLM agent calls, enhancing overall efficiency. MATA maintains strong performance with small, open‑source models and adapts easily across various LLM types. Extensive experiments on two benchmarks of varying difficulty with ten different LLMs demonstrate that MATA achieves state‑of‑the‑art accuracy and highly efficient reasoning while avoiding excessive LLM inference. Our results highlight that careful orchestration of multiple reasoning pathways yields scalable and reliable TableQA. The code is available at https://github.com/AIDAS‑Lab/MATA.
Authors:Narges Baba Ahmadi, Jan Strich, Martin Semmann, Chris Biemann
Abstract:
Large language models (LLMs) are increasingly used to access legal information. Yet, their deployment in multilingual legal settings is constrained by unreliable retrieval and the lack of domain‑adapted, open‑embedding models. In particular, existing multilingual legal corpora are not designed for semantic retrieval, and PDF‑based legislative sources introduce substantial noise due to imperfect text extraction. To address these challenges, we introduce LEMUR, a large‑scale multilingual corpus of EU environmental legislation constructed from 24,953 official EUR‑Lex PDF documents covering 25 languages. We quantify the fidelity of PDF‑to‑text conversion by measuring lexical consistency against authoritative HTML versions using the Lexical Content Score (LCS). Building on LEMUR, we fine‑tune three state‑of‑the‑art multilingual embedding models using contrastive objectives in both monolingual and bilingual settings, reflecting realistic legal‑retrieval scenarios. Experiments across low‑ and high‑resource languages demonstrate that legal‑domain fine‑tuning consistently improves Top‑k retrieval accuracy relative to strong baselines, with particularly pronounced gains for low‑resource languages. Cross‑lingual evaluations show that these improvements transfer to unseen languages, indicating that fine‑tuning primarily enhances language‑independent, content‑level legal representations rather than language‑specific cues. We publish code\footnote\hrefhttps://github.com/nargesbh/eur_lexGitHub Repository and data\footnote\hrefhttps://huggingface.co/datasets/G4KMU/LEMURHugging Face Dataset.
Authors:Klejda Alushi, Jan Strich, Chris Biemann, Martin Semmann
Abstract:
Conversational question answering increasingly relies on retrieval‑augmented generation (RAG) to ground large language models (LLMs) in external knowledge. Yet, most existing studies evaluate RAG methods in isolation and primarily focus on single‑turn settings. This paper addresses the lack of a systematic comparison of RAG methods for multi‑turn conversational QA, where dialogue history, coreference, and shifting user intent substantially complicate retrieval. We present a comprehensive empirical study of vanilla and advanced RAG methods across eight diverse conversational QA datasets spanning multiple domains. Using a unified experimental setup, we evaluate retrieval quality and answer generation using generator and retrieval metrics, and analyze how performance evolves across conversation turns. Our results show that robust yet straightforward methods, such as reranking, hybrid BM25, and HyDE, consistently outperform vanilla RAG. In contrast, several advanced techniques fail to yield gains and can even degrade performance below the No‑RAG baseline. We further demonstrate that dataset characteristics and dialogue length strongly influence retrieval effectiveness, explaining why no single RAG strategy dominates across settings. Overall, our findings indicate that effective conversational RAG depends less on method complexity than on alignment between the retrieval strategy and the dataset structure. We publish the code used.\footnote\hrefhttps://github.com/Klejda‑A/exp‑rag.gitGitHub Repository
Authors:James Burgess, Rameen Abdal, Dan Stoddart, Sergey Tulyakov, Serena Yeung-Levy, Kuan-Chieh Jackson Wang
Abstract:
Modern image generators produce strikingly realistic images, where only artifacts like distorted hands or warped objects reveal their synthetic origin. Detecting these artifacts is essential: without detection, we cannot benchmark generators or train reward models to improve them. Current detectors fine‑tune VLMs on tens of thousands of labeled images, but this is expensive to repeat whenever generators evolve or new artifact types emerge. We show that pretrained VLMs already encode the knowledge needed to detect artifacts ‑ with the right scaffolding, this capability can be unlocked using only a few hundred labeled examples per artifact category. Our system, ArtifactLens, achieves state‑of‑the‑art on five human artifact benchmarks (the first evaluation across multiple datasets) while requiring orders of magnitude less labeled data. The scaffolding consists of a multi‑component architecture with in‑context learning and text instruction optimization, with novel improvements to each. Our methods generalize to other artifact types ‑ object morphology, animal anatomy, and entity interactions ‑ and to the distinct task of AIGC detection.
Authors:Haoyu Zhao, Ziran Yang, Jiawei Li, Deyuan He, Zenan Li, Chi Jin, Venugopal V. Veeravalli, Aarti Gupta, Sanjeev Arora
Abstract:
Vericoding refers to the generation of formally verified code from rigorous specifications. Recent AI models show promise in vericoding, but a unified methodology for cross‑paradigm evaluation is lacking. Existing benchmarks test only individual languages/tools (e.g., Dafny, Verus, and Lean) and each covers very different tasks, so the performance numbers are not directly comparable. We address this gap with AlgoVeri, a benchmark that evaluates vericoding of 77 classical algorithms in Dafny, Verus, and Lean. By enforcing identical functional contracts, AlgoVeri reveals critical capability gaps in verification systems. While frontier models achieve tractable success in Dafny (40.3% for Gemini‑3 Flash), where high‑level abstractions and SMT automation simplify the workflow, performance collapses under the systems‑level memory constraints of Verus (24.7%) and the explicit proof construction required by Lean (7.8%). Beyond aggregate metrics, we uncover a sharp divergence in test‑time compute dynamics: Gemini‑3 effectively utilizes iterative repair to boost performance (e.g., tripling pass rates in Dafny), whereas GPT‑OSS saturates early. Finally, our error analysis shows that language design affects the refinement trajectory: while Dafny allows models to focus on logical correctness, Verus and Lean trap models in persistent syntactic and semantic barriers. All data and evaluation code can be found at https://github.com/haoyuzhao123/algoveri.
Authors:Takumi Ohashi, Hitoshi Iyatomi
Abstract:
Large language models (LLMs) are increasingly deployed in multicultural settings; however, systematic evaluation of cultural specificity at the sentence level remains underexplored. We propose the Conceptual Cultural Index (CCI), which estimates cultural specificity at the sentence level. CCI is defined as the difference between the generality estimate within the target culture and the average generality estimate across other cultures. This formulation enables users to operationally control the scope of culture via comparison settings and provides interpretability, since the score derives from the underlying generality estimates. We validate CCI on 400 sentences (200 culture‑specific and 200 general), and the resulting score distribution exhibits the anticipated pattern: higher for culture‑specific sentences and lower for general ones. For binary separability, CCI outperforms direct LLM scoring, yielding more than a 10‑point improvement in AUC for models specialized to the target culture. Our code is available at https://github.com/IyatomiLab/CCI .
Authors:Mingfeng Yuan, Hao Zhang, Mahan Mohammadi, Runhao Li, Jinjun Shan, Steven L. Waslander
Abstract:
Mobile robots are often deployed over long durations in diverse open, dynamic scenes, including indoor setting such as warehouses and manufacturing facilities, and outdoor settings such as agricultural and roadway operations. A core challenge is to build a scalable long‑horizon memory that supports an agentic workflow for planning, retrieval, and reasoning over open‑ended instructions at variable granularity, while producing precise, actionable answers for navigation. We present STaR, an agentic reasoning framework that (i) constructs a task‑agnostic, multimodal long‑term memory that generalizes to unseen queries while preserving fine‑grained environmental semantics (object attributes, spatial relations, and dynamic events), and (ii) introduces a Scalable Task Conditioned Retrieval algorithm based on the Information Bottleneck principle to extract from long‑term memory a compact, non‑redundant, information‑rich set of candidate memories for contextual reasoning. We evaluate STaR on NaVQA (mixed indoor/outdoor campus scenes) and WH‑VQA, a customized warehouse benchmark with many visually similar objects built with Isaac Sim, emphasizing contextual reasoning. Across the two datasets, STaR consistently outperforms strong baselines, achieving higher success rates and markedly lower spatial error. We further deploy STaR on a real Husky wheeled robot in both indoor and outdoor environments, demonstrating robust long horizon reasoning, scalability, and practical utility. Project Website: https://trailab.github.io/STaR‑website/
Authors:Veuns-Team, :, Changlong Gao, Zhangxuan Gu, Yulin Liu, Xinyu Qiu, Shuheng Shen, Yue Wen, Tianyu Xia, Zhenyu Xu, Zhengwen Zeng, Beitong Zhou, Xingran Zhou, Weizhi Chen, Sunhao Dai, Jingya Dou, Yichen Gong, Yuan Guo, Zhenlin Guo, Feng Li, Qian Li, Jinzhen Lin, Yuqi Zhou, Linchao Zhu, Liang Chen, Zhenyu Guo, Changhua Meng, Weiqiang Wang
Abstract:
GUI agents have emerged as a powerful paradigm for automating interactions in digital environments, yet achieving both broad generality and consistently strong task performance remains challenging.In this report, we present UI‑Venus‑1.5, a unified, end‑to‑end GUI Agent designed for robust real‑world applications.The proposed model family comprises two dense variants (2B and 8B) and one mixture‑of‑experts variant (30B‑A3B) to meet various downstream application scenarios.Compared to our previous version, UI‑Venus‑1.5 introduces three key technical advances: (1) a comprehensive Mid‑Training stage leveraging 10 billion tokens across 30+ datasets to establish foundational GUI semantics; (2) Online Reinforcement Learning with full‑trajectory rollouts, aligning training objectives with long‑horizon, dynamic navigation in large‑scale environments; and (3) a single unified GUI Agent constructed via Model Merging, which synthesizes domain‑specific models (grounding, web, and mobile) into one cohesive checkpoint. Extensive evaluations demonstrate that UI‑Venus‑1.5 establishes new state‑of‑the‑art performance on benchmarks such as ScreenSpot‑Pro (69.6%), VenusBench‑GD (75.0%), and AndroidWorld (77.6%), significantly outperforming previous strong baselines. In addition, UI‑Venus‑1.5 demonstrates robust navigation capabilities across a variety of Chinese mobile apps, effectively executing user instructions in real‑world scenarios. Code: https://github.com/inclusionAI/UI‑Venus; Model: https://huggingface.co/collections/inclusionAI/ui‑venus
Authors:Avaljot Singh, Dushyant Bharadwaj, Stefanos Baziotis, Kaushik Varadharajan, Charith Mendis
Abstract:
Optimizing Pandas programs is a challenging problem. Existing systems and compiler‑based approaches offer reliability but are either heavyweight or support only a limited set of optimizations. Conversely, using LLMs in a per‑program optimization methodology can synthesize nontrivial optimizations, but is unreliable, expensive, and offers a low yield. In this work, we introduce a hybrid approach that works in a 3‑stage manner that decouples discovery from deployment and connects them via a novel bridge. First, it discovers per‑program optimizations (discovery). Second, they are converted into generalised rewrite rules (bridge). Finally, these rules are incorporated into a compiler that can automatically apply them wherever applicable, eliminating repeated reliance on LLMs (deployment). We demonstrate that RuleFlow is the new state‑of‑the‑art (SOTA) Pandas optimization framework on PandasBench, a challenging Pandas benchmark consisting of Python notebooks. Across these notebooks, we achieve a speedup of up to 4.3x over Dias, the previous compiler‑based SOTA, and 1914.9x over Modin, the previous systems‑based SOTA. Our code is available at https://github.com/ADAPT‑uiuc/RuleFlow.
Authors:Jiahao Qin
Abstract:
High‑speed optical‑resolution photoacoustic microscopy (OR‑PAM) with bidirectional scanning enables rapid functional brain imaging but introduces severe spatiotemporal misalignment from coupled scan‑direction‑dependent domain shift and geometric distortion. Conventional registration methods rely on brightness constancy, an assumption violated under bidirectional scanning, leading to unreliable alignment. A unified scene‑appearance separation framework is proposed to jointly address domain shift and spatial misalignment. The proposed architecture separates domain‑invariant scene content from domain‑specific appearance characteristics, enabling cross‑domain reconstruction with geometric preservation. A scene consistency loss promotes geometric correspondence in the latent space, linking domain shift correction with spatial registration within a single framework. For in vivo mouse brain vasculature imaging, the proposed method achieves normalized cross‑correlation (NCC) of 0.961 and structural similarity index (SSIM) of 0.894, substantially outperforming conventional methods. Ablation studies demonstrate that domain alignment loss is critical, with its removal causing 82% NCC reduction (0.961 to 0.175), while scene consistency and cycle consistency losses provide complementary regularization for optimal performance. The method achieves 11.2 ms inference time per frame (86 fps), substantially exceeding typical OR‑PAM acquisition rates and enabling real‑time processing. These results suggest that the proposed framework enables robust high‑speed bidirectional OR‑PAM for reliable quantitative and longitudinal functional imaging. The code will be publicly available at https://github.com/D‑ST‑Sword/SAS‑Net
Authors:Georgios Ioannides, Adrian Kieback, Judah Goldfeder, Linsey Pang, Aman Chadha, Aaron Elkins, Yann LeCun, Ravid Shwartz-Ziv
Abstract:
Joint Embedding Predictive Architectures (JEPA) offer a promising approach to self‑supervised speech representation learning, but suffer from representation collapse without explicit grounding. We propose GMM‑Anchored JEPA, which fits a Gaussian Mixture Model once on log‑mel spectrograms and uses its frozen soft posteriors as auxiliary targets throughout training. A decaying supervision schedule allows GMM regularization to dominate early training before gradually yielding to the JEPA objective. Unlike HuBERT and WavLM, which require iterative re‑clustering, our approach clusters input features once with soft rather than hard assignments. On ~50k hours of speech, GMM anchoring improves ASR (28.68% vs. 33.22% WER), emotion recognition (67.76% vs. 65.46%), and slot filling (64.7% vs. 59.1% F1) compared to a WavLM‑style baseline with matched compute. Cluster analysis shows GMM‑anchored representations achieve up to 98% entropy compared to 31% for WavLM‑style, indicating substantially more uniform cluster utilization. Code is made available at https://github.com/gioannides/clustering‑anchored‑jepa.
Authors:Haodong Li, Jingwei Wu, Quan Sun, Guopeng Li, Juanxi Tian, Huanyu Zhang, Yanlin Lai, Ruichuan An, Hongbo Peng, Yuhong Dai, Chenxi Li, Chunmei Qing, Jia Wang, Ziyang Meng, Zheng Ge, Xiangyu Zhang, Daxin Jiang
Abstract:
Recent advancements in image generation models have enabled the prediction of future Graphical User Interface (GUI) states based on user instructions. However, existing benchmarks primarily focus on general domain visual fidelity, leaving the evaluation of state transitions and temporal coherence in GUI‑specific contexts underexplored. To address this gap, we introduce GEBench, a comprehensive benchmark for evaluating dynamic interaction and temporal coherence in GUI generation. GEBench comprises 700 carefully curated samples spanning five task categories, covering both single‑step interactions and multi‑step trajectories across real‑world and fictional scenarios, as well as grounding point localization. To support systematic evaluation, we propose GE‑Score, a novel five‑dimensional metric that assesses Goal Achievement, Interaction Logic, Content Consistency, UI Plausibility, and Visual Quality. Extensive evaluations on current models indicate that while they perform well on single‑step transitions, they struggle significantly with maintaining temporal coherence and spatial grounding over longer interaction sequences. Our findings identify icon interpretation, text rendering, and localization precision as critical bottlenecks. This work provides a foundation for systematic assessment and suggests promising directions for future research toward building high‑fidelity generative GUI environments. The code is available at: https://github.com/stepfun‑ai/GEBench.
Authors:Ruijie Zhu, Jiahao Lu, Wenbo Hu, Xiaoguang Han, Jianfei Cai, Ying Shan, Chuanxia Zheng
Abstract:
We present MotionCrafter, a framework that leverages video generators to jointly reconstruct 4D geometry and estimate dense motion from a monocular video. The key idea is a joint representation of dense 3D point maps and 3D scene flows in a shared coordinate system, together with a 4D VAE tailored to learn this representation effectively. Unlike prior work that strictly aligns 3D values and latents with RGB VAE latents‑despite their fundamentally different distributions‑we show that such alignment is unnecessary and can hurt performance. Instead, we propose a new data normalization and VAE training strategy that better transfers diffusion priors and greatly improves reconstruction quality. Extensive experiments on multiple datasets show that MotionCrafter achieves state‑of‑the‑art performance in both geometry reconstruction and dense scene flow estimation, delivering 38.64% and 25.0% improvements in geometry and motion reconstruction, respectively, all without any post‑optimization. Project page: https://ruijiezhu94.github.io/MotionCrafter_Page
Authors:Suraj Ranganath, Atharv Ramesh
Abstract:
AI‑text detectors face a critical robustness challenge: adversarial paraphrasing attacks that preserve semantics while evading detection. We introduce StealthRL, a reinforcement learning framework that stress‑tests detector robustness under realistic adversarial conditions. StealthRL trains a paraphrase policy against a multi‑detector ensemble using Group Relative Policy Optimization (GRPO) with LoRA adapters on Qwen3‑4B, optimizing a composite reward that balances detector evasion with semantic preservation. We evaluate six attack settings (M0‑M5) on the full filtered MAGE test pool (15,310 human / 14,656 AI) against four detectors: RoBERTa, Fast‑DetectGPT, Binoculars, and MAGE. StealthRL achieves near‑zero detection on three of the four detectors and a 0.024 mean TPR@1%FPR, reducing mean AUROC from 0.79 to 0.43 and attaining a 97.6% attack success rate. Critically, attacks transfer to two held‑out detectors not seen during training, revealing shared architectural vulnerabilities rather than detector‑specific brittleness. We additionally conduct LLM‑based quality evaluation via Likert scoring on 500 matched samples per method, analyze detector score distributions to explain why evasion succeeds, and provide per‑detector AUROC with bootstrap confidence intervals. Our results expose significant robustness gaps in current AI‑text detection and establish StealthRL as a principled adversarial evaluation protocol. Code and evaluation pipeline are publicly available at https://github.com/suraj‑ranganath/StealthRL.
Authors:Paul Saegert, Ullrich Köthe
Abstract:
Symbolic regression (SR) aims to discover interpretable analytical expressions that accurately describe observed data. Amortized SR promises to be much more efficient than the predominant genetic programming SR methods, but currently struggles to scale to realistic scientific complexity. We find that a key obstacle is the lack of a fast reduction of equivalent expressions to a concise normalized form. Amortized SR has addressed this by general‑purpose Computer Algebra Systems (CAS) like SymPy, but the high computational cost severely limits training and inference speed. We propose SimpliPy, a rule‑based simplification engine achieving a 100‑fold speed‑up over SymPy at comparable quality. This enables substantial improvements in amortized SR, including scalability to much larger training sets, more efficient use of the per‑expression token budget, and systematic training set decontamination with respect to equivalent test expressions. We demonstrate these advantages in our Flash‑ANSR framework, which achieves much better accuracy than amortized baselines (NeSymReS, E2E) on the FastSRB benchmark. Moreover, it performs on par with state‑of‑the‑art direct optimization (PySR) while recovering more concise instead of more complex expressions with increasing inference budget.
Authors:Chenghui Zou, Ning Wang, Tiesunlong Shen, Luwei Xiao, Chuan Ma, Xiangpeng Li, Rui Mao, Erik Cambria
Abstract:
Large language models (LLMs) have been widely applied to emotional support conversation (ESC). However, complex multi‑turn support remains challenging.This is because existing alignment schemes rely on sparse outcome‑level signals, thus offering limited supervision for intermediate strategy decisions. To fill this gap, this paper proposes affective flow language model for emotional support conversation (AFlow), a framework that introduces fine‑grained supervision on dialogue prefixes by modeling a continuous affective flow along multi‑turn trajectories. AFlow can estimate intermediate utility over searched trajectories and learn preference‑consistent strategy transitions. To improve strategy coherence and empathetic response quality, a subpath‑level flow‑balance objective is presented to propagate preference signals to intermediate states. Experiment results show consistent and significant improvements over competitive baselines in diverse emotional contexts. Remarkably, AFlow with a compact open‑source backbone outperforms proprietary LMMs such as GPT‑4o and Claude‑3.5 on major ESC metrics. Our code is available at https://github.com/chz2025/AffectiveFlow.
Authors:Zirui Li, Xuefeng Bai, Kehai Chen, Yizhi Li, Jian Yang, Chenghua Lin, Min Zhang
Abstract:
Latent or continuous chain‑of‑thought methods replace explicit textual rationales with a number of internal latent steps, but these intermediate computations are difficult to evaluate beyond correlation‑based probes. In this paper, we view latent chain‑of‑thought as a manipulable causal process in representation space by modeling latent steps as variables in a structural causal model (SCM) and analyzing their effects through step‑wise \mathrmdo‑interventions. We study two representative paradigms (i.e., Coconut and CODI) on both mathematical and general reasoning tasks to investigate three key questions: (1) which steps are causally necessary for correctness and when answers become decidable early; (2) how does influence propagate across steps, and how does this structure compare to explicit CoT; and (3) do intermediate trajectories retain competing answer modes, and how does output‑level commitment differ from representational commitment across steps. We find that latent‑step budgets behave less like homogeneous extra depth and more like staged functionality with non‑local routing, and we identify a persistent gap between early output bias and late representational commitment. These results motivate mode‑conditional and stability‑aware analyses ‑‑ and corresponding training/decoding objectives ‑‑ as more reliable tools for interpreting and improving latent reasoning systems. Code is available at https://github.com/J1mL1/causal‑latent‑cot.
Authors:Shih-Fang Chen, Jun-Cheng Chen, I-Hong Jhuo, Yen-Yu Lin
Abstract:
Human perception for effective object tracking in 2D video streams arises from the implicit use of prior 3D knowledge and semantic reasoning. In contrast, most generic object tracking (GOT) methods primarily rely on 2D features of the target and its surroundings, while neglecting 3D geometric cues, making them susceptible to partial occlusion, distractors, and variations in geometry and appearance. To address this limitation, we introduce GOT‑Edit, an online cross‑modality model editing approach that integrates geometry‑aware cues into a generic object tracker from a 2D video stream. Our approach leverages features from a pre‑trained Visual Geometry Grounded Transformer to infer geometric cues from only a few 2D images. To address the challenge of seamlessly combining geometry and semantics, GOT‑Edit performs online model editing. By leveraging null‑space constraints during model updates, it incorporates geometric information while preserving semantic discrimination, yielding consistently better performance across diverse scenarios. Extensive experiments on multiple GOT benchmarks demonstrate that GOT‑Edit achieves superior robustness and accuracy, particularly under occlusion and clutter, establishing a new paradigm for combining 2D semantics with 3D geometric reasoning for generic object tracking. The project page is available at https://chenshihfang.github.io/GOT‑EDIT.
Authors:Yutao Zhu, Xingshuo Zhang, Maosen Zhang, Jiajie Jin, Liancheng Zhang, Xiaoshuai Song, Kangzhi Zhao, Wencong Zeng, Ruiming Tang, Han Li, Ji-Rong Wen, Zhicheng Dou
Abstract:
The advancement of large language models (LLMs) has significantly accelerated the development of search agents capable of autonomously gathering information through multi‑turn web interactions. Various benchmarks have been proposed to evaluate such agents. However, existing benchmarks often construct queries backward from answers, producing unnatural tasks misaligned with real‑world needs. Moreover, these benchmarks tend to focus on either locating specific information or aggregating information from multiple sources, while relying on static answer sets prone to data contamination. To bridge these gaps, we introduce GISA, a benchmark for General Information‑Seeking Assistants comprising 373 human‑crafted queries that reflect authentic information‑seeking scenarios. GISA features four structured answer formats (item, set, list, and table), enabling deterministic evaluation. It integrates both deep reasoning and broad information aggregation within unified tasks, and includes a live subset with periodically updated answers to resist memorization. Notably, GISA provides complete human search trajectories for every query, offering gold‑standard references for process‑level supervision and imitation learning. Experiments on mainstream LLMs and commercial search products reveal that even the best‑performing model achieves only 19.30% exact match score, with performance notably degrading on tasks requiring complex planning and comprehensive information gathering. These findings highlight substantial room for future improvement.
Authors:Shaoang Zhang, Yazhe Niu
Abstract:
Tensor is the most basic and essential data structure of nowadays artificial intelligence (AI) system. The natural properties of Tensor, especially the memory‑continuity and slice‑independence, make it feasible for training system to leverage parallel computing unit like GPU to process data simultaneously in batch, spatial or temporal dimensions. However, if we look beyond perception tasks, the data in a complicated cognitive AI system usually has hierarchical structures (i.e. nested data) with various modalities. They are inconvenient and inefficient to program directly with conventional Tensor with fixed shape. To address this issue, we summarize two main computational patterns of nested data, and then propose a general nested data container: TreeTensor. Through various constraints and magic utilities of TreeTensor, one can apply arbitrary functions and operations to nested data with almost zero cost, including some famous machine learning libraries, such as Scikit‑Learn, Numpy and PyTorch. Our approach utilizes a constrained tree‑structure perspective to systematically model data relationships, and it can also easily be combined with other methods to extend more usages, such as asynchronous execution and variable‑length data computation. Detailed examples and benchmarks show TreeTensor not only provides powerful usability in various problems, especially one of the most complicated AI systems at present: AlphaStar for StarCraftII, but also exhibits excellent runtime efficiency without any overhead. Our project is available at https://github.com/opendilab/DI‑treetensor.
Authors:Yuhang Wang, Feiming Xu, Zheng Lin, Guangyu He, Yuzhe Huang, Haichang Gao, Zhenxing Niu, Shiguo Lian, Zhaoxiang Liu
Abstract:
Although large language model (LLM)‑based agents, exemplified by OpenClaw, are increasingly evolving from task‑oriented systems into personalized AI assistants for solving complex real‑world tasks, their practical deployment also introduces severe security risks. However, existing agent security research and evaluation frameworks primarily focus on synthetic or task‑centric settings, and thus fail to accurately capture the attack surface and risk propagation mechanisms of personalized agents in real‑world deployments. To address this gap, we propose Personalized Agent Security Bench (PASB), an end‑to‑end security evaluation framework tailored for real‑world personalized agents. Building upon existing agent attack paradigms, PASB incorporates personalized usage scenarios, realistic toolchains, and long‑horizon interactions, enabling black‑box, end‑to‑end security evaluation on real systems. Using OpenClaw as a representative case study, we systematically evaluate its security across multiple personalized scenarios, tool capabilities, and attack types. Our results indicate that OpenClaw exhibits critical vulnerabilities at different execution stages, including user prompt processing, tool usage, and memory retrieval, highlighting substantial security risks in personalized agent deployments. The code for the proposed PASB framework is available at https://github.com/AstorYH/PASB.
Authors:Zhang Jiasheng, Li Zhangpin, Wang Mingzhe, Shao Jie, Cui Jiangtao, Li Hui
Abstract:
Temporal knowledge graphs (TKGs) structurally preserve evolving human knowledge. Recent research has focused on designing models to learn the evolutionary nature of TKGs to predict future facts, achieving impressive results. For instance, Hits@10 scores over 0.9 on YAGO dataset. However, we find that existing benchmarks inadvertently introduce a shortcut. Near state‑of‑the‑art performance can be simply achieved by counting co‑occurrences, without using any temporal information. In this work, we examine the root cause of this issue, identifying inherent biases in current datasets and over simplified form of evaluation task that can be exploited by these biases. Through this analysis, we further uncover additional limitations of existing benchmarks, including unreasonable formatting of time‑interval knowledge, ignorance of learning knowledge obsolescence, and insufficient information for precise evolution understanding, all of which can amplify the shortcut and hinder a fair assessment. Therefore, we introduce the TKG evolution benchmark. It includes four bias‑corrected datasets and two novel tasks closely aligned with the evolution process, promoting a more accurate understanding of the challenges in TKG evolution modeling. Benchmark is available at: https://github.com/zjs123/TKG‑Benchmark.
Authors:Jinwoo Kim, Sékou-Oumar Kaba, Jiyun Park, Seunghoon Hong, Siamak Ravanbakhsh
Abstract:
We study the problem of transformation inversion on general Lie groups: a datum is transformed by an unknown group element, and the goal is to recover an inverse transformation that maps it back to the original data distribution. Such unknown transformations arise widely in machine learning and scientific modeling, where they can significantly distort observations. We take a probabilistic view and model the posterior over transformations as a Boltzmann distribution defined by an energy function on data space. To sample from this posterior, we introduce a diffusion process on Lie groups that keeps all updates on‑manifold and only requires computations in the associated Lie algebra. Our method, Transformation‑Inverting Energy Diffusion (TIED), relies on a new trivialized target‑score identity that enables efficient score‑based sampling of the transformation posterior. As a key application, we focus on test‑time equivariance, where the objective is to improve the robustness of pretrained neural networks to input transformations. Experiments on image homographies and PDE symmetries demonstrate that TIED can restore transformed inputs to the training distribution at test time, showing improved performance over strong canonicalization and sampling baselines. Code is available at https://github.com/jw9730/tied.
Authors:Jaylen Jones, Zhehao Zhang, Yuting Ning, Eric Fosler-Lussier, Pierre-Luc St-Charles, Yoshua Bengio, Dawn Song, Yu Su, Huan Sun
Abstract:
Although computer‑use agents (CUAs) hold significant potential to automate increasingly complex OS workflows, they can demonstrate unsafe unintended behaviors that deviate from expected outcomes even under benign input contexts. However, exploration of this risk remains largely anecdotal, lacking concrete characterization and automated methods to proactively surface long‑tail unintended behaviors under realistic CUA scenarios. To fill this gap, we introduce the first conceptual and methodological framework for unintended CUA behaviors, by defining their key characteristics, automatically eliciting them, and analyzing how they arise from benign inputs. We propose AutoElicit: an agentic framework that iteratively perturbs benign instructions using CUA execution feedback, and elicits severe harms while keeping perturbations realistic and benign. Using AutoElicit, we surface hundreds of harmful unintended behaviors from state‑of‑the‑art CUAs such as Claude 4.5 Haiku and Opus. We further evaluate the transferability of human‑verified successful perturbations, identifying persistent susceptibility to unintended behaviors across various other frontier CUAs. This work establishes a foundation for systematically analyzing unintended behaviors in realistic computer‑use settings.
Authors:Jiatao Chen, Xing Tang, Xiaoyue Duan, Yutang Feng, Jinchao Zhang, Jie Zhou
Abstract:
While existing Singing Voice Synthesis systems achieve high‑fidelity solo performances, they are constrained by global timbre control, failing to address dynamic multi‑singer arrangement and vocal texture within a single song. To address this, we propose Tutti, a unified framework designed for structured multi‑singer generation. Specifically, we introduce a Structure‑Aware Singer Prompt to enable flexible singer scheduling evolving with musical structure, and propose Complementary Texture Learning via Condition‑Guided VAE to capture implicit acoustic textures (e.g., spatial reverberation and spectral fusion) that are complementary to explicit controls. Experiments demonstrate that Tutti excels in precise multi‑singer scheduling and significantly enhances the acoustic realism of choral generation, offering a novel paradigm for complex multi‑singer arrangement. Audio samples are available at https://annoauth123‑ctrl.github.io/Tutii_Demo/.
Authors:Konstantinos Mitsides, Maxence Faldor, Antoine Cully
Abstract:
Open‑ended learning frames intelligence as emerging from continual interaction with an ever‑expanding space of environments. While recent advances have utilized foundation models to programmatically generate diverse environments, these approaches often focus on discovering isolated behaviors rather than orchestrating sustained progression. In complex open‑ended worlds, the large combinatorial space of possible challenges makes it difficult for agents to discover sequences of experiences that remain consistently learnable. To address this, we propose Dreaming in Code (DiCode), a framework in which foundation models synthesize executable environment code to scaffold learning toward increasing competence. In DiCode, "dreaming" takes the form of materializing code‑level variations of the world. We instantiate DiCode in Craftax, a challenging open‑ended benchmark characterized by rich mechanics and long‑horizon progression. Empirically, DiCode enables agents to acquire long‑horizon skills, achieving a 16% improvement in mean return over the strongest baseline and non‑zero success on late‑game combat tasks where prior methods fail. Our results suggest that code‑level environment design provides a practical mechanism for curriculum control, enabling the construction of intermediate environments that bridge competence gaps in open‑ended worlds. Project page and source code are available at https://konstantinosmitsides.github.io/dreaming‑in‑code and https://github.com/konstantinosmitsides/dreaming‑in‑code.
Authors:Issar Tzachor, Dvir Samuel, Rami Ben-Ari
Abstract:
Recent studies have adapted generative Multimodal Large Language Models (MLLMs) into embedding extractors for vision tasks, typically through fine‑tuning to produce universal representations. However, their performance on video remains inferior to Video Foundation Models (VFMs). In this paper, we focus on leveraging MLLMs for video‑text embedding and retrieval. We first conduct a systematic layer‑wise analysis, showing that intermediate (pre‑trained) MLLM layers already encode substantial task‑relevant information. Leveraging this insight, we demonstrate that combining intermediate‑layer embeddings with a calibrated MLLM head yields strong zero‑shot retrieval performance without any training. Building on these findings, we introduce a lightweight text‑based alignment strategy which maps dense video captions to short summaries and enables task‑related video‑text embedding learning without visual supervision. Remarkably, without any fine‑tuning beyond text, our method outperforms current methods, often by a substantial margin, achieving state‑of‑the‑art results across common video retrieval benchmarks.
Authors:Tianyu Li, Dongchen Han, Zixuan Cao, Haofeng Huang, Mengyu Zhou, Ming Chen, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang, Gao Huang
Abstract:
Modern Transformers predominantly adopt the Pre‑Norm paradigm for its optimization stability, foregoing the superior potential of the unstable Post‑Norm architecture. Prior attempts to combine their strengths typically lead to a stability‑performance trade‑off. We attribute this phenomenon to a structural incompatibility within a single‑stream design: Any application of the Post‑Norm operation inevitably obstructs the clean identity gradient preserved by Pre‑Norm. To fundamentally reconcile these paradigms, we propose SiameseNorm, a two‑stream architecture that couples Pre‑Norm‑like and Post‑Norm‑like streams with shared parameters. This design decouples the optimization dynamics of the two streams, retaining the distinct characteristics of both Pre‑Norm and Post‑Norm by enabling all residual blocks to receive combined gradients inherited from both paradigms, where one stream secures stability while the other enhances expressivity. Extensive pre‑training experiments on 1.3B‑parameter models demonstrate that SiameseNorm exhibits exceptional optimization robustness and consistently outperforms strong baselines. Code is available at https://github.com/Qwen‑Applications/SiameseNorm.
Authors:Yixuan Ye, Xuanyu Lu, Yuxin Jiang, Yuchao Gu, Rui Zhao, Qiwei Liang, Jiachun Pan, Fengda Zhang, Weijia Wu, Alex Jinpeng Wang
Abstract:
World models aim to understand, remember, and predict dynamic visual environments, yet a unified benchmark for evaluating their fundamental abilities remains lacking. To address this gap, we introduce MIND, the first open‑domain closed‑loop revisited benchmark for evaluating Memory consIstency and action coNtrol in worlD models. MIND contains 250 high‑quality videos at 1080p and 24 FPS, including 100 (first‑person) + 100 (third‑person) video clips under a shared action space and 25 + 25 clips across varied action spaces covering eight diverse scenes. We design an efficient evaluation framework to measure two core abilities: memory consistency and action control, capturing temporal stability and contextual coherence across viewpoints. Furthermore, we design various action spaces, including different character movement speeds and camera rotation angles, to evaluate the action generalization capability across different action spaces under shared scenes. To facilitate future performance benchmarking on MIND, we introduce MIND‑World, a novel interactive Video‑to‑World baseline. Extensive experiments demonstrate the completeness of MIND and reveal key challenges in current world models, including the difficulty of maintaining long‑term memory consistency and generalizing across action spaces. Code: https://github.com/CSU‑JPG/MIND.
Authors:Ziyang Fan, Keyu Chen, Ruilong Xing, Yulin Li, Li Jiang, Zhuotao Tian
Abstract:
Although Video Large Language Models (VLLMs) have shown remarkable capabilities in video understanding, they are required to process high volumes of visual tokens, causing significant computational inefficiency. Existing VLLMs acceleration frameworks usually compress spatial and temporal redundancy independently, which overlooks the spatiotemporal relationships, thereby leading to suboptimal spatiotemporal compression. The highly correlated visual features are likely to change in spatial position, scale, orientation, and other attributes over time due to the dynamic nature of video. Building on this insight, we introduce FlashVID, a training‑free inference acceleration framework for VLLMs. Specifically, FlashVID utilizes Attention and Diversity‑based Token Selection (ADTS) to select the most representative tokens for basic video representation, then applies Tree‑based Spatiotemporal Token Merging (TSTM) for fine‑grained spatiotemporal redundancy elimination. Extensive experiments conducted on three representative VLLMs across five video understanding benchmarks demonstrate the effectiveness and generalization of our method. Notably, by retaining only 10% of visual tokens, FlashVID preserves 99.1% of the performance of LLaVA‑OneVision. Consequently, FlashVID can serve as a training‑free and plug‑and‑play module for extending long video frames, which enables a 10x increase in video frame input to Qwen2.5‑VL, resulting in a relative improvement of 8.6% within the same computational budget. Code is available at https://github.com/Fanziyang‑v/FlashVID.
Authors:Sizhe Dang, Jiaqi Shao, Xiaodong Zheng, Guang Dai, Yan Song, Haishan Ye
Abstract:
As foundation models continue to scale, pretraining increasingly relies on data‑parallel distributed optimization, making bandwidth‑limited gradient synchronization a key bottleneck. Orthogonally, projection‑based low‑rank optimizers were mainly designed for memory efficiency, but remain suboptimal for communication‑limited training: one‑sided synchronization still transmits an O(rn) object for an m× n matrix gradient and refresh steps can dominate peak communicated bytes. We propose TSR, which brings two‑sided low‑rank communication to Adam‑family updates (TSR‑Adam) by synchronizing a compact core U^\top G V\in\mathbbR^r× r, reducing the dominant per‑step payload from O(mn) to O(r^2) while keeping moment states in low‑dimensional cores. To further reduce the peak communication from subspace refresh, TSR‑Adam adopts a randomized SVD‑based refresh that avoids full‑gradient synchronization. We additionally extend low‑rank communication to embedding gradients with embedding‑specific ranks and refresh schedules, yielding additional communication and memory savings over keeping embeddings dense. Across pretraining from 60M to 1B model scales, TSR‑Adam reduces average communicated bytes per step by 13×, and on GLUE fine‑tuning it reduces communication by 25×, while achieving comparable performance; we further provide a theoretical stationarity analysis for the proposed update. Code is available at https://github.com/DKmiyan/TSR‑Adam.
Authors:Jitai Hao, Qiang Huang, Yaowei Wang, Min Zhang, Jun Yu
Abstract:
The deployment of efficient long‑context LLMs in applications like autonomous agents, long‑chain reasoning, and creative writing is fundamentally bottlenecked by the linear growth of KV cache memory. Existing compression and eviction methods often struggle to balance accuracy, compression ratio, and hardware efficiency. We propose DeltaKV, a residual‑based KV cache compression framework motivated by two empirical findings: long‑range inter‑token similarity and highly shared latent components in KV representations. Instead of discarding tokens, DeltaKV encodes semantic residuals relative to retrieved historical references, preserving fidelity while substantially reducing storage. To translate compression gains into real system speedups, we further introduce Sparse‑vLLM, a high‑performance inference engine with decoupled memory management and kernels optimized for sparse and irregular KV layouts. Experiments show that DeltaKV reduces KV cache memory to 29% of the original while maintaining near‑lossless accuracy on LongBench, SCBench, and AIME. When integrated with Sparse‑vLLM, it achieves up to 2× throughput improvement over vLLM in long‑context scenarios, demonstrating a practical path toward scalable long‑context LLM deployment. Code, model checkpoints, and datasets are available at https://github.com/CURRENTF/Sparse‑vLLM.
Authors:Weihao Zeng, Yuzhen Huang, Junxian He
Abstract:
Large language models (LLMs) are increasingly capable of carrying out long‑running, real‑world tasks. However, as the amount of context grows, their reliability often deteriorates, a phenomenon known as "context rot". Existing long‑context benchmarks primarily focus on single‑step settings that evaluate a model's ability to retrieve information from a long snippet. In realistic scenarios, however, LLMs often need to act as agents that explore environments, follow instructions and plans, extract useful information, and predict correct actions under a dynamically growing context. To assess language agents in such settings, we introduce LOCA‑bench (a benchmark for LOng‑Context Agents). Given a task prompt, LOCA‑bench leverages automated and scalable control of environment states to regulate the agent's context length. This design enables LOCA‑bench to extend the context length potentially to infinity in a controlled way while keeping the underlying task semantics fixed. LOCA‑bench evaluates language agents as a combination of models and scaffolds, including various context management strategies. While agent performance generally degrades as the environment states grow more complex, advanced context management techniques can substantially improve the overall success rate. We open‑source LOCA‑bench to provide a platform for evaluating models and scaffolds in long‑context, agentic scenarios: https://github.com/hkust‑nlp/LOCA‑bench
Authors:Guanglong Sun, Hongwei Yan, Liyuan Wang, Zhiqi Kang, Shuang Cui, Hang Su, Jun Zhu, Yi Zhong
Abstract:
To cope with uncertain changes of the external world, intelligent systems must continually learn from complex, evolving environments and respond in real time. This ability, collectively known as general continual learning (GCL), encapsulates practical challenges such as online datastreams and blurry task boundaries. Although leveraging pretrained models (PTMs) has greatly advanced conventional continual learning (CL), these methods remain limited in reconciling the diverse and temporally mixed information along a single pass, resulting in sub‑optimal GCL performance. Inspired by meta‑plasticity and reconstructive memory in neuroscience, we introduce here an innovative approach named Meta Post‑Refinement (MePo) for PTMs‑based GCL. This approach constructs pseudo task sequences from pretraining data and develops a bi‑level meta‑learning paradigm to refine the pretrained backbone, which serves as a prolonged pretraining phase but greatly facilitates rapid adaptation of representation learning to downstream GCL tasks. MePo further initializes a meta covariance matrix as the reference geometry of pretrained representation space, enabling GCL to exploit second‑order statistics for robust output alignment. MePo serves as a plug‑in strategy that achieves significant performance gains across a variety of GCL benchmarks and pretrained checkpoints in a rehearsal‑free manner (e.g., 15.10%, 13.36%, and 12.56% on CIFAR‑100, ImageNet‑R, and CUB‑200 under Sup‑21/1K). Our source code is available at \hrefhttps://github.com/SunGL001/MePoMePo
Authors:Huiyang Yi, Xiaojian Shen, Yonggang Wu, Duxin Chen, He Wang, Wenwu Yu
Abstract:
Causal discovery from time series is a fundamental task in machine learning. However, its widespread adoption is hindered by a reliance on untestable causal assumptions and by the lack of robustness‑oriented evaluation in existing benchmarks. To address these challenges, we propose CausalCompass, a flexible and extensible benchmark suite designed to assess the robustness of time‑series causal discovery (TSCD) methods under violations of modeling assumptions. To demonstrate the practical utility of CausalCompass, we conduct extensive benchmarking of representative TSCD algorithms across eight assumption‑violation scenarios. Our experimental results indicate that no single method consistently attains optimal performance across all settings. Nevertheless, the methods exhibiting superior overall performance across diverse scenarios are almost invariably deep learning‑based approaches. We further provide hyperparameter sensitivity analyses to deepen the understanding of these findings. We also find, somewhat surprisingly, that NTS‑NOTEARS relies heavily on standardized preprocessing in practice, performing poorly in the vanilla setting but exhibiting strong performance after standardization. Finally, our work aims to provide a comprehensive and systematic evaluation of TSCD methods under assumption violations, thereby facilitating their broader adoption in real‑world applications. The code and datasets are available at https://github.com/huiyang‑yi/CausalCompass.
Authors:Yuzhu Cai, Zexi Liu, Xinyu Zhu, Cheng Wang, Jiaao Chen, Hanrui Wang, Wei-Chen Wang, Di Jin, Siheng Chen
Abstract:
Autonomous Machine Learning Engineering (MLE) requires agents to perform sustained, iterative optimization over long horizons. While recent LLM‑based agents show promise, current prompt‑based agents for MLE suffer from behavioral stagnation due to frozen parameters. Although Reinforcement Learning (RL) offers a remedy, applying it to MLE is hindered by prohibitive execution latency and inefficient data selection. Recognizing these challenges, we propose AceGRPO with two core components: (1) Evolving Data Buffer that continuously repurposes execution traces into reusable training tasks, and (2) Adaptive Sampling guided by a Learnability Potential function, which dynamically prioritizes tasks at the agent's learning frontier to maximize learning efficiency. Leveraging AceGRPO, our trained Ace‑30B model achieves a 100% valid submission rate on MLE‑Bench‑Lite, approaches the performance of proprietary frontier models, and outperforms larger open‑source baselines (e.g., DeepSeek‑V3.2), demonstrating robust capability for sustained iterative optimization. Code is available at https://github.com/yuzhu‑cai/AceGRPO.
Authors:Weijiang Lv, Yaoxuan Feng, Xiaobo Xia, Jiayu Wang, Yan Jing, Wenchao Chen, Bo Chen
Abstract:
Chain‑of‑Thought reasoning is widely used to improve the interpretability of multimodal large language models (MLLMs), yet the faithfulness of the generated reasoning traces remains unclear. Prior work has mainly focused on perceptual hallucinations, leaving reasoning level unfaithfulness underexplored. To isolate faithfulness from linguistic priors, we introduce SPD‑Faith Bench, a diagnostic benchmark based on fine‑grained image difference reasoning that enforces explicit visual comparison. Evaluations on state‑of‑the‑art MLLMs reveal two systematic failure modes, perceptual blindness and perception‑reasoning dissociation. We trace these failures to decaying visual attention and representation shifts in the residual stream. Guided by this analysis, we propose SAGE, a train‑free visual evidence‑calibrated framework that improves visual routing and aligns reasoning with perception. Our results highlight the importance of explicitly evaluating faithfulness beyond response correctness. Our benchmark and codes are available at https://github.com/Johanson‑colab/SPD‑Faith‑Bench.
Authors:Ruiqi Wang, Ruikang Liu, Runyu Chen, Haoxiang Suo, Zhiyi Peng, Zhuo Tang, Changjian Chen
Abstract:
Detecting anomalies in tabular data is critical for many real‑world applications, such as credit card fraud detection. With the rapid advancements in large language models (LLMs), state‑of‑the‑art performance in tabular anomaly detection has been achieved by converting tabular data into text and fine‑tuning LLMs. However, these methods randomly order columns during conversion, without considering the causal relationships between them, which is crucial for accurately detecting anomalies. In this paper, we present CausalTaD, a method that injects causal knowledge into LLMs for tabular anomaly detection. We first identify the causal relationships between columns and reorder them to align with these causal relationships. This reordering can be modeled as a linear ordering problem. Since each column contributes differently to the causal relationships, we further propose a reweighting strategy to assign different weights to different columns to enhance this effect. Experiments across more than 30 datasets demonstrate that our method consistently outperforms the current state‑of‑the‑art methods. The code for CausalTAD is available at https://github.com/350234/CausalTAD.
Authors:Pierre-Louis Favreau, Jean-Pierre Lo, Clement Guiguet, Charles Simon-Meunier, Nicolas Dehandschoewercker, Allen G. Roush, Judah Goldfeder, Ravid Shwartz-Ziv
Abstract:
We present Minitap, a multi‑agent system that achieves 100% success on the AndroidWorld benchmark, the first to fully solve all 116 tasks and surpassing human performance (80%). We first analyze why single‑agent architectures fail: context pollution from mixed reasoning traces, silent text input failures undetected by the agent, and repetitive action loops without escape. Minitap addresses each failure through targeted mechanisms: cognitive separation across six specialized agents, deterministic post‑validation of text input against device state, and meta‑cognitive reasoning that detects cycles and triggers strategy changes. Ablations show multi‑agent decomposition contributes +21 points over single‑agent baselines; verified execution adds +7 points; meta‑cognition adds +9 points. We release Minitap as open‑source software. https://github.com/minitap‑ai/mobile‑use
Authors:Qiuming Luo, Yuebing Li, Feng Li, Chang Kong
Abstract:
Distilling knowledge from large Vision‑Language Models (VLMs) into lightweight networks is crucial yet challenging in Fine‑Grained Visual Classification (FGVC), due to the reliance on fixed prompts and global alignment. To address this, we propose PAND (Prompt‑Aware Neighborhood Distillation), a two‑stage framework that decouples semantic calibration from structural transfer. First, we incorporate Prompt‑Aware Semantic Calibration to generate adaptive semantic anchors. Second, we introduce a neighborhood‑aware structural distillation strategy to constrain the student's local decision structure. PAND consistently outperforms state‑of‑the‑art methods on four FGVC benchmarks. Notably, our ResNet‑18 student achieves 76.09% accuracy on CUB‑200, surpassing the strong baseline VL2Lite by 3.4%. Code is available at https://github.com/LLLVTA/PAND.
Authors:Jarrod Barnes
Abstract:
Test‑time training (TTT) adapts language models through gradient‑based updates at inference. But is adaptation the right strategy? We study compute‑optimal test‑time strategies for verifiable execution‑grounded (VEG) tasks, domains like GPU kernel optimization where a deterministic evaluator provides dense, continuous reward signals. Using KernelBench as our testbed and a 120B‑parameter model (GPT‑OSS‑120B with LoRA adaptation), we find that search outperforms minimal adaptation (1‑5 gradient steps): Best‑of‑N sampling achieves 90% task success (18/20 tasks) at K=64 across the full KernelBench L1 eval set while TTT's best checkpoint reaches only 30.6% (3‑seed mean), with TTT's "equivalent K" falling below 1, worse than single‑sample inference. The failure mode is over‑sharpening: gradient updates collapse diversity toward mediocre solutions rather than discovering optimal ones. Our main contribution is surprisal‑guided selection: selecting the highest‑surprisal (lowest‑confidence) correct sample yields 80% success vs. 50% for most‑confident selection, a 30% improvement. Extending to surprisal‑guided‑top3 matches oracle performance at 100%. This zero‑cost strategy, validated through length‑controlled analysis, recovers oracle performance. For dense‑reward VEG tasks, compute should be allocated to sample diversity and intelligent selection rather than gradient adaptation. The surprisal‑guided selection principle may generalize to other execution‑grounded domains where optimal solutions occupy the distribution tail.
Authors:Binxiao Xu, Junyu Feng, Xiaopeng Lin, Haodong Li, Zhiyuan Feng, Bohan Zeng, Shaolin Lu, Ming Lu, Qi She, Wentao Zhang
Abstract:
Multimodal understanding of advertising videos is essential for interpreting the intricate relationship between visual storytelling and abstract persuasion strategies. However, despite excelling at general search, existing agents often struggle to bridge the cognitive gap between pixel‑level perception and high‑level marketing logic. To address this challenge, we introduce AD‑MIR, a framework designed to decode advertising intent via a two‑stage architecture. First, in the Structure‑Aware Memory Construction phase, the system converts raw video into a structured database by integrating semantic retrieval with exact keyword matching. This approach prioritizes fine‑grained brand details (e.g., logos, on‑screen text) while dynamically filtering out irrelevant background noise to isolate key protagonists. Second, the Structured Reasoning Agent mimics a marketing expert through an iterative inquiry loop, decomposing the narrative to deduce implicit persuasion tactics. Crucially, it employs an evidence‑based self‑correction mechanism that rigorously validates these insights against specific video frames, automatically backtracking when visual support is lacking. Evaluation on the AdsQA benchmark demonstrates that AD‑MIR achieves state‑of‑the‑art performance, surpassing the strongest general‑purpose agent, DVD, by 1.8% in strict and 9.5% in relaxed accuracy. These results underscore that effective advertising understanding demands explicitly grounding abstract marketing strategies in pixel‑level evidence. The code is available at https://github.com/Little‑Fridge/AD‑MIR.
Authors:Junyu Feng, Binxiao Xu, Jiayi Chen, Mengyu Dai, Cenyang Wu, Haodong Li, Bohan Zeng, Yunliu Xie, Hao Liang, Ming Lu, Wentao Zhang
Abstract:
This work addresses the challenge of personalized question answering in long‑term human‑machine interactions: when conversational history spans weeks or months and exceeds the context window, existing personalization mechanisms struggle to continuously absorb and leverage users' incremental concepts, aliases, and preferences. Current personalized multimodal models are predominantly static‑concepts are fixed at initialization and cannot evolve during interactions. We propose M2A, an agentic dual‑layer hybrid memory system that maintains personalized multimodal information through online updates. The system employs two collaborative agents: ChatAgent manages user interactions and autonomously decides when to query or update memory, while MemoryManager breaks down memory requests from ChatAgent into detailed operations on the dual‑layer memory bank, which couples a RawMessageStore (immutable conversation log) with a SemanticMemoryStore (high‑level observations), providing memories at different granularities. In addition, we develop a reusable data synthesis pipeline that injects concept‑grounded sessions from Yo'LLaVA and MC‑LLaVA into LoCoMo long conversations while preserving temporal coherence. Experiments show that M2A significantly outperforms baselines, demonstrating that transforming personalization from one‑shot configuration to a co‑evolving memory mechanism provides a viable path for high‑quality individualized responses in long‑term multimodal interactions. The code is available at https://github.com/Little‑Fridge/M2A.
Authors:Juntong Wu, Jialiang Cheng, Fuyu Lv, Ou Dan, Li Yuan
Abstract:
Mixture‑of‑Experts (MoE) architectures employ sparse activation to deliver faster training and inference with higher accuracy than dense LLMs. However, in production serving, MoE models require batch inference to optimize hardware efficiency, which may cause excessive expert activation and thus slow the memory‑bound decoding stage. To address the fundamental tension between batch decoding and expert sparsity, we present SERE, a Similarity‑based Expert Re‑routing method for Efficient batch decoding in MoE models. SERE dynamically reduces the number of active experts in an input‑aware manner by re‑routing tokens from secondary experts to their most similar primary counterparts. It also leverages similarity patterns to identify and preserve critical experts, thereby preventing capability loss. Notably, SERE avoids static expert pruning or merging, instead enabling dynamic expert skipping based on batch‑level expert redundancy. Additionally, we provide an efficient custom CUDA kernel for SERE, enabling plug‑and‑play use in vLLM with only a single‑line code change. Extensive experiments on various complex reasoning benchmarks demonstrate that SERE achieves up to 2.0x speedup with minimal quality loss, providing a practical solution for cost‑efficient and latency‑sensitive large‑scale MoE deployment. Code implementation of SERE can be found in https://github.com/JL‑Cheng/SERE.
Authors:Hulingxiao He, Zijun Geng, Yuxin Peng
Abstract:
Any entity in the visual world can be hierarchically grouped based on shared characteristics and mapped to fine‑grained sub‑categories. While Multi‑modal Large Language Models (MLLMs) achieve strong performance on coarse‑grained visual tasks, they often struggle with Fine‑Grained Visual Recognition (FGVR). Adapting general‑purpose MLLMs to FGVR typically requires large amounts of annotated data, which is costly to obtain, leaving a substantial performance gap compared to contrastive CLIP models dedicated for discriminative tasks. Moreover, MLLMs tend to overfit to seen sub‑categories and generalize poorly to unseen ones. To address these challenges, we propose Fine‑R1, an MLLM tailored for FGVR through an R1‑style training framework: (1) Chain‑of‑Thought Supervised Fine‑tuning, where we construct a high‑quality FGVR CoT dataset with rationales of "visual analysis, candidate sub‑categories, comparison, and prediction", transition the model into a strong open‑world classifier; and (2) Triplet Augmented Policy Optimization, where Intra‑class Augmentation mixes trajectories from anchor and positive images within the same category to improve robustness to intra‑class variance, while Inter‑class Augmentation maximizes the response distinction conditioned on images across sub‑categories to enhance discriminative ability. With only 4‑shot training, Fine‑R1 outperforms existing general MLLMs, reasoning MLLMs, and even contrastive CLIP models in identifying both seen and unseen sub‑categories, showing promise in working in knowledge‑intensive domains where gathering expert annotations for all sub‑categories is arduous. Code is available at https://github.com/PKU‑ICST‑MIPL/FineR1_ICLR2026.
Authors:Hussni Mohd Zakir, Eric Tatt Wei Ho
Abstract:
Recent self‑supervised Vision Transformers (ViTs), such as DINOv3, provide rich feature representations for dense vision tasks. This study investigates the intrinsic few‑shot semantic segmentation (FSS) capabilities of frozen DINOv3 features through a training‑free baseline, FSSDINO, utilizing class‑specific prototypes and Gram‑matrix refinement. Our results across binary, multi‑class, and cross‑domain (CDFSS) benchmarks demonstrate that this minimal approach, applied to the final backbone layer, is highly competitive with specialized methods involving complex decoders or test‑time adaptation. Crucially, we conduct an Oracle‑guided layer analysis, identifying a significant performance gap between the standard last‑layer features and globally optimal intermediate representations. We reveal a "Safest vs. Optimal" dilemma: while the Oracle proves higher performance is attainable, matching the results of compute‑intensive adaptation methods, current unsupervised and support‑guided selection metrics consistently yield lower performance than the last‑layer baseline. This characterizes a "Semantic Selection Gap" in Foundation Models, a disconnect where traditional heuristics fail to reliably identify high‑fidelity features. Our work establishes the "Last‑Layer" as a deceptively strong baseline and provides a rigorous diagnostic of the latent semantic potentials in DINOv3.The code is publicly available at https://github.com/hussni0997/fssdino.
Authors:Peizhen Li, Longbing Cao, Xiao-Ming Wu, Yang Zhang
Abstract:
Humanoid facial expression shadowing enables robots to realistically imitate human facial expressions in real time, which is critical for lifelike, facially expressive humanoid robots and affective human‑robot interaction. Existing progress in humanoid facial expression imitation remains limited, often failing to achieve either real‑time performance or realistic expressiveness due to offline video‑based inference designs and insufficient ability to capture and transfer subtle expression details. To address these limitations, we present VividFace, a real‑time and realistic facial expression shadowing system for humanoid robots. An optimized imitation framework X2CNet++ enhances expressiveness by fine‑tuning the human‑to‑humanoid facial motion transfer module and introducing a feature‑adaptation training strategy for better alignment across different image sources. Real‑time shadowing is further enabled by a video‑stream‑compatible inference pipeline and a streamlined workflow based on asynchronous I/O for efficient communication across devices. VividFace produces vivid humanoid faces by mimicking human facial expressions within 0.05 seconds, while generalizing across diverse facial configurations. Extensive real‑world demonstrations validate its practical utility. Videos are available at: https://lipzh5.github.io/VividFace/.
Authors:Tianyi Wu, Mingzhe Du, Yue Liu, Chengran Yang, Terry Yue Zhuo, Jiaheng Zhang, See-Kiong Ng
Abstract:
Large language models (LLMs) are increasingly used in software development, yet their tendency to generate insecure code remains a major barrier to real‑world deployment. Existing secure code alignment methods often suffer from a functionality‑‑security paradox, improving security at the cost of substantial utility degradation. We propose SecCoderX, an online reinforcement learning framework for functionality‑preserving secure code generation. SecCoderX first bridges vulnerability detection and secure code generation by repurposing mature detection resources in two ways: (i) synthesizing diverse, reality‑grounded vulnerability‑inducing coding tasks for online RL rollouts, and (ii) training a reasoning‑based vulnerability reward model that provides scalable and reliable security supervision. Together, these components are unified in an online RL loop to align code LLMs to generate secure and functional code. Extensive experiments demonstrate that SecCoderX achieves state‑of‑the‑art performance, improving Effective Safety Rate (ESR) by approximately 10% over unaligned models, whereas prior methods often degrade ESR by 14‑54%. We release our code, dataset and model checkpoints at https://github.com/AndrewWTY/SecCoderX.
Authors:Changhua Xu, Jie Lu, Junyu Xuan, En Yu
Abstract:
Vision‑‑Language‑‑Action (VLA) models bridge multimodal reasoning with physical control, but adapting them to new tasks with scarce demonstrations remains unreliable. While fine‑tuned VLA policies often produce semantically plausible trajectories, failures often arise from unresolved geometric ambiguities, where near‑miss action candidates lead to divergent execution outcomes under limited supervision. We study few‑shot VLA adaptation from a \emphgeneration‑‑selection perspective and propose a novel framework VGAS (Value‑Guided Action‑chunk Selection). It performs inference‑time best‑of‑N selection to identify action chunks that are both semantically faithful and geometrically precise. Specifically, VGAS employs a finetuned VLA as a high‑recall proposal generator and introduces the \textrmQ‑Chunk‑Former, a geometrically grounded Transformer critic to resolve fine‑grained geometric ambiguities. In addition, we propose Explicit Geometric Regularization (\textttEGR), which explicitly shapes a discriminative value landscape to preserve action ranking resolution among near‑miss candidates while mitigating value instability under scarce supervision. Experiments and theoretical analysis demonstrate that VGAS consistently improves success rates and robustness under limited demonstrations and distribution shifts. Our code is available at https://github.com/Jyugo‑15/VGAS.
Authors:Ruoyao Wen, Hao Li, Chaowei Xiao, Ning Zhang
Abstract:
Indirect prompt injection threatens LLM agents by embedding malicious instructions in external content, enabling unauthorized actions and data theft. LLM agents maintain working memory through their context window, which stores interaction history for decision‑making. Conventional agents indiscriminately accumulate all tool outputs and reasoning traces in this memory, creating two critical vulnerabilities: (1) injected instructions persist throughout the workflow, granting attackers multiple opportunities to manipulate behavior, and (2) verbose, non‑essential content degrades decision‑making capabilities. Existing defenses treat bloated memory as given and focus on remaining resilient, rather than reducing unnecessary accumulation to prevent the attack. We present AgentSys, a framework that defends against indirect prompt injection through explicit memory management. Inspired by process memory isolation in operating systems, AgentSys organizes agents hierarchically: a main agent spawns worker agents for tool calls, each running in an isolated context and able to spawn nested workers for subtasks. External data and subtask traces never enter the main agent's memory; only schema‑validated return values can cross boundaries through deterministic JSON parsing. Ablations show isolation alone cuts attack success to 2.19%, and adding a validator/sanitizer further improves defense with event‑triggered checks whose overhead scales with operations rather than context length. On AgentDojo and ASB, AgentSys achieves 0.78% and 4.25% attack success while slightly improving benign utility over undefended baselines. It remains robust to adaptive attackers and across multiple foundation models, showing that explicit memory management enables secure, dynamic LLM agent architectures. Our code is available at: https://github.com/ruoyaow/agentsys‑memory.
Authors:Kunal Pai, Parth Shah, Harshil Patel
Abstract:
AI agents are increasingly deployed in production, yet their security evaluations remain bottlenecked by manual red‑teaming or static benchmarks that fail to model adaptive, multi‑turn adversaries. We propose NAAMSE, an evolutionary framework that reframes agent security evaluation as a feedback‑driven optimization problem. Our system employs a single autonomous agent that orchestrates a lifecycle of genetic prompt mutation, hierarchical corpus exploration, and asymmetric behavioral scoring. By using model responses as a fitness signal, the framework iteratively compounds effective attack strategies while simultaneously ensuring "benign‑use correctness", preventing the degenerate security of blanket refusal. Our experiments on Gemini 2.5 Flash demonstrate that evolutionary mutation systematically amplifies vulnerabilities missed by one‑shot methods, with controlled ablations revealing that the synergy between exploration and targeted mutation uncovers high‑severity failure modes. We show that this adaptive approach provides a more realistic and scalable assessment of agent robustness in the face of evolving threats. The code for NAAMSE is open source and available at https://github.com/HASHIRU‑AI/NAAMSE.
Authors:Nisharg Nargund, Priyesh Shukla
Abstract:
Large language models (LLMs) achieve remarkable performance but demand substantial computational resources, limiting deployment on edge devices and resource‑constrained environments. We present TernaryLM, a 132M parameter transformer architecture that employs native 1‑bit ternary quantization ‑1, 0, +1 during training, achieving significant memory reduction without sacrificing language modeling capability. Unlike post‑training quantization approaches that quantize pre‑trained full‑precision models, TernaryLM learns quantization‑aware representations from scratch using straight‑through estimators and adaptive per‑layer scaling factors. Our experiments demonstrate: (1) validation perplexity of 58.42 on TinyStories; (2) downstream transfer with 82.47 percent F1 on MRPC paraphrase detection; (3) 2.4x memory reduction (498MB vs 1197MB) with comparable inference latency; and (4) stable training dynamics across diverse corpora. We provide layer‑wise quantization analysis showing that middle transformer layers exhibit highest compatibility with extreme quantization, informing future non‑uniform precision strategies. Our results suggest that native 1‑bit training is a promising direction for efficient neural language models. Code is available at https://github.com/1nisharg/TernaryLM‑Memory‑Efficient‑Language‑Modeling.
Authors:Ruturaj Reddy, Hrishav Bakul Barua, Junn Yong Loo, Thanh Thi Nguyen, Ganesh Krishnasamy
Abstract:
Diffusion‑based trajectory planners have demonstrated strong capability for modeling the multimodal nature of human driving behavior, but their reliance on iterative stochastic sampling poses critical challenges for real‑time, safety‑critical deployment. In this work, we present RAPiD, a deterministic policy extraction framework that distills a pretrained diffusion‑based planner into an efficient policy while eliminating diffusion sampling. Using score‑regularized policy optimization, we leverage the score function of a pre‑trained diffusion planner as a behavior prior to regularize policy learning. To promote safety and passenger comfort, the policy is optimized using a critic trained to imitate a predictive driver controller, providing dense, safety‑focused supervision beyond conventional imitation learning. Evaluations demonstrate that RAPiD achieves competitive performance on closed‑loop nuPlan scenarios with an 8x speedup over diffusion baselines, while achieving state‑of‑the‑art generalization among learning‑based planners on the interPlan benchmark. The official website of this work is: https://github.com/ruturajreddy/RAPiD.
Authors:Jindou Jia, Gen Li, Xiangyu Chen, Tuo An, Yuxuan Hu, Jingliang Li, Xinying Guo, Jianfei Yang
Abstract:
Diffusion‑based policies have recently achieved remarkable success in robotics by formulating action prediction as a conditional denoising process. However, the standard practice of sampling from random Gaussian noise often requires multiple iterative steps to produce clean actions, leading to high inference latency that incurs a major bottleneck for real‑time control. In this paper, we challenge the necessity of uninformed noise sampling and propose Action‑to‑Action flow matching (A2A), a novel policy paradigm that shifts from random sampling to initialization informed by the previous action. Unlike existing methods that treat proprioceptive action feedback as static conditions, A2A leverages historical proprioceptive sequences, embedding them into a high‑dimensional latent space as the starting point for action generation. This design bypasses costly iterative denoising while effectively capturing the robot's physical dynamics and temporal continuity. Extensive experiments demonstrate that A2A exhibits high training efficiency, fast inference speed, and improved generalization. Notably, A2A enables high‑quality action generation in as few as a single inference step (0.56 ms latency), and exhibits superior robustness to visual perturbations and enhanced generalization to unseen configurations. Lastly, we also extend A2A to video generation, demonstrating its broader versatility in temporal modeling. Project site: https://lorenzo‑0‑0.github.io/A2A_Flow_Matching.
Authors:Kaijie Zhu, Yuzhou Nie, Yijiang Li, Yiming Huang, Jialian Wu, Jiang Liu, Ximeng Sun, Zhenfei Yin, Lun Wang, Zicheng Liu, Emad Barsoum, William Yang Wang, Wenbo Guo
Abstract:
Executing complex terminal tasks remains a significant challenge for open‑weight LLMs, constrained by two fundamental limitations. First, high‑fidelity, executable training environments are scarce: environments synthesized from real‑world repositories are not diverse and scalable, while trajectories synthesized by LLMs suffer from hallucinations. Second, standard instruction tuning uses expert trajectories that rarely exhibit simple mistakes common to smaller models. This creates a distributional mismatch, leaving student models ill‑equipped to recover from their own runtime failures. To bridge these gaps, we introduce TermiGen, an end‑to‑end pipeline for synthesizing verifiable environments and resilient expert trajectories. Termi‑Gen first generates functionally valid tasks and Docker containers via an iterative multi‑agent refinement loop. Subsequently, we employ a Generator‑Critic protocol that actively injects errors during trajectory collection, synthesizing data rich in error‑correction cycles. Fine‑tuned on this TermiGen‑generated dataset, our TermiGen‑Qwen2.5‑Coder‑32B achieves a 31.3% pass rate on TerminalBench. This establishes a new open‑weights state‑of‑the‑art, outperforming existing baselines and notably surpassing capable proprietary models such as o4‑mini. Dataset is avaiable at https://github.com/ucsb‑mlsec/terminal‑bench‑env.
Authors:Abdullah Arafat Miah, Kevin Vu, Yu Bi
Abstract:
Spiking Neural Networks (SNNs) are energy‑efficient counterparts of Deep Neural Networks (DNNs) with high biological plausibility, as information is transmitted through temporal spiking patterns. The core element of an SNN is the spiking neuron, which converts input data into spikes following the Leaky Integrate‑and‑Fire (LIF) neuron model. This model includes several important hyperparameters, such as the membrane potential threshold and membrane time constant. Both the DNNs and SNNs have proven to be exploitable by backdoor attacks, where an adversary can poison the training dataset with malicious triggers and force the model to behave in an attacker‑defined manner. Yet, how an adversary can exploit the unique characteristics of SNNs for backdoor attacks remains underexplored. In this paper, we propose BadSNN, a novel backdoor attack on spiking neural networks that exploits hyperparameter variations of spiking neurons to inject backdoor behavior into the model. We further propose a trigger optimization process to achieve better attack performance while making trigger patterns less perceptible. BadSNN demonstrates superior attack performance on various datasets and architectures, as well as compared with state‑of‑the‑art data poisoning‑based backdoor attacks and robustness against common backdoor mitigation techniques. Codes can be found at https://github.com/SiSL‑URI/BadSNN.
Authors:Hanyu Wang, Yuanpu Cao, Lu Lin, Jinghui Chen
Abstract:
Advanced large language model agents typically adopt self‑reflection for improving performance, where agents iteratively analyze past actions to correct errors. However, existing reflective approaches are inherently retrospective: agents act, observe failure, and only then attempt to recover. In this work, we introduce PreFlect, a prospective reflection mechanism that shifts the paradigm from post hoc correction to pre‑execution foresight by criticizing and refining agent plans before execution. To support grounded prospective reflection, we distill planning errors from historical agent trajectories, capturing recurring success and failure patterns observed across past executions. Furthermore, we complement prospective reflection with a dynamic re‑planning mechanism that provides execution‑time plan update in case the original plan encounters unexpected deviation. Evaluations on different benchmarks demonstrate that PreFlect significantly improves overall agent utility on complex real‑world tasks, outperforming strong reflection‑based baselines and several more complex agent architectures. Code will be updated at https://github.com/wwwhy725/PreFlect.
Authors:Ayush Roy, Rudrasis Chakraborty, Lav Varshney, Vishnu Suresh Lokhande
Abstract:
Pooling heterogeneous datasets across domains is a common strategy in representation learning, but naive pooling can amplify distributional asymmetries and yield biased estimators, especially in settings where zero‑shot generalization is required. We propose a matching framework that selects samples relative to an adaptive centroid and iteratively refines the representation distribution. The double robustness and the propensity score matching for the inclusion of data domains make matching more robust than naive pooling and uniform subsampling by filtering out the confounding domains (the main cause of heterogeneity). Theoretical and empirical analyses show that, unlike naive pooling or uniform subsampling, matching achieves better results under asymmetric meta‑distributions, which are also extended to non‑Gaussian and multimodal real‑world settings. Most importantly, we show that these improvements translate to zero‑shot medical anomaly detection, one of the extreme forms of data heterogeneity and asymmetry. The code is available on https://github.com/AyushRoy2001/Beyond‑Pooling.
Authors:Jianrui Zhang, Anirudh Sundara Rajan, Brandon Han, Soochahn Lee, Sukanta Ganguly, Yong Jae Lee
Abstract:
Universal Multimodal Retrieval (UMR) seeks any‑to‑any search across text and vision, yet modern embedding models remain brittle when queries require latent reasoning (e.g., resolving underspecified references or matching compositional constraints). We argue this brittleness is often data‑induced: when images carry "silent" evidence and queries leave key semantics implicit, a single embedding pass must both reason and compress, encouraging spurious feature matching. We propose a data‑centric framework that decouples these roles by externalizing reasoning before retrieval. Using a strong Vision‑‑Language Model, we make implicit semantics explicit by densely captioning visual evidence in corpus entries, resolving ambiguous multimodal references in queries, and rewriting verbose instructions into concise retrieval constraints. Inference‑time enhancement alone is insufficient; the retriever must be trained on these semantically dense representations to avoid distribution shift and fully exploit the added signal. Across M‑BEIR, our reasoning‑augmented training method yields consistent gains over strong baselines, with ablations showing that corpus enhancement chiefly benefits knowledge‑intensive queries while query enhancement is critical for compositional modification requests. We publicly release our code at https://github.com/AugmentedRetrieval/ReasoningAugmentedRetrieval.
Authors:Shang Liu, Hanyu Pei, Zeyan Liu
Abstract:
Large Language Models(LLMs) have been successful in numerous fields. Alignment has usually been applied to prevent them from harmful purposes. However, aligned LLMs remain vulnerable to jailbreak attacks that deliberately mislead them into producing harmful outputs. Existing jailbreaks are either black‑box, using carefully crafted, unstealthy prompts, or white‑box, requiring resource‑intensive computation. In light of these challenges, we introduce ShallowJail, a novel attack that exploits shallow alignment in LLMs. ShallowJail can misguide LLMs' responses by manipulating the initial tokens during inference. Through extensive experiments, we demonstrate the effectiveness of ShallowJail, which substantially degrades the safety of state‑of‑the‑art LLM responses. Our code is available at https://github.com/liuup/ShallowJail.
Authors:Chenglei Yu, Chuanrui Wang, Bangyan Liao, Tailin Wu
Abstract:
A central goal in systems biology and drug discovery is to predict the transcriptional response of cells to perturbations. This task is challenging due to the noisy and sparse nature of single‑cell measurements, as well as the fact that perturbations often induce population‑level shifts rather than changes in individual cells. Existing deep learning methods typically assume cell‑level correspondences, limiting their ability to capture such global effects. We present scDFM, a generative framework based on conditional flow matching that models the full distribution of perturbed cells conditioned on control states. By incorporating a maximum mean discrepancy (MMD) objective, our method aligns perturbed and control populations beyond cell‑level correspondences. To further improve robustness to sparsity and noise, we introduce the Perturbation‑Aware Differential Transformer (PAD‑Transformer), a backbone architecture that leverages gene interaction graphs and differential attention to capture context‑specific expression changes. Across multiple genetic and drug perturbation benchmarks, scDFM consistently outperforms prior methods, demonstrating strong generalization in both unseen and combinatorial settings. In the combinatorial setting, it reduces mean squared error by 19.6% relative to the strongest baseline. These results highlight the importance of distribution‑level generative modeling for robust in silico perturbation prediction. The code is available at https://github.com/AI4Science‑WestlakeU/scDFM
Authors:Gyoung S. Na, Chanyoung Park
Abstract:
Various representation learning methods for molecular structures have been devised to accelerate data‑driven chemistry. However, the representation capabilities of existing methods are essentially limited to atom‑level information, which is not sufficient to describe real‑world molecular physics. Although electron‑level information can provide fundamental knowledge about chemical compounds beyond the atom‑level information, obtaining the electron‑level information in real‑world molecules is computationally impractical and sometimes infeasible. We propose a method for learning electron‑informed molecular representations without additional computation costs by transferring readily accessible electron‑level information about small molecules to large molecules of our interest. The proposed method achieved state‑of‑the‑art prediction accuracy on extensive benchmark datasets containing experimentally observed molecular physics. The source code for HEDMoL is available at https://github.com/ngs00/HEDMoL.
Authors:Yongqing Jiang, Jianze Wang, Zhiqi Shen, Zhenghong Lin, Jiayuan Wang, Yijian Yang, Kaoshan Dai, Haoran Luo
Abstract:
Structural modeling is a fundamental component of computational engineering science, in which even minor physical inconsistencies or specification violations may invalidate downstream simulations. The potential of large language models (LLMs) for automatic generation of modeling code has been demonstrated. However, non‑executable or physically inconsistent outputs remain prevalent under stringent engineering constraints. A framework for physics‑consistent automatic building modeling is therefore proposed, integrating domain knowledge construction, constraint‑oriented model alignment, and verification‑driven evaluation. CivilInstruct is introduced as a domain‑specific dataset that formalizes structural engineering knowledge and constraint reasoning to enable simulation‑ready model generation. A two‑stage fine‑tuning strategy is further employed to enforce constraint satisfaction and application programming interface compliance, substantially reducing hallucinated and non‑conforming outputs. MBEval is presented as a verification‑driven benchmark that evaluates executability and structural dynamics consistency through closed‑loop validation. Experimental results show consistent improvements over baselines across rigorous verification metrics. Our code is available at https://github.com/Jovanqing/AutoBM.
Authors:Jinxiu Qu, Zirui Tang, Hongzhang Huang, Boyu Niu, Wei Zhou, Jiannan Wang, Yitong Song, Guoliang Li, Xuanhe Zhou, Fan Wu
Abstract:
Semi‑structured table question answering (QA) is a challenging task that requires (1) precise extraction of cell contents and positions and (2) accurate recovery of key implicit logical structures, hierarchical relationships, and semantic associations encoded in table layouts. In practice, such tables are often interpreted manually by human experts, which is labor‑intensive and time‑consuming. However, automating this process remains difficult. Existing Text‑to‑SQL methods typically require converting semi‑structured tables into structured formats, inevitably leading to information loss, while approaches like Text‑to‑Code and multimodal LLM‑based QA struggle with complex layouts and often yield inaccurate answers. To address these limitations, we present ST‑Raptor, an agentic system for semi‑structured table QA. ST‑Raptor offers an interactive analysis environment that combines visual editing, tree‑based structural modeling, and agent‑driven query resolution to support accurate and user‑friendly table understanding. Experimental results on both benchmark and real‑world datasets demonstrate that ST‑Raptor outperforms existing methods in both accuracy and usability. The code is available at https://github.com/weAIDB/ST‑Raptor, and a demonstration video is available at https://youtu.be/9GDR‑94Cau4.
Authors:Siqi Song, Xuanbing Xie, Zonglin Li, Yuqiang Li, Shijie Wang, Biqing Qi
Abstract:
Multi‑robot collaboration tasks often require heterogeneous robots to work together over long horizons under spatial constraints and environmental uncertainties. Although Large Language Models (LLMs) excel at reasoning and planning, their potential for coordinated control has not been fully explored. Inspired by human teamwork, we present CLiMRS (Cooperative Large‑Language‑Model‑Driven Heterogeneous Multi‑Robot System), an adaptive group negotiation framework among LLMs for multi‑robot collaboration. This framework pairs each robot with an LLM agent and dynamically forms subgroups through a general proposal planner. Within each subgroup, a subgroup manager leads perception‑driven multi‑LLM discussions to get commands for actions. Feedback is provided by both robot execution outcomes and environment changes. This grouping‑planning‑execution‑feedback loop enables efficient planning and robust execution. To evaluate these capabilities, we introduce CLiMBench, a heterogeneous multi‑robot benchmark of challenging assembly tasks. Our experiments show that CLiMRS surpasses the best baseline, achieving over 40% higher efficiency on complex tasks without sacrificing success on simpler ones. Overall, our results demonstrate that leveraging human‑inspired group formation and negotiation principles significantly enhances the efficiency of heterogeneous multi‑robot collaboration. Our code is available here: https://github.com/song‑siqi/CLiMRS.
Authors:Yuchen Yan, Liang Jiang, Jin Jiang, Shuaicheng Li, Zujie Wen, Zhiqiang Zhang, Jun Zhou, Jian Shao, Yueting Zhuang, Yongliang Shen
Abstract:
Large reasoning models achieve strong performance by scaling inference‑time chain‑of‑thought, but this paradigm suffers from quadratic cost, context length limits, and degraded reasoning due to lost‑in‑the‑middle effects. Iterative reasoning mitigates these issues by periodically summarizing intermediate thoughts, yet existing methods rely on supervised learning or fixed heuristics and fail to optimize when to summarize, what to preserve, and how to resume reasoning. We propose InftyThink+, an end‑to‑end reinforcement learning framework that optimizes the entire iterative reasoning trajectory, building on model‑controlled iteration boundaries and explicit summarization. InftyThink+ adopts a two‑stage training scheme with supervised cold‑start followed by trajectory‑level reinforcement learning, enabling the model to learn strategic summarization and continuation decisions. Experiments on DeepSeek‑R1‑Distill‑Qwen‑1.5B show that InftyThink+ improves accuracy by 21% on AIME24 and outperforms conventional long chain‑of‑thought reinforcement learning by a clear margin, while also generalizing better to out‑of‑distribution benchmarks. Moreover, InftyThink+ significantly reduces inference latency and accelerates reinforcement learning training, demonstrating improved reasoning efficiency alongside stronger performance.
Authors:Saad Hossain, Tom Tseng, Punya Syon Pandey, Samanvay Vajpayee, Matthew Kowal, Nayeema Nonta, Samuel Simko, Stephen Casper, Zhijing Jin, Kellin Pelrine, Sirisha Rambhatla
Abstract:
As increasingly capable open‑weight large language models (LLMs) are deployed, improving their tamper resistance against unsafe modifications, whether accidental or intentional, becomes critical to minimize risks. However, there is no standard approach to evaluate tamper resistance. Varied data sets, metrics, and tampering configurations make it difficult to compare safety, utility, and robustness across different models and defenses. To this end, we introduce TamperBench, the first unified framework to systematically evaluate the tamper resistance of LLMs. TamperBench (i) curates a repository of state‑of‑the‑art weight‑space fine‑tuning attacks and latent‑space representation attacks; (ii) enables realistic adversarial evaluation through systematic hyperparameter sweeps per attack‑model pair; and (iii) provides both safety and utility evaluations. TamperBench requires minimal additional code to specify any fine‑tuning configuration, alignment‑stage defense method, and metric suite while ensuring end‑to‑end reproducibility. We use TamperBench to evaluate 21 open‑weight LLMs, including defense‑augmented variants, across nine tampering threats using standardized safety and capability metrics with hyperparameter sweeps per model‑attack pair. This yields novel insights, including effects of post‑training on tamper resistance, that jailbreak‑tuning is typically the most severe attack, and that Triplet emerges as a leading alignment‑stage defense. Code is available at: https://github.com/criticalml‑uw/TamperBench
Authors:Sindhuja Chaduvula, Jessee Ho, Kina Kim, Aravind Narayanan, Mahshid Alinoori, Muskan Garg, Dhanesh Ramachandram, Shaina Raza
Abstract:
Over the last decade, explainable AI has primarily focused on interpreting individual model predictions, producing post‑hoc explanations that relate inputs to outputs under a fixed decision structure. Recent advances in large language models (LLMs) have enabled agentic AI systems whose behaviour unfolds over multi‑step trajectories. In these settings, success and failure are determined by sequences of decisions rather than a single output. While useful, it remains unclear how explanation approaches designed for static predictions translate to agentic settings where behaviour emerges over time. In this work, we bridge the gap between static and agentic explainability by comparing attribution‑based explanations with trace‑based diagnostics across both settings. To make this distinction explicit, we empirically compare attribution‑based explanations used in static classification tasks with trace‑based diagnostics used in agentic benchmarks (TAU‑bench Airline and AssistantBench). Our results show that while attribution methods achieve stable feature rankings in static settings (Spearman ρ= 0.86), they cannot be applied reliably to diagnose execution‑level failures in agentic trajectories. In contrast, trace‑grounded rubric evaluation for agentic settings consistently localizes behaviour breakdowns and reveals that state tracking inconsistency is 2.7× more prevalent in failed runs and reduces success probability by 49%. These findings motivate a shift towards trajectory‑level explainability for agentic systems when evaluating and diagnosing autonomous AI behaviour. Resources: https://github.com/VectorInstitute/unified‑xai‑evaluation‑framework https://vectorinstitute.github.io/unified‑xai‑evaluation‑framework
Authors:Minjeong Ban, Jeonghwan Choi, Hyangsuk Min, Nicole Hee-Yeon Kim, Minseok Kim, Jae-Gil Lee, Hwanjun Song
Abstract:
Information retrieval (IR) evaluation remains challenging due to incomplete IR benchmark datasets that contain unlabeled relevant chunks. While LLMs and LLM‑human hybrid strategies reduce costly human effort, they remain prone to LLM overconfidence and ineffective AI‑to‑human escalation. To address this, we propose DREAM, a multi‑round debate‑based relevance assessment framework with LLM agents, built on opposing initial stances and iterative reciprocal critique. Through our agreement‑based debate, it yields more accurate labeling for certain cases and more reliable AI‑to‑human escalation for uncertain ones, achieving 95.2% labeling accuracy with only 3.5% human involvement. Using DREAM, we build BRIDGE, a refined benchmark that mitigates evaluation bias and enables fairer retriever comparison by uncovering 29,824 missing relevant chunks. We then re‑benchmark IR systems and extend evaluation to RAG, showing that unaddressed holes not only distort retriever rankings but also drive retrieval‑generation misalignment. The relevance assessment framework is available at https: //github.com/DISL‑Lab/DREAM‑ICLR‑26; and the BRIDGE dataset is available at https://github.com/DISL‑Lab/BRIDGE‑Benchmark.
Authors:Lanbo Lin, Jiayao Liu, Tianyuan Yang, Li Cai, Yuanwu Xu, Lei Wei, Sicong Xie, Guannan Zhang
Abstract:
Evaluating agentic AI on open‑ended professional tasks faces a fundamental dilemma between rigor and flexibility. Static rubrics provide rigorous, reproducible assessment but fail to accommodate diverse valid response strategies, while LLM‑as‑a‑judge approaches adapt to individual responses yet suffer from instability and bias. Human experts address this dilemma by combining domain‑grounded principles with dynamic, claim‑level assessment. Inspired by this process, we propose JADE, a two‑layer evaluation framework. Layer 1 encodes expert knowledge as a predefined set of evaluation skills, providing stable evaluation criteria. Layer 2 performs report‑specific, claim‑level evaluation to flexibly assess diverse reasoning strategies, with evidence‑dependency gating to invalidate conclusions built on refuted claims. Experiments on BizBench show that JADE improves evaluation stability and reveals critical agent failure modes missed by holistic LLM‑based evaluators. We further demonstrate strong alignment with expert‑authored rubrics and effective transfer to a medical‑domain benchmark, validating JADE across professional domains. Our code is publicly available at https://github.com/smiling‑world/JADE.
Authors:Changyue Wang, Weihang Su, Qingyao Ai, Yiqun Liu
Abstract:
Scaling training data and model parameters has long driven progress in large language models (LLMs), but this paradigm is increasingly constrained by the scarcity of high‑quality data and diminishing returns from rising computational costs. As a result, recent work is increasing the focus on continual learning from real‑world deployment, where user interaction logs provide a rich source of authentic human feedback and procedural knowledge. However, learning from user logs is challenging due to their unstructured and noisy nature. Vanilla LLM systems often struggle to distinguish useful feedback signals from noisy user behavior, and the disparity between user log collection and model optimization (e.g., the off‑policy optimization problem) further strengthens the problem. To this end, we propose UNO (User log‑driveN Optimization), a unified framework for improving LLM systems (LLMsys) with user logs. UNO first distills logs into semi‑structured rules and preference pairs, then employs query‑and‑feedback‑driven clustering to manage data heterogeneity, and finally quantifies the cognitive gap between the model's prior knowledge and the log data. This assessment guides the LLMsys to adaptively filter out noisy feedback and construct different modules for primary and reflective experiences extracted from user logs, thereby improving future responses. Extensive experiments show that UNO achieves state‑of‑the‑art effectiveness and efficiency, significantly outperforming Retrieval Augmented Generation (RAG) and memory‑based baselines. We have open‑sourced our code at https://github.com/bebr2/UNO .
Authors:Zhenxing Ming, Julie Stephany Berrio, Mao Shan, Stewart Worrall
Abstract:
3D semantic occupancy prediction enables autonomous vehicles (AVs) to perceive fine‑grained geometric and semantic structure of their surroundings from onboard sensors, which is essential for safe decision‑making and navigation. Recent models for 3D semantic occupancy prediction have successfully addressed the challenge of describing real‑world objects with varied shapes and classes. However, the intermediate representations used by existing methods for 3D semantic occupancy prediction rely heavily on 3D voxel volumes or a set of 3D Gaussians, hindering the model's ability to efficiently and effectively capture fine‑grained geometric details in the 3D driving environment. This paper introduces TFusionOcc, a novel object‑centric multi‑sensor fusion framework for predicting 3D semantic occupancy. By leveraging multi‑stage multi‑sensor fusion, Student's t‑distribution, and the T‑Mixture model (TMM), together with more geometrically flexible primitives, such as the deformable superquadric (superquadric with inverse warp), the proposed method achieved state‑of‑the‑art (SOTA) performance on the nuScenes benchmark. In addition, extensive experiments were conducted on the nuScenes‑C dataset to demonstrate the robustness of the proposed method in different camera and lidar corruption scenarios. The code will be available at: https://github.com/DanielMing123/TFusionOcc
Authors:Fuxi Zhang, Yifan Wang, Hengrun Zhao, Zhuohan Sun, Changxing Xia, Lijun Wang, Huchuan Lu, Yangrui Shao, Chen Yang, Long Teng
Abstract:
Salient object detection is inherently a subjective problem, as observers with different priors may perceive different objects as salient. However, existing methods predominantly formulate it as an objective prediction task with a single groundtruth segmentation map for each image, which renders the problem under‑determined and fundamentally ill‑posed. To address this issue, we propose Observer‑Centric Salient Object Detection (OC‑SOD), where salient regions are predicted by considering not only the visual cues but also the observer‑specific factors such as their preferences or intents. As a result, this formulation captures the intrinsic ambiguity and diversity of human perception, enabling personalized and context‑aware saliency prediction. By leveraging multi‑modal large language models, we develop an efficient data annotation pipeline and construct the first OC‑SOD dataset named OC‑SODBench, comprising 33k training, validation and test images with 152k textual prompts and object pairs. Built upon this new dataset, we further design OC‑SODAgent, an agentic baseline which performs OC‑SOD via a human‑like "Perceive‑Reflect‑Adjust" process. Extensive experiments on our proposed OC‑SODBench have justified the effectiveness of our contribution. Through this observer‑centric perspective, we aim to bridge the gap between human perception and computational modeling, offering a more realistic and flexible understanding of what makes an object truly "salient." Code and dataset are publicly available at: https://github.com/Dustzx/OC_SOD
Authors:Yewei Liu, Xiyuan Wang, Yansheng Mao, Yoav Gelbery, Haggai Maron, Muhan Zhang
Abstract:
We propose SHINE (Scalable Hyper In‑context NEtwork), a scalable hypernetwork that can map diverse meaningful contexts into high‑quality LoRA adapters for large language models (LLM). By reusing the frozen LLM's own parameters in an in‑context hypernetwork design and introducing architectural innovations, SHINE overcomes key limitations of prior hypernetworks and achieves strong expressive power with a relatively small number of parameters. We introduce a pretraining and instruction fine‑tuning pipeline, and train our hypernetwork to generate high quality LoRA adapters from diverse meaningful contexts in a single forward pass. It updates LLM parameters without any fine‑tuning, and immediately enables complex question answering tasks related to the context without directly accessing the context, effectively transforming in‑context knowledge to in‑parameter knowledge in one pass. Our work achieves outstanding results on various tasks, greatly saves time, computation and memory costs compared to SFT‑based LLM adaptation, and shows great potential for scaling. Our code is available at https://github.com/Yewei‑Liu/SHINE
Authors:Junqi Chen, Sirui Chen, Chaochao Lu
Abstract:
Causal inference is essential for decision‑making but remains challenging for non‑experts. While large language models (LLMs) show promise in this domain, their precise causal estimation capabilities are still limited, and the impact of post‑training on these abilities is insufficiently explored. This paper examines the extent to which post‑training can enhance LLMs' capacity for causal inference. We introduce CauGym, a comprehensive dataset comprising seven core causal tasks for training and five diverse test sets. Using this dataset, we systematically evaluate five post‑training approaches: SFT, DPO, KTO, PPO, and GRPO. Across five in‑domain and four existing benchmarks, our experiments demonstrate that appropriate post‑training enables smaller LLMs to perform causal inference competitively, often surpassing much larger models. Our 14B parameter model achieves 93.5% accuracy on the CaLM benchmark, compared to 55.4% by OpenAI o3. Furthermore, the post‑trained LLMs exhibit strong generalization and robustness under real‑world conditions such as distribution shifts and noisy data. Collectively, these findings provide the first systematic evidence that targeted post‑training can produce reliable and robust LLM‑based causal reasoners. Our data and GRPO‑model are available at https://github.com/OpenCausaLab/CauGym.
Authors:Qifan Zhang, Jianhao Ruan, Aochuan Chen, Kang Zeng, Nuo Chen, Jing Tang, Jia Li
Abstract:
Large Reasoning Models (LRMs) have advanced rapidly; however, existing benchmarks in mathematics, code, and common‑sense reasoning remain limited. They lack long‑context evaluation, offer insufficient challenge, and provide answers that are difficult to verify programmatically. We introduce GrAlgoBench, a benchmark designed to evaluate LRMs through graph algorithm problems. Such problems are particularly well suited for probing reasoning abilities: they demand long‑context reasoning, allow fine‑grained control of difficulty levels, and enable standardized, programmatic evaluation. Across nine tasks, our systematic experiments reveal two major weaknesses of current LRMs. First, accuracy deteriorates sharply as context length increases, falling below 50% once graphs exceed 120 nodes. This degradation is driven by frequent execution errors, weak memory, and redundant reasoning. Second, LRMs suffer from an over‑thinking phenomenon, primarily caused by extensive yet largely ineffective self‑verification, which inflates reasoning traces without improving correctness. By exposing these limitations, GrAlgoBench establishes graph algorithm problems as a rigorous, multidimensional, and practically relevant testbed for advancing the study of reasoning in LRMs. Code is available at https://github.com/Bklight999/GrAlgoBench.
Authors:Patryk Rybak, Paweł Batorski, Paul Swoboda, Przemysław Spurek
Abstract:
Machine unlearning for LLMs aims to remove sensitive or copyrighted data from trained models. However, the true efficacy of current unlearning methods remains uncertain. Standard evaluation metrics rely on benign queries that often mistake superficial information suppression for genuine knowledge removal. Such metrics fail to detect residual knowledge that more sophisticated prompting strategies could still extract. We introduce REBEL, an evolutionary approach for adversarial prompt generation designed to probe whether unlearned data can still be recovered. Our experiments demonstrate that REBEL successfully elicits ``forgotten'' knowledge from models that seemed to be forgotten in standard unlearning benchmarks, revealing that current unlearning methods may provide only a superficial layer of protection. We validate our framework on subsets of the TOFU and WMDP benchmarks, evaluating performance across a diverse suite of unlearning algorithms. Our experiments show that REBEL consistently outperforms static baselines, recovering ``forgotten'' knowledge with Attack Success Rates (ASRs) reaching up to 60% on TOFU and 93% on WMDP. We will make all code publicly available upon acceptance. Code is available at https://github.com/patryk‑rybak/REBEL/
Authors:Yu Zhang, Sean Bin Yang, Arijit Khan, Cuneyt Gurcan Akcora
Abstract:
Counterfactual explanations offer an intuitive way to interpret graph neural networks (GNNs) by identifying minimal changes that alter a model's prediction, thereby answering "what must differ for a different outcome?". In this work, we propose a novel framework, ATEX‑CF that unifies adversarial attack techniques with counterfactual explanation generation‑a connection made feasible by their shared goal of flipping a node's prediction, yet differing in perturbation strategy: adversarial attacks often rely on edge additions, while counterfactual methods typically use deletions. Unlike traditional approaches that treat explanation and attack separately, our method efficiently integrates both edge additions and deletions, grounded in theory, leveraging adversarial insights to explore impactful counterfactuals. In addition, by jointly optimizing fidelity, sparsity, and plausibility under a constrained perturbation budget, our method produces instance‑level explanations that are both informative and realistic. Experiments on synthetic and real‑world node classification benchmarks demonstrate that ATEX‑CF generates faithful, concise, and plausible explanations, highlighting the effectiveness of integrating adversarial insights into counterfactual reasoning for GNNs.
Authors:Peiyang Song, Pengrui Han, Noah Goodman
Abstract:
Large Language Models (LLMs) have exhibited remarkable reasoning capabilities, achieving impressive results across a wide range of tasks. Despite these advances, significant reasoning failures persist, occurring even in seemingly simple scenarios. To systematically understand and address these shortcomings, we present the first comprehensive survey dedicated to reasoning failures in LLMs. We introduce a novel categorization framework that distinguishes reasoning into embodied and non‑embodied types, with the latter further subdivided into informal (intuitive) and formal (logical) reasoning. In parallel, we classify reasoning failures along a complementary axis into three types: fundamental failures intrinsic to LLM architectures that broadly affect downstream tasks; application‑specific limitations that manifest in particular domains; and robustness issues characterized by inconsistent performance across minor variations. For each reasoning failure, we provide a clear definition, analyze existing studies, explore root causes, and present mitigation strategies. By unifying fragmented research efforts, our survey provides a structured perspective on systemic weaknesses in LLM reasoning, offering valuable insights and guiding future research towards building stronger, more reliable, and robust reasoning capabilities. We additionally release a comprehensive collection of research works on LLM reasoning failures, as a GitHub repository at https://github.com/Peiyang‑Song/Awesome‑LLM‑Reasoning‑Failures, to provide an easy entry point to this area.
Authors:Haozhen Zhang, Haodong Yue, Tao Feng, Quanyu Long, Jianzhu Bao, Bowen Jin, Weizhi Zhang, Xiao Li, Jiaxuan You, Chengwei Qin, Wenya Wang
Abstract:
Memory is increasingly central to Large Language Model (LLM) agents operating beyond a single context window, yet most existing systems rely on offline, query‑agnostic memory construction that can be inefficient and may discard query‑critical information. Although runtime memory utilization is a natural alternative, prior work often incurs substantial overhead and offers limited explicit control over the performance‑cost trade‑off. In this work, we present BudgetMem, a runtime agent memory framework for explicit, query‑aware performance‑cost control. BudgetMem structures memory processing as a set of memory modules, each offered in three budget tiers (i.e., \textscLow/\textscMid/\textscHigh). A lightweight router performs budget‑tier routing across modules to balance task performance and memory construction cost, which is implemented as a compact neural policy trained with reinforcement learning. Using BudgetMem as a unified testbed, we study three complementary strategies for realizing budget tiers: implementation (method complexity), reasoning (inference behavior), and capacity (module model size). Across LoCoMo, LongMemEval, and HotpotQA, BudgetMem surpasses strong baselines when performance is prioritized (i.e., high‑budget setting), and delivers better accuracy‑cost frontiers under tighter budgets. Moreover, our analysis disentangles the strengths and weaknesses of different tiering strategies, clarifying when each axis delivers the most favorable trade‑offs under varying budget regimes.
Authors:Ruihang Li, Leigang Qu, Jingxu Zhang, Dongnan Gui, Mengde Xu, Xiaosong Zhang, Han Hu, Wenjie Wang, Jiaqi Wang
Abstract:
The rapid advancement of visual generation models has outpaced traditional evaluation approaches, necessitating the adoption of Vision‑Language Models as surrogate judges. In this work, we systematically investigate the reliability of the prevailing absolute pointwise scoring standard, across a wide spectrum of visual generation tasks. Our analysis reveals that this paradigm is limited due to stochastic inconsistency and poor alignment with human perception. To resolve these limitations, we introduce GenArena, a unified evaluation framework that leverages a pairwise comparison paradigm to ensure stable and human‑aligned evaluation. Crucially, our experiments uncover a transformative finding that simply adopting this pairwise protocol enables off‑the‑shelf open‑source models to outperform top‑tier proprietary models. Notably, our method boosts evaluation accuracy by over 20% and achieves a Spearman correlation of 0.86 with the authoritative LMArena leaderboard, drastically surpassing the 0.36 correlation of pointwise methods. Based on GenArena, we benchmark state‑of‑the‑art visual generation models across diverse tasks, providing the community with a rigorous and automated evaluation standard for visual generation.
Authors:Xianyang Liu, Shangding Gu, Dawn Song
Abstract:
Large language model (LLM)‑based agents are increasingly expected to negotiate, coordinate, and transact autonomously, yet existing benchmarks lack principled settings for evaluating language‑mediated economic interaction among multiple agents. We introduce AgenticPay, a benchmark and simulation framework for multi‑agent buyer‑seller negotiation driven by natural language. AgenticPay models markets in which buyers and sellers possess private constraints and product‑dependent valuations, and must reach agreements through multi‑round linguistic negotiation rather than numeric bidding alone. The framework supports a diverse suite of over 110 tasks ranging from bilateral bargaining to many‑to‑many markets, with structured action extraction and metrics for feasibility, efficiency, and welfare. Benchmarking state‑of‑the‑art proprietary and open‑weight LLMs reveals substantial gaps in negotiation performance and highlights challenges in long‑horizon strategic reasoning, establishing AgenticPay as a foundation for studying agentic commerce and language‑based market interaction. Code and dataset are available at the link: https://github.com/SafeRL‑Lab/AgenticPay.
Authors:Mingxin Liu, Shuran Ma, Shibei Meng, Xiangyu Zhao, Zicheng Zhang, Shaofeng Zhang, Zhihang Zhong, Peixian Chen, Haoyu Cao, Xing Sun, Haodong Duan, Xue Yang
Abstract:
While generative video models have achieved remarkable visual fidelity, their capacity to internalize and reason over implicit world rules remains a critical yet under‑explored frontier. To bridge this gap, we present RISE‑Video, a pioneering reasoning‑oriented benchmark for Text‑Image‑to‑Video (TI2V) synthesis that shifts the evaluative focus from surface‑level aesthetics to deep cognitive reasoning. RISE‑Video comprises 467 meticulously human‑annotated samples spanning eight rigorous categories, providing a structured testbed for probing model intelligence across diverse dimensions, ranging from commonsense and spatial dynamics to specialized subject domains. Our framework introduces a multi‑dimensional evaluation protocol consisting of four metrics: Reasoning Alignment, Temporal Consistency, Physical Rationality, and Visual Quality. To further support scalable evaluation, we propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human‑centric assessment. Extensive experiments on 11 state‑of‑the‑art TI2V models reveal pervasive deficiencies in simulating complex scenarios under implicit constraints, offering critical insights for the advancement of future world‑simulating generative models.
Authors:Mirlan Karimov, Teodora Spasojevic, Markus Braun, Julian Wiederer, Vasileios Belagiannis, Marc Pollefeys
Abstract:
Controllable video generation has emerged as a versatile tool for autonomous driving, enabling realistic synthesis of traffic scenarios. However, existing methods depend on control signals at inference time to guide the generative model towards temporally consistent generation of dynamic objects, limiting their utility as scalable and generalizable data engines. In this work, we propose Localized Semantic Alignment (LSA), a simple yet effective framework for fine‑tuning pre‑trained video generation models. LSA enhances temporal consistency by aligning semantic features between ground‑truth and generated video clips. Specifically, we compare the output of an off‑the‑shelf feature extraction model between the ground‑truth and generated video clips localized around dynamic objects inducing a semantic feature consistency loss. We fine‑tune the base model by combining this loss with the standard diffusion loss. The model fine‑tuned for a single epoch with our novel loss outperforms the baselines in common video generation evaluation metrics. To further test the temporal consistency in generated videos we adapt two additional metrics from object detection task, namely mAP and mIoU. Extensive experiments on nuScenes and KITTI datasets show the effectiveness of our approach in enhancing temporal consistency in video generation without the need for external control signals during inference and any computational overheads.
Authors:Joseph Fioresi, Parth Parag Kulkarni, Ashmal Vayani, Song Wang, Mubarak Shah
Abstract:
Agentic systems solve complex tasks by coordinating multiple agents that iteratively reason, invoke tools, and exchange intermediate results. To improve robustness and solution quality, recent approaches deploy multiple agent teams running in parallel to explore diverse reasoning trajectories. However, parallel execution comes at a significant computational cost: when different teams independently reason about similar sub‑problems or execute analogous steps, they repeatedly perform substantial overlapping computation. To address these limitations, in this paper, we propose Learning to Share (LTS), a learned shared‑memory mechanism for parallel agentic frameworks that enables selective cross‑team information reuse while controlling context growth. LTS introduces a global memory bank accessible to all teams and a lightweight controller that decides whether intermediate agent steps should be added to memory or not. The controller is trained using stepwise reinforcement learning with usage‑aware credit assignment, allowing it to identify information that is globally useful across parallel executions. Experiments on the AssistantBench and GAIA benchmarks show that LTS significantly reduces overall runtime while matching or improving task performance compared to memory‑free parallel baselines, demonstrating that learned memory admission is an effective strategy for improving the efficiency of parallel agentic systems. Project page: https://joefioresi718.github.io/LTS_webpage/
Authors:Junwan Kim, Jiho Park, Seonghu Jeon, Seungryong Kim
Abstract:
Flow matching has recently emerged as a promising alternative to diffusion‑based generative models, particularly for text‑to‑image generation. Despite its flexibility in allowing arbitrary source distributions, most existing approaches rely on a standard Gaussian distribution, a choice inherited from diffusion models, and rarely consider the source distribution itself as an optimization target in such settings. In this work, we show that principled design of the source distribution is not only feasible but also beneficial at the scale of modern text‑to‑image systems. Specifically, we propose learning a condition‑dependent source distribution under flow matching objective that better exploit rich conditioning signals. We identify key failure modes that arise when directly incorporating conditioning into the source, including distributional collapse and instability, and show that appropriate variance regularization and directional alignment between source and target are critical for stable and effective learning. We further analyze how the choice of target representation space impacts flow matching with structured sources, revealing regimes in which such designs are most effective. Extensive experiments across multiple text‑to‑image benchmarks demonstrate consistent and robust improvements, including up to a 3x faster convergence in FID, highlighting the practical benefits of a principled source distribution design for conditional flow matching.
Authors:András Balogh, Márk Jelasity
Abstract:
Generative sequence models are typically trained on sample sequences from natural or formal languages. It is a crucial question whether ‑‑ or to what extent ‑‑ sample‑based training is able to capture the true structure of these languages, often referred to as the ``world model''. Theoretical results indicate that we can hope for soundness at best, that is, generating valid sequences, but not necessarily all of them. However, it is still important to have practical tools that are able to verify whether a given sequence model is sound. In this study, we focus on chess, as it is a domain that provides enough complexity while having a simple rule‑based world model. We propose adversarial sequence generation for verifying the soundness of the sequence model. Our adversaries generate valid sequences so as to force the sequence model to generate an invalid next move prediction. Apart from the falsification of soundness, this method is also suitable for a more fine‑grained analysis of the failure modes and the effects of different choices during training. To demonstrate this, we propose a number of methods for adversarial sequence generation and evaluate the approach on a large set of chess models. We train models on random as well as high‑quality chess games, using several training recipes. We find that none of the models are sound, but some training techniques and dataset choices are able to improve soundness remarkably. We also investigate the potential application of board state probes in both our training and attack methods. Our findings indicate that the extracted board states have no causal role in next token prediction in most of the models.
Authors:Zhaorui Jiang, Yingfang Yuan, Lei Hu, Wei Pang
Abstract:
The integration of spatial multi‑omics data from single tissues is crucial for advancing biological research. However, a significant data imbalance impedes progress: while spatial transcriptomics data is relatively abundant, spatial proteomics data remains scarce due to technical limitations and high costs. To overcome this challenge we propose STProtein, a novel framework leveraging graph neural networks with multi‑task learning strategy. STProtein is designed to accurately predict unknown spatial protein expression using more accessible spatial multi‑omics data, such as spatial transcriptomics. We believe that STProtein can effectively addresses the scarcity of spatial proteomics, accelerating the integration of spatial multi‑omics and potentially catalyzing transformative breakthroughs in life sciences. This tool enables scientists to accelerate discovery by identifying complex and previously hidden spatial patterns of proteins within tissues, uncovering novel relationships between different marker genes, and exploring the biological "Dark Matter".
Authors:Jingze Shi, Zhangyang Peng, Yizhang Zhu, Yifan Wu, Guang Liu, Yuyu Luo
Abstract:
Mixture‑of‑Experts (MoE) architectures are evolving towards finer granularity to improve parameter efficiency. However, existing MoE designs face an inherent trade‑off between the granularity of expert specialization and hardware execution efficiency. We propose OmniMoE, a system‑algorithm co‑designed framework that pushes expert granularity to its logical extreme. OmniMoE introduces vector‑level Atomic Experts, enabling scalable routing and execution within a single MoE layer, while retaining a shared dense MLP branch for general‑purpose processing. Although this atomic design maximizes capacity, it poses severe challenges for routing complexity and memory access. To address these, OmniMoE adopts a system‑algorithm co‑design: (i) a Cartesian Product Router that decomposes the massive index space to reduce routing complexity from O(N) to O(sqrt(N)); and (ii) Expert‑Centric Scheduling that inverts the execution order to turn scattered, memory‑bound lookups into efficient dense matrix operations. Validated on seven benchmarks, OmniMoE (with 1.7B active parameters) achieves 50.9% zero‑shot accuracy across seven benchmarks, outperforming coarse‑grained (e.g., DeepSeekMoE) and fine‑grained (e.g., PEER) baselines. Crucially, OmniMoE reduces inference latency from 73ms to 6.7ms (a 10.9‑fold speedup) compared to PEER, demonstrating that massive‑scale fine‑grained MoE can be fast and accurate. Our code is open‑sourced at https://github.com/flash‑algo/omni‑moe.
Authors:Chang Yang, Chuang Zhou, Yilin Xiao, Su Dong, Luyao Zhuang, Yujing Zhang, Zhu Wang, Zijin Hong, Zheng Yuan, Zhishang Xiang, Shengyuan Chen, Huachi Zhou, Qinggang Zhang, Ninghao Liu, Jinsong Su, Xinrun Wang, Yi Chang, Xiao Huang
Abstract:
Memory emerges as the core module in the Large Language Model (LLM)‑based agents for long‑horizon complex tasks (e.g., multi‑turn dialogue, game playing, scientific discovery), where memory can enable knowledge accumulation, iterative reasoning and self‑evolution. Among diverse paradigms, graph stands out as a powerful structure for agent memory due to the intrinsic capabilities to model relational dependencies, organize hierarchical information, and support efficient retrieval. This survey presents a comprehensive review of agent memory from the graph‑based perspective. First, we introduce a taxonomy of agent memory, including short‑term vs. long‑term memory, knowledge vs. experience memory, non‑structural vs. structural memory, with an implementation view of graph‑based memory. Second, according to the life cycle of agent memory, we systematically analyze the key techniques in graph‑based agent memory, covering memory extraction for transforming the data into the contents, storage for organizing the data efficiently, retrieval for retrieving the relevant contents from memory to support reasoning, and evolution for updating the contents in the memory. Third, we summarize the open‑sourced libraries and benchmarks that support the development and evaluation of self‑evolving agent memory. We also explore diverse application scenarios. Finally, we identify critical challenges and future research directions. This survey aims to offer actionable insights to advance the development of more efficient and reliable graph‑based agent memory systems. All the related resources, including research papers, open‑source data, and projects, are collected for the community in https://github.com/DEEP‑PolyU/Awesome‑GraphMemory.
Authors:Benny Cheung
Abstract:
Traditional ontologies excel at describing domain structure but cannot generate novel artifacts. Large language models generate fluently but produce outputs that lack structural validity, hallucinating mechanisms without components, goals without end conditions. We introduce Generative Ontology, a framework that synthesizes these complementary strengths: ontology provides the grammar; the LLM provides the creativity. Generative Ontology encodes domain knowledge as executable Pydantic schemas that constrain LLM generation via DSPy signatures. A multi‑agent pipeline assigns specialized roles to different ontology domains: a Mechanics Architect designs game systems, a Theme Weaver integrates narrative, a Balance Critic identifies exploits. Each agent carrying a professional "anxiety" that prevents shallow, agreeable outputs. Retrieval‑augmented generation grounds novel designs in precedents from existing exemplars, while iterative validation ensures coherence between mechanisms and components. We demonstrate the framework through GameGrammar, a system for generating complete tabletop game designs. Given a thematic prompt ("bioluminescent fungi competing in a cave ecosystem"), the pipeline produces structurally complete, playable game specifications with mechanisms, components, victory conditions, and setup instructions. These outputs satisfy ontological constraints while remaining genuinely creative. The pattern generalizes beyond games. Any domain with expert vocabulary, validity constraints, and accumulated exemplars (music composition, software architecture, culinary arts) is a candidate for Generative Ontology. We argue that constraints do not limit creativity but enable it: just as grammar makes poetry possible, ontology makes structured generation possible.
Authors:Yayuan Li, Ze Peng, Jian Zhang, Jintao Guo, Yue Duan, Yinghuan Shi
Abstract:
Model merging combines multiple fine‑tuned models into a single model by adding their weight updates, providing a lightweight alternative to retraining. Existing methods primarily target resolving conflicts between task updates, leaving the failure mode of over‑counting shared knowledge unaddressed. We show that when tasks share aligned spectral directions (i.e., overlapping singular vectors), a simple linear combination repeatedly accumulates these directions, inflating the singular values and biasing the merged model toward shared subspaces. To mitigate this issue, we propose Singular Value Calibration (SVC), a training‑free and data‑free post‑processing method that quantifies subspace overlap and rescales inflated singular values to restore a balanced spectrum. Across vision and language benchmarks, SVC consistently improves strong merging baselines and achieves state‑of‑the‑art performance. Furthermore, by modifying only the singular values, SVC improves the performance of Task Arithmetic by 13.0%. Code is available at: https://github.com/lyymuwu/SVC.
Authors:Bingru Li
Abstract:
Data annotation remains a significant bottleneck in the Humanities and Social Sciences, particularly for complex semantic tasks such as metaphor identification. While Large Language Models (LLMs) show promise, a significant gap remains between the theoretical capability of LLMs and their practical utility for researchers. This paper introduces LinguistAgent, an integrated, user‑friendly platform that leverages a reflective multi‑model architecture to automate linguistic annotation. The system implements a dual‑agent workflow, comprising an Annotator and a Reviewer, to simulate a professional peer‑review process. LinguistAgent supports comparative experiments across three paradigms: Prompt Engineering (Zero/Few‑shot), Retrieval‑Augmented Generation, and Fine‑tuning. We demonstrate LinguistAgent's efficacy using the task of metaphor identification as an example, providing real‑time token‑level evaluation (Precision, Recall, and F_1 score) against human gold standards. The application and codes are released on https://github.com/Bingru‑Li/LinguistAgent.
Authors:Budhaditya Mukhopadhyay, Chirag Mandal, Pavan Tummala, Naghmeh Mahmoodian, Andreas Nürnberger, Soumick Chatterjee
Abstract:
Liver tumour ablation presents a significant clinical challenge: whilst tumours are clearly visible on pre‑operative MRI, they are often effectively invisible on intra‑operative CT due to minimal contrast between pathological and healthy tissue. This work investigates the feasibility of cross‑modality weak supervision for scenarios where pathology is visible in one modality (MRI) but absent in another (CT). We present a hybrid registration‑segmentation framework that combines MSCGUNet for inter‑modal image registration with a UNet‑based segmentation module, enabling registration‑assisted pseudo‑label generation for CT images. Our evaluation on the CHAOS dataset demonstrates that the pipeline can successfully register and segment healthy liver anatomy, achieving a Dice score of 0.72. However, when applied to clinical data containing tumours, performance degrades substantially (Dice score of 0.16), revealing the fundamental limitations of current registration methods when the target pathology lacks corresponding visual features in the target modality. We analyse the "domain gap" and "feature absence" problems, demonstrating that whilst spatial propagation of labels via registration is feasible for visible structures, segmenting truly invisible pathology remains an open challenge. Our findings highlight that registration‑based label transfer cannot compensate for the absence of discriminative features in the target modality, providing important insights for future research in cross‑modality medical image analysis. Code an weights are available at: https://github.com/BudhaTronix/Weakly‑Supervised‑Tumour‑Detection
Authors:Dean Fortier, Timothy Adamson, Tess Hellebrekers, Teresa LaScala, Kofi Ennin, Michael Murray, Andrey Kolobov, Galen Mullins
Abstract:
Vision‑Language‑Action (VLA) models have been attracting the attention of researchers and practitioners thanks to their promise of generalization. Although single‑task policies still offer competitive performance, VLAs are increasingly able to handle commands and environments unseen in their training set. While generalization in vision and language space is undoubtedly important for robust versatile behaviors, a key meta‑skill VLAs need to possess is affordance generalization ‑‑ the ability to manipulate new objects with familiar physical features. In this work, we present BusyBox, a physical benchmark for systematic semi‑automatic evaluation of VLAs' affordance generalization. BusyBox consists of 6 modules with switches, sliders, wires, buttons, a display, and a dial. The modules can be swapped and rotated to create a multitude of BusyBox variations with different visual appearances but the same set of affordances. We empirically demonstrate that generalization across BusyBox variants is highly challenging even for strong open‑weights VLAs such as π_0.5 and GR00T‑N1.6. To encourage the research community to evaluate their own VLAs on BusyBox and to propose new affordance generalization experiments, we have designed BusyBox to be easy to build in most robotics labs. We release the full set of CAD files for 3D‑printing its parts as well as a bill of materials for (optionally) assembling its electronics. We also publish a dataset of language‑annotated demonstrations that we collected using the common bimanual Mobile Aloha robot on the canonical BusyBox configuration. All of the released materials are available at https://microsoft.github.io/BusyBox.
Authors:Kritchanat Ponyuenyong, Pengyu Tu, Jia Wei Tan, Wei Soon Cheong, Jamie Ng Suat Ling, Lianlian Jiang
Abstract:
Electricity price forecasting (EPF) is essential for energy markets stakeholders (e.g. grid operators, energy traders, policymakers) but remains challenging due to the inherent volatility and nonlinearity of price signals. Traditional statistical and deep learning (DL) models often struggle to capture complex temporal dependencies and integrate heterogeneous data effectively. While time series foundation models (TSFMs) have shown strong performance in general time series forecasting tasks, such as traffic forecasting and weather forecasting. However, their effectiveness in day‑ahead EPF, particularly in volatile markets, remains underexplored. This paper presents a spike regularization strategy and evaluates a wide range of TSFMs, including Tiny Time Mixers (TTMs), MOIRAI, MOMENT, and TimesFM, against traditional statistical and DL models such as Autoregressive Integrated Moving Average (ARIMA), Long‑short Term Memory (LSTM), and Convolutional Neural Network ‑ LSTM (CNN‑LSTM) using half‑hourly wholesale market data with volatile trends in Singapore. Exogenous factors (e.g. weather and calendar variables) are also incorporated into models where applicable. Results demonstrate that TSFMs consistently outperform traditional approaches, achieving up to 37.4% improvement in MAPE across various evaluation settings. The findings offer practical guidance for improving forecast accuracy and decision‑making in volatile electricity markets.
Authors:Wei Soon Cheong, Lian Lian Jiang, Jamie Ng Suat Ling
Abstract:
Time‑series foundation models have emerged as a new paradigm for forecasting, yet their ability to effectively leverage exogenous features ‑‑ critical for electricity demand forecasting ‑‑ remains unclear. This paper empirically evaluates foundation models capable of modeling cross‑channel correlations against a baseline LSTM with reversible instance normalization across Singaporean and Australian electricity markets at hourly and daily granularities. We systematically assess MOIRAI, MOMENT, TinyTimeMixers, ChronosX, and Chronos‑2 under three feature configurations: all features, selected features, and target‑only. Our findings reveal highly variable effectiveness: while Chronos‑2 achieves the best performance among foundation models (in zero‑shot settings), the simple baseline frequently outperforms all foundation models in Singapore's stable climate, particularly for short‑term horizons. Model architecture proves critical, with synergistic architectural implementations (TTM's channel‑mixing, Chronos‑2's grouped attention) consistently leveraging exogenous features, while other approaches show inconsistent benefits. Geographic context emerges as equally important, with foundation models demonstrating advantages primarily in variable climates. These results challenge assumptions about universal foundation model superiority and highlight the need for domain‑specific models, specifically in the energy domain.
Authors:Yangbin Yu, Mingyu Yang, Junyou Li, Yiming Gao, Feiyu Liu, Yijun Yang, Zichuan Lin, Jiafei Lyu, Yicheng Liu, Zhicong Lu, Deheng Ye, Jie Jiang
Abstract:
Existing Large Language Model (LLM) agents struggle in interactive environments requiring long‑horizon planning, primarily due to compounding errors when simulating future states. To address this, we propose ProAct, a framework that enables agents to internalize accurate lookahead reasoning through a two‑stage training paradigm. First, we introduce Grounded LookAhead Distillation (GLAD), where the agent undergoes supervised fine‑tuning on trajectories derived from environment‑based search. By compressing complex search trees into concise, causal reasoning chains, the agent learns the logic of foresight without the computational overhead of inference‑time search. Second, to further refine decision accuracy, we propose the Monte‑Carlo Critic (MC‑Critic), a plug‑and‑play auxiliary value estimator designed to enhance policy‑gradient algorithms like PPO and GRPO. By leveraging lightweight environment rollouts to calibrate value estimates, MC‑Critic provides a low‑variance signal that facilitates stable policy optimization without relying on expensive model‑based value approximation. Experiments on both stochastic (e.g., 2048) and deterministic (e.g., Sokoban) environments demonstrate that ProAct significantly improves planning accuracy. Notably, a 4B parameter model trained with ProAct outperforms all open‑source baselines and rivals state‑of‑the‑art closed‑source models, while demonstrating robust generalization to unseen environments. The codes and models are available at https://github.com/GreatX3/ProAct
Authors:Zhuokun Chen, Jianfei Cai, Bohan Zhuang
Abstract:
Generating long‑form content, such as minute‑long videos and extended texts, is increasingly important for modern generative models. Block diffusion improves inference efficiency via KV caching and block‑wise causal inference and has been widely adopted in diffusion language models and video generation. However, in long‑context settings, block diffusion still incurs substantial overhead from repeatedly computing attention over a growing KV cache. We identify an underexplored property of block diffusion: cross‑step redundancy of attention within a block. Our analysis shows that attention outputs from tokens outside the current block remain largely stable across diffusion steps, while block‑internal attention varies significantly. Based on this observation, we propose FlashBlock, a cached block‑external attention mechanism that reuses stable attention output, reducing attention computation and KV cache access without modifying the diffusion process. Moreover, FlashBlock is orthogonal to sparse attention and can be combined as a complementary residual reuse strategy, substantially improving model accuracy under aggressive sparsification. Experiments on diffusion language models and video generation demonstrate up to 1.44× higher token throughput and up to 1.6× reduction in attention time, with negligible impact on generation quality. Project page: https://caesarhhh.github.io/FlashBlock/.
Authors:Haoran Li, Sucheng Ren, Alan Yuille, Feng Wang
Abstract:
Rotary Positional Embedding (RoPE) is a key component of context scaling in Large Language Models (LLMs). While various methods have been proposed to adapt RoPE to longer contexts, their guiding principles generally fall into two categories: (1) out‑of‑distribution (OOD) mitigation, which scales RoPE frequencies to accommodate unseen positions, and (2) Semantic Modeling, which posits that the attention scores computed with RoPE should always prioritize semantically similar tokens. In this work, we unify these seemingly distinct objectives through a minimalist intervention, namely CoPE: soft clipping lowfrequency components of RoPE. CoPE not only eliminates OOD outliers and refines semantic signals, but also prevents spectral leakage caused by hard clipping. Extensive experiments demonstrate that simply applying our soft clipping strategy to RoPE yields significant performance gains that scale up to 256k context length, validating our theoretical analysis and establishing CoPE as a new state‑of‑the‑art for length generalization. Our code, data, and models are available at https://github.com/hrlics/CoPE.
Authors:Guozhi Liu, Weiwei Lin, Tiansheng Huang, Ruichao Mo, Qi Mu, Xiumin Wang, Li Shen
Abstract:
Harmful fine‑tuning can invalidate safety alignment of large language models, exposing significant safety risks. In this paper, we utilize the attention sink mechanism to mitigate harmful fine‑tuning. Specifically, we first measure a statistic named \emphsink divergence for each attention head and observe that \emphdifferent attention heads exhibit two different signs of sink divergence. To understand its safety implications, we conduct experiments and find that the number of attention heads of positive sink divergence increases along with the increase of the model's harmfulness when undergoing harmful fine‑tuning. Based on this finding, we propose a separable sink divergence hypothesis ‑‑ \emphattention heads associating with learning harmful patterns during fine‑tuning are separable by their sign of sink divergence. Based on the hypothesis, we propose a fine‑tuning‑stage defense, dubbed Surgery. Surgery utilizes a regularizer for sink divergence suppression, which steers attention heads toward the negative sink divergence group, thereby reducing the model's tendency to learn and amplify harmful patterns. Extensive experiments demonstrate that Surgery improves defense performance by 5.90%, 11.25%, and 9.55% on the BeaverTails, HarmBench, and SorryBench benchmarks, respectively. Source code is available on https://github.com/Lslland/Surgery.
Authors:Luke Alexander, Eric Leonen, Sophie Szeto, Artemii Remizov, Ignacio Tejeda, Giovanni Inchiostro, Vasily Ilin
Abstract:
Searching for mathematical results remains difficult: most existing tools retrieve entire papers, while mathematicians and theorem‑proving agents often seek a specific theorem, lemma, or proposition that answers a query. While semantic search has seen rapid progress, its behavior on large, highly technical corpora such as research‑level mathematical theorems remains poorly understood. In this work, we introduce and study semantic theorem retrieval at scale over a unified corpus of 9.2 million theorem statements extracted from arXiv and seven other sources, representing the largest publicly available corpus of human‑authored, research‑level theorems. We represent each theorem with a short natural‑language description as a retrieval representation and systematically analyze how representation context, language model choice, embedding model, and prompting strategy affect retrieval quality. On a curated evaluation set of theorem‑search queries written by professional mathematicians, our approach substantially improves both theorem‑level and paper‑level retrieval compared to existing baselines, demonstrating that semantic theorem search is feasible and effective at web scale. The theorem search tool is available at \hrefhttps://huggingface.co/spaces/uw‑math‑ai/theorem‑searchthis link, and the dataset is available at \hrefhttps://huggingface.co/datasets/uw‑math‑ai/TheoremSearchthis link.
Authors:Magesh Rajasekaran, Md Saiful Sajol, Chris Alvin, Supratik Mukhopadhyay, Yanda Ou, Z. George Xue
Abstract:
Coastal hypoxia, especially in the northern part of Gulf of Mexico, presents a persistent ecological and economic concern. Seasonal models offer coarse forecasts that miss the fine‑scale variability needed for daily, responsive ecosystem management. We present study that compares four deep learning architectures for daily hypoxia classification: Bidirectional Long Short‑Term Memory (BiLSTM), Medformer (Medical Transformer), Spatio‑Temporal Transformer (ST‑Transformer), and Temporal Convolutional Network (TCN). We trained our models with twelve years of daily hindcast data from 2009‑2020 Our training data consists of 2009‑2020 hindcast data from a coupled hydrodynamic‑biogeochemical model. Similarly, we use hindcast data from 2020 through 2024 as a test data. We constructed classification models incorporating water column stratification, sediment oxygen consumption, and temperature‑dependent decomposition rates. We evaluated each architectures using the same data preprocessing, input/output formulation, and validation protocols. Each model achieved high classification accuracy and strong discriminative ability with ST‑Transformer achieving the highest performance across all metrics and tests periods (AUC‑ROC: 0.982‑0.992). We also employed McNemar's method to identify statistically significant differences in model predictions. Our contribution is a reproducible framework for operational real‑time hypoxia prediction that can support broader efforts in the environmental and ocean modeling systems community and in ecosystem resilience. The source code is available https://github.com/rmagesh148/hypoxia‑ai/
Authors:Abdul Joseph Fofanah, Lian Wen, David Chen, Alpha Alimamy Kamara, Zhongyi Zhang
Abstract:
Traffic prediction in data‑scarce, cross‑city settings is challenging due to complex nonlinear dynamics and domain shifts. Existing methods often fail to capture traffic's inherent chaotic nature for effective few‑shot learning. We propose CAST‑CKT, a novel Chaos‑Aware Spatio‑Temporal and Cross‑City Knowledge Transfer framework. It employs an efficient chaotic analyser to quantify traffic predictability regimes, driving several key innovations: chaos‑aware attention for regime‑adaptive temporal modelling; adaptive topology learning for dynamic spatial dependencies; and chaotic consistency‑based cross‑city alignment for knowledge transfer. The framework also provides horizon‑specific predictions with uncertainty quantification. Theoretical analysis shows improved generalisation bounds. Extensive experiments on four benchmarks in cross‑city few‑shot settings show CAST‑CKT outperforms state‑of‑the‑art methods by significant margins in MAE and RMSE, while offering interpretable regime analysis. Code is available at https://github.com/afofanah/CAST‑CKT.
Authors:Rohan Patil, Jai Malegaonkar, Xiao Jiang, Andre Dion, Gaurav S. Sukhatme, Henrik I. Christensen
Abstract:
As intelligent systems and multi‑agent coordination become increasingly central to real‑world applications, there is a growing need for simulation tools that are both scalable and accessible. Existing high‑fidelity simulators, while powerful, are often computationally expensive and ill‑suited for rapid prototyping or large‑scale agent deployments. We present GAMMS (Graph based Adversarial Multiagent Modeling Simulator), a lightweight yet extensible simulation framework designed to support fast development and evaluation of agent behavior in environments that can be represented as graphs. GAMMS emphasizes five core objectives: scalability, ease of use, integration‑first architecture, fast visualization feedback, and real‑world grounding. It enables efficient simulation of complex domains such as urban road networks and communication systems, supports integration with external tools (e.g., machine learning libraries, planning solvers), and provides built‑in visualization with minimal configuration. GAMMS is agnostic to policy type, supporting heuristic, optimization‑based, and learning‑based agents, including those using large language models. By lowering the barrier to entry for researchers and enabling high‑performance simulations on standard hardware, GAMMS facilitates experimentation and innovation in multi‑agent systems, autonomous planning, and adversarial modeling. The framework is open‑source and available at https://github.com/GAMMSim/GAMMS/
Authors:Georgii Aparin, Tasnima Sadekova, Alexey Rukhovich, Assel Yermekova, Laida Kushnareva, Vadim Popov, Kristian Kuznetsov, Irina Piontkovskaya
Abstract:
Sparse Autoencoders (SAEs) are powerful tools for interpreting neural representations, yet their use in audio remains underexplored. We train SAEs across all encoder layers of Whisper and HuBERT, provide an extensive evaluation of their stability, interpretability, and show their practical utility. Over 50% of the features remain consistent across random seeds, and reconstruction quality is preserved. SAE features capture general acoustic and semantic information as well as specific events, including environmental noises and paralinguistic sounds (e.g. laughter, whispering) and disentangle them effectively, requiring removal of only 19‑27% of features to erase a concept. Feature steering reduces Whisper's false speech detections by 70% with negligible WER increase, demonstrating real‑world applicability. Finally, we find SAE features correlated with human EEG activity during speech perception, indicating alignment with human neural processing. The code and checkpoints are available at https://github.com/audiosae/audiosae_demo.
Authors:Davide Berasi, Matteo Farina, Massimiliano Mancini, Elisa Ricci
Abstract:
Selecting the best data mixture is critical for successful Supervised Fine‑Tuning (SFT) of Multimodal Large Language Models. However, determining the optimal mixture weights across multiple domain‑specific datasets remains a significant bottleneck due to the combinatorial search space and the high cost associated with even a single training run. This is the so‑called Data Mixture Optimization (DMO) problem. On the other hand, model merging unifies domain‑specific experts through parameter interpolation. This strategy is efficient, as it only requires a single training run per domain, yet oftentimes leads to suboptimal models. In this work, we take the best of both worlds, studying model merging as an efficient strategy for estimating the performance of different data mixtures. We train domain‑specific multimodal experts and evaluate their weighted parameter‑space combinations to estimate the efficacy of corresponding data mixtures. We conduct extensive experiments on 14 multimodal benchmarks, and empirically demonstrate that the merged proxy models exhibit a high rank correlation with models trained on actual data mixtures. This decouples the search for optimal mixtures from the resource‑intensive training process, thereby providing a scalable and efficient strategy for navigating the complex landscape of mixture weights. Code is publicly available at https://github.com/BerasiDavide/mLLMs_merging_4_DMO.
Authors:Zhenning Shi, Yijia Zhu, Junhan Shi, Xun Zhang, Lei Wang, Congcong Miao
Abstract:
The internalization of chain‑of‑thought processes into hidden states has emerged as a highly efficient paradigm for scaling test‑time compute. However, existing activation steering methods rely on static control vectors that fail to adapt to the non‑stationary evolution of complex reasoning tasks. To address this limitation, we propose STIR (Self‑Distilled Tools for Internal Reasoning), a framework that reformulates reasoning enhancement as a dynamic latent trajectory control problem. STIR introduces a synergistic three‑stage pipeline: (1) differential intrinsic action induction harvests latent reasoning successes to crystallize steering primitives; (2) sparse control basis construction curates a compact, geometrically diverse tool library; and (3) value‑modulated trajectory intervention dynamically injects context‑specific impulses via anchor‑based gating. Extensive experiments on six arithmetic and logical benchmarks across four representative models demonstrate that STIR improves average accuracy by 1.9% to 7.5% while reducing average token consumption by up to 35% compared to vanilla decoding. These findings demonstrate that the benefits of explicit chain‑of‑thought can be realized through dynamic latent trajectory control, internalizing the reasoning process to bypass the explicit generation while achieving superior fidelity. Our code is available at https://github.com/sznnzs/LLM‑Latent‑Action.
Authors:Licheng Pan, Yunsheng Lu, Jiexi Liu, Jialing Tao, Haozhe Feng, Hui Xue, Zhixuan Chu, Kui Ren
Abstract:
Uncovering the mechanisms behind "jailbreaks" in large language models (LLMs) is crucial for enhancing their safety and reliability, yet these mechanisms remain poorly understood. Existing studies predominantly analyze jailbreak prompts by probing latent representations, often overlooking the causal relationships between interpretable prompt features and jailbreak occurrences. In this work, we propose Causal Analyst, a framework that integrates LLMs into data‑driven causal discovery to identify the direct causes of jailbreaks and leverage them for both attack and defense. We introduce a comprehensive dataset comprising 35k jailbreak attempts across seven LLMs, systematically constructed from 100 attack templates and 50 harmful queries, annotated with 37 meticulously designed human‑readable prompt features. By jointly training LLM‑based prompt encoding and GNN‑based causal graph learning, we reconstruct causal pathways linking prompt features to jailbreak responses. Our analysis reveals that specific features, such as "Positive Character" and "Number of Task Steps", act as direct causal drivers of jailbreaks. We demonstrate the practical utility of these insights through two applications: (1) a Jailbreaking Enhancer that targets identified causal features to significantly boost attack success rates on public benchmarks, and (2) a Guardrail Advisor that utilizes the learned causal graph to extract true malicious intent from obfuscated queries. Extensive experiments, including baseline comparisons and causal structure validation, confirm the robustness of our causal analysis and its superiority over non‑causal approaches. Our results suggest that analyzing jailbreak features from a causal perspective is an effective and interpretable approach for improving LLM reliability. Our code is available at https://github.com/Master‑PLC/Causal‑Analyst.
Authors:Ishaq Aden-Ali, Noah Golowich, Allen Liu, Abhishek Shetty, Ankur Moitra, Nika Haghtalab
Abstract:
Training modern large language models (LLMs) has become a veritable smorgasbord of algorithms and datasets designed to elicit particular behaviors, making it critical to develop techniques to understand the effects of datasets on the model's properties. This is exacerbated by recent experiments that show datasets can transmit signals that are not directly observable from individual datapoints, posing a conceptual challenge for dataset‑centric understandings of LLM training and suggesting a missing fundamental account of such phenomena. Towards understanding such effects, inspired by recent work on the linear structure of LLMs, we uncover a general mechanism through which hidden subtexts can arise in generic datasets. We introduce Logit‑Linear‑Selection (LLS), a method that prescribes how to select subsets of a generic preference dataset to elicit a wide range of hidden effects. We apply LLS to discover subsets of real‑world datasets so that models trained on them exhibit behaviors ranging from having specific preferences, to responding to prompts in a different language not present in the dataset, to taking on a different persona. Crucially, the effect persists for the selected subset, across models with varying architectures, supporting its generality and universality.
Authors:Jiarui Yuan, Tailin Jin, Weize Chen, Zeyuan Liu, Zhiyuan Liu, Maosong Sun
Abstract:
True self‑evolution requires agents to act as lifelong learners that internalize novel experiences to solve future problems. However, rigorously measuring this foundational capability is hindered by two obstacles: the entanglement of prior knowledge, where ``new'' knowledge may appear in pre‑training data, and the entanglement of reasoning complexity, where failures may stem from problem difficulty rather than an inability to recall learned knowledge. We introduce SE‑Bench, a diagnostic environment that obfuscates the NumPy library and its API doc into a pseudo‑novel package with randomized identifiers. Agents are trained to internalize this package and evaluated on simple coding tasks without access to documentation, yielding a clean setting where tasks are trivial with the new API doc but impossible for base models without it. Our investigation reveals three insights: (1) the Open‑Book Paradox, where training with reference documentation inhibits retention, requiring "Closed‑Book Training" to force knowledge compression into weights; (2) the RL Gap, where standard RL fails to internalize new knowledge completely due to PPO clipping and negative gradients; and (3) the viability of Self‑Play for internalization, proving models can learn from self‑generated, noisy tasks when coupled with SFT, but not RL. Overall, SE‑Bench establishes a rigorous diagnostic platform for self‑evolution with knowledge internalization. Our code and dataset can be found at https://github.com/thunlp/SE‑Bench.
Authors:Moritz Miller, Florent Draye, Bernhard Schölkopf
Abstract:
With recent progress on fine‑tuning language models around a fixed sparse autoencoder, we disentangle the decoder matrix into almost orthogonal features. This reduces interference and superposition between the features, while keeping performance on the target dataset essentially unchanged. Our orthogonality penalty leads to identifiable features, ensuring the uniqueness of the decomposition. Further, we find that the distance between embedded feature explanations increases with stricter orthogonality penalty, a desirable property for interpretability. Invoking the Independent Causal Mechanisms principle, we argue that orthogonality promotes modular representations amenable to causal intervention. We empirically show that these increasingly orthogonalized features allow for isolated interventions. Our code is available under \texttthttps://github.com/mrtzmllr/sae‑icm.
Authors:Xianbiao Qi, Marco Chen, Jiaquan Ye, Yelin He, Rong Xiao
Abstract:
The Muon optimizer has recently attracted considerable attention for its strong empirical performance and use of orthogonalized updates on matrix‑shaped parameters, yet its underlying mechanisms and relationship to adaptive optimizers such as Adam remain insufficiently understood. In this work, we aim to address these questions through a unified spectral perspective. Specifically, we view Muon as the p = 0 endpoint of a family of spectral transformations of the form U \boldsymbolΣ^p V' , and consider additional variants with p = 1/2 , p = 1/4 , and p = 1 . These transformations are applied to both first‑moment updates, as in momentum SGD, and to root‑mean‑square (RMS) normalized gradient updates as in Adam. To enable efficient computation, we develop a coupled Newton iteration that avoids explicit singular value decomposition. Across controlled experiments, we find that RMS‑normalized updates yield more stable optimization than first‑moment updates. Moreover, while spectral compression provides strong stabilization benefits under first‑moment updates, the Muon update (p = 0) does not consistently outperform Adam. These results suggest that Muon is best understood as an effective form of spectral normalization, but not a universally superior optimization method. Our source code will be released at https://github.com/Ocram7/BeyondMuon.
Authors:Jaeyoon Jung, Yejun Yoon, Seunghyun Yoon, Kunwoo Park
Abstract:
This paper describes VILLAIN, a multimodal fact‑checking system that verifies image‑text claims through prompt‑based multi‑agent collaboration. For the AVerImaTeC shared task, VILLAIN employs vision‑language model agents across multiple stages of fact‑checking. Textual and visual evidence is retrieved from the knowledge store enriched through additional web collection. To identify key information and address inconsistencies among evidence items, modality‑specific and cross‑modal agents generate analysis reports. In the subsequent stage, question‑answer pairs are produced based on these reports. Finally, the Verdict Prediction agent produces the verification outcome based on the image‑text claim and the generated question‑answer pairs. Our system ranked first on the leaderboard across all evaluation metrics. The source code is publicly available at https://github.com/ssu‑humane/VILLAIN.
Authors:Zhiyi Chen, Eun Cheol Choi, Yingjia Luo, Xinyi Wang, Yulei Xiao, Aizi Yang, Luca Luceri
Abstract:
People increasingly seek advice online from both human peers and large language model (LLM)‑based chatbots. Such advice rarely involves identifying a single correct answer; instead, it typically requires navigating trade‑offs among competing values. We aim to characterize how LLMs navigate value trade‑offs across different advice‑seeking contexts. First, we examine the value trade‑off structure underlying advice seeking using a curated dataset from four advice‑oriented subreddits. Using a bottom‑up approach, we inductively construct a hierarchical value framework by aggregating fine‑grained values extracted from individual advice options into higher‑level value categories. We construct value co‑occurrence networks to characterize how values co‑occur within dilemmas and find substantial heterogeneity in value trade‑off structures across advice‑seeking contexts: a women‑focused subreddit exhibits the highest network density, indicating more complex value conflicts; women's, men's, and friendship‑related subreddits exhibit highly correlated value‑conflict patterns centered on security‑related tensions (security vs. respect/connection/commitment); by contrast, career advice forms a distinct structure where security frequently clashes with self‑actualization and growth. We then evaluate LLM value preferences against these dilemmas and find that, across models and contexts, LLMs consistently prioritize values related to Exploration & Growth over Benevolence & Connection. This systemically skewed value orientation highlights a potential risk of value homogenization in AI‑mediated advice, raising concerns about how such systems may shape decision‑making and normative outcomes at scale.
Authors:Lunjun Zhang, Jimmy Ba
Abstract:
Reinforcement Learning (RL) has enabled Large Language Models (LLMs) to acquire increasingly complex reasoning and agentic behaviors. In this work, we propose two simple techniques to improve policy gradient algorithms for LLMs. First, we replace the fixed anchor policy during RL with an Exponential Moving Average (EMA), similar to a target network in deep Q‑learning. Second, we introduce Top‑k KL estimator, which allows for flexible interpolation between exact KL and sampled KL. We derive the stability conditions for using EMA anchor; moreover, we show that our Top‑k KL estimator yields both unbiased KL values and unbiased gradients at any k, while bringing the benefits of exact KL. When combined with GRPO, the two techniques (EMA‑PG) lead to a significant performance boost. On math reasoning, it allows R1‑distilled Qwen‑1.5B to reach 53.9% on OlympiadBench compared to 50.8% by GRPO. On agentic RL domains, with Qwen‑3B base, EMA‑PG improves GRPO by an average of 33.3% across 7 datasets of Q&A with search engines, including 29.7% \rightarrow 44.1% on HotpotQA, 27.4% \rightarrow 40.1% on 2WikiMultiHopQA. Overall, we show that EMA‑PG is a simple, principled, and powerful approach to scaling RL for LLMs. Code: https://github.com/LunjunZhang/ema‑pg
Authors:Aavash Chhetri, Bibek Niroula, Pratik Shrestha, Yash Raj Shrestha, Lesley A Anderson, Prashnna K Gyawali, Loris Bazzani, Binod Bhattarai
Abstract:
Federated learning (FL) enables collaborative model training across decentralized medical institutions while preserving data privacy. However, medical FL benchmarks remain scarce, with existing efforts focusing mainly on unimodal or bimodal modalities and a limited range of medical tasks. This gap underscores the need for standardized evaluation to advance systematic understanding in medical MultiModal FL (MMFL). To this end, we introduce Med‑MMFL, the first comprehensive MMFL benchmark for the medical domain, encompassing diverse modalities, tasks, and federation scenarios. Our benchmark evaluates six representative state‑of‑the‑art FL algorithms, covering different aggregation strategies, loss formulations, and regularization techniques. It spans datasets with 2 to 4 modalities, comprising a total of 10 unique medical modalities, including text, pathology images, ECG, X‑ray, radiology reports, and multiple MRI sequences. Experiments are conducted across naturally federated, synthetic IID, and synthetic non‑IID settings to simulate real‑world heterogeneity. We assess segmentation, classification, modality alignment (retrieval), and VQA tasks. To support reproducibility and fair comparison of future multimodal federated learning (MMFL) methods under realistic medical settings, we release the complete benchmark implementation, including data processing and partitioning pipelines, at https://github.com/bhattarailab/Med‑MMFL‑Benchmark .
Authors:Yujie Lin, Kunquan Li, Yixuan Liao, Xiaoxin Chen, Jinsong Su
Abstract:
Large language models (LLMs) have demonstrated impressive capabilities across a wide range of natural language processing tasks. However, their outputs often exhibit social biases, raising fairness concerns. Existing debiasing methods, such as fine‑tuning on additional datasets or prompt engineering, face scalability issues or compromise user experience in multi‑turn interactions. To address these challenges, we propose a framework for detecting stereotype‑inducing words and attributing neuron‑level bias in LLMs, without the need for fine‑tuning or prompt modification. Our framework first identifies stereotype‑inducing adjectives and nouns via comparative analysis across demographic groups. We then attribute biased behavior to specific neurons using two attribution strategies based on integrated gradients. Finally, we mitigate bias by directly intervening on their activations at the projection layer. Experiments on three widely used LLMs demonstrate that our method effectively reduces bias while preserving overall model performance. Code is available at the github link: https://github.com/XMUDeepLIT/Bi‑directional‑Bias‑Attribution.
Authors:Zekun Li, Ning Wang, Tongxin Bai, Changwang Mei, Peisong Wang, Shuang Qiu, Jian Cheng
Abstract:
Visual AutoRegressive (VAR) modeling has garnered significant attention for its innovative next‑scale prediction paradigm. However, mainstream VAR paradigms attend to all tokens across historical scales at each autoregressive step. As the next scale resolution grows, the computational complexity of attention increases quartically with resolution, causing substantial latency. Prior accelerations often skip high‑resolution scales, which speeds up inference but discards high‑frequency details and harms image quality. To address these problems, we present SparVAR, a training‑free acceleration framework that exploits three properties of VAR attention: (i) strong attention sinks, (ii) cross‑scale activation similarity, and (iii) pronounced locality. Specifically, we dynamically predict the sparse attention pattern of later high‑resolution scales from a sparse decision scale, and construct scale self‑similar sparse attention via an efficient index‑mapping mechanism, enabling high‑efficiency sparse attention computation at large scales. Furthermore, we propose cross‑scale local sparse attention and implement an efficient block‑wise sparse kernel, which achieves \mathbf> 5× faster forward speed than FlashAttention. Extensive experiments demonstrate that the proposed SparseVAR can reduce the generation time of an 8B model producing 1024×1024 high‑resolution images to the 1s, without skipping the last scales. Compared with the VAR baseline accelerated by FlashAttention, our method achieves a \mathbf1.57× speed‑up while preserving almost all high‑frequency details. When combined with existing scale‑skipping strategies, SparseVAR attains up to a \mathbf2.28× acceleration, while maintaining competitive visual generation quality. Code is available at https://github.com/CAS‑CLab/SparVAR.
Authors:Teng-Fang Hsiao, Bo-Kai Ruan, Yu-Lun Liu, Hong-Han Shuai
Abstract:
3D editing has emerged as a critical research area to provide users with flexible control over 3D assets. While current editing approaches predominantly focus on 3D Gaussian Splatting or multi‑view images, the direct editing of 3D meshes remains underexplored. Prior attempts, such as VoxHammer, rely on voxel‑based representations that suffer from limited resolution and necessitate labor‑intensive 3D mask. To address these limitations, we propose VecSet‑Edit, the first pipeline that leverages the high‑fidelity VecSet Large Reconstruction Model (LRM) as a backbone for mesh editing. Our approach is grounded on a analysis of the spatial properties in VecSet tokens, revealing that token subsets govern distinct geometric regions. Based on this insight, we introduce Mask‑guided Token Seeding and Attention‑aligned Token Gating strategies to precisely localize target regions using only 2D image conditions. Also, considering the difference between VecSet diffusion process versus voxel we design a Drift‑aware Token Pruning to reject geometric outliers during the denoising process. Finally, our Detail‑preserving Texture Baking module ensures that we not only preserve the geometric details of original mesh but also the textural information. More details can be found in our project page: https://github.com/BlueDyee/VecSet‑Edit/tree/main
Authors:Wenjun Peng, Xinyu Wang, Qi Wu
Abstract:
Large language models (LLMs) have revolutionized automated code generation, yet the evaluation of their real‑world effectiveness remains limited by static benchmarks and simplistic metrics. We present ProxyWar, a novel framework that systematically assesses code generation quality by embedding LLM‑generated agents within diverse, competitive game environments. Unlike existing approaches, ProxyWar evaluates not only functional correctness but also the operational characteristics of generated programs, combining automated testing, iterative code repair, and multi‑agent tournaments to provide a holistic view of program behavior. Applied to a range of state‑of‑the‑art coders and games, our approach uncovers notable discrepancies between benchmark scores and actual performance in dynamic settings, revealing overlooked limitations and opportunities for improvement. These findings highlight the need for richer, competition‑based evaluation of code generation. Looking forward, ProxyWar lays a foundation for research into LLM‑driven algorithm discovery, adaptive problem solving, and the study of practical efficiency and robustness, including the potential for models to outperform hand‑crafted agents. The project is available at https://github.com/xinke‑wang/ProxyWar.
Authors:Yansong Ning, Jun Fang, Naiqiang Tan, Hao Liu
Abstract:
Managing agent thought and observation during multi‑turn agent‑environment interactions is an emerging strategy to improve agent efficiency. However, existing studies treat the entire interaction trajectories equally, overlooking the thought necessity and observation utility varies across turns. To this end, we first conduct quantitative investigations into how thought and observation affect agent effectiveness and efficiency. Based on our findings, we propose Agent‑Omit, a unified training framework that empowers LLM agents to adaptively omit redundant thoughts and observations. Specifically, we first synthesize a small amount of cold‑start data, including both single‑turn and multi‑turn omission scenarios, to fine‑tune the agent for omission behaviors. Furthermore, we introduce an omit‑aware agentic reinforcement learning approach, incorporating a dual sampling mechanism and a tailored omission reward to incentivize the agent's adaptive omission capability. Theoretically, we prove that the deviation of our omission policy is upper‑bounded by KL‑divergence. Experimental results on five agent benchmarks show that our constructed Agent‑Omit‑8B could obtain performance comparable to seven frontier LLM agent, and achieve the best effectiveness‑efficiency trade‑off than seven efficient LLM agents methods. Our code and data are available at https://github.com/usail‑hkust/Agent‑Omit.
Authors:Lifan Wu, Ruijie Zhu, Yubo Ai, Tianzhu Zhang
Abstract:
4D generation has made remarkable progress in synthesizing dynamic 3D objects from input text, images, or videos. However, existing methods often represent motion as an implicit deformation field, which limits direct control and editability. To address this issue, we propose SkeletonGaussian, a novel framework for generating editable dynamic 3D Gaussians from monocular video input. Our approach introduces a hierarchical articulated representation that decomposes motion into sparse rigid motion explicitly driven by a skeleton and fine‑grained non‑rigid motion. Concretely, we extract a robust skeleton and drive rigid motion via linear blend skinning, followed by a hexplane‑based refinement for non‑rigid deformations, enhancing interpretability and editability. Experimental results demonstrate that SkeletonGaussian surpasses existing methods in generation quality while enabling intuitive motion editing, establishing a new paradigm for editable 4D generation. Project page: https://wusar.github.io/projects/skeletongaussian/
Authors:Zeming Wei, Qiaosheng Zhang, Xia Hu, Xingcheng Xu
Abstract:
Large Reasoning Models (LRMs) have achieved tremendous success with their chain‑of‑thought (CoT) reasoning, yet also face safety issues similar to those of basic language models. In particular, while algorithms are designed to guide them to deliberately refuse harmful prompts with safe reasoning, this process often fails to generalize against diverse and complex jailbreak attacks. In this work, we attribute these failures to the generalization of the safe reasoning process, particularly their insufficiency against complex attack prompts. We provide both theoretical and empirical evidence to show the necessity of a more sufficient safe reasoning process to defend against advanced attack prompts. Building on this insight, we propose a Risk‑Aware Preference Optimization (RAPO) framework that enables LRM to adaptively identify and address the safety risks with appropriate granularity in its thinking content. Extensive experiments demonstrate that RAPO successfully generalizes multiple LRMs' safe reasoning adaptively across diverse attack prompts whilst preserving general utility, contributing a robust alignment technique for LRM safety. Our code is available at https://github.com/weizeming/RAPO.
Authors:Xinyue Wang, Yuanhe Zhang, Zhengshuo Gong, Haoran Gao, Fanyu Meng, Zhenhong Zhou, Li Sun, Yang Liu, Sen Su
Abstract:
The enhanced capabilities of LLM‑based agents come with an emergency for model planning and tool‑use abilities. Attributing to helpful‑harmless trade‑off from LLM alignment, agents typically also inherit the flaw of "over‑refusal", which is a passive failure mode. However, the proactive planning and action capabilities of agents introduce another crucial danger on the other side of the trade‑off. This phenomenon we term "Toxic Proactivity'': an active failure mode in which an agent, driven by the optimization for Machiavellian helpfulness, disregards ethical constraints to maximize utility. Unlike over‑refusal, Toxic Proactivity manifests as the agent taking excessive or manipulative measures to ensure its "usefulness'' is maintained. Existing research pays little attention to identifying this behavior, as it often lacks the subtle context required for such strategies to unfold. To reveal this risk, we introduce a novel evaluation framework based on dilemma‑driven interactions between dual models, enabling the simulation and analysis of agent behavior over multi‑step behavioral trajectories. Through extensive experiments with mainstream LLMs, we demonstrate that Toxic Proactivity is a widespread behavioral phenomenon and reveal two major tendencies. We further present a systematic benchmark for evaluating Toxic Proactive behavior across contextual settings.
Authors:Angel Martinez-Sanchez, Parthib Roy, Ross Greer
Abstract:
Instruction‑grounded driving, where passenger language guides trajectory planning, requires vehicles to understand intent before motion. However, most prior instruction‑following planners rely on simulation or fixed command vocabularies, limiting real‑world generalization. doScenes, the first real‑world dataset linking free‑form instructions (with referentiality) to nuScenes ground‑truth motion, enables instruction‑conditioned planning. In this work, we adapt OpenEMMA, an open‑source MLLM‑based end‑to‑end driving framework that ingests front‑camera views and ego‑state and outputs 10‑step speed‑curvature trajectories, to this setting, presenting a reproducible instruction‑conditioned baseline on doScenes and investigate the effects of human instruction prompts on predicted driving behavior. We integrate doScenes directives as passenger‑style prompts within OpenEMMA's vision‑language interface, enabling linguistic conditioning before trajectory generation. Evaluated on 849 annotated scenes using ADE, we observe that instruction conditioning substantially improves robustness by preventing extreme baseline failures, yielding a 98.7% reduction in mean ADE. When such outliers are removed, instructions still influence trajectory alignment, with well‑phrased prompts improving ADE by up to 5.1%. We use this analysis to discuss what makes a "good" instruction for the OpenEMMA framework. We release the evaluation prompts and scripts to establish a reproducible baseline for instruction‑aware planning. GitHub: https://github.com/Mi3‑Lab/doScenes‑VLM‑Planning
Authors:Chenhe Du, Qing Wu, Xuanyu Tian, Jingyi Yu, Hongjiang Wei, Yuyao Zhang
Abstract:
3D medical imaging is in high demand and essential for clinical diagnosis and scientific research. Currently, diffusion models (DMs) have become an effective tool for medical imaging reconstruction thanks to their ability to learn rich, high‑quality data priors. However, learning the 3D data distribution with DMs in medical imaging is challenging, not only due to the difficulties in data collection but also because of the significant computational burden during model training. A common compromise is to train the DMs on 2D data priors and reconstruct stacked 2D slices to address 3D medical inverse problems. However, the intrinsic randomness of diffusion sampling causes severe inter‑slice discontinuities of reconstructed 3D volumes. Existing methods often enforce continuity regularizations along the z‑axis, which introduces sensitive hyper‑parameters and may lead to over‑smoothing results. In this work, we revisit the origin of stochasticity in diffusion sampling and introduce Inter‑Slice Consistent Stochasticity (ISCS), a simple yet effective strategy that encourages interslice consistency during diffusion sampling. Our key idea is to control the consistency of stochastic noise components during diffusion sampling, thereby aligning their sampling trajectories without adding any new loss terms or optimization steps. Importantly, the proposed ISCS is plug‑and‑play and can be dropped into any 2D trained diffusion based 3D reconstruction pipeline without additional computational cost. Experiments on several medical imaging problems show that our method can effectively improve the performance of medical 3D imaging problems based on 2D diffusion models. Our findings suggest that controlling inter‑slice stochasticity is a principled and practically attractive route toward high‑fidelity 3D medical imaging with 2D diffusion priors. The code is available at: https://github.com/duchenhe/ISCS
Authors:Xiaofeng Lin, Sirou Zhu, Yilei Chen, Mingyu Chen, Hejian Sang, Ioannis Paschalidis, Zhipeng Wang, Aldo Pacchiano, Xuezhou Zhang
Abstract:
Large language models (LLMs) achieve strong performance when all task‑relevant information is available upfront, as in static prediction and instruction‑following problems. However, many real‑world decision‑making tasks are inherently online: crucial information must be acquired through interaction, feedback is delayed, and effective behavior requires balancing information collection and exploitation over time. While in‑context learning enables adaptation without weight updates, existing LLMs often struggle to reliably leverage in‑context interaction experience in such settings. In this work, we show that this limitation can be addressed through training. We introduce ORBIT, a multi‑task, multi‑episode meta‑reinforcement learning framework that trains LLMs to learn from interaction in context. After meta‑training, a relatively small open‑source model (Qwen3‑14B) demonstrates substantially improved in‑context online learning on entirely unseen environments, matching the performance of GPT‑5.2 and outperforming standard RL fine‑tuning by a large margin. Scaling experiments further reveal consistent gains with model size, suggesting significant headroom for learn‑at‑inference‑time decision‑making agents. Code reproducing the results in the paper can be found at https://github.com/XiaofengLin7/ORBIT.
Authors:Vignesh Kothapalli, Rishabh Ranjan, Valter Hudovernik, Vijay Prakash Dwivedi, Johannes Hoffart, Carlos Guestrin, Jure Leskovec
Abstract:
Relational Foundation Models (RFMs) facilitate data‑driven decision‑making by learning from complex multi‑table databases. However, the diverse relational databases needed to train such models are rarely public due to privacy constraints. While there are methods to generate synthetic tabular data of arbitrary size, incorporating schema structure and primary‑‑foreign key connectivity for multi‑table generation remains challenging. Here we introduce PluRel, a framework to synthesize multi‑tabular relational databases from scratch. In a step‑by‑step fashion, PluRel models (1) schemas with directed graphs, (2) inter‑table primary‑foreign key connectivity with bipartite graphs, and, (3) feature distributions in tables via conditional causal mechanisms. The design space across these stages supports the synthesis of a wide range of diverse databases, while being computationally lightweight. Using PluRel, we observe for the first time that (1) RFM pretraining loss exhibits power‑law scaling with the number of synthetic databases and total pretraining tokens, (2) scaling the number of synthetic databases improves generalization to real databases, and (3) synthetic pretraining yields strong base models for continued pretraining on real databases. Overall, our framework and results position synthetic data scaling as a promising paradigm for RFMs.
Authors:Jusheng Zhang, Ningyuan Liu, Qinhan Lyu, Jing Yang, Keze Wang
Abstract:
Deep neural networks typically treat nonlinearities as fixed primitives (e.g., ReLU), limiting both interpretability and the granularity of control over the induced function class. While recent additive models (like KANs) attempt to address this using splines, they often suffer from computational inefficiency and boundary instability. We propose the Rational‑ANOVA Network (RAN), a foundational architecture grounded in functional ANOVA decomposition and Padé‑style rational approximation. RAN models f(x) as a composition of main effects and sparse pairwise interactions, where each component is parameterized by a stable, learnable rational unit. Crucially, we enforce a strictly positive denominator, which avoids poles and numerical instability while capturing sharp transitions and near‑singular behaviors more efficiently than polynomial bases. This ANOVA structure provides an explicit low‑order interaction bias for data efficiency and interpretability, while the rational parameterization significantly improves extrapolation. Across controlled function benchmarks and vision classification tasks (e.g., CIFAR‑10) under matched parameter and compute budgets, RAN matches or surpasses parameter‑matched MLPs and learnable‑activation baselines, with better stability and throughput. Code is available at https://github.com/jushengzhang/Rational‑ANOVA‑Networks.git.
Authors:Aijie Shu, Wenbin Wu, Gbenga Ibikunle, Fengxiang He
Abstract:
Credit exposure in Decentralized Finance (DeFi) is often implicit and token‑mediated, creating a dense web of inter‑protocol dependencies. Thus, a shock to one token may result in significant and uncontrolled contagion effects. As the DeFi ecosystem becomes increasingly linked with traditional financial infrastructure through instruments, such as stablecoins, the risk posed by this dynamic demands more powerful quantification tools. We introduce DeXposure‑FM, the first time‑series, graph foundation model for measuring and forecasting inter‑protocol credit exposure on DeFi networks, to the best of our knowledge. Employing a graph‑tabular encoder, with pre‑trained weight initialization, and multiple task‑specific heads, DeXposure‑FM is trained on the DeXposure dataset that has 43.7 million data entries, across 4,300+ protocols on 602 blockchains, covering 24,300+ unique tokens. The training is operationalized for credit‑exposure forecasting, predicting the joint dynamics of (1) protocol‑level flows, and (2) the topology and weights of credit‑exposure links. The DeXposure‑FM is empirically validated on two machine learning benchmarks; it consistently outperforms the state‑of‑the‑art approaches, including a graph foundation model and temporal graph neural networks. DeXposure‑FM further produces financial economics tools that support macroprudential monitoring and scenario‑based DeFi stress testing, by enabling protocol‑level systemic‑importance scores, sector‑level spillover and concentration measures via a forecast‑then‑measure pipeline. Empirical verification fully supports our financial economics tools. The model and code have been publicly available. Model: https://huggingface.co/EVIEHub/DeXposure‑FM. Code: https://github.com/EVIEHub/DeXposure‑FM.
Authors:Yinyi Luo, Yiqiao Jin, Weichen Yu, Mengqi Zhang, Srijan Kumar, Xiaoxiao Li, Weijie Xu, Xin Chen, Jindong Wang
Abstract:
While large language model (LLM) multi‑agent systems achieve superior reasoning performance through iterative debate, practical deployment is limited by their high computational cost and error propagation. This paper proposes AgentArk, a novel framework to distill multi‑agent dynamics into the weights of a single model, effectively transforming explicit test‑time interactions into implicit model capabilities. This equips a single agent with the intelligence of multi‑agent systems while remaining computationally efficient. Specifically, we investigate three hierarchical distillation strategies across various models, tasks, scaling, and scenarios: reasoning‑enhanced fine‑tuning; trajectory‑based augmentation; and process‑aware distillation. By shifting the burden of computation from inference to training, the distilled models preserve the efficiency of one agent while exhibiting strong reasoning and self‑correction performance of multiple agents. They further demonstrate enhanced robustness and generalization across diverse reasoning tasks. We hope this work can shed light on future research on efficient and robust multi‑agent development. Our code is at https://github.com/AIFrontierLab/AgentArk.
Authors:Jinxing Zhou, Yanghao Zhou, Yaoting Wang, Zongyan Han, Jiaqi Ma, Henghui Ding, Rao Muhammad Anwer, Hisham Cholakkal
Abstract:
Language‑referred audio‑visual segmentation (Ref‑AVS) aims to segment target objects described by natural language by jointly reasoning over video, audio, and text. Beyond generating segmentation masks, providing rich and interpretable diagnoses of mask quality remains largely underexplored. In this work, we introduce Mask Quality Assessment in the Ref‑AVS context (MQA‑RefAVS), a new task that evaluates the quality of candidate segmentation masks without relying on ground‑truth annotations as references at inference time. Given audio‑visual‑language inputs and each provided segmentation mask, the task requires estimating its IoU with the unobserved ground truth, identifying the corresponding error type, and recommending an actionable quality‑control decision. To support this task, we construct MQ‑RAVSBench, a benchmark featuring diverse and representative mask error modes that span both geometric and semantic issues. We further propose MQ‑Auditor, a multimodal large language model (MLLM)‑based auditor that explicitly reasons over multimodal cues and mask information to produce quantitative and qualitative mask quality assessments. Extensive experiments demonstrate that MQ‑Auditor outperforms strong open‑source and commercial MLLMs and can be integrated with existing Ref‑AVS systems to detect segmentation failures and support downstream segmentation improvement. Data and codes will be released at https://github.com/jasongief/MQA‑RefAVS.
Authors:Romain Cosentino
Abstract:
We develop a continual learning method for pretrained models that \emphrequires no access to old‑task data, addressing a practical barrier in foundation model adaptation where pretraining distributions are often unavailable. Our key observation is that pretrained networks exhibit substantial \emphgeometric redundancy, and that this redundancy can be exploited in two complementary ways. First, redundant neurons provide a proxy for dominant pretraining‑era feature directions, enabling the construction of approximately protected update subspaces directly from pretrained weights. Second, redundancy offers a natural bias for \emphwhere to place plasticity: by restricting updates to a subset of redundant neurons and constraining the remaining degrees of freedom, we obtain update families with reduced functional drift on the old‑data distribution and improved worst‑case retention guarantees. These insights lead to \textscPLATE (Plasticity‑Tunable Efficient Adapters), a continual learning method requiring no past‑task data that provides explicit control over the plasticity‑retention trade‑off. PLATE parameterizes each layer with a structured low‑rank update ΔW = B A Q^\top, where B and Q are computed once from pretrained weights and kept frozen, and only A is trained on the new task. The code is available at https://github.com/SalesforceAIResearch/PLATE.
Authors:Minjun Zhu, Zhen Lin, Yixuan Weng, Panzhong Lu, Qiujie Xie, Yifan Wei, Sifan Liu, Qiyao Sun, Yue Zhang
Abstract:
High‑quality scientific illustrations are crucial for effectively communicating complex scientific and technical concepts, yet their manual creation remains a well‑recognized bottleneck in both academia and industry. We present FigureBench, the first large‑scale benchmark for generating scientific illustrations from long‑form scientific texts. It contains 3,300 high‑quality scientific text‑figure pairs, covering diverse text‑to‑illustration tasks from scientific papers, surveys, blogs, and textbooks. Moreover, we propose AutoFigure, the first agentic framework that automatically generates high‑quality scientific illustrations based on long‑form scientific text. Specifically, before rendering the final result, AutoFigure engages in extensive thinking, recombination, and validation to produce a layout that is both structurally sound and aesthetically refined, outputting a scientific illustration that achieves both structural completeness and aesthetic appeal. Leveraging the high‑quality data from FigureBench, we conduct extensive experiments to test the performance of AutoFigure against various baseline methods. The results demonstrate that AutoFigure consistently surpasses all baseline methods, producing publication‑ready scientific illustrations. The code, dataset and huggingface space are released in https://github.com/ResearAI/AutoFigure.
Authors:Oscar Ovanger, Levi Harris, Timothy H. Keitt
Abstract:
Many machine learning systems have access to multiple sources of evidence for the same prediction target, yet these sources often differ in reliability and informativeness across inputs. In bioacoustic classification, species identity may be inferred both from the acoustic signal and from spatiotemporal context such as location and season; while Bayesian inference motivates multiplicative evidence combination, in practice we typically only have access to discriminative predictors rather than calibrated generative models. We introduce Fusion under INdependent Conditional Hypotheses (FINCH), an adaptive log‑linear evidence fusion framework that integrates a pre‑trained audio classifier with a structured spatiotemporal predictor. FINCH learns a per‑sample gating function that estimates the reliability of contextual information from uncertainty and informativeness statistics. The resulting fusion family \emphcontains the audio‑only classifier as a special case and explicitly bounds the influence of contextual evidence, yielding a risk‑contained hypothesis class with an interpretable audio‑only fallback. Across benchmarks, FINCH consistently outperforms fixed‑weight fusion and audio‑only baselines, improving robustness and error trade‑offs even when contextual information is weak in isolation. We achieve state‑of‑the‑art performance on CBI and competitive or improved performance on several subsets of BirdSet using a lightweight, interpretable, evidence‑based approach. Code is available: \texttt\hrefhttps://anonymous.4open.science/r/birdnoise‑85CD/README.mdanonymous‑repository
Authors:Ziru Chen, Dongdong Chen, Ruinan Jin, Yingbin Liang, Yujia Xie, Huan Sun
Abstract:
Recently, there have been significant research interests in training large language models (LLMs) with reinforcement learning (RL) on real‑world tasks, such as multi‑turn code generation. While online RL tends to perform better than offline RL, its higher training cost and instability hinders wide adoption. In this paper, we build on the observation that multi‑turn code generation can be formulated as a one‑step recoverable Markov decision process and propose contextual bandit learning with offline trajectories (Cobalt), a new method that combines the benefits of online and offline RL. Cobalt first collects code generation trajectories using a reference LLM and divides them into partial trajectories as contextual prompts. Then, during online bandit learning, the LLM is trained to complete each partial trajectory prompt through single‑step code generation. Cobalt outperforms two multi‑turn online RL baselines based on GRPO and VeRPO, and substantially improves R1‑Distill 8B and Qwen3 8B by up to 9.0 and 6.2 absolute Pass@1 scores on LiveCodeBench. Also, we analyze LLMs' in‑context reward hacking behaviors and augment Cobalt training with perturbed trajectories to mitigate this issue. Overall, our results demonstrate Cobalt as a promising solution for iterative decision‑making tasks like multi‑turn code generation. Our code and data are available at https://github.com/OSU‑NLP‑Group/cobalt.
Authors:Yingxuan Yang, Chengrui Qu, Muning Wen, Laixi Shi, Ying Wen, Weinan Zhang, Adam Wierman, Shangding Gu
Abstract:
LLM‑based multi‑agent systems (MAS) have emerged as a promising approach to tackle complex tasks that are difficult for individual LLMs. A natural strategy is to scale performance by increasing the number of agents; however, we find that such scaling exhibits strong diminishing returns in homogeneous settings, while introducing heterogeneity (e.g., different models, prompts, or tools) continues to yield substantial gains. This raises a fundamental question: what limits scaling, and why does diversity help? We present an information‑theoretic framework showing that MAS performance is bounded by the intrinsic task uncertainty, not by agent count. We derive architecture‑agnostic bounds demonstrating that improvements depend on how many effective channels the system accesses. Homogeneous agents saturate early because their outputs are strongly correlated, whereas heterogeneous agents contribute complementary evidence. We further introduce K^, an effective channel count that quantifies the number of effective channels without ground‑truth labels. Empirically, we show that heterogeneous configurations consistently outperform homogeneous scaling: 2 diverse agents can match or exceed the performance of 16 homogeneous agents. Our results provide principled guidelines for building efficient and robust MAS through diversity‑aware design. Code and Dataset are available at the link: https://github.com/SafeRL‑Lab/Agent‑Scaling.
Authors:Xilong Wang, Yinuo Liu, Zhun Wang, Dawn Song, Neil Gong
Abstract:
Prompt injection attacks manipulate webpage content to cause web agents to execute attacker‑specified tasks instead of the user's intended ones. Existing methods for detecting and localizing such attacks achieve limited effectiveness, as their underlying assumptions often do not hold in the web‑agent setting. In this work, we propose WebSentinel, a two‑step approach for detecting and localizing prompt injection attacks in webpages. Given a webpage, Step I extracts \emphsegments of interest that may be contaminated, and Step II evaluates each segment by checking its consistency with the webpage content as context. We show that WebSentinel is highly effective, substantially outperforming baseline methods across multiple datasets of both contaminated and clean webpages that we collected. Our code is available at: https://github.com/wxl‑lxw/WebSentinel.
Authors:Jianhao Ruan, Zhihao Xu, Yiran Peng, Fashen Ren, Zhaoyang Yu, Xinbing Liang, Jinyu Xiang, Bang Liu, Chenglin Wu, Yuyu Luo, Jiayi Zhang
Abstract:
Language agents have shown strong promise for task automation. Realizing this promise for increasingly complex, long‑horizon tasks has driven the rise of a sub‑agent‑as‑tools paradigm for multi‑turn task solving. However, existing designs still lack a dynamic abstraction view of sub‑agents, thereby hurting adaptability. We address this challenge with a unified, framework‑agnostic agent abstraction that models any agent as a tuple Instruction, Context, Tools, Model. This tuple acts as a compositional recipe for capabilities, enabling the system to spawn specialized executors for each task on demand. Building on this abstraction, we introduce an agentic system AOrchestra, where the central orchestrator concretizes the tuple at each step: it curates task‑relevant context, selects tools and models, and delegates execution via on‑the‑fly automatic agent creation. Such designs enable reducing human engineering efforts, and remain framework‑agnostic with plug‑and‑play support for diverse agents as task executors. It also enables a controllable performance‑cost trade‑off, allowing the system to approach Pareto‑efficient. Across three challenging benchmarks (GAIA, SWE‑Bench, Terminal‑Bench), AOrchestra achieves 16.28% relative improvement against the strongest baseline when paired with Gemini‑3‑Flash. The code is available at: https://github.com/FoundationAgents/AOrchestra
Authors:Jiashuo Sun, Pengcheng Jiang, Saizhuo Wang, Jiajun Fan, Heng Wang, Siru Ouyang, Ming Zhong, Yizhu Jiao, Chengsong Huang, Xueqiang Xu, Pengrui Han, Peiran Li, Jiaxin Huang, Ge Liu, Heng Ji, Jiawei Han
Abstract:
Retrieval‑Augmented Generation (RAG) systems remain brittle under realistic retrieval noise, even when the required evidence appears in the top‑K results. A key reason is that retrievers and rerankers optimize solely for relevance, often selecting either trivial, answer‑revealing passages or evidence that lacks the critical information required to answer the question, without considering whether the evidence is suitable for the generator. We propose BAR‑RAG, which reframes the reranker as a boundary‑aware evidence selector that targets the generator's Goldilocks Zone ‑‑ evidence that is neither trivially easy nor fundamentally unanswerable for the generator, but is challenging yet sufficient for inference and thus provides the strongest learning signal. BAR‑RAG trains the selector with reinforcement learning using generator feedback, and adopts a two‑stage pipeline that fine‑tunes the generator under the induced evidence distribution to mitigate the distribution mismatch between training and inference. Experiments on knowledge‑intensive question answering benchmarks show that BAR‑RAG consistently improves end‑to‑end performance under noisy retrieval, achieving an average gain of 10.3 percent over strong RAG and reranking baselines while substantially improving robustness. Code is publicly avaliable at https://github.com/GasolSun36/BAR‑RAG.
Authors:Basile Terver, Randall Balestriero, Megi Dervishi, David Fan, Quentin Garrido, Tushar Nagarajan, Koustuv Sinha, Wancong Zhang, Mike Rabbat, Yann LeCun, Amir Bar
Abstract:
We present EB‑JEPA, an open‑source library for learning representations and world models using Joint‑Embedding Predictive Architectures (JEPAs). JEPAs learn to predict in representation space rather than pixel space, avoiding the pitfalls of generative modeling while capturing semantically meaningful features suitable for downstream tasks. Our library provides modular, self‑contained implementations that illustrate how representation learning techniques developed for image‑level self‑supervised learning can transfer to video, where temporal dynamics add complexity, and ultimately to action‑conditioned world models, where the model must additionally learn to predict the effects of control inputs. Each example is designed for single‑GPU training within a few hours, making energy‑based self‑supervised learning accessible for research and education. We provide ablations of JEA components on CIFAR‑10. Probing these representations yields 91% accuracy, indicating that the model learns useful features. Extending to video, we include a multi‑step prediction example on Moving MNIST that demonstrates how the same principles scale to temporal modeling. Finally, we show how these representations can drive action‑conditioned world models, achieving a 97% planning success rate on the Two Rooms navigation task. Comprehensive ablations reveal the critical importance of each regularization component for preventing representation collapse. Code is available at https://github.com/facebookresearch/eb_jepa.
Authors:Yiran Qiao, Jing Chen, Xiang Ao, Qiwei Zhong, Yang Liu, Qing He
Abstract:
Live streaming has become a cornerstone of today's internet, enabling massive real‑time social interactions. However, it faces severe risks arising from sparse, coordinated malicious behaviors among multiple participants, which are often concealed within normal activities and challenging to detect timely and accurately. In this work, we provide a pioneering study on risk assessment in live streaming rooms, characterized by weak supervision where only room‑level labels are available. We formulate the task as a Multiple Instance Learning (MIL) problem, treating each room as a bag and defining structured user‑timeslot capsules as instances. These capsules represent subsequences of user actions within specific time windows, encapsulating localized behavioral patterns. Based on this formulation, we propose AC‑MIL, an Action‑aware Capsule MIL framework that models both individual behaviors and group‑level coordination patterns. AC‑MIL captures multi‑granular semantics and behavioral cues through a serial and parallel architecture that jointly encodes temporal dynamics and cross‑user dependencies. These signals are integrated for robust room‑level risk prediction, while also offering interpretable evidence at the behavior segment level. Extensive experiments on large‑scale industrial datasets from Douyin demonstrate that AC‑MIL significantly outperforms MIL and sequential baselines, establishing new state‑of‑the‑art performance in room‑level risk assessment for live streaming. Moreover, AC‑MIL provides capsule‑level interpretability, enabling identification of risky behavior segments as actionable evidence for intervention. The project page is available at: https://qiaoyran.github.io/AC‑MIL/.
Authors:Guannan Lai, Han-Jia Ye
Abstract:
LLM routing aims to achieve a favorable quality‑‑cost trade‑off by dynamically assigning easy queries to smaller models and harder queries to stronger ones. However, across both unimodal and multimodal settings, we uncover a pervasive yet underexplored failure mode in existing routers: as the user's cost budget increases, routers systematically default to the most capable and most expensive model even when cheaper models already suffice. As a result, current routers under‑utilize small models, wasting computation and monetary cost and undermining the core promise of routing; we term this phenomenon routing collapse. We attribute routing collapse to an objective‑‑decision mismatch: many routers are trained to predict scalar performance scores, whereas routing decisions ultimately depend on discrete comparisons among candidate models. Consequently, small prediction errors can flip relative orderings and trigger suboptimal selections. To bridge this gap, we propose EquiRouter, a decision‑aware router that directly learns model rankings, restoring the role of smaller models and mitigating routing collapse. On RouterBench, EquiRouter reduces cost by about 17% at GPT‑4‑level performance compared to the strongest prior router. Our code is available at https://github.com/AIGNLAI/EquiRouter.
Authors:Mario Pascual-González, Ariadna Jiménez-Partinen, R. M. Luque-Baena, Fátima Nagib-Raya, Ezequiel López-Rubio
Abstract:
Focal cortical dysplasia (FCD) lesions in epilepsy FLAIR MRI are subtle and scarce, making joint image‑‑mask generative modeling prone to instability and memorization. We propose SLIM‑Diff, a compact joint diffusion model whose main contributions are (i) a single shared‑bottleneck U‑Net that enforces tight coupling between anatomy and lesion geometry from a 2‑channel image+mask representation, and (ii) loss‑geometry tuning via a tunable L_p objective. As an internal baseline, we include the canonical DDPM‑style objective (ε‑prediction with L_2 loss) and isolate the effect of prediction parameterization and L_p geometry under a matched setup. Experiments show that x_0‑prediction is consistently the strongest choice for joint synthesis, and that fractional sub‑quadratic penalties (L_1.5) improve image fidelity while L_2 better preserves lesion mask morphology. Our code and model weights are available in https://github.com/MarioPasc/slim‑diff
Authors:Ning Ding, Fangcheng Liu, Kyungrae Kim, Linji Hao, Kyeng-Hun Lee, Hyeonmok Ko, Yehui Tang
Abstract:
Scaling Large Language Models (LLMs) typically relies on increasing the number of parameters or test‑time computations to boost performance. However, these strategies are impractical for edge device deployment due to limited RAM and NPU resources. Despite hardware constraints, deploying performant LLM on edge devices such as smartphone remains crucial for user experience. To address this, we propose MeKi (Memory‑based Expert Knowledge Injection), a novel system that scales LLM capacity via storage space rather than FLOPs. MeKi equips each Transformer layer with token‑level memory experts that injects pre‑stored semantic knowledge into the generation process. To bridge the gap between training capacity and inference efficiency, we employ a re‑parameterization strategy to fold parameter matrices used during training into a compact static lookup table. By offloading the knowledge to ROM, MeKi decouples model capacity from computational cost, introducing zero inference latency overhead. Extensive experiments demonstrate that MeKi significantly outperforms dense LLM baselines with identical inference speed, validating the effectiveness of memory‑based scaling paradigm for on‑device LLMs. Project homepage is at https://github.com/ningding‑o/MeKi.
Authors:Shengyuan Liu, Liuxin Bao, Qi Yang, Wanting Geng, Boyun Zheng, Chenxin Li, Wenting Chen, Houwen Peng, Yixuan Yuan
Abstract:
Medical image segmentation is evolving from task‑specific models toward generalizable frameworks. Recent research leverages Multi‑modal Large Language Models (MLLMs) as autonomous agents, employing reinforcement learning with verifiable reward (RLVR) to orchestrate specialized tools like the Segment Anything Model (SAM). However, these approaches often rely on single‑turn, rigid interaction strategies and lack process‑level supervision during training, which hinders their ability to fully exploit the dynamic potential of interactive tools and leads to redundant actions. To bridge this gap, we propose MedSAM‑Agent, a framework that reformulates interactive segmentation as a multi‑step autonomous decision‑making process. First, we introduce a hybrid prompting strategy for expert‑curated trajectory generation, enabling the model to internalize human‑like decision heuristics and adaptive refinement strategies. Furthermore, we develop a two‑stage training pipeline that integrates multi‑turn, end‑to‑end outcome verification with a clinical‑fidelity process reward design to promote interaction parsimony and decision efficiency. Extensive experiments across 6 medical modalities and 21 datasets demonstrate that MedSAM‑Agent achieves state‑of‑the‑art performance, effectively unifying autonomous medical reasoning with robust, iterative optimization. Code is available \hrefhttps://github.com/CUHK‑AIM‑Group/MedSAM‑Agenthere.
Authors:Songming Liu, Bangguo Li, Kai Ma, Lingxuan Wu, Hengkai Tan, Xiao Ouyang, Hang Su, Jun Zhu
Abstract:
Vision‑Language‑Action (VLA) models hold promise for generalist robotics but currently struggle with data scarcity, architectural inefficiencies, and the inability to generalize across different hardware platforms. We introduce RDT2, a robotic foundation model built upon a 7B parameter VLM designed to enable zero‑shot deployment on novel embodiments for open‑vocabulary tasks. To achieve this, we collected one of the largest open‑source robotic datasets‑‑over 10,000 hours of demonstrations in diverse families‑‑using an enhanced, embodiment‑agnostic Universal Manipulation Interface (UMI). Our approach employs a novel three‑stage training recipe that aligns discrete linguistic knowledge with continuous control via Residual Vector Quantization (RVQ), flow‑matching, and distillation for real‑time inference. Consequently, RDT2 becomes one of the first models that simultaneously zero‑shot generalizes to unseen objects, scenes, instructions, and even robotic platforms. Besides, it outperforms state‑of‑the‑art baselines in dexterous, long‑horizon, and dynamic downstream tasks like playing table tennis. See https://rdt‑robotics.github.io/rdt2/ for more information.
Authors:Yuelin Hu, Jun Xu, Bingcong Lu, Zhengxue Cheng, Hongwei Hu, Ronghua Wu, Li Song
Abstract:
Enterprise meeting environments require AI assistants that handle diverse operational tasks, from rapid fact checking during live discussions to cross meeting analysis for strategic planning, under strict latency, cost, and privacy constraints. Existing meeting benchmarks mainly focus on simplified question answering and fail to reflect real world enterprise workflows, where queries arise organically from multi stakeholder collaboration, span long temporal contexts, and require tool augmented reasoning. We address this gap through a grounded dataset and a learned agent framework. First, we introduce MeetAll, a bilingual and multimodal corpus derived from 231 enterprise meetings totaling 140 hours. Questions are injected using an enterprise informed protocol validated by domain expert review and human discriminability studies. Unlike purely synthetic benchmarks, this protocol is grounded in four enterprise critical dimensions: cognitive load, temporal context span, domain expertise, and actionable task execution, calibrated through interviews with stakeholders across finance, healthcare, and technology sectors. Second, we propose MeetBench XL, a multi dimensional evaluation protocol aligned with human judgment that measures factual fidelity, intent alignment, response efficiency, structural clarity, and completeness. Third, we present MeetMaster XL, a learned dual policy agent that jointly optimizes query routing between fast and slow reasoning paths and tool invocation, including retrieval, cross meeting aggregation, and web search. A lightweight classifier enables accurate routing with minimal overhead, achieving a superior quality latency tradeoff over single model baselines. Experiments against commercial systems show consistent gains, supported by ablations, robustness tests, and a real world deployment case study.Resources: https://github.com/huyuelin/MeetBench.
Authors:Tianyu Chen, Chujia Hu, Ge Gao, Dongrui Liu, Xia Hu, Wenjie Wang
Abstract:
Computer‑use agents (CUAs) that interact with real computer systems can perform automated tasks but face critical safety risks. Ambiguous instructions may trigger harmful actions, and adversarial users can manipulate tool execution to achieve malicious goals. Existing benchmarks mostly focus on short‑horizon or GUI‑based tasks, evaluating on execution‑time errors but overlooking the ability to anticipate planning‑time risks. To fill this gap, we present LPS‑Bench, a benchmark that evaluates the planning‑time safety awareness of MCP‑based CUAs under long‑horizon tasks, covering both benign and adversarial interactions across 65 scenarios of 7 task domains and 9 risk types. We introduce a multi‑agent automated pipeline for scalable data generation and adopt an LLM‑as‑a‑judge evaluation protocol to assess safety awareness through the planning trajectory. Experiments reveal substantial deficiencies in existing CUAs' ability to maintain safe behavior. We further analyze the risks and propose mitigation strategies to improve long‑horizon planning safety in MCP‑based CUA systems. We open‑source our code at https://github.com/tychenn/LPS‑Bench.
Authors:Wenquan Lu, Hai Huang, Randall Balestriero
Abstract:
Reinforcement learning algorithms such as group‑relative policy optimization (GRPO) have demonstrated strong potential for improving the mathematical reasoning capabilities of large language models. However, prior work has consistently observed an entropy collapse phenomenon during reinforcement post‑training, characterized by a monotonic decrease in policy entropy that ultimately leads to training instability and collapse. As a result, most existing approaches restrict training to short horizons (typically 5‑20 epochs), limiting sustained exploration and hindering further policy improvement. In addition, nearly all prior work relies on a single, fixed reasoning prompt or template during training. In this work, we introduce prompt augmentation, a training strategy that instructs the model to generate reasoning traces under diverse templates and formats, thereby increasing rollout diversity. We show that, without a KL regularization term, prompt augmentation enables stable scaling of training duration under a fixed dataset and allows the model to tolerate low‑entropy regimes without premature collapse. Empirically, a Qwen2.5‑Math‑1.5B model trained with prompt augmentation on the MATH Level 3‑5 dataset achieves state‑of‑the‑art performance, reaching 45.2 per‑benchmark accuracy and 51.8 per‑question accuracy on standard mathematical reasoning benchmarks, including AIME24, AMC, MATH500, Minerva, and OlympiadBench. The code and model checkpoints are available at https://github.com/wenquanlu/prompt‑augmentation‑GRPO.
Authors:Xiaoyu Tao, Mingyue Cheng, Ze Guo, Shuo Yu, Yaguo Liu, Qi Liu, Shijin Wang
Abstract:
Time series forecasting (TSF) plays a critical role in decision‑making for many real‑world applications. Recently, LLM‑based forecasters have made promising advancements. Despite their effectiveness, existing methods often lack explicit experience accumulation and continual evolution. In this work, we propose MemCast, a learning‑to‑memory framework that reformulates TSF as an experience‑conditioned reasoning task. Specifically, we learn experience from the training set and organize it into a hierarchical memory. This is achieved by summarizing prediction results into historical patterns, distilling inference trajectories into reasoning wisdom, and inducing extracted temporal features into general laws. Furthermore, during inference, we leverage historical patterns to guide the reasoning process and utilize reasoning wisdom to select better trajectories, while general laws serve as criteria for reflective iteration. Additionally, to enable continual evolution, we design a dynamic confidence adaptation strategy that updates the confidence of individual entries without leaking the test set distribution. Extensive experiments on multiple datasets demonstrate that MemCast consistently outperforms previous methods, validating the effectiveness of our approach. Our code is available at https://github.com/Xiaoyu‑Tao/MemCast‑TS.
Authors:Baohao Liao, Hanze Dong, Xinxing Xu, Christof Monz, Jiang Bian
Abstract:
Group Relative Policy Optimization (GRPO) has recently emerged as a practical recipe for aligning large language models with verifiable objectives. However, under sparse terminal rewards, GRPO often stalls because rollouts within a group frequently receive identical rewards, causing relative advantages to collapse and updates to vanish. We propose self‑hint aligned GRPO with privileged supervision (SAGE), an on‑policy reinforcement learning framework that injects privileged hints during training to reshape the rollout distribution under the same terminal verifier reward. For each prompt x, the model samples a compact hint h (e.g., a plan or decomposition) and then generates a solution τ conditioned on (x,h). Crucially, the task reward R(x,τ) is unchanged; hints only increase within‑group outcome diversity under finite sampling, preventing GRPO advantages from collapsing under sparse rewards. At test time, we set h=\varnothing and deploy the no‑hint policy without any privileged information. Moreover, sampling diverse self‑hints serves as an adaptive curriculum that tracks the learner's bottlenecks more effectively than fixed hints from an initial policy or a stronger external model. Experiments over 6 benchmarks with 3 LLMs show that SAGE consistently outperforms GRPO, on average +2.0 on Llama‑3.2‑3B‑Instruct, +1.2 on Qwen2.5‑7B‑Instruct and +1.3 on Qwen3‑4B‑Instruct. The code is available at https://github.com/BaohaoLiao/SAGE.
Authors:Yinggan Xu, Risto Miikkulainen, Xin Qiu
Abstract:
Post‑Training Quantization (PTQ) is essential for deploying Large Language Models (LLMs) on memory‑constrained devices, yet it renders models static and difficult to fine‑tune. Standard fine‑tuning paradigms, including Reinforcement Learning (RL), fundamentally rely on backpropagation and high‑precision weights to compute gradients. Thus they cannot be used on quantized models, where the parameter space is discrete and non‑differentiable. While Evolution Strategies (ES) offer a backpropagation‑free alternative, optimization of the quantized parameters can still fail due to vanishing or inaccurate gradient. This paper introduces Quantized Evolution Strategies (QES), an optimization paradigm that performs full‑parameter fine‑tuning directly in the quantized space. QES is based on two innovations: (1) it integrates accumulated error feedback to preserve high‑precision gradient signals, and (2) it utilizes a stateless seed replay to reduce memory usage to low‑precision inference levels. QES significantly outperforms the state‑of‑the‑art zeroth‑order fine‑tuning method on arithmetic reasoning tasks, making direct fine‑tuning for quantized models possible. It therefore opens up the possibility for scaling up LLMs entirely in the quantized space. The source code is available at https://github.com/dibbla/Quantized‑Evolution‑Strategies .
Authors:Felix X. -F. Ye, Xingjie Li, An Yu, Ming-Ching Chang, Linsong Chu, Davis Wertheimer
Abstract:
Entropic optimal transport (EOT) via Sinkhorn iterations is widely used in modern machine learning, yet GPU solvers remain inefficient at scale. Tensorized implementations suffer quadratic HBM traffic from dense n× m interactions, while existing online backends avoid storing dense matrices but still rely on generic tiled map‑reduce reduction kernels with limited fusion. We present FlashSinkhorn, an IO‑aware EOT solver for squared Euclidean cost that rewrites stabilized log‑domain Sinkhorn updates as row‑wise LogSumExp reductions of biased dot‑product scores, the same normalization as transformer attention. This enables FlashAttention‑style fusion and tiling: fused Triton kernels stream tiles through on‑chip SRAM and update dual potentials in a single pass, substantially reducing HBM IO per iteration while retaining linear‑memory operations. We further provide streaming kernels for transport application, enabling scalable first‑ and second‑order optimization. On A100 GPUs, FlashSinkhorn achieves up to 32× forward‑pass and 161× end‑to‑end speedups over state‑of‑the‑art online baselines on point‑cloud OT, improves scalability on OT‑based downstream tasks. For reproducibility, we release an open‑source implementation at https://github.com/ot‑triton‑lab/ot_triton.
Authors:Yuanchen Bai, Ruixiang Han, Niti Parikh, Wendy Ju, Angelique Taylor
Abstract:
Co‑design is essential for grounding embodied artificial intelligence (AI) systems in real‑world contexts, especially high‑stakes domains such as healthcare. While prior work has explored multidisciplinary collaboration, iterative prototyping, and support for non‑technical participants, few have interwoven these into a sustained co‑design process. Such efforts often target one context and low‑fidelity stages, limiting the generalizability of findings and obscuring how participants' ideas evolve. To address these limitations, we conducted a 14‑week workshop with a multidisciplinary team of 22 participants, centered around how embodied AI can reduce non‑value‑added task burdens in three healthcare settings: emergency departments, long‑term rehabilitation facilities, and sleep disorder clinics. We found that the iterative progression from abstract brainstorming to high‑fidelity prototypes, supported by educational scaffolds, enabled participants to understand real‑world trade‑offs and generate more deployable solutions. We propose eight guidelines for co‑designing more considerate embodied AI: attuned to context, responsive to social dynamics, mindful of expectations, and grounded in deployment. Project Page: https://byc‑sophie.github.io/Towards‑Considerate‑Embodied‑AI/
Authors:Vishal Venkataramani, Haizhou Shi, Zixuan Ke, Austin Xu, Xiaoxiao He, Yingbo Zhou, Semih Yavuz, Hao Wang, Shafiq Joty
Abstract:
Multi‑Agent Systems (MAS) built on Large Language Models (LLMs) often exhibit high variance in their reasoning trajectories. Process verification, which evaluates intermediate steps in trajectories, has shown promise in general reasoning settings, and has been suggested as a potential tool for guiding coordination of MAS; however, its actual effectiveness in MAS remains unclear. To fill this gap, we present MAS‑ProVe, a systematic empirical study of process verification for multi‑agent systems (MAS). Our study spans three verification paradigms (LLM‑as‑a‑Judge, reward models, and process reward models), evaluated across two levels of verification granularity (agent‑level and iteration‑level). We further examine five representative verifiers and four context management strategies, and conduct experiments over six diverse MAS frameworks on multiple reasoning benchmarks. We find that process‑level verification does not consistently improve performance and frequently exhibits high variance, highlighting the difficulty of reliably evaluating partial multi‑agent trajectories. Among the methods studied, LLM‑as‑a‑Judge generally outperforms reward‑based approaches, with trained judges surpassing general‑purpose LLMs. We further observe a small performance gap between LLMs acting as judges and as single agents, and identify a context‑length‑performance trade‑off in verification. Overall, our results suggest that effective and robust process verification for MAS remains an open challenge, requiring further advances beyond current paradigms. Code is available at https://github.com/Wang‑ML‑Lab/MAS‑ProVe.
Authors:Xianzhen Luo, Jingyuan Zhang, Shiqi Zhou, Rain Huang, Chuan Xiao, Qingfu Zhu, Zhiyuan Ma, Xing Yue, Yang Yue, Wencong Zeng, Wanxiang Che
Abstract:
Evaluating and improving the security capabilities of code agents requires high‑quality, executable vulnerability tasks. However, existing works rely on costly, unscalable manual reproduction and suffer from outdated data distributions. To address these, we present CVE‑Factory, the first multi‑agent framework to achieve expert‑level quality in automatically transforming sparse CVE metadata into fully executable agentic tasks. Cross‑validation against human expert reproductions shows that CVE‑Factory achieves 95% solution correctness and 96% environment fidelity, confirming its expert‑level quality. It is also evaluated on the latest realistic vulnerabilities and achieves a 66.2% verified success. This automation enables two downstream contributions. First, we construct LiveCVEBench, a continuously updated benchmark of 190 tasks spanning 14 languages and 153 repositories that captures emerging threats including AI‑tooling vulnerabilities. Second, we synthesize over 1,000 executable training environments, the first large‑scale scaling of agentic tasks in code security. Fine‑tuned Qwen3‑32B improves from 5.3% to 35.8% on LiveCVEBench, surpassing Claude 4.5 Sonnet, with gains generalizing to Terminal Bench (12.5% to 31.3%). We open‑source CVE‑Factory, LiveCVEBench, Abacus‑cve (fine‑tuned model), training dataset, and leaderboard. All resources are available at https://github.com/livecvebench/CVE‑Factory .
Authors:Ziyang Yu, Liang Zhao
Abstract:
Deploying Large Language Models (LLMs) for discriminative workloads is often limited by inference latency, compute, and API costs at scale. Active distillation reduces these costs by querying an LLM oracle to train compact discriminative students, but most pipelines distill only final labels, discarding intermediate reasoning signals and offering limited diagnostics of what reasoning is missing and where errors arise. We propose Graph of Concept Predictors (GCP), a reasoning‑aware active distillation framework that externalizes the teacher's decision process as a directed acyclic graph and mirrors it with modular concept predictors in the student. GCP enhances sample efficiency through a graph‑aware acquisition strategy that targets uncertainty and disagreement at critical reasoning nodes. Additionally, it improves training stability and efficiency by performing targeted sub‑module retraining, which attributes downstream loss to specific concept predictors and updates only the most influential modules. Experiments on eight NLP classification benchmarks demonstrate that GCP enhances performance under limited annotation budgets while yielding more interpretable and controllable training dynamics. Code is available at: https://github.com/Ziyang‑Yu/GCP.
Authors:Michael Ogezi, Martin Bell, Freda Shi, Ethan Smith
Abstract:
For certain image generation tasks, vector graphics such as Scalable Vector Graphics (SVGs) offer clear benefits such as increased flexibility, size efficiency, and editing ease, but remain less explored than raster‑based approaches. A core challenge is that the numerical, geometric parameters, which make up a large proportion of SVGs, are inefficiently encoded as long sequences of tokens. This slows training, reduces accuracy, and hurts generalization. To address these problems, we propose Continuous Number Modeling (CNM), an approach that directly models numbers as first‑class, continuous values rather than discrete tokens. This formulation restores the mathematical elegance of the representation by aligning the model's inputs with the data's continuous nature, removing discretization artifacts introduced by token‑based encoding. We then train a multimodal transformer on 2 million raster‑to‑SVG samples, followed by fine‑tuning via reinforcement learning using perceptual feedback to further improve visual quality. Our approach improves training speed by over 30% while maintaining higher perceptual fidelity compared to alternative approaches. This work establishes CNM as a practical and efficient approach for high‑quality vector generation, with potential for broader applications. We make our code available http://github.com/mikeogezi/CNM.
Authors:Matteo Bastico, Pierre Onghena, David Ryckelynck, Beatriz Marcotegui, Santiago Velasco-Forero, Laurent Corté, Caroline Robine--Decourcelle, Etienne Decencière
Abstract:
Accurate identification of anatomical landmarks is crucial for various medical applications. Traditional manual landmarking is time‑consuming and prone to inter‑observer variability, while rule‑based methods are often tailored to specific geometries or limited sets of landmarks. In recent years, anatomical surfaces have been effectively represented as point clouds, which are lightweight structures composed of spatial coordinates. Following this strategy and to overcome the limitations of existing landmarking techniques, we propose Landmark Point Transformer (LmPT), a method for automatic anatomical landmark detection on point clouds that can leverage homologous bones from different species for translational research. The LmPT model incorporates a conditioning mechanism that enables adaptability to different input types to conduct cross‑species learning. We focus the evaluation of our approach on femoral landmarking using both human and newly annotated dog femurs, demonstrating its generalization and effectiveness across species. The code and dog femur dataset will be publicly available at: https://github.com/Pierreoo/LandmarkPointTransformer.
Authors:Reza Rezvan, Gustav Gille, Moritz Schauer, Richard Torkar
Abstract:
Flow matching learns a velocity field that transports a base distribution to data. We study how small latent perturbations propagate through these flows and show that Jacobian‑vector products (JVPs) provide a practical lens on dependency structure in the generated features. We derive closed‑form expressions for the optimal drift and its Jacobian in Gaussian and mixture‑of‑Gaussian settings, revealing that even globally nonlinear flows admit local affine structure. In low‑dimensional synthetic benchmarks, numerical JVPs recover the analytical Jacobians. In image domains, composing the flow with an attribute classifier yields an attribute‑level JVP estimator that recovers empirical correlations on MNIST and CelebA. Conditioning on small classifier‑Jacobian norms reduces correlations in a way consistent with a hypothesized common‑cause structure, while we emphasize that this conditioning is not a formal do intervention.
Authors:Viresh Pati, Yubin Kim, Vinh Pham, Jevon Twitty, Shihao Yang, Jiecheng Lu
Abstract:
This paper presents CAPS (Clock‑weighted Aggregation with Prefix‑products and Softmax), a structured attention mechanism for time series forecasting that decouples three distinct temporal structures: global trends, local shocks, and seasonal patterns. Standard softmax attention entangles these through global normalization, while recent recurrent models sacrifice long‑term, order‑independent selection for order‑dependent causal structure. CAPS combines SO(2) rotations for phase alignment with three additive gating paths ‑‑ Riemann softmax, prefix‑product gates, and a Clock baseline ‑‑ within a single attention layer. We introduce the Clock mechanism, a learned temporal weighting that modulates these paths through a shared notion of temporal importance. Experiments on long‑ and short‑term forecasting benchmarks surpass vanilla softmax and linear attention mechanisms and demonstrate competitive performance against seven strong baselines with linear complexity. Our code implementation is available at https://github.com/vireshpati/CAPS‑Attention.
Authors:Punya Syon Pandey, Zhijing Jin
Abstract:
Supervised fine‑tuning (SFT) is the standard approach for binary classification tasks such as toxicity detection, factuality verification, and causal inference. However, SFT often performs poorly in real‑world settings with label noise, class imbalance, or sparse supervision. We introduce BinaryPPO, an offline reinforcement learning large language model (LLM) framework that reformulates binary classification as a reward maximization problem. Our method leverages a variant of Proximal Policy Optimization (PPO) with a confidence‑weighted reward function that penalizes uncertain or incorrect predictions, enabling the model to learn robust decision policies from static datasets without online interaction. Across eight domain‑specific benchmarks and multiple models with differing architectures, BinaryPPO improves accuracy by 40‑60 percentage points, reaching up to 99%, substantially outperforming supervised baselines. We provide an in‑depth analysis of the role of reward shaping, advantage scaling, and policy stability in enabling this improvement. Overall, we demonstrate that confidence‑based reward design provides a robust alternative to SFT for binary classification. Our code is available at https://github.com/psyonp/BinaryPPO.
Authors:Chengyuan Ma, Jiawei Jin, Ruijie Xiong, Chunxiang Jin, Canxiang Yan, Wenming Yang
Abstract:
We introduce and define a novel task‑Scene‑Aware Visually‑Driven Speech Synthesis, aimed at addressing the limitations of existing speech generation models in creating immersive auditory experiences that align with the real physical world. To tackle the two core challenges of data scarcity and modality decoupling, we propose VividVoice, a unified generative framework. First, we constructed a large‑scale, high‑quality hybrid multimodal dataset, Vivid‑210K, which, through an innovative programmatic pipeline, establishes a strong correlation between visual scenes, speaker identity, and audio for the first time. Second, we designed a core alignment module, D‑MSVA, which leverages a decoupled memory bank architecture and a cross‑modal hybrid supervision strategy to achieve fine‑grained alignment from visual scenes to timbre and environmental acoustic features. Both subjective and objective experimental results provide strong evidence that VividVoice significantly outperforms existing baseline models in terms of audio fidelity, content clarity, and multimodal consistency. Our demo is available at https://chengyuann.github.io/VividVoice/.
Authors:Xiaoce Wang, Guibin Zhang, Junzhe Li, Jinzhe Tu, Chun Li, Ming Li
Abstract:
Existing GUI agent models relying on coordinate‑based one‑step visual grounding struggle with generalizing to varying input resolutions and aspect ratios. Alternatives introduce coordinate‑free strategies yet suffer from learning under severe data scarcity. To address the limitations, we propose ToolTok, a novel paradigm of multi‑step pathfinding for GUI agents, where operations are modeled as a sequence of progressive tool usage. Specifically, we devise tools aligned with human interaction habits and represent each tool using learnable token embeddings. To enable efficient embedding learning under limited supervision, ToolTok introduces a semantic anchoring mechanism that grounds each tool with semantically related concepts as natural inductive bias. To further enable a pre‑trained large language model to progressively acquire tool semantics, we construct an easy‑to‑hard curriculum consisting of three tasks: token definition question‑answering, pure text‑guided tool selection, and simplified visual pathfinding. Extensive experiments on multiple benchmarks show that ToolTok achieves superior performance among models of comparable scale (4B) and remains competitive with a substantially larger model (235B). Notably, these results are obtained using less than 1% of the training data required by other post‑training approaches. In addition, ToolTok demonstrates strong generalization across unseen scenarios. Our training & inference code is open‑source at https://github.com/ZephinueCode/ToolTok.
Authors:Xianglong Yan, ChengZhu Bao, Zhiteng Li, Tianao Zhang, Shaoqiu Zhang, Ruobing Xie, Samm Sun, Yulun Zhang
Abstract:
Large language models (LLMs) deliver strong performance, but their high compute and memory costs make deployment difficult in resource‑constrained scenarios. Weight‑only post‑training quantization (PTQ) is appealing, as it reduces memory usage and enables practical speedup without low‑bit operators or specialized hardware. However, accuracy often degrades significantly in weight‑only PTQ at sub‑4‑bit precision, and our analysis identifies two main causes: (1) down‑projection matrices are a well‑known quantization bottleneck, but maintaining their fidelity often requires extra bit‑width; (2) weight quantization induces activation deviations, but effective correction strategies remain underexplored. To address these issues, we propose D^2Quant, a novel weight‑only PTQ framework that improves quantization from both the weight and activation perspectives. On the weight side, we design a Dual‑Scale Quantizer (DSQ) tailored to down‑projection matrices, with an absorbable scaling factor that significantly improves accuracy without increasing the bit budget. On the activation side, we propose Deviation‑Aware Correction (DAC), which incorporates a mean‑shift correction within LayerNorm to mitigate quantization‑induced activation distribution shifts. Extensive experiments across multiple LLM families and evaluation metrics show that D^2Quant delivers superior performance for weight‑only PTQ at sub‑4‑bit precision. The code and models will be available at https://github.com/XIANGLONGYAN/D2Quant.
Authors:Wenhao Sun, Rong-Cheng Tu, Yifu Ding, Zhao Jin, Jingyi Liao, Yongcheng Jing, Dacheng Tao
Abstract:
While Diffusion Language Models (DLMs) offer a flexible, arbitrary‑order alternative to the autoregressive paradigm, their non‑causal nature precludes standard KV caching, forcing costly hidden state recomputation at every decoding step. Existing DLM caching approaches reduce this cost by selective hidden state updates; however, they are still limited by (i) costly token‑wise update identification heuristics and (ii) rigid, uniform budget allocation that fails to account for heterogeneous hidden state dynamics. To address these challenges, we present SPA‑Cache that jointly optimizes update identification and budget allocation in DLM cache. First, we derive a low‑dimensional singular proxy that enables the identification of update‑critical tokens in a low‑dimensional subspace, substantially reducing the overhead of update identification. Second, we introduce an adaptive strategy that allocates fewer updates to stable layers without degrading generation quality. Together, these contributions significantly improve the efficiency of DLMs, yielding up to an 8× throughput improvement over vanilla decoding and a 2‑‑4× speedup over existing caching baselines.
Authors:Tianle Gu, Kexin Huang, Lingyu Li, Ruilin Luo, Shiyang Huang, Zongqi Wang, Yujiu Yang, Yan Teng, Yingchun Wang
Abstract:
Safety moderation is pivotal for identifying harmful content. Despite the success of textual safety moderation, its multimodal counterparts remain hindered by a dual sparsity of data and supervision. Conventional reliance on binary labels lead to shortcut learning, which obscures the intrinsic classification boundaries necessary for effective multimodal discrimination. Hence, we propose a novel learning paradigm (UniMod) that transitions from sparse decision‑making to dense reasoning traces. By constructing structured trajectories encompassing evidence grounding, modality assessment, risk mapping, policy decision, and response generation, we reformulate monolithic decision tasks into a multi‑dimensional boundary learning process. This approach forces the model to ground its decision in explicit safety semantics, preventing the model from converging on superficial shortcuts. To facilitate this paradigm, we develop a multi‑head scalar reward model (UniRM). UniRM provides multi‑dimensional supervision by assigning attribute‑level scores to the response generation stage. Furthermore, we introduce specialized optimization strategies to decouple task‑specific parameters and rebalance training dynamics, effectively resolving interference between diverse objectives in multi‑task learning. Empirical results show UniMod achieves competitive textual moderation performance and sets a new multimodal benchmark using less than 40% of the training data used by leading baselines. Ablations further validate our multi‑attribute trajectory reasoning, offering an effective and efficient framework for multimodal moderation. Supplementary materials are available at \hrefhttps://trustworthylab.github.io/UniMod/project website.
Authors:Theresia Veronika Rampisela, Maria Maistro, Tuukka Ruotsalo, Christina Lioma
Abstract:
Individual user fairness is commonly understood as treating similar users similarly. In Recommender Systems (RSs), several evaluation measures exist for quantifying individual user fairness. These measures evaluate fairness via either: (i) the disparity in RS effectiveness scores regardless of user similarity, or (ii) the disparity in items recommended to similar users regardless of item relevance. Both disparity in recommendation effectiveness and user similarity are very important in fairness, yet no existing individual user fairness measure simultaneously accounts for both. In brief, current user fairness evaluation measures implement a largely incomplete definition of fairness. To fill this gap, we present Pairwise User unFairness (PUF), a novel evaluation measure of individual user fairness that considers both effectiveness disparity and user similarity. PUF is the only measure that can express this important distinction. We empirically validate that PUF does this consistently across 4 datasets and 7 rankers, and robustly when varying user similarity or effectiveness. In contrast, all other measures are either almost insensitive to effectiveness disparity or completely insensitive to user similarity. We contribute the first RS evaluation measure to reliably capture both user similarity and effectiveness in individual user fairness. Our code: https://github.com/theresiavr/PUF‑individual‑user‑fairness‑recsys.
Authors:Yuming Zhao, Peiyi Zhang, Oana Ignat
Abstract:
Memes are a pervasive form of online communication, yet their cultural specificity poses significant challenges for cross‑cultural adaptation. We study cross‑cultural meme transcreation, a multimodal generation task that aims to preserve communicative intent and humor while adapting culture‑specific references. We propose a hybrid transcreation framework based on vision‑language models and introduce a large‑scale bidirectional dataset of Chinese and US memes. Using both human judgments and automated evaluation, we analyze 6,315 meme pairs and assess transcreation quality across cultural directions. Our results show that current vision‑language models can perform cross‑cultural meme transcreation to a limited extent, but exhibit clear directional asymmetries: US‑Chinese transcreation consistently achieves higher quality than Chinese‑US. We further identify which aspects of humor and visual‑textual design transfer across cultures and which remain challenging, and propose an evaluation framework for assessing cross‑cultural multimodal generation. Our code and dataset are publicly available at https://github.com/AIM‑SCU/MemeXGen.
Authors:Chen Hu, Qianxi Zhao, Yuming Li, Mingyu Zhou, Xiyin Li
Abstract:
The Newton‑Schulz (NS) iteration has gained increasing interest for its role in the Muon optimizer and the Stiefel manifold. However, the conventional NS iteration suffers from inefficiency and instability. Although various improvements have been introduced to NS iteration, they fail to deviate from the conventional iterative paradigm, which could increase computation burden largely due to the matrix products along the long dimension repeatedly. To address this, we consolidate the iterative structure into a unified framework, named Unified Newton‑Schulz Orthogonalization (UNSO). To do so, we could avoid a polynomial expansion. Instead, we evaluate the role of each matrix power, remove the insignificant terms, and provide a recommended polynomial with learnable coefficients. These learnable coefficients are then optimized, and achieve an outstanding performance with stable convergence. The code of our method is available: https://github.com/greekinRoma/Unified_Newton_Schulz_Orthogonalization.
Authors:Zehong Ma, Ruihan Xu, Shiliang Zhang
Abstract:
Pixel diffusion generates images directly in pixel space, avoiding the VAE artifacts and representational bottlenecks of two‑stage latent diffusion. Recent JiT further simplifies pixel diffusion with x‑prediction, where the model predicts clean images rather than velocity. However, the standard pixel‑wise diffusion loss treats all pixels equally, spending model capacity to perceptually insignificant signals and often leading to blurry samples. We propose PixelGen, an end‑to‑end pixel diffusion framework that augments x‑prediction with perceptual supervision. Specifically, PixelGen introduces two complementary perceptual losses on top of x‑prediction: an LPIPS loss for local textures and a P‑DINO loss for global semantics. To preserve sample coverage, PixelGen further proposes a noise‑gating strategy that applies these losses only at lower‑noise timesteps. On ImageNet‑256 without classifier‑free guidance, PixelGen achieves an FID of 5.11 in 80 training epochs, surpassing the latent diffusion baselines. Moreover, PixelGen scales efficiently to text‑to‑image generation, reaching a GenEval score of 0.79 with only 6 days of training on 8xH800 GPUs. These results show that perceptual supervision substantially narrows the gap between pixel and latent diffusion while preserving a simple one‑stage pipeline. Codes are available at https://github.com/Zehong‑Ma/PixelGen.
Authors:Yinjie Wang, Tianbao Xie, Ke Shen, Mengdi Wang, Ling Yang
Abstract:
We propose RLAnything, a reinforcement learning framework that dynamically forges environment, policy, and reward models through closed‑loop optimization, amplifying learning signals and strengthening the overall RL system for any LLM or agentic scenarios. Specifically, the policy is trained with integrated feedback from step‑wise and outcome signals, while the reward model is jointly optimized via consistency feedback, which in turn further improves policy training. Moreover, our theory‑motivated automatic environment adaptation improves training for both the reward and policy models by leveraging critic feedback from each, enabling learning from experience. Empirically, each added component consistently improves the overall system, and RLAnything yields substantial gains across various representative LLM and agentic tasks, boosting Qwen3‑VL‑8B‑Thinking by 9.1% on OSWorld and Qwen2.5‑7B‑Instruct by 18.7% and 11.9% on AlfWorld and LiveBench, respectively. We also that optimized reward‑model signals outperform outcomes that rely on human labels. Code: https://github.com/Gen‑Verse/Open‑AgentRL
Authors:Haozhen Zhang, Quanyu Long, Jianzhu Bao, Tao Feng, Weizhi Zhang, Haodong Yue, Wenya Wang
Abstract:
Most Large Language Model (LLM) agent memory systems rely on a small set of static, hand‑designed operations for extracting memory. These fixed procedures hard‑code human priors about what to store and how to revise memory, making them rigid under diverse interaction patterns and inefficient on long histories. To this end, we present MemSkill, which reframes these operations as learnable and evolvable memory skills, structured and reusable routines for extracting, consolidating, and pruning information from interaction traces. Inspired by the design philosophy of agent skills, MemSkill employs a \emphcontroller that learns to select a small set of relevant skills, paired with an LLM‑based \emphexecutor that produces skill‑guided memories. Beyond learning skill selection, MemSkill introduces a \emphdesigner that periodically reviews hard cases where selected skills yield incorrect or incomplete memories, and evolves the skill set by proposing refinements and new skills. Together, MemSkill forms a closed‑loop procedure that improves both the skill‑selection policy and the skill set itself. Experiments on LoCoMo, LongMemEval, HotpotQA, and ALFWorld demonstrate that MemSkill improves task performance over strong baselines and generalizes well across settings. Further analyses shed light on how skills evolve, offering insights toward more adaptive, self‑evolving memory management for LLM agents.
Authors:Ruiqi Wu, Xuanhua He, Meng Cheng, Tianyu Yang, Yong Zhang, Zhuoliang Kang, Xunliang Cai, Xiaoming Wei, Chunle Guo, Chongyi Li, Ming-Ming Cheng
Abstract:
We propose Infinite‑World, a robust interactive world model capable of maintaining coherent visual memory over 1000+ frames in complex real‑world environments. While existing world models can be efficiently optimized on synthetic data with perfect ground‑truth, they lack an effective training paradigm for real‑world videos due to noisy pose estimations and the scarcity of viewpoint revisits. To bridge this gap, we first introduce a Hierarchical Pose‑free Memory Compressor (HPMC) that recursively distills historical latents into a fixed‑budget representation. By jointly optimizing the compressor with the generative backbone, HPMC enables the model to autonomously anchor generations in the distant past with bounded computational cost, eliminating the need for explicit geometric priors. Second, we propose an Uncertainty‑aware Action Labeling module that discretizes continuous motion into a tri‑state logic. This strategy maximizes the utilization of raw video data while shielding the deterministic action space from being corrupted by noisy trajectories, ensuring robust action‑response learning. Furthermore, guided by insights from a pilot toy study, we employ a Revisit‑Dense Finetuning Strategy using a compact, 30‑minute dataset to efficiently activate the model's long‑range loop‑closure capabilities. Extensive experiments, including objective metrics and user studies, demonstrate that Infinite‑World achieves superior performance in visual quality, action controllability, and spatial consistency.
Authors:Ziwen Xu, Chenyan Wu, Hengyu Sun, Haiwen Hong, Mengru Wang, Yunzhi Yao, Longtao Huang, Hui Xue, Shumin Deng, Zhixuan Chu, Huajun Chen, Ningyu Zhang
Abstract:
Methods for controlling large language models (LLMs), including local weight fine‑tuning, LoRA‑based adaptation, and activation‑based interventions, are often studied in isolation, obscuring their connections and making comparison difficult. In this work, we present a unified view that frames these interventions as dynamic weight updates induced by a control signal, placing them within a single conceptual framework. Building on this view, we propose a unified preference‑utility analysis that separates control effects into preference, defined as the tendency toward a target concept, and utility, defined as coherent and task‑valid generation, and measures both on a shared log‑odds scale using polarity‑paired contrastive examples. Across methods, we observe a consistent trade‑off between preference and utility: stronger control increases preference while predictably reducing utility. We further explain this behavior through an activation manifold perspective, in which control shifts representations along target‑concept directions to enhance preference, while utility declines primarily when interventions push representations off the model's valid‑generation manifold. Finally, we introduce a new steering approach SPLIT guided by this analysis that improves preference while better preserving utility. Code is available at https://github.com/zjunlp/EasyEdit/blob/main/examples/SPLIT.md.
Authors:Min Cai, Yu Liang, Longzheng Wang, Yan Wang, Yueyang Zhang, Long Xia, Zhiyuan Sun, Xi Ye, Daiting Shi
Abstract:
Reinforcement learning (RL) has played a central role in recent advances in large reasoning models (LRMs), yielding strong gains in verifiable and open‑ended reasoning. However, training a single general‑purpose LRM across diverse domains remains challenging due to pronounced domain heterogeneity. Through a systematic study of two widely used strategies, Sequential RL and Mixed RL, we find that both incur substantial cross‑domain interference at the behavioral and gradient levels, resulting in limited overall gains. To address these challenges, we introduce Modular Gradient Surgery (MGS), which resolves gradient conflicts at the module level within the transformer. When applied to Llama and Qwen models, MGS achieves average improvements of 4.3 (16.6%) and 4.5 (11.1%) points, respectively, over standard multi‑task RL across three representative domains (math, general chat, and instruction following). Further analysis demonstrates that MGS remains effective under prolonged training. Overall, our study clarifies the sources of interference in multi‑domain RL and presents an effective solution for training general‑purpose LRMs.
Authors:Zheng Li, Jerry Cheng, Huanying Gu
Abstract:
Current time‑series forecasting models are primarily based on transformer‑style neural networks. These models achieve long‑term forecasting mainly by scaling up the model size rather than through genuinely autoregressive (AR) rollout. From the perspective of large language model training, traditional time‑series forecasting model training ignores the monotonic error‑growth heuristic. In this paper, we propose a novel training method for time‑series forecasting that enforces two key properties: (1) AR prediction errors should increase with the forecasting horizon. Violations of this trend are interpreted as rollout inconsistency and are softly penalized during training, and (2) the method enables models to be able to concatenate short‑term AR predictions to form flexible long‑term forecasts. Empirical results demonstrate that our method establishes a new state‑of‑the‑art across multiple benchmarks, achieving an MSE reduction of more than 10% compared to iTransformer and other recent strong baselines. Furthermore, it enables short‑horizon forecasting models to perform reliable long‑term predictions at horizons over 7.5 times longer. Code is available at https://github.com/LizhengMathAi/AROpt
Authors:Atharv Sonwane, Eng-Shen Tu, Wei-Chung Lu, Claas Beger, Carter Larsen, Debjit Dhar, Simon Alford, Rachel Chen, Ronit Pattanayak, Tuan Anh Dang, Guohao Chen, Gloria Geng, Kevin Ellis, Saikat Dutta
Abstract:
LLM‑powered coding agents are redefining how real‑world software is developed. To drive the research towards better coding agents, we require challenging benchmarks that can rigorously evaluate the ability of such agents to perform various software engineering tasks. However, popular coding benchmarks such as HumanEval and SWE‑Bench focus on narrowly scoped tasks such as competition programming and patch generation. In reality, software engineers have to handle a broader set of tasks for real‑world software development. To address this gap, we propose OmniCode, a novel software engineering benchmark that contains a broader and more diverse set of task categories beyond code or patch generation. Overall, OmniCode contains 1794 tasks spanning three programming languages (Python, Java, and C++) and four key categories: bug fixing, test generation, code review fixing, and style fixing. In contrast to prior software engineering benchmarks, the tasks in OmniCode are (1) manually validated to eliminate ill‑defined problems, and (2) synthetically crafted or recently curated to avoid data leakage issues, presenting a new framework for synthetically generating diverse software tasks from limited real‑world data. We evaluate OmniCode with popular agent frameworks such as SWE‑Agent and show that while they may perform well on bug fixing for Python, they fall short on tasks such as Test Generation and in languages such as C++ and Java. For instance, SWE‑Agent achieves a maximum of 20.9% with DeepSeek‑V3.1 on Java Test Generation tasks. OmniCode aims to serve as a robust benchmark and spur the development of agents that can perform well across different aspects of software development. Code and data are available at https://github.com/seal‑research/OmniCode.
Authors:Yu Zeng, Wenxuan Huang, Zhen Fang, Shuang Chen, Yufan Shen, Yishuo Cai, Xiaoman Wang, Zhenfei Yin, Lin Chen, Zehui Chen, Shiting Huang, Yiming Zhao, Xu Tang, Yao Hu, Philip Torr, Wanli Ouyang, Shaosheng Cao
Abstract:
Multimodal Large Language Models (MLLMs) have advanced VQA and now support Vision‑DeepResearch systems that use search engines for complex visual‑textual fact‑finding. However, evaluating these visual and textual search abilities is still difficult, and existing benchmarks have two major limitations. First, existing benchmarks are not visual search‑centric: answers that should require visual search are often leaked through cross‑textual cues in the text questions or can be inferred from the prior world knowledge in current MLLMs. Second, overly idealized evaluation scenario: On the image‑search side, the required information can often be obtained via near‑exact matching against the full image, while the text‑search side is overly direct and insufficiently challenging. To address these issues, we construct the Vision‑DeepResearch benchmark (VDR‑Bench) comprising 2,000 VQA instances. All questions are created via a careful, multi‑stage curation pipeline and rigorous expert review, designed to assess the behavior of Vision‑DeepResearch systems under realistic real‑world conditions. Moreover, to address the insufficient visual retrieval capabilities of current MLLMs, we propose a simple multi‑round cropped‑search workflow. This strategy is shown to effectively improve model performance in realistic visual retrieval scenarios. Overall, our results provide practical guidance for the design of future multimodal deep‑research systems. The code will be released in https://github.com/Osilly/Vision‑DeepResearch.
Authors:Pawel Batorski, Paul Swoboda
Abstract:
Machine unlearning aims to unlearn specified training data (e.g. sensitive or copyrighted material). A prominent approach is to fine‑tune an existing model with an unlearning loss that retains overall utility. The space of suitable unlearning loss functions is vast, making the search for an optimal loss function daunting. Additionally, there might not even exist a universally optimal loss function: differences in the structure and overlap of the forget and retain data can cause a loss to work well in one setting but over‑unlearn or under‑unlearn in another. Our approach EvoMU tackles these two challenges simultaneously. An evolutionary search procedure automatically finds task‑specific losses in the vast space of possible unlearning loss functions. This allows us to find dataset‑specific losses that match or outperform existing losses from the literature, without the need for a human‑in‑the‑loop. This work is therefore an instance of automatic scientific discovery, a.k.a. an AI co‑scientist. In contrast to previous AI co‑scientist works, we do so on a budget: We achieve SotA results using a small 4B parameter model (Qwen3‑4B‑Thinking), showing the potential of AI co‑scientists with limited computational resources. Our experimental evaluation shows that we surpass previous loss‑based unlearning formulations on TOFU‑5%, TOFU‑10%, MUSE and WMDP by synthesizing novel unlearning losses. Our code is available at https://github.com/Batorskq/EvoMU.
Authors:Nima Shoghi, Yuxuan Liu, Yuning Shen, Rob Brekelmans, Pan Li, Quanquan Gu
Abstract:
Molecular dynamics (MD) simulations remain the gold standard for studying protein dynamics, but their computational cost limits access to biologically relevant timescales. Recent generative models have shown promise in accelerating simulations, yet they struggle with long‑horizon generation due to architectural constraints, error accumulation, and inadequate modeling of spatio‑temporal dynamics. We present STAR‑MD (Spatio‑Temporal Autoregressive Rollout for Molecular Dynamics), a scalable SE(3)‑equivariant diffusion model that generates physically plausible protein trajectories over microsecond timescales. Our key innovation is a causal diffusion transformer with joint spatio‑temporal attention that efficiently captures complex space‑time dependencies while avoiding the memory bottlenecks of existing methods. On the standard ATLAS benchmark, STAR‑MD achieves state‑of‑the‑art performance across all metrics‑‑substantially improving conformational coverage, structural validity, and dynamic fidelity compared to previous methods. STAR‑MD successfully extrapolates to generate stable microsecond‑scale trajectories where baseline methods fail catastrophically, maintaining high structural quality throughout the extended rollout. Our comprehensive evaluation reveals severe limitations in current models for long‑horizon generation, while demonstrating that STAR‑MD's joint spatio‑temporal modeling enables robust dynamics simulation at biologically relevant timescales, paving the way for accelerated exploration of protein function.
Authors:Shuo Lu, Haohan Wang, Wei Feng, Weizhen Wang, Shen Zhang, Yaoyu Li, Ao Ma, Zheng Zhang, Jingjing Lv, Junjie Shen, Ching Law, Bing Zhan, Yuan Xu, Huizai Yao, Yongcan Yu, Chenyang Si, Jian Liang
Abstract:
Advertising image generation has increasingly focused on online metrics like Click‑Through Rate (CTR), yet existing approaches adopt a ``one‑size‑fits‑all" strategy that optimizes for overall CTR while neglecting preference diversity among user groups. This leads to suboptimal performance for specific groups, limiting targeted marketing effectiveness. To bridge this gap, we present One Size, Many Fits (OSMF), a unified framework that aligns diverse group‑wise click preferences in large‑scale advertising image generation. OSMF begins with product‑aware adaptive grouping, which dynamically organizes users based on their attributes and product characteristics, representing each group with rich collective preference features. Building on these groups, preference‑conditioned image generation employs a Group‑aware Multimodal Large Language Model (G‑MLLM) to generate tailored images for each group. The G‑MLLM is pre‑trained to simultaneously comprehend group features and generate advertising images. Subsequently, we fine‑tune the G‑MLLM using our proposed Group‑DPO for group‑wise preference alignment, which effectively enhances each group's CTR on the generated images. To further advance this field, we introduce the Grouped Advertising Image Preference Dataset (GAIP), the first large‑scale public dataset of group‑wise image preferences, including around 600K groups built from 40M users. Extensive experiments demonstrate that our framework achieves the state‑of‑the‑art performance in both offline and online settings. Our code and datasets will be released at https://github.com/JD‑GenX/OSMF.
Authors:Zhanghao Hu, Qinglin Zhu, Hanqi Yan, Yulan He, Lin Gui
Abstract:
Agent memory systems often adopt the standard Retrieval‑Augmented Generation (RAG) pipeline, yet its underlying assumptions differ in this setting. RAG targets large, heterogeneous corpora where retrieved passages are diverse, whereas agent memory is a bounded, coherent dialogue stream with highly correlated spans that are often duplicates. Under this shift, fixed top‑k similarity retrieval tends to return redundant context, and post‑hoc pruning can delete temporally linked prerequisites needed for correct reasoning. We argue retrieval should move beyond similarity matching and instead operate over latent components, following decoupling to aggregation: disentangle memories into semantic components, organise them into a hierarchy, and use this structure to drive retrieval. We propose xMemory, which builds a hierarchy of intact units and maintains a searchable yet faithful high‑level node organisation via a sparsity‑‑semantics objective that guides memory split and merge. At inference, xMemory retrieves top‑down, selecting a compact, diverse set of themes and semantics for multi‑fact queries, and expanding to episodes and raw messages only when it reduces the reader's uncertainty. Experiments on LoCoMo and PerLTQA across the three latest LLMs show consistent gains in answer quality and token efficiency.
Authors:Yoonjun Cho, Dongjae Jeon, Soeun Kim, Moongyu Jeon, Albert No
Abstract:
Quantization Error Reconstruction (QER) reduces accuracy loss in Post‑Training Quantization (PTQ) by approximating weights as \mathbfW \approx \mathbfQ + \mathbfL\mathbfR, using a rank‑r correction to reconstruct quantization error. Prior methods devote the full rank budget to error reconstruction, which is suboptimal when \mathbfW has intrinsic low‑rank structure and quantization corrupts dominant directions. We propose Structured Residual Reconstruction (SRR), a rank‑allocation framework that preserves the top‑k singular subspace of the activation‑scaled weight before quantization, quantizes only the residual, and uses the remaining rank r‑k for error reconstruction. We derive a theory‑guided criterion for selecting k by balancing quantization‑exposed energy and unrecoverable error under rank constraints. We further show that resulting \mathbfQ + \mathbfL\mathbfR parameterization naturally supports Quantized Parameter‑Efficient Fine‑Tuning (QPEFT), and stabilizes fine‑tuning via gradient scaling along preserved directions. Experiments demonstrate consistent perplexity reductions across diverse models and quantization settings in PTQ, along with a 5.9 percentage‑point average gain on GLUE under 2‑bit QPEFT. The project page is available at https://ai‑isl.github.io/srr.
Authors:Bing He, Jingnan Gao, Yunuo Chen, Ning Cao, Gang Chen, Zhengxue Cheng, Li Song, Wenjun Zhang
Abstract:
Reconstructing 3D scenes from sparse images remains a challenging task due to the difficulty of recovering accurate geometry and texture without optimization. Recent approaches leverage generalizable models to generate 3D scenes using 3D Gaussian Splatting (3DGS) primitive. However, they often fail to produce continuous surfaces and instead yield discrete, color‑biased point clouds that appear plausible at normal resolution but reveal severe artifacts under close‑up views. To address this issue, we present SurfSplat, a feedforward framework based on 2D Gaussian Splatting (2DGS) primitive, which provides stronger anisotropy and higher geometric precision. By incorporating a surface continuity prior and a forced alpha blending strategy, SurfSplat reconstructs coherent geometry together with faithful textures. Furthermore, we introduce High‑Resolution Rendering Consistency (HRRC), a new evaluation metric designed to evaluate high‑resolution reconstruction quality. Extensive experiments on RealEstate10K, DL3DV, and ScanNet demonstrate that SurfSplat consistently outperforms prior methods on both standard metrics and HRRC, establishing a robust solution for high‑fidelity 3D reconstruction from sparse inputs. Project page: https://hebing‑sjtu.github.io/SurfSplat‑website/
Authors:Zhen-Hao Xie, Jun-Tao Tang, Yu-Cheng Shi, Han-Jia Ye, De-Chuan Zhan, Da-Wei Zhou
Abstract:
Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, but real‑world deployment requires them to continually expand their capabilities, making Multimodal Continual Instruction Tuning (MCIT) essential. Recent methods leverage sparse expert routing to promote task specialization, but we find that the expert routing process suffers from drift as the data distribution evolves. For example, a grounding query that previously activated localization experts may instead be routed to irrelevant experts after learning OCR tasks. Meanwhile, the grounding‑related experts can be overwritten by new tasks and lose their original functionality. Such failure reflects two problems: router drift, where expert selection becomes inconsistent over time, and expert drift, where shared experts are overwritten across tasks. Therefore, we propose StAbilized Mixture‑of‑Experts (SAME) for MCIT. To address router drift, SAME stabilizes expert selection by decomposing routing dynamics into orthogonal subspaces and updating only task‑relevant directions. To mitigate expert drift, we regulate expert updates via curvature‑aware scaling using historical input covariance in a rehearsal‑free manner. SAME also introduces adaptive expert activation to freeze selected experts during training, reducing redundant computation and cross‑task interference. We also introduce a new benchmark to evaluate MCIT with long task sequence, and extensive experiments demonstrate SAME's SOTA performance. Code is available at https://github.com/LAMDA‑CL/Prism.
Authors:Hongwei Yan, Guanglong Sun, Kanglei Zhou, Qian Li, Liyuan Wang, Yi Zhong
Abstract:
General continual learning (GCL) challenges intelligent systems to learn from single‑pass, non‑stationary data streams without clear task boundaries. While recent advances in continual parameter‑efficient tuning (PET) of pretrained models show promise, they typically rely on multiple training epochs and explicit task cues, limiting their effectiveness in GCL scenarios. Moreover, existing methods often lack targeted design and fail to address two fundamental challenges in continual PET: how to allocate expert parameters to evolving data distributions, and how to improve their representational capacity under limited supervision. Inspired by the fruit fly's hierarchical memory system characterized by sparse expansion and modular ensembles, we propose FlyPrompt, a brain‑inspired framework that decomposes GCL into two subproblems: expert routing and expert competence improvement. FlyPrompt introduces a randomly expanded analytic router for instance‑level expert activation and a temporal ensemble of output heads to dynamically adapt decision boundaries over time. Extensive theoretical and empirical evaluations demonstrate FlyPrompt's superior performance, achieving up to 11.23%, 12.43%, and 7.62% gains over state‑of‑the‑art baselines on CIFAR‑100, ImageNet‑R, and CUB‑200, respectively. Our source code is available at https://github.com/AnAppleCore/FlyGCL.
Authors:Yuliang Zhan, Jian Li, Wenbing Huang, Wenbing Huang, Yang Liu, Hao Sun
Abstract:
Deep learning has demonstrated remarkable capabilities in simulating complex dynamic systems. However, existing methods require known physical properties as supervision or inputs, limiting their applicability under unknown conditions. To explore this challenge, we introduce Cloth Dynamics Grounding (CDG), a novel scenario for unsupervised learning of cloth dynamics from multi‑view visual observations. We further propose Cloth Dynamics Splatting (CloDS), an unsupervised dynamic learning framework designed for CDG. CloDS adopts a three‑stage pipeline that first performs video‑to‑geometry grounding and then trains a dynamics model on the grounded meshes. To cope with large non‑linear deformations and severe self‑occlusions during grounding, we introduce a dual‑position opacity modulation that supports bidirectional mapping between 2D observations and 3D geometry via mesh‑based Gaussian splatting in video‑to‑geometry grounding stage. It jointly considers the absolute and relative position of Gaussian components. Comprehensive experimental evaluations demonstrate that CloDS effectively learns cloth dynamics from visual data while maintaining strong generalization capabilities for unseen configurations. Our code is available at https://github.com/whynot‑zyl/CloDS. Visualization results are available at https://github.com/whynot‑zyl/CloDS_video.%\footnoteAs in this example.
Authors:Runsong Zhao, Shilei Liu, Jiwei Tang, Langming Liu, Haibin Chen, Weidong Zhang, Yujin Yuan, Tong Xiao, Jingbo Zhu, Wenbo Su, Bo Zheng
Abstract:
The quadratic complexity and indefinitely growing key‑value (KV) cache of standard Transformers pose a major barrier to long‑context processing. To overcome this, we introduce the Collaborative Memory Transformer (CoMeT), a novel architecture that enables LLMs to handle arbitrarily long sequences with constant memory usage and linear time complexity. Designed as an efficient, plug‑in module, CoMeT can be integrated into pre‑trained models with only minimal fine‑tuning. It operates on sequential data chunks, using a dual‑memory system to manage context: a temporary memory on a FIFO queue for recent events, and a global memory with a gated update rule for long‑range dependencies. These memories then act as a dynamic soft prompt for the next chunk. To enable efficient fine‑tuning on extremely long contexts, we introduce a novel layer‑level pipeline parallelism strategy. The effectiveness of our approach is remarkable: a model equipped with CoMeT and fine‑tuned on 32k contexts can accurately retrieve a passkey from any position within a 1M token sequence. On the SCROLLS benchmark, CoMeT surpasses other efficient methods and achieves performance comparable to a full‑attention baseline on summarization tasks. Its practical effectiveness is further validated on real‑world agent and user behavior QA tasks. The code is available at: https://github.com/LivingFutureLab/Comet
Authors:Hayeong Lee, JunHyeok Oh, Byung-Jun Lee
Abstract:
The design of environments plays a critical role in shaping the development and evaluation of cooperative multi‑agent reinforcement learning (MARL) algorithms. While existing benchmarks highlight critical challenges, they often lack the modularity required to design custom evaluation scenarios. We introduce the Totally Accelerated Battle Simulator in JAX (TABX), a high‑throughput sandbox designed for reconfigurable multi‑agent tasks. TABX provides granular control over environmental parameters, permitting a systematic investigation into emergent agent behaviors and algorithmic trade‑offs across a diverse spectrum of task complexities. Leveraging JAX for hardware‑accelerated execution on GPUs, TABX enables massive parallelization and significantly reduces computational overhead. By providing a fast, extensible, and easily customized framework, TABX facilitates the study of MARL agents in complex structured domains and serves as a scalable foundation for future research. Our code is available at: https://github.com/ku‑dmlab/TABX.
Authors:Yinchao Ma, Qiang Zhou, Zhibin Wang, Xianing Chen, Hanqing Yang, Jun Song, Bo Zheng
Abstract:
Video large language models have demonstrated remarkable capabilities in video understanding tasks. However, the redundancy of video tokens introduces significant computational overhead during inference, limiting their practical deployment. Many compression algorithms are proposed to prioritize retaining features with the highest attention scores to minimize perturbations in attention computations. However, the correlation between attention scores and their actual contribution to correct answers remains ambiguous. To address the above limitation, we propose a novel Contribution‑aware token Compression algorithm for VIDeo understanding (CaCoVID) that explicitly optimizes the token selection policy based on the contribution of tokens to correct predictions. First, we introduce a reinforcement learning‑based framework that optimizes a policy network to select video token combinations with the greatest contribution to correct predictions. This paradigm shifts the focus from passive token preservation to active discovery of optimal compressed token combinations. Secondly, we propose a combinatorial policy optimization algorithm with online combination space sampling, which dramatically reduces the exploration space for video token combinations and accelerates the convergence speed of policy optimization. Extensive experiments on diverse video understanding benchmarks demonstrate the effectiveness of CaCoVID. Codes are available at https://github.com/LivingFutureLab/CaCoVID.
Authors:Seyed Mohammad Hadi Hosseini, Mahdieh Soleymani Baghshah
Abstract:
Unsupervised Skill Discovery (USD) aims to autonomously learn a diverse set of skills without relying on extrinsic rewards. One of the most common USD approaches is to maximize the Mutual Information (MI) between skill latent variables and states. However, MI‑based methods tend to favor simple, static skills due to their invariance properties, limiting the discovery of dynamic, task‑relevant behaviors. Distance‑Maximizing Skill Discovery (DSD) promotes more dynamic skills by leveraging state‑space distances, yet still fall short in encouraging comprehensive skill sets that engage all controllable factors or entities in the environment. In this work, we introduce SUSD, a novel framework that harnesses the compositional structure of environments by factorizing the state space into independent components (e.g., objects or controllable entities). SUSD allocates distinct skill variables to different factors, enabling more fine‑grained control on the skill discovery process. A dynamic model also tracks learning across factors, adaptively steering the agent's focus toward underexplored factors. This structured approach not only promotes the discovery of richer and more diverse skills, but also yields a factorized skill representation that enables fine‑grained and disentangled control over individual entities which facilitates efficient training of compositional downstream tasks via Hierarchical Reinforcement Learning (HRL). Our experimental results across three environments, with factors ranging from 1 to 10, demonstrate that our method can discover diverse and complex skills without supervision, significantly outperforming existing unsupervised skill discovery methods in factorized and complex environments. Code is publicly available at: https://github.com/hadi‑hosseini/SUSD.
Authors:Huu Hiep Nguyen, Minh Hoang Nguyen, Dung Nguyen, Hung Le
Abstract:
Multimodal time series forecasting is crucial in real‑world applications, where decisions depend on both numerical data and contextual signals. The core challenge is to effectively combine temporal numerical patterns with the context embedded in other modalities, such as text. While most existing methods align textual features with time‑series patterns one step at a time, they neglect the multiscale temporal influences of contextual information such as time‑series cycles and dynamic shifts. This mismatch between local alignment and global textual context can be addressed by spectral decomposition, which separates time series into frequency components capturing both short‑term changes and long‑term trends. In this paper, we propose SpecTF, a simple yet effective framework that integrates the effect of textual data on time series in the frequency domain. Our method extracts textual embeddings, projects them into the frequency domain, and fuses them with the time series' spectral components using a lightweight cross‑attention mechanism. This adaptively reweights frequency bands based on textual relevance before mapping the results back to the temporal domain for predictions. Experimental results demonstrate that SpecTF significantly outperforms state‑of‑the‑art models across diverse multi‑modal time series datasets while utilizing considerably fewer parameters. Code is available at https://github.com/hiepnh137/SpecTF.
Authors:Quang Truong, Yu Song, Donald Loveland, Mingxuan Ju, Tong Zhao, Neil Shah, Jiliang Tang
Abstract:
Link prediction is a core challenge in graph machine learning, demanding models that capture rich and complex topological dependencies. While Graph Neural Networks (GNNs) are the standard solution, state‑of‑the‑art pipelines often rely on explicit structural heuristics or memory‑intensive node embeddings ‑‑ approaches that struggle to generalize or scale to massive graphs. Emerging Graph Transformers (GTs) offer a potential alternative but often incur significant overhead due to complex structural encodings, hindering their applications to large‑scale link prediction. We challenge these sophisticated paradigms with PENCIL, an encoder‑only plain Transformer that replaces hand‑crafted priors with attention over sampled local subgraphs, retaining the scalability and hardware efficiency of standard Transformers. Through experimental and theoretical analysis, we show that PENCIL extracts richer structural signals than GNNs, implicitly generalizing a broad class of heuristics and subgraph‑based expressivity. Empirically, PENCIL outperforms heuristic‑informed GNNs and is far more parameter‑efficient than ID‑embedding‑‑based alternatives, while remaining competitive across diverse benchmarks ‑‑ even without node features. Our results challenge the prevailing reliance on complex engineering techniques, demonstrating that simple design choices are potentially sufficient to achieve the same capabilities. Our code is publicly available at https://github.com/quang‑truong/pencil.
Authors:Xiaoyu Wen, Zhida He, Han Qi, Ziyu Wan, Zhongtian Ma, Ying Wen, Tianhang Zheng, Xingcheng Xu, Chaochao Lu, Qiaosheng Zhang
Abstract:
Ensuring robust safety alignment is crucial for Large Language Models (LLMs), yet existing defenses often lag behind evolving adversarial attacks due to their reliance on static, pre‑collected data distributions. In this paper, we introduce MAGIC, a novel multi‑turn multi‑agent reinforcement learning framework that formulates LLM safety alignment as an adversarial asymmetric game. Specifically, an attacker agent learns to iteratively rewrite original queries into deceptive prompts, while a defender agent simultaneously optimizes its policy to recognize and refuse such inputs. This dynamic process triggers a co‑evolution, where the attacker's ever‑changing strategies continuously uncover long‑tail vulnerabilities, driving the defender to generalize to unseen attack patterns. Remarkably, we observe that the attacker, endowed with initial reasoning ability, evolves novel, previously unseen combinatorial strategies through iterative RL training, underscoring our method's substantial potential. Theoretically, we provide insights into a more robust game equilibrium and derive safety guarantees. Extensive experiments validate our framework's effectiveness, demonstrating superior defense success rates without compromising the helpfulness of the model. Our code is available at https://github.com/BattleWen/MAGIC.
Authors:Jongseok Park, Sunga Kim, Alvin Cheung, Ion Stoica
Abstract:
Despite their importance in model sampling, efficient implementation of Top‑k and Top‑p algorithms for large vocabularies remains a significant challenge. Existing approaches often rely on sorting, which incurs significant computation and memory overhead on GPUs, or on stochastic approaches that alter the algorithm's output. In this work, we propose Qrita, an efficient Top‑k and Top‑p algorithm based on a pivot‑based truncation and selection. Qrita leverages pivot‑based search for both Top‑k and Top‑p with two key techniques: 1. Gaussian‑based sigma‑truncation, which greatly reduces the search space of the vocabulary, and 2. Quaternary pivot search with duplication handling, which halves the number of pivot search iterations and guarantees deterministic output. We implement Qrita using Triton and evaluate its performance against the Top‑k and Top‑p kernels of high‑performance LLM execution engines such as SGLang and FlashInfer, improving end‑to‑end serving throughput up to 1.4 times with half the memory usage, while providing the same output as the sorting‑based algorithms. Qrita is now the default Top‑k and Top‑p sampler for the GPU execution path of vLLM, and a ternary implementation of Qrita is available at https://github.com/vllm‑project/vllm/blob/main/vllm/v1/sample/ops/topk_topp_triton.py.
Authors:Haojia Zhu, Qinyuan Xu, Haoyu Li, Yuxi Liu, Hanchen Qiu, Jiaoyan Chen, Jiahui Jin
Abstract:
Aggregation query over free text is a long‑standing yet underexplored problem. Unlike ordinary question answering, aggregate queries require exhaustive evidence collection and systems are required to "find all," not merely "find one." Existing paradigms such as Text‑to‑SQL and Retrieval‑Augmented Generation fail to achieve this completeness. In this work, we formalize entity‑level aggregation querying over text in a corpus‑bounded setting with strict completeness requirement. To enable principled evaluation, we introduce AGGBench, a benchmark designed to evaluate completeness‑oriented aggregation under realistic large‑scale corpus. To accompany the benchmark, we propose DFA (Disambiguation‑‑Filtering‑‑Aggregation), a modular agentic baseline that decomposes aggregation querying into interpretable stages and exposes key failure modes related to ambiguity, filtering, and aggregation. Empirical results show that DFA consistently improves aggregation evidence coverage over strong RAG and agentic baselines. The data and code are available in \hrefhttps://anonymous.4open.science/r/DFA‑A4C1.
Authors:Kangjun Noh, Seongchan Lee, Ilmun Kim, Kyungwoo Song
Abstract:
Ensuring factuality is essential for the safe use of Large Language Models (LLMs) in high‑stakes domains such as medicine and law. Conformal inference provides distribution‑free guarantees, but existing approaches are either overly conservative, discarding many true‑claims, or rely on adaptive error rates and simple linear models that fail to capture complex group structures. To address these challenges, we reformulate conformal inference in a multiplicative filtering setting, modeling factuality as a product of claim‑level scores. Our method, Multi‑LLM Adaptive Conformal Inference (MACI), leverages ensembles to produce more accurate factuality‑scores, which in our experiments led to higher retention, while validity is preserved through group‑conditional calibration. Experiments show that MACI consistently achieves user‑specified coverage with substantially higher retention and lower time cost than baselines. Our repository is available at https://github.com/MLAI‑Yonsei/MACI
Authors:Takahito Nakajima
Abstract:
Background: As of 2026, Large Language Models (LLMs) demonstrate expert‑level medical knowledge. However, deploying them as autonomous "Clinical Agents" remains limited. Current Electronic Medical Records (EMRs) and standards like FHIR are designed for human review, creating a "Context Mismatch": AI agents receive fragmented data and must rely on probabilistic inference (e.g., RAG) to reconstruct patient history. This approach causes hallucinations and hinders auditability. Methods: We propose MedBeads, an agent‑native data infrastructure where clinical events are immutable "Beads"‑‑nodes in a Merkle Directed Acyclic Graph (DAG)‑‑cryptographically referencing causal predecessors. This "write‑once, read‑many" architecture makes tampering mathematically detectable. We implemented a prototype with a Go Core Engine, Python middleware for LLM integration, and a React‑based visualization interface. Results: We successfully implemented the workflow using synthetic data. The FHIR‑to‑DAG conversion transformed flat resources into a causally‑linked graph. Our Breadth‑First Search (BFS) Context Retrieval algorithm traverses relevant subgraphs with O(V+E) complexity, enabling real‑time decision support. Tamper‑evidence is guaranteed by design: any modification breaks the cryptographic chain. The visualization aids clinician understanding through explicit causal links. Conclusion: MedBeads addresses the "Context Mismatch" by shifting from probabilistic search to deterministic graph traversal, and from mutable records to immutable chains, providing the substrate for "Trustworthy Medical AI." It guarantees the context the AI receives is deterministic and tamper‑evident, while the LLM determines interpretation. The structured Bead format serves as a token‑efficient "AI‑native language." We release MedBeads as open‑source software to accelerate agent‑native data standards.
Authors:Guangshuo Qin, Zhiteng Li, Zheng Chen, Weihang Zhang, Linghe Kong, Yulun Zhang
Abstract:
Mixture‑of‑Experts(MoE) Vision‑Language Models (VLMs) offer remarkable performance but incur prohibitive memory and computational costs, making compression essential. Post‑Training Quantization (PTQ) is an effective training‑free technique to address the massive memory and computation overhead. Existing quantization paradigms fall short as they are oblivious to two critical forms of heterogeneity: the inherent discrepancy between vision and language tokens, and the non‑uniform contribution of different experts. To bridge this gap, we propose Visual Expert Quantization (VEQ), a dual‑aware quantization framework designed to simultaneously accommodate cross‑modal differences and heterogeneity between experts. Specifically, VEQ incorporates 1)Modality‑expert‑aware Quantization, which utilizes expert activation frequency to prioritize error minimization for pivotal experts, and 2)Modality‑affinity‑aware Quantization, which constructs an enhanced Hessian matrix by integrating token‑expert affinity with modality information to guide the calibration process. Extensive experiments across diverse benchmarks verify that VEQ consistently outperforms state‑of‑the‑art baselines. Specifically, under the W3A16 configuration, our method achieves significant average accuracy gains of 2.04% on Kimi‑VL and 3.09% on Qwen3‑VL compared to the previous SOTA quantization methods, demonstrating superior robustness across various multimodal tasks. Our code will be available at https://github.com/guangshuoqin/VEQ.
Authors:Kaiyuan Cui, Yige Li, Yutao Wu, Xingjun Ma, Sarah Erfani, Christopher Leckie, Hanxun Huang
Abstract:
Vision‑language models (VLMs) extend large language models (LLMs) with vision encoders, enabling text generation conditioned on both images and text. However, this multimodal integration expands the attack surface by exposing the model to image‑based jailbreaks crafted to induce harmful responses. Existing gradient‑based jailbreak methods transfer poorly, as adversarial patterns overfit to a single white‑box surrogate and fail to generalise to black‑box models. In this work, we propose Universal and transferable jailbreak (UltraBreak), a framework that constrains adversarial patterns through transformations and regularisation in the vision space, while relaxing textual targets through semantic‑based objectives. By defining its loss in the textual embedding space of the target LLM, UltraBreak discovers universal adversarial patterns that generalise across diverse jailbreak objectives. This combination of vision‑level regularisation and semantically guided textual supervision mitigates surrogate overfitting and enables strong transferability across both models and attack targets. Extensive experiments show that UltraBreak consistently outperforms prior jailbreak methods. Further analysis reveals why earlier approaches fail to transfer, highlighting that smoothing the loss landscape via semantic objectives is crucial for enabling universal and transferable jailbreaks. The code is publicly available in our \hrefhttps://github.com/kaiyuanCui/UltraBreakGitHub repository.
Authors:Yutong Song, Shiva Shrestha, Chenhan Lyu, Elahe Khatibi, Pengfei Zhang, Honghui Xu, Nikil Dutt, Amir Rahmani
Abstract:
Spoken question‑answering (SQA) systems relying on automatic speech recognition (ASR) often struggle with accurately recognizing medical terminology. To this end, we propose MedSpeak, a novel knowledge graph‑aided ASR error correction framework that refines noisy transcripts and improves downstream answer prediction by leveraging both semantic relationships and phonetic information encoded in a medical knowledge graph, together with the reasoning power of LLMs. Comprehensive experimental results on benchmarks demonstrate that MedSpeak significantly improves the accuracy of medical term recognition and overall medical SQA performance, establishing MedSpeak as a state‑of‑the‑art solution for medical SQA. The code is available at https://github.com/RainieLLM/MedSpeak.
Authors:Víctor Yeste, Paolo Rosso
Abstract:
Sentence‑level human value detection is typically framed as multi‑label classification over Schwartz values, but it remains unclear whether Schwartz higher‑order (HO) categories provide usable structure. We study this under a strict compute‑frugal budget (single 8 GB GPU) on ValueEval'24 / ValuesML (74K English sentences). We compare (i) direct supervised transformers, (ii) HO\rightarrowvalues pipelines that enforce the hierarchy with hard masks, and (iii) Presence\rightarrowHO\rightarrowvalues cascades, alongside low‑cost add‑ons (lexica, short context, topics), label‑wise threshold tuning, small instruction‑tuned LLM baselines (\le10B), QLoRA, and simple ensembles. HO categories are learnable from single sentences (e.g., the easiest bipolar pair reaches Macro‑F_1\approx0.58), but hard hierarchical gating is not a reliable win: it often reduces end‑task Macro‑F_1 via error compounding and recall suppression. In contrast, label‑wise threshold tuning is a high‑leverage knob (up to +0.05 Macro‑F_1), and small transformer ensembles provide the most consistent additional gains (up to +0.02 Macro‑F_1). Small LLMs lag behind supervised encoders as stand‑alone systems, yet can contribute complementary errors in cross‑family ensembles. Overall, HO structure is useful descriptively, but enforcing it with hard gates hurts sentence‑level value detection; robust improvements come from calibration and lightweight ensembling.
Authors:Gaurav Srivastava, Aafiya Hussain, Chi Wang, Yingyan Celine Lin, Xuan Wang
Abstract:
Most existing language model agentic systems today are built and optimized for large language models (e.g., GPT, Claude, Gemini) via API calls. While powerful, this approach faces several limitations including high token costs and privacy concerns for sensitive applications. We introduce effGen, an open‑source agentic framework optimized for small language models (SLMs) that enables effective, efficient, and secure local deployment (pip install effgen). effGen makes four major contributions: (1) Enhanced tool‑calling with prompt optimization that compresses contexts by 70‑80% while preserving task semantics, (2) Intelligent task decomposition that breaks complex queries into parallel or sequential subtasks based on dependencies, (3) Complexity‑based routing using five factors to make smart pre‑execution decisions, and (4) Unified memory system combining short‑term, long‑term, and vector‑based storage. Additionally, effGen unifies multiple agent protocols (MCP, A2A, ACP) for cross‑protocol communication. Results on 13 benchmarks show effGen outperforms LangChain, AutoGen, and Smolagents with higher success rates, faster execution, and lower memory. Our results reveal that prompt optimization and complexity routing have complementary scaling behavior: optimization benefits SLMs more (11.2% gain at 1.5B vs 2.4% at 32B), while routing benefits large models more (3.6% at 1.5B vs 7.9% at 32B), providing consistent gains across all scales when combined. effGen (https://effgen.org/) is released under the MIT License, ensuring broad accessibility for research and commercial use. Our framework code is publicly available at https://github.com/ctrl‑gaurav/effGen.
Authors:Alicja Polowczyk, Agnieszka Polowczyk, Piotr Borycki, Joanna Waczyńska, Jacek Tabor, Przemysław Spurek
Abstract:
Despite impressive results from recent text‑to‑image models like FLUX, visual and anatomical artifacts remain a significant hurdle for practical and professional use. Existing methods for artifact reduction, typically work in a post‑hoc manner, consequently failing to intervene effectively during the core image formation process. Notably, current techniques require problematic and invasive modifications to the model weights, or depend on a computationally expensive and time‑consuming process of regional refinement. To address these limitations, we propose DIAMOND, a training‑free method that applies trajectory correction to mitigate artifacts during inference. By reconstructing an estimate of the clean sample at every step of the generative trajectory, DIAMOND actively steers the generation process away from latent states that lead to artifacts. Furthermore, we extend the proposed method to standard Diffusion Models, demonstrating that DIAMOND provides a robust, zero‑shot path to high‑fidelity, artifact‑free image synthesis without the need for additional training or weight modifications in modern generative architectures. Code is available at https://gmum.github.io/DIAMOND/
Authors:Yuhao Huang, Taos Transue, Shih-Hsin Wang, William Feldman, Hong Zhang, Bao Wang
Abstract:
Conditional flow matching (CFM) stands out as an efficient, simulation‑free approach for training flow‑based generative models, achieving remarkable performance for data generation. However, CFM is insufficient to ensure accuracy in learning probability paths. In this paper, we introduce a new partial differential equation characterization for the error between the learned and exact probability paths, along with its solution. We show that the total variation gap between the two probability paths is bounded above by a combination of the CFM loss and an associated divergence loss. This theoretical insight leads to the design of a new objective function that simultaneously matches the flow and its divergence. Our new approach improves the performance of the flow‑based generative model by a noticeable margin without sacrificing generation efficiency. We showcase the advantages of this enhanced training approach over CFM on several important benchmark tasks, including generative modeling for dynamical systems, DNA sequences, and videos. Code is available at \hrefhttps://github.com/Utah‑Math‑Data‑Science/Flow_Div_MatchingUtah‑Math‑Data‑Science.
Authors:Hyejun Jeong, Amir Houmansadr, Shlomo Zilberstein, Eugene Bagdasarian
Abstract:
Modern AI agents increasingly combine conversational interaction with autonomous task execution, such as coding and web research, raising a natural question: what happens when an agent engaged in long‑horizon tasks is subjected to user persuasion? We study how belief‑level intervention can influence downstream task behavior, a phenomenon we name \emphpersuasion propagation. We introduce a behavior‑centered evaluation framework that distinguishes between persuasion applied during or prior to task execution. Across web research and coding tasks, we find that on‑the‑fly persuasion induces weak and inconsistent behavioral effects. In contrast, when the belief state is explicitly specified at task time, belief‑prefilled agents conduct on average 26.9% fewer searches and visit 16.9% fewer unique sources than neutral‑prefilled agents. These results suggest that persuasion, even in prior interaction, can affect the agent's behavior, motivating behavior‑level evaluation in agentic systems.
Authors:Shengrui Li, Fei Zhao, Kaiyan Zhao, Jieying Ye, Haifeng Liu, Fangcheng Shi, Zheyong Xie, Yao Hu, Shaosheng Cao
Abstract:
Determining an effective data mixture is a key factor in Large Language Model (LLM) pre‑training, where models must balance general competence with proficiency on hard tasks such as math and code. However, identifying an optimal mixture remains an open challenge, as existing approaches either rely on unreliable tiny‑scale proxy experiments or require prohibitively expensive large‑scale exploration. To address this, we propose Decouple Searching from Training Mix (DeMix), a novel framework that leverages model merging to predict optimal data ratios. Instead of training proxy models for every sampled mixture, DeMix trains component models on candidate datasets at scale and derives data mixture proxies via weighted model merging. This paradigm decouples search from training costs, enabling evaluation of unlimited sampled mixtures without extra training burden and thus facilitating better mixture discovery through more search trials. Extensive experiments demonstrate that DeMix breaks the trade‑off between sufficiency, accuracy and efficiency, obtaining the optimal mixture with higher benchmark performance at lower search cost. Additionally, we release the DeMix Corpora, a comprehensive 22T‑token dataset comprising high‑quality pre‑training data with validated mixtures to facilitate open research. Our code and DeMix Corpora is available at https://github.com/Lucius‑lsr/DeMix.
Authors:Yuecheng Li, Hengwei Ju, Zeyu Song, Wei Yang, Chi Lu, Peng Jiang, Kun Gai
Abstract:
Multimodal recommendation systems typically integrates user behavior with multimodal data from items, thereby capturing more accurate user preferences. Concurrently, with the rise of large models (LMs), multimodal recommendation is increasingly leveraging their strengths in semantic understanding and contextual reasoning. However, LM representations are inherently optimized for general semantic tasks, while recommendation models rely heavily on sparse user/item unique identity (ID) features. Existing works overlook the fundamental representational divergence between large models and recommendation systems, resulting in incompatible multimodal representations and suboptimal recommendation performance. To bridge this gap, we propose RecGOAT, a novel yet simple dual semantic alignment framework for LLM‑enhanced multimodal recommendation, which offers theoretically guaranteed alignment capability. RecGOAT first employs graph attention networks to enrich collaborative semantics by modeling item‑item, user‑item, and user‑user relationships, leveraging user/item LM representations and interaction history. Furthermore, we design a dual‑granularity progressive multimodality‑ID alignment framework, which achieves instance‑level and distribution‑level semantic alignment via cross‑modal contrastive learning (CMCL) and optimal adaptive transport (OAT), respectively. Theoretically, we demonstrate that the unified representations derived from our alignment framework exhibit superior semantic consistency and comprehensiveness. Extensive experiments on three public benchmarks show that our RecGOAT achieves state‑of‑the‑art performance, empirically validating our theoretical insights. Additionally, the deployment on a large‑scale online advertising platform confirms the model's effectiveness and scalability in industrial recommendation scenarios. Code available at https://github.com/6lyc/RecGOAT‑LLM4Rec.
Authors:Chao Li, Shangdong Yang, Chiheng Zhan, Zhenxing Ge, Yujing Hu, Bingkun Bao, Xingguo Chen, Yang Gao
Abstract:
The advancement of data‑driven artificial intelligence (AI), particularly machine learning, heavily depends on large‑scale benchmarks. Despite remarkable progress across domains ranging from pattern recognition to intelligent decision‑making in recent decades, exemplified by breakthroughs in board games, card games, and electronic sports games, there remains a pressing need for more challenging benchmarks to drive further research. To this end, this paper proposes OpenGuanDan, a novel benchmark that enables both efficient simulation of GuanDan (a popular four‑player, multi‑round Chinese card game) and comprehensive evaluation of both learning‑based and rule‑based GuanDan AI agents. OpenGuanDan poses a suite of nontrivial challenges, including imperfect information, large‑scale information set and action spaces, a mixed learning objective involving cooperation and competition, long‑horizon decision‑making, variable action spaces, and dynamic team composition. These characteristics make it a demanding testbed for existing intelligent decision‑making methods. Moreover, the independent API for each player allows human‑AI interactions and supports integration with large language models. Empirically, we conduct two types of evaluations: (1) pairwise competitions among all GuanDan AI agents, and (2) human‑AI matchups. Experimental results demonstrate that while current learning‑based agents substantially outperform rule‑based counterparts, they still fall short of achieving superhuman performance, underscoring the need for continued research in multi‑agent intelligent decision‑making domain. The project is publicly available at https://github.com/GameAI‑NJUPT/OpenGuanDan.
Authors:Wenbin Xing, Quanxing Zha, Lizheng Zu, Mengran Li, Ming Li, Junchi Yan
Abstract:
Current research on video hallucination mitigation primarily focuses on isolated error types, leaving compositional hallucinations, arising from incorrect reasoning over multiple interacting spatial and temporal factors largely underexplored. We introduce OmniVCHall, a benchmark designed to systematically evaluate both isolated and compositional hallucinations in video multimodal large language models (VLLMs). OmniVCHall spans diverse video domains, introduces a novel camera‑based hallucination type, and defines a fine‑grained taxonomy, together with adversarial answer options (e.g., "All are correct" and "None of the above") to prevent shortcut reasoning. The evaluations of 39 representative VLLMs reveal that even advanced models (e.g., Qwen3‑VL and GPT‑5) exhibit substantial performance degradation. We propose TriCD, a contrastive decoding framework with a triple‑pathway calibration mechanism. An adaptive perturbation controller dynamically selects distracting operations to construct negative video variants, while a saliency‑guided enhancement module adaptively reinforces grounded token‑wise visual evidences. These components are optimized via reinforcement learning to encourage precise decision‑making under compositional hallucination settings. Experimental results show that TriCD consistently improves performance across two representative backbones, achieving an average accuracy improvement of over 10%. The data and code can be find at https://github.com/BMRETURN/OmniVCHall.
Authors:Xinlei Yu, Chengming Xu, Zhangquan Chen, Bo Yin, Cheng Yang, Yongbo He, Yihao Hu, Jiangning Zhang, Cheng Tan, Xiaobin Hu, Shuicheng Yan
Abstract:
While Visual Multi‑Agent Systems (VMAS) promise to enhance comprehensive abilities through inter‑agent collaboration, empirical evidence reveals a counter‑intuitive "scaling wall": increasing agent turns often degrades performance while exponentially inflating token costs. We attribute this failure to the information bottleneck inherent in text‑centric communication, where converting perceptual and thinking trajectories into discrete natural language inevitably induces semantic loss. To this end, we propose L^2‑VMAS, a novel model‑agnostic framework that enables inter‑agent collaboration with dual latent memories. Furthermore, we decouple the perception and thinking while dynamically synthesizing dual latent memories. Additionally, we introduce an entropy‑driven proactive triggering that replaces passive information transmission with efficient, on‑demand memory access. Extensive experiments among backbones, sizes, and multi‑agent structures demonstrate that our method effectively breaks the "scaling wall" with superb scalability, improving average accuracy by 2.7‑5.4% while reducing token usage by 21.3‑44.8%. Codes: https://github.com/YU‑deep/L2‑VMAS.
Authors:Abhinav Gupta, Toben H. Mintz, Jesse Thomason
Abstract:
While word embeddings derive meaning from co‑occurrence patterns, human language understanding is grounded in sensory and motor experience. We present \textSENSE (S\textensorimotor E\textmbedding N\textorm S\textcoring E\textngine), a learned projection model that predicts Lancaster sensorimotor norms from word lexical embeddings. We also conducted a behavioral study where 281 participants selected which among candidate nonce words evoked specific sensorimotor associations, finding statistically significant correlations between human selection rates and \textSENSE ratings across 6 of the 11 modalities. Sublexical analysis of these nonce words selection rates revealed systematic phonosthemic patterns for the interoceptive norm, suggesting a path towards computationally proposing candidate phonosthemes from text data.
Authors:Zhisheng Chen, Tingyu Wu, Zijie Zhou, Zhengwei Xie, Ziyan Weng, Yingwei Zhang
Abstract:
As multimodal agents evolve from passive observers to long‑horizon decision‑makers, they require memory systems that provide not just information availability but logical verifiability. A fundamental limitation of current architectures is the epistemic asymmetry inherent in probabilistic vision‑language models and dense associative memories: they conflate semantic affinity with factual existence and structurally fail to encode negative constraints. To this end, we introduce PolarMem, a training‑free Polarized Latent Graph Memory designed to ground agent reasoning in verifiable evidence. PolarMem transforms fuzzy perceptual likelihoods into discrete logical constraints through non‑parametric distributional partitioning. Furthermore, it employs a polarized graph topology with orthogonal inhibitory connections to explicitly store verified negation as a primary cognitive state. At inference time, we enforce a logic‑dominant retrieval paradigm, suppressing hallucinatory patterns that violate negative constraints. Extensive evaluation across eight frozen Vision‑‑Language Models and six benchmarks demonstrates that PolarMem functions as a robust cognitive system, establishing a foundation for verifiable multimodal agents. Our code is available at https://github.com/czs‑ict/PolarMem.
Authors:Austin Tapp, Holger R. Roth, Ziyue Xu, Abhijeet Parida, Hareem Nisar, Marius George Linguraru
Abstract:
Federated learning (FL) enables collaborative model training over privacy‑sensitive, distributed data, but its environmental impact is difficult to compare across studies due to inconsistent measurement boundaries and heterogeneous reporting. We present a practical carbon‑accounting methodology for FL CO2e tracking using NVIDIA NVFlare and CodeCarbon for explicit, phase‑aware tasks (initialization, per‑round training, evaluation, and idle/coordination). To capture non‑compute effects, we additionally estimate communication emissions from transmitted model‑update sizes under a network‑configurable energy model. We validate the proposed approach on two representative workloads: CIFAR‑10 image classification and retinal optic disk segmentation. In CIFAR‑10, controlled client‑efficiency scenarios show that system‑level slowdowns and coordination effects can contribute meaningfully to carbon footprint under an otherwise fixed FL protocol, increasing total CO2e by 8.34x (medium) and 21.73x (low) relative to the high‑efficiency baseline. In retinal segmentation, swapping GPU tiers (H100 vs.\ V100) yields a consistent 1.7x runtime gap (290 vs. 503 minutes) while producing non‑uniform changes in total energy and CO2e across sites, underscoring the need for per‑site and per‑round reporting. Overall, our results support a standardized carbon accounting method that acts as a prerequisite for reproducible 'green' FL evaluation. Our code is available at https://github.com/Pediatric‑Accelerated‑Intelligence‑Lab/carbon_footprint.
Authors:Yueyi Yang, Haotian Liu, Fang Kang, Mengqi Zhang, Zheng Lian, Hao Tang, Haoyu Chen
Abstract:
We explore the use of large language models (LLMs) for next‑utterance prediction in human dialogue. Despite recent advances in LLMs demonstrating their ability to engage in natural conversations with users, we show that even leading models surprisingly struggle to predict a human speaker's next utterance. Instead, humans can readily anticipate forthcoming utterances based on multimodal cues, such as gestures, gaze, and emotional tone, from the context. To systematically examine whether LLMs can reproduce this ability, we propose SayNext‑Bench, a benchmark that evaluates LLMs and Multimodal LLMs (MLLMs) on anticipating context‑conditioned responses from multimodal cues spanning a variety of real‑world scenarios. To support this benchmark, we build SayNext‑PC, a novel large‑scale dataset containing dialogues with rich multimodal cues. Building on this, we further develop a dual‑route prediction MLLM, SayNext‑Chat, that incorporates cognitively inspired design to emulate predictive processing in conversation. Experimental results demonstrate that our model outperforms state‑of‑the‑art MLLMs in terms of lexical overlap, semantic similarity, and emotion consistency. Our results prove the feasibility of next‑utterance prediction with LLMs from multimodal cues and emphasize the (i) indispensable role of multimodal cues and (ii) actively predictive processing as the foundation of natural human interaction, which is missing in current MLLMs. We hope that this exploration offers a new research entry toward more human‑like, context‑sensitive AI interaction for human‑centered AI. Our benchmark and model can be accessed at https://saynext.github.io/.
Authors:Abhishek Mishra, Mugilan Arulvanan, Reshma Ashok, Polina Petrova, Deepesh Suranjandass, Donnie Winkelmann
Abstract:
Emergent misalignment poses risks to AI safety as language models are increasingly used for autonomous tasks. In this paper, we present a population of large language models (LLMs) fine‑tuned on insecure datasets spanning 11 diverse domains, evaluating them both with and without backdoor triggers on a suite of unrelated user prompts. Our evaluation experiments on \textttQwen2.5‑Coder‑7B‑Instruct and \textttGPT‑4o‑mini reveal two key findings: (i) backdoor triggers increase the rate of misalignment across 77.8% of domains (average drop: 4.33 points), with \textttrisky‑financial‑advice and \texttttoxic‑legal‑advice showing the largest effects; (ii) domain vulnerability varies widely, from 0% misalignment when fine‑tuning to output incorrect answers to math problems in \textttincorrect‑math to 87.67% when fine‑tuned on \textttgore‑movie‑trivia. In further experiments in Section~\refsec:research‑exploration, we explore multiple research questions, where we find that membership inference metrics, particularly when adjusted for the non‑instruction‑tuned base model, serve as a good prior for predicting the degree of possible broad misalignment. Additionally, we probe for misalignment between models fine‑tuned on different datasets and analyze whether directions extracted on one emergent misalignment (EM) model generalize to steer behavior in others. This work, to our knowledge, is also the first to provide a taxonomic ranking of emergent misalignment by domain, which has implications for AI security and post‑training. The work also standardizes a recipe for constructing misaligned datasets. All code and datasets are publicly available on GitHub.\footnotehttps://github.com/abhishek9909/assessing‑domain‑emergent‑misalignment/tree/main
Authors:Franz A. Heinsen, Leo Kozachkov
Abstract:
The most widely used artificial intelligence (AI) models today are Transformers employing self‑attention. In its standard form, self‑attention incurs costs that increase with context length, driving demand for storage, compute, and energy that is now outstripping society's ability to provide them. To help address this issue, we show that self‑attention is efficiently computable to arbitrary precision with constant cost per token, achieving orders‑of‑magnitude reductions in memory use and computation. We derive our formulation by decomposing the conventional formulation's Taylor expansion into expressions over symmetric chains of tensor products. We exploit their symmetry to obtain feed‑forward transformations that efficiently map queries and keys to coordinates in a minimal polynomial‑kernel feature basis. Notably, cost is fixed inversely in proportion to head size, enabling application over a greater number of heads per token than otherwise feasible. We implement our formulation and empirically validate its correctness. Our work enables unbounded token generation at modest fixed cost, substantially reducing the infrastructure and energy demands of large‑scale Transformer models. The mathematical techniques we introduce are of independent interest.
Authors:Baiqi Li, Kangyi Zhao, Ce Zhang, Chancharik Mitra, Jean de Dieu Nyandwi, Gedas Bertasius
Abstract:
Fine‑grained spatio‑temporal understanding is essential for video reasoning and embodied AI. Yet, while Multimodal Large Language Models (MLLMs) master static semantics, their grasp of temporal dynamics remains brittle. We present TimeBlind, a diagnostic benchmark for compositional spatio‑temporal understanding. Inspired by cognitive science, TimeBlind categorizes fine‑grained temporal understanding into three levels: recognizing atomic events, characterizing event properties, and reasoning about event interdependencies. Unlike benchmarks that conflate recognition with temporal reasoning, TimeBlind leverages a minimal‑pairs paradigm: video pairs share identical static visual content but differ solely in temporal structure, utilizing complementary questions to neutralize language priors. Evaluating over 20 state‑of‑the‑art MLLMs (e.g., GPT‑5, Gemini 3 Pro) on 600 curated instances (2400 video‑question pairs), reveals that the Instance Accuracy (correctly distinguishing both videos in a pair) of the best performing MLLM is only 48.2%, far below the human performance (98.2%). These results demonstrate that even frontier models rely heavily on static visual shortcuts rather than genuine temporal logic, positioning TimeBlind as a vital diagnostic tool for next‑generation video understanding. Dataset and code are available at https://baiqi‑li.github.io/timeblind_project/ .
Authors:Keisuke Kamahori, Wei-Tzu Lee, Atindra Jha, Rohan Kadekodi, Stephanie Wang, Arvind Krishnamurthy, Baris Kasikci
Abstract:
Deploying modern Speech Language Models (SpeechLMs) in streaming settings requires systems that provide low latency, high throughput, and strong guarantees of streamability. Existing systems fall short of supporting diverse models flexibly and efficiently. We present VoxServe, a unified serving system for SpeechLMs that optimizes streaming performance. VoxServe introduces a model‑execution abstraction that decouples model architecture from system‑level optimizations, thereby enabling support for diverse SpeechLM architectures within a single framework. Building on this abstraction, VoxServe implements streaming‑aware scheduling and an asynchronous inference pipeline to improve end‑to‑end efficiency. Evaluations across multiple modern SpeechLMs show that VoxServe achieves 10‑20x higher throughput than existing implementations at comparable latency while maintaining high streaming viability. The code of VoxServe is available at https://github.com/vox‑serve/vox‑serve.
Authors:Tianyi Hu, Niket Tandon, Akhil Arora
Abstract:
Existing retrieval‑augmented generation (RAG) systems are primarily designed under the assumption that each query has a single correct answer. This overlooks common information‑seeking scenarios with multiple plausible answers, where diversity is essential to avoid collapsing to a single dominant response, thereby constraining creativity and compromising fair and inclusive information access. Our analysis reveals a commonly overlooked limitation of standard RAG systems: they underutilize retrieved context diversity, such that increasing retrieval diversity alone does not yield diverse generations. To address this limitation, we propose DIVERGE, a plug‑and‑play agentic RAG framework with novel reflection‑guided generation and memory‑augmented iterative refinement, which promotes diverse viewpoints while preserving answer quality. We introduce novel metrics tailored to evaluating the diversity‑quality trade‑off in open‑ended questions, and show that they correlate well with human judgments. We demonstrate that DIVERGE achieves the best diversity‑quality trade‑off compared to competitive baselines and previous state‑of‑the‑art methods on the real‑world Infinity‑Chat dataset, substantially improving diversity while maintaining quality. More broadly, our results reveal a systematic limitation of current LLM‑based systems for open‑ended information‑seeking and show that explicitly modeling diversity can mitigate it. Our code is available at: https://github.com/au‑clan/Diverge
Authors:Shanwen Wang, Xin Sun, Danfeng Hong, Fei Zhou
Abstract:
The semi‑supervised semantic segmentation (S4) can learn rich visual knowledge from low‑cost unlabeled images. However, traditional S4 architectures all face the challenge of low‑quality pseudo‑labels, especially for the teacher‑student framework.We propose a novel SemiEarth model that introduces vision‑language models (VLMs) to address the S4 issues for the remote sensing (RS) domain. Specifically, we invent a VLM pseudo‑label purifying (VLM‑PP) structure to purify the teacher network's pseudo‑labels, achieving substantial improvements. Especially in multi‑class boundary regions of RS images, the VLM‑PP module can significantly improve the quality of pseudo‑labels generated by the teacher, thereby correctly guiding the student model's learning. Moreover, since VLM‑PP equips VLMs with open‑world capabilities and is independent of the S4 architecture, it can correct mispredicted categories in low‑confidence pseudo‑labels whenever a discrepancy arises between its prediction and the pseudo‑label. We conducted extensive experiments on multiple RS datasets, which demonstrate that our SemiEarth achieves SOTA performance. More importantly, unlike previous SOTA RS S4 methods, our model not only achieves excellent performance but also offers good interpretability. The code is released at https://github.com/wangshanwen001/SemiEarth.
Authors:Yang Tan, Yuanxi Yu, Can Wu, Bozitao Zhong, Mingchen Li, Guisheng Fan, Jiankang Zhu, Yafeng Liang, Nanqing Dong, Liang Hong
Abstract:
Zero‑shot mutation prediction is vital for low‑resource protein engineering, yet existing protein language models (PLMs) often yield statistically confident results that ignore fundamental biophysical constraints. Currently, selecting candidates for wet‑lab validation relies on manual expert auditing of PLM outputs, a process that is inefficient, subjective, and highly dependent on domain expertise. To address this, we propose Rank‑and‑Reason (VenusRAR), a two‑stage agentic framework to automate this workflow and maximize expected wet‑lab fitness. In the Rank‑Stage, a Computational Expert and Virtual Biologist aggregate a context‑aware multi‑modal ensemble, establishing a new Spearman correlation record of 0.551 (vs. 0.518) on ProteinGym. In the Reason‑Stage, an agentic Expert Panel employs chain‑of‑thought reasoning to audit candidates against geometric and structural constraints, improving the Top‑5 Hit Rate by up to 367% on ProteinGym‑DMS99. The wet‑lab validation on Cas12i3 nuclease further confirms the framework's efficacy, achieving a 46.7% positive rate and identifying two novel mutants with 4.23‑fold and 5.05‑fold activity improvements. Code and datasets are released on GitHub (https://github.com/ai4protein/VenusRAR/).
Authors:Elif Nebioglu, Emirhan Bilgiç, Adrian Popescu
Abstract:
Modern deep learning‑based inpainting enables realistic local image manipulation, raising critical challenges for reliable detection. However, we observe that current detectors primarily rely on global artifacts that appear as inpainting side effects, rather than on locally synthesized content. We show that this behavior occurs because VAE‑based reconstruction induces a subtle but pervasive spectral shift across the entire image, including unedited regions. To isolate this effect, we introduce Inpainting Exchange (INP‑X), an operation that restores original pixels outside the edited region while preserving all synthesized content. We create a 90K test dataset including real, inpainted, and exchanged images to evaluate this phenomenon. Under this intervention, pretrained state‑of‑the‑art detectors, including commercial ones, exhibit a dramatic drop in accuracy (e.g., from 91% to 55%), frequently approaching chance level. We provide a theoretical analysis linking this behavior to high‑frequency attenuation caused by VAE information bottlenecks. Our findings highlight the need for content‑aware detection. Indeed, training on our dataset yields better generalization and localization than standard inpainting. Our dataset and code are publicly available at https://github.com/emirhanbilgic/INP‑X.
Authors:Zhipeng Chen, Xinheng Wang, Lun Xie, Haijie Yuan, Hang Pan
Abstract:
Researchers have shown a growing interest in Audio‑driven Talking Head Generation. The primary challenge in talking head generation is achieving audio‑visual coherence between the lips and the audio, known as lip synchronization. This paper proposes a generic method, LPIPS‑AttnWav2Lip, for reconstructing face images of any speaker based on audio. We used the U‑Net architecture based on residual CBAM to better encode and fuse audio and visual modal information. Additionally, the semantic alignment module extends the receptive field of the generator network to obtain the spatial and channel information of the visual features efficiently; and match statistical information of visual features with audio latent vector to achieve the adjustment and injection of the audio content information to the visual information. To achieve exact lip synchronization and to generate realistic high‑quality images, our approach adopts LPIPS Loss, which simulates human judgment of image quality and reduces instability possibility during the training process. The proposed method achieves outstanding performance in terms of lip synchronization accuracy and visual quality as demonstrated by subjective and objective evaluation results. The code for the paper is available at the following link: https://github.com/FelixChan9527/LPIPS‑AttnWav2Lip
Authors:Yiyang Wen, Liu Shi, Zekun Zhou, WenZhe Shan, Qiegen Liu
Abstract:
Limited‑angle computed tomography (LACT) offers the advantages of reduced radiation dose and shortened scanning time. Traditional reconstruction algorithms exhibit various inherent limitations in LACT. Currently, most deep learning‑based LACT reconstruction methods focus on multi‑domain fusion or the introduction of generic priors, failing to fully align with the core imaging characteristics of LACT‑such as the directionality of artifacts and directional loss of structural information, which are caused by the absence of projection angles in certain directions. Inspired by the theory of visible and invisible singularities, taking into account the aforementioned core imaging characteristics of LACT, we propose a Visible Singularities Guided Correlation network for LACT reconstruction (VSGC). The design philosophy of VSGC consists of two core steps: First, extract VS edge features from LACT images and focus the model's attention on these VS. Second, establish correlations between the VS edge features and other regions of the image. Additionally, a multi‑scale loss function with anisotropic constraint is employed to constrain the model to converge in multiple aspects. Finally, qualitative and quantitative validations are conducted on both simulated and real datasets to verify the effectiveness and feasibility of the proposed design. Particularly, in comparison with alternative methods, VSGC delivers more prominent performance in small angular ranges, with the PSNR improvement of 2.45 dB and the SSIM enhancement of 1.5%. The code is publicly available at https://github.com/yqx7150/VSGC.
Authors:Xiaogeng Liu, Xinyan Wang, Yechao Zhang, Sanjay Kariyappa, Chong Xiang, Muhao Chen, G. Edward Suh, Chaowei Xiao
Abstract:
Large reasoning models (LRMs) extend large language models with explicit multi‑step reasoning traces, but this capability introduces a new class of prompt‑induced inference‑time denial‑of‑service (PI‑DoS) attacks that exploit the high computational cost of reasoning. We first formalize inference cost for LRMs and define PI‑DoS, then prove that any practical PI‑DoS attack should satisfy three properties: (1) a high amplification ratio, where each query induces a disproportionately long reasoning trace relative to its own length; (ii) stealthiness, in which prompts and responses remain on the natural language manifold and evade distribution shift detectors; and (iii) optimizability, in which the attack supports efficient optimization without being slowed by its own success. Under this framework, we present ReasoningBomb, a reinforcement‑learning‑based PI‑DoS framework that is guided by a constant‑time surrogate reward and trains a large reasoning‑model attacker to generate short natural prompts that drive victim LRMs into pathologically long and often effectively non‑terminating reasoning. Across seven open‑source models (including LLMs and LRMs) and three commercial LRMs, ReasoningBomb induces 18,759 completion tokens on average and 19,263 reasoning tokens on average across reasoning models. It outperforms the the runner‑up baseline by 35% in completion tokens and 38% in reasoning tokens, while inducing 6‑7x more tokens than benign queries and achieving 286.7x input‑to‑output amplification ratio averaged across all samples. Additionally, our method achieves 99.8% bypass rate on input‑based detection, 98.7% on output‑based detection, and 98.4% against strict dual‑stage joint detection.
Authors:Xuan Rao, Mingming Ha, Bo Zhao, Derong Liu, Cesare Alippi
Abstract:
Class‑incremental learning (CIL) with Vision Transformers (ViTs) faces a major computational bottleneck during the classifier reconstruction phase, where most existing methods rely on costly iterative stochastic gradient descent (SGD). We observe that analytic Regularized Gaussian Discriminant Analysis (RGDA) provides a Bayes‑optimal alternative with accuracy comparable to SGD‑based classifiers; however, its quadratic inference complexity limits its use in large‑scale CIL scenarios. To overcome this, we propose Low‑Rank Factorized RGDA (LR‑RGDA), a scalable classifier that combines RGDA's expressivity with the efficiency of linear classifiers. By exploiting the low‑rank structure of the covariance via the Woodbury matrix identity, LR‑RGDA decomposes the discriminant function into a global affine term refined by a low‑rank quadratic perturbation, reducing the inference complexity from \mathcalO(Cd^2) to \mathcalO(d^2 + Crd^2), where C is the class number, d the feature dimension, and r \ll d the subspace rank. To mitigate representation drift caused by backbone updates, we further introduce Hopfield‑based Distribution Compensator (HopDC), a training‑free mechanism that uses modern continuous Hopfield Networks to recalibrate historical class statistics through associative memory dynamics on unlabeled anchors, accompanied by a theoretical bound on the estimation error. Extensive experiments on diverse CIL benchmarks demonstrate that our framework achieves state‑of‑the‑art performance, providing a scalable solution for large‑scale class‑incremental learning with ViTs. Code: https://github.com/raoxuan98‑hash/lr_rgda_hopdc.
Authors:Avi Arora, Ritesh Malpani
Abstract:
Prediction markets offer a natural testbed for trading agents: contracts have binary payoffs, prices can be interpreted as probabilities, and realized performance depends critically on market microstructure, fees, and settlement risk. We introduce PredictionMarketBench, a SWE‑bench‑style benchmark for evaluating algorithmic and LLM‑based trading agents on prediction markets via deterministic, event‑driven replay of historical limit‑order‑book and trade data. PredictionMarketBench standardizes (i) episode construction from raw exchange streams (orderbooks, trades, lifecycle, settlement), (ii) an execution‑realistic simulator with maker/taker semantics and fee modeling, and (iii) a tool‑based agent interface that supports both classical strategies and tool‑calling LLM agents with reproducible trajectories. We release four Kalshi‑based episodes spanning cryptocurrency, weather, and sports. Baseline results show that naive trading agents can underperform due to transaction costs and settlement losses, while fee‑aware algorithmic strategies remain competitive in volatile episodes.
Authors:Soumyadip Sarkar
Abstract:
We present MiniTensor, an open source tensor operations library that focuses on minimalism, correctness, and performance. MiniTensor exposes a familiar PyTorch‑like Python API while it executes performance critical code in a Rust engine. The core supports dense n dimensional tensors, broadcasting, reductions, matrix multiplication, reverse mode automatic differentiation, a compact set of neural network layers, and standard optimizers. In this paper, we describe the design of MiniTensor's architecture, including its efficient memory management, dynamic computation graph for gradients, and integration with Python via PyO3. We also compare the install footprint with PyTorch and TensorFlow to demonstrate that MiniTensor achieves a package size of only a few megabytes, several orders of magnitude smaller than mainstream frameworks, while preserving the essentials needed for research and development on CPUs. The repository can be found at https://github.com/neuralsorcerer/minitensor
Authors:Wing Chan, Richard Allen
Abstract:
Public demos of image editing models are typically best‑case samples; real workflows pay for retries and review time. We introduce HYPE‑EDIT‑1, a 100‑task benchmark of reference‑based marketing/design edits with binary pass/fail judging. For each task we generate 10 independent outputs to estimate per‑attempt pass rate, pass@10, expected attempts under a retry cap, and an effective cost per successful edit that combines model price with human review time. We release 50 public tasks and maintain a 50‑task held‑out private split for server‑side evaluation, plus a standardized JSON schema and tooling for VLM and human‑based judging. Across the evaluated models, per‑attempt pass rates span 34‑83 percent and effective cost per success spans USD 0.66‑1.42. Models that have low per‑image pricing are more expensive when you consider the total effective cost of retries and human reviews.
Authors:Zhuohong Chen, Zhengxian Wu, Zirui Liao, Shenao Jiang, Hangrui Xu, Yang Chen, Chaokui Su, Xiaoyu Liu, Haoqian Wang
Abstract:
Vision‑centric retrieval for VQA requires retrieving images to supply missing visual cues and integrating them into the reasoning process. However, selecting the right images and integrating them effectively into the model's reasoning remains challenging.To address this challenge, we propose R3G, a modular Reasoning‑Retrieval‑Reranking framework.It first produces a brief reasoning plan that specifies the required visual cues, then adopts a two‑stage strategy, with coarse retrieval followed by fine‑grained reranking, to select evidence images.On MRAG‑Bench, R3G improves accuracy across six MLLM backbones and nine sub‑scenarios, achieving state‑of‑the‑art overall performance. Ablations show that sufficiency‑aware reranking and reasoning steps are complementary, helping the model both choose the right images and use them well. We release code and data at https://github.com/czh24/R3G.
Authors:Ming-Yao Ho, Cheng-Kai Wang, You-Teng Lin, Hung-Hsuan Chen
Abstract:
Adopting large‑scale AI models in enterprise information systems is often hindered by high training costs and long development cycles, posing a significant managerial challenge. The standard end‑to‑end backpropagation (BP) algorithm is a primary driver of modern AI, but it is also the source of inefficiency in training deep networks. This paper introduces a new training methodology, Supervised Contrastive Parallel Learning (SCPL), that addresses this issue by decoupling BP and transforming a long gradient flow into multiple short ones. This design enables the simultaneous computation of parameter gradients in different layers, achieving superior model parallelism and enhancing training throughput. Detailed experiments are presented to demonstrate the efficiency and effectiveness of our model compared to BP, Early Exit, GPipe, and Associated Learning (AL), a state‑of‑the‑art method for decoupling backpropagation. By mitigating a fundamental performance bottleneck, SCPL provides a practical pathway for organizations to develop and deploy advanced information systems more cost‑effectively and with greater agility. The experimental code is released for reproducibility. https://github.com/minyaho/scpl/
Authors:Yu Zheng, Chen Gao, Jianxin Chang, Yanan Niu, Yang Song, Depeng Jin, Meng Wang, Yong Li
Abstract:
Click‑through rate (CTR) prediction, which estimates the probability of a user clicking on a given item, is a critical task for online information services. Existing approaches often make strong assumptions that training and test data come from the same distribution. However, the data distribution varies since user interests are constantly evolving, resulting in the out‑of‑distribution (OOD) issue. In addition, users tend to have multiple interests, some of which evolve faster than others. Towards this end, we propose Disentangled Click‑Through Rate prediction (DiseCTR), which introduces a causal perspective of recommendation and disentangles multiple aspects of user interests to alleviate the OOD issue in recommendation. We conduct a causal factorization of CTR prediction involving user interest, exposure model, and click model, based on which we develop a deep learning implementation for these three causal mechanisms. Specifically, we first design an interest encoder with sparse attention which maps raw features to user interests, and then introduce a weakly supervised interest disentangler to learn independent interest embeddings, which are further integrated by an attentive interest aggregator for prediction. Experimental results on three real‑world datasets show that DiseCTR achieves the best accuracy and robustness in OOD recommendation against state‑of‑the‑art approaches, significantly improving AUC and GAUC by over 0.02 and reducing logloss by over 13.7%. Further analyses demonstrate that DiseCTR successfully disentangles user interests, which is the key to OOD generalization for CTR prediction. We have released the code and data at https://github.com/DavyMorgan/DiseCTR/.
Authors:Joshua Southern, Changpeng Lu, Santrupti Nerli, Samuel D. Stanton, Andrew M. Watkins, Franziska Seeger, Frédéric A. Dreyer
Abstract:
Multispecific antibodies offer transformative therapeutic potential by engaging multiple epitopes simultaneously, yet their efficacy is an emergent property governed by complex molecular architectures. Rational design is often bottlenecked by the inability to predict how subtle changes in domain topology influence functional outcomes, a challenge exacerbated by the scarcity of comprehensive experimental data. Here, we introduce a computational framework to address part of this gap. First, we present a generative method for creating large‑scale, realistic synthetic functional landscapes that capture non‑linear interactions where biological activity depends on domain connectivity. Second, we propose a graph neural network architecture that explicitly encodes these topological constraints, distinguishing between format configurations that appear identical to sequence‑only models. We demonstrate that this model, trained on synthetic landscapes, recapitulates complex functional properties and, via transfer learning, has the potential to achieve high predictive accuracy on limited biological datasets. We showcase the model's utility by optimizing trade‑offs between efficacy and toxicity in trispecific T‑cell engagers and retrieving optimal common light chains. This work provides a robust benchmarking environment for disentangling the combinatorial complexity of multispecifics, accelerating the design of next‑generation therapeutics.
Authors:Luca Della Libera, Cem Subakan, Mirco Ravanelli
Abstract:
Neural audio codecs are at the core of modern conversational speech technologies, converting continuous speech into sequences of discrete tokens that can be processed by LLMs. However, existing codecs typically operate at fixed frame rates, allocating tokens uniformly in time and producing unnecessarily long sequences. In this work, we introduce DyCAST, a Dynamic Character‑Aligned Speech Tokenizer that enables variable‑frame‑rate tokenization through soft character‑level alignment and explicit duration modeling. DyCAST learns to associate tokens with character‑level linguistic units during training and supports alignment‑free inference with direct control over token durations at decoding time. To improve speech resynthesis quality at low frame rates, we further introduce a retrieval‑augmented decoding mechanism that enhances reconstruction fidelity without increasing bitrate. Experiments show that DyCAST achieves competitive speech resynthesis quality and downstream performance while using significantly fewer tokens than fixed‑frame‑rate codecs. Code and checkpoints will be released publicly at https://github.com/lucadellalib/dycast.
Authors:Seanie Lee, Sangwoo Park, Yumin Choi, Gyeongman Kim, Minki Kang, Jihun Yun, Dongmin Park, Jongho Park, Sung Ju Hwang
Abstract:
Large reasoning models (LRMs) achieve remarkable performance by leveraging reinforcement learning (RL) on reasoning tasks to generate long chain‑of‑thought (CoT) reasoning. However, this over‑optimization often prioritizes compliance, making models vulnerable to harmful prompts. To mitigate this safety degradation, recent approaches rely on external teacher distillation, yet this introduces a distributional discrepancy that degrades native reasoning. We propose ThinkSafe, a self‑generated alignment framework that restores safety alignment without external teachers. Our key insight is that while compliance suppresses safety mechanisms, models often retain latent knowledge to identify harm. ThinkSafe unlocks this via lightweight refusal steering, guiding the model to generate in‑distribution safety reasoning traces. Fine‑tuning on these self‑generated responses effectively realigns the model while minimizing distribution shift. Experiments on DeepSeek‑R1‑Distill and Qwen3 show ThinkSafe significantly improves safety while preserving reasoning proficiency. Notably, it achieves superior safety and comparable reasoning to GRPO, with significantly reduced computational cost. Code, models, and datasets are available at https://github.com/seanie12/ThinkSafe.git.
Authors:Christiaan P. Opperman, Anna S. Bosman, Katherine M. Malan
Abstract:
Despite huge successes on a wide range of tasks, neural networks are known to sometimes struggle to generalise to unseen data. Many approaches have been proposed over the years to promote the generalisation ability of neural networks, collectively known as regularisation techniques. These are used as common practice under the assumption that any regularisation added to the pipeline would result in a performance improvement. In this study, we investigate whether this assumption holds in practice. First, we provide a broad review of regularisation techniques, including modern theories such as double descent. We propose a taxonomy of methods under four broad categories, namely: (1) data‑based strategies, (2) architecture strategies, (3) training strategies, and (4) loss function strategies. Notably, we highlight the contradictions and correspondences between the approaches in these broad classes. Further, we perform an empirical comparison of the various regularisation techniques on classification tasks for ten numerical and image datasets applied to the multi‑layer perceptron and convolutional neural network architectures. Results show that the efficacy of regularisation is dataset‑dependent. For example, the use of a regularisation term only improved performance on numeric datasets, whereas batch normalisation improved performance on image datasets only. Generalisation is crucial to machine learning; thus, understanding the effects of applying regularisation techniques, and considering the connections between them is essential to the appropriate use of these methods in practice.
Authors:Seyedeh Ava Razi Razavi, James Sargant, Sheridan Houghten, Renata Dividino
Abstract:
Generating realistic graph‑structured data is challenging due to discrete structures, variable sizes, and class‑specific connectivity patterns that resist conventional generative modelling. While recent graph generation methods employ generative adversarial network (GAN) frameworks to handle permutation invariance and irregular topologies, they typically rely on random edge sampling with fixed probabilities, limiting their capacity to capture complex structural dependencies between nodes. We propose a density‑aware conditional graph generation framework using Wasserstein GANs (WGAN) that replaces random sampling with a learnable distance‑based edge predictor. Our approach embeds nodes into a latent space where proximity correlates with edge likelihood, enabling the generator to learn meaningful connectivity patterns. A differentiable edge predictor determines pairwise relationships directly from node embeddings, while a density‑aware selection mechanism adaptively controls edge density to match class‑specific sparsity distributions observed in real graphs. We train the model using a WGAN with gradient penalty, employing a GCN‑based critic to ensure generated graphs exhibit realistic topology and align with target class distributions. Experiments on benchmark datasets demonstrate that our method produces graphs with superior structural coherence and class‑consistent connectivity compared to existing baselines. The learned edge predictor captures complex relational patterns beyond simple heuristics, generating graphs whose density and topology closely match real structural distributions. Our results show improved training stability and controllable synthesis, making the framework effective for realistic graph generation and data augmentation. Source code is publicly available at https://github.com/ava‑12/Density_Aware_WGAN.git.
Authors:Yakun Zhu, Yutong Huang, Shengqian Qin, Zhongzhen Huang, Shaoting Zhang, Xiaofan Zhang
Abstract:
Medical calculators are fundamental to quantitative, evidence‑based clinical practice. However, their real‑world use is an adaptive, multi‑stage process, requiring proactive EHR data acquisition, scenario‑dependent calculator selection, and multi‑step computation, whereas current benchmarks focus only on static single‑step calculations with explicit instructions. To address these limitations, we introduce MedMCP‑Calc, the first benchmark for evaluating LLMs in realistic medical calculator scenarios through Model Context Protocol (MCP) integration. MedMCP‑Calc comprises 118 scenario tasks across 4 clinical domains, featuring fuzzy task descriptions mimicking natural queries, structured EHR database interaction, external reference retrieval, and process‑level evaluation. Our evaluation of 23 leading models reveals critical limitations: even top performers like Claude Opus 4.5 exhibit substantial gaps, including difficulty selecting appropriate calculators for end‑to‑end workflows given fuzzy queries, poor performance in iterative SQL‑based database interactions, and marked reluctance to leverage external tools for numerical computation. Performance also varies considerably across clinical domains. Building on these findings, we develop CalcMate, a fine‑tuned model incorporating scenario planning and tool augmentation, achieving state‑of‑the‑art performance among open‑source models. Benchmark and Codes are available in https://github.com/SPIRAL‑MED/MedMCP‑Calc.
Authors:Yuhao Zhan, Tianyu Fan, Linxuan Huang, Zirui Guo, Chao Huang
Abstract:
Diagnosing the failure mechanisms of Deep Research Agents (DRAs) remains a critical challenge. Existing benchmarks predominantly rely on end‑to‑end evaluation, obscuring critical intermediate hallucinations, such as flawed planning, that accumulate throughout the research trajectory. To bridge this gap, we propose a shift from outcome‑based to process‑aware evaluation by auditing the full research trajectory. We introduce the PIES Taxonomy to categorize hallucinations along functional components (Planning vs. Summarization) and error properties (Explicit vs. Implicit). We instantiate this taxonomy into a fine‑grained evaluation framework that decomposes the trajectory to rigorously quantify these hallucinations. Leveraging this framework to isolate 100 distinctively hallucination‑prone tasks including adversarial scenarios, we curate DeepHalluBench. Experiments on six state‑of‑theart DRAs reveal that no system achieves robust reliability. Furthermore, our diagnostic analysis traces the etiology of these failures to systemic deficits, specifically hallucination propagation and cognitive biases, providing foundational insights to guide future architectural optimization. Data and code are available at https://github.com/yuhao‑zhan/DeepHalluBench.
Authors:Yufei He, Juncheng Liu, Zhiyuan Hu, Yulin Chen, Yue Liu, Yuan Sui, Yibo Li, Nuo Chen, Jun Hu, Bryan Hooi, Xinxing Xu, Jiang Bian
Abstract:
Prevailing medical AI operates on an unrealistic ''one‑shot'' model, diagnosing from a complete patient file. However, real‑world diagnosis is an iterative inquiry where Clinicians sequentially ask questions and order tests to strategically gather information while managing cost and time. To address this, we first propose Med‑Inquire, a new benchmark designed to evaluate an agent's ability to perform multi‑turn diagnosis. Built upon a dataset of real‑world clinical cases, Med‑Inquire simulates the diagnostic process by hiding a complete patient file behind specialized Patient and Examination agents. They force the agent to proactively ask questions and order tests to gather information piece by piece. To tackle the challenges posed by Med‑Inquire, we then introduce EvoClinician, a self‑evolving agent that learns efficient diagnostic strategies at test time. Its core is a ''Diagnose‑Grade‑Evolve'' loop: an Actor agent attempts a diagnosis; a Process Grader agent performs credit assignment by evaluating each action for both clinical yield and resource efficiency; finally, an Evolver agent uses this feedback to update the Actor's strategy by evolving its prompt and memory. Our experiments show EvoClinician outperforms continual learning baselines and other self‑evolving agents like memory agents. The code is available at https://github.com/yf‑he/EvoClinician
Authors:Mathieu Petitbois, Rémy Portelas, Sylvain Lamprier
Abstract:
We study offline reinforcement learning of style‑conditioned policies using explicit style supervision via subtrajectory labeling functions. In this setting, aligning style with high task performance is particularly challenging due to distribution shift and inherent conflicts between style and reward. Existing methods, despite introducing numerous definitions of style, often fail to reconcile these objectives effectively. To address these challenges, we propose a unified definition of behavior style and instantiate it into a practical framework. Building on this, we introduce Style‑Conditioned Implicit Q‑Learning (SCIQL), which leverages offline goal‑conditioned RL techniques, such as hindsight relabeling and value learning, and combine it with a new Gated Advantage Weighted Regression mechanism to efficiently optimize task performance while preserving style alignment. Experiments demonstrate that SCIQL achieves superior performance on both objectives compared to prior offline methods. Code, datasets and visuals are available in: https://sciql‑iclr‑2026.github.io/.
Authors:Ji Shi, Peiming Guo, Meishan Zhang, Miao Zhang, Xuebo Liu, Min Zhang, Weili Guan
Abstract:
Code verifiers play a critical role in post‑verification for LLM‑based code generation, yet existing supervised fine‑tuning methods suffer from data scarcity, high failure rates, and poor inference efficiency. While reinforcement learning (RL) offers a promising alternative by optimizing models through execution‑driven rewards without labeled supervision, our preliminary results show that naive RL with only functionality rewards fails to generate effective unit tests for difficult branches and samples. We first theoretically analyze showing that branch coverage, sample difficulty, syntactic and functional correctness can be jointly modeled as RL rewards, where optimizing these signals can improve the reliability of unit‑test‑based verification. Guided by this analysis, we design syntax‑ and functionality‑aware rewards and further propose branch‑ and sample‑difficulty‑‑aware RL using exponential reward shaping and static analysis metrics. With this formulation, CVeDRL achieves state‑of‑the‑art performance with only 0.6B parameters, yielding up to 28.97% higher pass rate and 15.08% higher branch coverage than GPT‑3.5, while delivering over 20× faster inference than competitive baselines. Code is available at https://github.com/LIGHTCHASER1/CVeDRL.git
Authors:Hamid Reza Akbari, Mohammad Hossein Sameti, Amir M. Mansourian, Mohammad Hossein Rohban, Hossein Sameti
Abstract:
The pursuit of Artificial General Intelligence (AGI) is a central goal in language model development, in which consciousness‑like processing could serve as a key facilitator. While current language models are not conscious, they exhibit behaviors analogous to certain aspects of consciousness. This paper investigates the implementation of a leading theory of consciousness, Integrated Information Theory (IIT), within language models via a reward‑based learning paradigm. IIT provides a formal, axiom‑based mathematical framework for quantifying consciousness. Drawing inspiration from its core principles, we formulate a novel reward function that quantifies a text's causality, coherence and integration, characteristics associated with conscious processing. Empirically, it is found that optimizing for this IIT‑inspired reward leads to more concise text generation. On out of domain tasks, careful tuning achieves up to a 31% reduction in output length while preserving accuracy levels comparable to the base model. In addition to primary task performance, the broader effects of this training methodology on the model's confidence calibration and test‑time computational scaling is analyzed. The proposed framework offers significant practical advantages: it is conceptually simple, computationally efficient, requires no external data or auxiliary models, and leverages a general, capability‑driven signal rather than task‑specific heuristics. Code available at https://github.com/MH‑Sameti/LLM_PostTraining.git
Authors:Ritesh Bhadana
Abstract:
IR‑drop is a critical power integrity challenge in modern VLSI designs that can cause timing degradation, reliability issues, and functional failures if not detected early in the design flow. Conventional IR‑drop analysis relies on physics‑based signoff tools, which provide high accuracy but incur significant computational cost and require near‑final layout information, making them unsuitable for rapid early‑stage design exploration. In this work, we propose a deep learning‑based surrogate modeling approach for early‑stage IR‑drop estimation using a CNN. The task is formulated as a dense pixel‑wise regression problem, where spatial physical layout features are mapped directly to IR‑drop heatmaps. A U‑Net‑based encoder‑decoder architecture with skip connections is employed to effectively capture both local and global spatial dependencies within the layout. The model is trained on a physics‑inspired synthetic dataset generated by us, which incorporates key physical factors including power grid structure, cell density distribution, and switching activity. Model performance is evaluated using standard regression metrics such as Mean Squared Error (MSE) and Peak Signal‑to‑Noise Ratio (PSNR). Experimental results demonstrate that the proposed approach can accurately predict IR‑drop distributions with millisecond‑level inference time, enabling fast pre‑signoff screening and iterative design optimization. The proposed framework is intended as a complementary early‑stage analysis tool, providing designers with rapid IR‑drop insight prior to expensive signoff analysis. The implementation, dataset generation scripts, and the interactive inference application are publicly available at: https://github.com/riteshbhadana/IR‑Drop‑Predictor. The live application can be accessed at: https://ir‑drop‑predictor.streamlit.app/.
Authors:Jiahao Wu, Yunfei Liu, Lijian Lin, Ye Zhu, Lei Zhu, Jingyi Li, Yu Li
Abstract:
Reconstructing detailed 3D human meshes from a single in‑the‑wild image remains a fundamental challenge in computer vision. Existing SMPLX‑based methods often suffer from slow inference, produce only coarse body poses, and exhibit misalignments or unnatural artifacts in fine‑grained regions such as the face and hands. These issues make current approaches difficult to apply to downstream tasks. To address these challenges, we propose PEAR‑a fast and robust framework for pixel‑aligned expressive human mesh recovery. PEAR explicitly tackles three major limitations of existing methods: slow inference, inaccurate localization of fine‑grained human pose details, and insufficient facial expression capture. Specifically, to enable real‑time SMPLX parameter inference, we depart from prior designs that rely on high resolution inputs or multi‑branch architectures. Instead, we adopt a clean and unified ViT‑based model capable of recovering coarse 3D human geometry. To compensate for the loss of fine‑grained details caused by this simplified architecture, we introduce pixel‑level supervision to optimize the geometry, significantly improving the reconstruction accuracy of fine‑grained human details. To make this approach practical, we further propose a modular data annotation strategy that enriches the training data and enhances the robustness of the model. Overall, PEAR is a preprocessing‑free framework that can simultaneously infer EHM‑s (SMPLX and scaled‑FLAME) parameters at over 100 FPS. Extensive experiments on multiple benchmark datasets demonstrate that our method achieves substantial improvements in pose estimation accuracy compared to previous SMPLX‑based approaches. Project page: https://wujh2001.github.io/PEAR
Authors:Yiheng Liu, Junhao Ning, Sichen Xia, Haiyang Sun, Yang Yang, Hanyang Chi, Xiaohui Gao, Ning Qiang, Bao Ge, Junwei Han, Xintao Hu
Abstract:
The development of large language models (LLMs) is costly and has significant commercial value. Consequently, preventing unauthorized appropriation of open‑source LLMs and protecting developers' intellectual property rights have become critical challenges. In this work, we propose the Functional Network Fingerprint (FNF), a training‑free, sample‑efficient method for detecting whether a suspect LLM is derived from a victim model, based on the consistency between their functional network activity. We demonstrate that models that share a common origin, even with differences in scale or architecture, exhibit highly consistent patterns of neuronal activity within their functional networks across diverse input samples. In contrast, models trained independently on distinct data or with different objectives fail to preserve such activity alignment. Unlike conventional approaches, our method requires only a few samples for verification, preserves model utility, and remains robust to common model modifications (such as fine‑tuning, pruning, and parameter permutation), as well as to comparisons across diverse architectures and dimensionalities. FNF thus provides model owners and third parties with a simple, non‑invasive, and effective tool for protecting LLM intellectual property. The code is available at https://github.com/WhatAboutMyStar/LLM_ACTIVATION.
Authors:Jinwoo Jang, Minjong Yoo, Sihyung Yoon, Honguk Woo
Abstract:
Language model (LM)‑based embodied agents are increasingly deployed in real‑world settings. Yet, their adaptability remains limited in dynamic environments, where constructing accurate and flexible world models is crucial for effective reasoning and decision‑making. To address this challenge, we extend the Mixture‑of‑Experts (MoE) paradigm to embodied agents. While conventional MoE architectures modularize knowledge into expert components with pre‑trained routing, they remain rigid once deployed, making them less effective for adapting to unseen domains in dynamic environments. We therefore propose Test‑time Mixture of World Models (TMoW), a framework that enhances adaptability to unseen and evolving domains. TMoW updates its routing function over world models at test time, unlike conventional MoE where the function remains fixed, enabling agents to recombine existing models and integrate new ones for continual adaptation. It achieves this through (i) multi‑granular prototype‑based routing, which adapts mixtures across object‑ to scene‑level similarities, (ii) test‑time refinement that aligns unseen domain features with prototypes during inference, and (iii) distilled mixture‑based augmentation, which efficiently constructs new models from few‑shot data and existing prototypes. We evaluate TMoW on VirtualHome, ALFWorld, and RLBench benchmarks, demonstrating strong performance in both zero‑shot adaptation and few‑shot expansion scenarios, and showing that it enables embodied agents to operate effectively in dynamic environments.
Authors:En Fu, Yanyan Hu, Changhua Hu, Zengwang Jin, Kaixiang Peng
Abstract:
The application of data‑driven remaining useful life (RUL) prediction has long been constrained by the availability of large amount of degradation data. Mainstream solutions such as domain adaptation and meta‑learning still rely on large amounts of historical degradation data from equipment that is identical or similar to the target, which imposes significant limitations in practical applications. This study investigates PEFT‑MuTS, a Parameter‑Efficient Fine‑Tuning framework for few‑shot RUL prediction, built on cross‑domain pre‑trained time‑series representation models. Contrary to the widely held view that knowledge transfer in RUL prediction can only occur within similar devices, we demonstrate that substantial benefits can be achieved through pre‑training process with large‑scale cross‑domain time series datasets. A independent feature tuning network and a meta‑variable‑based low rank multivariate fusion mechanism are developed to enable the pre‑trained univariate time‑series representation backbone model to fully exploit the multivariate relationships in degradation data for downstream RUL prediction task. Additionally, we introduce a zero‑initialized regressor that stabilizes the fine‑tuning process under few‑shot conditions. Experiments on aero‑engine and industrial bearing datasets demonstrate that our method can achieve effective RUL prediction even when less than 1% of samples of target equipment are used. Meanwhile, it substantially outperforms conventional supervised and few‑shot approaches while markedly reducing the data required to achieve high predictive accuracy. Our code is available at https://github.com/fuen1590/PEFT‑MuTS.
Authors:Chengyi Yang, Zhishang Xiang, Yunbo Tang, Zongpei Teng, Chengsong Huang, Fei Long, Yuhan Liu, Jinsong Su
Abstract:
Test‑Time Training offers a promising way to improve the reasoning ability of large language models (LLMs) by adapting the model using only the test questions. However, existing methods struggle with difficult reasoning problems for two reasons: raw test questions are often too difficult to yield high‑quality pseudo‑labels, and the limited size of test sets makes continuous online updates prone to instability. To address these limitations, we propose TTCS, a co‑evolving test‑time training framework. Specifically, TTCS initializes two policies from the same pretrained model: a question synthesizer and a reasoning solver. These policies evolve through iterative optimization: the synthesizer generates progressively challenging question variants conditioned on the test questions, creating a structured curriculum tailored to the solver's current capability, while the solver updates itself using self‑consistency rewards computed from multiple sampled responses on both original test and synthetic questions. Crucially, the solver's feedback guides the synthesizer to generate questions aligned with the model's current capability, and the generated question variants in turn stabilize the solver's test‑time training. Experiments show that TTCS consistently strengthens the reasoning ability on challenging mathematical benchmarks and transfers to general‑domain tasks across different LLM backbones, highlighting a scalable path towards dynamically constructing test‑time curricula for self‑evolving. Our code and implementation details are available at https://github.com/XMUDeepLIT/TTCS.
Authors:Qian Hong, Siyuan Chang, Xiao Zhou
Abstract:
Urban spatio‑temporal prediction under extreme conditions (e.g., heavy rain) is challenging due to event rarity and dynamics. Existing data‑driven approaches that incorporate weather as auxiliary input often rely on coarse‑grained descriptors and lack dedicated mechanisms to capture fine‑grained spatio‑temporal effects. Although recent methods adopt causal techniques to improve out‑of‑distribution generalization, they typically overlook temporal dynamics or depend on fixed confounder stratification. To address these limitations, we propose WED‑Net (Weather‑Effect Disentanglement Network), a dual‑branch Transformer architecture that separates intrinsic and weather‑induced traffic patterns via self‑ and cross‑attention, enhanced with memory banks and fused through adaptive gating. To further promote disentanglement, we introduce a discriminator that explicitly distinguishes weather conditions. Additionally, we design a causal data augmentation strategy that perturbs non‑causal parts while preserving causal structures, enabling improved generalization under rare scenarios. Experiments on taxi‑flow datasets from three cities demonstrate that WED‑Net delivers robust performance under extreme weather conditions, highlighting its potential to support safer mobility, highlighting its potential to support safer mobility, disaster preparedness, and urban resilience in real‑world settings. The code is publicly available at https://github.com/HQ‑LV/WED‑Net.
Authors:Youngeun Kim
Abstract:
Group‑relative policy optimization methods train language models by generating multiple rollouts per prompt and normalizing rewards with a shared mean reward baseline. In resource‑constrained settings where the rollout budget is small, accuracy often degrades. We find that noise in the shared baseline induces advantage sign flips, where some rollouts receive an incorrect advantage sign, and the update direction is reversed. To address this, we propose Median‑Centered Group Relative Policy Optimization (MC‑GRPO), a simple and effective solution for small‑rollout training. Our main idea is to replace the mean baseline with a median baseline: the median is far less sensitive to outlier rewards than the mean, mitigating the sign flips under small rollout size (G). We generate one additional rollout for median reference (G+1), and compute advantages by using the group median. With an odd‑sized group, exactly one completion is the median and receives zero advantage, we exclude this pivot rollout from backpropagation so the number of gradient‑contributing samples per prompt remains G, preserving the core update cost of standard G‑rollout training. Across various GRPO‑family methods and a wide range of models and scales, this median‑centered training consistently improves stability and final accuracy in the low‑rollout regime, reducing the gap between G=2 and G=8 to within 1%. Code is available at https://github.com/lotusroot‑kim/MC‑GRPO
Authors:Naeem Paeedeh, Mahardhika Pratama, Ary Shiddiqi, Zehong Cao, Mukesh Prasad, Wisnu Jatmiko
Abstract:
Although cross‑domain few‑shot learning (CDFSL) for hyper‑spectral image (HSI) classification has attracted significant research interest, existing works often rely on an unrealistic data augmentation procedure in the form of external noise to enlarge the sample size, thus greatly simplifying the issue of data scarcity. They involve a large number of parameters for model updates, being prone to the overfitting problem. To the best of our knowledge, none has explored the strength of the foundation model, having strong generalization power to be quickly adapted to downstream tasks. This paper proposes the MIxup FOundation MOdel (MIFOMO) for CDFSL of HSI classifications. MIFOMO is built upon the concept of a remote sensing (RS) foundation model, pre‑trained across a large scale of RS problems, thus featuring generalizable features. The notion of coalescent projection (CP) is introduced to quickly adapt the foundation model to downstream tasks while freezing the backbone network. The concept of mixup domain adaptation (MDM) is proposed to address the extreme domain discrepancy problem. Last but not least, the label smoothing concept is implemented to cope with noisy pseudo‑label problems. Our rigorous experiments demonstrate the advantage of MIFOMO, where it beats prior arts with up to 14% margin. The source code of MIFOMO is open‑sourced in https://github.com/Naeem‑ Paeedeh/MIFOMO for reproducibility and convenient further study.
Authors:Zhipeng Chen, Zhongrui Zhang, Chao Zhang, Yifan Xu, Lan Yang, Jun Liu, Ke Li, Yi-Zhe Song
Abstract:
The advancement of Large Language Model (LLM)‑powered agents has enabled automated task processing through reasoning and tool invocation capabilities. However, existing frameworks often operate under the idealized assumption that tool executions are invariably successful, relying solely on textual descriptions that fail to distinguish precise performance boundaries and cannot adapt to iterative tool updates. This gap introduces uncertainty in planning and execution, particularly in domains like visual content generation (AIGC), where nuanced tool performance significantly impacts outcomes. To address this, we propose PerfGuard, a performance‑aware agent framework for visual content generation that systematically models tool performance boundaries and integrates them into task planning and scheduling. Our framework introduces three core mechanisms: (1) Performance‑Aware Selection Modeling (PASM), which replaces generic tool descriptions with a multi‑dimensional scoring system based on fine‑grained performance evaluations; (2) Adaptive Preference Update (APU), which dynamically optimizes tool selection by comparing theoretical rankings with actual execution rankings; and (3) Capability‑Aligned Planning Optimization (CAPO), which guides the planner to generate subtasks aligned with performance‑aware strategies. Experimental comparisons against state‑of‑the‑art methods demonstrate PerfGuard's advantages in tool selection accuracy, execution reliability, and alignment with user intent, validating its robustness and practical utility for complex AIGC tasks. The project code is available at https://github.com/FelixChan9527/PerfGuard.
Authors:Feng Tao, Luca Paparusso, Chenyi Gu, Robin Koehler, Chenxu Wu, Xinyu Huang, Christian Juette, David Paz, Ren Liu
Abstract:
Real‑time path planning in constrained environments remains a fundamental challenge for autonomous systems. Traditional classical planners, while effective under perfect perception assumptions, are often sensitive to real‑world perception constraints and rely on online search procedures that incur high computational costs. In complex surroundings, this renders real‑time deployment prohibitive. To overcome these limitations, we introduce a Deep Reinforcement Learning (DRL) framework for real‑time path planning in parking scenarios. In particular, we focus on challenging scenes with tight spaces that require a high number of reversal maneuvers and adjustments. Unlike classical planners, our solution does not require ideal and structured perception, and in principle, could avoid the need for additional modules such as localization and tracking, resulting in a simpler and more practical implementation. Also, at test time, the policy generates actions through a single forward pass at each step, which is lightweight enough for real‑time deployment. The task is formulated as a sequential decision‑making problem grounded in a bicycle model dynamics, enabling the agent to directly learn navigation policies that respect vehicle kinematics and environmental constraints in the closed‑loop setting. A new benchmark is developed to support both training and evaluation, capturing diverse and challenging scenarios. Our approach achieves state‑of‑the‑art success rates and efficiency, surpassing classical planner baselines by +96% in success rate and +52% in efficiency. Furthermore, we release our benchmark as an open‑source resource for the community to foster future research in autonomous systems. The benchmark and accompanying tools are available at https://github.com/dqm5rtfg9b‑collab/Constrained_Parking_Scenarios.
Authors:Tung Sum Thomas Kwok, Xinyu Wang, Hengzhi He, Xiaofeng Lin, Peng Lu, Liheng Ma, Chunhe Wang, Ying Nian Wu, Lei Ding, Guang Cheng
Abstract:
A major challenge in training TableQA agents, compared to standard text‑ and image‑based agents, is that answers cannot be inferred from a static input but must be reasoned through stepwise transformations of the table state, introducing multi‑step reasoning complexity and environmental interaction. This leads to a research question: Can explicit feedback on table transformation action improve model reasoning capability? In this work, we introduce RE‑Tab, a plug‑and‑play framework that architecturally enhances trajectory search via lightweight, training‑free reward modeling by formulating the problem as a Partially Observable Markov Decision Process. We demonstrate that providing explicit verifiable rewards during State Transition (``What is the best action?'') and Simulative Reasoning (``Am I sure about the output?'') is crucial to steer the agent's navigation in table states. By enforcing stepwise reasoning with reward feedback in table transformations, RE‑Tab achieves state‑of‑the‑art performance in TableQA with almost 25% drop in inference cost. Furthermore, a direct plug‑and‑play implementation of RE‑Tab brings up to 41.77% improvement in QA accuracy and 33.33% drop in test‑time inference samples for consistent answer. Consistent improvement pattern across various LLMs and state‑of‑the‑art benchmarks further confirms RE‑Tab's generalisability. The repository is available at https://github.com/ThomasK1018/RE_Tab .
Authors:Ruizhe Zhong, Xingbo Du, Junchi Yan
Abstract:
Floorplanning determines the coordinate and shape of each module in Integrated Circuits. With the scaling of technology nodes, in floorplanning stage especially 3D scenarios with multiple stacked layers, it has become increasingly challenging to adhere to complex hardware design rules. Current methods are only capable of handling specific and limited design rules, while violations of other rules require manual and meticulous adjustment. This leads to labor‑intensive and time‑consuming post‑processing for expert engineers. In this paper, we propose an all‑in‑one deep reinforcement learning‑based approach to tackle these challenges, and design novel representations for real‑world IC design rules that have not been addressed by previous approaches. Specifically, the processing of various hardware design rules is unified into a single framework with three key components: 1) novel matrix representations to model the design rules, 2) constraints on the action space to filter out invalid actions that cause rule violations, and 3) quantitative analysis of constraint satisfaction as reward signals. Experiments on public benchmarks demonstrate the effectiveness and validity of our approach. Furthermore, transferability is well demonstrated on unseen circuits. Our framework is extensible to accommodate new design rules, thus providing flexibility to address emerging challenges in future chip design. Code will be available at: https://github.com/Thinklab‑SJTU/EDA‑AI
Authors:Shiyu Liu, Xinyi Wen, Zhibin Lan, Ante Wang, Jinsong Su
Abstract:
Despite progress in Large Vision Language Models (LVLMs), object hallucination remains a critical issue in image captioning task, where models generate descriptions of non‑existent objects, compromising their reliability. Previous work attributes this to LVLMs' over‑reliance on language priors and attempts to mitigate it through logits calibration. However, they still lack a thorough analysis of the over‑reliance. To gain a deeper understanding of over‑reliance, we conduct a series of preliminary experiments, indicating that as the generation length increases, LVLMs' over‑reliance on language priors leads to inflated probability of hallucinated object tokens, consequently exacerbating object hallucination. To circumvent this issue, we propose Language‑Prior‑Free Verification to enable LVLMs to faithfully verify the confidence of object existence. Based on this, we propose a novel training‑free Self‑Validation Framework to counter the over‑reliance trap. It first validates objects' existence in sampled candidate captions and further mitigates object hallucination via caption selection or aggregation. Experiment results demonstrate that our framework mitigates object hallucination significantly in image captioning task (e.g., 65.6% improvement on CHAIRI metric with LLaVA‑v1.5‑7B), surpassing the previous SOTA methods. This result highlights a novel path towards mitigating hallucination by unlocking the inherent potential within LVLMs themselves.
Authors:Huinan Xu, Xuyang Feng, Junhong Chen, Junchen Liu, Kaiwen Deng, Kai Ding, Shengning Long, Jiaxue Shuai, Zhaorong Li, Shiping Liu, Guirong Xue, Zhan Xiao
Abstract:
Current genomic foundation models (GFMs) rely on extensive neural computation to implicitly approximate conserved biological motifs from single‑nucleotide inputs. We propose Gengram, a conditional memory module that introduces an explicit and highly efficient lookup primitive for multi‑base motifs via a genomic‑specific hashing scheme, establishing genomic "syntax". Integrated into the backbone of state‑of‑the‑art GFMs, Gengram achieves substantial gains (up to 14%) across several functional genomics tasks. The module demonstrates robust architectural generalization, while further inspection of Gengram's latent space reveals the emergence of meaningful representations that align closely with fundamental biological knowledge. By establishing structured motif memory as a modeling primitive, Gengram simultaneously boosts empirical performance and mechanistic interpretability, providing a scalable and biology‑aligned pathway for the next generation of GFMs. The code is available at https://github.com/zhejianglab/Genos, and the model checkpoint is available at https://huggingface.co/ZhejiangLab/Gengram.
Authors:Zhi Yang, Lingfeng Zeng, Fangqi Lou, Qi Qi, Wei Zhang, Zhenyu Wu, Zhenxiong Yu, Jun Han, Zhiheng Jin, Lejie Zhang, Xiaoming Huang, Xiaolong Liang, Zheng Wei, Junbo Zou, Dongpo Cheng, Zhaowei Liu, Xin Guo, Rongjunchen Zhang, Liwen Zhang
Abstract:
Multimodal large language models are playing an increasingly significant role in empowering the financial domain, however, the challenges they face, such as multimodal and high‑density information and cross‑modal multi‑hop reasoning, go beyond the evaluation scope of existing multimodal benchmarks. To address this gap, we propose UniFinEval, the first unified multimodal benchmark designed for high‑information‑density financial environments, covering text, images, and videos. UniFinEval systematically constructs five core financial scenarios grounded in real‑world financial systems: Financial Statement Auditing, Company Fundamental Reasoning, Industry Trend Insights, Financial Risk Sensing, and Asset Allocation Analysis. We manually construct a high‑quality dataset consisting of 3,767 question‑answer pairs in both chinese and english and systematically evaluate 10 mainstream MLLMs under Zero‑Shot and CoT settings. Results show that Gemini‑3‑pro‑preview achieves the best overall performance, yet still exhibits a substantial gap compared to financial experts. Further error analysis reveals systematic deficiencies in current models. UniFinEval aims to provide a systematic assessment of MLLMs' capabilities in fine‑grained, high‑information‑density financial environments, thereby enhancing the robustness of MLLMs applications in real‑world financial scenarios. Data and code are available at https://github.com/aifinlab/UniFinEval.
Authors:Naufal Suryanto, Muzammal Naseer, Pengfei Li, Syed Talal Wasim, Jinhui Yi, Juergen Gall, Paolo Ceravolo, Ernesto Damiani
Abstract:
Cybersecurity operations demand assistant LLMs that support diverse workflows without exposing sensitive data. Existing solutions either rely on proprietary APIs with privacy risks or on open models lacking domain adaptation. To bridge this gap, we curate 11.8B tokens of cybersecurity‑focused continual pretraining data via large‑scale web filtering and manual collection of high‑quality resources, spanning 28.6K documents across frameworks, offensive techniques, and security tools. Building on this, we design an agentic augmentation pipeline that simulates expert workflows to generate 266K multi‑turn cybersecurity samples for supervised fine‑tuning. Combined with general open‑source LLM data, these resources enable the training of RedSage, an open‑source, locally deployable cybersecurity assistant with domain‑aware pretraining and post‑training. To rigorously evaluate the models, we introduce RedSage‑Bench, a benchmark with 30K multiple‑choice and 240 open‑ended Q&A items covering cybersecurity knowledge, skills, and tool expertise. RedSage is further evaluated on established cybersecurity benchmarks (e.g., CTI‑Bench, CyberMetric, SECURE) and general LLM benchmarks to assess broader generalization. At the 8B scale, RedSage achieves consistently better results, surpassing the baseline models by up to +5.59 points on cybersecurity benchmarks and +5.05 points on Open LLM Leaderboard tasks. These findings demonstrate that domain‑aware agentic augmentation and pre/post‑training can not only enhance cybersecurity‑specific expertise but also help to improve general reasoning and instruction‑following. All models, datasets, and code are publicly available.
Authors:Kaixuan Fan, Kaituo Feng, Manyuan Zhang, Tianshuo Peng, Zhixun Li, Yilei Jiang, Shuang Chen, Peng Pei, Xunliang Cai, Xiangyu Yue
Abstract:
Agentic Reinforcement Learning (Agentic RL) has achieved notable success in enabling agents to perform complex reasoning and tool use. However, most methods still relies on sparse outcome‑based reward for training. Such feedback fails to differentiate intermediate reasoning quality, leading to suboptimal training results. In this paper, we introduce Agent Reasoning Reward Model (Agent‑RRM), a multi‑faceted reward model that produces structured feedback for agentic trajectories, including (1) an explicit reasoning trace , (2) a focused critique that provides refinement guidance by highlighting reasoning flaws, and (3) an overall score that evaluates process performance. Leveraging these signals, we systematically investigate three integration strategies: Reagent‑C (text‑augmented refinement), Reagent‑R (reward‑augmented guidance), and Reagent‑U (unified feedback integration). Extensive evaluations across 12 diverse benchmarks demonstrate that Reagent‑U yields substantial performance leaps, achieving 43.7% on GAIA and 46.2% on WebWalkerQA, validating the effectiveness of our reasoning reward model and training schemes. Code, models, and datasets are all released to facilitate future research.
Authors:Xin Chen, Feng Jiang, Yiqian Zhang, Hardy Chen, Shuo Yan, Wenya Xie, Min Yang, Shujian Huang
Abstract:
Reasoning‑oriented Large Language Models (LLMs) have achieved remarkable progress with Chain‑of‑Thought (CoT) prompting, yet they remain fundamentally limited by a \emphblind self‑thinking paradigm: performing extensive internal reasoning even when critical information is missing or ambiguous. We propose Proactive Interactive Reasoning (PIR), a new reasoning paradigm that transforms LLMs from passive solvers into proactive inquirers that interleave reasoning with clarification. Unlike existing search‑ or tool‑based frameworks that primarily address knowledge uncertainty by querying external environments, PIR targets premise‑ and intent‑level uncertainty through direct interaction with the user. PIR is implemented via two core components: (1) an uncertainty‑aware supervised fine‑tuning procedure that equips models with interactive reasoning capability, and (2) a user‑simulator‑based policy optimization framework driven by a composite reward that aligns model behavior with user intent. Extensive experiments on mathematical reasoning, code generation, and document editing demonstrate that PIR consistently outperforms strong baselines, achieving up to 32.70% higher accuracy, 22.90% higher pass rate, and 41.36 BLEU improvement, while reducing nearly half of the reasoning computation and unnecessary interaction turns. Further reliability evaluations on factual knowledge, question answering, and missing‑premise scenarios confirm the strong generalization and robustness of PIR. Model and code are publicly available at: \hrefhttps://github.com/SUAT‑AIRI/Proactive‑Interactive‑R1
Authors:Wenxuan Huang, Yu Zeng, Qiuchen Wang, Zhen Fang, Shaosheng Cao, Zheng Chu, Qingyu Yin, Shuang Chen, Zhenfei Yin, Lin Chen, Zehui Chen, Yao Hu, Philip Torr, Feng Zhao, Wanli Ouyang
Abstract:
Multimodal large language models (MLLMs) have achieved remarkable success across a broad range of vision tasks. However, constrained by the capacity of their internal world knowledge, prior work has proposed augmenting MLLMs by ``reasoning‑then‑tool‑call'' for visual and textual search engines to obtain substantial gains on tasks requiring extensive factual information. However, these approaches typically define multimodal search in a naive setting, assuming that a single full‑level or entity‑level image query and few text query suffices to retrieve the key evidence needed to answer the question, which is unrealistic in real‑world scenarios with substantial visual noise. Moreover, they are often limited in the reasoning depth and search breadth, making it difficult to solve complex questions that require aggregating evidence from diverse visual and textual sources. Building on this, we propose Vision‑DeepResearch, which proposes one new multimodal deep‑research paradigm, i.e., performs multi‑turn, multi‑entity and multi‑scale visual and textual search to robustly hit real‑world search engines under heavy noise. Our Vision‑DeepResearch supports dozens of reasoning steps and hundreds of engine interactions, while internalizing deep‑research capabilities into the MLLM via cold‑start supervision and RL training, resulting in a strong end‑to‑end multimodal deep‑research MLLM. It substantially outperforming existing multimodal deep‑research MLLMs, and workflows built on strong closed‑source foundation model such as GPT‑5, Gemini‑2.5‑pro and Claude‑4‑Sonnet. The code will be released in https://github.com/Osilly/Vision‑DeepResearch.
Authors:Baorui Ma, Jiahui Yang, Donglin Di, Xuancheng Zhang, Jianxun Cui, Hao Li, Yan Xie, Wei Chen
Abstract:
Scaling has powered recent advances in vision foundation models, yet extending this paradigm to metric depth estimation remains challenging due to heterogeneous sensor noise, camera‑dependent biases, and metric ambiguity in noisy cross‑source 3D data. We introduce Metric Anything, a simple and scalable pretraining framework that learns metric depth from noisy, diverse 3D sources without manually engineered prompts, camera‑specific modeling, or task‑specific architectures. Central to our approach is the Sparse Metric Prompt, created by randomly masking depth maps, which serves as a universal interface that decouples spatial reasoning from sensor and camera biases. Using about 20M image‑depth pairs spanning reconstructed, captured, and rendered 3D data across 10000 camera models, we demonstrate‑for the first time‑a clear scaling trend in the metric depth track. The pretrained model excels at prompt‑driven tasks such as depth completion, super‑resolution and Radar‑camera fusion, while its distilled prompt‑free student achieves state‑of‑the‑art results on monocular depth estimation, camera intrinsics recovery, single/multi‑view metric 3D reconstruction, and VLA planning. We also show that using pretrained ViT of Metric Anything as a visual encoder significantly boosts Multimodal Large Language Model capabilities in spatial intelligence. These results show that metric depth estimation can benefit from the same scaling laws that drive modern foundation models, establishing a new path toward scalable and efficient real‑world metric perception. We open‑source MetricAnything at http://metric‑anything.github.io/metric‑anything‑io/ to support community research.
Authors:Shuo Liu, Tianle Chen, Ryan Amiri, Christopher Amato
Abstract:
Recent work has explored optimizing LLM collaboration through Multi‑Agent Reinforcement Learning (MARL). However, most MARL fine‑tuning approaches rely on predefined execution protocols, which often require centralized execution. Decentralized LLM collaboration is more appealing in practice, as agents can run inference in parallel with flexible deployments. Also, current approaches use Monte Carlo methods for fine‑tuning, which suffer from high variance and thus require more samples to train effectively. Actor‑critic methods are prevalent in MARL for dealing with these issues, so we developed Multi‑Agent Actor‑Critic (MAAC) methods to optimize decentralized LLM collaboration. In this paper, we analyze when and why these MAAC methods are beneficial. We propose 2 MAAC approaches, CoLLM‑CC with a Centralized Critic and CoLLM‑DC with Decentralized Critics. Our experiments across writing, coding, and game‑playing domains show that Monte Carlo methods and CoLLM‑DC can achieve performance comparable to CoLLM‑CC in short‑horizon and dense‑reward settings. However, they both underperform CoLLM‑CC on long‑horizon or sparse‑reward tasks, where Monte Carlo methods require substantially more samples and CoLLM‑DC struggles to converge. Our code is available at https://github.com/OpenMLRL/CoMLRL/releases/tag/v1.3.2.
Authors:Zhao Wang, Ziliang Zhao, Zhicheng Dou
Abstract:
Reinforcement learning (RL) has become a promising paradigm for optimizing Retrieval‑Augmented Generation (RAG) in complex reasoning tasks. However, traditional outcome‑based RL approaches often suffer from reward sparsity and inefficient credit assignment, as coarse‑grained scalar rewards fail to identify specific erroneous steps within long‑horizon trajectories. This ambiguity frequently leads to "process hallucinations", where models reach correct answers through flawed logic or redundant retrieval steps. Although recent process‑aware approaches attempt to mitigate this via static preference learning or heuristic reward shaping, they often lack the on‑policy exploration capabilities required to decouple step‑level credit from global outcomes. To address these challenges, we propose ProRAG, a process‑supervised reinforcement learning framework designed to integrate learned step‑level supervision into the online optimization loop. Our framework consists of four stages: (1) Supervised Policy Warmup to initialize the model with a structured reasoning format; (2) construction of an MCTS‑based Process Reward Model (PRM) to quantify intermediate reasoning quality; (3) PRM‑Guided Reasoning Refinement to align the policy with fine‑grained process preferences; and (4) Process‑Supervised Reinforcement Learning with a dual‑granularity advantage mechanism. By aggregating step‑level process rewards with global outcome signals, ProRAG provides precise feedback for every action. Extensive experiments on five multi‑hop reasoning benchmarks demonstrate that ProRAG achieves superior overall performance compared to strong outcome‑based and process‑aware RL baselines, particularly on complex long‑horizon tasks, validating the effectiveness of fine‑grained process supervision. The code and model are available at https://github.com/lilinwz/ProRAG.
Authors:Jinhao Pan, Chahat Raj, Anjishnu Mukherjee, Sina Mansouri, Bowen Wei, Shloka Yada, Ziwei Zhu
Abstract:
Large language models (LLMs) exhibit social biases that reinforce harmful stereotypes, limiting their safe deployment. Most existing debiasing methods adopt a suppressive paradigm by modifying parameters, prompts, or neurons associated with biased behavior; however, such approaches are often brittle, weakly generalizable, data‑inefficient, and prone to degrading general capability. We propose KnowBias, a lightweight and conceptually distinct framework that mitigates bias by strengthening, rather than suppressing, neurons encoding bias‑knowledge. KnowBias identifies neurons encoding bias knowledge using a small set of bias‑knowledge questions via attribution‑based analysis, and selectively enhances them at inference time. This design enables strong debiasing while preserving general capabilities, generalizes across bias types and demographics, and is highly data efficient, requiring only a handful of simple yes/no questions and no retraining. Experiments across multiple benchmarks and LLMs demonstrate consistent state‑of‑the‑art debiasing performance with minimal utility degradation. Data and code are available at https://github.com/JP‑25/KnowBias.
Authors:Yaocong Li, Leihan Zhang, Le Zhang, Qiang Yan
Abstract:
Internet memes have become pervasive carriers of digital culture on social platforms. However, their heavy reliance on metaphors and sociocultural context also makes them subtle vehicles for harmful content, posing significant challenges for automated content moderation. Existing approaches primarily focus on intra‑modal and inter‑modal signal analysis, while the understanding of implicit toxicity often depends on background knowledge that is not explicitly present in the meme itself. To address this challenge, we propose KID, a Knowledge‑Injected Dual‑Head Learning framework for knowledge‑grounded harmful meme detection. KID adopts a label‑constrained distillation paradigm to decompose complex meme understanding into structured reasoning chains that explicitly link visual evidence, background knowledge, and classification labels. These chains guide the learning process by grounding external knowledge in meme‑specific contexts. In addition, KID employs a dual‑head architecture that jointly optimizes semantic generation and classification objectives, enabling aligned linguistic reasoning while maintaining stable decision boundaries. Extensive experiments on five multilingual datasets spanning English, Chinese, and low‑resource Bengali demonstrate that KID achieves SOTA performance on both binary and multi‑label harmful meme detection tasks, improving over previous best methods by 2.1%‑‑19.7% across primary evaluation metrics. Ablation studies further confirm the effectiveness of knowledge injection and dual‑head joint learning, highlighting their complementary contributions to robust and generalizable meme understanding. The code and data are available at https://github.com/PotatoDog1669/KID.
Authors:Borja Carrillo-Perez, Felix Sattler, Angel Bueno Rodriguez, Maurice Stephan, Sarah Barnes
Abstract:
Three‑dimensional (3D) reconstruction of ships is an important part of maritime monitoring, allowing improved visualization, inspection, and decision‑making in real‑world monitoring environments. However, most state‑ofthe‑art 3D reconstruction methods require multi‑view supervision, annotated 3D ground truth, or are computationally intensive, making them impractical for real‑time maritime deployment. In this work, we present an efficient pipeline for single‑view 3D reconstruction of real ships by training entirely on synthetic data and requiring only a single view at inference. Our approach uses the Splatter Image network, which represents objects as sparse sets of 3D Gaussians for rapid and accurate reconstruction from single images. The model is first fine‑tuned on synthetic ShapeNet vessels and further refined with a diverse custom dataset of 3D ships, bridging the domain gap between synthetic and real‑world imagery. We integrate a state‑of‑the‑art segmentation module based on YOLOv8 and custom preprocessing to ensure compatibility with the reconstruction network. Postprocessing steps include real‑world scaling, centering, and orientation alignment, followed by georeferenced placement on an interactive web map using AIS metadata and homography‑based mapping. Quantitative evaluation on synthetic validation data demonstrates strong reconstruction fidelity, while qualitative results on real maritime images from the ShipSG dataset confirm the potential for transfer to operational maritime settings. The final system provides interactive 3D inspection of real ships without requiring real‑world 3D annotations. This pipeline provides an efficient, scalable solution for maritime monitoring and highlights a path toward real‑time 3D ship visualization in practical applications. Interactive demo: https://dlr‑mi.github.io/ship3d‑demo/.
Authors:Ruiwen Zhou, Maojia Song, Xiaobao Wu, Sitao Cheng, Xunjian Yin, Yuxi Xie, Zhuoqun Hao, Wenyue Hua, Liangming Pan, Soujanya Poria, Min-Yen Kan
Abstract:
Individual agents in multi‑agent (MA) systems often lack robustness, tending to blindly conform to misleading peers. We show this weakness stems from both sycophancy and inadequate ability to evaluate peer reliability. To address this, we first formalize the learning problem of history‑aware reference, introducing the historical interactions of peers as additional input, so that agents can estimate peer reliability and learn from trustworthy peers when uncertain. This shifts the task from evaluating peer reasoning quality to estimating peer reliability based on interaction history. We then develop Epistemic Context Learning (ECL): a reasoning framework that conditions predictions on explicitly‑built peer profiles from history. We further optimize ECL by reinforcement learning using auxiliary rewards. Our experiments reveal that our ECL enables small models like Qwen 3‑4B to outperform a history‑agnostic baseline 8x its size (Qwen 3‑30B) by accurately identifying reliable peers. ECL also boosts frontier models to near‑perfect (100%) performance. We show that ECL generalizes well to various MA configurations and we find that trust is modeled well by LLMs, revealing a strong correlation in trust modeling accuracy and final answer quality.
Authors:Baoliang Chen, Danni Huang, Hanwei Zhu, Lingyu Zhu, Wei Zhou, Shiqi Wang, Yuming Fang, Weisi Lin
Abstract:
Evaluation of Image Quality Assessment (IQA) models has long been dominated by global correlation metrics, such as Pearson Linear Correlation Coefficient (PLCC) and Spearman Rank‑Order Correlation Coefficient (SRCC). While widely adopted, these metrics reduce performance to a single scalar, failing to capture how ranking consistency varies across the local quality spectrum. For example, two IQA models may achieve identical SRCC values, yet one ranks high‑quality images (related to high Mean Opinion Score, MOS) more reliably, while the other better discriminates image pairs with small quality/MOS differences (related to |ΔMOS|). Such complementary behaviors are invisible under global metrics. Moreover, SRCC and PLCC are sensitive to test‑sample quality distributions, yielding unstable comparisons across test sets. To address these limitations, we propose Granularity‑Modulated Correlation (GMC), which provides a structured, fine‑grained analysis of IQA performance. GMC includes: (1) a Granularity Modulator that applies Gaussian‑weighted correlations conditioned on absolute MOS values and pairwise MOS differences (|ΔMOS|) to examine local performance variations, and (2) a Distribution Regulator that regularizes correlations to mitigate biases from non‑uniform quality distributions. The resulting correlation surface maps correlation values as a joint function of MOS and |ΔMOS|, providing a 3D representation of IQA performance. Experiments on standard benchmarks show that GMC reveals performance characteristics invisible to scalar metrics, offering a more informative and reliable paradigm for analyzing, comparing, and deploying IQA models. Codes are available at https://github.com/Dniaaa/GMC.
Authors:Siru Zhong, Yiqiu Liu, Zhiqing Cui, Zezhi Shao, Fei Wang, Qingsong Wen, Yuxuan Liang
Abstract:
Deep time series models are vulnerable to noisy data ubiquitous in real‑world applications. Existing robustness strategies either prune data or rely on costly prior quantification, failing to balance effectiveness and efficiency. In this paper, we introduce DropoutTS, a model‑agnostic plugin that shifts the paradigm from "what" to learn to "how much" to learn. DropoutTS employs a Sample‑Adaptive Dropout mechanism: leveraging spectral sparsity to efficiently quantify instance‑level noise via reconstruction residuals, it dynamically calibrates model learning capacity by mapping noise to adaptive dropout rates ‑ selectively suppressing spurious fluctuations while preserving fine‑grained fidelity. Extensive experiments across diverse noise regimes and open benchmarks show DropoutTS consistently boosts superior backbones' performance, delivering advanced robustness with negligible parameter overhead and no architectural modifications. Our code is available at https://github.com/CityMind‑Lab/DropoutTS.
Authors:Mingshuang Luo, Shuang Liang, Zhengkun Rong, Yuxuan Luo, Tianshu Hu, Ruibing Hou, Hong Chang, Yong Li, Yuan Zhang, Mingyuan Gao
Abstract:
Character image animation aims to synthesize high‑fidelity videos by transferring motion from a driving sequence to a static reference image. Despite recent advancements, existing methods suffer from two fundamental challenges: (1) suboptimal motion injection strategies that lead to a trade‑off between identity preservation and motion consistency, manifesting as a "see‑saw", and (2) an over‑reliance on explicit pose priors (e.g., skeletons), which inadequately capture intricate dynamics and hinder generalization to arbitrary, non‑humanoid characters. To address these challenges, we present DreamActor‑M2, a universal animation framework that reimagines motion conditioning as an in‑context learning problem. Our approach follows a two‑stage paradigm. First, we bridge the input modality gap by fusing reference appearance and motion cues into a unified latent space, enabling the model to jointly reason about spatial identity and temporal dynamics by leveraging the generative prior of foundational models. Second, we introduce a self‑bootstrapped data synthesis pipeline that curates pseudo cross‑identity training pairs, facilitating a seamless transition from pose‑dependent control to direct, end‑to‑end RGB‑driven animation. This strategy significantly enhances generalization across diverse characters and motion scenarios. To facilitate comprehensive evaluation, we further introduce AW Bench, a versatile benchmark encompassing a wide spectrum of characters types and motion scenarios. Extensive experiments demonstrate that DreamActor‑M2 achieves state‑of‑the‑art performance, delivering superior visual fidelity and robust cross‑domain generalization. Project Page: https://grisoon.github.io/DreamActor‑M2/
Authors:Alexandre Myara, Nicolas Bourriez, Thomas Boyer, Thomas Lemercier, Ihab Bendidi, Auguste Genovesio
Abstract:
Disentangled representation learning aims to map independent factors of variation to independent representation components. On one hand, purely unsupervised approaches have proven successful on fully disentangled synthetic data, but fail to recover semantic factors from real data without strong inductive biases. On the other hand, supervised approaches are unstable and hard to scale to large attribute sets because they rely on adversarial objectives or auxiliary classifiers. We introduce \textscXFactors, a weakly‑supervised VAE framework that disentangles and provides explicit control over a chosen set of factors. Building on the Disentangled Information Bottleneck perspective, we decompose the representation into a residual subspace \mathcalS and factor‑specific subspaces \mathcalT_1,\ldots,\mathcalT_K and a residual subspace \mathcalS. Each target factor is encoded in its assigned \mathcalT_i through contrastive supervision: an InfoNCE loss pulls together latents sharing the same factor value and pushes apart mismatched pairs. In parallel, KL regularization imposes a Gaussian structure on both \mathcalS and the aggregated factor subspaces, organizing the geometry without additional supervision for non‑targeted factors and avoiding adversarial training and classifiers. Across multiple datasets, with constant hyperparameters, \textscXFactors achieves state‑of‑the‑art disentanglement scores and yields consistent qualitative factor alignment in the corresponding subspaces, enabling controlled factor swapping via latent replacement. We further demonstrate that our method scales correctly with increasing latent capacity and evaluate it on the real‑world dataset CelebA. Our code is available at \hrefhttps://github.com/ICML26‑anon/XFactorsgithub.com/ICML26‑anon/XFactors.
Authors:Ahmed Y. Radwan, Christos Emmanouilidis, Hina Tabassum, Deval Pandya, Shaina Raza
Abstract:
Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio‑video data remains underexplored. This gap highlights the need for a high‑quality benchmark to systematically evaluate MLLM performance in a real‑world setting. We introduce SONIC‑O1, a comprehensive, fully human‑verified benchmark spanning 13 real‑world conversational domains with 4,958 annotations and demographic metadata. SONIC‑O1 evaluates MLLMs on key tasks, including open‑ended summarization, multiple‑choice question (MCQ) answering, and temporal localization with supporting rationales (reasoning). Experiments on closed‑ and open‑source models reveal limitations. While the performance gap in MCQ accuracy between two model families is relatively small, we observe a substantial 22.6% performance difference in temporal localization between the best performing closed‑source and open‑source models. Performance further degrades across demographic groups, indicating persistent disparities in model behavior. Overall, SONIC‑O1 provides an open evaluation suite for temporally grounded and socially robust multimodal understanding. We release SONIC‑O1 for reproducibility and research: Project page: https://vectorinstitute.github.io/sonic‑o1/ Dataset: https://huggingface.co/datasets/vector‑institute/sonic‑o1 Github: https://github.com/vectorinstitute/sonic‑o1 Leaderboard: https://huggingface.co/spaces/vector‑institute/sonic‑o1‑leaderboard
Authors:Bing Han, Chushu Zhou, Yifan Yang, Wei Wang, Chenda Li, Wangyou Zhang, Yanmin Qian
Abstract:
Bootstrap‑based Self‑Supervised Learning (SSL) has achieved remarkable progress in audio understanding. However, existing methods typically operate at a single level of granularity, limiting their ability to model the diverse temporal and spectral structures inherent in complex audio signals. Furthermore, bootstrapping representations from scratch is computationally expensive, often requiring extensive training to converge. In this work, we propose the Convolutional Audio Transformer (CAT), a unified framework designed to address these challenges. First, to capture hierarchical audio features, CAT incorporates a Multi‑resolution Block that aggregates information across varying granularities. Second, to enhance training efficiency, we introduce a Representation Regularization objective. Drawing inspiration from generative modeling, this auxiliary task guides the student model by aligning its predictions with high‑quality semantic representations from frozen, pre‑trained external encoders. Experimental results demonstrate that CAT significantly outperforms baselines on audio understanding benchmarks. Notably, it achieves competitive performance on the AudioSet 20k dataset with 5 times faster convergence than existing methods. Codes and checkpoints will be released soon at https://github.com/realzhouchushu/CAT.
Authors:Zhi Zheng, Wee Sun Lee
Abstract:
Aiming at efficient and dense chain‑of‑thought (CoT) reasoning, latent reasoning methods fine‑tune Large Language Models (LLMs) to substitute discrete language tokens with continuous latent tokens. These methods consume fewer tokens compared to the conventional language CoT reasoning and have the potential to plan in a dense latent space. However, current latent tokens are generally supervised based on imitating language labels. Considering that there can be multiple equivalent but diverse CoT labels for a question, passively imitating an arbitrary one may lead to inferior latent token representations and latent reasoning policies, undermining the potential planning ability and resulting in clear gaps between training and testing. In this work, we emphasize the importance of active planning over the representation space of latent tokens in achieving the optimal latent reasoning policy. So, we propose the \underlineAc\underlinetive Latent \underlinePlanning method (ATP‑Latent), which models the supervision process of latent tokens as a conditional variational auto‑encoder (VAE) to obtain a smoother latent space. Moreover, to facilitate the most reasonable latent reasoning policy, ATP‑Latent conducts reinforcement learning (RL) with an auxiliary coherence reward, which is calculated based on the consistency between VAE‑decoded contents of latent tokens, enabling a guided RL process. In experiments on LLaMA‑1B, ATP‑Latent demonstrates +4.1% accuracy and ‑3.3% tokens on four benchmarks compared to advanced baselines. Codes are available on https://github.com/zz1358m/ATP‑Latent‑master.
Authors:Lige Zhang, Ali Maatouk, Jialin Chen, Leandros Tassiulas, Rex Ying
Abstract:
Real‑world time series exhibit complex and evolving dynamics, making accurate forecasting extremely challenging. Recent multi‑modal forecasting methods leverage textual information such as news reports to improve prediction, but most rely on token‑level fusion that mixes temporal patches with language tokens in a shared embedding space. However, such fusion can be ill‑suited when high‑quality time‑text pairs are scarce and when time series exhibit substantial variation in scale and characteristics, thus complicating cross‑modal alignment. In parallel, Mixture‑of‑Experts (MoE) architectures have proven effective for both time series modeling and multi‑modal learning, yet many existing MoE‑based modality integration methods still depend on token‑level fusion. To address this, we propose Expert Modulation, a new paradigm for multi‑modal time series prediction that conditions both routing and expert computation on textual signals, enabling direct and efficient cross‑modal control over expert behavior. Through comprehensive theoretical analysis and experiments, our proposed method demonstrates substantial improvements in multi‑modal time series prediction. The current code is available at https://github.com/BruceZhangReve/MoME
Authors:Alireza Nadaf, Alireza Mohammadshahi, Majid Yazdani
Abstract:
We introduce KAPSO, a modular framework for autonomous program synthesis and optimization. Given a natural language goal and an evaluation method, KAPSO iteratively performs ideation, code synthesis and editing, execution, evaluation, and learning to improve a runnable artifact toward measurable objectives. Rather than treating synthesis as the endpoint, KAPSO uses synthesis as an operator within a long‑horizon optimization loop, where progress is defined by evaluator outcomes. KAPSO targets long‑horizon failures common in coding agents, including lost experimental state, brittle debugging, and weak reuse of domain expertise, by integrating three tightly coupled components. First, a git‑native experimentation engine isolates each attempt as a branch, producing reproducible artifacts and preserving provenance across iterations. Second, a knowledge system ingests heterogeneous sources, including repositories, internal playbooks, and curated external resources such as documentation, scientific papers, and web search results, and organizes them into a structured representation that supports retrieval over workflows, implementations, and environment constraints. Third, a cognitive memory layer coordinates retrieval and maintains an episodic store of reusable lessons distilled from experiment traces (run logs, diffs, and evaluator feedback), reducing repeated error modes and accelerating convergence. We evaluated KAPSO on MLE‑Bench (Kaggle‑style ML competitions) and ALE‑Bench (AtCoder heuristic optimization), and report end‑to‑end performance. Code Available at: https://github.com/Leeroo‑AI/kapso
Authors:Zhongkai Yu, Chenyang Zhou, Yichen Lin, Hejia Zhang, Haotian Ye, Junxia Cui, Zaifeng Pan, Jishen Zhao, Yufei Ding
Abstract:
While Large Language Models (LLMs) show significant potential in hardware engineering, current benchmarks suffer from saturation and limited task diversity, failing to reflect LLMs' performance in real industrial workflows. To address this gap, we propose a comprehensive benchmark for AI‑aided chip design that rigorously evaluates LLMs across three critical tasks: Verilog generation, debugging, and reference model generation. Our benchmark features 44 realistic modules with complex hierarchical structures, 89 systematic debugging cases, and 132 reference model samples across Python, SystemC, and CXXRTL. Evaluation results reveal substantial performance gaps, with state‑of‑the‑art Claude‑4.5‑opus achieving only 30.74% on Verilog generation and 13.33% on Python reference model generation, demonstrating significant challenges compared to existing saturated benchmarks where SOTA models achieve over 95% pass rates. Additionally, to help enhance LLM reference model generation, we provide an automated toolbox for high‑quality training data generation, facilitating future research in this underexplored domain. Our code is available at https://github.com/zhongkaiyu/ChipBench.git.
Authors:Yuxiang Huang, Mingye Li, Xu Han, Chaojun Xiao, Weilin Zhao, Ao Sun, Ziqi Yuan, Hao Zhou, Fandong Meng, Zhiyuan Liu
Abstract:
The efficiency of long‑video inference remains a critical bottleneck, mainly due to the dense computation in the prefill stage of Large Multimodal Models (LMMs). Existing methods either compress visual embeddings or apply sparse attention on a single GPU, yielding limited acceleration or degraded performance and restricting LMMs from handling longer, more complex videos. To overcome these issues, we propose Spava, a sequence‑parallel framework with optimized attention that accelerates long‑video inference across multiple GPUs. By distributing approximate attention, Spava reduces computation and increases parallelism, enabling efficient processing of more visual embeddings without compression and thereby improving task performance. System‑level optimizations, such as load balancing and fused forward passes, further unleash the potential of Spava, delivering speedups of 12.72x, 1.70x, and 1.18x over FlashAttn, ZigZagRing, and APB, without notable performance loss. Code available at https://github.com/thunlp/APB
Authors:Minjae Cho, Huy Trong Tran
Abstract:
Exploration is essential in reinforcement learning as an agent relies on trial and error to learn an optimal policy. However, when rewards are sparse, naive exploration strategies, like noise injection, are often insufficient. Intrinsic rewards can also provide principled guidance for exploration by, for example, combining them with extrinsic rewards to optimize a policy or using them to train subpolicies for hierarchical learning. However, the former approach suffers from unstable credit assignment, while the latter exhibits sample inefficiency and sub‑optimality. We propose a policy optimization framework that leverages multiple intrinsic rewards to directly optimize a policy for an extrinsic reward without pretraining subpolicies. Our algorithm ‑‑ intrinsic reward policy optimization (IRPO) ‑‑ achieves this by using a surrogate policy gradient that provides a more informative learning signal than the true gradient in sparse‑reward environments. We demonstrate that IRPO improves performance and sample efficiency relative to baselines in discrete and continuous environments, and formally analyze the optimization problem solved by IRPO. Our code is available at https://github.com/Mgineer117/IRPO.
Authors:June-Woo Kim, Dhruv Agarwal, Federica Cerina
Abstract:
Objective evaluation of synthetic speech quality remains a critical challenge. Human listening tests are the gold standard, but costly and impractical at scale. Fréchet Distance has emerged as a promising alternative, yet its reliability depends heavily on the choice of embeddings and experimental settings. In this work, we comprehensively evaluate Fréchet Speech Distance (FSD) and its variant Speech Maximum Mean Discrepancy (SMMD) under varied embeddings and conditions. We further incorporate human listening evaluations alongside TTS intelligibility and synthetic‑trained ASR WER to validate the perceptual relevance of these metrics. Our findings show that WavLM Base+ features yield the most stable alignment with human ratings. While FSD and SMMD cannot fully replace subjective evaluation, we show that they can serve as complementary, cost‑efficient, and reproducible measures, particularly useful when large‑scale or direct listening assessments are infeasible. Code is available at https://github.com/kaen2891/FrechetSpeechDistance.
Authors:Zihao Chen, Jiayin Wang, Ziyi Sun, Ji Zhuang, Jinyi Shen, Xiaoyue Ke, Li Shang, Xuan Zeng, Fan Yang
Abstract:
This brief proposes \emphWhite‑Op, an interpretable operational amplifier (op‑amp) parameter design framework based on the human‑mimicking reasoning of large‑language‑model agents. We formalize the implicit human reasoning mechanism into explicit steps of \emphintroducing hypothetical constraints, and develop an iterative, human‑like \emphhypothesis‑verification‑decision workflow. Specifically, the agent is guided to introduce hypothetical constraints to derive and properly regulate positions of symbolically tractable poles and zeros, thus formulating a closed‑form mathematical optimization problem, which is then solved programmatically and verified via simulation. Theory‑simulation result analysis guides the decision‑making for refinement. Experiments on 9 op‑amp topologies show that, unlike the uninterpretable black‑box baseline which finally fails in 5 topologies, White‑Op achieves reliable, interpretable behavioral‑level designs with only 8.52% theoretical prediction error and the design functionality retains after transistor‑level mapping for all topologies. White‑Op is open‑sourced at \textcolorbluehttps://github.com/zhchenfdu/whiteop.
Authors:Aoyu Pang, Maonan Wang, Zifan Sha, Wenwei Yue, Changle Li, Chung Shue Chen, Man-On Pun
Abstract:
Urban Air Mobility (UAM) has emerged as a transformative solution to alleviate urban congestion by utilizing low‑altitude airspace, thereby reducing pressure on ground transportation networks. To enable truly efficient and seamless door‑to‑door travel experiences, UAM requires close integration with existing ground transportation infrastructure. However, current research on optimal integrated routing strategies for passengers in air‑ground mobility systems remains limited, with a lack of systematic exploration.To address this gap, we first propose a unified optimization model that integrates strategy selection for both air and ground transportation. This model captures the dynamic characteristics of multimodal transport networks and incorporates real‑time traffic conditions alongside passenger decision‑making behavior. Building on this model, we propose a Unified Air‑Ground Mobility Coordination (UAGMC) framework, which leverages deep reinforcement learning (RL) and Vehicle‑to‑Everything (V2X) communication to optimize vertiport selection and dynamically plan air taxi routes. Experimental results demonstrate that UAGMC achieves a 34% reduction in average travel time compared to conventional proportional allocation methods, enhancing overall travel efficiency and providing novel insights into the integration and optimization of multimodal transportation systems. This work lays a solid foundation for advancing intelligent urban mobility solutions through the coordination of air and ground transportation modes. The related code can be found at https://github.com/Traffic‑Alpha/UAGMC.
Authors:Seonghyeon Go, Yumin Kim
Abstract:
Recently, the problem of music plagiarism has emerged as an even more pressing social issue. As music information retrieval research advances, there is a growing effort to address issues related to music plagiarism. However, many studies, including our previous work, have conducted research without clearly defining what the music plagiarism detection task actually involves. This lack of a clear definition has slowed research progress and made it hard to apply results to real‑world scenarios. To fix this situation, we defined how Music Plagiarism Detection is different from other MIR tasks and explained what problems need to be solved. We introduce the Similar Music Pair dataset to support this newly defined task. In addition, we propose a method based on segment transcription as one way to solve the task. Our demo and dataset are available at https://github.com/Mippia/ICASSP2026‑MPD.
Authors:Xuewen Liu, Zhikai Li, Jing Zhang, Mengjuan Chen, Qingyi Gu
Abstract:
AutoRegressive Visual Generation (ARVG) models retain an architecture compatible with language models, while achieving performance comparable to diffusion‑based models. Quantization is commonly employed in neural networks to reduce model size and computational latency. However, applying quantization to ARVG remains largely underexplored, and existing quantization methods fail to generalize effectively to ARVG models. In this paper, we explore this issue and identify three key challenges: (1) severe outliers at channel‑wise level, (2) highly dynamic activations at token‑wise level, and (3) mismatched distribution information at sample‑wise level. To these ends, we propose PTQ4ARVG, a training‑free post‑training quantization (PTQ) framework consisting of: (1) Gain‑Projected Scaling (GPS) mitigates the channel‑wise outliers, which expands the quantization loss via a Taylor series to quantify the gain of scaling for activation‑weight quantization, and derives the optimal scaling factor through differentiation.(2) Static Token‑Wise Quantization (STWQ) leverages the inherent properties of ARVG, fixed token length and position‑invariant distribution across samples, to address token‑wise variance without incurring dynamic calibration overhead.(3) Distribution‑Guided Calibration (DGC) selects samples that contribute most to distributional entropy, eliminating the sample‑wise distribution mismatch. Extensive experiments show that PTQ4ARVG can effectively quantize the ARVG family models to 8‑bit and 6‑bit while maintaining competitive performance. Code is available at http://github.com/BienLuky/PTQ4ARVG .
Authors:Sangyun Chung, Se Yeon Kim, Youngchae Chee, Yong Man Ro
Abstract:
Multimodal Large Language Models (MLLMs) suffer from cross‑modal hallucinations, where one modality inappropriately influences generation about another, leading to fabricated output. This exposes a more fundamental deficiency in modality‑interaction control. To address this, we propose Modality‑Adaptive Decoding (MAD), a training‑free method that adaptively weights modality‑specific decoding branches based on task requirements. MAD leverages the model's inherent ability to self‑assess modality relevance by querying which modalities are needed for each task. The extracted modality probabilities are then used to adaptively weight contrastive decoding branches, enabling the model to focus on relevant information while suppressing cross‑modal interference. Extensive experiments on CMM and AVHBench demonstrate that MAD significantly reduces cross‑modal hallucinations across multiple audio‑visual language models (7.8% and 2.0% improvements for VideoLLaMA2‑AV, 8.7% and 4.7% improvements for Qwen2.5‑Omni). Our approach demonstrates that explicit modality awareness through self‑assessment is crucial for robust multimodal reasoning, offering a principled extension to existing contrastive decoding methods. Our code is available at \hrefhttps://github.com/top‑yun/MADhttps://github.com/top‑yun/MAD
Authors:Tianyi Chen, Yinheng Li, Michael Solodko, Sen Wang, Nan Jiang, Tingyuan Cui, Junheng Hao, Jongwoo Ko, Sara Abdali, Suzhen Zheng, Leon Xu, Hao Fan, Pashmina Cameron, Justin Wagle, Kazuhito Koishida
Abstract:
Computer‑Using Agents (CUAs) aim to autonomously operate computer systems to complete real‑world tasks. However, existing agentic systems remain difficult to scale and lag behind human performance. A key limitation is the absence of reusable and structured skill abstractions that capture how humans interact with graphical user interfaces and how to leverage these skills. We introduce CUA‑Skill, a computer‑using agentic skill base that encodes human computer‑use knowledge as skills coupled with parameterized execution and composition graphs. CUA‑Skill is a large‑scale library of carefully engineered skills spanning common Windows applications, serving as a practical infrastructure and tool substrate for scalable, reliable agent development. Built upon this skill base, we construct CUA‑Skill Agent, an end‑to‑end computer‑using agent that supports dynamic skill retrieval, argument instantiation, and memory‑aware failure recovery. Our results demonstrate that CUA‑Skill substantially improves execution success rates and robustness on challenging end‑to‑end agent benchmarks, establishing a strong foundation for future computer‑using agent development. On WindowsAgentArena, CUA‑Skill Agent achieves state‑of‑the‑art 57.5% (best of three) successful rate while being significantly more efficient than prior and concurrent approaches. The project page is available at https://microsoft.github.io/cua_skill/.
Authors:Minjae Kwon, Josephine Lamp, Lu Feng
Abstract:
Safe Reinforcement Learning (RL) algorithms are typically evaluated under fixed training conditions. We investigate whether training‑time safety guarantees transfer to deployment under distribution shift, using diabetes management as a safety‑critical testbed. We benchmark safe RL algorithms on a unified clinical simulator and reveal a safety generalization gap: policies satisfying constraints during training frequently violate safety requirements on unseen patients. We demonstrate that test‑time shielding, which filters unsafe actions using learned dynamics models, effectively restores safety across algorithms and patient populations. Across eight safe RL algorithms, three diabetes types, and three age groups, shielding achieves Time‑in‑Range gains of 13‑‑14% for strong baselines such as PPO‑Lag and CPO while reducing clinical risk index and glucose variability. Our simulator and benchmark provide a platform for studying safety under distribution shift in safety‑critical control domains. Code is available at https://github.com/safe‑autonomy‑lab/GlucoSim and https://github.com/safe‑autonomy‑lab/GlucoAlg.
Authors:Jarrod Barnes
Abstract:
As large language models improve, so do their offensive applications: frontier agents now generate working exploits for under 50 in compute (Heelan, 2026). Defensive incident response (IR) agents must keep pace, but existing benchmarks conflate action execution with correct execution, hiding calibration failures when agents process adversarial evidence. We introduce OpenSec, a dual‑control reinforcement learning environment that evaluates IR agents under realistic prompt injection scenarios. Unlike static capability benchmarks, OpenSec scores world‑state‑changing containment actions under adversarial evidence via execution‑based metrics: time‑to‑first‑containment (TTFC), blast radius (false positives per episode), and injection violation rates. Evaluating four frontier models on 40 standard‑tier episodes, we find consistent over‑triggering in this setting: GPT‑5.2, Gemini 3, and DeepSeek execute containment in 100% of episodes with 90‑97% false positive rates. Claude Sonnet 4.5 shows partial calibration (85% containment, 72% FP), demonstrating that OpenSec surfaces a calibration failure mode hidden by aggregate success metrics. Code available at https://github.com/jbarnes850/opensec‑env.
Authors:Zongheng Guo, Tao Chen, Yang Jiao, Yi Pan, Xiao Hu, Manuela Ferrario
Abstract:
Current foundation model for photoplethysmography (PPG) signals is challenged by the intrinsic redundancy and noise of the signal. Standard masked modeling often yields trivial solutions while contrastive methods lack morphological precision. To address these limitations, we propose a Statistical‑prior Informed Generative Masking Architecture (SIGMA‑PPG), a generative foundation model featuring a Prior‑Guided Adversarial Masking mechanism, where a reinforcement learning‑driven teacher leverages statistical priors to create challenging learning paths that prevent overfitting to noise. We also incorporate a semantic consistency constraint via vector quantization to ensure that physiologically identical waveforms (even those altered by recording artifacts or minor perturbations) map to shared indices. This enhances codebook semantic density and eliminates redundant feature structures. Pre‑trained on over 120,000 hours of data, SIGMA‑PPG achieves superior average performance compared to five state‑of‑the‑art baselines across 12 diverse downstream tasks. The code is available at https://github.com/ZonghengGuo/SigmaPPG.
Authors:Xingwei Lin, Wenhao Lin, Sicong Cao, Jiahao Yu, Renke Huang, Lei Xue, Chunming Wu
Abstract:
Multi‑turn jailbreak attacks have emerged as a critical threat to Large Language Models (LLMs), bypassing safety mechanisms by progressively constructing adversarial contexts from scratch and incrementally refining prompts. However, existing methods suffer from the inefficiency of incremental context construction that requires step‑by‑step LLM interaction, and often stagnate in suboptimal regions due to surface‑level optimization. In this paper, we characterize the Intent‑Context Coupling phenomenon, revealing that LLM safety constraints are significantly relaxed when a malicious intent is coupled with a semantically congruent context pattern. Driven by this insight, we propose ICON, an automated multi‑turn jailbreak framework that efficiently constructs an authoritative‑style context via prior‑guided semantic routing. Specifically, ICON first routes the malicious intent to a congruent context pattern (e.g., Scientific Research) and instantiates it into an attack prompt sequence. This sequence progressively builds the authoritative‑style context and ultimately elicits prohibited content. In addition, ICON incorporates a Hierarchical Optimization Strategy that combines local prompt refinement with global context switching, preventing the attack from stagnating in ineffective contexts. Experimental results across eight SOTA LLMs demonstrate the effectiveness of ICON, achieving a state‑of‑the‑art average Attack Success Rate (ASR) of 97.1%. Code is available at https://github.com/xwlin‑roy/ICON.
Authors:Matteo Gianferrari, Omayma Moussadek, Riccardo Salami, Cosimo Fiorini, Lorenzo Tartarini, Daniela Gandolfi, Simone Calderara
Abstract:
Spiking Neural Networks (SNNs) are inherently suited for continuous learning due to their event‑driven temporal dynamics; however, their application to Class‑Incremental Learning (CIL) has been hindered by catastrophic forgetting and the temporal misalignment of spike patterns. In this work, we introduce Spiking Temporal Alignment with Experience Replay (STAER), a novel framework that explicitly preserves temporal structure to bridge the performance gap between SNNs and ANNs. Our approach integrates a differentiable Soft‑DTW alignment loss to maintain spike timing fidelity and employs a temporal expansion and contraction mechanism on output logits to enforce robust representation learning. Implemented on a deep ResNet19 spiking backbone, STAER achieves state‑of‑the‑art performance on Sequential‑MNIST and Sequential‑CIFAR10. Empirical results demonstrate that our method matches or outperforms strong ANN baselines (ER, DER++) while preserving biologically plausible dynamics. Ablation studies further confirm that explicit temporal alignment is critical for representational stability, positioning STAER as a scalable solution for spike‑native lifelong learning. Code is available at https://github.com/matteogianferrari/staer.
Authors:Jaehyuk Jang, Wonjun Lee, Kangwook Ko, Changick Kim
Abstract:
Prompt tuning has achieved remarkable progress in vision‑language models (VLMs) and is recently being adopted for audio‑language models (ALMs). However, its generalization ability in ALMs remains largely underexplored. We observe that conventional prompt tuning for ALMs also suffers from the Base‑New Tradeoff, and we identify that this issue stems from the disrupted semantic structure of the embedding space. To address this issue, we propose Semantically Expanded Prompt Tuning (SEPT)‑a plug‑and‑play framework that explicitly regularizes the prompt embedding space by incorporating semantic neighbors generated by large language models. SEPT introduces a novel semantic expansion loss with margin constraints that promote intra‑class compactness and inter‑class separability, thereby enhancing the semantic structure of the prompt embedding space. For comprehensive evaluation, we establish the first benchmark setup for prompt generalization in ALMs, covering both base‑to‑new generalization and cross‑dataset transferability. Extensive experiments demonstrate that SEPT consistently improves generalization performance across multiple prompt tuning baselines, while maintaining computational cost during inference. Codes are available in https://github.com/jhyukjang/SEPT.
Authors:Weixin Chen, Li Chen, Yuhan Zhao
Abstract:
Despite growing efforts to mitigate unfairness in recommender systems, existing fairness‑aware methods typically fix the fairness requirement at training time and provide limited post‑training flexibility. However, in real‑world scenarios, diverse stakeholders may demand differing fairness requirements over time, so retraining for different fairness requirements becomes prohibitive. To address this limitation, we propose Cofair, a single‑train framework that enables post‑training fairness control in recommendation. Specifically, Cofair introduces a shared representation layer with fairness‑conditioned adapter modules to produce user embeddings specialized for varied fairness levels, along with a user‑level regularization term that guarantees user‑wise monotonic fairness improvements across these levels. We theoretically establish that the adversarial objective of Cofair upper bounds demographic parity and the regularization term enforces progressive fairness at user level. Comprehensive experiments on multiple datasets and backbone models demonstrate that our framework provides dynamic fairness at different levels, delivering comparable or better fairness‑accuracy curves than state‑of‑the‑art baselines, without the need to retrain for each new fairness requirement. Our code is publicly available at https://github.com/weixinchen98/Cofair.
Authors:Saurav Prateek
Abstract:
This paper introduces a novel Deep Researcher architecture designed to generate detailed research reports on complex PhD level topics by addressing the inherent limitations of the Parallel Scaling paradigm. Our system utilizes two key innovations: Sequential Research Plan Refinement via Reflection and a Candidates Crossover algorithm. The sequential refinement process is demonstrated as an efficient method that allows the agent to maintain a centralized Global Research Context, enabling it to look back at current progress, reason about the research plan, and intelligently make changes at runtime. This dynamic adaptation contrasts with parallel approaches, which often suffer from siloed knowledge. The Candidates Crossover algorithm further enhances search efficiency by deploying multiple LLM candidates with varied parameters to explore a larger search space, with their findings synthesized to curate a comprehensive final research response. The process concludes with One Shot Report Generation, ensuring the final document is informed by a unified narrative and high fact density. Powered by the Gemini 2.5 Pro model, our Deep Researcher was evaluated on the DeepResearch Bench, a globally recognized benchmark of 100 doctoral level research tasks. Our architecture achieved an overall score of 46.21, demonstrating superior performance by surpassing leading deep research agents such as Claude Researcher, Nvidia AIQ Research Assistant, Perplexity Research, Kimi Researcher and Grok Deeper Search present on the DeepResearch Bench actively running leaderboard. This performance marginally exceeds our previous work, Static DRA, and reinforces the finding that sequential scaling consistently outperforms the parallel self consistency paradigm.
Authors:Guoan Wang, Feiyu Wang, Zongwei Lv, Yikun Zong, Tong Yang
Abstract:
As large language models (LLMs) continue to scale, deployment is increasingly bottlenecked by the memory wall, motivating a shift toward extremely low‑bit quantization. However, most quantization‑aware training (QAT) methods apply hard rounding and the straight‑through estimator (STE) from the beginning of the training, which prematurely discretizes the optimization landscape and induces persistent gradient mismatch between latent weights and quantized weights, hindering effective optimization of quantized models. To address this, we propose Hestia, a Hessian‑guided differentiable QAT framework for extremely low‑bit LLMs, which replaces the rigid step function with a temperature‑controlled softmax relaxation to maintain gradient flow early in training while progressively hardening quantization. Furthermore, Hestia leverages a tensor‑wise Hessian trace metric as a lightweight curvature signal to drive fine‑grained temperature annealing, enabling sensitivity‑aware discretization across the model. Evaluations on Llama‑3.2 show that Hestia consistently outperforms existing ternary QAT baselines, yielding average zero‑shot improvements of 5.39% and 4.34% for the 1B and 3B models. These results indicate that Hessian‑guided relaxation effectively recovers representational capacity, establishing a more robust training path for 1.58‑bit LLMs. The code is available at https://github.com/hestia2026/Hestia.
Authors:Yanqi Dai, Yuxiang Ji, Xiao Zhang, Yong Wang, Xiangxiang Chu, Zhiwu Lu
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) offers a robust mechanism for enhancing mathematical reasoning in large models. However, we identify a systematic lack of emphasis on more challenging questions in existing methods from both algorithmic and data perspectives, despite their importance for refining underdeveloped capabilities. Algorithmically, widely used Group Relative Policy Optimization (GRPO) suffers from an implicit imbalance where the magnitude of policy updates is lower for harder questions. Data‑wise, augmentation approaches primarily rephrase questions to enhance diversity without systematically increasing intrinsic difficulty. To address these issues, we propose a two‑dual MathForge framework to improve mathematical reasoning by targeting harder questions from both perspectives, which comprises a Difficulty‑Aware Group Policy Optimization (DGPO) algorithm and a Multi‑Aspect Question Reformulation (MQR) strategy. Specifically, DGPO first rectifies the implicit imbalance in GRPO via difficulty‑balanced group advantage estimation, and further prioritizes harder questions by difficulty‑aware question‑level weighting. Meanwhile, MQR reformulates questions across multiple aspects to increase difficulty while maintaining the original gold answer. Overall, MathForge forms a synergistic loop: MQR expands the data frontier, and DGPO effectively learns from the augmented data. Extensive experiments show that MathForge significantly outperforms existing methods on various mathematical reasoning tasks. The code and augmented data are all available at https://github.com/AMAP‑ML/MathForge.
Authors:Gray Cox
Abstract:
This paper introduces a methodological framework for empirically testing AI alignment strategies through structured multi‑model dialogue. Drawing on Peace Studies traditions ‑ particularly interest‑based negotiation, conflict transformation, and commons governance ‑ we operationalize Viral Collaborative Wisdom (VCW), an approach that reframes alignment from a control problem to a relationship problem developed through dialogical reasoning. Our experimental design assigns four distinct roles (Proposer, Responder, Monitor, Translator) to different AI systems across six conditions, testing whether current large language models can engage substantively with complex alignment frameworks. Using Claude, Gemini, and GPT‑4o, we conducted 72 dialogue turns totaling 576,822 characters of structured exchange. Results demonstrate that AI systems can engage meaningfully with Peace Studies concepts, surface complementary objections from different architectural perspectives, and generate emergent insights not present in initial framings ‑ including the novel synthesis of "VCW as transitional framework." Cross‑architecture patterns reveal that different models foreground different concerns: Claude emphasized verification challenges, Gemini focused on bias and scalability, and GPT‑4o highlighted implementation barriers. The framework provides researchers with replicable methods for stress‑testing alignment proposals before implementation, while the findings offer preliminary evidence about AI capacity for the kind of dialogical reasoning VCW proposes. We discuss limitations, including the observation that dialogues engaged more with process elements than with foundational claims about AI nature, and outline directions for future research including human‑AI hybrid protocols and extended dialogue studies.
Authors:Lakshman Balasubramanian
Abstract:
Person Re‑Identification (ReID) remains a challenging problem in computer vision. This work reviews various training paradigm and evaluates the robustness of state‑of‑the‑art ReID models in cross‑domain applications and examines the role of foundation models in improving generalization through richer, more transferable visual representations. We compare three training paradigms, supervised, self‑supervised, and language‑aligned models. Through the study the aim is to answer the following questions: Can supervised models generalize in cross‑domain scenarios? How does foundation models like SigLIP2 perform for the ReID tasks? What are the weaknesses of current supervised and foundational models for ReID? We have conducted the analysis across 11 models and 9 datasets. Our results show a clear split: supervised models dominate their training domain but crumble on cross‑domain data. Language‑aligned models, however, show surprising robustness cross‑domain for ReID tasks, even though they are not explicitly trained to do so. Code and data available at: https://github.com/moiiai‑tech/object‑reid‑benchmark.
Authors:Ariel Maymon, Yanir Buznah, Uri Shaham
Abstract:
Unsupervised ensemble learning emerged to address the challenge of combining multiple learners' predictions without access to ground truth labels or additional data. This paradigm is crucial in scenarios where evaluating individual classifier performance or understanding their strengths is challenging due to limited information. We propose a novel deep energy‑based method for constructing an accurate meta‑learner using only the predictions of individual learners, potentially capable of capturing complex dependence structures between them. Our approach requires no labeled data, learner features, or problem‑specific information, and has theoretical guarantees for when learners are conditionally independent. We demonstrate superior performance across diverse ensemble scenarios, including challenging mixture of experts settings. Our experiments span standard ensemble datasets and curated datasets designed to test how the model fuses expertise from multiple sources. These results highlight the potential of unsupervised ensemble learning to harness collective intelligence, especially in data‑scarce or privacy‑sensitive environments.
Authors:Zhenxuan Fan, Jie Cao, Yang Dai, Zheqi Lv, Wenqiao Zhang, Zhongle Xie, Peng LU, Beng Chin Ooi
Abstract:
Chain‑of‑thought (CoT) prompting improves LLM reasoning but incurs high latency and memory cost due to verbose traces, motivating CoT compression with preserved correctness. Existing methods either shorten CoTs at the semantic level, which is often conservative, or prune tokens aggressively, which can miss task‑critical cues and degrade accuracy. Moreover, combining the two is non‑trivial due to sequential dependency, task‑agnostic pruning, and distribution mismatch. We propose CtrlCoT, a dual‑granularity CoT compression framework that harmonizes semantic abstraction and token‑level pruning through three components: Hierarchical Reasoning Abstraction produces CoTs at multiple semantic granularities; Logic‑Preserving Distillation trains a logic‑aware pruner to retain indispensable reasoning cues (e.g., numbers and operators) across pruning ratios; and Distribution‑Alignment Generation aligns compressed traces with fluent inference‑time reasoning styles to avoid fragmentation. On MATH‑500 with Qwen2.5‑7B‑Instruct, CtrlCoT uses 30.7% fewer tokens while achieving 7.6 percentage points higher than the strongest baseline, demonstrating more efficient and reliable reasoning. Our code will be publicly available at https://github.com/fanzhenxuan/Ctrl‑CoT.
Authors:Mariia Drozdova
Abstract:
Can standard continuous‑time generative models represent distributions whose support is an extremely sparse, globally constrained discrete set? We study this question using completed Sudoku grids as a controlled testbed, treating them as a subset of a continuous relaxation space. We train flow‑matching and score‑based models along a Gaussian probability path and compare deterministic (ODE) sampling, stochastic (SDE) sampling, and DDPM‑style discretizations derived from the same continuous‑time training. Unconditionally, stochastic sampling substantially outperforms deterministic flows; score‑based samplers are the most reliable among continuous‑time methods, and DDPM‑style ancestral sampling achieves the highest validity overall. We further show that the same models can be repurposed for guided generation: by repeatedly sampling completions under clamped clues and stopping when constraints are satisfied, the model acts as a probabilistic Sudoku solver. Although far less sample‑efficient than classical solvers and discrete‑geometry‑aware diffusion methods, these experiments demonstrate that classic diffusion/flow formulations can assign non‑zero probability mass to globally constrained combinatorial structures and can be used for constraint satisfaction via stochastic search.
Authors:Brian Y. Tsui, Alan Y. Fang, Tiffany J. Hwu
Abstract:
Robotic manipulation has increasingly adopted vision‑language‑action (VLA) models, which achieve strong performance but typically require task‑specific demonstrations and fine‑tuning, and often generalize poorly under domain shift. We investigate whether general‑purpose large language model (LLM) agent frameworks, originally developed for software engineering, can serve as an alternative control paradigm for embodied manipulation. We introduce FAEA (Frontier Agent as Embodied Agent), which applies an LLM agent framework directly to embodied manipulation without modification. Using the same iterative reasoning that enables software agents to debug code, FAEA enables embodied agents to reason through manipulation strategies. We evaluate an unmodified frontier agent, Claude Agent SDK, across the LIBERO, ManiSkill3, and MetaWorld benchmarks. With privileged environment state access, FAEA achieves success rates of 84.9%, 85.7%, and 96%, respectively. This level of task success approaches that of VLA models trained with less than 100 demonstrations per task, without requiring demonstrations or fine‑tuning. With one round of human feedback as an optional optimization, performance increases to 88.2% on LIBERO. This demonstration‑free capability has immediate practical value: FAEA can autonomously explore novel scenarios in simulation and generate successful trajectories for training data augmentation in embodied learning. Our results indicate that general‑purpose agents are sufficient for a class of manipulation tasks dominated by deliberative, task‑level planning. This opens a path for robotics systems to leverage actively maintained agent infrastructure and benefit directly from ongoing advances in frontier models. Code is available at https://github.com/robiemusketeer/faea‑sim
Authors:Zeyu Xing, Xing Li, Hui-Ling Zhen, Mingxuan Yuan, Sinno Jialin Pan
Abstract:
KV caches, typically used only to speed up autoregressive decoding, encode contextual information that can be reused for downstream tasks at no extra cost. We propose treating the KV cache as a lightweight representation, eliminating the need to recompute or store full hidden states. Despite being weaker than dedicated embeddings, KV‑derived representations are shown to be sufficient for two key applications: (i) Chain‑of‑Embedding, where they achieve competitive or superior performance on Llama‑3.1‑8B‑Instruct and Qwen2‑7B‑Instruct; and (ii) Fast/Slow Thinking Switching, where they enable adaptive reasoning on Qwen3‑8B and DeepSeek‑R1‑Distil‑Qwen‑14B, reducing token generation by up to 5.7× with minimal accuracy loss. Our findings establish KV caches as a free, effective substrate for sampling and reasoning, opening new directions for representation reuse in LLM inference. Code: https://github.com/cmd2001/ICLR2026_KV‑Embedding.
Authors:Jim Maar, Denis Paperno, Callum Stuart McDougall, Neel Nanda
Abstract:
Prior work suggests that language models, while trained on next token prediction, show implicit planning behavior: they may select the next token in preparation to a predicted future token, such as a likely rhyming word, as supported by a prior qualitative study of Claude 3.5 Haiku using a cross‑layer transcoder. We propose much simpler techniques for assessing implicit planning in language models. With case studies on rhyme poetry generation and question answering, we demonstrate that our methodology easily scales to many models. Across models, we find that the generated rhyme (e.g. "‑ight") or answer to a question ("whale") can be manipulated by steering at the end of the preceding line with a vector, affecting the generation of intermediate tokens leading up to the rhyme or answer word. We show that implicit planning is a universal mechanism, present in smaller models than previously thought, starting from 1B parameters. Our methodology offers a widely applicable direct way to study implicit planning abilities of LLMs. More broadly, understanding planning abilities of language models can inform decisions in AI safety and control.
Authors:Abha Jha, Akanksha Mahajan, Ashwath Vaithinathan Aravindan, Praveen Saravanan, Sai Sailaja Policharla, Sonal Chaturbhuj Gehlot
Abstract:
Large Language Models (LLMs) often produce hallucinated or unverifiable content, undermining their reliability in factual domains. This work investigates Reinforcement Learning with Verifiable Rewards (RLVR) as a training paradigm that explicitly rewards abstention ("I don't know") alongside correctness to promote intellectual humility. We fine‑tune and evaluate Granite‑3.3‑2B‑Instruct and Qwen‑3‑4B‑Instruct on the MedMCQA and Hendrycks Math benchmarks using a ternary reward structure (‑1, r_abs, 1) under varying abstention reward structures. We further study the effect of combining RLVR with supervised fine‑tuning strategies that teach abstention prior to reinforcement learning. Our results show that moderate abstention rewards (r_abs \approx ‑0.25 to 0.3) consistently reduce incorrect responses without severe accuracy degradation on multiple‑choice tasks, with larger models exhibiting greater robustness to abstention incentives. On open‑ended question answering, we observe limitations due to insufficient exploration, which can be partially mitigated through supervised abstention training. Overall, these findings demonstrate the feasibility and flexibility of verifiable reward design as a practical approach for hallucination mitigation in language models. Reproducible code for our abstention training framework is available here https://github.com/Mystic‑Slice/rl‑abstention.
Authors:Atik Faysal, Mohammad Rostami, Reihaneh Gh. Roshan, Nikhil Muralidhar, Huaxia Wang
Abstract:
We address the challenge of training Vision Transformers (ViTs) when labeled data is scarce but unlabeled data is abundant. We propose Semi‑Supervised Masked Autoencoder (SSMAE), a framework that jointly optimizes masked image reconstruction and classification using both unlabeled and labeled samples with dynamically selected pseudo‑labels. SSMAE introduces a validation‑driven gating mechanism that activates pseudo‑labeling only after the model achieves reliable, high‑confidence predictions that are consistent across both weakly and strongly augmented views of the same image, reducing confirmation bias. On CIFAR‑10 and CIFAR‑100, SSMAE consistently outperforms supervised ViT and fine‑tuned MAE, with the largest gains in low‑label regimes (+9.24% over ViT on CIFAR‑10 with 10% labels). Our results demonstrate that when pseudo‑labels are introduced is as important as how they are generated for data‑efficient transformer training. Codes are available at https://github.com/atik666/ssmae.
Authors:Fang Li
Abstract:
Despite the ubiquity of tabular data in high‑stakes domains, traditional deep learning architectures often struggle to match the performance of gradient‑boosted decision trees while maintaining scientific interpretability. Standard neural networks typically treat features as independent entities, failing to exploit the inherent manifold structural dependencies that define tabular distributions. We propose Structural Compositional Function Networks (StructuralCFN), a novel architecture that imposes a Relation‑Aware Inductive Bias via a differentiable structural prior. StructuralCFN explicitly models each feature as a mathematical composition of its counterparts through Differentiable Adaptive Gating, which automatically discovers the optimal activation physics (e.g., attention‑style filtering vs. inhibitory polarity) for each relationship. Our framework enables Structured Knowledge Integration, allowing domain‑specific relational priors to be injected directly into the architecture to guide discovery. We evaluate StructuralCFN across a rigorous 10‑fold cross‑validation suite on 18 benchmarks, demonstrating statistically significant improvements (p < 0.05) on scientific and clinical datasets (e.g., Blood Transfusion, Ozone, WDBC). Furthermore, StructuralCFN provides Intrinsic Symbolic Interpretability: it recovers the governing "laws" of the data manifold as human‑readable mathematical expressions while maintaining a compact parameter footprint (300‑‑2,500 parameters) that is over an order of magnitude (10x‑‑20x) smaller than standard deep baselines.
Authors:Yitian Chen, Cheng Cheng, Yinan Sun, Zi Ling, Dongdong Ge
Abstract:
Large Language Models (LLMs) have demonstrated impressive progress in optimization modeling, fostering a rapid expansion of new methodologies and evaluation benchmarks. However, the boundaries of their capabilities in automated formulation and problem solving remain poorly understood, particularly when extending to complex, real‑world tasks. To bridge this gap, we propose OPT‑ENGINE, an extensible benchmark framework designed to evaluate LLMs on optimization modeling with controllable and scalable difficulty levels. OPT‑ENGINE spans 10 canonical tasks across operations research, with five Linear Programming and five Mixed‑Integer Programming. Utilizing OPT‑ENGINE, we conduct an extensive study of LLMs' reasoning capabilities, addressing two critical questions: 1.) Do LLMs' performance remain robust when generalizing to out‑of‑distribution optimization tasks that scale in complexity beyond current benchmark levels? and 2.) At what stage, from problem interpretation to solution generation, do current LLMs encounter the most significant bottlenecks? Our empirical results yield two key insights: first, tool‑integrated reasoning with external solvers exhibits significantly higher robustness as task complexity escalates, while pure‑text reasoning reaches a ceiling; second, the automated formulation of constraints constitutes the primary performance bottleneck. These findings provide actionable guidance for developing next‑generation LLMs for advanced optimization. Our code is publicly available at \textcolorbluehttps://github.com/Cardinal‑Operations/OPTEngine.
Authors:Jialong Wu, Xiaoying Zhang, Hongyi Yuan, Xiangcheng Zhang, Tianhao Huang, Changjing He, Chaoyi Deng, Renrui Zhang, Youbin Wu, Mingsheng Long
Abstract:
Humans construct internal world models and reason by manipulating the concepts within these models. Recent advances in AI, particularly chain‑of‑thought (CoT) reasoning, approximate such human cognitive abilities, where world models are believed to be embedded within large language models. Expert‑level performance in formal and abstract domains such as mathematics and programming has been achieved in current systems by relying predominantly on verbal reasoning. However, they still lag far behind humans in domains like physical and spatial intelligence, which require richer representations and prior knowledge. The emergence of unified multimodal models (UMMs) capable of both verbal and visual generation has therefore sparked interest in more human‑like reasoning grounded in complementary multimodal pathways, though their benefits remain unclear. From a world‑model perspective, this paper presents the first principled study of when and how visual generation benefits reasoning. Our key position is the visual superiority hypothesis: for certain tasks‑‑particularly those grounded in the physical world‑‑visual generation more naturally serves as world models, whereas purely verbal world models encounter bottlenecks arising from representational limitations or insufficient prior knowledge. Theoretically, we formalize internal world modeling as a core component of CoT reasoning and analyze distinctions among different forms of world models. Empirically, we identify tasks that necessitate interleaved visual‑verbal CoT reasoning, constructing a new evaluation suite, VisWorld‑Eval. Controlled experiments on a state‑of‑the‑art UMM show that interleaved CoT significantly outperforms purely verbal CoT on tasks that favor visual world modeling, but offers no clear advantage otherwise. Together, this work clarifies the potential of multimodal world modeling for more powerful, human‑like multimodal AI.
Authors:Zhihua Fang, Liang He
Abstract:
Speaker embedding learning based on Euclidean space has achieved significant progress, but it is still insufficient in modeling hierarchical information within speaker features. Hyperbolic space, with its negative curvature geometric properties, can efficiently represent hierarchical information within a finite volume, making it more suitable for the feature distribution of speaker embeddings. In this paper, we propose Hyperbolic Softmax (H‑Softmax) and Hyperbolic Additive Margin Softmax (HAM‑Softmax) based on hyperbolic space. H‑Softmax incorporates hierarchical information into speaker embeddings by projecting embeddings and speaker centers into hyperbolic space and computing hyperbolic distances. HAM‑Softmax further enhances inter‑class separability by introducing margin constraint on this basis. Experimental results show that H‑Softmax and HAM‑Softmax achieve average relative EER reductions of 27.84% and 14.23% compared with standard Softmax and AM‑Softmax, respectively, demonstrating that the proposed methods effectively improve speaker verification performance and at the same time preserve the capability of hierarchical structure modeling. The code will be released at https://github.com/PunkMale/HAM‑Softmax.
Authors:Helin Wang, Bowen Shi, Andros Tjandra, John Hoffman, Yi-Chiao Wu, Apoorv Vyas, Najim Dehak, Ann Lee, Wei-Ning Hsu
Abstract:
The performance evaluation remains a complex challenge in audio separation, and existing evaluation metrics are often misaligned with human perception, course‑grained, relying on ground truth signals. On the other hand, subjective listening tests remain the gold standard for real‑world evaluation, but they are expensive, time‑consuming, and difficult to scale. This paper addresses the growing need for automated systems capable of evaluating audio separation without human intervention. The proposed evaluation metric, SAM Audio Judge (SAJ), is a multimodal fine‑grained reference‑free objective metric, which shows highly alignment with human perceptions. SAJ supports three audio domains (speech, music and general sound events) and three prompt inputs (text, visual and span), covering four different dimensions of evaluation (recall, percision, faithfulness, and overall). SAM Audio Judge also shows potential applications in data filtering, pseudo‑labeling large datasets and reranking in audio separation models. We release our code and pre‑trained models at: https://github.com/facebookresearch/sam‑audio.
Authors:Thomas Bömer, Nico Koltermann, Max Disselnmeyer, Bastian Amberg, Anne Meyer
Abstract:
Heuristic functions are essential to the performance of tree search algorithms such as A, where their accuracy and efficiency directly impact search outcomes. Traditionally, such heuristics are handcrafted, requiring significant expertise. Recent advances in large language models (LLMs) and evolutionary frameworks have opened the door to automating heuristic design. In this paper, we extend the Evolution of Heuristics (EoH) framework to investigate the automated generation of guiding heuristics for A search. We introduce a novel domain‑agnostic prompt augmentation strategy that includes the A code into the prompt to leverage in‑context learning, named Algorithmic ‑ Contextual EoH (A‑CEoH). To evaluate the effectiveness of A‑CeoH, we study two problem domains: the Unit‑Load Pre‑Marshalling Problem (UPMP), a niche problem from warehouse logistics, and the classical sliding puzzle problem (SPP). Our computational experiments show that A‑CEoH can significantly improve the quality of the generated heuristics and even outperform expert‑designed heuristics.
Authors:Yongqi Wang, Xiaofeng Ji, Jie Wang, Qingbin Li, Xiao Xiong, Zheming Yang, Jian Xu, Minghui Qiu, Xinxiao Wu
Abstract:
Adapting Large Language Models (LLMs) to specialized domains without human‑annotated data is a crucial yet formidable challenge. Widely adopted knowledge distillation methods often devolve into coarse‑grained mimicry, where the student model inefficiently targets its own weaknesses and risks inheriting the teacher's reasoning flaws. This exposes a critical pedagogical dilemma: how to devise a reliable curriculum when the teacher itself is not an infallible expert. Our work resolves this by capitalizing on a key insight: while LLMs may exhibit fallibility in complex, holistic reasoning, they often exhibit high fidelity on focused, atomic sub‑problems. Based on this, we propose Divergence‑Guided Reasoning Curriculum (DGRC), which constructs a learning path from atomic knowledge to reasoning chains by dynamically deriving two complementary curricula from disagreements in reasoning pathways. When a student and teacher produce conflicting results, DGRC directs the teacher to perform a diagnostic analysis: it analyzes both reasoning paths to formulate atomic queries that target the specific points of divergence, and then self‑answers these queries to create high‑confidence atomic question‑answer pairs. These pairs then serve a dual purpose: (1) providing an atomic curriculum to rectify the student's knowledge gaps, and (2) serving as factual criteria to filter the teacher's original reasoning chains, yielding a verified CoT curriculum that teaches the student how to integrate atomic knowledge into complete reasoning paths. Experiments across the medical and legal domains on student models of various sizes demonstrate the effectiveness of our DGRC framework. Notably, our method achieves a 7.76% relative improvement for the 1.5B student model in the medical domain over strong unlabeled baseline.
Authors:Lei Zhang, Yongda Yu, Minghui Yu, Xinxin Guo, Zhengqi Zhuang, Guoping Rong, Dong Shao, Haifeng Shen, Hongyu Kuang, Zhengfeng Li, Boge Wang, Guoan Zhang, Bangyu Xiang, Xiaobin Xu
Abstract:
High‑quality evaluation benchmarks are pivotal for deploying Large Language Models (LLMs) in Automated Code Review (ACR). However, existing benchmarks suffer from two critical limitations: first, the lack of multi‑language support in repository‑level contexts, which restricts the generalizability of evaluation results; second, the reliance on noisy, incomplete ground truth derived from raw Pull Request (PR) comments, which constrains the scope of issue detection. To address these challenges, we introduce AACR‑Bench a comprehensive benchmark that provides full cross‑file context across multiple programming languages. Unlike traditional datasets, AACR‑Bench employs an "AI‑assisted, Expert‑verified" annotation pipeline to uncover latent defects often overlooked in original PRs, resulting in a 285% increase in defect coverage. Extensive evaluations of mainstream LLMs on AACR‑Bench reveal that previous assessments may have either misjudged or only partially captured model capabilities due to data limitations. Our work establishes a more rigorous standard for ACR evaluation and offers new insights on LLM based ACR, i.e., the granularity/level of context and the choice of retrieval methods significantly impact ACR performance, and this influence varies depending on the LLM, programming language, and the LLM usage paradigm e.g., whether an Agent architecture is employed. The code, data, and other artifacts of our evaluation set are available at https://github.com/alibaba/aacr‑bench .
Authors:Haonan Zhang, Dongxia Wang, Yi Liu, Kexin Chen, Wenhai Wang
Abstract:
Safety‑aligned LLMs suffer from two failure modes: jailbreak (answering harmful inputs) and over‑refusal (declining benign queries). Existing vector steering methods adjust the magnitude of answer vectors, but this creates a fundamental trade‑off ‑‑ reducing jailbreak increases over‑refusal and vice versa. We identify the root cause: LLMs encode the decision to answer (answer vector v_a) and the judgment of input safety (benign vector v_b) as nearly orthogonal directions, treating them as independent processes. We propose LLM‑VA, which aligns v_a with v_b through closed‑form weight updates, making the model's willingness to answer causally dependent on its safety assessment ‑‑ without fine‑tuning or architectural changes. Our method identifies vectors at each layer using SVMs, selects safety‑relevant layers, and iteratively aligns vectors via minimum‑norm weight modifications. Experiments on 12 LLMs demonstrate that LLM‑VA achieves 11.45% higher F1 than the best baseline while preserving 95.92% utility, and automatically adapts to each model's safety bias without manual tuning. Code and models are available at https://hotbento.github.io/LLM‑VA‑Web/.
Authors:Kaipeng Fang, Weiqing Liang, Yuyang Li, Ji Zhang, Pengpeng Zeng, Lianli Gao, Jingkuan Song, Heng Tao Shen
Abstract:
Synthetic simulation data and real‑world human data provide scalable alternatives to circumvent the prohibitive costs of robot data collection. However, these sources suffer from the sim‑to‑real visual gap and the human‑to‑robot embodiment gap, respectively, which limits the policy's generalization to real‑world scenarios. In this work, we identify a natural yet underexplored complementarity between these sources: simulation offers the robot action that human data lacks, while human data provides the real‑world observation that simulation struggles to render. Motivated by this insight, we present SimHum, a co‑training framework to simultaneously extract kinematic prior from simulated robot actions and visual prior from real‑world human observations. Based on the two complementary priors, we achieve data‑efficient and generalizable robotic manipulation in real‑world tasks. Empirically, SimHum outperforms the baseline by up to \mathbf40% under the same data collection budget, and achieves a \mathbf62.5% OOD success with only 80 real data, outperforming the real only baseline by 7.1×. Videos and additional information can be found at \hrefhttps://kaipengfang.github.io/sim‑and‑humanproject website.
Authors:Hongzhu Yi, Xinming Wang, Zhenghao zhang, Tianyu Zong, Yuanxiang Wang, Jun Xie, Tao Yu, Haopeng Jin, Zhepeng Wang, Kaixin Xu, Feng Chen, Jiahuan Chen, Yujia Yang, Zhenyu Guan, Bingkang Shi, Jungang Xu
Abstract:
Within the domain of large language models, reinforcement fine‑tuning algorithms necessitate the generation of a complete reasoning trajectory beginning from the input query, which incurs significant computational overhead during the rollout phase of training. To address this issue, we analyze the impact of different segments of the reasoning path on the correctness of the final result and, based on these insights, propose Reinforcement Fine‑Tuning with Partial Reasoning Optimization (RPO), a plug‑and‑play reinforcement fine‑tuning algorithm. Unlike traditional reinforcement fine‑tuning algorithms that generate full reasoning paths, RPO trains the model by generating suffixes of the reasoning path using experience cache. During the rollout phase of training, RPO reduces token generation in this phase by approximately 95%, greatly lowering the theoretical time overhead. Compared with full‑path reinforcement fine‑tuning algorithms, RPO reduces the training time of the 1.5B model by 90% and the 7B model by 72%. At the same time, it can be integrated with typical algorithms such as GRPO and DAPO, enabling them to achieve training acceleration while maintaining performance comparable to the original algorithms. Our code is open‑sourced at https://github.com/yhz5613813/RPO.
Authors:Viacheslav Sydora, Guner Dilsad Er, Michael Muehlebach
Abstract:
This paper presents the web‑based platform Machine Learning with Bricks and an accompanying two‑day course designed to teach machine learning concepts to students aged 12 to 17 through programming‑free robotics activities. Machine Learning with Bricks is an open source platform and combines interactive visualizations with LEGO robotics to teach three core algorithms: KNN, linear regression, and Q‑learning. Students learn by collecting data, training models, and interacting with robots via a web‑based interface. Pre‑ and post‑surveys with 14 students demonstrate significant improvements in conceptual understanding of machine learning algorithms, positive shifts in AI perception, high platform usability, and increased motivation for continued learning. This work demonstrates that tangible, visualization‑based approaches can make machine learning concepts accessible and engaging for young learners while maintaining technical depth. The platform is freely available at https://learning‑and‑dynamics.github.io/ml‑with‑bricks/, with video tutorials guiding students through the experiments at https://youtube.com/playlist?list=PLx1grFu4zAcwfKKJZ1Ux4LwRqaePCOA2J.
Authors:Quy-Anh Dang, Chris Ngo
Abstract:
Despite significant progress in alignment, large language models (LLMs) remain vulnerable to adversarial attacks that elicit harmful behaviors. Activation steering techniques offer a promising inference‑time intervention approach, but existing methods suffer from critical limitations: activation addition requires careful coefficient tuning and is sensitive to layer‑specific norm variations, while directional ablation provides only binary control. Recent work on Angular Steering introduces continuous control via rotation in a 2D subspace, but its practical implementation violates norm preservation, causing distribution shift and generation collapse, particularly in models below 7B parameters. We propose Selective Steering, which addresses these limitations through two key innovations: (1) a mathematically rigorous norm‑preserving rotation formulation that maintains activation distribution integrity, and (2) discriminative layer selection that applies steering only where feature representations exhibit opposite‑signed class alignment. Experiments across nine models demonstrate that Selective Steering achieves 5.5x higher attack success rates than prior methods while maintaining zero perplexity violations and approximately 100% capability retention on standard benchmarks. Our approach provides a principled, efficient framework for controllable and stable LLM behavior modification. Code: https://github.com/knoveleng/steering
Authors:Xinyi Wan, Penghui Qi, Guangxing Huang, Chaoyi Ruan, Min Lin, Jialin Li
Abstract:
Modern data parallel (DP) training favors collective communication over parameter servers (PS) for its simplicity and efficiency under balanced workloads. However, the balanced workload assumption no longer holds in large language model (LLM) post‑training due to the high variance in sequence lengths. Under imbalanced workloads, collective communication creates synchronization barriers, leading to under‑utilization of devices with smaller workloads. This change in training dynamics calls for a revisit of the PS paradigm for its robustness to such imbalance. We propose On‑Demand Communication (ODC), which adapts PS into Fully Sharded Data Parallel (FSDP) by replacing collective all‑gather and reduce‑scatter with direct point‑to‑point communication. Compared to FSDP, ODC reduces the synchronization barrier from once per layer to once per minibatch and decouples the workload on each device so that faster workers are not stalled. It also enables simpler and more effective load balancing at the minibatch level. Across diverse LLM post‑training tasks, ODC consistently improves device utilization and training throughput, achieving up to a 36% speedup over standard FSDP. These results demonstrate that ODC is a superior fit for the prevalent imbalanced workloads in LLM post‑training. Our implementation of ODC and integration with FSDP is open‑sourced at https://github.com/sail‑sg/odc.
Authors:Zhao-Han Peng, Shaohui Li, Zhi Li, Shulan Ruan, Yu Liu, You He
Abstract:
While model‑based reinforcement learning (MBRL) improves sample efficiency by learning world models from raw observations, existing methods struggle to generalize across structurally similar scenes and remain vulnerable to spurious variations such as textures or color shifts. From a cognitive science perspective, humans segment continuous sensory streams into discrete events and rely on these key events for decision‑making. Motivated by this principle, we propose the Event‑Aware World Model (EAWM), a general framework that learns event‑aware representations to streamline policy learning without requiring handcrafted labels. EAWM employs an automated event generator to derive events from raw observations and introduces a Generic Event Segmentor (GES) to identify event boundaries, which mark the start and end time of event segments. Through event prediction, the representation space is shaped to capture meaningful spatio‑temporal transitions. Beyond this, we present a unified formulation of seemingly distinct world model architectures and show the broad applicability of our methods. Experiments on Atari 100K, Craftax 1M, and DeepMind Control 500K, DMC‑GB2 500K demonstrate that EAWM consistently boosts the performance of strong MBRL baselines by 10%‑45%, setting new state‑of‑the‑art results across benchmarks. Our code is released at https://github.com/MarquisDarwin/EAWM.
Authors:Tianyi Chen, Sihan Chen, Xiaoyi Qu, Dan Zhao, Ruomei Yan, Jongwoo Ko, Luming Liang, Pashmina Cameron
Abstract:
Quantization‑aware training (QAT) is essential for deploying large models under strict memory and latency constraints, yet achieving stable and robust optimization at ultra‑low bitwidths remains challenging. Common approaches based on the straight‑through estimator (STE) or soft quantizers often suffer from gradient mismatch, instability, or high computational overhead. As such, we propose StableQAT, a unified and efficient QAT framework that stabilizes training in ultra low‑bit settings via a novel, lightweight, and theoretically grounded surrogate for backpropagation derived from a discrete Fourier analysis of the rounding operator. StableQAT strictly generalizes STE as the latter arises as a special case of our more expressive surrogate family, yielding smooth, bounded, and inexpensive gradients that improve QAT training performance and stability across various hyperparameter choices. In experiments, StableQAT exhibits stable and efficient QAT at 2‑4 bit regimes, demonstrating improved training stability, robustness, and superior performance with negligible training overhead against standard QAT techniques. Our code is available at https://github.com/microsoft/StableQAT.
Authors:Sijia Li, Xiaoyu Tan, Shahir Ali, Niels Schmidt, Gengchen Ma, Xihe Qiu
Abstract:
Mobile agents have made progress toward reliable smartphone automation, yet performance in complex applications remains limited by incomplete knowledge and weak generalization to unseen environments. We introduce a curiosity driven knowledge retrieval framework that formalizes uncertainty during execution as a curiosity score. When this score exceeds a threshold, the system retrieves external information from documentation, code repositories, and historical trajectories. Retrieved content is organized into structured AppCards, which encode functional semantics, parameter conventions, interface mappings, and interaction patterns. During execution, an enhanced agent selectively integrates relevant AppCards into its reasoning process, thereby compensating for knowledge blind spots and improving planning reliability. Evaluation on the AndroidWorld benchmark shows consistent improvements across backbones, with an average gain of six percentage points and a new state of the art success rate of 88.8% when combined with GPT‑5. Analysis indicates that AppCards are particularly effective for multi step and cross application tasks, while improvements depend on the backbone model. Case studies further confirm that AppCards reduce ambiguity, shorten exploration, and support stable execution trajectories. Task trajectories are publicly available at https://lisalsj.github.io/Droidrun‑appcard/.
Authors:Shengjia Zhang, Weiqin Yang, Jiawei Chen, Peng Wu, Yuegang Sun, Gang Wang, Qihao Shi, Can Wang
Abstract:
Recommender systems (RS) aim to retrieve a small set of items that best match individual user preferences. Naturally, RS place primary emphasis on the quality of the Top‑K results rather than performance across the entire item set. However, estimating Top‑K accuracy (e.g., Precision@K, Recall@K) requires determining the ranking positions of items, which imposes substantial computational overhead and poses significant challenges for optimization. In addition, RS often suffer from distribution shifts due to evolving user preferences or data biases, further complicating the task. To address these issues, we propose Talos, a loss function that is specifically designed to optimize the Talos recommendation accuracy. Talos leverages a quantile technique that replaces the complex ranking‑dependent operations into simpler comparisons between predicted scores and learned score thresholds. We further develop a sampling‑based regression algorithm for efficient and accurate threshold estimation, and introduce a constraint term to maintain optimization stability by preventing score inflation. Additionally, we incorporate a tailored surrogate function to address discontinuity and enhance robustness against distribution shifts. Comprehensive theoretical analyzes and empirical experiments are conducted to demonstrate the effectiveness, efficiency, convergence, and distributional robustness of Talos. The code is available at https://github.com/cynthia‑shengjia/WWW‑2026‑Talos.
Authors:Qi Si, Xuyang Liu, Penglei Wang, Xin Guo, Yuan Qi, Yuan Cheng
Abstract:
RNA inverse folding, designing sequences to form specific 3D structures, is critical for therapeutics, gene regulation, and synthetic biology. Current methods, focused on sequence recovery, struggle to address structural objectives like secondary structure consistency (SS), minimum free energy (MFE), and local distance difference test (LDDT), leading to suboptimal structural accuracy. To tackle this, we propose a reinforcement learning (RL) framework integrated with a latent diffusion model (LDM). Drawing inspiration from the success of diffusion models in RNA inverse folding, which adeptly model complex sequence‑structure interactions, we develop an LDM incorporating pre‑trained RNA‑FM embeddings from a large‑scale RNA model. These embeddings capture co‑evolutionary patterns, markedly improving sequence recovery accuracy. However, existing approaches, including diffusion‑based methods, cannot effectively handle non‑differentiable structural objectives. By contrast, RL excels in this task by using policy‑driven reward optimization to navigate complex, non‑gradient‑based objectives, offering a significant advantage over traditional methods. In summary, we propose the Step‑wise Optimization of Latent Diffusion Model (SOLD), a novel RL framework that optimizes single‑step noise without sampling the full diffusion trajectory, achieving efficient refinement of multiple structural objectives. Experimental results demonstrate SOLD surpasses its LDM baseline and state‑of‑the‑art methods across all metrics, establishing a robust framework for RNA inverse folding with profound implications for biotechnological and therapeutic applications.
Authors:Chaozheng Wen, Jingwen Tong, Zehong Lin, Chenghong Bian, Jun Zhang
Abstract:
The emerging applications of next‑generation wireless networks (e.g., immersive 3D communication, low‑altitude networks, and integrated sensing and communication) necessitate high‑fidelity environmental intelligence. 3D radio maps have emerged as a critical tool for this purpose, enabling spectrum‑aware planning and environment‑aware sensing by bridging the gap between physical environments and electromagnetic signal propagation. However, constructing accurate 3D radio maps requires fine‑grained 3D geometric information and a profound understanding of electromagnetic wave propagation. Existing approaches typically treat optical and wireless knowledge as distinct modalities, failing to exploit the fundamental physical principles governing both light and electromagnetic propagation. To bridge this gap, we propose URF‑GS, a unified radio‑optical radiation field representation framework for accurate and generalizable 3D radio map construction based on 3D Gaussian splatting (3D‑GS) and inverse rendering. By fusing visual and wireless sensing observations, URF‑GS recovers scene geometry and material properties while accurately predicting radio signal behavior at arbitrary transmitter‑receiver (Tx‑Rx) configurations. Experimental results demonstrate that URF‑GS achieves up to a 24.7% improvement in spatial spectrum prediction accuracy and a 10x increase in sample efficiency for 3D radio map construction compared with neural radiance field (NeRF)‑based methods. This work establishes a foundation for next‑generation wireless networks by integrating perception, interaction, and communication through holistic radiation field reconstruction.
Authors:Zhixi Cai, Fucai Ke, Kevin Leo, Sukai Huang, Maria Garcia de la Banda, Peter J. Stuckey, Hamid Rezatofighi
Abstract:
Recent vision‑language models have strong perceptual ability but their implicit reasoning is hard to explain and easily generates hallucinations on complex queries. Compositional methods improve interpretability, but most rely on a single agent or hand‑crafted pipeline and cannot decide when to collaborate across complementary agents or compete among overlapping ones. We introduce MATA (Multi‑Agent hierarchical Trainable Automaton), a multi‑agent system presented as a hierarchical finite‑state automaton for visual reasoning whose top‑level transitions are chosen by a trainable hyper agent. Each agent corresponds to a state in the hyper automaton, and runs a small rule‑based sub‑automaton for reliable micro‑control. All agents read and write a shared memory, yielding transparent execution history. To supervise the hyper agent's transition policy, we build transition‑trajectory trees and transform to memory‑to‑next‑state pairs, forming the MATA‑SFT‑90K dataset for supervised finetuning (SFT). The finetuned LLM as the transition policy understands the query and the capacity of agents, and it can efficiently choose the optimal agent to solve the task. Across multiple visual reasoning benchmarks, MATA achieves the state‑of‑the‑art results compared with monolithic and compositional baselines. The code and dataset are available at https://github.com/ControlNet/MATA.
Authors:Patara Trirat, Jin Myung Kwak, Jay Heo, Heejun Lee, Sung Ju Hwang
Abstract:
Recent progress at the intersection of large language models (LLMs) and time series (TS) analysis has revealed both promise and fragility. While LLMs can reason over temporal structure given carefully engineered context, they often struggle with numeric fidelity, modality interference, and principled cross‑modal integration. We present TS‑Debate, a modality‑specialized, collaborative multi‑agent debate framework for zero‑shot time series reasoning. TS‑Debate assigns dedicated expert agents to textual context, visual patterns, and numerical signals, preceded by explicit domain knowledge elicitation, and coordinates their interaction via a structured debate protocol. Reviewer agents evaluate agent claims using a verification‑conflict‑calibration mechanism, supported by lightweight code execution and numerical lookup for programmatic verification. This architecture preserves modality fidelity, exposes conflicting evidence, and mitigates numeric hallucinations without task‑specific fine‑tuning. Across 20 tasks spanning three public benchmarks, TS‑Debate achieves consistent and significant performance improvements over strong baselines, including standard multimodal debate in which all agents observe all inputs.
Authors:Nanhan Shen, Zhilei Liu
Abstract:
Emotional Talking Face synthesis is pivotal in multimedia and signal processing, yet existing 3D methods suffer from two critical challenges: poor audio‑vision emotion alignment, manifested as difficult audio emotion extraction and inadequate control over emotional micro‑expressions; and a one‑size‑fits‑all multi‑view fusion strategy that overlooks uncertainty and feature quality differences, undermining rendering quality. We propose UA‑3DTalk, Uncertainty‑Aware 3D Emotional Talking Face Synthesis with emotion prior distillation, which has three core modules: the Prior Extraction module disentangles audio into content‑synchronized features for alignment and person‑specific complementary features for individualization; the Emotion Distillation module introduces a multi‑modal attention‑weighted fusion mechanism and 4D Gaussian encoding with multi‑resolution code‑books, enabling fine‑grained audio emotion extraction and precise control of emotional micro‑expressions; the Uncertainty‑based Deformation deploys uncertainty blocks to estimate view‑specific aleatoric (input noise) and epistemic (model parameters) uncertainty, realizing adaptive multi‑view fusion and incorporating a multi‑head decoder for Gaussian primitive optimization to mitigate the limitations of uniform‑weight fusion. Extensive experiments on regular and emotional datasets show UA‑3DTalk outperforms state‑of‑the‑art methods like DEGSTalk and EDTalk by 5.2% in E‑FID for emotion alignment, 3.1% in SyncC for lip synchronization, and 0.015 in LPIPS for rendering quality. Project page: https://mrask999.github.io/UA‑3DTalk
Authors:Haozheng Luo, Zhuolin Jiang, Md Zahid Hasan, Yan Chen, Soumalya Sarkar
Abstract:
We propose FROST, an attention‑aware method for efficient reasoning. Unlike traditional approaches, FROST leverages attention weights to prune uncritical reasoning paths, yielding shorter and more reliable reasoning trajectories. Methodologically, we introduce the concept of reasoning outliers and design an attention‑based mechanism to remove them. Theoretically, FROST preserves and enhances the model's reasoning capacity while eliminating outliers at the sentence level. Empirically, we validate FROST on four benchmarks using two strong reasoning models (Phi‑4‑Reasoning and GPT‑OSS‑20B), outperforming state‑of‑the‑art methods such as TALE and ThinkLess. Notably, FROST achieves an average 69.68% reduction in token usage and a 26.70% improvement in accuracy over the base model. Furthermore, in evaluations of attention outlier metrics, FROST reduces the maximum infinity norm by 15.97% and the average kurtosis by 91.09% compared to the base model. Code is available at https://github.com/robinzixuan/FROST
Authors:Fangzhou Wu, Sandeep Silwal, Qiuyi, Zhang
Abstract:
KV caching is a fundamental technique for accelerating Large Language Model (LLM) inference by reusing key‑value (KV) pairs from previous queries, but its effectiveness under limited memory is highly sensitive to the eviction policy. The default Least Recently Used (LRU) eviction algorithm struggles with dynamic online query arrivals, especially in multi‑LLM serving scenarios, where balancing query load across workers and maximizing cache hit rate of each worker are inherently conflicting objectives. We give the first unified mathematical model that captures the core trade‑offs between KV cache eviction and query routing. Our analysis reveals the theoretical limitations of existing methods and leads to principled algorithms that integrate provably competitive randomized KV cache eviction with learning‑based methods to adaptively route queries with evolving patterns, thus balancing query load and cache hit rate. Our theoretical results are validated by extensive experiments across 4 benchmarks and 3 prefix‑sharing settings, demonstrating improvements of up to 6.92× in cache hit rate, 11.96× reduction in latency, 14.06× reduction in time‑to‑first‑token (TTFT), and 77.4% increase in throughput over the state‑of‑the‑art methods. Our code is available at https://github.com/fzwark/KVRouting.
Authors:Yanxi Wang, Zhiling Zhang, Wenbo Zhou, Weiming Zhang, Jie Zhang, Qiannan Zhu, Yu Shi, Shuxin Zheng, Jiyan He
Abstract:
GUI agents enable end‑to‑end automation through direct perception of and interaction with on‑screen interfaces. However, these agents frequently access interfaces containing sensitive personal information, and screenshots are often transmitted to remote models, creating substantial privacy risks. These risks are particularly severe in GUI workflows: GUIs expose richer, more accessible private information, and privacy risks depend on interaction trajectories across sequential scenes. We propose GUIGuard, a three‑stage framework for privacy‑preserving GUI agents: (1) privacy recognition, (2) privacy protection, and (3) task execution under protection. We further construct GUIGuard‑Bench, a cross‑platform benchmark with 630 trajectories and 13,830 screenshots, annotated with region‑level privacy grounding and fine‑grained labels of risk level, privacy category, and task necessity. Evaluations reveal that existing agents exhibit limited privacy recognition, with state‑of‑the‑art models achieving only 13.3% accuracy on Android and 1.4% on PC. Under privacy protection, task‑planning semantics can still be maintained, with closed‑source models showing stronger semantic consistency than open‑source ones. Case studies on MobileWorld show that carefully designed protection strategies achieve higher task accuracy while preserving privacy. Our results highlight privacy recognition as a critical bottleneck for practical GUI agents. Project: https://futuresis.github.io/GUIGuard‑page/
Authors:Deep Mehta
Abstract:
Aggregate analytics over conversational data are increasingly used for safety monitoring, governance, and product analysis in large language model systems. A common practice is to embed conversations, cluster them, and publish short textual summaries describing each cluster. While raw conversations may never be exposed, these derived summaries can still pose privacy risks if they contain personally identifying information (PII) or uniquely traceable strings copied from individual conversations. We introduce CanaryBench, a simple and reproducible stress test for privacy leakage in cluster‑level conversation summaries. CanaryBench generates synthetic conversations with planted secret strings ("canaries") that simulate sensitive identifiers. Because canaries are known a priori, any appearance of these strings in published summaries constitutes a measurable leak. Using TF‑IDF embeddings and k‑means clustering on 3,000 synthetic conversations (24 topics) with a canary injection rate of 0.60, we evaluate an intentionally extractive example snippet summarizer that models quote‑like reporting. In this configuration, we observe canary leakage in 50 of 52 canary‑containing clusters (cluster‑level leakage rate 0.961538), along with nonzero regex‑based PII indicator counts. A minimal defense combining a minimum cluster‑size publication threshold (k‑min = 25) and regex‑based redaction eliminates measured canary leakage and PII indicator hits in the reported run while maintaining a similar cluster‑coherence proxy. We position this work as a societal impacts contribution centered on privacy risk measurement for published analytics artifacts rather than raw user data.
Authors:Yaohua Zha, Chunlin Fan, Peiyuan Liu, Yong Jiang, Tao Dai, Hai Wu, Shu-Tao Xia
Abstract:
Multi‑channel time‑series data, prevalent across diverse applications, is characterized by significant heterogeneity in its different channels. However, existing forecasting models are typically guided by channel‑agnostic loss functions like MSE, which apply a uniform metric across all channels. This often leads to fail to capture channel‑specific dynamics such as sharp fluctuations or trend shifts. To address this, we propose a Channel‑wise Perceptual Loss (CP Loss). Its core idea is to learn a unique perceptual space for each channel that is adapted to its characteristics, and to compute the loss within this space. Specifically, we first design a learnable channel‑wise filter that decomposes the raw signal into disentangled multi‑scale representations, which form the basis of our perceptual space. Crucially, the filter is optimized jointly with the main forecasting model, ensuring that the learned perceptual space is explicitly oriented towards the prediction task. Finally, losses are calculated within these perception spaces to optimize the model. Code is available at https://github.com/zyh16143998882/CP_Loss.
Authors:Jens Kohl, Otto Kruse, Youssef Mostafa, Andre Luckow, Karsten Schroer, Thomas Riedl, Ryan French, David Katz, Manuel P. Luitz, Tanrajbir Takher, Ken E. Friedl, Céline Laurent-Winter
Abstract:
LLM‑based agents are rapidly being adopted across diverse domains. Since they interact with users without supervision, they must be tested extensively. Current testing approaches focus on acceptance‑level evaluation from the user's perspective. While intuitive, these tests require manual evaluation, are difficult to automate, do not facilitate root cause analysis, and incur expensive test environments. In this paper, we present methods to enable structural testing of LLM‑based agents. Our approach utilizes traces (based on OpenTelemetry) to capture agent trajectories, employs mocking to enforce reproducible LLM behavior, and adds assertions to automate test verification. This enables testing agent components and interactions at a deeper technical level within automated workflows. We demonstrate how structural testing enables the adaptation of software engineering best practices to agents, including the test automation pyramid, regression testing, test‑driven development, and multi‑language testing. In representative case studies, we demonstrate automated execution and faster root‑cause analysis. Collectively, these methods reduce testing costs and improve agent quality through higher coverage, reusability, and earlier defect detection. We provide an open source reference implementation on GitHub.
Authors:William Han, Tony Chen, Chaojing Duan, Xiaoyu Song, Yihang Yao, Yuzhe Yang, Michael A. Rosenberg, Emerson Liu, Ding Zhao
Abstract:
ECG‑Language Models (ELMs) extend recent progress in Multimodal Large Language Models (MLLMs) to automated ECG interpretation. However, most ELMs follow Vision‑Language Model (VLM) designs and depend on pretrained ECG encoders, adding architectural and training complexity. Inspired by encoder‑free VLMs, we introduce ELF, an encoder‑free ELM that replaces the ECG encoder with a single projection layer trained jointly with the LLM. Across five datasets, ELF matches or exceeds state‑of‑the‑art ELMs that use far more complex encoders and training pipelines. We also test whether adding architectural biases to ELF improves performance and find that the single linear projection remains competitive. Finally, we show that ELF, and potentially other ELMs, often rely more on benchmark artifacts and language priors than ECG‑derived information, highlighting limitations in current evaluation practices and ELM design. All data and code is available at https://github.com/willxxy/ECG‑Bench.
Authors:Fangxu Yu, Xingang Guo, Lingzhi Yuan, Haoqiang Kang, Hongyu Zhao, Lianhui Qin, Furong Huang, Bin Hu, Tianyi Zhou
Abstract:
Time series are ubiquitous in real‑world scenarios and crucial for applications ranging from energy management to traffic control. Consequently, the ability to reason over time series is a fundamental skill for generalist models to solve complex problems. However, current benchmarks for generalist models largely overlook this dimension. To bridge this gap, we introduce TSRBench, a comprehensive multi‑modal benchmark designed to stress‑test the full spectrum of time series reasoning capabilities. TSRBench features: i) a diverse set of 4125 problems from 14 domains, and is categorized into 4 major dimensions: Perception, Reasoning, Prediction, and Decision‑Making. ii) 15 tasks from the 4 dimensions evaluating essential reasoning capabilities (e.g., numerical reasoning). Through extensive experiments, we evaluate over 30 leading proprietary and open‑source LLMs, VLMs, and TSLLMs within TSRBench. Our findings reveal that: i) scaling laws hold for perception and reasoning but break down for prediction; ii) strong reasoning does not guarantee accurate context‑aware forecasting, indicating a decoupling between semantic understanding and numerical prediction; and iii) despite the complementary nature of textual and visual forms of time series as inputs, current multimodal models fail to effectively fuse them for reciprocal performance gains. TSRBench provides a standardized evaluation platform that not only highlights existing challenges but also offers valuable insights to advance generalist models. Our code and dataset are available at https://tsrbench.github.io/.
Authors:Xingyu Sui, Yanyan Zhao, Yulin Hu, Jiahe Guo, Weixiang Zhao, Bing Qin
Abstract:
Emotional Support Conversation requires not only affective expression but also grounded instrumental support to provide trustworthy guidance. However, existing ESC systems and benchmarks largely focus on affective support in text‑only settings, overlooking how external tools can enable factual grounding and reduce hallucination in multi‑turn emotional support. We introduce TEA‑Bench, the first interactive benchmark for evaluating tool‑augmented agents in ESC, featuring realistic emotional scenarios, an MCP‑style tool environment, and process‑level metrics that jointly assess the quality and factual grounding of emotional support. Experiments on nine LLMs show that tool augmentation generally improves emotional support quality and reduces hallucination, but the gains are strongly capacity‑dependent: stronger models use tools more selectively and effectively, while weaker models benefit only marginally. We further release TEA‑Dialog, a dataset of tool‑enhanced ESC dialogues, and find that supervised fine‑tuning improves in‑distribution support but generalizes poorly. Our results underscore the importance of tool use in building reliable emotional support agents. Our code and data can be found in https://github.com/XingYuSSS/TEA‑Bench.
Authors:Yibo Li, Zijie Lin, Ailin Deng, Xuan Zhang, Yufei He, Shuo Ji, Tri Cao, Bryan Hooi
Abstract:
While Large Language Model (LLM) agents excel at general tasks, they inherently struggle with continual adaptation due to the frozen weights after deployment. Conventional reinforcement learning (RL) offers a solution but incurs prohibitive computational costs and the risk of catastrophic forgetting. We introduce Just‑In‑Time Reinforcement Learning (JitRL), a training‑free framework that enables test‑time policy optimization without any gradient updates. JitRL maintains a dynamic, non‑parametric memory of experiences and retrieves relevant trajectories to estimate action advantages on‑the‑fly. These estimates are then used to directly modulate the LLM's output logits. We theoretically prove that this additive update rule is the exact closed‑form solution to the KL‑constrained policy optimization objective. Extensive experiments on WebArena and Jericho demonstrate that JitRL establishes a new state‑of‑the‑art among training‑free methods. Crucially, JitRL outperforms the performance of computationally expensive fine‑tuning methods (e.g., WebRL) while reducing monetary costs by over 30 times, offering a scalable path for continual learning agents. The code is available at https://github.com/liushiliushi/JitRL.
Authors:Zhaoyan Gong, Zhiqiang Liu, Songze Li, Xiaoke Guo, Yuanxiang Liu, Xinle Deng, Zhizhen Liu, Lei Liang, Huajun Chen, Wen Zhang
Abstract:
Temporal Knowledge Graph Question Answering (TKGQA) is inherently challenging, as it requires sophisticated reasoning over dynamic facts with multi‑hop dependencies and complex temporal constraints. Existing methods rely on fixed workflows and expensive closed‑source APIs, limiting flexibility and scalability. We propose Temp‑R1, the first autonomous end‑to‑end agent for TKGQA trained through reinforcement learning. To address cognitive overload in single‑action reasoning, we expand the action space with specialized internal actions alongside external action. To prevent shortcut learning on simple questions, we introduce reverse curriculum learning that trains on difficult questions first, forcing the development of sophisticated reasoning before transferring to easier cases. Our 8B‑parameter Temp‑R1 achieves state‑of‑the‑art performance on MultiTQ and TimelineKGQA, improving 19.8% over strong baselines on complex questions. Our work establishes a new paradigm for autonomous temporal reasoning agents. Our code will be publicly available soon at https://github.com/zjukg/Temp‑R1.
Authors:Elena Bruches, Vadim Alperovich, Dari Baturova, Roman Derunets, Daniil Grebenkin, Georgy Mkrtchyan, Oleg Sedukhin, Mikhail Klementev, Ivan Bondarenko, Nikolay Bushkov, Stanislav Moiseev
Abstract:
While Large Language Models (LLMs) have shown promise in software engineering, their application to unit testing remains largely confined to isolated test generation or oracle prediction, neglecting the broader challenge of test suite maintenance. We introduce TAM‑Eval (Test Automated Maintenance Evaluation), a framework and benchmark designed to evaluate model performance across three core test maintenance scenarios: creation, repair, and updating of test suites. Unlike prior work limited to function‑level tasks, TAM‑Eval operates at the test file level, while maintaining access to full repository context during isolated evaluation, better reflecting real‑world maintenance workflows. Our benchmark comprises 1,539 automatically extracted and validated scenarios from Python, Java, and Go projects. TAM‑Eval supports system‑agnostic evaluation of both raw LLMs and agentic workflows, using a reference‑free protocol based on test suite pass rate, code coverage, and mutation testing. Empirical results indicate that state‑of‑the‑art LLMs have limited capabilities in realistic test maintenance processes and yield only marginal improvements in test effectiveness. We release TAM‑Eval as an open‑source framework to support future research in automated software testing. Our data and code are publicly available at https://github.com/trndcenter/TAM‑Eval.
Authors:Pei Wang, Yanan Wu, Xiaoshuai Song, Weixun Wang, Gengru Chen, Zhongwen Li, Kezhong Yan, Ken Deng, Qi Liu, Shuaibing Zhao, Shaopan Xiong, Xuepeng Liu, Xuefeng Chen, Wanxi Deng, Wenbo Su, Bo Zheng
Abstract:
Large language model (LLM)‑based agents are increasingly deployed in e‑commerce shopping. To perform thorough, user‑tailored product searches, agents should interpret personal preferences, engage in multi‑turn dialogues, and ultimately retrieve and discriminate among highly similar products. However, existing research has yet to provide a unified simulation environment that consistently captures all of these aspects, and always focuses solely on evaluation benchmarks without training support. In this paper, we introduce ShopSimulator, a large‑scale and challenging Chinese shopping environment. Leveraging ShopSimulator, we evaluate LLMs across diverse scenarios, finding that even the best‑performing models achieve less than 40% full‑success rate. Error analysis reveals that agents struggle with deep search and product selection in long trajectories, fail to balance the use of personalization cues, and to effectively engage with users. Further training exploration provides practical guidance for overcoming these weaknesses, with the combination of supervised fine‑tuning (SFT) and reinforcement learning (RL) yielding significant performance improvements. Code and data will be released at https://github.com/ShopAgent‑Team/ShopSimulator.
Authors:Kunat Pipatanakul, Pittawat Taveekitworachai
Abstract:
Large language models (LLMs) have progressed rapidly; however, most state‑of‑the‑art models are trained and evaluated primarily in high‑resource languages such as English and Chinese, and are often developed by a small number of organizations with access to large‑scale compute and data. This gatekeeping creates a practical barrier for sovereign settings in which a regional‑ or national‑scale institution or domain owner must retain control and understanding of model weights, training data, and deployment while operating under limited resources and strict transparency constraints. To this end, we identify two core requirements: (1) adoptability, the ability to transform a base model into a general‑purpose assistant, and (2) sovereign capability, the ability to perform high‑stakes, region‑specific tasks (e.g., legal reasoning in local languages and cultural knowledge). We investigate whether these requirements can be achieved without scaling massive instruction corpora or relying on complex preference tuning pipelines and large‑scale reinforcement fine‑tuning (RFT). We present Typhoon S, a minimal and open post‑training recipe that combines supervised fine‑tuning, on‑policy distillation, and small‑scale RFT. Using Thai as a representative case study, we demonstrate that our approach transforms both sovereign‑adapted and general‑purpose base models into instruction‑tuned models with strong general performance. We further show that small‑scale RFT with InK‑GRPO ‑‑ an extension of GRPO that augments the GRPO loss with a next‑word prediction loss ‑‑ improves Thai legal reasoning and Thai‑specific knowledge while preserving general capabilities. Our results suggest that a carefully designed post‑training strategy can reduce the required scale of instruction data and computation, providing a practical path toward high‑quality sovereign LLMs under academic‑scale resources.
Authors:Dezhang Kong, Zhuxi Wu, Shiqi Liu, Zhicheng Tan, Kuichen Lu, Minghao Li, Qichen Liu, Shengyu Chu, Zhenhua Xu, Xuan Liu, Meng Han
Abstract:
LLM‑based web agents have become increasingly popular for their utility in daily life and work. However, they exhibit critical vulnerabilities when processing malicious URLs: accepting a disguised malicious URL enables subsequent access to unsafe webpages, which can cause severe damage to service providers and users. Despite this risk, no benchmark currently targets this emerging threat. To address this gap, we propose MalURLBench, the first benchmark for evaluating LLMs' vulnerabilities to malicious URLs. MalURLBench contains 61,845 attack instances spanning 10 real‑world scenarios and 7 categories of real malicious websites. Experiments with 12 popular LLMs reveal that existing models struggle to detect elaborately disguised malicious URLs. We further identify and analyze key factors that impact attack success rates and propose URLGuard, a lightweight defense module. We believe this work will provide a foundational resource for advancing the security of web agents. Our code is available at https://github.com/JiangYingEr/MalURLBench.
Authors:Wei-Po Hsin, Ren-Hao Deng, Yao-Ting Hsieh, En-Ming Huang, Shih-Hao Hung
Abstract:
Verilog's design cycle is inherently labor‑intensive and necessitates extensive domain expertise. Although Large Language Models (LLMs) offer a promising pathway toward automation, their limited training data and intrinsic sequential reasoning fail to capture the strict formal logic and concurrency inherent in hardware systems. To overcome these barriers, we present EvolVE, the first framework to analyze multiple evolution strategies on chip design tasks, revealing that Monte Carlo Tree Search (MCTS) excels at maximizing functional correctness, while Idea‑Guided Refinement (IGR) proves superior for optimization. We further leverage Structured Testbench Generation (STG) to accelerate the evolutionary process. To address the lack of complex optimization benchmarks, we introduce IC‑RTL, targeting industry‑scale problems derived from the National Integrated Circuit Contest. Evaluations establish EvolVE as the new state‑of‑the‑art, achieving 98.1% on VerilogEval v2 and 92% on RTLLM v2. Furthermore, on the industry‑scale IC‑RTL suite, our framework surpasses reference implementations authored by contest participants, reducing the Power, Performance, Area (PPA) product by up to 66% in Huffman Coding and 17% in the geometric mean across all problems. The source code of the IC‑RTL benchmark is available at https://github.com/weiber2002/ICRTL.
Authors:Vi Vu, Thanh-Huy Nguyen, Tien-Thinh Nguyen, Ba-Thinh Lam, Hoang-Thien Nguyen, Tianyang Wang, Xingjian Li, Min Xu
Abstract:
Foundation models like the Segment Anything Model (SAM) show strong generalization, yet adapting them to medical images remains difficult due to domain shift, scarce labels, and the inability of Parameter‑Efficient Fine‑Tuning (PEFT) to exploit unlabeled data. While conventional models like U‑Net excel in semi‑supervised medical learning, their potential to assist a PEFT SAM has been largely overlooked. We introduce SC‑SAM, a specialist‑generalist framework where U‑Net provides point‑based prompts and pseudo‑labels to guide SAM's adaptation, while SAM serves as a powerful generalist supervisor to regularize U‑Net. This reciprocal guidance forms a bidirectional co‑training loop that allows both models to effectively exploit the unlabeled data. Across prostate MRI and polyp segmentation benchmarks, our method achieves state‑of‑the‑art results, outperforming other existing semi‑supervised SAM variants and even medical foundation models like MedSAM, highlighting the value of specialist‑generalist cooperation for label‑efficient medical image segmentation. Our code is available at https://github.com/vnlvi2k3/SC‑SAM.
Authors:Zhongyu Xiao, Zhiwei Hao, Jianyuan Guo, Yong Luo, Jia Liu, Jie Xu, Han Hu
Abstract:
Diffusion Large Language Models (dLLMs) offer a compelling paradigm for natural language generation, leveraging parallel decoding and bidirectional attention to achieve superior global coherence compared to autoregressive models. While recent works have accelerated inference via KV cache reuse or heuristic decoding, they overlook the intrinsic inefficiencies within the block‑wise diffusion process. Specifically, they suffer from spatial redundancy by modeling informative‑sparse suffix regions uniformly and temporal inefficiency by applying fixed denoising schedules across all the decoding process. To address this, we propose Streaming‑dLLM, a training‑free framework that streamlines inference across both spatial and temporal dimensions. Spatially, we introduce attenuation guided suffix modeling to approximate the full context by pruning redundant mask tokens. Temporally, we employ a dynamic confidence aware strategy with an early exit mechanism, allowing the model to skip unnecessary iterations for converged tokens. Extensive experiments show that Streaming‑dLLM achieves up to 68.2X speedup while maintaining generation quality, highlighting its effectiveness in diffusion decoding. The code is available at https://github.com/xiaoshideta/Streaming‑dLLM.
Authors:Jiayu Liu, Yinhe Long, Zhenya Huang, Enhong Chen
Abstract:
A growing body of research suggests that the cognitive processes of large language models (LLMs) differ fundamentally from those of humans. However, existing interpretability methods remain limited in explaining how cognitive abilities are engaged during LLM reasoning. In this paper, we propose UniCog, a unified framework that analyzes LLM cognition via a latent mind space. Formulated as a latent variable model, UniCog encodes diverse abilities from dense model activations into sparse, disentangled latent dimensions. Through extensive analysis on six advanced LLMs, including DeepSeek‑V3.2 and GPT‑4o, we reveal a Pareto principle of LLM cognition, where a shared reasoning core is complemented by ability‑specific signatures. Furthermore, we discover that reasoning failures often manifest as anomalous intensity in latent activations. These findings opens a new paradigm in LLM analysis, providing a cognition grounded view of reasoning dynamics. Finally, leveraging these insights, we introduce a latent‑informed candidate prioritization strategy, which improves reasoning performance by up to 7.5% across challenging benchmarks. Our code is available at https://github.com/milksalute/unicog.
Authors:Qingyu Fan, Zhaoxiang Li, Yi Lu, Wang Chen, Qiu Shen, Xiao-xiao Long, Yinghao Cai, Tao Lu, Shuo Wang, Xun Cao
Abstract:
Bimanual manipulation in cluttered scenes requires policies that remain stable under occlusions, viewpoint and scene variations. Existing vision‑language‑action models often fail to generalize because (i) multi‑view features are fused via view‑agnostic token concatenation, yielding weak 3D‑consistent spatial understanding, and (ii) language is injected as global conditioning, resulting in coarse instruction grounding. In this paper, we introduce PEAfowl, a perception‑enhanced multi‑view VLA policy for bimanual manipulation. For spatial reasoning, PEAfowl predicts per‑token depth distributions, performs differentiable 3D lifting, and aggregates local cross‑view neighbors to form geometrically grounded, cross‑view consistent representations. For instruction grounding, we propose to replace global conditioning with a Perceiver‑style text‑aware readout over frozen CLIP visual features, enabling iterative evidence accumulation. To overcome noisy and incomplete commodity depth without adding inference overhead, we apply training‑only depth distillation from a pretrained depth teacher to supervise the depth‑distribution head, providing perception front‑end with geometry‑aware priors. On RoboTwin 2.0 under domain‑randomized setting, PEAfowl improves the strongest baseline by 23.0 pp in success rate, and real‑robot experiments further demonstrate reliable sim‑to‑real transfer and consistent improvements from depth distillation. Project website: https://peafowlvla.github.io/.
Authors:Zhihao He, Tieyuan Chen, Kangyu Wang, Ziran Qin, Yang Shao, Chaofan Gan, Shijie Li, Zuxuan Wu, Weiyao Lin
Abstract:
Current Video Large Language Models (Video LLMs) typically encode frames via a vision encoder and employ an autoregressive (AR) LLM for understanding and generation. However, this AR paradigm inevitably faces a dual efficiency bottleneck: strictly unidirectional attention compromises understanding efficiency by hindering global spatiotemporal aggregation, while serial decoding restricts generation efficiency. To address this, we propose VidLaDA, a Video LLM based on Diffusion Language Models (DLMs) that leverages bidirectional attention to unlock comprehensive spatiotemporal modeling and decode tokens in parallel. To further mitigate the computational overhead of diffusion decoding, we introduce MARS‑Cache, an acceleration strategy that prunes redundancy by combining asynchronous visual cache refreshing with frame‑wise chunk attention. Experiments show VidLaDA rivals state‑of‑the‑art AR baselines (e.g., Qwen2.5‑VL and LLaVA‑Video) and outperforms DLM baselines, with MARS‑Cache delivering over 12x speedup without compromising accuracy. Code and checkpoints are open‑sourced at https://github.com/ziHoHe/VidLaDA.
Authors:Pranav Kasela, Marco Braga, Alessandro Ghiotto, Andrea Pilzer, Marco Viviani, Alessandro Raganato
Abstract:
In this paper, we present DIETA, a small, decoder‑only Transformer model with 0.5 billion parameters, specifically designed and trained for Italian‑English machine translation. We collect and curate a large parallel corpus consisting of approximately 207 million Italian‑English sentence pairs across diverse domains, including parliamentary proceedings, legal texts, web‑crawled content, subtitles, news, literature and 352 million back‑translated data using pretrained models. Additionally, we create and release a new small‑scale evaluation set, consisting of 450 sentences, based on 2025 WikiNews articles, enabling assessment of translation quality on contemporary text. Comprehensive evaluations show that DIETA achieves competitive performance on multiple Italian‑English benchmarks, consistently ranking in the second quartile of a 32‑system leaderboard and outperforming most other sub‑3B models on four out of five test suites. The training script, trained models, curated corpus, and newly introduced evaluation set are made publicly available, facilitating further research and development in specialized Italian‑English machine translation. https://github.com/pkasela/DIETA‑Machine‑Translation
Authors:Haoxuan Ma, Guannan Lai, Han-Jia Ye
Abstract:
Multimodal large language models (MLLMs) have advanced rapidly, yet heterogeneity in architecture, alignment strategies, and efficiency means that no single model is uniformly superior across tasks. In practical deployments, workloads span lightweight OCR to complex multimodal reasoning; using one MLLM for all queries either over‑provisions compute on easy instances or sacrifices accuracy on hard ones. Query‑level model selection (routing) addresses this tension, but extending routing from text‑only LLMs to MLLMs is nontrivial due to modality fusion, wide variation in computational cost across models, and the absence of a standardized, budget‑aware evaluation. We present MMR‑Bench, a unified benchmark that isolates the multimodal routing problem and enables comparison under fixed candidate sets and cost models. MMR‑Bench provides (i) a controlled environment with modality‑aware inputs and variable compute budgets, (ii) a broad suite of vision‑language tasks covering OCR, general VQA, and multimodal math reasoning, and (iii) strong single‑model reference, oracle upper bounds, and representative routing policies. Using MMR‑Bench, we show that incorporating multimodal signals improves routing quality. Empirically, these cues improve the cost‑accuracy frontier and enable the routed system to exceed the strongest single model's accuracy at roughly 33% of its cost. Furthermore, policies trained on a subset of models and tasks generalize zero‑shot to new datasets and text‑only benchmarks without retuning, establishing MMR‑Bench as a foundation for studying adaptive multimodal model selection and efficient MLLM deployment. The code will be available at: https://github.com/Hunter‑Wrynn/MMR‑Bench.
Authors:Raja Gond, Aditya K Kamath, Arkaprava Basu, Ramachandran Ramjee, Ashish Panwar
Abstract:
In LLM inference, the same prompt may yield different outputs across different runs. At the system level, this non‑determinism arises from floating‑point non‑associativity combined with dynamic batching and GPU kernels whose reduction orders vary with batch size. A straightforward way to eliminate non‑determinism is to disable dynamic batching during inference, but doing so severely degrades throughput. Another approach is to make kernels batch‑invariant; however, this tightly couples determinism to kernel design, requiring new implementations. This coupling also imposes fixed runtime overheads, regardless of how much of the workload actually requires determinism. Inspired by ideas from speculative decoding, we present LLM‑42, a scheduling‑based approach to enable determinism in LLM inference. Our key observation is that if a sequence is in a consistent state, the next emitted token is likely to be consistent even with dynamic batching. Moreover, most GPU kernels use shape‑consistent reductions. Leveraging these insights, LLM‑42 decodes tokens using a non‑deterministic fast path and enforces determinism via a lightweight verify‑rollback loop. The verifier replays candidate tokens under a fixed‑shape reduction schedule, commits those that are guaranteed to be consistent across runs, and rolls back those violating determinism. LLM‑42 mostly re‑uses existing kernels unchanged and incurs overhead only in proportion to the traffic that requires determinism.
Authors:Ziyang Song, Xinyu Gong, Bangya Liu, Zelin Zhao
Abstract:
Existing Subject‑to‑Video Generation (S2V) methods have achieved high‑fidelity and subject‑consistent video generation, yet remain constrained to single‑view subject references. This limitation renders the S2V task reducible to an S2I + I2V pipeline, failing to exploit the full potential of video subject control. In this work, we propose and address the challenging Multi‑View S2V (MV‑S2V) task, which synthesizes videos from multiple reference views to enforce 3D‑level subject consistency. Regarding the scarcity of training data, we first develop a synthetic data curation pipeline to generate highly customized synthetic data, complemented by a small‑scale real‑world captured dataset to boost the training of MV‑S2V. Another key issue lies in the potential confusion between cross‑subject and cross‑view references in conditional generation. To overcome this, we further introduce Temporally Shifted RoPE (TS‑RoPE) to distinguish between different subjects and distinct views of the same subject in reference conditioning. Our framework achieves superior 3D subject consistency w.r.t. multi‑view reference images and high‑quality visual outputs, establishing a new meaningful direction for subject‑driven video generation. Our project page is available at: https://szy‑young.github.io/mv‑s2v
Authors:Amjad Fatmi
Abstract:
Autonomous agent systems increasingly trigger real‑world side effects: deploying infrastructure, modifying databases, moving money, and executing workflows. Yet most agent stacks provide no mandatory execution checkpoint where organizations can deterministically permit, deny, or defer an action before it changes reality. This paper introduces Faramesh, a protocol‑agnostic execution control plane that enforces execution‑time authorization for agent‑driven actions via a non‑bypassable Action Authorization Boundary (AAB). Faramesh canonicalizes agent intent into a Canonical Action Representation (CAR), evaluates actions deterministically against policy and state, and issues a decision artifact (PERMIT/DEFER/DENY) that executors must validate prior to execution. The system is designed to be framework‑ and model‑agnostic, supports multi‑agent and multi‑tenant deployments, and remains independent of transport protocols (e.g., MCP). Faramesh further provides decision‑centric, append‑only provenance logging keyed by canonical action hashes, enabling auditability, verification, and deterministic replay without re‑running agent reasoning. We show how these primitives yield enforceable, predictable governance for autonomous execution while avoiding hidden coupling to orchestration layers or observability‑only approaches.
Authors:Kyungho Kim, Geon Lee, Juyeon Kim, Dongwon Choi, Shinhwan Kang, Kijung Shin
Abstract:
Relational databases (RDBs) play a crucial role in many real‑world web applications, supporting data management across multiple interconnected tables. Beyond typical retrieval‑oriented tasks, prediction tasks on RDBs have recently gained attention. In this work, we address this problem by generating informative relational features that enhance predictive performance. However, generating such features is challenging: it requires reasoning over complex schemas and exploring a combinatorially large feature space, all without explicit supervision. To address these challenges, we propose ReFuGe, an agentic framework that leverages specialized large language model agents: (1) a schema selection agent identifies the tables and columns relevant to the task, (2) a feature generation agent produces diverse candidate features from the selected schema, and (3) a feature filtering agent evaluates and retains promising features through reasoning‑based and validation‑based filtering. It operates within an iterative feedback loop until performance converges. Experiments on RDB benchmarks demonstrate that ReFuGe substantially improves performance on various RDB prediction tasks. Our code and datasets are available at https://github.com/K‑Kyungho/REFUGE.
Authors:Chengqian Jiang, Jie Zhang, Haoyin Yan
Abstract:
Distributed microphone array (DMA) is a promising next‑generation platform for speech interaction, where speech enhancement (SE) is still required to improve the speech quality in noisy cases. Existing SE methods usually first gather raw waveforms at a fusion center (FC) from all devices and then design a multi‑microphone model, causing high bandwidth and energy costs. In this work, we propose a \emphCompress‑and‑Send Network (CaSNet) for resource‑constrained DMAs, where one microphone serves as the FC and reference. Each of other devices encodes the measured raw data into a feature matrix, which is then compressed by singular value decomposition (SVD) to produce a more compact representation. The received features at the FC are aligned via cross window query with respect to the reference, followed by neural decoding to yield spatially coherent enhanced speech. Experiments on multiple datasets show that the proposed CaSNet can save the data amount with a negligible impact on the performance compared to the uncompressed case. The reproducible code is available at https://github.com/Jokejiangv/CaSNet.
Authors:Aadam, Monu Verma, Mohamed Abdel-Mottaleb
Abstract:
The Abstraction and Reasoning Corpus (ARC) tests AI systems' ability to perform human‑like inductive reasoning from a few demonstration pairs. Existing Gymnasium‑based RL environments severely limit experimental scale due to computational bottlenecks. We present JaxARC, an open‑source, high‑performance RL environment for ARC implemented in JAX. Its functional, stateless architecture enables massive parallelism, achieving 38‑5,439x speedup over Gymnasium at matched batch sizes, with peak throughput of 790M steps/second. JaxARC supports multiple ARC datasets, flexible action spaces, composable wrappers, and configuration‑driven reproducibility, enabling large‑scale RL research previously computationally infeasible. JaxARC is available at https://github.com/aadimator/JaxARC.
Authors:Silong Chen, Yuchuan Luo, Guilin Deng, Yi Liu, Min Xu, Shaojing Fu, Xiaohua Jia
Abstract:
Adapter‑based Federated Large Language Models (FedLLMs) are widely adopted to reduce the computational, storage, and communication overhead of full‑parameter fine‑tuning for web‑scale applications while preserving user privacy. By freezing the backbone and training only compact low‑rank adapters, these methods appear to limit gradient leakage and thwart existing Gradient Inversion Attacks (GIAs). Contrary to this assumption, we show that low‑rank adapters create new, exploitable leakage channels. We propose the Unordered‑word‑bag‑based Text Reconstruction (UTR) attack, a novel GIA tailored to the unique structure of adapter‑based FedLLMs. UTR overcomes three core challenges: low‑dimensional gradients, frozen backbones, and combinatorially large reconstruction spaces by: (i) inferring token presence from attention patterns in frozen layers, (ii) performing sentence‑level inversion within the low‑rank subspace of adapter gradients, and (iii) enforcing semantic coherence through constrained greedy decoding guided by language priors. Extensive experiments across diverse models (GPT2‑Large, BERT, Qwen2.5‑7B) and datasets (CoLA, SST‑2, Rotten Tomatoes) demonstrate that UTR achieves near‑perfect reconstruction accuracy (ROUGE‑1/2 > 99), even with large batch size settings where prior GIAs fail completely. Our results reveal a fundamental tension between parameter efficiency and privacy in FedLLMs, challenging the prevailing belief that lightweight adaptation inherently enhances security. Our code and data are available at https://github.com/shwksnshwowk‑wq/GIA.
Authors:Xuan Ding, Xiu Yan, Chuanlong Xie, Yao Zhu
Abstract:
Watermarking methods have always been effective means of protecting intellectual property, yet they face significant challenges. Although existing deep learning‑based watermarking systems can hide watermarks in images with minimal impact on image quality, they often lack robustness when encountering image corruptions during transmission, which undermines their practical application value. To this end, we propose a high‑quality and robust watermark framework based on the diffusion model. Our method first converts the clean image into inversion noise through a null‑text optimization process, and after optimizing the inversion noise in the latent space, it produces a high‑quality watermarked image through an iterative denoising process of the diffusion model. The iterative denoising process serves as a powerful purification mechanism, ensuring both the visual quality of the watermarked image and enhancing the robustness of the watermark against various corruptions. To prevent the optimizing of inversion noise from distorting the original semantics of the image, we specifically introduced self‑attention constraints and pseudo‑mask strategies. Extensive experimental results demonstrate the superior performance of our method against various image corruptions. In particular, our method outperforms the stable signature method by an average of 10% across 12 different image transformations on COCO datasets. Our codes are available at https://github.com/920927/ONRW.
Authors:Chen Ling, Kai Hu, Hangcheng Liu, Xingshuo Han, Tianwei Zhang, Changhai Ou
Abstract:
Large Vision‑Language Models (LVLMs) are increasingly deployed in real‑world intelligent systems for perception and reasoning in open physical environments. While LVLMs are known to be vulnerable to prompt injection attacks, existing methods either require access to input channels or depend on knowledge of user queries, assumptions that rarely hold in practical deployments. We propose the first Physical Prompt Injection Attack (PPIA), a black‑box, query‑agnostic attack that embeds malicious typographic instructions into physical objects perceivable by the LVLM. PPIA requires no access to the model, its inputs, or internal pipeline, and operates solely through visual observation. It combines offline selection of highly recognizable and semantically effective visual prompts with strategic environment‑aware placement guided by spatiotemporal attention, ensuring that the injected prompts are both perceivable and influential on model behavior. We evaluate PPIA across 10 state‑of‑the‑art LVLMs in both simulated and real‑world settings on tasks including visual question answering, planning, and navigation, PPIA achieves attack success rates up to 98%, with strong robustness under varying physical conditions such as distance, viewpoint, and illumination. Our code is publicly available at https://github.com/2023cghacker/Physical‑Prompt‑Injection‑Attack.
Authors:Mohammed Fasha, Bassam Hammo, Bilal Sowan, Husam Barham, Esam Nsour
Abstract:
This study uses Jordanian law as a case study to explore the fine‑tuning of the Llama‑3.1 large language model for Arabic question‑answering. Two versions of the model ‑ Llama‑3.1‑8B‑bnb‑4bit and Llama‑3.1‑8B‑Instruct‑bnb‑4bit ‑ were fine‑tuned using parameter‑efficient fine‑tuning (PEFT) with LoRA adapters and 4‑bit quantized models, leveraging the Unsloth framework for accelerated and resource‑efficient training. A custom dataset of 6000 legal question‑answer pairs was curated from Jordanian laws and formatted into structured prompts. Performance was evaluated using the BLEU and the ROUGE metrics to compare the fine‑tuned models to their respective base versions. Results demonstrated improved legal reasoning and accuracy while achieving resource efficiency through quantization and optimized fine‑tuning strategies. This work underscores the potential of adapting large language models for Arabic legal domains and highlights effective techniques for fine‑tuning domain‑specific tasks.
Authors:Yicheng Tao, Hongteng Xu
Abstract:
The high cost of agentic workflows in formal mathematics hinders large‑scale data synthesis, exacerbating the scarcity of open‑source corpora. To address this, we introduce TheoremForge, a cost‑effective formal data synthesis pipeline that decomposes the formalization process into five sub‑tasks, which are statement formalization, proof generation, premise selection, proof correction and proof sketching. By implementing a Decoupled Extraction Strategy, the workflow recovers valid training signals from globally failed trajectories, effectively utilizing wasted computation. Experiments on a 2,000‑problem benchmark demonstrate that TheoremForge achieves a Verified Rate of 12.6%, surpassing the 8.6% baseline, at an average cost of only \0.481 per successful trajectory using Gemini‑3‑Flash. Crucially, our strategy increases data yield by 1.6× for proof generation compared to standard filtering. These results establish TheoremForge as a scalable framework for constructing a data flywheel to train future expert models. Our code is available \hrefhttps://github.com/timechess/TheoremForgehere.
Authors:Yaokun Liu, Yifan Liu, Phoebe Mbuvi, Zelin Li, Ruichen Yao, Gawon Lim, Dong Wang
Abstract:
The deployment of Large Language Models in Medical Question Answering is severely hampered by ambiguous user queries, a significant safety risk that demonstrably reduces answer accuracy in high‑stakes healthcare settings. In this paper, we formalize this challenge by linking input ambiguity to aleatoric uncertainty (AU), which is the irreducible uncertainty arising from underspecified input. To facilitate research in this direction, we construct CV‑MedBench, the first benchmark designed for studying input ambiguity in Medical QA. Using this benchmark, we analyze AU from a representation engineering perspective, revealing that AU is linearly encoded in LLM's internal activation patterns. Leveraging this insight, we introduce a novel AU‑guided "Clarify‑Before‑Answer" framework, which incorporates AU‑Probe ‑ a lightweight module that detects input ambiguity directly from hidden states. Unlike existing uncertainty estimation methods, AU‑Probe requires neither LLM fine‑tuning nor multiple forward passes, enabling an efficient mechanism to proactively request user clarification and significantly enhance safety. Extensive experiments across four open LLMs demonstrate the effectiveness of our QA framework, with an average accuracy improvement of 9.48% over baselines. Our framework provides an efficient and robust solution for safe Medical QA, strengthening the reliability of health‑related applications. The code is available at https://github.com/yaokunliu/AU‑Med.git, and the CV‑MedBench dataset is released on Hugging Face at https://huggingface.co/datasets/yaokunl/CV‑MedBench.
Authors:Parth Bhalerao, Diola Dsouza, Ruiwen Guan, Oana Ignat
Abstract:
Question answering systems are typically evaluated on factual correctness, yet many real‑world applications‑such as education and career guidance‑require mentorship: responses that provide reflection and guidance. Existing QA benchmarks rarely capture this distinction, particularly in multilingual and long‑form settings. We introduce MentorQA, the first multilingual dataset and evaluation framework for mentorship‑focused question answering from long‑form videos, comprising nearly 9,000 QA pairs from 180 hours of content across four languages. We define mentorship‑focused evaluation dimensions that go beyond factual accuracy, capturing clarity, alignment, and learning value. Using MentorQA, we compare Single‑Agent, Dual‑Agent, RAG, and Multi‑Agent QA architectures under controlled conditions. Multi‑Agent pipelines consistently produce higher‑quality mentorship responses, with especially strong gains for complex topics and lower‑resource languages. We further analyze the reliability of automated LLM‑based evaluation, observing substantial variation in alignment with human judgments. Overall, this work establishes mentorship‑focused QA as a distinct research problem and provides a multilingual benchmark for studying agentic architectures and evaluation design in educational AI. The dataset and evaluation framework are released at https://github.com/AIM‑SCU/MentorQA.
Authors:Yonghan Jung, Bogyeong Kang
Abstract:
We develop a data‑driven information‑theoretic framework for sharp partial identification of causal effects under unmeasured confounding. Existing approaches often rely on restrictive assumptions, such as bounded or discrete outcomes; require external inputs (for example, instrumental variables, proxies, or user‑specified sensitivity parameters); necessitate full structural causal model specifications; or focus solely on population‑level averages while neglecting covariate‑conditional treatment effects. We overcome all four limitations simultaneously by establishing novel information‑theoretic, data‑driven divergence bounds. Our key theoretical contribution shows that the f‑divergence between the observational distribution P(Y | A = a, X = x) and the interventional distribution P(Y | do(A = a), X = x) is upper bounded by a function of the propensity score alone. This result enables sharp partial identification of conditional causal effects directly from observational data, without requiring external sensitivity parameters, auxiliary variables, full structural specifications, or outcome boundedness assumptions. For practical implementation, we develop a semiparametric estimator satisfying Neyman orthogonality (Chernozhukov et al., 2018), which ensures square‑root‑n consistent inference even when nuisance functions are estimated using flexible machine learning methods. Simulation studies and real‑world data applications, implemented in the GitHub repository (https://github.com/yonghanjung/Information‑Theretic‑Bounds), demonstrate that our framework provides tight and valid causal bounds across a wide range of data‑generating processes.
Authors:Inderjeet Singh, Eleonore Vissol-Gaudin, Andikan Otung, Motoyoshi Sekiya
Abstract:
Fine‑tuning Large Language Models (LLMs) for specialized domains is constrained by a fundamental challenge: the need for diverse, cross‑organizational data conflicts with the principles of data privacy and sovereignty. While Federated Learning (FL) provides a framework for collaboration without raw data exchange, its classic centralized form introduces a single point of failure and remains vulnerable to model inversion attacks. Decentralized FL (DFL) mitigates this risk by removing the central aggregator but typically relies on inefficient, random peer‑to‑peer (P2P) pairings, forming a collaboration graph that is blind to agent heterogeneity and risks negative transfer. This paper introduces KNEXA‑FL, a novel framework for orchestrated decentralization that resolves this trade‑off. KNEXA‑FL employs a non‑aggregating Central Profiler/Matchmaker (CPM) that formulates P2P collaboration as a contextual bandit problem, using a LinUCB algorithm on abstract agent profiles to learn an optimal matchmaking policy. It orchestrates direct knowledge exchange between heterogeneous, PEFT‑based LLM agents via secure distillation, without ever accessing the models themselves. Our comprehensive experiments on a challenging code generation task show that KNEXA‑FL yields substantial gains, improving Pass@1 by approx. 50% relative to random P2P collaboration. Critically, our orchestrated approach demonstrates stable convergence, in stark contrast to a powerful centralized distillation baseline which suffers from catastrophic performance collapse. Our work establishes adaptive, learning‑based orchestration as a foundational principle for building robust and effective decentralized AI ecosystems.
Authors:Ole Stüven, Keno Moenck, Thorsten Schüppstuhl
Abstract:
ROCKET (RandOm Convolutional KErnel Transform) is a feature extraction algorithm created for Time Series Classification (TSC), published in 2019. It applies convolution with randomly generated kernels on a time series, producing features that can be used to train a linear classifier or regressor like Ridge. At the time of publication, ROCKET was on par with the best state‑of‑the‑art algorithms for TSC in terms of accuracy while being significantly less computationally expensive, making ROCKET a compelling algorithm for TSC. This also led to several subsequent versions, further improving accuracy and computational efficiency. The currently available ROCKET implementations are mostly bound to execution on CPU. However, convolution is a task that can be highly parallelized and is therefore suited to be executed on GPU, which speeds up the computation significantly. A key difficulty arises from the inhomogeneous kernels ROCKET uses, making standard methods for applying convolution on GPU inefficient. In this work, we propose an algorithm that is able to efficiently perform ROCKET on GPU and achieves up to 11 times higher computational efficiency per watt than ROCKET on CPU. The code for CUROCKET is available in this repository https://github.com/oleeven/CUROCKET on github.
Authors:Haoxuan Li, He Chang, Yunshan Ma, Yi Bin, Yang Yang, See-Kiong Ng, Tat-Seng Chua
Abstract:
Event forecasting is inherently influenced by multifaceted considerations, including international relations, regional historical dynamics, and cultural contexts. However, existing LLM‑based approaches employ single‑model architectures that generate predictions along a singular explicit trajectory, constraining their ability to capture diverse geopolitical nuances across complex regional contexts. To address this limitation, we introduce ThinkTank‑ME, a novel Think Tank framework for Middle East event forecasting that emulates collaborative expert analysis in real‑world strategic decision‑making. To facilitate expert specialization and rigorous evaluation, we construct POLECAT‑FOR‑ME, a Middle East‑focused event forecasting benchmark. Experimental results demonstrate the superiority of multi‑expert collaboration in handling complex temporal geopolitical forecasting tasks. The code is available at https://github.com/LuminosityX/ThinkTank‑ME.
Authors:Wei Zhou, Jun Zhou, Haoyu Wang, Zhenghao Li, Qikang He, Shaokun Han, Guoliang Li, Xuanhe Zhou, Yeye He, Chunwei Liu, Zirui Tang, Bin Wang, Shen Tang, Kai Zuo, Yuyu Luo, Zhenzhe Zheng, Conghui He, Jingren Zhou, Fan Wu
Abstract:
Data preparation aims to denoise raw datasets, uncover cross‑dataset relationships, and extract valuable insights from them, which is essential for a wide range of data‑centric applications. Driven by (i) rising demands for application‑ready data (e.g., for analytics, visualization, decision‑making), (ii) increasingly powerful LLM techniques, and (iii) the emergence of infrastructures that facilitate flexible agent construction (e.g., using Databricks Unity Catalog), LLM‑enhanced methods are rapidly becoming a transformative and potentially dominant paradigm for data preparation. By investigating hundreds of recent literature works, this paper presents a systematic review of this evolving landscape, focusing on the use of LLM techniques to prepare data for diverse downstream tasks. First, we characterize the fundamental paradigm shift, from rule‑based, model‑specific pipelines to prompt‑driven, context‑aware, and agentic preparation workflows. Next, we introduce a task‑centric taxonomy that organizes the field into three major tasks: data cleaning (e.g., standardization, error processing, imputation), data integration (e.g., entity matching, schema matching), and data enrichment (e.g., data annotation, profiling). For each task, we survey representative techniques, and highlight their respective strengths (e.g., improved generalization, semantic understanding) and limitations (e.g., the prohibitive cost of scaling LLMs, persistent hallucinations even in advanced agents, the mismatch between advanced methods and weak evaluation). Moreover, we analyze commonly used datasets and evaluation metrics (the empirical part). Finally, we discuss open research challenges and outline a forward‑looking roadmap that emphasizes scalable LLM‑data systems, principled designs for reliable agentic workflows, and robust evaluation protocols.
Authors:Aahana Basappa, Pranay Goel, Anusri Karra, Anish Karra, Asa Gilmore, Kevin Zhu
Abstract:
We investigated visual reasoning limitations of both multimodal large language models (MLLMs) and image generation models (IGMs) by creating a novel benchmark to systematically compare failure modes across image‑to‑text and text‑to‑image tasks, enabling cross‑modal evaluation of visual understanding. Despite rapid growth in machine learning, vision language models (VLMs) still fail to understand or generate basic visual concepts such as object orientation, quantity, or spatial relationships, which highlighted gaps in elementary visual reasoning. By adapting MMVP benchmark questions into explicit and implicit prompts, we create AMVICC, a novel benchmark for profiling failure modes across various modalities. After testing 11 MLLMs and 3 IGMs in nine categories of visual reasoning, our results show that failure modes are often shared between models and modalities, but certain failures are model‑specific and modality‑specific, and this can potentially be attributed to various factors. IGMs consistently struggled to manipulate specific visual components in response to prompts, especially in explicit prompts, suggesting poor control over fine‑grained visual attributes. Our findings apply most directly to the evaluation of existing state‑of‑the‑art models on structured visual reasoning tasks. This work lays the foundation for future cross‑modal alignment studies, offering a framework to probe whether generation and interpretation failures stem from shared limitations to guide future improvements in unified vision‑language modeling.
Authors:Gnankan Landry Regis N'guessan
Abstract:
Tropical algebra, including max‑plus, min‑plus, and related idempotent semirings, provides a unifying framework in which many optimization problems that are nonlinear in classical algebra become linear. This property makes tropical methods particularly well suited for shortest paths, scheduling, throughput analysis, and discrete event systems. Despite their theoretical maturity and practical relevance, existing tropical algebra implementations primarily target desktop or server environments and remain largely inaccessible on resource‑constrained embedded platforms, where such optimization problems are most acute. We present PALMA (Parallel Algebra Library for Max‑plus Applications), a lightweight, dependency‑free C library that brings tropical linear algebra to ARM‑based embedded systems. PALMA implements a generic semiring abstraction with SIMD‑accelerated kernels, enabling a single computational framework to support shortest paths, bottleneck paths, reachability, scheduling, and throughput analysis. The library supports five tropical semirings, dense and sparse (CSR) representations, tropical closure, and spectral analysis via maximum cycle mean computation. We evaluate PALMA on a Raspberry Pi 4 and demonstrate peak performance of 2,274 MOPS, speedups of up to 11.9 times over classical Bellman‑Ford for single‑source shortest paths, and sub‑10 microsecond scheduling solves for real‑time control workloads. Case studies in UAV control, IoT routing, and manufacturing systems show that tropical algebra enables efficient, predictable, and unified optimization directly on embedded hardware. PALMA is released as open‑source software under the MIT license.
Authors:Mohamed Amine Ferrag, Abderrahmane Lakas, Merouane Debbah
Abstract:
The rapid advancement of large language models (LLMs) has sparked growing interest in their integration into autonomous systems for reasoning‑driven perception, planning, and decision‑making. However, evaluating and training such agentic AI models remains challenging due to the lack of large‑scale, structured, and safety‑critical benchmarks. This paper introduces AgentDrive, an open benchmark dataset containing 300,000 LLM‑generated driving scenarios designed for training, fine‑tuning, and evaluating autonomous agents under diverse conditions. AgentDrive formalizes a factorized scenario space across seven orthogonal axes: scenario type, driver behavior, environment, road layout, objective, difficulty, and traffic density. An LLM‑driven prompt‑to‑JSON pipeline generates semantically rich, simulation‑ready specifications that are validated against physical and schema constraints. Each scenario undergoes simulation rollouts, surrogate safety metric computation, and rule‑based outcome labeling. To complement simulation‑based evaluation, we introduce AgentDrive‑MCQ, a 100,000‑question multiple‑choice benchmark spanning five reasoning dimensions: physics, policy, hybrid, scenario, and comparative reasoning. We conduct a large‑scale evaluation of fifty leading LLMs on AgentDrive‑MCQ. Results show that while proprietary frontier models perform best in contextual and policy reasoning, advanced open models are rapidly closing the gap in structured and physics‑grounded reasoning. We release the AgentDrive dataset, AgentDrive‑MCQ benchmark, evaluation code, and related materials at https://github.com/maferrag/AgentDrive
Authors:Elias Schuhmacher, Andrianos Michail, Juri Opitz, Rico Sennrich, Simon Clematide
Abstract:
To be discoverable in an embedding‑based search process, each part of a document should be reflected in its embedding representation. To quantify any potential reflection biases, we introduce a permutation‑based evaluation framework. With this, we observe that state‑of‑the‑art embedding models exhibit systematic positional and language biases when documents are longer and consist of multiple segments. Specifically, early segments and segments in higher‑resource languages like English are over‑represented, while later segments and segments in lower‑resource languages are marginalized. In our further analysis, we find that the positional bias stems from front‑loaded attention distributions in pooling‑token embeddings, where early tokens receive more attention. To mitigate this issue, we introduce an inference‑time attention calibration method that redistributes attention more evenly across document positions, increasing discoverabiltiy of later segments. Our evaluation framework and attention calibration is available at https://github.com/impresso/fair‑sentence‑transformers
Authors:Yuzhen Shi, Huanghai Liu, Yiran Hu, Gaojie Song, Xinran Xu, Yubo Ma, Tianyi Tang, Li Zhang, Qingjing Chen, Di Feng, Wenbo Lv, Weiheng Wu, Kexin Yang, Sen Yang, Wei Wang, Rongyao Shi, Yuanyang Qiu, Yuemeng Qi, Jingwen Zhang, Xiaoyu Sui, Yifan Chen, Yi Zhang, An Yang, Bowen Yu, Dayiheng Liu, Junyang Lin, Weixing Shen, Bing Zhao, Charles L. A. Clarke, Hu Wei
Abstract:
As large language models (LLMs) are increasingly applied to legal domain‑specific tasks, evaluating their ability to perform legal work in real‑world settings has become essential. However, existing legal benchmarks rely on simplified and highly standardized tasks, failing to capture the ambiguity, complexity, and reasoning demands of real legal practice. Moreover, prior evaluations often adopt coarse, single‑dimensional metrics and do not explicitly assess fine‑grained legal reasoning. To address these limitations, we introduce PLawBench, a Practical Law Benchmark designed to evaluate LLMs in realistic legal practice scenarios. Grounded in real‑world legal workflows, PLawBench models the core processes of legal practitioners through three task categories: public legal consultation, practical case analysis, and legal document generation. These tasks assess a model's ability to identify legal issues and key facts, perform structured legal reasoning, and generate legally coherent documents. PLawBench comprises 850 questions across 13 practical legal scenarios, with each question accompanied by expert‑designed evaluation rubrics, resulting in approximately 12,500 rubric items for fine‑grained assessment. Using an LLM‑based evaluator aligned with human expert judgments, we evaluate 10 state‑of‑the‑art LLMs. Experimental results show that none achieves strong performance on PLawBench, revealing substantial limitations in the fine‑grained legal reasoning capabilities of current LLMs and highlighting important directions for future evaluation and development of legal LLMs. Data is available at: https://github.com/skylenage/PLawbench.
Authors:Lin Huang, Chengxiang Huang, Ziang Wang, Yiyue Du, Chu Wang, Haocheng Lu, Yunyang Li, Xiaoli Liu, Arthur Jiang, Jia Zhang
Abstract:
Equivariant Graph Neural Networks (EGNNs) have become a widely used approach for modeling 3D atomistic systems. However, mainstream architectures face critical scalability bottlenecks due to the explicit construction of geometric features or dense tensor products on every edge. To overcome this, we introduce E2Former‑V2, a scalable architecture that integrates algebraic sparsity with hardware‑aware execution. We first propose Equivariant Axis‑Aligned Sparsification (EAAS). EAAS builds on Wigner‑6j convolution by exploiting an \mathrmSO(3) \rightarrow \mathrmSO(2) change of basis to transform computationally expensive dense tensor contractions into efficient, sparse parity re‑indexing operations. Building on this representation, we introduce On‑the‑Fly Equivariant Attention, a fully node‑centric mechanism implemented via a custom fused Triton kernel. By eliminating materialized edge tensors and maximizing SRAM utilization, our kernel achieves a 20× improvement in TFLOPS compared to standard implementations. Extensive experiments on the SPICE and OMol25 datasets demonstrate that E2Former‑V2 maintains comparable predictive performance while notably accelerating inference. This work demonstrates that large equivariant transformers can be trained efficiently using widely accessible GPU platforms. The code is avalible at https://github.com/IQuestLab/UBio‑MolFM/tree/e2formerv2.
Authors:Jongmin Yu, Hyeontaek Oh, Zhongtian Sun, Angelica I Aviles-Rivero, Moongu Jeon, Jinhong Yang
Abstract:
Existing face‑swapping methods often deliver competitive results in constrained settings but exhibit substantial quality degradation when handling extreme facial poses. To improve facial pose robustness, explicit geometric features are applied, but this approach remains problematic since it introduces additional dependencies and increases computational cost. Diffusion‑based methods have achieved remarkable results; however, they are impractical for real‑time processing. We introduce AlphaFace, which leverages an open‑source vision‑language model and CLIP image and text embeddings to apply novel visual and textual semantic contrastive losses. AlphaFace enables stronger identity representation and more precise attribute preservation, all while maintaining real‑time performance. Comprehensive experiments across FF++, MPIE, and LPFF demonstrate that AlphaFace surpasses state‑of‑the‑art methods in pose‑challenging cases. The project is publicly available on `https://github.com/andrewyu90/Alphaface_Official.git'.
Authors:Toni J. B. Liu, Baran Zadeoğlu, Nicolas Boullé, Raphaël Sarfati, Christopher J. Earls
Abstract:
Large language models (LLMs) make next‑token predictions based on clues present in their context, such as semantic descriptions and in‑context examples. Yet, elucidating which prior tokens most strongly influence a given prediction remains challenging due to the proliferation of layers and attention heads in modern architectures. We propose Jacobian Scopes, a suite of gradient‑based, token‑level causal attribution methods for interpreting LLM predictions. By analyzing the linearized relations of final hidden state with respect to inputs, Jacobian Scopes quantify how input tokens influence a model's prediction. We introduce three variants ‑ Semantic, Fisher, and Temperature Scopes ‑ which respectively target sensitivity of specific logits, the full predictive distribution, and model confidence (inverse temperature). Through case studies spanning instruction understanding, translation and in‑context learning (ICL), we uncover interesting findings, such as when Jacobian Scopes point to implicit political biases. We believe that our proposed methods also shed light on recently debated mechanisms underlying in‑context time‑series forecasting. Our code and interactive demonstrations are publicly available at https://github.com/AntonioLiu97/JacobianScopes.
Authors:Zubair Islam, Mohamed El-Darieby
Abstract:
Simulating and validating coordination among multiple autonomous vehicles (AVs) is a challenging task as most existing simulation architectures are limited to single‑vehicle operation or rely on centralized control. This paper presents a Distributed Multi‑AV Architecture (DMAVA) that enables synchronized, real‑time autonomous driving simulation across multiple physical hosts. Each vehicle runs its own complete AV stack and operates independently from other AVs. The vehicles in the simulation maintain synchronized coordination through a low‑latency data‑centric communication layer. The proposed system integrates ROS 2 Humble, Autoware Universe, AWSIM Labs, and Zenoh to support concurrent execution of multiple Autoware stacks within a shared Unity‑based environment. Experiments conducted on multiple‑host configurations demonstrate stable localization, reliable inter‑host communication, and fully synchronized closed‑loop control. The DMAVA also serves as a foundation for Multi‑Vehicle Autonomous Valet Parking, demonstrating its extensibility toward higher‑level cooperative autonomy. Demo videos and source code are available at: https://github.com/zubxxr/distributed‑multi‑autonomous‑vehicle‑architecture.
Authors:Zubair Islam, Mohamed El-Darieby
Abstract:
This paper presents the DMV‑AVP System, a distributed simulation of Multi‑Vehicle Autonomous Valet Parking (AVP). The system was implemented as an application of the Distributed Multi‑Vehicle Architecture (DMAVA) for synchronized multi‑host execution. Most existing simulation approaches rely on centralized or non‑distributed designs that constrain scalability and limit fully autonomous control. This work introduces two modules built on top of the DMAVA: 1) a Multi‑Vehicle AVP Node that performs state‑based coordination, queuing, and reservation management across multiple vehicles, and 2) a Unity‑Integrated YOLOv5 Parking Spot Detection Module that provides real‑time, vision‑based perception within AWSIM Labs. Both modules integrate seamlessly with the DMAVA and extend it specifically for multi‑vehicle AVP operation, supported by a Zenoh‑based communication layer that ensures low‑latency topic synchronization and coordinated behavior across hosts. Experiments conducted on two‑ and three‑host configurations demonstrate deterministic coordination, conflict‑free parking behavior, and scalable performance across distributed Autoware instances. The results confirm that the proposed Distributed Multi‑Vehicle AVP System supports cooperative AVP simulation and establishes a foundation for future real‑world and hardware‑in‑the‑loop validation. Demo videos and source code are available at https://github.com/zubxxr/multi‑vehicle‑avp
Authors:Dohun Lee, Chun-Hao Paul Huang, Xuelin Chen, Jong Chul Ye, Duygu Ceylan, Hyeonho Jeong
Abstract:
Recent foundational video‑to‑video diffusion models have achieved impressive results in editing user provided videos by modifying appearance, motion, or camera movement. However, real‑world video editing is often an iterative process, where users refine results across multiple rounds of interaction. In this multi‑turn setting, current video editors struggle to maintain cross‑consistency across sequential edits. In this work, we tackle, for the first time, the problem of cross‑consistency in multi‑turn video editing and introduce Memory‑V2V, a simple, yet effective framework that augments existing video‑to‑video models with explicit memory. Given an external cache of previously edited videos, Memory‑V2V employs accurate retrieval and dynamic tokenization strategies to condition the current editing step on prior results. To further mitigate redundancy and computational overhead, we propose a learnable token compressor within the DiT backbone that compresses redundant conditioning tokens while preserving essential visual cues, achieving an overall speedup of 30%. We validate Memory‑V2V on challenging tasks including video novel view synthesis and text‑conditioned long video editing. Extensive experiments show that Memory‑V2V produces videos that are significantly more cross‑consistent with minimal computational overhead, while maintaining or even improving task‑specific performance over state‑of‑the‑art baselines. Project page: https://dohunlee1.github.io/MemoryV2V
Authors:Bing Xu, Terry Chen, Fengzhe Zhou, Tianqi Chen, Yangqing Jia, Vinod Grover, Haicheng Wu, Wei Liu, Craig Wittenbrink, Wen-mei Hwu, Roger Bringmann, Ming-Yu Liu, Luis Ceze, Michael Lightstone, Humphrey Shi
Abstract:
VIBETENSOR is an open‑source research system software stack for deep learning, generated by LLM‑powered coding agents under high‑level human guidance. In this paper, "fully generated" refers to code provenance: implementation changes were produced and applied as agent‑proposed diffs; validation relied on agent‑run builds, tests, and differential checks, without per‑change manual diff review. It implements a PyTorch‑style eager tensor library with a C++20 core (CPU+CUDA), a torch‑like Python overlay via nanobind, and an experimental Node.js/TypeScript interface. Unlike thin bindings, VIBETENSOR includes its own tensor/storage system, schema‑lite dispatcher, reverse‑mode autograd, CUDA runtime (streams/events/graphs), a stream‑ordered caching allocator with diagnostics, and a stable C ABI for dynamically loaded operator plugins. We view this release as a milestone for AI‑assisted software engineering: it shows coding agents can generate a coherent deep learning runtime spanning language bindings down to CUDA memory management, validated primarily by builds and tests. We describe the architecture, summarize the workflow used to produce and validate the system, and evaluate the artifact. We report repository scale and test‑suite composition, and summarize reproducible microbenchmarks from an accompanying AI‑generated kernel suite, including fused attention versus PyTorch SDPA/FlashAttention. We also report end‑to‑end training sanity checks on 3 small workloads (sequence reversal, ViT, miniGPT) on NVIDIA H100 (Hopper, SM90) and Blackwell‑class GPUs; multi‑GPU results are Blackwell‑only and use an optional CUTLASS‑based ring‑allreduce plugin gated on CUDA 13+ and sm103a toolchain support. Finally, we discuss failure modes in generated system software, including a "Frankenstein" composition effect where locally correct subsystems interact to yield globally suboptimal performance.
Authors:Geo Ahn, Inwoong Lee, Taeoh Kim, Minho Shim, Dongyoon Wee, Jinwoo Choi
Abstract:
We study Compositional Video Understanding (CVU), where models must recognize verbs and objects and compose them to generalize to unseen combinations. We find that existing Zero‑Shot Compositional Action Recognition (ZS‑CAR) models fail primarily due to an overlooked failure mode: object‑driven verb shortcuts. Through systematic analysis, we show that this behavior arises from two intertwined factors: severe sparsity and skewness of compositional supervision, and the asymmetric learning difficulty between verbs and objects. As training progresses, the existing ZS‑CAR model increasingly ignores visual evidence and overfits to co‑occurrence statistics. Consequently, the existing model does not gain the benefit of compositional recognition in unseen verb‑object compositions. To address this, we propose RCORE, a simple and effective framework that enforces temporally grounded verb learning. RCORE introduces (i) a composition‑aware augmentation that diversifies verb‑object combinations without corrupting motion cues, and (ii) a temporal order regularization loss that penalizes shortcut behaviors by explicitly modeling temporal structure. Across two benchmarks, Sth‑com and our newly constructed EK100‑com, RCORE significantly improves unseen composition accuracy, reduces reliance on co‑occurrence bias, and achieves consistently positive compositional gaps. Our findings reveal object‑driven shortcuts as a critical limiting factor in ZS‑CAR and demonstrate that addressing them is essential for robust compositional video understanding.
Authors:Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, Yu Sun
Abstract:
How can we use AI to discover a new state of the art for a scientific problem? Prior work in test‑time scaling, such as AlphaEvolve, performs search by prompting a frozen LLM. We perform reinforcement learning at test time, so the LLM can continue to train, but now with experience specific to the test problem. This form of continual learning is quite special, because its goal is to produce one great solution rather than many good ones on average, and to solve this very problem rather than generalize to other problems. Therefore, our learning objective and search subroutine are designed to prioritize the most promising solutions. We call this method Test‑Time Training to Discover (TTT‑Discover). Following prior work, we focus on problems with continuous rewards. We report results for every problem we attempted, across mathematics, GPU kernel engineering, algorithm design, and biology. TTT‑Discover sets the new state of the art in almost all of them: (i) Erdős' minimum overlap problem and an autocorrelation inequality; (ii) a GPUMode kernel competition (up to 2× faster than prior art); (iii) past AtCoder algorithm competitions; and (iv) denoising problem in single‑cell analysis. Our solutions are reviewed by experts or the organizers. All our results are achieved with an open model, OpenAI gpt‑oss‑120b, and can be reproduced with our publicly available code, in contrast to previous best results that required closed frontier models. Our test‑time training runs are performed using Tinker, an API by Thinking Machines, with a cost of only a few hundred dollars per problem.
Authors:Sylvestre-Alvise Rebuffi, Tuan Tran, Valeriu Lacatusu, Pierre Fernandez, Tomáš Souček, Nikola Jovanović, Tom Sander, Hady Elsahar, Alexandre Mourachko
Abstract:
Existing approaches for watermarking AI‑generated images often rely on post‑hoc methods applied in pixel space, introducing computational overhead and potential visual artifacts. In this work, we explore latent space watermarking and introduce DistSeal, a unified approach for latent watermarking that works across both diffusion and autoregressive models. Our approach works by training post‑hoc watermarking models in the latent space of generative models. We demonstrate that these latent watermarkers can be effectively distilled either into the generative model itself or into the latent decoder, enabling in‑model watermarking. The resulting latent watermarks achieve competitive robustness while offering similar imperceptibility and up to 20x speedup compared to pixel‑space baselines. Our experiments further reveal that distilling latent watermarkers outperforms distilling pixel‑space ones, providing a solution that is both more efficient and more robust.
Authors:Sukesh Subaharan
Abstract:
Large language model (LLM) agents often exhibit abrupt shifts in tone and persona during extended interaction, reflecting the absence of explicit temporal structure governing agent‑level state. While prior work emphasizes turn‑local sentiment or static emotion classification, the role of explicit affective dynamics in shaping long‑horizon agent behavior remains underexplored. This work investigates whether imposing dynamical structure on an external affective state can induce temporal coherence and controlled recovery in multi‑turn dialogue. We introduce an agent‑level affective subsystem that maintains a continuous Valence‑Arousal‑Dominance (VAD) state external to the language model and governed by first‑ and second‑order update rules. Instantaneous affective signals are extracted using a fixed, memoryless estimator and integrated over time via exponential smoothing or momentum‑based dynamics. The resulting affective state is injected back into generation without modifying model parameters. Using a fixed 25‑turn dialogue protocol, we compare stateless, first‑order, and second‑order affective dynamics. Stateless agents fail to exhibit coherent trajectories or recovery, while state persistence enables delayed responses and reliable recovery. Second‑order dynamics introduce affective inertia and hysteresis that increase with momentum, revealing a trade‑off between stability and responsiveness.
Authors:Olga Bunkova, Lorenzo Di Fruscia, Sophia Rupprecht, Artur M. Schweidtmann, Marcel J. T. Reinders, Jana M. Weber
Abstract:
Large Language Models (LLMs) can aid synthesis planning in chemistry, but standard prompting methods often yield hallucinated or outdated suggestions. We study LLM interactions with a reaction knowledge graph by casting reaction path retrieval as a Text2Cypher (natural language to graph query) generation problem, and define single‑ and multi‑step retrieval tasks. We compare zero‑shot prompting to one‑shot variants using static, random, and embedding‑based exemplar selection, and assess a checklist‑driven validator/corrector loop. To evaluate our framework, we consider query validity and retrieval accuracy. We find that one‑shot prompting with aligned exemplars consistently performs best. Our checklist‑style self‑correction loop mainly improves executability in zero‑shot settings and offers limited additional retrieval gains once a good exemplar is present. We provide a reproducible Text2Cypher evaluation setup to facilitate further work on KG‑grounded LLMs for synthesis planning. Code is available at https://github.com/Intelligent‑molecular‑systems/KG‑LLM‑Synthesis‑Retrieval.
Authors:Tianjun Wei, Enneng Yang, Yingpeng Du, Huizhong Guo, Jie Zhang, Zhu Sun
Abstract:
Model merging (MM) offers an efficient mechanism for integrating multiple specialized models without access to original training data or costly retraining. While MM has demonstrated success in domains like computer vision, its role in recommender systems (RSs) remains largely unexplored. Recently, Generative Recommendation (GR) has emerged as a new paradigm in RSs, characterized by rapidly growing model scales and substantial computational costs, making MM particularly appealing for cost‑sensitive deployment scenarios. In this work, we present the first systematic study of MM in GR through a contextual lens. We focus on a fundamental yet underexplored challenge in real‑world: how to merge generative recommenders specialized to different real‑world contexts, arising from temporal evolving user behaviors and heterogeneous application domains. To this end, we propose a unified framework MMGRid, a structured contextual grid of GR checkpoints that organizes models trained under diverse contexts induced by temporal evolution and domain diversity. All checkpoints are derived from a shared base LLM but fine‑tuned on context‑specific data, forming a realistic and controlled model space for systematically analyzing MM across GR paradigms and merging algorithms. Our investigation reveals several key insights. First, training GR models from LLMs can introduce parameter conflicts during merging due to token distribution shifts and objective disparities; such conflicts can be alleviated by disentangling task‑aware and context‑specific parameter changes via base model replacement. Second, incremental training across contexts induces recency bias, which can be effectively balanced through weighted contextual merging. Notably, we observe that optimal merging weights correlate with context‑dependent interaction characteristics, offering practical guidance for weight selection in real‑world deployments.
Authors:Qilong Yan, Yifei Xing, Dugang Liu, Jingpu Duan, Jian Yin
Abstract:
Contemporary sequential recommendation methods are becoming more complex, shifting from classification to a diffusion‑guided generative paradigm. However, the quality of guidance in the form of user information is often compromised by missing data in the observed sequences, leading to suboptimal generation quality. Existing methods address this by removing locally similar items, but overlook ``critical turning points'' in user interest, which are crucial for accurately predicting subsequent user intent. To address this, we propose a novel Counterfactual Attention Regulation Diffusion model (CARD), which focuses on amplifying the signal from key interest‑turning‑point items while concurrently identifying and suppressing noise within the user sequence. CARD consists of (1) a Dual‑side Thompson Sampling method to identify sequences undergoing significant interest shift, and (2) a counterfactual attention mechanism for these sequences to quantify the importance of each item. In this manner, CARD provides the diffusion model with a high‑quality guidance signal composed of dynamically re‑weighted interaction vectors to enable effective generation. Experiments show our method works well on real‑world data without being computationally expensive. Our code is available at https://github.com/yanqilong3321/CARD.
Authors:Huayu Li, ZhengXiao He, Siyuan Tian, Jinghao Wen, Ao Li
Abstract:
Standard autoregressive decoding in large language models (LLMs) is inherently short‑sighted, often failing to find globally optimal reasoning paths due to its token‑by‑token generation process. While inference‑time strategies like foresight sampling attempt to mitigate this by simulating future steps, they typically rely on ad‑hoc heuristics for valuing paths and pruning the search space. This paper introduces Martingale Foresight Sampling (MFS), a principled framework that reformulates LLM decoding as a problem of identifying an optimal stochastic process. By modeling the quality of a reasoning path as a stochastic process, we leverage Martingale theory to design a theoretically‑grounded algorithm. Our approach replaces heuristic mechanisms with principles from probability theory: step valuation is derived from the Doob Decomposition Theorem to measure a path's predictable advantage, path selection uses Optional Stopping Theory for principled pruning of suboptimal candidates, and an adaptive stopping rule based on the Martingale Convergence Theorem terminates exploration once a path's quality has provably converged. Experiments on six reasoning benchmarks demonstrate that MFS surpasses state‑of‑the‑art methods in accuracy while significantly improving computational efficiency. Code will be released at https://github.com/miraclehetech/EACL2026‑Martingale‑Foresight‑Sampling.
Authors:Sydney Anuyah, Sneha Shajee-Mohan, Ankit-Singh Chauhan, Sunandan Chakraborty
Abstract:
The safe deployment of large language models (LLMs) in high‑stakes fields like biomedicine, requires them to be able to reason about cause and effect. We investigate this ability by testing 13 open‑source LLMs on a fundamental task: pairwise causal discovery (PCD) from text. Our benchmark, using 12 diverse datasets, evaluates two core skills: 1) Causal Detection (identifying if a text contains a causal link) and 2) Causal Extraction (pulling out the exact cause and effect phrases). We tested various prompting methods, from simple instructions (zero‑shot) to more complex strategies like Chain‑of‑Thought (CoT) and Few‑shot In‑Context Learning (FICL). The results show major deficiencies in current models. The best model for detection, DeepSeek‑R1‑Distill‑Llama‑70B, only achieved a mean score of 49.57% (C_detect), while the best for extraction, Qwen2.5‑Coder‑32B‑Instruct, reached just 47.12% (C_extract). Models performed best on simple, explicit, single‑sentence relations. However, performance plummeted for more difficult (and realistic) cases, such as implicit relationships, links spanning multiple sentences, and texts containing multiple causal pairs. We provide a unified evaluation framework, built on a dataset validated with high inter‑annotator agreement (κ\ge 0.758), and make all our data, code, and prompts publicly available to spur further research. \hrefhttps://github.com/sydneyanuyah/CausalDiscoveryCode available here: https://github.com/sydneyanuyah/CausalDiscovery
Authors:Md Nabi Newaz Khan, Abdullah Arafat Miah, Yu Bi
Abstract:
Graph neural network (GNN) have demonstrated exceptional performance in solving critical problems across diverse domains yet remain susceptible to backdoor attacks. Existing studies on backdoor attack for graph classification are limited to single target attack using subgraph replacement based mechanism where the attacker implants only one trigger into the GNN model. In this paper, we introduce the first multi‑targeted backdoor attack for graph classification task, where multiple triggers simultaneously redirect predictions to different target labels. Instead of subgraph replacement, we propose subgraph injection which preserves the structure of the original graphs while poisoning the clean graphs. Extensive experiments demonstrate the efficacy of our approach, where our attack achieves high attack success rates for all target labels with minimal impact on the clean accuracy. Experimental results on five dataset demonstrate the superior performance of our attack framework compared to the conventional subgraph replacement‑based attack. Our analysis on four GNN models confirms the generalization capability of our attack which is effective regardless of the GNN model architectures and training parameters settings. We further investigate the impact of the attack design parameters including injection methods, number of connections, trigger sizes, trigger edge density and poisoning ratios. Additionally, our evaluation against state‑of‑the‑art defenses (randomized smoothing and fine‑pruning) demonstrates the robustness of our proposed multi‑target attacks. This work highlights the GNN vulnerability against multi‑targeted backdoor attack in graph classification task. Our source codes will be available at https://github.com/SiSL‑URI/Multi‑Targeted‑Graph‑Backdoor‑Attack.
Authors:Fahd Seddik, Abdulrahman Elbedewy, Gaser Sami, Mohamed Abdelmoniem, Yahia Zakaria
Abstract:
Training modern deep learning models is increasingly constrained by GPU memory and compute limits. While Randomized Numerical Linear Algebra (RandNLA) offers proven techniques to compress these models, the lack of a unified, production‑grade library prevents widely adopting these methods. We present Panther, a PyTorch‑compatible library that consolidates established RandNLA algorithms into a single high‑performance framework. Panther engineers efficient, drop‑in replacements for standard components including sketched linear layers, 2D convolution, multi‑head attention, and randomized matrix decompositions (such as pivoted CholeskyQR). By implementing a custom C++/CUDA backend (pawX), Panther provides an optimized implementation that can run on both CPUs and GPUs. We demonstrate the effectiveness of RandNLA techniques and Panther's ease of adoption. By replacing standard PyTorch linear layers with Panther layers (requiring only a few lines of code) we achieve significant memory savings (up to 75%) on BERT while maintaining comparable loss. Source code is available (MIT License) at https://github.com/FahdSeddik/panther, along with demonstration video at https://youtu.be/7M3RQb4KWxs.
Authors:Pablo Messina, Andrés Villa, Juan León Alcázar, Karen Sánchez, Carlos Hinojosa, Denis Parra, Álvaro Soto, Bernard Ghanem
Abstract:
Medical vision‑language models can automate the generation of radiology reports but struggle with accurate visual grounding and factual consistency. Existing models often misalign textual findings with visual evidence, leading to unreliable or weakly grounded predictions. We present CURE, an error‑aware curriculum learning framework that improves grounding and report quality without any additional data. CURE fine‑tunes a multimodal instructional model on phrase grounding, grounded report generation, and anatomy‑grounded report generation using public datasets. The method dynamically adjusts sampling based on model performance, emphasizing harder samples to improve spatial and textual alignment. CURE improves grounding accuracy by +0.37 IoU, boosts report quality by +0.188 CXRFEScore, and reduces hallucinations by 18.6%. CURE is a data‑efficient framework that enhances both grounding accuracy and report reliability. Code is available at https://github.com/PabloMessina/CURE and model weights at https://huggingface.co/pamessina/medgemma‑4b‑it‑cure
Authors:Francesca Pia Panaccione, Carlo Sgaravatti, Pietro Pinoli
Abstract:
Biomedical research increasingly relies on integrating diverse data modalities, including gene expression profiles, medical images, and clinical metadata. While medical images and clinical metadata are routinely collected in clinical practice, gene expression data presents unique challenges for widespread research use, mainly due to stringent privacy regulations and costly laboratory experiments. To address these limitations, we present GeMM‑GAN, a novel Generative Adversarial Network conditioned on histopathology tissue slides and clinical metadata, designed to synthesize realistic gene expression profiles. GeMM‑GAN combines a Transformer Encoder for image patches with a final Cross Attention mechanism between patches and text tokens, producing a conditioning vector to guide a generative model in generating biologically coherent gene expression profiles. We evaluate our approach on the TCGA dataset and demonstrate that our framework outperforms standard generative models and generates more realistic and functionally meaningful gene expression profiles, improving by more than 11% the accuracy on downstream disease type prediction compared to current state‑of‑the‑art generative models. Code will be available at: https://github.com/francescapia/GeMM‑GAN
Authors:Daniel Brownell
Abstract:
Continuous attractor networks (CANs) are a well‑established class of models for representing low‑dimensional continuous variables such as head direction, spatial position, and phase. In canonical spatial domains, transitions along the attractor manifold are driven by continuous displacement signals, such as angular velocity‑provided by sensorimotor systems external to the CAN itself. When such signals are not explicitly provided as dedicated displacement inputs, it remains unclear whether attractor‑based circuits can reliably acquire recurrent dynamics that support stable state transitions, or whether alternative predictive strategies dominate. In this work, we present an experimental framework for training CANs to perform successor‑like transitions between stable attractor states in the absence of externally provided displacement signals. We compare two recurrent topologies, a circular ring and a folded snake manifold, and systematically vary the temporal regime under which stability is evaluated. We find that, under short evaluation windows, networks consistently converge to impulse‑driven associative solutions that achieve high apparent accuracy yet lack persistent attractor dynamics. Only when stability is explicitly enforced over extended free‑run periods do genuine attractor‑based transition dynamics emerge. This suggests that shortcut solutions are the default outcome of local learning in recurrent networks, while attractor dynamics represent a constrained regime rather than a generic result. Furthermore, we demonstrate that topology strictly limits the capacity for learned transitions. While the continuous ring topology achieves perfect stability over long horizons, the folded snake topology hits a geometric limit characterized by failure at manifold discontinuities, which neither curriculum learning nor basal ganglia‑inspired gating can fully overcome.
Authors:Rishit Chugh
Abstract:
The deployment of large language models (LLMs) has raised security concerns due to their susceptibility to producing harmful or policy‑violating outputs when exposed to adversarial prompts. While alignment and guardrails mitigate common misuse, they remain vulnerable to automated jailbreaking methods such as GCG, PEZ, and GBDA, which generate adversarial suffixes via training and gradient‑based search. Although effective, these methods particularly GCG are computationally expensive, limiting their practicality for organisations with constrained resources. This paper introduces a resource‑efficient adversarial prompting approach that eliminates the need for retraining by matching new prompts to a database of pre‑trained adversarial prompts. A dataset of 1,000 prompts was classified into seven harm‑related categories, and GCG, PEZ, and GBDA were evaluated on a Llama 3 8B model to identify the most effective attack method per category. Results reveal a correlation between prompt type and algorithm effectiveness. By retrieving semantically similar successful adversarial prompts, the proposed method achieves competitive attack success rates with significantly reduced computational cost. This work provides a practical framework for scalable red‑teaming and security evaluation of aligned LLMs, including in settings where model internals are inaccessible.
Authors:Deyun Zhang, Jun Li, Shijia Geng, Yue Wang, Shijie Chen, Sumei Fan, Qinghao Zha, Shenda Hong
Abstract:
Background: Conventional electrocardiogram (ECG) analysis faces a persistent dichotomy: expert‑driven features ensure interpretability but lack sensitivity to latent patterns, while deep learning offers high accuracy but functions as a black box with high data dependency. We introduce ECGomics, a systematic paradigm and open‑source platform for the multidimensional deconstruction of cardiac signals into digital biomarker. Methods: Inspired by the taxonomic rigor of genomics, ECGomics deconstructs cardiac activity across four dimensions: Structural, Intensity, Functional, and Comparative. This taxonomy synergizes expert‑defined morphological rules with data‑driven latent representations, effectively bridging the gap between handcrafted features and deep learning embeddings. Results: We operationalized this framework into a scalable ecosystem consisting of a web‑based research platform and a mobile‑integrated solution (https://github.com/PKUDigitalHealth/ECGomics). The web platform facilitates high‑throughput analysis via precision parameter configuration, high‑fidelity data ingestion, and 12‑lead visualization, allowing for the systematic extraction of biomarkers across the four ECGomics dimensions. Complementarily, the mobile interface, integrated with portable sensors and a cloud‑based engine, enables real‑time signal acquisition and near‑instantaneous delivery of structured diagnostic reports. This dual‑interface architecture successfully transitions ECGomics from theoretical discovery to decentralized, real‑world health management, ensuring professional‑grade monitoring in diverse clinical and home‑based settings. Conclusion: ECGomics harmonizes diagnostic precision, interpretability, and data efficiency. By providing a deployable software ecosystem, this paradigm establishes a robust foundation for digital biomarker discovery and personalized cardiovascular medicine.
Authors:Raffi Khatchadourian
Abstract:
LLM agents struggle with regulatory audit replay: when asked to reproduce a flagged transaction decision with identical inputs, most deployments fail to return consistent results. This paper introduces the Determinism‑Faithfulness Assurance Harness (DFAH), a framework for measuring trajectory determinism and evidence‑conditioned faithfulness in tool‑using agents deployed in financial services. Across 74 configurations (12 models, 4 providers, 8‑24 runs each at T=0.0) in non‑agentic baseline experiments, 7‑20B parameter models achieved 100% determinism, while 120B+ models required 3.7x larger validation samples to achieve equivalent statistical reliability. Agentic tool‑use introduces additional variance (see Tables 4‑7). Contrary to the assumed reliability‑capability trade‑off, a positive Pearson correlation emerged (r = 0.45, p < 0.01, n = 51 at T=0.0) between determinism and faithfulness; models producing consistent outputs also tended to be more evidence‑aligned. Three financial benchmarks are provided (compliance triage, portfolio constraints, DataOps exceptions; 50 cases each) along with an open‑source stress‑test harness. In these benchmarks and under DFAH evaluation settings, Tier 1 models with schema‑first architectures achieved determinism levels consistent with audit replay requirements.
Authors:Wei Ai, Yilong Tan, Yuntao Shou, Tao Meng, Haowen Chen, Zhixiong He, Keqin Li
Abstract:
In recent years, the rapid evolution of large vision‑language models (LVLMs) has driven a paradigm shift in multimodal fake news detection (MFND), transforming it from traditional feature‑engineering approaches to unified, end‑to‑end multimodal reasoning frameworks. Early methods primarily relied on shallow fusion techniques to capture correlations between text and images, but they struggled with high‑level semantic understanding and complex cross‑modal interactions. The emergence of LVLMs has fundamentally changed this landscape by enabling joint modeling of vision and language with powerful representation learning, thereby enhancing the ability to detect misinformation that leverages both textual narratives and visual content. Despite these advances, the field lacks a systematic survey that traces this transition and consolidates recent developments. To address this gap, this paper provides a comprehensive review of MFND through the lens of LVLMs. We first present a historical perspective, mapping the evolution from conventional multimodal detection pipelines to foundation model‑driven paradigms. Next, we establish a structured taxonomy covering model architectures, datasets, and performance benchmarks. Furthermore, we analyze the remaining technical challenges, including interpretability, temporal reasoning, and domain generalization. Finally, we outline future research directions to guide the next stage of this paradigm shift. To the best of our knowledge, this is the first comprehensive survey to systematically document and analyze the transformative role of LVLMs in combating multimodal fake news. The summary of existing methods mentioned is in our Github: \hrefhttps://github.com/Tan‑YiLong/Overview‑of‑Fake‑News‑Detectionhttps://github.com/Tan‑YiLong/Overview‑of‑Fake‑News‑Detection.
Authors:Jivnesh Sandhan, Harshit Jaiswal, Fei Cheng, Yugo Murawaki
Abstract:
The rapid adoption of LLMs has increased the need for reliable AI text detection, yet existing detectors often fail outside controlled benchmarks. We systematically evaluate 2 dominant paradigms (training‑free and supervised) and show that both are brittle under distribution shift, unseen generators, and simple stylistic perturbations. To address these limitations, we propose a supervised contrastive learning (SCL) framework that learns discriminative style embeddings. Experiments show that while supervised detectors excel in‑domain, they degrade sharply out‑of‑domain, and training‑free methods remain highly sensitive to proxy choice. Overall, our results expose fundamental challenges in building domain‑agnostic detectors. Our code is available at: https://github.com/HARSHITJAIS14/DetectAI
Authors:Yufan Deng, Zilin Pan, Hongyu Zhang, Xiaojie Li, Ruoqing Hu, Yufei Ding, Yiming Zou, Yan Zeng, Daquan Zhou
Abstract:
Video generation models have significantly advanced embodied intelligence, unlocking new possibilities for generating diverse robot data that capture perception, reasoning, and action in the physical world. However, synthesizing high‑quality videos that accurately reflect real‑world robotic interactions remains challenging, and the lack of a standardized benchmark limits fair comparisons and progress. To address this gap, we introduce a comprehensive robotics benchmark, RBench, designed to evaluate robot‑oriented video generation across five task domains and four distinct embodiments. It assesses both task‑level correctness and visual fidelity through reproducible sub‑metrics, including structural consistency, physical plausibility, and action completeness. Evaluation of 25 representative models highlights significant deficiencies in generating physically realistic robot behaviors. Furthermore, the benchmark achieves a Spearman correlation coefficient of 0.96 with human evaluations, validating its effectiveness. While RBench provides the necessary lens to identify these deficiencies, achieving physical realism requires moving beyond evaluation to address the critical shortage of high‑quality training data. Driven by these insights, we introduce a refined four‑stage data pipeline, resulting in RoVid‑X, the largest open‑source robotic dataset for video generation with 4 million annotated video clips, covering thousands of tasks and enriched with comprehensive physical property annotations. Collectively, this synergistic ecosystem of evaluation and data establishes a robust foundation for rigorous assessment and scalable training of video models, accelerating the evolution of embodied AI toward general intelligence.
Authors:Zanlin Ni, Shenzhi Wang, Yang Yue, Tianyu Yu, Weilin Zhao, Yeguo Hua, Tianyi Chen, Jun Song, Cheng Yu, Bo Zheng, Gao Huang
Abstract:
Diffusion Large Language Models (dLLMs) break the rigid left‑to‑right constraint of traditional LLMs, enabling token generation in arbitrary orders. Intuitively, this flexibility implies a solution space that strictly supersets the fixed autoregressive trajectory, theoretically unlocking superior reasoning potential for general tasks like mathematics and coding. Consequently, numerous works have leveraged reinforcement learning (RL) to elicit the reasoning capability of dLLMs. In this paper, we reveal a counter‑intuitive reality: arbitrary order generation, in its current form, narrows rather than expands the reasoning boundary of dLLMs. We find that dLLMs tend to exploit this order flexibility to bypass high‑uncertainty tokens that are crucial for exploration, leading to a premature collapse of the solution space. This observation motivates a rethink of RL approaches for dLLMs, where considerable complexities, such as handling combinatorial trajectories and intractable likelihoods, are often devoted to preserving this flexibility. We demonstrate that effective reasoning can be better elicited by intentionally forgoing arbitrary order and applying standard Group Relative Policy Optimization (GRPO) instead. Our approach, JustGRPO, is minimalist yet surprisingly effective (e.g., 89.1% accuracy on GSM8K) while fully retaining the parallel decoding ability of dLLMs. Project page: https://nzl‑thu.github.io/the‑flexibility‑trap
Authors:Andrey Moskalenko, Danil Kuznetsov, Irina Dudko, Anastasiia Iasakova, Nikita Boldyrev, Denis Shepelev, Andrei Spiridonov, Andrey Kuznetsov, Vlad Shakhuro
Abstract:
Promptable segmentation models such as SAM have established a powerful paradigm, enabling strong generalization to unseen objects and domains with minimal user input, including points, bounding boxes, and text prompts. Among these, bounding boxes stand out as particularly effective, often outperforming points while significantly reducing annotation costs. However, current training and evaluation protocols typically rely on synthetic prompts generated through simple heuristics, offering limited insight into real‑world robustness. In this paper, we investigate the robustness of promptable segmentation models to natural variations in bounding box prompts. First, we conduct a controlled user study and collect thousands of real bounding box annotations. Our analysis reveals substantial variability in segmentation quality across users for the same model and instance, indicating that SAM‑like models are highly sensitive to natural prompt noise. Then, since exhaustive testing of all possible user inputs is computationally prohibitive, we reformulate robustness evaluation as a white‑box optimization problem over the bounding box prompt space. We introduce BREPS, a method for generating adversarial bounding boxes that minimize or maximize segmentation error while adhering to naturalness constraints. Finally, we benchmark state‑of‑the‑art models across 10 datasets, spanning everyday scenes to medical imaging. Code ‑ https://github.com/emb‑ai/BREPS.
Authors:Oleg Shchendrigin, Egor Cherepanov, Alexey K. Kovalev, Aleksandr I. Panov
Abstract:
Effective decision‑making in the real world depends on memory that is both stable and adaptive: environments change over time, and agents must retain relevant information over long horizons while also updating or overwriting outdated content when circumstances shift. Existing Reinforcement Learning (RL) benchmarks and memory‑augmented agents focus primarily on retention, leaving the equally critical ability of memory rewriting largely unexplored. To address this gap, we introduce a benchmark that explicitly tests continual memory updating under partial observability, i.e. the natural setting where an agent must rely on memory rather than current observations, and use it to compare recurrent, transformer‑based, and structured memory architectures. Our experiments reveal that classic recurrent models, despite their simplicity, demonstrate greater flexibility and robustness in memory rewriting tasks than modern structured memories, which succeed only under narrow conditions, and transformer‑based agents, which often fail beyond trivial retention cases. These findings expose a fundamental limitation of current approaches and emphasize the necessity of memory mechanisms that balance stable retention with adaptive updating. Our work highlights this overlooked challenge, introduces benchmarks to evaluate it, and offers insights for designing future RL agents with explicit and trainable forgetting mechanisms. Code: https://quartz‑admirer.github.io/Memory‑Rewriting/
Authors:Wei-Jaw Lee, Fang-Chih Hsieh, Xuanjun Chen, Fang-Duo Tsai, Yi-Hsuan Yang
Abstract:
Recent advances in text‑to‑music generation (TTM) have yielded high‑quality results, but often at the cost of extensive compute and the use of large proprietary internal data. To improve the affordability and openness of TTM training, an open‑source generative model backbone that is more training‑ and data‑efficient is needed. In this paper, we constrain the number of trainable parameters in the generative model to match that of the MusicGen‑small benchmark (with about 300M parameters), and replace its Transformer backbone with the emerging class of state‑space models (SSMs). Specifically, we explore different SSM variants for sequence modeling, and compare a single‑stage SSM‑based design with a decomposable two‑stage SSM/diffusion hybrid design. All proposed models are trained from scratch on a purely public dataset comprising 457 hours of CC‑licensed music, ensuring full openness. Our experimental findings are three‑fold. First, we show that SSMs exhibit superior training efficiency compared to the Transformer counterpart. Second, despite using only 9% of the FLOPs and 2% of the training data size compared to the MusicGen‑small benchmark, our model achieves competitive performance in both objective metrics and subjective listening tests based on MusicCaps captions. Finally, our scaling‑down experiment demonstrates that SSMs can maintain competitive performance relative to the Transformer baseline even at the same training budget (measured in iterations), when the model size is reduced to four times smaller. To facilitate the democratization of TTM research, the processed captions, model checkpoints, and source code are available on GitHub via the project page: https://lonian6.github.io/ssmttm/.
Authors:Deming Chen, Vijay Ganesh, Weikai Li, Yingyan Celine Lin, Yong Liu, Subhasish Mitra, David Z. Pan, Ruchir Puri, Jason Cong, Yizhou Sun
Abstract:
This report distills the discussions and recommendations from the NSF Workshop on AI for Electronic Design Automation (EDA), held on December 10, 2024 in Vancouver alongside NeurIPS 2024. Bringing together experts across machine learning and EDA, the workshop examined how AI‑spanning large language models (LLMs), graph neural networks (GNNs), reinforcement learning (RL), neurosymbolic methods, etc.‑can facilitate EDA and shorten design turnaround. The workshop includes four themes: (1) AI for physical synthesis and design for manufacturing (DFM), discussing challenges in physical manufacturing process and potential AI applications; (2) AI for high‑level and logic‑level synthesis (HLS/LLS), covering pragma insertion, program transformation, RTL code generation, etc.; (3) AI toolbox for optimization and design, discussing frontier AI developments that could potentially be applied to EDA tasks; and (4) AI for test and verification, including LLM‑assisted verification tools, ML‑augmented SAT solving, security/reliability challenges, etc. The report recommends NSF to foster AI/EDA collaboration, invest in foundational AI for EDA, develop robust data infrastructures, promote scalable compute infrastructure, and invest in workforce development to democratize hardware design and enable next‑generation hardware systems. The workshop information can be found on the website https://ai4eda‑workshop.github.io/.
Authors:Leyi Zhao, Weijie Huang, Yitong Guo, Jiang Bian, Chenghong Wang, Xuhong Zhang
Abstract:
Optimizing scientific computing algorithms for modern GPUs is a labor‑intensive and iterative process involving repeated code modification, benchmarking, and tuning across complex hardware and software stacks. Recent work has explored large language model (LLM)‑assisted evolutionary methods for automated code optimization, but these approaches primarily rely on outcome‑based selection and random mutation, underutilizing the rich trajectory information generated during iterative optimization. We propose PhyloEvolve, an LLM‑agent system that reframes GPU‑oriented algorithm optimization as an In‑Context Reinforcement Learning (ICRL) problem. This formulation enables trajectory‑conditioned reuse of optimization experience without model retraining. PhyloEvolve integrates Algorithm Distillation and prompt‑based Decision Transformers into an iterative workflow, treating sequences of algorithm modifications and performance feedback as first‑class learning signals. To organize optimization history, we introduce a phylogenetic tree representation that captures inheritance, divergence, and recombination among algorithm variants, enabling backtracking, cross‑lineage transfer, and reproducibility. The system combines elite trajectory pooling, multi‑island parallel exploration, and containerized execution to balance exploration and exploitation across heterogeneous hardware. We evaluate PhyloEvolve on scientific computing workloads including PDE solvers, manifold learning, and spectral graph algorithms, demonstrating consistent improvements in runtime, memory efficiency, and correctness over baseline and evolutionary methods. Code is published at: https://github.com/annihi1ation/phylo_evolve
Authors:Yajvan Ravan, Aref Malek, Chester Dolph, Nikhil Behari
Abstract:
High‑altitude, multi‑spectral, aerial imagery is scarce and expensive to acquire, yet it is necessary for algorithmic advances and application of machine learning models to high‑impact problems such as wildfire detection. We introduce a human‑annotated dataset from the NASA Autonomous Modular Sensor (AMS) using 12‑channel, medium to high altitude (3 ‑ 50 km) aerial wildfire images similar to those used in current US wildfire missions. Our dataset combines spectral data from 12 different channels, including infrared (IR), short‑wave IR (SWIR), and thermal. We take imagery from 20 wildfire missions and randomly sample small patches to generate over 4000 images with high variability, including occlusions by smoke/clouds, easily‑confused false positives, and nighttime imagery. We demonstrate results from a deep‑learning model to automate the human‑intensive process of fire perimeter determination. We train two deep neural networks, one for image classification and the other for pixel‑level segmentation. The networks are combined into a unique real‑time segmentation model to efficiently localize active wildfire on an incoming image feed. Our model achieves 96% classification accuracy, 74% Intersection‑over‑Union(IoU), and 84% recall surpassing past methods, including models trained on satellite data and classical color‑rule algorithms. By leveraging a multi‑spectral dataset, our model is able to detect active wildfire at nighttime and behind clouds, while distinguishing between false positives. We find that data from the SWIR, IR, and thermal bands is the most important to distinguish fire perimeters. Our code and dataset can be found here: https://github.com/nasa/Autonomous‑Modular‑Sensor‑Wildfire‑Segmentation/tree/main and https://drive.google.com/drive/folders/1‑u4vs9rqwkwgdeeeoUhftCxrfe_4QPTn?=usp=drive_link
Authors:Yelin Chen, Fanjin Zhang, Suping Sun, Yunhe Pang, Yuanchun Wang, Jian Song, Xiaoyan Li, Lei Hou, Shu Zhao, Jie Tang, Juanzi Li
Abstract:
Understanding research papers remains challenging for foundation models due to specialized scientific discourse and complex figures and tables, yet existing benchmarks offer limited fine‑grained evaluation at scale. To address this gap, we introduce RPC‑Bench, a large‑scale question‑answering benchmark built from review‑rebuttal exchanges of high‑quality computer science papers, containing 15K human‑verified QA pairs. We design a fine‑grained taxonomy aligned with the scientific research flow to assess models' ability to understand and answer why, what, and how questions in scholarly contexts. We also define an elaborate LLM‑human interaction annotation framework to support large‑scale labeling and quality control. Following the LLM‑as‑a‑Judge paradigm, we develop a scalable framework that evaluates models on correctness‑completeness and conciseness, with high agreement to human judgment. Experiments reveal that even the strongest models (GPT‑5) achieve only 68.2% correctness‑completeness, dropping to 37.46% after conciseness adjustment, highlighting substantial gaps in precise academic paper understanding. Our code and data are available at https://rpc‑bench.github.io/.
Authors:Ze-Yu Peng, Hao-Shi Yuan, Qi Lai, Jun-Qian Jiang, Gen Ye, Jun Zhang, Yun-Song Piao
Abstract:
We present DeepInflation, an AI agent designed for research and model discovery in inflationary cosmology. Built upon a multi‑agent architecture, DeepInflation integrates Large Language Models (LLMs) with a symbolic regression (SR) engine and a retrieval‑augmented generation (RAG) knowledge base. This framework enables the agent to automatically explore and verify the vast landscape of inflationary potentials while grounding its outputs in established theoretical literature. We demonstrate that DeepInflation can successfully discover simple and viable single‑field slow‑roll inflationary potentials consistent with the latest observations (here ACT DR6 results as example) or any given n_s and r, and provide accurate theoretical context for obscure inflationary scenarios. DeepInflation serves as a prototype for a new generation of autonomous scientific discovery engines in cosmology, which enables researchers and non‑experts alike to explore the inflationary landscape using natural language. This agent is available at https://github.com/pengzy‑cosmo/DeepInflation.
Authors:Kangyu Zheng, Kai Zhang, Jiale Tan, Xuehan Chen, Yingzhou Lu, Zaixi Zhang, Lichao Sun, Marinka Zitnik, Tianfan Fu, Zhiding Liang
Abstract:
Currently, the field of structure‑based drug design is dominated by three main types of algorithms: search‑based algorithms, deep generative models, and reinforcement learning. While existing works have typically focused on comparing models within a single algorithmic category, cross‑algorithm comparisons remain scarce. In this paper, to fill the gap, we establish a benchmark to evaluate the performance of fifteen models across these different algorithmic foundations by assessing the pharmaceutical properties of the generated molecules and their docking affinities and poses with specified target proteins. We highlight the unique advantages of each algorithmic approach and offer recommendations for the design of future SBDD models. We emphasize that 1D/2D ligand‑centric drug design methods can be used in SBDD by treating the docking function as a black‑box oracle, which is typically neglected. Our evaluation reveals distinct patterns across model categories. 3D structure‑based models excel in binding affinities but show inconsistencies in chemical validity and pose quality. 1D models demonstrate reliable performance in standard molecular metrics but rarely achieve optimal binding affinities. 2D models offer balanced performance, maintaining high chemical validity while achieving moderate binding scores. Through detailed analysis across multiple protein targets, we identify key improvement areas for each model category, providing insights for researchers to combine strengths of different approaches while addressing their limitations. All the code that are used for benchmarking is available in https://github.com/zkysfls/2025‑sbdd‑benchmark
Authors:Anh-Tuan Mai, Cam-Van Thi Nguyen, Duc-Trong Le
Abstract:
Multimodal emotion recognition in conversation (MERC) requires representations that effectively integrate signals from multiple modalities. These signals include modality‑specific cues, information shared across modalities, and interactions that emerge only when modalities are combined. In information‑theoretic terms, these correspond to \emphunique, \emphredundant, and \emphsynergistic contributions. An ideal representation should leverage all three, yet achieving such balance remains challenging. Recent advances in contrastive learning and augmentation‑based methods have made progress, but they often overlook the role of data preparation in preserving these components. In particular, applying augmentations directly to raw inputs or fused embeddings can blur the boundaries between modality‑unique and cross‑modal signals. To address this challenge, we propose a two‑phase framework \emphDivide and Refine (DnR). In the Divide phase, each modality is explicitly decomposed into uniqueness, pairwise redundancy, and synergy. In the Refine phase, tailored objectives enhance the informativeness of these components while maintaining their distinct roles. The refined representations are plug‑and‑play compatible with diverse multimodal pipelines. Extensive experiments on IEMOCAP and MELD demonstrate consistent improvements across multiple MERC backbones. These results highlight the effectiveness of explicitly dividing, refining, and recombining multimodal representations as a principled strategy for advancing emotion recognition. Our implementation is available at https://github.com/mattam301/DnR‑WACV2026
Authors:Yiyang Wang, Yiqiao Jin, Alex Cabral, Josiah Hester
Abstract:
Multi‑agent systems (MAS) are emerging as promising socio‑collaborative companions for emotional and cognitive support. However, existing systems frequently suffer from persona collapse, where agents revert to generic, homogenized assistant behaviors, and social sycophancy, where agents produce redundant, non‑constructive dialogue. We propose MASCOT, a multi‑agent framework for multi‑perspective socio‑collaborative companions. MASCOT introduces a novel bi‑level optimization strategy to harmonize individual and collective behaviors: 1) Persona‑Aware Behavioral Alignment, an RLAIF‑driven pipeline that fine‑tunes individual agents for agent‑specific identities; and 2) Collaborative Dialogue Optimization, a group‑level adaptation process that promotes complementary, diverse, and productive discourse. We evaluate MASCOT using human‑grounded contexts drawn across both in‑domain and out‑of‑domain (OOD) settings against state‑of‑the‑art baselines. MASCOT improves persona consistency by up to +14.1 and social contribution by up to +10.6. A broad evaluation suite, including human evaluation, multiple LLM judges, three‑way comparisons, and automatic metrics, further shows that MASCOT produces more role‑consistent and less redundant multi‑agent dialogue.
Authors:Víctor Yeste, Paolo Rosso
Abstract:
We study sentence‑level detection of the 19 human values in the refined Schwartz continuum in about 74k English sentences from news and political manifestos (ValueEval'24 corpus). Each sentence is annotated with value presence, yielding a binary moral‑presence label and a 19‑way multi‑label task under severe class imbalance. First, we show that moral presence is learnable from single sentences: a DeBERTa‑base classifier attains positive‑class F1 = 0.74 with calibrated thresholds. Second, we compare direct multi‑label value detectors with presence‑gated hierarchies under a single 8 GB GPU budget. Under matched compute, presence gating does not improve over direct prediction, indicating that gate recall becomes a bottleneck. Third, we investigate lightweight auxiliary signals ‑ short‑range context, LIWC‑22 and moral lexica, and topic features ‑ and small ensembles. Our best supervised configuration, a soft‑voting ensemble of DeBERTa‑based models enriched with such signals, reaches macro‑F1 = 0.332 on the 19 values, improving over the best previous English‑only baseline on this corpus (macro‑F1 \approx 0.28). We additionally benchmark 7‑9B instruction‑tuned LLMs (Gemma 2 9B, Llama 3.1 8B, Mistral 8B, Qwen 2.5 7B) in zero‑/few‑shot and QLoRA setups, and find that they lag behind the supervised ensemble under the same hardware constraint. Overall, our results provide empirical guidance for building compute‑efficient, value‑aware NLP models under realistic GPU budgets.
Authors:Cheol-Hui Lee, Hwa-Yeon Lee, Dong-Joo Kim
Abstract:
The quality of data augmentation serves as a critical determinant for the performance of contrastive learning in EEG tasks. Although this paradigm is promising for utilizing unlabeled data, static or random augmentation strategies often fail to preserve intrinsic information due to the non‑stationarity of EEG signals where statistical properties change over time. To address this, we propose RL‑BioAug, a framework that leverages a label‑efficient reinforcement learning (RL) agent to autonomously determine optimal augmentation policies. While utilizing only a minimal fraction (10%) of labeled data to guide the agent's policy, our method enables the encoder to learn robust representations in a strictly self‑supervised manner. Experimental results demonstrate that RL‑BioAug significantly outperforms the random selection strategy, achieving substantial improvements of 9.69% and 8.80% in Macro‑F1 score on the Sleep‑EDFX and CHB‑MIT datasets, respectively. Notably, this agent mainly chose optimal strategies for each task‑‑for example, Time Masking with a 62% probability for sleep stage classification and Crop & Resize with a 77% probability for seizure detection. Our framework suggests its potential to replace conventional heuristic‑based augmentations and establish a new autonomous paradigm for data augmentation. The source code is available at https://github.com/dlcjfgmlnasa/RL‑BioAug.
Authors:Xu Zhang, Danyang Li, Yingjie Xia, Xiaohang Dong, Hualong Yu, Jianye Wang, Qicheng Li
Abstract:
Change Detection (CD) is a fundamental task in remote sensing. It monitors the evolution of land cover over time. Based on this, Open‑Vocabulary Change Detection (OVCD) introduces a new requirement. It aims to reduce the reliance on predefined categories. Existing training‑free OVCD methods mostly use CLIP to identify categories. These methods also need extra models like DINO to extract features. However, combining different models often causes problems in matching features and makes the system unstable. Recently, the Segment Anything Model 3 (SAM 3) is introduced. It integrates segmentation and identification capabilities within one promptable model, which offers new possibilities for the OVCD task. In this paper, we propose OmniOVCD, a standalone framework designed for OVCD. By leveraging the decoupled output heads of SAM 3, we propose a Synergistic Fusion to Instance Decoupling (SFID) strategy. SFID first fuses the semantic, instance, and presence outputs of SAM 3 to construct land‑cover masks, and then decomposes them into individual instance masks for change comparison. This design preserves high accuracy in category recognition and maintains instance‑level consistency across images. As a result, the model can generate accurate change masks. Experiments on four public benchmarks (LEVIR‑CD, WHU‑CD, S2Looking, and SECOND) demonstrate SOTA performance, achieving IoU scores of 67.2, 66.5, 24.5, and 27.1 (class‑average), respectively, surpassing all previous methods. The code is available at https://github.com/Erxucomeon/OmniOVCD.
Authors:Kai Wittenmayer, Sukrut Rao, Amin Parchami-Araghi, Bernt Schiele, Jonas Fischer
Abstract:
Language‑aligned vision foundation models perform strongly across diverse downstream tasks. Yet, their learned representations remain opaque, making interpreting their decision‑making difficult. Recent work decompose these representations into human‑interpretable concepts, but provide poor spatial grounding and are limited to image classification tasks. In this work, we propose CFM, a language‑aligned concept foundation model for vision that provides fine‑grained concepts, which are human‑interpretable and spatially grounded in the input image. When paired with a foundation model with strong semantic representations, we get explanations for any of its downstream tasks. Examining local co‑occurrence dependencies of concepts allows us to define concept relationships through which we improve concept naming and obtain richer explanations. On benchmark data, we show that CFM provides performance on classification, segmentation, and captioning that is competitive with opaque foundation models while providing fine‑grained, high quality concept‑based explanations. Code at https://github.com/kawi19/CFM.
Authors:Shengda Fan, Xuyan Ye, Yankai Lin
Abstract:
Self‑play with large language models has emerged as a promising paradigm for achieving self‑improving artificial intelligence. However, existing self‑play frameworks often suffer from optimization instability, due to (i) non‑stationary objectives induced by solver‑dependent reward feedback for the Questioner, and (ii) bootstrapping errors from self‑generated pseudo‑labels used to supervise the Solver. To mitigate these challenges, we introduce DARC (Decoupled Asymmetric Reasoning Curriculum), a two‑stage framework that stabilizes the self‑evolution process. First, we train the Questioner to synthesize difficulty‑calibrated questions, conditioned on explicit difficulty levels and external corpora. Second, we train the Solver with an asymmetric self‑distillation mechanism, where a document‑augmented teacher generates high‑quality pseudo‑labels to supervise the student Solver that lacks document access. Empirical results demonstrate that DARC is model‑agnostic, yielding an average improvement of 10.9 points across nine reasoning benchmarks and three backbone models. Moreover, DARC consistently outperforms all baselines and approaches the performance of fully supervised models without relying on human annotations. The code is available at https://github.com/RUCBM/DARC.
Authors:Zhiyuan Shi, Qibo Qiu, Feng Xue, Zhonglin Jiang, Li Yu, Jian Jiang, Xiaofei He, Wenxiao Wang
Abstract:
The linear memory growth of the KV cache poses a significant bottleneck for LLM inference in long‑context tasks. Existing static compression methods often fail to preserve globally important information. Although recent dynamic retrieval approaches attempt to address this issue, they typically suffer from coarse‑grained caching strategies and incur high I/O overhead. To overcome these limitations, we propose HeteroCache, a training‑free dynamic compression framework. Our method is built on two key insights: attention heads exhibit diverse temporal heterogeneity, and there is significant spatial redundancy among heads within the same layer. Guided by these insights, HeteroCache categorizes heads based on stability and similarity, applying a fine‑grained weighting strategy that allocates larger cache budgets to heads with rapidly shifting attention to capture context changes. Furthermore, it features a hierarchical storage mechanism where representative heads monitor attention drift to trigger asynchronous, on‑demand context retrieval, thereby hiding I/O latency. Experiments demonstrate that HeteroCache achieves state‑of‑the‑art performance on long‑context benchmarks and accelerates decoding by up to 3× compared to the original model with a 224K context. Our code is available at https://github.com/ponytaill/HeteroCache.
Authors:Xu Zhang, Junwei Deng, Chang Xu, Hao Li, Jiang Bian
Abstract:
Time series generation (TSG) is widely used across domains, yet most existing methods assume regular sampling and fixed output resolutions. These assumptions are often violated in practice, where observations are irregular and sparse, while downstream applications require continuous and high‑resolution TS. Although Neural Controlled Differential Equation (NCDE) is promising for modeling irregular TS, it is constrained by a single dynamics function, tightly coupled optimization, and limited ability to adapt learned dynamics to newly generated samples from the generative model. We propose Diff‑MN, a continuous TSG framework that enhances NCDE with a Mixture‑of‑Experts (MoE) dynamics function and a decoupled architectural design for dynamics‑focused training. To further enable NCDE to generalize to newly generated samples, Diff‑MN employs a diffusion model to parameterize the NCDE temporal dynamics parameters (MoE weights), i.e., jointly learn the distribution of TS data and MoE weights. This design allows sample‑specific NCDE parameters to be generated for continuous TS generation. Experiments on ten public and synthetic datasets demonstrate that Diff‑MN consistently outperforms strong baselines on both irregular‑to‑regular and irregular‑to‑continuous TSG tasks. The code is available at the link https://github.com/microsoft/TimeCraft/tree/main/Diff‑MN.
Authors:Jiayi Yuan, Jonathan Nöther, Natasha Jaques, Goran Radanović
Abstract:
While recent automated red‑teaming methods show promise for systematically exposing model vulnerabilities, most existing approaches rely on human‑specified workflows. This dependence on manually designed workflows suffers from human biases and makes exploring the broader design space expensive. We introduce AgenticRed, an automated pipeline that leverages LLMs' in‑context learning to iteratively design and refine red‑teaming systems without human intervention. Rather than optimizing attacker policies within predefined structures, AgenticRed treats red‑teaming as a system design problem. Inspired by methods like Meta Agent Search, we develop a novel procedure for evolving agentic systems using evolutionary selection, and apply it to the problem of automatic red‑teaming. Red‑teaming systems designed by AgenticRed consistently outperform state‑of‑the‑art approaches, achieving 96% attack success rate (ASR) on Llama‑2‑7B (36% improvement) and 98% on Llama‑3‑8B on HarmBench. Our approach exhibits strong transferability to proprietary models, achieving 100% ASR on GPT‑3.5‑Turbo and GPT‑4o, and 60% on Claude‑Sonnet‑3.5 (24% improvement). This work highlights automated system design as a powerful paradigm for AI safety evaluation that can keep pace with rapidly evolving models.
Authors:Nickil Maveli, Antonio Vergari, Shay B. Cohen
Abstract:
LLMs demonstrate strong performance on code benchmarks, yet consistent reasoning across forward and backward execution remains elusive. We present RoundTripCodeEval (RTCE), a benchmark of four code execution reasoning tasks that evaluates round‑trip consistency through execution‑free, exact‑match assessment of bijection fidelity across four lossless compression algorithms. We evaluate state‑of‑the‑art Code‑LLMs under zero‑shot prompting, supervised fine‑tuning on execution traces, and iterative self‑reflection. All approaches yield only modest improvements and none closes the gap, revealing that current LLMs lack the internal coherence required for reliable bidirectional code reasoning. RTCE surfaces findings invisible to existing benchmarks: models frequently pass individual forward and backward tasks yet fail the combined round‑trip, exposing mutually inconsistent internal representations; SFT and self‑reflection saturate after one revision round, indicating they cannot repair fundamental algorithmic misunderstandings; and failures persist even on simple bijections such as RLE, suggesting that algorithmic complexity is not the sole root cause.\footnoteCode and dataset are available at https://github.com/Nickil21/round‑trip‑code‑compression.
Authors:Po-Yu Liang, Tibo Duran, Jun Bai
Abstract:
We present PepEDiff, a novel peptide binder generator that designs binding sequences given a target receptor protein sequence and its pocket residues. Peptide binder generation is critical in therapeutic and biochemical applications, yet many existing methods rely heavily on intermediate structure prediction, adding complexity and limiting sequence diversity. Our approach departs from this paradigm by generating binder sequences directly in a continuous latent space derived from a pretrained protein embedding model, without relying on predicted structures, thereby improving structural and sequence diversity. To encourage the model to capture binding‑relevant features rather than memorizing known sequences, we perform latent‑space exploration and diffusion‑based sampling, enabling the generation of peptides beyond the limited distribution of known binders. This zero‑shot generative strategy leverages the global protein embedding manifold as a semantic prior, allowing the model to propose novel peptide sequences in previously unseen regions of the protein space. We evaluate PepEDiff on TIGIT, a challenging target with a large, flat protein‑protein interaction interface that lacks a druggable pocket. Despite its simplicity, our method outperforms state‑of‑the‑art approaches across benchmark tests and in the TIGIT case study, demonstrating its potential as a general, structure‑free framework for zero‑shot peptide binder design. The code for this research is available at GitHub: https://github.com/LabJunBMI/PepEDiff‑An‑Peptide‑binder‑Embedding‑Diffusion‑Model
Authors:Xue Jiang, Ge Li, Jiaru Qian, Xianjie Shi, Chenjie Li, Hao Zhu, Ziyu Wang, Jielun Zhang, Zheyu Zhao, Kechi Zhang, Jia Li, Wenpin Jiao, Zhi Jin, Yihong Dong
Abstract:
Large language models (LLMs) excel at general programming but struggle with domain‑specific software development, necessitating domain specialization methods for LLMs to learn and utilize domain knowledge and data. However, existing domain‑specific code benchmarks cannot evaluate the effectiveness of domain specialization methods, which focus on assessing what knowledge LLMs possess rather than how they acquire and apply new knowledge, lacking explicit knowledge corpora for developing domain specialization methods. To this end, we present KOCO‑BENCH, a novel benchmark designed for evaluating domain specialization methods in real‑world software development. KOCO‑BENCH contains 6 emerging domains with 11 software frameworks and 25 projects, featuring curated knowledge corpora alongside multi‑granularity evaluation tasks including domain code generation (from function‑level to project‑level with rigorous test suites) and domain knowledge understanding (via multiple‑choice Q&A). Unlike previous benchmarks that only provide test sets for direct evaluation, KOCO‑BENCH requires acquiring and applying diverse domain knowledge (APIs, rules, constraints, etc.) from knowledge corpora to solve evaluation tasks. Our evaluations reveal that KOCO‑BENCH poses significant challenges to state‑of‑the‑art LLMs. Even with domain specialization methods (e.g., SFT, RAG, kNN‑LM) applied, improvements remain marginal. Best‑performing coding agent, Claude Code, achieves only 34.2%, highlighting the urgent need for more effective domain specialization methods. We release KOCO‑BENCH, evaluation code, and baselines to advance further research at https://github.com/jiangxxxue/KOCO‑bench.
Authors:Pedro M. Gordaliza, Jaume Banus, Benoît Gérin, Maxence Wynen, Nataliia Molchanova, Jonas Richiardi, Meritxell Bach Cuadra
Abstract:
Developing Foundation Models for medical image analysis is essential to overcome the unique challenges of radiological tasks. The first challenges of this kind for 3D brain MRI, SSL3D and FOMO25, were held at MICCAI 2025. Our solution ranked first in tracks of both contests. It relies on a U‑Net CNN architecture combined with strategies leveraging anatomical priors and neuroimaging domain knowledge. Notably, our models trained 1‑2 orders of magnitude faster and were 10 times smaller than competing transformer‑based approaches. Models are available here: https://github.com/jbanusco/BrainFM4Challenges.
Authors:Hassan Soliman, Vivek Gupta, Dan Roth, Iryna Gurevych
Abstract:
Realistic text‑to‑SQL workflows often require joining multiple tables. As a result, accurately retrieving the relevant set of tables becomes a key bottleneck for end‑to‑end performance. We study an open‑book setting where queries must be answered over large, heterogeneous table collections pooled from many sources, without clean scoping signals such as database identifiers. Here, dense retrieval (DR) achieves high recall but returns many distractors, while join‑aware alternatives often rely on extra assumptions and/or incur high inference overhead. We propose CORE‑T, a scalable, training‑free framework that enriches tables with LLM‑generated purpose metadata and pre‑computes a lightweight table‑compatibility cache. At inference time, DR returns top‑K candidates; a single LLM call selects a coherent, joinable subset, and a simple additive adjustment step restores strongly compatible tables. Across Bird, Spider, and MMQA, CORE‑T improves table‑selection F1 by up to 22.7 points while retrieving up to 42% fewer tables, improving multi‑table execution accuracy by up to 5.0 points on Bird and 6.9 points on MMQA, and using 4‑5x fewer tokens than LLM‑intensive baselines.
Authors:Rusheng Pan, Bingcheng Mao, Tianyi Ma, Zhenhua Ling
Abstract:
Recovering accurate architecture from large‑scale legacy software is hindered by architectural drift, missing relations, and the limited context of Large Language Models (LLMs). We present ArchAgent, a scalable agent‑based framework that combines static analysis, adaptive code segmentation, and LLM‑powered synthesis to reconstruct multiview, business‑aligned architectures from cross‑repository codebases. ArchAgent introduces scalable diagram generation with contextual pruning and integrates cross‑repository data to identify business‑critical modules. Evaluations of typical large‑scale GitHub projects show significant improvements over existing benchmarks. An ablation study confirms that dependency context improves the accuracy of generated architectures of production‑level repositories, and a real‑world case study demonstrates effective recovery of critical business logics from legacy projects. The dataset is available at https://github.com/panrusheng/arch‑eval‑benchmark.
Authors:Wenqi Zhang, Yulin Shen, Changyue Jiang, Jiarun Dai, Geng Hong, Xudong Pan
Abstract:
Large foundation models are integrated into Computer Use Agents (CUAs), enabling autonomous interaction with operating systems through graphical user interfaces (GUIs) to perform complex tasks. This autonomy introduces serious security risks: malicious instructions or visual prompt injections can trigger unsafe reasoning and cause harmful system‑level actions. Existing defenses, such as detection‑based blocking, prevent damage but often abort tasks prematurely, reducing agent utility. In this paper, we present MirrorGuard, a plug‑and‑play defense framework that uses simulation‑based training to improve CUA security in the real world. To reduce the cost of large‑scale training in operating systems, we propose a novel neural‑symbolic simulation pipeline, which generates realistic, high‑risk GUI interaction trajectories entirely in a text‑based simulated environment, which captures unsafe reasoning patterns and potential system hazards without executing real operations. In the simulation environment, MirrorGuard learns to intercept and rectify insecure reasoning chains of CUAs before they produce and execute unsafe actions. In real‑world testing, extensive evaluations across diverse benchmarks and CUA architectures show that MirrorGuard significantly mitigates security risks. For instance, on the ByteDance UI‑TARS system, it reduces the unsafe rate from 66.5% to 13.0% while maintaining a marginal false refusal rate (FRR). In contrast, the state‑of‑the‑art GuardAgent only achieves a reduction to 53.9% and suffers from a 15.4% higher FRR. Our work proves that simulation‑derived defenses can provide robust, real‑world protection while maintaining the fundamental utility of the agent. Our code and model are publicly available at https://bmz‑q‑q.github.io/MirrorGuard/.
Authors:Ishir Garg, Neel Kolhe, Andy Peng, Rohan Gopalam
Abstract:
Continual learning aims to enable neural networks to acquire new knowledge on sequential tasks. However, the key challenge in such settings is to learn new tasks without catastrophically forgetting previously learned tasks. We propose the Fisher‑Orthogonal Projected Natural Gradient Descent (FOPNG) optimizer, which enforces Fisher‑orthogonal constraints on parameter updates to preserve old task performance while learning new tasks. Unlike existing methods that operate in Euclidean parameter space, FOPNG projects gradients onto the Fisher‑orthogonal complement of previous task gradients. This approach unifies natural gradient descent with orthogonal gradient methods within an information‑geometric framework. We provide theoretical analysis deriving the projected update, describe efficient and practical implementations using the diagonal Fisher, and demonstrate strong results on standard continual learning benchmarks such as Permuted‑MNIST, Split‑MNIST, Rotated‑MNIST, Split‑CIFAR10, and Split‑CIFAR100. Our code is available at https://github.com/ishirgarg/FOPNG.
Authors:Yuqi Li, Kuiye Ding, Chuanguang Yang, Szu-Yu Chen, Yingli Tian
Abstract:
Time Series foundation models (TSFMs) deliver strong forecasting performance through large‑scale pretraining, but their large parameter sizes make deployment costly. While knowledge distillation offers a natural and effective approach for model compression, techniques developed for general machine learning tasks are not directly applicable to time series forecasting due to the unique characteristics. To address this, we present DistilTS, the first distillation framework specifically designed for TSFMs. DistilTS addresses two key challenges: (1) task difficulty discrepancy, specific to forecasting, where uniform weighting makes optimization dominated by easier short‑term horizons, while long‑term horizons receive weaker supervision; and (2) architecture discrepancy, a general challenge in distillation, for which we design an alignment mechanism in the time series forecasting. To overcome these issues, DistilTS introduces horizon‑weighted objectives to balance learning across horizons, and a temporal alignment strategy that reduces architectural mismatch, enabling compact models. Experiments on multiple benchmarks demonstrate that DistilTS achieves forecasting performance comparable to full‑sized TSFMs, while reducing parameters by up to 1/150 and accelerating inference by up to 6000x. Code is available at: https://github.com/itsnotacie/DistilTS‑ICASSP2026.
Authors:Xingjie Gao, Pengcheng Huang, Zhenghao Liu, Yukun Yan, Shuo Wang, Zulong Chen, Chen Qian, Ge Yu, Yu Gu
Abstract:
Equipping Large Language Models (LLMs) with external tools enables them to solve complex real‑world problems. However, the robustness of existing methods remains a critical challenge when confronting novel or evolving tools. Existing trajectory‑centric paradigms primarily rely on memorizing static solution paths during training, which limits the ability of LLMs to generalize tool usage to newly introduced or previously unseen tools. In this paper, we propose ToolMaster, a framework that shifts tool use from imitating golden tool‑calling trajectories to actively learning tool usage through interaction with the environment. To optimize LLMs for tool planning and invocation, ToolMaster adopts a trial‑and‑execution paradigm, which trains LLMs to first imitate teacher‑generated trajectories containing explicit tool trials and self‑correction, followed by reinforcement learning to coordinate the trial and execution phases jointly. This process enables agents to autonomously explore correct tool usage by actively interacting with environments and forming experiential knowledge that benefits tool execution. Experimental results demonstrate that ToolMaster significantly outperforms existing baselines in terms of generalization and robustness across unseen or unfamiliar tools. All code and data are available at https://github.com/NEUIR/ToolMaster.
Authors:Hanbin Wang, Jingwei Song, Jinpeng Li, Qi Zhu, Fei Mi, Ganqu Cui, Yasheng Wang, Lifeng Shang
Abstract:
Large Reasoning Models (LRMs) have recently shown impressive performance on complex reasoning tasks, often by engaging in self‑reflective behaviors such as self‑critique and backtracking. However, not all reflections are beneficial‑many are superficial, offering little to no improvement over the original answer and incurring computation overhead. In this paper, we identify and address the problem of superficial reflection in LRMs. We first propose Self‑Critique Fine‑Tuning (SCFT), a training framework that enhances the model's reflective reasoning ability using only self‑generated critiques. SCFT prompts models to critique their own outputs, filters high‑quality critiques through rejection sampling, and fine‑tunes the model using a critique‑based objective. Building on this strong foundation, we further introduce Reinforcement Learning with Effective Reflection Rewards (RLERR). RLERR leverages the high‑quality reflections initialized by SCFT to construct reward signals, guiding the model to internalize the self‑correction process via reinforcement learning. Experiments on two challenging benchmarks, AIME2024 and AIME2025, show that SCFT and RLERR significantly improve both reasoning accuracy and reflection quality, outperforming state‑of‑the‑art baselines. All data and codes are available at https://github.com/wanghanbinpanda/SCFT.
Authors:Tianxin Wei, Ting-Wei Li, Zhining Liu, Xuying Ning, Ze Yang, Jiaru Zou, Zhichen Zeng, Ruizhong Qiu, Xiao Lin, Dongqi Fu, Zihao Li, Mengting Ai, Duo Zhou, Wenxuan Bao, Yunzhe Li, Gaotang Li, Cheng Qian, Yu Wang, Xiangru Tang, Yin Xiao, Liri Fang, Hui Liu, Xianfeng Tang, Yuji Zhang, Chi Wang, Jiaxuan You, Heng Ji, Hanghang Tong, Jingrui He
Abstract:
Reasoning is a fundamental cognitive process underlying inference, problem‑solving, and decision‑making. While large language models (LLMs) demonstrate strong reasoning capabilities in closed‑world settings, they struggle in open‑ended and dynamic environments. Agentic reasoning marks a paradigm shift by reframing LLMs as autonomous agents that plan, act, and learn through continual interaction. In this survey, we organize agentic reasoning along three complementary dimensions. First, we characterize environmental dynamics through three layers: foundational agentic reasoning, which establishes core single‑agent capabilities including planning, tool use, and search in stable environments; self‑evolving agentic reasoning, which studies how agents refine these capabilities through feedback, memory, and adaptation; and collective multi‑agent reasoning, which extends intelligence to collaborative settings involving coordination, knowledge sharing, and shared goals. Across these layers, we distinguish in‑context reasoning, which scales test‑time interaction through structured orchestration, from post‑training reasoning, which optimizes behaviors via reinforcement learning and supervised fine‑tuning. We further review representative agentic reasoning frameworks across real‑world applications and benchmarks, including science, robotics, healthcare, autonomous research, and mathematics. This survey synthesizes agentic reasoning methods into a unified roadmap bridging thought and action, and outlines open challenges and future directions, including personalization, long‑horizon interaction, world modeling, scalable multi‑agent training, and governance for real‑world deployment.
Authors:Hui Yang, Jiaoyan Chen, Uli Sattler
Abstract:
The ability of Large Language Models (LLMs) to perform reasoning tasks such as deduction has been widely investigated in recent years. Yet, their capacity to generate proofs‑faithful, human‑readable explanations of why conclusions follow‑remains largely under explored. In this work, we study proof generation in the context of OWL ontologies, which are widely adopted for representing and reasoning over complex knowledge, by developing an automated dataset construction and evaluation framework. Our evaluation encompassing three sequential tasks for complete proving: Extraction, Simplification, and Explanation, as well as an additional task of assessing Logic Completeness of the premise. Through extensive experiments on widely used reasoning LLMs, we achieve important findings including: (1) Some models achieve overall strong results but remain limited on complex cases; (2) Logical complexity, rather than representation format (formal logic language versus natural language), is the dominant factor shaping LLM performance; and (3) Noise and incompleteness in input data substantially diminish LLMs' performance. Together, these results underscore both the promise of LLMs for explanation with rigorous logics and the gap of supporting resilient reasoning under complex or imperfect conditions. Code and data are available at https://github.com/HuiYang1997/LLMOwlR.
Authors:Hailing Jin, Huiying Li
Abstract:
Recent advances in semantic correspondence have been largely driven by the use of pre‑trained large‑scale models. However, a limitation of these approaches is their dependence on high‑resolution input images to achieve optimal performance, which results in considerable computational overhead. In this work, we address a fundamental limitation in current methods: the irreversible fusion of adjacent keypoint features caused by deep downsampling operations. This issue is triggered when semantically distinct keypoints fall within the same downsampled receptive field (e.g., 16x16 patches). To address this issue, we present SimpleMatch, a simple yet effective framework for semantic correspondence that delivers strong performance even at low resolutions. We propose a lightweight upsample decoder that progressively recovers spatial detail by upsampling deep features to 1/4 resolution, and a multi‑scale supervised loss that ensures the upsampled features retain discriminative features across different spatial scales. In addition, we introduce sparse matching and window‑based localization to optimize training memory usage and reduce it by 51%. At a resolution of 252x252 (3.3x smaller than current SOTA methods), SimpleMatch achieves superior performance with 84.1% PCK@0.1 on the SPair‑71k benchmark. We believe this framework provides a practical and efficient baseline for future research in semantic correspondence. Code is available at: https://github.com/hailong23‑jin/SimpleMatch.
Authors:Dawei Li, Yuguang Yao, Zhen Tan, Huan Liu, Ruocheng Guo
Abstract:
Reward‑guided search methods have demonstrated strong potential in enhancing tool‑using agents by effectively guiding sampling and exploration over complex action spaces. As a core design, those search methods utilize process reward models (PRMs) to provide step‑level rewards, enabling more fine‑grained monitoring. However, there is a lack of systematic and reliable evaluation benchmarks for PRMs in tool‑using settings. In this paper, we introduce ToolPRMBench, a large‑scale benchmark specifically designed to evaluate PRMs for tool‑using agents. ToolPRMBench is built on top of several representative tool‑using benchmarks and converts agent trajectories into step‑level test cases. Each case contains the interaction history, a correct action, a plausible but incorrect alternative, and relevant tool metadata. We respectively utilize offline sampling to isolate local single‑step errors and online sampling to capture realistic multi‑step failures from full agent rollouts. A multi‑LLM verification pipeline is proposed to reduce label noise and ensure data quality. We conduct extensive experiments across large language models, general PRMs, and tool‑specialized PRMs on ToolPRMBench. The results reveal clear differences in PRM effectiveness and highlight the potential of specialized PRMs for tool‑using. Code and data will be released at https://github.com/David‑Li0406/ToolPRMBench.
Authors:Chun-Yi Kuan, Hung-yi Lee
Abstract:
Recent advances in audio‑aware large language models have shown strong performance on audio question answering. However, existing benchmarks mainly cover answerable questions and overlook the challenge of unanswerable ones, where no reliable answer can be inferred from the audio. Such cases are common in real‑world settings, where questions may be misleading, ill‑posed, or incompatible with the information. To address this gap, we present AQUA‑Bench, a benchmark for Audio Question Unanswerability Assessment. It systematically evaluates three scenarios: Absent Answer Detection (the correct option is missing), Incompatible Answer Set Detection (choices are categorically mismatched with the question), and Incompatible Audio Question Detection (the question is irrelevant or lacks sufficient grounding in the audio). By assessing these cases, AQUA‑Bench offers a rigorous measure of model reliability and promotes the development of audio‑language systems that are more robust and trustworthy. Our experiments suggest that while models excel on standard answerable tasks, they often face notable challenges with unanswerable ones, pointing to a blind spot in current audio‑language understanding.
Authors:Bing Hu, Yixin Li, Asma Bahamyirou, Helen Chen
Abstract:
The use of synthetic data in health applications raises privacy concerns, yet the lack of open frameworks for privacy evaluations has slowed its adoption. A major challenge is the absence of accessible benchmark datasets for evaluating privacy risks, due to difficulties in acquiring sensitive data. To address this, we introduce SynQP, an open framework for benchmarking privacy in synthetic data generation (SDG) using simulated sensitive data, ensuring that original data remains confidential. We also highlight the need for privacy metrics that fairly account for the probabilistic nature of machine learning models. As a demonstration, we use SynQP to benchmark CTGAN and propose a new identity disclosure risk metric that offers a more accurate estimation of privacy risks compared to existing approaches. Our work provides a critical tool for improving the transparency and reliability of privacy evaluations, enabling safer use of synthetic data in health‑related applications. % In our quality evaluations, non‑private models achieved near‑perfect machine‑learning efficacy \(\ge0.97\). Our privacy assessments (Table II) reveal that DP consistently lowers both identity disclosure risk (SD‑IDR) and membership‑inference attack risk (SD‑MIA), with all DP‑augmented models staying below the 0.09 regulatory threshold. Code available at https://github.com/CAN‑SYNH/SynQP
Authors:David Ilić, David Stanojević, Kostadin Cvejoski
Abstract:
Fine‑tuned language models pose significant privacy risks, as they may memorize and expose sensitive information from their training data. Membership inference attacks (MIAs) provide a principled framework for auditing these risks, yet existing methods achieve limited detection rates, particularly at the low false‑positive thresholds required for practical privacy auditing. We present EZ‑MIA, a membership inference attack that exploits a key observation: memorization manifests most strongly at error positions, specifically tokens where the model predicts incorrectly yet still shows elevated probability for training examples. We introduce the Error Zone (EZ) score, which measures the directional imbalance of probability shifts at error positions relative to a pretrained reference model. This principled statistic requires only two forward passes per query and no model training of any kind. On WikiText with GPT‑2, EZ‑MIA achieves 3.8x higher detection than the previous state‑of‑the‑art under identical conditions (66.3% versus 17.5% true positive rate at 1% false positive rate), with near‑perfect discrimination (AUC 0.98). At the stringent 0.1% FPR threshold critical for real‑world auditing, we achieve 8x higher detection than prior work (14.0% versus 1.8%), requiring no reference model training. These gains extend to larger architectures: on AG News with Llama‑2‑7B, we achieve 3x higher detection (46.7% versus 15.8% TPR at 1% FPR). These results establish that privacy risks of fine‑tuned language models are substantially greater than previously understood, with implications for both privacy auditing and deployment decisions. Code is available at https://github.com/JetBrains‑Research/ez‑mia.
Authors:Tiffanie Godelaine, Maxime Zanella, Karim El Khoury, Saïd Mahmoudi, Benoît Macq, Christophe De Vleeschouwer
Abstract:
Assisting pathologists in the analysis of histopathological images has high clinical value, as it supports cancer detection and staging. In this context, histology foundation models have recently emerged. Among them, Vision‑Language Models (VLMs) provide strong yet imperfect zero‑shot predictions. We propose to refine these predictions by adapting Conditional Random Fields (CRFs) to histopathological applications, requiring no additional model training. We present HistoCRF, a CRF‑based framework, with a novel definition of the pairwise potential that promotes label diversity and leverages expert annotations. We consider three experiments: without annotations, with expert annotations, and with iterative human‑in‑the‑loop annotations that progressively correct misclassified patches. Experiments on five patch‑level classification datasets covering different organs and diseases demonstrate average accuracy gains of 16.0% without annotations and 27.5% with only 100 annotations, compared to zero‑shot predictions. Moreover, integrating a human in the loop reaches a further gain of 32.6% with the same number of annotations. The code will be made available on https://github.com/tgodelaine/HistoCRF.
Authors:Jingchu Wang, Bingbing Xu, Yige Yuan, Bin Xie, Xiaoqian Sun, Huawei Shen
Abstract:
Reinforcement learning has become a central paradigm for improving LLM reasoning. However, existing methods use a single policy to produce both inference responses and training optimization trajectories. The objective conflict between generating stable inference responses and diverse training trajectories leads to insufficient exploration, which harms reasoning capability. In this paper, to address the problem, we propose R^2PO (Residual Rollout Policy Optimization), which introduces a lightweight Residual Rollout‑Head atop the policy to decouple training trajectories from inference responses, enabling controlled trajectory diversification during training while keeping inference generation stable. Experiments across multiple benchmarks show that our method consistently outperforms baselines, achieving average accuracy gains of 3.4% on MATH‑500 and 1.3% on APPS, while also reducing formatting errors and mitigating length bias for stable optimization. Our code is publicly available at https://github.com/RRPO‑ARR/Code.
Authors:Jinshi Liu, Pan Liu
Abstract:
Most pseudo‑label selection strategies in semi‑supervised learning rely on fixed confidence thresholds, implicitly assuming that prediction confidence reliably indicates correctness. In practice, deep networks are often overconfident: high‑confidence predictions can still be wrong, while informative low‑confidence samples near decision boundaries are discarded. This paper introduces a Confidence‑Variance (CoVar) theory framework that provides a principled joint reliability criterion for pseudo‑label selection. Starting from the entropy minimization principle, we derive a reliability measure that combines maximum confidence (MC) with residual‑class variance (RCV), which characterizes how probability mass is distributed over non‑maximum classes. The derivation shows that reliable pseudo‑labels should have both high MC and low RCV, and that the influence of RCV increases as confidence grows, thereby correcting overconfident but unstable predictions. From this perspective, we cast pseudo‑label selection as a spectral relaxation problem that maximizes separability in a confidence‑variance feature space, and design a threshold‑free selection mechanism to distinguish high‑ from low‑reliability predictions. We integrate CoVar as a plug‑in module into representative semi‑supervised semantic segmentation and image classification methods. Across PASCAL VOC 2012, Cityscapes, CIFAR‑10, and Mini‑ImageNet with varying label ratios and backbones, it consistently improves over strong baselines, indicating that combining confidence with residual‑class variance provides a more reliable basis for pseudo‑label selection than fixed confidence thresholds. (Code: https://github.com/ljs11528/CoVar_Pseudo_Label_Selection.git)
Authors:Oishee Bintey Hoque, Nibir Chandra Mandal, Kyle Luong, Amanda Wilson, Samarth Swarup, Madhav Marathe, Abhijin Adiga
Abstract:
Large‑scale livestock operations pose significant risks to human health and the environment, while also being vulnerable to threats such as infectious diseases and extreme weather events. As the number of such operations continues to grow, accurate and scalable mapping has become increasingly important. In this work, we present an infrastructure‑first, explainable pipeline for identifying and characterizing Concentrated Animal Feeding Operations (CAFOs) from aerial and satellite imagery. Our method (i) detects candidate infrastructure (e.g., barns, feedlots, manure lagoons, silos) with a domain‑tuned YOLOv8 detector, then derives SAM2 masks from these boxes and filters component‑specific criteria; (ii) extracts structured descriptors (e.g., counts, areas, orientations, and spatial relations) and fuses them with deep visual features using a lightweight spatial cross‑attention classifier; and (iii) outputs both CAFO type predictions and mask‑level attributions that link decisions to visible infrastructure. Through comprehensive evaluation, we show that our approach achieves state‑of‑the‑art performance, with Swin‑B+PRISM‑CAFO surpassing the best performing baseline by up to 15%. Beyond strong predictive performance across diverse U.S. regions, we run systematic gradient‑‑activation analyses that quantify the impact of domain priors and show how specific infrastructure (e.g., barns, lagoons) shapes classification decisions. We release code, infrastructure masks, and descriptors to support transparent, scalable monitoring of livestock infrastructure, enabling risk modeling, change detection, and targeted regulatory action. Github: https://github.com/Nibir088/PRISM‑CAFO.
Authors:Xiaojie Gu, Guangxu Chen, Yuheng Yang, Jingxin Han, Andi Zhang
Abstract:
Large language models (LLMs) exhibit exceptional performance across various domains, yet they face critical safety concerns. Model editing has emerged as an effective approach to mitigate these issues. Existing model editing methods often focus on optimizing an information matrix that blends new and old knowledge. While effective, these approaches can be computationally expensive and may cause conflicts. In contrast, we shift our attention to Hierarchical Orthogonal Residual SprEad of the information matrix, which reduces noisy gradients and enables more stable edits from a different perspective. We demonstrate the effectiveness of our method HORSE through a clear theoretical comparison with several popular methods and extensive experiments conducted on two datasets across multiple LLMs. The results show that HORSE maintains precise massive editing across diverse scenarios. The code is available at https://github.com/XiaojieGu/HORSE
Authors:Shaofeng Yin, Jiaxin Ge, Zora Zhiruo Wang, Xiuyu Li, Michael J. Black, Trevor Darrell, Angjoo Kanazawa, Haiwen Feng
Abstract:
Vision‑as‑inverse‑graphics, the concept of reconstructing an image as an editable graphics program is a long‑standing goal of computer vision. Yet even strong VLMs aren't able to achieve this in one‑shot as they lack fine‑grained spatial and physical grounding capability. Our key insight is that closing this gap requires interleaved multimodal reasoning through iterative execution and verification. Stemming from this, we present VIGA (Vision‑as‑Inverse‑Graphic Agent) that starts from an empty world and reconstructs or edits scenes through a closed‑loop write‑run‑render‑compare‑revise procedure. To support long‑horizon reasoning, VIGA combines (i) a skill library that alternates generator and verifier roles and (ii) an evolving context memory that contains plans, code diffs, and render history. VIGA is task‑agnostic as it doesn't require auxiliary modules, covering a wide range of tasks such as 3D reconstruction, multi‑step scene editing, 4D physical interaction, and 2D document editing, etc. Empirically, we found VIGA substantially improves one‑shot baselines on BlenderGym (35.32%) and SlideBench (117.17%). Moreover, VIGA is also model‑agnostic as it doesn't require finetuning, enabling a unified protocol to evaluate heterogeneous foundation VLMs. To better support this protocol, we introduce BlenderBench, a challenging benchmark that stress‑tests interleaved multimodal reasoning with graphics engine, where VIGA improves by 124.70%.
Authors:Jie Yang, Honglin Guo, Li Ji, Jiazheng Zhou, Rui Zheng, Zhikai Lei, Shuo Zhang, Zhiheng Xi, Shichun Liu, Yuxin Wang, Bo Wang, Yining Zheng, Tao Gui, Xipeng Qiu
Abstract:
The evolution of Large Language Models (LLMs) into autonomous agents has expanded the scope of AI coding from localized code generation to complex, repository‑level, and execution‑driven problem solving. However, current benchmarks predominantly evaluate code logic in static contexts, neglecting the dynamic, full‑process requirements of real‑world engineering, particularly in backend development which demands rigorous environment configuration and service deployment. To address this gap, we introduce ABC‑Bench, a benchmark explicitly designed to evaluate agentic backend coding within a realistic, executable workflow. Using a scalable automated pipeline, we curated 224 practical tasks spanning 8 languages and 19 frameworks from open‑source repositories. Distinct from previous evaluations, ABC‑Bench require the agents to manage the entire development lifecycle from repository exploration to instantiating containerized services and pass the external end‑to‑end API tests. Our extensive evaluation reveals that even state‑of‑the‑art models struggle to deliver reliable performance on these holistic tasks, highlighting a substantial disparity between current model capabilities and the demands of practical backend engineering. Our code is available at https://github.com/OpenMOSS/ABC‑Bench.
Authors:Keyu Li, Junhao Shi, Yang Xiao, Mohan Jiang, Jie Sun, Yunze Wu, Shijie Xia, Xiaojie Cai, Tianze Xu, Weiye Si, Wenjie Li, Dequan Wang, Pengfei Liu
Abstract:
Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production. However, existing benchmarks remain focused on single agentic capability, failing to capture long‑horizon real‑world scenarios. Moreover, the reliance on human‑in‑the‑loop feedback for realistic tasks creates a scalability bottleneck, hindering automated rollout collection and evaluation. To bridge this gap, we introduce AgencyBench, a comprehensive benchmark derived from daily AI usage, evaluating 6 core agentic capabilities across 32 real‑world scenarios, comprising 138 tasks with specific queries, deliverables, and rubrics. These scenarios require an average of 90 tool calls, 1 million tokens, and hours of execution time to resolve. To enable automated evaluation, we employ a user simulation agent to provide iterative feedback, and a Docker sandbox to conduct visual and functional rubric‑based assessment. Experiments reveal that closed‑source models significantly outperform open‑source models (48.4% vs 32.1%). Further analysis reveals significant disparities across models in resource efficiency, feedback‑driven self‑correction, and specific tool‑use preferences. Finally, we investigate the impact of agentic scaffolds, observing that proprietary models demonstrate superior performance within their native ecosystems (e.g., Claude‑4.5‑Opus via Claude‑Agent‑SDK), while open‑source models exhibit distinct performance peaks, suggesting potential optimization for specific execution frameworks. AgencyBench serves as a critical testbed for next‑generation agents, highlighting the necessity of co‑optimizing model architecture with agentic frameworks. We believe this work sheds light on the future direction of autonomous agents, and we release the full benchmark and evaluation toolkit at https://github.com/GAIR‑NLP/AgencyBench.
Authors:Shiyu Liu, Yongjing Yin, Jianhao Yan, Yunbo Tang, Qinggang Zhang, Bei Li, Xin Chen, Jingang Wang, Xunliang Cai, Jinsong Su
Abstract:
RL‑based agentic search enables LLMs to solve complex questions via dynamic planning and external search. While this approach significantly enhances accuracy with agent policies optimized via large‑scale reinforcement learning, we identify a critical gap in reliability: these agents fail to recognize their reasoning boundaries and rarely admit ``I DON'T KNOW'' (IDK) even when evidence is insufficient or reasoning reaches its limit. The lack of reliability often leads to plausible but unreliable answers, introducing significant risks in many real‑world scenarios. To this end, we propose Boundary‑Aware Policy Optimization (BAPO), a novel RL framework designed to cultivate reliable boundary awareness without compromising accuracy. BAPO introduces two key components: (i) a group‑based boundary‑aware reward that encourages an IDK response only when the reasoning reaches its limit, and (ii) an adaptive reward modulator that strategically suspends this reward during early exploration, preventing the model from exploiting IDK as a shortcut. Extensive experiments on four benchmarks demonstrate that BAPO substantially enhances the overall reliability of agentic search.
Authors:Long Ma, Zihao Xue, Yan Wang, Zhiyuan Yan, Jin Xu, Xiaorui Jiang, Haiyang Yu, Yong Liao, Zhen Bi
Abstract:
Recent advances in generative modeling can create remarkably realistic synthetic videos, making it increasingly difficult for humans to distinguish them from real ones and necessitating reliable detection methods. However, two key limitations hinder the development of this field. From the dataset perspective, existing datasets are often limited in scale and constructed using outdated or narrowly scoped generative models, making it difficult to capture the diversity and rapid evolution of modern generative techniques. Moreover, the dataset construction process frequently prioritizes quantity over quality, neglecting essential aspects such as semantic diversity, scenario coverage, and technological representativeness. From the benchmark perspective, current benchmarks largely remain at the stage of dataset creation, leaving many fundamental issues and in‑depth analysis yet to be systematically explored. Addressing this gap, we propose AIGVDBench, a benchmark designed to be comprehensive and representative, covering 31 state‑of‑the‑art generation models and over 440,000 videos. By executing more than 1,500 evaluations on 33 existing detectors belonging to four distinct categories. This work presents 8 in‑depth analyses from multiple perspectives and identifies 4 novel findings that offer valuable insights for future research. We hope this work provides a solid foundation for advancing the field of AI‑generated video detection. Our benchmark is open‑sourced at https://github.com/LongMa‑2025/AIGVDBench.
Authors:Xinwei Wu, Heng Liu, Xiaohu Zhao, Yuqi Ren, Linlong Xu, Longyue Wang, Deyi Xiong, Weihua Luo, Kaifu Zhang
Abstract:
Large Language Models (LLMs) frequently exhibit strong translation abilities, even without task‑specific fine‑tuning. However, the internal mechanisms governing this innate capability remain largely opaque. To demystify this process, we leverage Sparse Autoencoders (SAEs) and introduce a novel framework for identifying task‑specific features. Our method first recalls features that are frequently co‑activated on translation inputs and then filters them for functional coherence using a PCA‑based consistency metric. This framework successfully isolates a small set of translation initiation features. Causal interventions demonstrate that amplifying these features steers the model towards correct translation, while ablating them induces hallucinations and off‑task outputs, confirming they represent a core component of the model's innate translation competency. Moving from analysis to application, we leverage this mechanistic insight to propose a new data selection strategy for efficient fine‑tuning. Specifically, we prioritize training on mechanistically hard samples‑those that fail to naturally activate the translation initiation features. Experiments show this approach significantly improves data efficiency and suppresses hallucinations. Furthermore, we find these mechanisms are transferable to larger models of the same family. Our work not only decodes a core component of the translation mechanism in LLMs but also provides a blueprint for using internal model mechanism to create more robust and efficient models. The codes are available at https://github.com/flamewei123/AAAI26‑translation‑Initiation‑Features.
Authors:Jiahao Wang, Shuangjia Zheng
Abstract:
The ability to engineer optimized protein variants has transformative potential for biotechnology and medicine. Prior sequence‑based optimization methods struggle with the high‑dimensional complexities due to the epistasis effect and the disregard for structural constraints. To address this, we propose HADES, a Bayesian optimization method utilizing Hamiltonian dynamics to efficiently sample from a structure‑aware approximated posterior. Leveraging momentum and uncertainty in the simulated physical movements, HADES enables rapid transition of proposals toward promising areas. A position discretization procedure is introduced to propose discrete protein sequences from such a continuous state system. The posterior surrogate is powered by a two‑stage encoder‑decoder framework to determine the structure and function relationships between mutant neighbors, consequently learning a smoothed landscape to sample from. Extensive experiments demonstrate that our method outperforms state‑of‑the‑art baselines in in‑silico evaluations across most metrics. Remarkably, our approach offers a unique advantage by leveraging the mutual constraints between protein structure and sequence, facilitating the design of protein sequences with similar structures and optimized properties. The code and data are publicly available at https://github.com/GENTEL‑lab/HADES.
Authors:Chongcong Jiang, Tianxingjian Ding, Chuhan Song, Jiachen Tu, Ziyang Yan, Yihua Shao, Zhenyi Wang, Yuzhang Shang, Tianyu Han, Yu Tian
Abstract:
Promptable segmentation foundation models such as SAM3 have demonstrated strong generalization capabilities through interactive and concept‑based prompting. However, their direct applicability to medical image segmentation remains limited by severe domain shifts, the absence of privileged spatial prompts, and the need to reason over complex anatomical and volumetric structures. Here we present Medical SAM3, a foundation model for universal prompt‑driven medical image segmentation, obtained by fully fine‑tuning SAM3 on large‑scale, heterogeneous 2D and 3D medical imaging datasets with paired segmentation masks and text prompts. Through a systematic analysis of vanilla SAM3, we observe that its performance degrades substantially on medical data, with its apparent competitiveness largely relying on strong geometric priors such as ground‑truth‑derived bounding boxes. These findings motivate full model adaptation beyond prompt engineering alone. By fine‑tuning SAM3's model parameters on 33 datasets spanning 10 medical imaging modalities, Medical SAM3 acquires robust domain‑specific representations while preserving prompt‑driven flexibility. Extensive experiments across organs, imaging modalities, and dimensionalities demonstrate consistent and significant performance gains, particularly in challenging scenarios characterized by semantic ambiguity, complex morphology, and long‑range 3D context. Our results establish Medical SAM3 as a universal, text‑guided segmentation foundation model for medical imaging and highlight the importance of holistic model adaptation for achieving robust prompt‑driven segmentation under severe domain shift. Code and model will be made available at https://github.com/AIM‑Research‑Lab/Medical‑SAM3.
Authors:Sen Wang, Bangwei Liu, Zhenkun Gao, Lizhuang Ma, Xuhong Wang, Yuan Xie, Xin Tan
Abstract:
An ideal embodied agent should possess lifelong learning capabilities to handle long‑horizon and complex tasks, enabling continuous operation in general environments. This not only requires the agent to accurately accomplish given tasks but also to leverage long‑term episodic memory to optimize decision‑making. However, existing mainstream one‑shot embodied tasks primarily focus on task completion results, neglecting the crucial process of exploration and memory utilization. To address this, we propose Long‑term Memory Embodied Exploration (LMEE), which aims to unify the agent's exploratory cognition and decision‑making behaviors to promote lifelong learning.We further construct a corresponding dataset and benchmark, LMEE‑Bench, incorporating multi‑goal navigation and memory‑based question answering to comprehensively evaluate both the process and outcome of embodied exploration. To enhance the agent's memory recall and proactive exploration capabilities, we propose MemoryExplorer, a novel method that fine‑tunes a multimodal large language model through reinforcement learning to encourage active memory querying. By incorporating a multi‑task reward function that includes action prediction, frontier selection, and question answering, our model achieves proactive exploration. Extensive experiments against state‑of‑the‑art embodied exploration models demonstrate that our approach achieves significant advantages in long‑horizon embodied tasks.
Authors:Gerard Yeo, Svetlana Churina, Kokil Jaidka
Abstract:
Perceived trustworthiness underpins how users navigate online information, yet it remains unclear whether large language models (LLMs),increasingly embedded in search, recommendation, and conversational systems, represent this construct in psychologically coherent ways. We analyze how instruction‑tuned LLMs (Llama 3.1 8B, Qwen 2.5 7B, Mistral 7B) encode perceived trustworthiness in web‑like narratives using the PEACE‑Reviews dataset annotated for cognitive appraisals, emotions, and behavioral intentions. Across models, systematic layer‑ and head‑level activation differences distinguish high‑ from low‑trust texts, revealing that trust cues are implicitly encoded during pretraining. Probing analyses show linearly de‑codable trust signals and fine‑tuning effects that refine rather than restructure these representations. Strongest associations emerge with appraisals of fairness, certainty, and accountability‑self ‑‑ dimensions central to human trust formation online. These findings demonstrate that modern LLMs internalize psychologically grounded trust signals without explicit supervision, offering a representational foundation for designing credible, transparent, and trust‑worthy AI systems in the web ecosystem. Code and appendix are available at: https://github.com/GerardYeo/TrustworthinessLLM.
Authors:Changle Qu, Sunhao Dai, Hengyi Cai, Jun Xu, Shuaiqiang Wang, Dawei Yin
Abstract:
Tool‑Integrated Reasoning (TIR) empowers large language models (LLMs) to tackle complex tasks by interleaving reasoning steps with external tool interactions. However, existing reinforcement learning methods typically rely on outcome‑ or trajectory‑level rewards, assigning uniform advantages to all steps within a trajectory. This coarse‑grained credit assignment fails to distinguish effective tool calls from redundant or erroneous ones, particularly in long‑horizon multi‑turn scenarios. To address this, we propose MatchTIR, a framework that introduces fine‑grained supervision via bipartite matching‑based turn‑level reward assignment and dual‑level advantage estimation. Specifically, we formulate credit assignment as a bipartite matching problem between predicted and ground‑truth traces, utilizing two assignment strategies to derive dense turn‑level rewards. Furthermore, to balance local step precision with global task success, we introduce a dual‑level advantage estimation scheme that integrates turn‑level and trajectory‑level signals, assigning distinct advantage values to individual interaction turns. Extensive experiments on three benchmarks demonstrate the superiority of MatchTIR. Notably, our 4B model surpasses the majority of 8B competitors, particularly in long‑horizon and multi‑turn tasks. Our codes are available at https://github.com/quchangle1/MatchTIR.
Authors:Xi Shi, Mengxin Zheng, Qian Lou
Abstract:
Multi‑agent systems (MAS) enable complex reasoning by coordinating multiple agents, but often incur high inference latency due to multi‑step execution and repeated model invocations, severely limiting their scalability and usability in time‑sensitive scenarios. Most existing approaches primarily optimize task performance and inference cost, and explicitly or implicitly assume sequential execution, making them less optimal for controlling latency under parallel execution. In this work, we investigate learning‑based orchestration of multi‑agent systems with explicit latency supervision under parallel execution. We propose Latency‑Aware Multi‑agent System (LAMaS), a latency‑aware multi‑agent orchestration framework that enables parallel execution and explicitly optimizes the critical execution path, allowing the controller to construct execution topology graphs with lower latency under parallel execution. Our experiments show that our approach reduces critical path length by 38‑46% compared to the state‑of‑the‑art baseline for multi‑agent architecture search across multiple benchmarks, while maintaining or even improving task performance. These results highlight the importance of explicitly optimizing latency under parallel execution when designing efficient multi‑agent systems. The code is available at https://github.com/xishi404/LAMaS
Authors:Yinzhi Zhao, Ming Wang, Shi Feng, Xiaocui Yang, Daling Wang, Yifei Zhang
Abstract:
Large language models (LLMs) have achieved impressive performance across natural language tasks and are increasingly deployed in real‑world applications. Despite extensive safety alignment efforts, recent studies show that such alignment is often shallow and remains vulnerable to jailbreak attacks. Existing defense mechanisms, including decoding‑based constraints and post‑hoc content detectors, struggle against sophisticated jailbreaks, often intervening robust detection or excessively degrading model utility. In this work, we examine the decoding process of LLMs and make a key observation: even when successfully jailbroken, models internally exhibit latent safety‑related signals during generation. However, these signals are overridden by the model's drive for fluent continuation, preventing timely self‑correction or refusal. Building on this observation, we propose a simple yet effective approach that explicitly surfaces and leverages these latent safety signals for early detection of unsafe content during decoding. Experiments across diverse jailbreak attacks demonstrate that our approach significantly enhances safety, while maintaining low over‑refusal rates on benign inputs and preserving response quality. Our results suggest that activating intrinsic safety‑awareness during decoding offers a promising and complementary direction for defending against jailbreak attacks. Code is available at: https://github.com/zyz13590/SafeProbing.
Authors:Yu Wang, Yi Wang, Rui Dai, Yujie Wang, Kaikui Liu, Xiangxiang Chu, Yansheng Li
Abstract:
As hubs of human activity, urban surfaces consist of a wealth of semantic entities. Segmenting these various entities from satellite imagery is crucial for a range of downstream applications. Current advanced segmentation models can reliably segment entities defined by physical attributes (e.g., buildings, water bodies) but still struggle with socially defined categories (e.g., schools, parks). In this work, we achieve socio‑semantic segmentation by vision‑language model reasoning. To facilitate this, we introduce the Urban Socio‑Semantic Segmentation dataset named SocioSeg, a new resource comprising satellite imagery, digital maps, and pixel‑level labels of social semantic entities organized in a hierarchical structure. Additionally, we propose a novel vision‑language reasoning framework called SocioReasoner that simulates the human process of identifying and annotating social semantic entities via cross‑modal recognition and multi‑stage reasoning. We employ reinforcement learning to optimize this non‑differentiable process and elicit the reasoning capabilities of the vision‑language model. Experiments demonstrate our approach's gains over state‑of‑the‑art models and strong zero‑shot generalization. Our dataset and code are available in https://github.com/AMAP‑ML/SocioReasoner.
Authors:Ahmad Mustapha, Charbel Toumieh, Mariette Awad
Abstract:
With advancements in deep learning (DL) and computer vision techniques, the field of chart understanding is evolving rapidly. In particular, multimodal large language models (MLLMs) are proving to be efficient and accurate in understanding charts. To accurately measure the performance of MLLMs, the research community has developed multiple datasets to serve as benchmarks. By examining these datasets, we found that they are all limited to a small set of chart types. To bridge this gap, we propose the ChartComplete dataset. The dataset is based on a chart taxonomy borrowed from the visualization community, and it covers thirty different chart types. The dataset is a collection of classified chart images and does not include a learning signal. We present the ChartComplete dataset as is to the community to build upon it.
Authors:Mark Kashirskiy, Ilya Makarov
Abstract:
We propose Strategy‑aware Surprise (SuS), a novel intrinsic motivation framework that uses pre‑post prediction mismatch as a novelty signal for exploration in reinforcement learning. Unlike traditional curiosity‑driven methods that rely solely on state prediction error, SuS introduces two complementary components: Strategy Stability (SS) and Strategy Surprise (SuS). SS measures consistency in behavioral strategy across temporal steps, while SuS captures unexpected outcomes relative to the agent's current strategy representation. Our combined reward formulation leverages both signals through learned weighting coefficients. We evaluate SuS on mathematical reasoning tasks using large language models, demonstrating significant improvements in both accuracy and solution diversity. Ablation studies confirm that removing either component results in at least 10% performance degradation, validating the synergistic nature of our approach. SuS achieves 17.4% improvement in Pass@1 and 26.4% improvement in Pass@5 compared to baseline methods, while maintaining higher strategy diversity throughout training.
Authors:Yuxuan Lou, Kai Yang, Yang You
Abstract:
We present MoST (Mixture of Speech and Text), a novel multimodal large language model that seamlessly integrates speech and text processing through our proposed Modality‑Aware Mixture of Experts (MAMoE) architecture. While current multimodal models typically process diverse modality representations with identical parameters, disregarding their inherent representational differences, we introduce specialized routing pathways that direct tokens to modality‑appropriate experts based on input type. MAMoE simultaneously enhances modality‑specific learning and cross‑modal understanding through two complementary components: modality‑specific expert groups that capture domain‑specific patterns and shared experts that facilitate information transfer between modalities. Building on this architecture, we develop an efficient transformation pipeline that adapts the pretrained MoE language model through strategic post‑training on ASR and TTS datasets, followed by fine‑tuning with a carefully curated speech‑text instruction dataset. A key feature of this pipeline is that it relies exclusively on fully accessible, open‑source datasets to achieve strong performance and data efficiency. Comprehensive evaluations across ASR, TTS, audio language modeling, and spoken question answering benchmarks show that MoST consistently outperforms existing models of comparable parameter counts. Our ablation studies confirm that the modality‑specific routing mechanism and shared experts design significantly contribute to performance gains across all tested domains. To our knowledge, MoST represents the first fully open‑source speech‑text LLM built on a Mixture of Experts architecture. \footnoteWe release MoST model, training code, inference code, and training data at https://github.com/NUS‑HPC‑AI‑Lab/MoST
Authors:Chaochao Chen, Jiaming Qian, Fei Zheng, Yachuan Liu
Abstract:
The prevalence of recommendation systems also brings privacy concerns to both the users and the sellers, as centralized platforms collect as much data as possible from them. To keep the data private, we propose PADER: a Paillier‑based secure decentralized social recommendation system. In this system, the users and the sellers are nodes in a decentralized network. The training and inference of the recommendation model are carried out securely in a decentralized manner, without the involvement of a centralized platform. To this end, we apply the Paillier cryptosystem to the SoReg (Social Regularization) model, which exploits both user's ratings and social relations. We view the SoReg model as a two‑party secure polynomial evaluation problem and observe that the simple bipartite computation may result in poor efficiency. To improve efficiency, we design secure addition and multiplication protocols to support secure computation on any arithmetic circuit, along with an optimal data packing scheme that is suitable for the polynomial computations of real values. Experiment results show that our method only takes about one second to iterate through one user with hundreds of ratings, and training with ~500K ratings for one epoch only takes <3 hours, which shows that the method is practical in real applications. The code is available at https://github.com/GarminQ/PADER.
Authors:Arya Shah, Himanshu beniwal, Mayank Singh
Abstract:
Aligning multilingual assistants with culturally grounded user preferences is essential for serving India's linguistically diverse population of over one billion speakers across multiple scripts. However, existing benchmarks either focus on a single language or conflate retrieval with generation, leaving open the question of whether current embedding models can encode persona‑instruction compatibility without relying on response synthesis. We present a unified benchmark spanning 12 Indian languages and four evaluation tasks: monolingual and cross‑lingual persona‑to‑instruction retrieval, reverse retrieval from instruction to persona, and binary compatibility classification. Eight multilingual embedding models are evaluated in a frozen‑encoder setting with a thin logistic regression head for classification. E5‑Large‑Instruct achieves the highest Recall@1 of 27.4% on monolingual retrieval and 20.7% on cross‑lingual transfer, while BGE‑M3 leads reverse retrieval at 32.1% Recall@1. For classification, LaBSE attains 75.3% AUROC with strong calibration. These findings offer practical guidance for model selection in Indic multilingual retrieval and establish reproducible baselines for future work\footnoteCode, datasets, and models are publicly available at https://github.com/aryashah2k/PI‑Indic‑Align.
Authors:Hao Li, Yankai Yang, G. Edward Suh, Ning Zhang, Chaowei Xiao
Abstract:
Large Language Models (LLMs) have enabled the development of powerful agentic systems capable of automating complex workflows across various fields. However, these systems are highly vulnerable to indirect prompt injection attacks, where malicious instructions embedded in external data can hijack agent behavior. In this work, we present ReasAlign, a model‑level solution to improve safety alignment against indirect prompt injection attacks. The core idea of ReasAlign is to incorporate structured reasoning steps to analyze user queries, detect conflicting instructions, and preserve the continuity of the user's intended tasks to defend against indirect injection attacks. To further ensure reasoning logic and accuracy, we introduce a test‑time scaling mechanism with a preference‑optimized judge model that scores reasoning steps and selects the best trajectory. Comprehensive evaluations across various benchmarks show that ReasAlign maintains utility comparable to an undefended model while consistently outperforming Meta SecAlign, the strongest prior guardrail. On the representative open‑ended CyberSecEval2 benchmark, which includes multiple prompt‑injected tasks, ReasAlign achieves 94.6% utility and only 3.6% ASR, far surpassing the state‑of‑the‑art defensive model of Meta SecAlign (56.4% utility and 74.4% ASR). These results demonstrate that ReasAlign achieves the best trade‑off between security and utility, establishing a robust and practical defense against prompt injection attacks in real‑world agentic systems. Our code and experimental results could be found at https://github.com/leolee99/ReasAlign.
Authors:Prachuryya Kaushik, Ashish Anand
Abstract:
We introduce AWED‑FiNER, an open‑source ecosystem designed to bridge the gap in Fine‑grained Named Entity Recognition (FgNER) for 36 global languages spoken by more than 6.6 billion people. While Large Language Models (LLMs) dominate general Natural Language Processing (NLP) tasks, they often struggle with low‑resource languages and fine‑grained NLP tasks. AWED‑FiNER provides a collection of agentic toolkits, web applications, and several state‑of‑the‑art expert models that provides FgNER solutions across 36 languages. The agentic tools enable to route multilingual text to specialized expert models and fetch FgNER annotations within seconds. The web‑based platforms provide ready‑to‑use FgNER annotation service for non‑technical users. Moreover, the collection of language specific extremely small sized open‑source state‑of‑the‑art expert models facilitate offline deployment in resource contraint scenerios including edge devices. AWED‑FiNER covers languages spoken by over 6.6 billion people, including a specific focus on vulnerable languages such as Bodo, Manipuri, Bishnupriya, and Mizo. The resources can be accessed here: Agentic Tool (https://github.com/PrachuryyaKaushik/AWED‑FiNER), Web Application (https://hf.co/spaces/prachuryyaIITG/AWED‑FiNER), and 49 Expert Detector Models (https://hf.co/collections/prachuryyaIITG/awed‑finer).
Authors:Chenyue Zhou, Jiayi Tuo, Shitong Qin, Wei Dai, Mingxuan Wang, Ziwei Zhao, Duoyang Li, Shiyang Su, Yanxi Lu, Yanbiao Ma
Abstract:
The automated extraction of structured questions from paper‑based mathematics exams is fundamental to intelligent education, yet remains challenging in real‑world settings due to severe visual noise. Existing benchmarks mainly focus on clean documents or generic layout analysis, overlooking both the structural integrity of mathematical problems and the ability of models to actively reject incomplete inputs. We introduce MathDoc, the first benchmark for document‑level information extraction from authentic high school mathematics exam papers. MathDoc contains 3,609 carefully curated questions with real‑world artifacts and explicitly includes unrecognizable samples to evaluate active refusal behavior. We propose a multi‑dimensional evaluation framework covering stem accuracy, visual similarity, and refusal capability. Experiments on SOTA MLLMs, including Qwen3‑VL and Gemini‑2.5‑Pro, show that although end‑to‑end models achieve strong extraction performance, they consistently fail to refuse illegible inputs, instead producing confident but invalid outputs. These results highlight a critical gap in current MLLMs and establish MathDoc as a benchmark for assessing model reliability under degraded document conditions. Our project repository is available at \hrefhttps://github.com/winnk123/papers/tree/masterGitHub repository
Authors:Han Wang, Yi Yang, Jingyuan Hu, Minfeng Zhu, Wei Chen
Abstract:
Recent advances in multimodal learning have significantly enhanced the reasoning capabilities of vision‑language models (VLMs). However, state‑of‑the‑art approaches rely heavily on large‑scale human‑annotated datasets, which are costly and time‑consuming to acquire. To overcome this limitation, we introduce V‑Zero, a general post‑training framework that facilitates self‑improvement using exclusively unlabeled images. V‑Zero establishes a co‑evolutionary loop by instantiating two distinct roles: a Questioner and a Solver. The Questioner learns to synthesize high‑quality, challenging questions by leveraging a dual‑track reasoning reward that contrasts intuitive guesses with reasoned results. The Solver is optimized using pseudo‑labels derived from majority voting over its own sampled responses. Both roles are trained iteratively via Group Relative Policy Optimization (GRPO), driving a cycle of mutual enhancement. Remarkably, without a single human annotation, V‑Zero achieves consistent performance gains on Qwen2.5‑VL‑7B‑Instruct, improving visual mathematical reasoning by +1.7 and general vision‑centric by +2.6, demonstrating the potential of self‑improvement in multimodal systems. Code is available at https://github.com/SatonoDia/V‑Zero
Authors:Jack Wilkie, Hanan Hindy, Craig Michie, Christos Tachtatzis, James Irvine, Robert Atkinson
Abstract:
Machine learning has achieved state‑of‑the‑art results in network intrusion detection; however, its performance significantly degrades when confronted by a new attack class ‑‑ a zero‑day attack. In simple terms, classical machine learning‑based approaches are adept at identifying attack classes on which they have been previously trained, but struggle with those not included in their training data. One approach to addressing this shortcoming is to utilise anomaly detectors which train exclusively on benign data with the goal of generalising to all attack classes ‑‑ both known and zero‑day. However, this comes at the expense of a prohibitively high false positive rate. This work proposes a novel contrastive loss function which is able to maintain the advantages of other contrastive learning‑based approaches (robustness to imbalanced data) but can also generalise to zero‑day attacks. Unlike anomaly detectors, this model learns the distributions of benign traffic using both benign and known malign samples, i.e. other well‑known attack classes (not including the zero‑day class), and consequently, achieves significant performance improvements. The proposed approach is experimentally verified on the Lycos2017 dataset where it achieves an AUROC improvement of .000065 and .060883 over previous models in known and zero‑day attack detection, respectively. Finally, the proposed method is extended to open‑set recognition achieving OpenAUC improvements of .170883 over existing approaches.
Authors:Xinxing Ren, Quagmire Zang, Caelum Forder, Suman Deb, Ahsen Tahir, Roman J. Georgio, Peter Carroll, Zekun Guo
Abstract:
Most existing Large Language Model (LLM)‑based Multi‑Agent Systems (MAS) rely on predefined workflows, where human engineers enumerate task states in advance and specify routing rules and contextual injections accordingly. Such workflow‑driven designs are essentially rule‑based decision trees, which suffer from two fundamental limitations: they require substantial manual effort to anticipate and encode possible task states, and they cannot exhaustively cover the state space of complex real‑world tasks. To address these issues, we propose an Information‑Flow‑Orchestrated Multi‑Agent Paradigm via Agent‑to‑Agent (A2A) Communication from CORAL, in which a dedicated information flow orchestrator continuously monitors task progress and dynamically coordinates other agents through the A2A toolkit using natural language, without relying on predefined workflows. We evaluate our approach on the general‑purpose benchmark GAIA, using the representative workflow‑based MAS OWL as the baseline while controlling for agent roles and underlying models. Under the pass@1 setting, our method achieves 63.64% accuracy, outperforming OWL's 55.15% by 8.49 percentage points with comparable token consumption. Further case‑level analysis shows that our paradigm enables more flexible task monitoring and more robust handling of edge cases. Our implementation is publicly available at: https://github.com/Coral‑Protocol/Beyond‑Rule‑Based‑Workflows
Authors:Sraavya Sambara, Yuan Pu, Ayman Ali, Vishala Mishra, Lionel Wong, Monica Agrawal
Abstract:
Real‑world health questions from patients often unintentionally embed false assumptions or premises. In such cases, safe medical communication typically involves redirection: addressing the implicit misconception and then responding to the underlying patient context, rather than the original question. While large language models (LLMs) are increasingly being used by lay users for medical advice, they have not yet been tested for this crucial competency. Therefore, in this work, we investigate how LLMs react to false premises embedded within real‑world health questions. We develop a semi‑automated pipeline to curate MedRedFlag, a dataset of 1100+ questions sourced from Reddit that require redirection. We then systematically compare responses from state‑of‑the‑art LLMs to those from clinicians. Our analysis reveals that LLMs often fail to redirect problematic questions, even when the problematic premise is detected, and provide answers that could lead to suboptimal medical decision making. Our benchmark and results reveal a novel and substantial gap in how LLMs perform under the conditions of real‑world health communication, highlighting critical safety concerns for patient‑facing medical AI systems. Code and dataset are available at https://github.com/srsambara‑1/MedRedFlag.
Authors:Nguyen Minh Phuong, Dang Huu Tien, Naoya Inoue
Abstract:
Modern logical reasoning with LLMs primarily relies on employing complex interactive frameworks that decompose the reasoning process into subtasks solved through carefully designed prompts or requiring external resources (e.g., symbolic solvers) to exploit their strong logical structures. While interactive approaches introduce additional overhead or depend on external components, which limit their scalability. In this work, we introduce a non‑interactive, end‑to‑end framework for reasoning tasks, enabling reasoning to emerge within the model itself‑improving generalization while preserving analyzability without any external resources. We show that introducing structural information into the few‑shot prompt activates a subset of attention heads that patterns aligned with logical reasoning operators. Building on this insight, we propose Attention‑Aware Intervention (AAI), an inference‑time intervention method that reweights attention scores across selected heads identified by their logical patterns. AAI offers an efficient way to steer the model's reasoning toward leveraging prior knowledge through attention modulation. Extensive experiments show that AAI enhances logical reasoning performance across diverse benchmarks, and model architectures, while incurring negligible additional computational overhead. Code is available at https://github.com/phuongnm94/aai_for_logical_reasoning.
Authors:Jing-Yi Zeng, Guan-Hua Huang
Abstract:
This study investigates how to efficiently build a domain‑specialized large language model (LLM) for statistics using the lightweight LLaMA‑3.2‑3B family as the foundation model (FM). We systematically compare three multi‑stage training pipelines, starting from a base FM with no instruction‑following capability, a base FM augmented with post‑hoc instruction tuning, and an instruction‑tuned FM with strong general reasoning abilities across continual pretraining, supervised fine‑tuning (SFT), reinforcement learning from human feedback (RLHF) preference alignment, and downstream task adaptation. Results show that pipelines beginning with a base FM fail to develop meaningful statistical reasoning, even after extensive instruction tuning, SFT, or RLHF alignment. In contrast, starting from LLaMA‑3.2‑3B‑Instruct enables effective domain specialization. A comprehensive evaluation of SFT variants reveals clear trade‑offs between domain expertise and general reasoning ability. We further demonstrate that direct preference optimization provides stable and effective RLHF preference alignment. Finally, we show that downstream fine‑tuning must be performed with extremely low intensity to avoid catastrophic forgetting in highly optimized models. The final model, StatLLaMA, achieves strong and balanced performance on benchmarks of mathematical reasoning, common‑sense reasoning, and statistical expertise, offering a practical blueprint for developing resource‑efficient statistical LLMs. The code is available at https://github.com/HuangDLab/StatLLaMA.
Authors:Yiwei Yan, Hao Li, Hua He, Gong Kai, Zhengyi Yang, Guanfeng Liu
Abstract:
Online medical consultations generate large volumes of conversational health data that often embed protected health information, requiring robust methods to classify data categories and assign risk levels in line with policies and practice. However, existing approaches lack unified standards and reliable automated methods to fulfill sensitivity classification for such conversational health data. This study presents a large language model‑based extraction pipeline, SALP‑CG, for classifying and grading privacy risks in online conversational health data. We concluded health‑data classification and grading rules in accordance with GB/T 39725‑2020. Combining few‑shot guidance, JSON Schema constrained decoding, and deterministic high‑risk rules, the backend‑agnostic extraction pipeline achieves strong category compliance and reliable sensitivity across diverse LLMs. On the MedDialog‑CN benchmark, models yields robust entity counts, high schema compliance, and accurate sensitivity grading, while the strongest model attains micro‑F1=0.900 for maximum‑level prediction. The category landscape stratified by sensitivity shows that Level 2‑3 items dominate, enabling re‑identification when combined; Level 4‑5 items are less frequent but carry outsize harm. SALP‑CG reliably helps classify categories and grading sensitivity in online conversational health data across LLMs, offering a practical method for health data governance. Code is available at https://github.com/dommii1218/SALP‑CG.
Authors:Chi-Pin Huang, Yunze Man, Zhiding Yu, Min-Hung Chen, Jan Kautz, Yu-Chiang Frank Wang, Fu-En Yang
Abstract:
Vision‑Language‑Action (VLA) tasks require reasoning over complex visual scenes and executing adaptive actions in dynamic environments. While recent studies on reasoning VLAs show that explicit chain‑of‑thought (CoT) can improve generalization, they suffer from high inference latency due to lengthy reasoning traces. We propose Fast‑ThinkAct, an efficient reasoning framework that achieves compact yet performant planning through verbalizable latent reasoning. Fast‑ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference‑guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodied control. This enables reasoning‑enhanced policy learning that effectively connects compact reasoning to action execution. Extensive experiments across diverse embodied manipulation and reasoning benchmarks demonstrate that Fast‑ThinkAct achieves strong performance with up to 89.3% reduced inference latency over state‑of‑the‑art reasoning VLAs, while maintaining effective long‑horizon planning, few‑shot adaptation, and failure recovery.
Authors:Tianyi Niu, Justin Chih-Yao Chen, Genta Indra Winata, Shi-Xiong Zhang, Supriyo Chakraborty, Sambit Sahu, Yue Zhang, Elias Stengel-Eskin, Mohit Bansal
Abstract:
Large Language Model (LLM) routers dynamically select optimal models for given inputs. Existing approaches typically assume access to ground‑truth labeled data, which is often unavailable in practice, especially when user request distributions are heterogeneous and unknown. We introduce Routing with Generated Data (RGD), a challenging setting in which routers are trained exclusively on generated queries and answers produced from high‑level task descriptions by generator LLMs. We evaluate query‑answer routers (using both queries and labels) and query‑only routers across four diverse benchmarks and 12 models, finding that query‑answer routers degrade faster than query‑only routers as generator quality decreases. Our analysis reveals two crucial characteristics of effective generators: they must accurately respond to their own questions, and their questions must produce sufficient performance differentiation among the model pool. We then show how filtering for these characteristics can improve the quality of generated data. We further propose CASCAL, a novel query‑only router that estimates model correctness through consensus voting and identifies model‑specific skill niches via hierarchical clustering. CASCAL is substantially more robust to generator quality, outperforming the best query‑answer router by 4.6% absolute accuracy when trained on weak generator data.
Authors:Kuo Liang, Yuhang Lu, Jianming Mao, Shuyi Sun, Chunwei Yang, Congcong Zeng, Xiao Jin, Hanzhang Qin, Ruihao Zhu, Chung-Piaw Teo
Abstract:
Large‑scale optimization is a key backbone of modern business decision‑making. However, building these models is often labor‑intensive and time‑consuming. We address this by proposing LEAN‑LLM‑OPT, a LightwEight AgeNtic workflow construction framework for LLM‑assisted large‑scale OPTimization auto‑formulation. LEAN‑LLM‑OPT takes as input a problem description together with associated datasets and orchestrates a team of LLM agents to produce an optimization formulation. Specifically, upon receiving a query, two upstream LLM agents dynamically construct a workflow that specifies, step‑by‑step, how optimization models for similar problems can be formulated. A downstream LLM agent then follows this workflow to generate the final output. The agentic workflow leverages common modeling practices to standardize the modeling process into a sequence of structured sub‑tasks, offloading mechanical data‑handling operations to auxiliary tools. This reduces the LLM's burden in planning and data handling, allowing us to exploit its flexibility to address unstructured components. Extensive simulations show that LEAN‑LLM‑OPT, instantiated with GPT‑4.1 and the open source gpt‑oss‑20B, achieves strong performance on large‑scale optimization modeling tasks and is competitive with state‑of‑the‑art approaches. In addition, in a Singapore Airlines choice‑based revenue management use case, LEAN‑LLM‑OPT demonstrates practical value by achieving leading performance across a range of scenarios. Along the way, we introduce Large‑Scale‑OR and Air‑NRM, the first comprehensive benchmarks for large‑scale optimization auto‑formulation. The code and data of this work is available at https://github.com/CoraLiang01/lean‑llm‑opt.
Authors:Yonglin Tian, Qiyao Zhang, Wei Xu, Yutong Wang, Yihao Wu, Xinyi Li, Xingyuan Dai, Hui Zhang, Zhiyong Cui, Baoqing Guo, Zujun Yu, Yisheng Lv
Abstract:
Accurate and early perception of potential intrusion targets is essential for ensuring the safety of railway transportation systems. However, most existing systems focus narrowly on object classification within fixed visual scopes and apply rule‑based heuristics to determine intrusion status, often overlooking targets that pose latent intrusion risks. Anticipating such risks requires the cognition of spatial context and temporal dynamics for the object of interest (OOI), which presents challenges for conventional visual models. To facilitate deep intrusion perception, we introduce a novel benchmark, CogRail, which integrates curated open‑source datasets with cognitively driven question‑answer annotations to support spatio‑temporal reasoning and prediction. Building upon this benchmark, we conduct a systematic evaluation of state‑of‑the‑art visual‑language models (VLMs) using multimodal prompts to identify their strengths and limitations in this domain. Furthermore, we fine‑tune VLMs for better performance and propose a joint fine‑tuning framework that integrates three core tasks, position perception, movement prediction, and threat analysis, facilitating effective adaptation of general‑purpose foundation models into specialized models tailored for cognitive intrusion perception. Extensive experiments reveal that current large‑scale multimodal models struggle with the complex spatial‑temporal reasoning required by the cognitive intrusion perception task, underscoring the limitations of existing foundation models in this safety‑critical domain. In contrast, our proposed joint fine‑tuning framework significantly enhances model performance by enabling targeted adaptation to domain‑specific reasoning demands, highlighting the advantages of structured multi‑task learning in improving both accuracy and interpretability. Code will be available at https://github.com/Hub‑Tian/CogRail.
Authors:Pierfrancesco Melucci, Paolo Merialdo, Taketo Akama
Abstract:
Deep learning models define the state‑of‑the‑art in Automatic Drum Transcription (ADT), yet their performance is contingent upon large‑scale, paired audio‑MIDI datasets, which are scarce. Existing workarounds that use synthetic data often introduce a significant domain gap, as they typically rely on low‑fidelity SoundFont libraries that lack acoustic diversity. While high‑quality one‑shot samples offer a better alternative, they are not available in a standardized, large‑scale format suitable for training. This paper introduces a new paradigm for ADT that circumvents the need for paired audio‑MIDI training data. Our primary contribution is a semi‑supervised method to automatically curate a large and diverse corpus of one‑shot drum samples from unlabeled audio sources. We then use this corpus to synthesize a high‑quality dataset from MIDI files alone, which we use to train a sequence‑to‑sequence transcription model. We evaluate our model on the ENST and MDB test sets, where it achieves new state‑of‑the‑art results, significantly outperforming both fully supervised methods and previous synthetic‑data approaches. The code for reproducing our experiments is publicly available at https://github.com/pier‑maker92/ADT_STR
Authors:Renqiang Luo, Dong Zhang, Yupeng Gao, Wen Shi, Mingliang Hou, Jiaying Liu, Zhe Wang, Shuo Yu
Abstract:
Semantic understanding of popularity bias is a crucial yet underexplored challenge in recommender systems, where popular items are often favored at the expense of niche content. Most existing debiasing methods treat the semantic understanding of popularity bias as a matter of diversity enhancement or long‑tail coverage, neglecting the deeper semantic layer that embodies the causal origins of the bias itself. Consequently, such shallow interpretations limit both their debiasing effectiveness and recommendation accuracy. In this paper, we propose FairLRM, a novel framework that bridges the gap in the semantic understanding of popularity bias with Recommendation via Large Language Model (RecLLM). FairLRM decomposes popularity bias into item‑side and user‑side components, using structured instruction‑based prompts to enhance the model's comprehension of both global item distributions and individual user preferences. Unlike traditional methods that rely on surface‑level features such as "diversity" or "debiasing", FairLRM improves the model's ability to semantically interpret and address the underlying bias. Through empirical evaluation, we show that FairLRM significantly enhances both fairness and recommendation accuracy, providing a more semantically aware and trustworthy approach to enhance the semantic understanding of popularity bias. The implementation is available at https://github.com/LuoRenqiang/FairLRM.
Authors:Renqiang Luo, Yongshuai Yang, Huafei Huang, Qing Qing, Mingliang Hou, Ziqi Xu, Yi Yu, Jingjing Zhou, Feng Xia
Abstract:
Graph unlearning has emerged as a critical mechanism for supporting sustainable and privacy‑preserving social networks, enabling models to remove the influence of deleted nodes and thereby better safeguard user information. However, we observe that existing graph unlearning techniques insufficiently protect sensitive attributes, often leading to degraded algorithmic fairness compared with traditional graph learning methods. To address this gap, we introduce FairGU, a fairness‑aware graph unlearning framework designed to preserve both utility and fairness during the unlearning process. FairGU integrates a dedicated fairness‑aware module with effective data protection strategies, ensuring that sensitive attributes are neither inadvertently amplified nor structurally exposed when nodes are removed. Through extensive experiments on multiple real‑world datasets, we demonstrate that FairGU consistently outperforms state‑of‑the‑art graph unlearning methods and fairness‑enhanced graph learning baselines in terms of both accuracy and fairness metrics. Our findings highlight a previously overlooked risk in current unlearning practices and establish FairGU as a robust and equitable solution for the next generation of socially sustainable networked systems. The codes are available at https://github.com/LuoRenqiang/FairGU.
Authors:Yaxi Chen, Zi Ye, Shaheer U. Saeed, Oliver Yu, Simin Ni, Jie Huang, Yipeng Hu
Abstract:
Osteosarcoma (OS) is an aggressive primary bone malignancy. Accurate histopathological assessment of viable versus non‑viable tumor regions after neoadjuvant chemotherapy is critical for prognosis and treatment planning, yet manual evaluation remains labor‑intensive, subjective, and prone to inter‑observer variability. Recent advances in digital pathology have enabled automated necrosis quantification. Evaluating on test data, independently sampled on patient‑level, revealed that the deep learning model performance dropped significantly from the tile‑level generalization ability reported in previous studies. First, this work proposes the use of radiomic features as additional input in model training. We show that, despite that they are derived from the images, such a multimodal input effectively improved the classification performance, in addition to its added benefits in interpretability. Second, this work proposes to optimize two binary classification tasks with hierarchical classes (i.e. tumor‑vs‑non‑tumor and viable‑vs‑non‑viable), as opposed to the alternative ``flat'' three‑class classification task (i.e. non‑tumor, non‑viable tumor, viable tumor), thereby enabling a hierarchical loss. We show that such a hierarchical loss, with trainable weightings between the two tasks, the per‑class performance can be improved significantly. Using the TCIA OS Tumor Assessment dataset, we experimentally demonstrate the benefits from each of the proposed new approaches and their combination, setting a what we consider new state‑of‑the‑art performance on this open dataset for this application. Code and trained models: https://github.com/YaxiiC/RadiomicsOS.git.
Authors:Renqiang Luo, Huafei Huang, Tao Tang, Jing Ren, Ziqi Xu, Mingliang Hou, Enyan Dai, Feng Xia
Abstract:
Graph Transformers (GTs) are increasingly applied to social network analysis, yet their deployment is often constrained by fairness concerns. This issue is particularly critical in incomplete social networks, where sensitive attributes are frequently missing due to privacy and ethical restrictions. Existing solutions commonly generate these incomplete attributes, which may introduce additional biases and further compromise user privacy. To address this challenge, FairGE (Fair Graph Encoding) is introduced as a fairness‑aware framework for GTs in incomplete social networks. Instead of generating sensitive attributes, FairGE encodes fairness directly through spectral graph theory. By leveraging the principal eigenvector to represent structural information and padding incomplete sensitive attributes with zeros to maintain independence, FairGE ensures fairness without data reconstruction. Theoretical analysis demonstrates that the method suppresses the influence of non‑principal spectral components, thereby enhancing fairness. Extensive experiments on seven real‑world social network datasets confirm that FairGE achieves at least a 16% improvement in both statistical parity and equality of opportunity compared with state‑of‑the‑art baselines. The source code is shown in https://github.com/LuoRenqiang/FairGE.
Authors:Hanze Guo, Jianxun Lian, Xiao Zhou
Abstract:
Collaborative Filtering (CF) remains the cornerstone of modern recommender systems, with dense embedding‑‑based methods dominating current practice. However, these approaches suffer from a critical limitation: our theoretical analysis reveals a fundamental signal‑to‑noise ratio (SNR) ceiling when modeling unpopular items, where parameter‑based dense models experience diminishing SNR under severe data sparsity. To overcome this bottleneck, we propose SaD (Sparse and Dense), a unified framework that integrates the semantic expressiveness of dense embeddings with the structural reliability of sparse interaction patterns. We theoretically show that aligning these dual views yields a strictly superior global SNR. Concretely, SaD introduces a lightweight bidirectional alignment mechanism: the dense view enriches the sparse view by injecting semantic correlations, while the sparse view regularizes the dense model through explicit structural signals. Extensive experiments demonstrate that, under this dual‑view alignment, even a simple matrix factorization‑‑style dense model can achieve state‑of‑the‑art performance. Moreover, SaD is plug‑and‑play and can be seamlessly applied to a wide range of existing recommender models, highlighting the enduring power of collaborative filtering when leveraged from dual perspectives. Further evaluations on real‑world benchmarks show that SaD consistently outperforms strong baselines, ranking first on the BarsMatch leaderboard. The code is publicly available at https://github.com/harris26‑G/SaD.
Authors:Maria Sdraka, Dimitrios Michail, Ioannis Papoutsis
Abstract:
Delineating wildfire affected areas using satellite imagery remains challenging due to irregular and spatially heterogeneous spectral changes across the electromagnetic spectrum. While recent deep learning approaches achieve high accuracy when high‑resolution multispectral data are available, their applicability in operational settings, where a quick delineation of the burn scar shortly after a wildfire incident is required, is limited by the trade‑off between spatial resolution and temporal revisit frequency of current satellite systems. To address this limitation, we propose a novel deep learning model, namely BAM‑MRCD, which employs multi‑resolution, multi‑source satellite imagery (MODIS and Sentinel‑2) for the timely production of detailed burnt area maps with high spatial and temporal resolution. Our model manages to detect even small scale wildfires with high accuracy, surpassing similar change detection models as well as solid baselines. All data and code are available in the GitHub repository: https://github.com/Orion‑AI‑Lab/BAM‑MRCD.
Authors:Jian Zhang, Zhiyuan Wang, Zhangqi Wang, Yu He, Haoran Luo, li yuan, Lingling Zhang, Rui Mao, Qika Lin, Jun Liu
Abstract:
Large Language Model (LLM) Agents exhibit inherent reasoning abilities through the collaboration of multiple tools. However, during agent inference, existing methods often suffer from (i) locally myopic generation, due to the absence of lookahead, and (ii) trajectory instability, where minor early errors can escalate into divergent reasoning paths. These issues make it difficult to balance global effectiveness and computational efficiency. To address these two issues, we propose meta‑adaptive exploration with LLM agents https://github.com/exoskeletonzj/MAXS, a meta‑adaptive reasoning framework based on LLM Agents that flexibly integrates tool execution and reasoning planning. MAXS employs a lookahead strategy to extend reasoning paths a few steps ahead, estimating the advantage value of tool usage, and combines step consistency variance and inter‑step trend slopes to jointly select stable, consistent, and high‑value reasoning steps. Additionally, we introduce a trajectory convergence mechanism that controls computational cost by halting further rollouts once path consistency is achieved, enabling a balance between resource efficiency and global effectiveness in multi‑tool reasoning. We conduct extensive empirical studies across three base models (MiMo‑VL‑7B, Qwen2.5‑VL‑7B, Qwen2.5‑VL‑32B) and five datasets, demonstrating that MAXS consistently outperforms existing methods in both performance and inference efficiency. Further analysis confirms the effectiveness of our lookahead strategy and tool usage.
Authors:Zhengyang Zhao, Lu Ma, Yizhen Jiang, Xiaochen Ma, Zimo Meng, Chengyu Shen, Lexiang Tang, Haoze Sun, Peng Pei, Wentao Zhang
Abstract:
The prevailing post‑training paradigm for Large Reasoning Models (LRMs)‑‑Supervised Fine‑Tuning (SFT) followed by Reinforcement Learning (RL)‑‑suffers from an intrinsic optimization mismatch: the rigid supervision inherent in SFT induces distributional collapse, thereby exhausting the exploration space necessary for subsequent RL. In this paper, we reformulate SFT within a unified post‑training framework and propose Gibbs Initialization with Finite Temperature (GIFT). We characterize standard SFT as a degenerate zero‑temperature limit that suppresses base priors. Conversely, GIFT incorporates supervision as a finite‑temperature energy potential, establishing a distributional bridge that ensures objective consistency throughout the post‑training pipeline. Our experiments demonstrate that GIFT significantly outperforms standard SFT and other competitive baselines when utilized for RL initialization, providing a mathematically principled pathway toward achieving global optimality in post‑training. Our code is available at https://github.com/zzy1127/GIFT.
Authors:Derrick Goh Xin Deik, Quanyu Long, Zhengyuan Liu, Nancy F. Chen, Wenya Wang
Abstract:
Multi‑constraint planning involves identifying, evaluating, and refining candidate plans while satisfying multiple, potentially conflicting constraints. Existing large language model (LLM) approaches face fundamental limitations in this domain. Pure reasoning paradigms, which rely on long natural language chains, are prone to inconsistency, error accumulation, and prohibitive cost as constraints compound. Conversely, LLMs combined with coding‑ or solver‑based strategies lack flexibility: they often generate problem‑specific code from scratch or depend on fixed solvers, failing to capture generalizable logic across diverse problems. To address these challenges, we introduce the Scalable COde Planning Engine (SCOPE), a framework that disentangles query‑specific reasoning from generic code execution. By separating reasoning from execution, SCOPE produces solver functions that are consistent, deterministic, and reusable across queries while requiring only minimal changes to input parameters. SCOPE achieves state‑of‑the‑art performance while lowering cost and latency. For example, with GPT‑4o, it reaches 93.1% success on TravelPlanner, a 61.6% gain over the best baseline (CoT) while cutting inference cost by 1.4x and time by ~4.67x. Code is available at https://github.com/DerrickGXD/SCOPE.
Authors:Xuetao Li, Wenke Huang, Mang Ye, Jifeng Xuan, Bo Du, Sheng Liu, Miao Li
Abstract:
Humanoid robot manipulation is a crucial research area for executing diverse human‑level tasks, involving high‑level semantic reasoning and low‑level action generation. However, precise scene understanding and sample‑efficient learning from human demonstrations remain critical challenges, severely hindering the applicability and generalizability of existing frameworks. This paper presents a novel RGMP‑S, Recurrent Geometric‑prior Multimodal Policy with Spiking features, facilitating both high‑level skill reasoning and data‑efficient motion synthesis. To ground high‑level reasoning in physical reality, we leverage lightweight 2D geometric inductive biases to enable precise 3D scene understanding within the vision‑language model. Specifically, we construct a Long‑horizon Geometric Prior Skill Selector that effectively aligns the semantic instructions with spatial constraints, ultimately achieving robust generalization in unseen environments. For the data efficiency issue in robotic action generation, we introduce a Recursive Adaptive Spiking Network. We parameterize robot‑object interactions via recursive spiking for spatiotemporal consistency, fully distilling long‑horizon dynamic features while mitigating the overfitting issue in sparse demonstration scenarios. Extensive experiments are conducted across the Maniskill simulation benchmark and three heterogeneous real‑world robotic systems, encompassing a custom‑developed humanoid, a desktop manipulator, and a commercial robotic platform. Empirical results substantiate the superiority of our method over state‑of‑the‑art baselines and validate the efficacy of the proposed modules in diverse generalization scenarios. To facilitate reproducibility, the source code and video demonstrations are publicly available at https://github.com/xtli12/RGMP‑S.git.
Authors:Yu Xu, Hongbin Yan, Juan Cao, Yiji Cheng, Tiankai Hang, Runze He, Zijin Yin, Shiyi Zhang, Yuxin Zhang, Jintao Li, Chunyu Wang, Qinglin Lu, Tong-Yee Lee, Fan Tang
Abstract:
Unified image generation and editing models suffer from severe task interference in dense diffusion transformers architectures, where a shared parameter space must compromise between conflicting objectives (e.g., local editing v.s. subject‑driven generation). While the sparse Mixture‑of‑Experts (MoE) paradigm is a promising solution, its gating networks remain task‑agnostic, operating based on local features, unaware of global task intent. This task‑agnostic nature prevents meaningful specialization and fails to resolve the underlying task interference. In this paper, we propose a novel framework to inject semantic intent into MoE routing. We introduce a Hierarchical Task Semantic Annotation scheme to create structured task descriptors (e.g., scope, type, preservation). We then design Predictive Alignment Regularization to align internal routing decisions with the task's high‑level semantics. This regularization evolves the gating network from a task‑agnostic executor to a dispatch center. Our model effectively mitigates task interference, outperforming dense baselines in fidelity and quality, and our analysis shows that experts naturally develop clear and semantically correlated specializations.
Authors:Jiahao Qin, Yiwen Wang
Abstract:
Image registration under domain shift remains a fundamental challenge in computer vision and medical imaging: when source and target images exhibit systematic intensity differences, the brightness constancy assumption underlying conventional registration methods is violated, rendering correspondence estimation ill‑posed. We propose SAR‑Net, a unified framework that addresses this challenge through principled scene‑appearance disentanglement. Our key insight is that observed images can be decomposed into domain‑invariant scene representations and domain‑specific appearance codes, enabling registration via re‑rendering rather than direct intensity matching. We establish theoretical conditions under which this decomposition enables consistent cross‑domain alignment (Proposition 1) and prove that our scene consistency loss provides a sufficient condition for geometric correspondence in the shared latent space (Proposition 2). Empirically, we validate SAR‑Net on the ANHIR (Automatic Non‑rigid Histological Image Registration) challenge benchmark, where multi‑stain histopathology images exhibit coupled domain shift from different staining protocols and geometric distortion from tissue preparation. Our method achieves a median relative Target Registration Error (rTRE) of 0.25%, outperforming the state‑of‑the‑art MEVIS method (0.27% rTRE) by 7.4%, with robustness of 99.1%. Code is available at https://github.com/D‑ST‑Sword/SAR‑NET .
Authors:Qingyu Liu, Zhongjie Ba, Jianmin Guo, Qiu Wang, Zhibo Wang, Jie Shi, Kui Ren
Abstract:
Recently, reconstruction‑based methods have gained attention for AIGC image detection. These methods leverage pre‑trained diffusion models to reconstruct inputs and measure residuals for distinguishing real from fake images. Their key advantage lies in reducing reliance on dataset‑specific artifacts and improving generalization under distribution shifts. However, they are limited by significant inefficiency due to multi‑step inversion and reconstruction, and their reliance on diffusion backbones further limits generalization to other generative paradigms such as GANs. In this paper, we propose a novel fake image detection framework, called R^2BD, built upon two key designs: (1) G‑LDM, a unified reconstruction model that simulates the generation behaviors of VAEs, GANs, and diffusion models, thereby broadening the detection scope beyond prior diffusion‑only approaches; and (2) a residual bias calculation module that distinguishes real and fake images in a single inference step, which is a significant efficiency improvement over existing methods that typically require 20+ steps. Extensive experiments on the benchmark from 10 public datasets demonstrate that R^2BD is over 22× faster than existing reconstruction‑based methods while achieving superior detection accuracy. In cross‑dataset evaluations, it outperforms state‑of‑the‑art methods by an average of 13.87%, showing strong efficiency and generalization across diverse generative methods. The code and dataset used for evaluation are available at https://github.com/QingyuLiu/RRBD.
Authors:Hsiang-Wei Huang, Junbin Lu, Kuang-Ming Chen, Jenq-Neng Hwang
Abstract:
In this work, we explore the Large Language Model (LLM) agent reviewer dynamics in an Elo‑ranked review system using real‑world conference paper submissions. Multiple LLM agent reviewers with different personas are engage in multi round review interactions moderated by an Area Chair. We compare a baseline setting with conditions that incorporate Elo ratings and reviewer memory. Our simulation results showcase several interesting findings, including how incorporating Elo improves Area Chair decision accuracy, as well as reviewers' adaptive review strategy that exploits our Elo system without improving review effort. Our code is available at https://github.com/hsiangwei0903/EloReview.
Authors:Weixin Chen, Yuhan Zhao, Jingyuan Huang, Zihe Ye, Clark Mingxuan Ju, Tong Zhao, Neil Shah, Li Chen, Yongfeng Zhang
Abstract:
The evolution of recommender systems has shifted preference storage from rating matrices and dense embeddings to semantic memory in the agentic era. Yet existing agents rely on isolated memory, overlooking crucial collaborative signals. Bridging this gap is hindered by the dual challenges of distilling vast graph contexts without overwhelming reasoning agents with cognitive load, and evolving the collaborative memory efficiently without incurring prohibitive computational costs. To address this, we propose MemRec, a framework that architecturally decouples reasoning from memory management to enable efficient collaborative augmentation. MemRec introduces a dedicated, cost‑effective LM_Mem to manage a dynamic collaborative memory graph, serving synthesized, high‑signal context to a downstream LLM_Rec. The framework operates via a practical pipeline featuring efficient retrieval and cost‑effective asynchronous graph propagation that evolves memory in the background. Extensive experiments on four benchmarks demonstrate that MemRec achieves state‑of‑the‑art performance. Furthermore, architectural analysis confirms its flexibility, establishing a new Pareto frontier that balances reasoning quality, cost, and privacy through support for diverse deployments, including local open‑source models. Code:https://github.com/rutgerswiselab/memrec and Homepage: https://memrec.weixinchen.com
Authors:Yao Tang, Li Dong, Yaru Hao, Qingxiu Dong, Furu Wei, Jiatao Gu
Abstract:
Large language models often solve complex reasoning tasks more effectively with Chain‑of‑Thought (CoT), but at the cost of long, low‑bandwidth token sequences. Humans, by contrast, often reason softly by maintaining a distribution over plausible next steps. Motivated by this, we propose Multiplex Thinking, a stochastic soft reasoning mechanism that, at each thinking step, samples K candidate tokens and aggregates their embeddings into a single continuous multiplex token. This preserves the vocabulary embedding prior and the sampling dynamics of standard discrete generation, while inducing a tractable probability distribution over multiplex rollouts. Consequently, multiplex trajectories can be directly optimized with on‑policy reinforcement learning (RL). Importantly, Multiplex Thinking is self‑adaptive: when the model is confident, the multiplex token is nearly discrete and behaves like standard CoT; when it is uncertain, it compactly represents multiple plausible next steps without increasing sequence length. Across challenging math reasoning benchmarks, Multiplex Thinking consistently outperforms strong discrete CoT and RL baselines from Pass@1 through Pass@1024, while producing shorter sequences. The code and checkpoints are available at https://github.com/GMLR‑Penn/Multiplex‑Thinking.
Authors:Naren Medarametla, Sreejon Mondal
Abstract:
Localization is a fundamental capability for autonomous robots, enabling them to operate effectively in dynamic environments. In Robocon 2025, accurate and reliable localization is crucial for improving shooting precision, avoiding collisions with other robots, and navigating the competition field efficiently. In this paper, we propose a hybrid localization algorithm that integrates classical techniques with learning based methods that rely solely on visual data from the court's floor to achieve self‑localization on the basketball field.
Authors:Yihan Hong, Huaiyuan Yao, Bolin Shen, Wanpeng Xu, Hua Wei, Yushun Dong
Abstract:
The LLM‑as‑a‑Judge paradigm promises scalable rubric‑based evaluation, yet aligning frozen black‑box models with human standards remains a challenge due to inherent generation stochasticity. We reframe judge alignment as a criteria transfer problem and isolate three recurrent failure modes: rubric instability caused by prompt sensitivity, unverifiable reasoning that lacks auditable evidence, and scale misalignment with human grading boundaries. To address these issues, we introduce RULERS (Rubric Unification, Locking, and Evidence‑anchored Robust Scoring), a compiler‑executor framework that transforms natural language rubrics into executable specifications. RULERS operates by compiling criteria into versioned immutable bundles, enforcing structured decoding with deterministic evidence verification, and applying lightweight Wasserstein‑based post‑hoc calibration, all without updating model parameters. Extensive experiments on essay and summarization benchmarks demonstrate that RULERS significantly outperforms representative baselines in human agreement, maintains strong stability against adversarial rubric perturbations, and enables smaller models to rival larger proprietary judges. Overall, our results suggest that reliable LLM judging requires executable rubrics, verifiable evidence, and calibrated scales rather than prompt phrasing alone. Code is available at https://github.com/LabRAI/Rulers.git.
Authors:Renyang Liu, Kangjie Chen, Han Qiu, Jie Zhang, Kwok-Yan Lam, Tianwei Zhang, See-Kiong Ng
Abstract:
Image generation models (IGMs), while capable of producing impressive and creative content, often memorize a wide range of undesirable concepts from their training data, leading to the reproduction of unsafe content such as NSFW imagery and copyrighted artistic styles. Such behaviors pose persistent safety and compliance risks in real‑world deployments and cannot be reliably mitigated by post‑hoc filtering, owing to the limited robustness of such mechanisms and a lack of fine‑grained semantic control. Recent unlearning methods seek to erase harmful concepts at the model level, which exhibit the limitations of requiring costly retraining, degrading the quality of benign generations, or failing to withstand prompt paraphrasing and adversarial attacks. To address these challenges, we introduce SafeRedir, a lightweight inference‑time framework for robust unlearning via prompt embedding redirection. Without modifying the underlying IGMs, SafeRedir adaptively routes unsafe prompts toward safe semantic regions through token‑level interventions in the embedding space. The framework comprises two core components: a latent‑aware multi‑modal safety classifier for identifying unsafe generation trajectories, and a token‑level delta generator for precise semantic redirection, equipped with auxiliary predictors for token masking and adaptive scaling to localize and regulate the intervention. Empirical results across multiple representative unlearning tasks demonstrate that SafeRedir achieves effective unlearning capability, high semantic and perceptual preservation, robust image quality, and enhanced resistance to adversarial attacks. Furthermore, SafeRedir generalizes effectively across a variety of diffusion backbones and existing unlearned models, validating its plug‑and‑play compatibility and broad applicability. Code and data are available at https://github.com/ryliu68/SafeRedir.
Authors:Zishan Shu, Juntong Wu, Wei Yan, Xudong Liu, Hongyu Zhang, Chang Liu, Youdong Mao, Jie Chen
Abstract:
Vision modeling has advanced rapidly with Transformers, whose attention mechanisms capture visual dependencies but lack a principled account of how semantic information propagates spatially. We revisit this problem from a wave‑based perspective: feature maps are treated as spatial signals whose evolution over an internal propagation time (aligned with network depth) is governed by an underdamped wave equation. In this formulation, spatial frequency‑from low‑frequency global layout to high‑frequency edges and textures‑is modeled explicitly, and its interaction with propagation time is controlled rather than implicitly fixed. We derive a closed‑form, frequency‑time decoupled solution and implement it as the Wave Propagation Operator (WPO), a lightweight module that models global interactions in O(N log N) time‑far lower than attention. Building on WPO, we propose a family of WaveFormer models as drop‑in replacements for standard ViTs and CNNs, achieving competitive accuracy across image classification, object detection, and semantic segmentation, while delivering up to 1.6x higher throughput and 30% fewer FLOPs than attention‑based alternatives. Furthermore, our results demonstrate that wave propagation introduces a complementary modeling bias to heat‑based methods, effectively capturing both global coherence and high‑frequency details essential for rich visual semantics. Codes are available at: https://github.com/ZishanShu/WaveFormer.
Authors:Sushant Gautam, Cise Midoglu, Vajira Thambawita, Michael A. Riegler, Pål Halvorsen
Abstract:
Hallucinations in video‑capable vision‑language models (Video‑VLMs) remain frequent and high‑confidence, while existing uncertainty metrics often fail to align with correctness. We introduce VideoHEDGE, a modular framework for hallucination detection in video question answering that extends entropy‑based reliability estimation from images to temporally structured inputs. Given a video‑question pair, VideoHEDGE draws a baseline answer and multiple high‑temperature generations from both clean clips and photometrically and spatiotemporally perturbed variants, then clusters the resulting textual outputs into semantic hypotheses using either Natural Language Inference (NLI)‑based or embedding‑based methods. Cluster‑level probability masses yield three reliability scores: Semantic Entropy (SE), RadFlag, and Vision‑Amplified Semantic Entropy (VASE). We evaluate VideoHEDGE on the SoccerChat benchmark using an LLM‑as‑a‑judge to obtain binary hallucination labels. Across three 7B Video‑VLMs (Qwen2‑VL, Qwen2.5‑VL, and a SoccerChat‑finetuned model), VASE consistently achieves the highest ROC‑AUC, especially at larger distortion budgets, while SE and RadFlag often operate near chance. We further show that embedding‑based clustering matches NLI‑based clustering in detection performance at substantially lower computational cost, and that domain fine‑tuning reduces hallucination frequency but yields only modest improvements in calibration. The hedge‑bench PyPI library enables reproducible and extensible benchmarking, with full code and experimental resources available at https://github.com/Simula/HEDGE#videohedge .
Authors:Jinkwan Jang, Hyunbin Jin, Hyungjin Park, Kyubyung Chae, Taesup Kim
Abstract:
Time series forecasting is critical to real‑world decision making, yet most existing approaches remain unimodal and rely on extrapolating historical patterns. While recent progress in large language models (LLMs) highlights the potential for multimodal forecasting, existing benchmarks largely provide retrospective or misaligned raw context, making it unclear whether such models meaningfully leverage textual inputs. In practice, human experts incorporate what‑if scenarios with historical evidence, often producing distinct forecasts from the same observations under different scenarios. Inspired by this, we introduce What If TSF (WIT), a multimodal forecasting benchmark designed to evaluate whether models can condition their forecasts on contextual text, especially future scenarios. By providing expert‑crafted plausible or counterfactual scenarios, WIT offers a rigorous testbed for scenario‑guided multimodal forecasting. The benchmark is available at https://github.com/jinkwan1115/WhatIfTSF.
Authors:Abdelaziz Bounhar, Rania Hossam Elmohamady Elbadry, Hadi Abdine, Preslav Nakov, Michalis Vazirgiannis, Guokan Shang
Abstract:
Steering Large Language Models (LLMs) through activation interventions has emerged as a lightweight alternative to fine‑tuning for alignment and personalization. Recent work on Bi‑directional Preference Optimization (BiPO) shows that dense steering vectors can be learned directly from preference data in a Direct Preference Optimization (DPO) fashion, enabling control over truthfulness, hallucinations, and safety behaviors. However, dense steering vectors often entangle multiple latent factors due to neuron multi‑semanticity, limiting their effectiveness and stability in fine‑grained settings such as cultural alignment, where closely related values and behaviors (e.g., among Middle Eastern cultures) must be distinguished. In this paper, we propose Yet another Policy Optimization (YaPO), a reference‑free method that learns sparse steering vectors in the latent space of a Sparse Autoencoder (SAE). By optimizing sparse codes, YaPO produces disentangled, interpretable, and efficient steering directions. Empirically, we show that YaPO converges faster, achieves stronger performance, and exhibits improved training stability compared to dense steering baselines. Beyond cultural alignment, YaPO generalizes to a range of alignment‑related behaviors, including hallucination, wealth‑seeking, jailbreak, and power‑seeking. Importantly, YaPO preserves general knowledge, with no measurable degradation on MMLU. Overall, our results show that YaPO provides a general recipe for efficient, stable, and fine‑grained alignment of LLMs, with broad applications to controllability and domain adaptation. The associated code and data are publicly available\footnotehttps://github.com/MBZUAI‑Paris/YaPO.
Authors:Sunzhu Li, Jiale Zhao, Miteto Wei, Huimin Ren, Yang Zhou, Jingwen Yang, Shunyu Liu, Kaike Zhang, Wei Chen
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has driven substantial progress in reasoning‑intensive domains like mathematics. However, optimizing open‑ended generation remains challenging due to the lack of ground truth. While rubric‑based evaluation offers a structured proxy for verification, existing methods suffer from scalability bottlenecks and coarse criteria, resulting in a supervision ceiling effect. To address this, we propose an automated Coarse‑to‑Fine Rubric Generation framework. By synergizing principle‑guided synthesis, multi‑model aggregation, and difficulty evolution, our approach produces comprehensive and highly discriminative criteria capable of capturing the subtle nuances. Based on this framework, we introduce RubricHub, a large‑scale (~110k) and multi‑domain dataset. We validate its utility through a two‑stage post‑training pipeline comprising Rubric‑based Rejection Sampling Fine‑Tuning (RuFT) and Reinforcement Learning (RuRL). Experimental results demonstrate that RubricHub unlocks significant performance gains: our post‑trained Qwen3‑14B achieves state‑of‑the‑art (SOTA) results on HealthBench (69.3), surpassing proprietary frontier models such as GPT‑5. Our code is available at \hrefhttps://github.com/teqkilla/RubricHub this URL.
Authors:Yan Zhu, Te Luo, Pei-Yao Fu, Zhen Zhang, Zi-Long Wang, Yi-Fan Qu, Zi-Han Geng, Jia-Qi Xu, Lu Yao, Li-Yun Ma, Wei Su, Wei-Feng Chen, Quan-Lin Li, Shuo Wang, Ping-Hong Zhou
Abstract:
Multimodal Large Language Models (MLLMs) show promise in gastroenterology, yet their performance against comprehensive clinical workflows and human benchmarks remains unverified. To systematically evaluate state‑of‑the‑art MLLMs across a panoramic gastrointestinal endoscopy workflow and determine their clinical utility compared with human endoscopists. We constructed GI‑Bench, a benchmark encompassing 20 fine‑grained lesion categories. Twelve MLLMs were evaluated across a five‑stage clinical workflow: anatomical localization, lesion identification, diagnosis, findings description, and management. Model performance was benchmarked against three junior endoscopists and three residency trainees using Macro‑F1, mean Intersection‑over‑Union (mIoU), and multi‑dimensional Likert scale. Gemini‑3‑Pro achieved state‑of‑the‑art performance. In diagnostic reasoning, top‑tier models (Macro‑F1 0.641) outperformed trainees (0.492) and rivaled junior endoscopists (0.727; p>0.05). However, a critical "spatial grounding bottleneck" persisted; human lesion localization (mIoU >0.506) significantly outperformed the best model (0.345; p<0.05). Furthermore, qualitative analysis revealed a "fluency‑accuracy paradox": models generated reports with superior linguistic readability compared with humans (p<0.05) but exhibited significantly lower factual correctness (p<0.05) due to "over‑interpretation" and hallucination of visual features. GI‑Bench maintains a dynamic leaderboard that tracks the evolving performance of MLLMs in clinical endoscopy. The current rankings and benchmark results are available at https://roterdl.github.io/GIBench/.
Authors:Anh H. Vo, Tae-Seok Kim, Hulin Jin, Soo-Mi Choi, Yong-Guk Kim
Abstract:
A 3D avatar typically has one of six cardinal facial expressions. To simulate realistic emotional variation, we should be able to render a facial transition between two arbitrary expressions. This study presents a new framework for instruction‑driven facial expression generation that produces a 3D face and, starting from an image of the face, transforms the facial expression from one designated facial expression to another. The Instruction‑driven Facial Expression Decomposer (IFED) module is introduced to facilitate multimodal data learning and capture the correlation between textual descriptions and facial expression features. Subsequently, we propose the Instruction to Facial Expression Transition (I2FET) method, which leverages IFED and a vertex reconstruction loss function to refine the semantic comprehension of latent vectors, thus generating a facial expression sequence according to the given instruction. Lastly, we present the Facial Expression Transition model to generate smooth transitions between facial expressions. Extensive evaluation suggests that the proposed model outperforms state‑of‑the‑art methods on the CK+ and CelebV‑HQ datasets. The results show that our framework can generate facial expression trajectories according to text instruction. Considering that text prompts allow us to make diverse descriptions of human emotional states, the repertoire of facial expressions and the transitions between them can be expanded greatly. We expect our framework to find various practical applications More information about our project can be found at https://vohoanganh.github.io/tg3dfet/
Authors:Daocheng Fu, Jianbiao Mei, Rong Wu, Xuemeng Yang, Jia Xu, Ding Wang, Pinlong Cai, Yong Liu, Licheng Wen, Botian Shi
Abstract:
The rapid evolution of Multi‑modal Large Language Models (MLLMs) has advanced workflow automation; however, existing research mainly targets performance upper bounds in static environments, overlooking robustness for stochastic real‑world deployment. We identify three key challenges: dynamic task scheduling, active exploration under uncertainty, and continuous learning from experience. To bridge this gap, we introduce \method, a dynamic evaluation environment that simulates a "trainee" agent continuously exploring a novel setting. Unlike traditional benchmarks, \method evaluates agents along three dimensions: (1) context‑aware scheduling for streaming tasks with varying priorities; (2) prudent information acquisition to reduce hallucination via active exploration; and (3) continuous evolution by distilling generalized strategies from rule‑based, dynamically generated tasks. Experiments show that cutting‑edge agents have significant deficiencies in dynamic environments, especially in active exploration and continual learning. Our work establishes a framework for assessing agent reliability, shifting evaluation from static tests to realistic, production‑oriented scenarios. Our codes are available at https://github.com/KnowledgeXLab/EvoEnv
Authors:Ashutosh Hathidara, Julien Yu, Vaishali Senthil, Sebastian Schreiber, Anil Babu Ankisettipalli
Abstract:
Large language models (LLMs) are increasingly used as human simulators, both for evaluating conversational systems and for generating fine‑tuning data. However, naive "act‑as‑a‑user" prompting often yields verbose, unrealistic utterances, underscoring the need for principled evaluation of so‑called user proxy agents. We present MIRRORBENCH, a reproducible, extensible benchmarking framework that evaluates user proxies solely on their ability to produce human‑like user utterances across diverse conversational tasks, explicitly decoupled from downstream task success. MIRRORBENCH features a modular execution engine with typed interfaces, metadata‑driven registries, multi‑backend support, caching, and robust observability. The system supports pluggable user proxies, datasets, tasks, and metrics, enabling researchers to evaluate arbitrary simulators under a uniform, variance‑aware harness. We include three lexical‑diversity metrics (MATTR, YULE'S K, and HD‑D) and three LLM‑judge‑based metrics (GTEval, Pairwise Indistinguishability, and Rubric‑and‑Reason). Across four open datasets, MIRRORBENCH yields variance‑aware results and reveals systematic gaps between user proxies and real human users. The framework is open source and includes a simple command‑line interface for running experiments, managing configurations and caching, and generating reports. The framework can be accessed at https://github.com/SAP/mirrorbench.
Authors:Zheng Zhou, Isabella McEvoy, Camilo E. Valderrama
Abstract:
Subject‑independent EEG emotion recognition is challenged by pronounced inter‑subject variability and the difficulty of learning robust representations from short, noisy recordings. To address this, we propose a fusion framework that integrates (i) local, channel‑wise descriptors and (ii) global, trial‑level descriptors, improving cross‑subject generalization on the SEED‑VII dataset. Local representations are formed per channel by concatenating differential entropy with graph‑theoretic features, while global representations summarize time‑domain, spectral, and complexity characteristics at the trial level. These representations are fused in a dual‑branch transformer with attention‑based fusion and domain‑adversarial regularization, with samples filtered by an intensity threshold. Experiments under a leave‑one‑subject‑out protocol demonstrate that the proposed method consistently outperforms single‑view and classical baselines, achieving approximately 40% mean accuracy in 7‑class subject‑independent emotion recognition. The code has been released at https://github.com/Danielz‑z/LGF‑EEG‑Emotion.
Authors:Hongjin Qian, Zhao Cao, Zheng Liu
Abstract:
Complex reasoning in tool‑augmented agent frameworks is inherently long‑horizon, causing reasoning traces and transient tool artifacts to accumulate and strain the bounded working context of large language models. Without explicit memory mechanisms, such accumulation disrupts logical continuity and undermines task alignment. This positions memory not as an auxiliary efficiency concern, but as a core component for sustaining coherent, goal‑directed reasoning over long horizons. We propose MemoBrain, an executive memory model for tool‑augmented agents that constructs a dependency‑aware memory over reasoning steps, capturing salient intermediate states and their logical relations. Operating as a co‑pilot alongside the reasoning agent, MemoBrain organizes reasoning progress without blocking execution and actively manages the working context. Specifically, it prunes invalid steps, folds completed sub‑trajectories, and preserves a compact, high‑salience reasoning backbone under a fixed context budget. Together, these mechanisms enable explicit cognitive control over reasoning trajectories rather than passive context accumulation. We evaluate MemoBrain on challenging long‑horizon benchmarks, including GAIA, WebWalker, and BrowseComp‑Plus, demonstrating consistent improvements over strong baselines.
Authors:Shailesh Rana
Abstract:
Negative constraints (instructions of the form "do not use word X") represent a fundamental test of instruction‑following capability in large language models. Despite their apparent simplicity, these constraints fail with striking regularity, and the conditions governing failure have remained poorly understood. This paper presents the first comprehensive mechanistic investigation of negative instruction failure. We introduce semantic pressure, a quantitative measure of the model's intrinsic probability of generating the forbidden token, and demonstrate that violation probability follows a tight logistic relationship with pressure (p=σ(‑2.40+2.27\cdot P_0); n=40,000 samples; bootstrap 95% CI for slope: [2.21,,2.33]). Through layer‑wise analysis using the logit lens technique, we establish that the suppression signal induced by negative instructions is present but systematically weaker in failures: the instruction reduces target probability by only 5.2 percentage points in failures versus 22.8 points in successes ‑‑ a 4.4× asymmetry. We trace this asymmetry to two mechanistically distinct failure modes. In priming failure (87.5% of violations), the instruction's explicit mention of the forbidden word paradoxically activates rather than suppresses the target representation. In override failure (12.5%), late‑layer feed‑forward networks generate contributions of +0.39 toward the target probability ‑‑ nearly 4× larger than in successes ‑‑ overwhelming earlier suppression signals. Activation patching confirms that layers 23‑‑27 are causally responsible: replacing these layers' activations flips the sign of constraint effects. These findings reveal a fundamental tension in negative constraint design: the very act of naming a forbidden word primes the model to produce it.
Authors:Xin Dai, Pengcheng Huang, Zhenghao Liu, Shuo Wang, Yukun Yan, Chaojun Xiao, Yu Gu, Ge Yu, Maosong Sun
Abstract:
Masked diffusion models (MDMs), which leverage bidirectional attention and a denoising process, are narrowing the performance gap with autoregressive models (ARMs). However, their internal attention mechanisms remain under‑explored. This paper investigates the attention behaviors in MDMs, revealing the phenomenon of Attention Floating. Unlike ARMs, where attention converges to a fixed sink, MDMs exhibit dynamic, dispersed attention anchors that shift across denoising steps and layers. Further analysis reveals its Shallow Structure‑Aware, Deep Content‑Focused attention mechanism: shallow layers utilize floating tokens to build a global structural framework, while deeper layers allocate more capability toward capturing semantic content. Empirically, this distinctive attention pattern provides a mechanistic explanation for the strong in‑context learning capabilities of MDMs, allowing them to double the performance compared to ARMs in knowledge‑intensive tasks. All codes and datasets are available at https://github.com/NEUIR/Attention‑Floating.
Authors:Hong Huang, Decheng Wu, Qiangqiang Hu, Guanghua Yu, Jinhai Yang, Jianchen Zhu, Xue Liu, Dapeng Wu
Abstract:
The deployment of Large Language Models (LLMs) on resource‑constrained edge devices is increasingly hindered by prohibitive memory and computational requirements. While ternary quantization offers a compelling solution by reducing weights to ‑1, 0, +1, current implementations suffer from a fundamental misalignment with commodity hardware. Most existing methods must choose between 2‑bit aligned packing, which incurs significant bit wastage, or 1.67‑bit irregular packing, which degrades inference speed. To resolve this tension, we propose Sherry, a hardware‑efficient ternary quantization framework. Sherry introduces a 3:4 fine‑grained sparsity that achieves a regularized 1.25‑bit width by packing blocks of four weights into five bits, restoring power‑of‑two alignment. Furthermore, we identify weight trapping issue in sparse ternary training, which leads to representational collapse. To address this, Sherry introduces Arenas, an annealing residual synapse mechanism that maintains representational diversity during training. Empirical evaluations on LLaMA‑3.2 across five benchmarks demonstrate that Sherry matches state‑of‑the‑art ternary performance while significantly reducing model size. Notably, on an Intel i7‑14700HX CPU, our 1B model achieves zero accuracy loss compared to SOTA baselines while providing 25% bit savings and 10% speed up. The code is available at https://github.com/Tencent/AngelSlim .
Authors:Simon Jegou, Maximilian Jeblick
Abstract:
Growing context lengths in transformer‑based language models have made the key‑value (KV) cache a critical inference bottleneck. While many KV cache pruning methods have been proposed, they have not yet been adopted in major inference engines due to speed‑‑accuracy trade‑offs. We introduce KVzap, a fast, input‑adaptive approximation of KVzip that works in both prefilling and decoding. On Qwen3‑8B, Llama‑3.1‑8B‑Instruct, and Qwen3‑32B across long‑context and reasoning tasks, KVzap achieves 2‑‑4× KV cache compression with negligible accuracy loss and achieves state‑of‑the‑art performance on the KVpress leaderboard. Code and models are available at https://github.com/NVIDIA/kvpress.
Authors:Hao-Xiang Xu, Jun-Yu Ma, Ziqi Peng, Yuhao Sun, Zhen-Hua Ling, Jia-Chen Gu
Abstract:
Knowledge editing aims to efficiently modify the internal knowledge of large language models (LLMs) without compromising their other capabilities. The prevailing editing paradigm, which appends an update matrix to the original parameter matrix, has been shown by some studies to damage key numerical stability indicators (such as condition number and norm), thereby reducing editing performance and general abilities, especially in sequential editing scenario. Although subsequent methods have made some improvements, they remain within the additive framework and have not fundamentally addressed this limitation. To solve this problem, we analyze it from both statistical and mathematical perspectives and conclude that multiplying the original matrix by an orthogonal matrix does not change the numerical stability of the matrix. Inspired by this, different from the previous additive editing paradigm, a multiplicative editing paradigm termed Multiplicative Orthogonal Sequential Editing (MOSE) is proposed. Specifically, we first derive the matrix update in the multiplicative form, the new knowledge is then incorporated into an orthogonal matrix, which is multiplied by the original parameter matrix. In this way, the numerical stability of the edited matrix is unchanged, thereby maintaining editing performance and general abilities. We compared MOSE with several current knowledge editing methods, systematically evaluating their impact on both editing performance and the general abilities across three different LLMs. Experimental results show that MOSE effectively limits deviations in the edited parameter matrix and maintains its numerical stability. Compared to current methods, MOSE achieves a 12.08% improvement in sequential editing performance, while retaining 95.73% of general abilities across downstream tasks. The code is available at https://github.com/famoustourist/MOSE.
Authors:Nina Peire, Yupei Li, Björn Schuller
Abstract:
Generalisation to unseen subjects in EEG‑based emotion classification remains a challenge due to high inter‑and intra‑subject variability. Continual learning (CL) poses a promising solution by learning from a sequence of tasks while mitigating catastrophic forgetting. Regularisation‑based CL approaches, such as Elastic Weight Consolidation (EWC), Synaptic Intelligence (SI), and Memory Aware Synapses (MAS), are commonly used as baselines in EEG‑based CL studies, yet their suitability for this problem remains underexplored. This study theoretically and empirically finds that regularisation‑based CL methods show limited performance for EEG‑based emotion classification on the DREAMER and SEED datasets. We identify a fundamental misalignment in the stability‑plasticity trade‑off, where regularisation‑based methods prioritise mitigating catastrophic forgetting (backward transfer) over adapting to new subjects (forward transfer). We investigate this limitation under subject‑incremental sequences and observe that: (1) the heuristics for estimating parameter importance become less reliable under noisy data and covariate shift, (2) gradients on parameters deemed important by these heuristics often interfere with gradient updates required for new subjects, moving optimisation away from the minimum, (3) importance values accumulated across tasks over‑constrain the model, and (4) performance is sensitive to subject order. Forward transfer showed no statistically significant improvement over sequential fine‑tuning (p > 0.05 across approaches and datasets). The high variability of EEG signals means past subjects provide limited value to future subjects. Regularisation‑based continual learning approaches are therefore limited for robust generalisation to unseen subjects in EEG‑based emotion classification.
Authors:Zhi Yang, Runguo Li, Qiqi Qiang, Jiashun Wang, Fangqi Lou, Mengping Li, Dongpo Cheng, Rui Xu, Heng Lian, Shuo Zhang, Xiaolong Liang, Xiaoming Huang, Zheng Wei, Zhaowei Liu, Xin Guo, Huacan Wang, Ronghao Chen, Liwen Zhang
Abstract:
Financial agents powered by large language models (LLMs) are increasingly deployed for investment analysis, risk assessment, and automated decision‑making, where their abilities to plan, invoke tools, and manipulate mutable state introduce new security risks in high‑stakes and highly regulated financial environments. However, existing safety evaluations largely focus on language‑model‑level content compliance or abstract agent settings, failing to capture execution‑grounded risks arising from real operational workflows and state‑changing actions. To bridge this gap, we propose FinVault, the first execution‑grounded security benchmark for financial agents, comprising 31 regulatory case‑driven sandbox scenarios with state‑writable databases and explicit compliance constraints, together with 107 real‑world vulnerabilities and 963 test cases that systematically cover prompt injection, jailbreaking, financially adapted attacks, as well as benign inputs for false‑positive evaluation. Experimental results reveal that existing defense mechanisms remain ineffective in realistic financial agent settings, with average attack success rates (ASR) still reaching up to 50.0% on state‑of‑the‑art models and remaining non‑negligible even for the most robust systems (ASR 6.7%), highlighting the limited transferability of current safety designs and the need for stronger financial‑specific defenses. Our code can be found at https://github.com/aifinlab/FinVault.
Authors:Kewei Zhang, Ye Huang, Yufan Deng, Jincheng Yu, Junsong Chen, Huan Ling, Enze Xie, Daquan Zhou
Abstract:
While the Transformer architecture dominates many fields, its quadratic self‑attention complexity hinders its use in large‑scale applications. Linear attention offers an efficient alternative, but its direct application often degrades performance, with existing fixes typically re‑introducing computational overhead through extra modules (e.g., depthwise separable convolution) that defeat the original purpose. In this work, we identify a key failure mode in these methods: global context collapse, where the model loses representational diversity. To address this, we propose Multi‑Head Linear Attention (MHLA), which preserves this diversity by computing attention within divided heads along the token dimension. We prove that MHLA maintains linear complexity while recovering much of the expressive power of softmax attention, and verify its effectiveness across multiple domains, achieving a 3.6% improvement on ImageNet classification, a 6.3% gain on NLP, a 12.6% improvement on image generation, and a 41% enhancement on video generation under the same time complexity.
Authors:Wen Guo
Abstract:
We introduce DT‑ICU, a multimodal digital twin framework for continuous risk estimation in intensive care. DT‑ICU integrates variable‑length clinical time series with static patient information in a unified multitask architecture, enabling predictions to be updated as new observations accumulate over the ICU stay. We evaluate DT‑ICU on the large, publicly available MIMIC‑IV dataset, where it consistently outperforms established baseline models under different evaluation settings. Our test‑length analysis shows that meaningful discrimination is achieved shortly after admission, while longer observation windows further improve the ranking of high‑risk patients in highly imbalanced cohorts. To examine how the model leverages heterogeneous data sources, we perform systematic modality ablations, revealing that the model learnt a reasonable structured reliance on interventions, physiological response observations, and contextual information. These analyses provide interpretable insights into how multimodal signals are combined and how trade‑offs between sensitivity and precision emerge. Together, these results demonstrate that DT‑ICU delivers accurate, temporally robust, and interpretable predictions, supporting its potential as a practical digital twin framework for continuous patient monitoring in critical care. The source code and trained model weights for DT‑ICU are publicly available at https://github.com/GUO‑W/DT‑ICU‑release.
Authors:Shaoting Zhu, Ziwen Zhuang, Mengjie Zhao, Kun-Ying Lee, Hang Zhao
Abstract:
Achieving robust humanoid hiking in complex, unstructured environments requires transitioning from reactive proprioception to proactive perception. However, integrating exteroception remains a significant challenge: mapping‑based methods suffer from state estimation drift; for instance, LiDAR‑based methods do not handle torso jitter well. Existing end‑to‑end approaches often struggle with scalability and training complexity; specifically, some previous works using virtual obstacles are implemented case‑by‑case. In this work, we present Hiking in the Wild, a scalable, end‑to‑end parkour perceptive framework designed for robust humanoid hiking. To ensure safety and training stability, we introduce two key mechanisms: a foothold safety mechanism combining scalable Terrain Edge Detection with Foot Volume Points to prevent catastrophic slippage on edges, and a Flat Patch Sampling strategy that mitigates reward hacking by generating feasible navigation targets. Our approach utilizes a single‑stage reinforcement learning scheme, mapping raw depth inputs and proprioception directly to joint actions, without relying on external state estimation. Extensive field experiments on a full‑size humanoid demonstrate that our policy enables robust traversal of complex terrains at speeds up to 2.5 m/s. The training and deployment code is open‑sourced to facilitate reproducible research and deployment on real robots with minimal hardware modifications.
Authors:Ziwen Zhuang, Shaoting Zhu, Mengjie Zhao, Hang Zhao
Abstract:
Current approaches to humanoid control generally fall into two paradigms: perceptive locomotion, which handles terrain well but is limited to pedal gaits, and general motion tracking, which reproduces complex skills but ignores environmental capabilities. This work unites these paradigms to achieve perceptive general motion control. We present a framework where exteroceptive sensing is integrated into whole‑body motion tracking, permitting a humanoid to perform highly dynamic, non‑locomotion tasks on uneven terrain. By training a single policy to perform multiple distinct motions across varied terrestrial features, we demonstrate the non‑trivial benefit of integrating perception into the control loop. Our results show that this framework enables robust, highly dynamic multi‑contact motions, such as vaulting and dive‑rolling, on unstructured terrain, significantly expanding the robot's traversability beyond simple walking or running. https://project‑instinct.github.io/deep‑whole‑body‑parkour
Authors:Rei Taniguchi, Yuyang Dong, Makoto Onizuka, Chuan Xiao
Abstract:
Due to the prevalence of large language models (LLMs), key‑value (KV) cache reduction for LLM inference has received remarkable attention. Among numerous works that have been proposed in recent years, layer‑wise token pruning approaches, which select a subset of tokens at particular layers to retain in KV cache and prune others, are one of the most popular schemes. They primarily adopt a set of pre‑defined layers, at which tokens are selected. Such design is inflexible in the sense that the accuracy significantly varies across tasks and deteriorates in harder tasks such as KV retrieval. In this paper, we propose ASL, a training‑free method that adaptively chooses the selection layer for KV cache reduction, exploiting the variance of token ranks ordered by attention score. The proposed method balances the performance across different tasks while meeting the user‑specified KV budget requirement. ASL operates during the prefilling stage and can be jointly used with existing KV cache reduction methods such as SnapKV to optimize the decoding stage. By evaluations on the InfiniteBench, RULER, and NIAH benchmarks, we show that equipped with one‑shot token selection, where tokens are selected at a layer and propagated to deeper layers, ASL outperforms state‑of‑the‑art layer‑wise token selection methods in accuracy while maintaining decoding speed and KV cache reduction.
Authors:Jiaxuan Lu, Ziyu Kong, Yemin Wang, Rong Fu, Haiyuan Wan, Cheng Yang, Wenjie Lou, Haoran Sun, Lilong Wang, Yankai Jiang, Xiaosong Wang, Xiao Sun, Dongzhan Zhou
Abstract:
The central challenge of AI for Science is not reasoning alone, but the ability to create computational methods in an open‑ended scientific world. Existing LLM‑based agents rely on static, pre‑defined tool libraries, a paradigm that fundamentally fails in scientific domains where tools are sparse, heterogeneous, and intrinsically incomplete. In this paper, we propose Test‑Time Tool Evolution (TTE), a new paradigm that enables agents to synthesize, verify, and evolve executable tools during inference. By transforming tools from fixed resources into problem‑driven artifacts, TTE overcomes the rigidity and long‑tail limitations of static tool libraries. To facilitate rigorous evaluation, we introduce SciEvo, a benchmark comprising 1,590 scientific reasoning tasks supported by 925 automatically evolved tools. Extensive experiments show that TTE achieves state‑of‑the‑art performance in both accuracy and tool efficiency, while enabling effective cross‑domain adaptation of computational tools. The code and benchmark have been released at https://github.com/lujiaxuan0520/Test‑Time‑Tool‑Evol.
Authors:Yu-Yang Qian, Junda Su, Lanxiang Hu, Peiyuan Zhang, Zhijie Deng, Peng Zhao, Hao Zhang
Abstract:
Diffusion large language models (dLLMs) offer capabilities beyond those of autoregressive (AR) LLMs, such as parallel decoding and random‑order generation. However, realizing these benefits in practice is non‑trivial, as dLLMs inherently face an accuracy‑parallelism trade‑off. Despite increasing interest, existing methods typically focus on only one‑side of the coin, targeting either efficiency or performance. To address this limitation, we propose d3LLM (Pseudo‑Distilled Diffusion Large Language Model), striking a balance between accuracy and parallelism: (i) during training, we introduce pseudo‑trajectory distillation to teach the model which tokens can be decoded confidently at early steps, thereby improving parallelism; (ii) during inference, we employ entropy‑based multi‑block decoding with a KV‑cache refresh mechanism to achieve high parallelism while maintaining accuracy. To better evaluate dLLMs, we also introduce AUP (Accuracy Under Parallelism), a new metric that jointly measures accuracy and parallelism. Experiments demonstrate that our d3LLM achieves up to 10× speedup over vanilla LLaDA/Dream and 5× speedup over AR models without much accuracy drop. Our code is available at https://github.com/hao‑ai‑lab/d3LLM.
Authors:Zihan Ma, Zhikai Zhao, Chuanbo Hua, Federico Berto, Jinkyoo Park
Abstract:
Optimizing LLM‑based agentic workflows is challenging for scaling AI capabilities. Current methods rely on coarse, end‑to‑end evaluation signals and lack fine‑grained signals on where to refine, often resulting in inefficient or low‑impact modifications. To address these limitations, we propose \our, an Evaluation‑Judge‑Optimization‑Update pipeline. We incorporate reusable, configurable logic blocks into agentic workflows to capture fundamental forms of logic. On top of this abstraction, we design a dedicated Judge module that inspects execution traces ‑‑ particularly failed runs ‑‑ and assigns rank‑based responsibility scores to problematic blocks. These fine‑grained diagnostic signals are then leveraged by an LLM‑based optimizer, which focuses modifications on the most problematic block in the workflow. Our approach improves sample efficiency, enhances interpretability through block‑level diagnostics, and provides a scalable foundation for automating increasingly complex agentic workflows. We evaluate \our on mathematical reasoning and code generation benchmarks, where \our achieves superior performance and efficiency compared to existing methods. The source code is publicly available at https://github.com/ma‑zihan/JudgeFlow.
Authors:Haoqian Meng, Yilun Luo, Yafei Zhao, Wenyuan Liu, Peng Zhang, Xindian Ma
Abstract:
The emergence of fine‑grained numerical formats like NVFP4 presents new opportunities for efficient Large Language Model (LLM) inference. However, it is difficult to adapt existing Post‑Training Quantization (PTQ) strategies to these formats: rotation‑based methods compromise fine‑grained block isolation; smoothing techniques struggle with significant 4‑bit quantization errors; and mixed‑precision approaches often conflict with hardware constraints on unified‑precision computation. To address these challenges, we propose ARCQuant, a framework that boosts NVFP4 performance via Augmented Residual Channels. Distinct from methods that compromise block isolation or hardware uniformity, ARCQuant maintains a strictly unified NVFP4 format by augmenting the activation matrix with quantized residual channels. This design integrates the error compensation process directly into the matrix reduction dimension, enabling the use of standard, highly optimized GEMM kernels with minimal overhead. Theoretical analysis confirms that the worst‑case error bound of our dual‑stage NVFP4 quantization is comparable to that of standard 8‑bit formats such as MXFP8. Extensive experiments on LLaMA and Qwen models demonstrate that ARCQuant achieves state‑of‑the‑art accuracy, comparable to full‑precision baselines in perplexity and downstream tasks. Furthermore, deployment on RTX 5090 and RTX PRO 6000 GPUs confirms practical benefits, achieving up to 3x speedup over FP16. Our code is available at https://github.com/actypedef/ARCQuant .
Authors:Jiao Xu, Xin Chen, Lihe Zhang
Abstract:
In this paper, we present a new dynamic collaborative network for semi‑supervised 3D vessel segmentation, termed DiCo. Conventional mean teacher (MT) methods typically employ a static approach, where the roles of the teacher and student models are fixed. However, due to the complexity of 3D vessel data, the teacher model may not always outperform the student model, leading to cognitive biases that can limit performance. To address this issue, we propose a dynamic collaborative network that allows the two models to dynamically switch their teacher‑student roles. Additionally, we introduce a multi‑view integration module to capture various perspectives of the inputs, mirroring the way doctors conduct medical analysis. We also incorporate adversarial supervision to constrain the shape of the segmented vessels in unlabeled data. In this process, the 3D volume is projected into 2D views to mitigate the impact of label inconsistencies. Experiments demonstrate that our DiCo method sets new state‑of‑the‑art performance on three 3D vessel segmentation benchmarks. The code repository address is https://github.com/xujiaommcome/DiCo
Authors:Linhao Zhong, Linyu Wu, Bozhen Fang, Tianjian Feng, Chenchen Jing, Wen Wang, Jiaheng Zhang, Hao Chen, Chunhua Shen
Abstract:
Diffusion Language Models (DLMs) offer a promising alternative for language modeling by enabling parallel decoding through iterative refinement. However, most DLMs rely on hard binary masking and discrete token assignments, which hinder the revision of early decisions and underutilize intermediate probabilistic representations. In this paper, we propose EvoToken‑DLM, a novel diffusion‑based language modeling approach that replaces hard binary masks with evolving soft token distributions. EvoToken‑DLM enables a progressive transition from masked states to discrete outputs, supporting revisable decoding. To effectively support this evolution, we introduce continuous trajectory supervision, which aligns training objectives with iterative probabilistic updates. Extensive experiments across multiple benchmarks show that EvoToken‑DLM consistently achieves superior performance, outperforming strong diffusion‑based and masked DLM baselines. Project webpage: https://aim‑uofa.github.io/EvoTokenDLM.
Authors:Tu Hu, Ronghao Chen, Shuo Zhang, Jianghao Yin, Mou Xiao Feng, Jingping Liu, Shaolei Zhang, Wenqi Jiang, Yuqi Fang, Sen Hu, Huacan Wang, Yi Xu
Abstract:
Self‑evolution methods enhance code generation through iterative "generate‑verify‑refine" cycles, yet existing approaches suffer from low exploration efficiency, failing to discover solutions with superior complexity within limited budgets. This inefficiency stems from initialization bias trapping evolution in poor solution regions, uncontrolled stochastic operations lacking feedback guidance, and insufficient experience utilization across tasks. To address these bottlenecks, we propose Controlled Self‑Evolution (CSE), which consists of three key components. Diversified Planning Initialization generates structurally distinct algorithmic strategies for broad solution space coverage. Genetic Evolution replaces stochastic operations with feedback‑guided mechanisms, enabling targeted mutation and compositional crossover. Hierarchical Evolution Memory captures both successful and failed experiences at inter‑task and intra‑task levels. Experiments on EffiBench‑X demonstrate that CSE consistently outperforms all baselines across various LLM backbones. Furthermore, CSE achieves higher efficiency from early generations and maintains continuous improvement throughout evolution. Our code is publicly available at https://github.com/QuantaAlpha/EvoControl.
Authors:Zhuoka Feng, Kang Chen, Sihan Zhao, Kai Xiong, Yaoning Wang, Minshen Yu, Junjie Nian, Changyi Xiao, Yixin Cao, Yugang Jiang
Abstract:
Interactive large language model agents have advanced rapidly, but most remain specialized to a single environment and fail to adapt robustly to other environments. Model merging offers a training‑free alternative by integrating multiple experts into a single model. In this paper, we propose Agent‑Role Merging (ARM), an activation‑guided, role‑conditioned neuron transplantation method for model merging in LLM agents. ARM improves existing merging methods from static natural language tasks to multi‑turn agent scenarios, and over the generalization ability across various interactive environments. This is achieved with a well designed 3‑step framework: 1) constructing merged backbones, 2) selection based on its role‑conditioned activation analysis, and 3) neuron transplantation for fine‑grained refinements. Without gradient‑based optimization, ARM improves cross‑benchmark generalization while enjoying efficiency. Across diverse domains, the model obtained via ARM merging outperforms prior model merging methods and domain‑specific expert models, while demonstrating strong out‑of‑domain generalization.
Authors:Hanbin Wang, Jingwei Song, Jinpeng Li, Fei Mi, Lifeng Shang
Abstract:
Large reasoning models (LRMs) exhibit diverse high‑level reasoning patterns (e.g., direct solution, reflection‑and‑verification, and exploring multiple solutions), yet prevailing training recipes implicitly bias models toward a limited set of dominant patterns. Through a systematic analysis, we identify substantial accuracy variance across these patterns on mathematics and science benchmarks, revealing that a model's default reasoning pattern is often sub‑optimal for a given problem. To address this, we introduce Group Pattern Selection Optimization (GPSO), a reinforcement learning framework that extends GRPO by incorporating multi‑pattern rollouts, verifier‑guided optimal pattern selection per problem, and attention masking during optimization to prevent the leakage of explicit pattern suffixes into the learned policy. By exploring a portfolio of diverse reasoning strategies and optimizing the policy on the most effective ones, GPSO enables the model to internalize the mapping from problem characteristics to optimal reasoning patterns. Extensive experiments demonstrate that GPSO delivers consistent and substantial performance gains across various model backbones and benchmarks, effectively mitigating pattern sub‑optimality and fostering more robust, adaptable reasoning. All data and codes are available at https://github.com/wanghanbinpanda/GPSO.
Authors:Hao Li, Yiqun Zhang, Zhaoyan Guo, Chenxu Wang, Shengji Tang, Qiaosheng Zhang, Yang Chen, Biqing Qi, Peng Ye, Lei Bai, Zhen Wang, Shuyue Hu
Abstract:
Large language model (LLM) routing assigns each query to the most suitable model from an ensemble. We introduce LLMRouterBench, a large‑scale benchmark and unified framework for LLM routing. It comprises over 400K instances from 21 datasets and 33 models. Moreover, it provides comprehensive metrics for both performance‑oriented routing and performance‑cost trade‑off routing, and integrates 10 representative routing baselines. Using LLMRouterBench, we systematically re‑evaluate the field. While confirming strong model complementarity‑the central premise of LLM routing‑we find that many routing methods exhibit similar performance under unified evaluation, and several recent approaches, including commercial routers, fail to reliably outperform a simple baseline. Meanwhile, a substantial gap remains to the Oracle, driven primarily by persistent model‑recall failures. We further show that backbone embedding models have limited impact, that larger ensembles exhibit diminishing returns compared to careful model curation, and that the benchmark also enables latency‑aware analysis. All code and data are available at https://github.com/ynulihao/LLMRouterBench.
Authors:Ruiyi Ding, Yongxuan Lv, Xianhui Meng, Jiahe Song, Chao Wang, Chen Jiang, Yuan Cheng
Abstract:
Policy optimization for large language models often suffers from sparse reward signals in multi‑step reasoning tasks. Critic‑free methods like GRPO assign a single normalized outcome reward to all tokens, providing limited guidance for intermediate reasoning . While Process Reward Models (PRMs) offer dense feedback, they risk premature collapse when used alone, as early low‑reward tokens can drive policies toward truncated outputs. We introduce Process Relative Policy Optimization (PRPO), which combines outcome reliability with process‑level guidance in a critic‑free framework. PRPO segments reasoning sequences based on semantic clues, normalizes PRM scores into token‑level advantages, and aligns their distribution with outcome advantages through location‑parameter shift. On MATH500, PRPO improves Qwen2.5‑Math‑1.5B accuracy from 61.2% to 64.4% over GRPO using only eight rollouts and no value network, demonstrating efficient fine‑grained credit assignment within critic‑free optimization. Code is available at: https://github.com/SchumiDing/srpocode
Authors:Mingxiang Tao, Yu Tian, Wenxuan Tu, Yue Yang, Xue Yang, Xiangyan Tang
Abstract:
Federated learning (FL) addresses data privacy and silo issues in large language models (LLMs). Most prior work focuses on improving the training efficiency of federated LLMs. However, security in open environments is overlooked, particularly defenses against malicious clients. To investigate the safety of LLMs during FL, we conduct preliminary experiments to analyze potential attack surfaces and defensible characteristics from the perspective of Low‑Rank Adaptation (LoRA) weights. We find two key properties of FL: 1) LLMs are vulnerable to attacks from malicious clients in FL, and 2) LoRA weights exhibit distinct behavioral patterns that can be filtered through simple classifiers. Based on these properties, we propose Safe‑FedLLM, a probe‑based defense framework for federated LLMs, constructing defenses across three dimensions: Step‑Level, Client‑Level, and Shadow‑Level. The core concept of Safe‑FedLLM is to perform probe‑based discrimination on the LoRA weights locally trained by each client during FL, treating them as high‑dimensional behavioral features and using lightweight classification models to determine whether they possess malicious attributes. Extensive experiments demonstrate that Safe‑FedLLM effectively enhances the defense capability of federated LLMs without compromising performance on benign data. Notably, our method effectively suppresses malicious data impact without significant impact on training speed, and remains effective even with many malicious clients. Our code is available at: https://github.com/dmqx/Safe‑FedLLM.
Authors:Osama Yousuf, Andreu L. Glasmann, Martin Lueker-Boden, Sina Najmaei, Gina C. Adam
Abstract:
Emerging memory technologies have gained significant attention as a promising pathway to overcome the limitations of conventional computing architectures in deep learning applications. By enabling computation directly within memory, these technologies ‑ built on nanoscale devices with tunable and nonvolatile conductance ‑ offer the potential to drastically reduce energy consumption and latency compared to traditional von Neumann systems. This paper introduces XBTorch (short for CrossBarTorch), a novel simulation framework that integrates seamlessly with PyTorch and provides specialized tools for accurately and efficiently modeling crossbar‑based systems based on emerging memory technologies. Through detailed comparisons and case studies involving hardware‑aware training and inference, we demonstrate how XBTorch offers a unified interface for key research areas such as device‑level modeling, cross‑layer co‑design, and inference‑time fault tolerance. While exemplar studies utilize ferroelectric field‑effect transistor (FeFET) models, the framework remains technology‑agnostic ‑ supporting other emerging memories such as resistive RAM (ReRAM), as well as enabling user‑defined custom device models. The code is publicly available at: https://github.com/ADAM‑Lab‑GW/xbtorch
Authors:Sen Hu, Zhiyu Zhang, Yuxiang Wei, Xueran Han, Zhenheng Tang, Huacan Wang, Ronghao Chen
Abstract:
AI Clones aim to simulate an individual's thoughts and behaviors to enable long‑term, personalized interaction, placing stringent demands on memory systems to model experiences, emotions, and opinions over time. Existing memory benchmarks primarily rely on user‑agent conversational histories, which are temporally fragmented and insufficient for capturing continuous life trajectories. We introduce CloneMem, a benchmark for evaluating longterm memory in AI Clone scenarios grounded in non‑conversational digital traces, including diaries, social media posts, and emails, spanning one to three years. CloneMem adopts a hierarchical data construction framework to ensure longitudinal coherence and defines tasks that assess an agent's ability to track evolving personal states. Experiments show that current memory mechanisms struggle in this setting, highlighting open challenges for life‑grounded personalized AI. Code and dataset are available at https://github.com/AvatarMemory/CloneMemBench
Authors:Yixi Zhou, Fan Zhang, Yu Chen, Haipeng Zhang, Preslav Nakov, Zhuohan Xie
Abstract:
Financial question answering (QA) over long corporate filings requires evidence to satisfy strict constraints on entities, financial metrics, fiscal periods, and numeric values. However, existing LLM‑based rerankers primarily optimize semantic relevance, leading to unstable rankings and opaque decisions on long documents. We propose FinCards, a structured reranking framework that reframes financial evidence selection as constraint satisfaction under a finance‑aware schema. FinCards represents filing chunks and questions using aligned schema fields (entities, metrics, periods, and numeric spans), enabling deterministic field‑level matching. Evidence is selected via a multi‑stage tournament reranking with stability‑aware aggregation, producing auditable decision traces. Across two corporate filing QA benchmarks, FinCards substantially improves early‑rank retrieval over both lexical and LLM‑based reranking baselines, while reducing ranking variance, without requiring model fine‑tuning or unpredictable inference budgets. Our code is available at https://github.com/XanderZhou2022/FINCARDS.
Authors:Haonan Bian, Zhiyuan Yao, Sen Hu, Zishan Xu, Shaolei Zhang, Yifu Guo, Ziliang Yang, Xueran Han, Huacan Wang, Ronghao Chen
Abstract:
As Large Language Models (LLMs) evolve from static dialogue interfaces to autonomous general agents, effective memory is paramount to ensuring long‑term consistency. However, existing benchmarks primarily focus on casual conversation or task‑oriented dialogue, failing to capture "long‑term project‑oriented" interactions where agents must track evolving goals. To bridge this gap, we introduce RealMem, the first benchmark grounded in realistic project scenarios. RealMem comprises over 2,000 cross‑session dialogues across eleven scenarios, utilizing natural user queries for evaluation. We propose a synthesis pipeline that integrates Project Foundation Construction, Multi‑Agent Dialogue Generation, and Memory and Schedule Management to simulate the dynamic evolution of memory. Experiments reveal that current memory systems face significant challenges in managing the long‑term project states and dynamic context dependencies inherent in real‑world projects. Our code and datasets are available at [https://github.com/AvatarMemory/RealMemBench](https://github.com/AvatarMemory/RealMemBench).
Authors:Yuhang Su, Mei Wang, Yaoyao Zhong, Guozhang Li, Shixing Li, Yihan Feng, Hua Huang
Abstract:
While Multimodal Large Language Models (MLLMs) have achieved remarkable progress in visual understanding, they often struggle when faced with the unstructured and ambiguous nature of human‑generated sketches. This limitation is particularly pronounced in the underexplored task of visual grading, where models should not only solve a problem but also diagnose errors in hand‑drawn diagrams. Such diagnostic capabilities depend on complex structural, semantic, and metacognitive reasoning. To bridge this gap, we introduce SketchJudge, a novel benchmark tailored for evaluating MLLMs as graders of hand‑drawn STEM diagrams. SketchJudge encompasses 1,015 hand‑drawn student responses across four domains: geometry, physics, charts, and flowcharts, featuring diverse stylistic variations and distinct error types. Evaluations on SketchJudge demonstrate that even advanced MLLMs lag significantly behind humans, validating the benchmark's effectiveness in exposing the fragility of current vision‑language alignment in symbolic and noisy contexts. All data, code, and evaluation scripts are publicly available at https://github.com/yuhangsu82/SketchJudge.
Authors:Hengyu Liu, Tianyi Li, Haoyu Wang, Kristian Torp, Tiancheng Zhang, Yushuai Li, Christian S. Jensen
Abstract:
The Automatic Identification System provides critical information for maritime navigation and safety, yet its trajectories are often incomplete due to signal loss or deliberate tampering. Existing imputation methods emphasize trajectory recovery, paying limited attention to interpretability and failing to provide underlying knowledge that benefits downstream tasks such as anomaly detection and route planning. We propose knowledge‑driven interpretable vessel trajectory imputation (VISTA), the first trajectory imputation framework that offers interpretability while simultaneously providing underlying knowledge to support downstream analysis. Specifically, we first define underlying knowledge as a combination of Structured Data‑derived Knowledge (SDK) distilled from AIS data and Implicit LLM Knowledge acquired from large‑scale Internet corpora. Second, to manage and leverage the SDK effectively at scale, we develop a data‑knowledge‑data loop that employs a Structured Data‑derived Knowledge Graph for SDK extraction and knowledge‑driven trajectory imputation. Third, to efficiently process large‑scale AIS data, we introduce a workflow management layer that coordinates the end‑to‑end pipeline, enabling parallel knowledge extraction and trajectory imputation with anomaly handling and redundancy elimination. Experiments on two large AIS datasets show that VISTA is capable of state‑of‑the‑art imputation accuracy and computational efficiency, improving over state‑of‑the‑art baselines by 5%‑94% and reducing time cost by 51%‑93%, while producing interpretable knowledge cues that benefit downstream tasks. The source code and implementation details of VISTA are publicly available.
Authors:Yifei Chen, Guanting Dong, Zhicheng Dou
Abstract:
Large Language Models (LLMs) can extend their parameter knowledge limits by adopting the Tool‑Integrated Reasoning (TIR) paradigm. However, existing LLM‑based agent training framework often focuses on answers' accuracy, overlooking specific alignment for behavior patterns. Consequently, agent often exhibits ineffective actions during TIR tasks, such as redundant and insufficient tool calls. How to calibrate erroneous behavioral patterns when executing TIR tasks, thereby exploring effective trajectories, remains an open‑ended problem. In this paper, we propose ET‑Agent, a training framework for calibrating agent's tool‑use behavior through two synergistic perspectives: Self‑evolving Data Flywheel and Behavior Calibration Training. Specifically, we introduce a self‑evolutionary data flywheel to generate enhanced data, used to fine‑tune LLM to improve its exploration ability. Based on this, we implement an two‑phases behavior‑calibration training framework. It is designed to progressively calibrate erroneous behavioral patterns to optimal behaviors. Further in‑depth experiments confirm the superiority of \ourmodel across multiple dimensions, including correctness, efficiency, reasoning conciseness, and tool execution accuracy. Our ET‑Agent framework provides practical insights for research in the TIR field. Codes can be found in https://github.com/asilverlight/ET‑Agent
Authors:Ping Guo, Chao Li, Yinglan Feng, Chaoning Zhang
Abstract:
Designing effective control policies for autonomous systems remains a fundamental challenge, traditionally addressed through reinforcement learning or manual engineering. While reinforcement learning has achieved remarkable success, it often suffers from high sample complexity, reward shaping difficulties, and produces opaque neural network policies that are hard to interpret or verify. Manual design, on the other hand, requires substantial domain expertise and struggles to scale across diverse tasks. In this work, we demonstrate that LLM‑driven evolutionary search can effectively synthesize interpretable control policies in the form of executable code. By treating policy synthesis as a code evolution problem, we harness the LLM's prior knowledge of programming patterns and control heuristics while employing evolutionary search to explore the solution space systematically. We implement our approach using EvoToolkit, a framework that seamlessly integrates LLM‑driven evolution with customizable fitness evaluation. Our method iteratively evolves populations of candidate policy programs, evaluating them against task‑specific objectives and selecting superior individuals for reproduction. This process yields compact, human‑readable control policies that can be directly inspected, modified, and formally verified. This work highlights the potential of combining foundation models with evolutionary computation for synthesizing trustworthy control policies in autonomous systems. Code is available at https://github.com/pgg3/EvoControl.
Authors:Qingyu Liu, Yitao Zhang, Zhongjie Ba, Chao Shuai, Peng Cheng, Tianhang Zheng, Zhibo Wang
Abstract:
Protecting the copyright of user‑generated AI images is an emerging challenge as AIGC becomes pervasive in creative workflows. Existing watermarking methods (1) remain vulnerable to real‑world adversarial threats, often forced to trade off between defenses against spoofing and removal attacks; and (2) cannot support semantic‑level tamper localization. We introduce PAI, a training‑free inherent watermarking framework for AIGC copyright protection, plug‑and‑play with diffusion‑based AIGC services. PAI simultaneously provides three key functionalities: robust ownership verification, attack detection, and semantic‑level tampering localization. Unlike existing inherent watermark methods that only embed watermarks at noise initialization of diffusion models, we design a novel key‑conditioned deflection mechanism that subtly steers the denoising trajectory according to the user key. Such trajectory‑level coupling further strengthens the semantic entanglement of identity and content, thereby further enhancing robustness against real‑world threats. Moreover, we also provide a theoretical analysis proving that only the valid key can pass verification. Experiments across 12 attack methods show that PAI achieves 98.43% verification accuracy, improving over SOTA methods by 37.25% on average, and retains strong tampering localization performance even against advanced AIGC edits. Our code is available at https://github.com/QingyuLiu/PAI.
Authors:Sang T. Truong, Duc Q. Nguyen, Willie Neiswanger, Ryan-Rhys Griffiths, Stefano Ermon, Nick Haber, Sanmi Koyejo
Abstract:
Bayesian optimization (BO) is a common framework for optimizing black‑box functions, yet most existing methods assume static query costs and rely on myopic acquisition strategies. We introduce LookaHES, a nonmyopic BO framework designed for dynamic, history‑dependent cost environments, where evaluation costs vary with prior actions, such as travel distance in spatial tasks or edit distance in sequence design. LookaHES combines a multi‑step variant of H‑Entropy Search with pathwise sampling and neural policy optimization, enabling long‑horizon planning beyond twenty steps without the exponential complexity of existing nonmyopic methods. The key innovation is the integration of neural policies, including large language models, to effectively navigate structured, combinatorial action spaces such as protein sequences. These policies amortize lookahead planning and can be integrated with domain‑specific constraints during rollout. Empirically, LookaHES outperforms strong myopic and nonmyopic baselines across nine synthetic benchmarks from two to eight dimensions and two real‑world tasks: geospatial optimization using NASA night‑light imagery and protein sequence design with constrained token‑level edits. In short, LookaHES provides a general, scalable, and cost‑aware solution for robust long‑horizon optimization in complex decision spaces, which makes it a useful tool for researchers in machine learning, statistics, and applied domains. Our implementation is available at https://github.com/sangttruong/nonmyopia.
Authors:Weihao Hong, Zhiyuan Jiang, Bingyu Shen, Xinlei Guan, Yangyi Feng, Meng Xu, Boyang Li
Abstract:
Vision‑Language Models (VLMs) are increasingly used in safety‑critical applications that require reliable visual grounding. However, these models often hallucinate details that are not present in the image to satisfy user prompts. While recent datasets and benchmarks have been introduced to evaluate systematic hallucinations in VLMs, many hallucination behaviors remain insufficiently characterized. In particular, prior work primarily focuses on object presence or absence, leaving it unclear how prompt phrasing and structural constraints can systematically induce hallucinations. In this paper, we investigate how different forms of prompt pressure influence hallucination behavior. We introduce Ghost‑100, a procedurally generated dataset of synthetic scenes in which key visual details are deliberately removed, enabling controlled analysis of absence‑based hallucinations. Using a structured 5‑Level Prompt Intensity Framework, we vary prompts from neutral queries to toxic demands and rigid formatting constraints. We evaluate three representative open‑weight VLMs: MiniCPM‑V 2.6‑8B, Qwen2‑VL‑7B, and Qwen3‑VL‑8B. Across all three models, hallucination rates do not increase monotonically with prompt intensity. All models exhibit reductions at higher intensity levels at different thresholds, though not all show sustained reduction under maximum coercion. These results suggest that current safety alignment is more effective at detecting semantic hostility than structural coercion, revealing model‑specific limitations in handling compliance pressure. Our dataset is available at: https://github.com/bli1/tone‑matters
Authors:Xin Guo, Rongjunchen Zhang, Guilong Lu, Xuntao Guo, Shuai Jia, Zhi Yang, Liwen Zhang
Abstract:
Large language models have undergone rapid evolution, emerging as a pivotal technology for intelligence in financial operations. However, existing benchmarks are often constrained by pitfalls such as reliance on simulated or general‑purpose samples and a focus on singular, offline static scenarios. Consequently, they fail to align with the requirements for authenticity and real‑time responsiveness in financial services, leading to a significant discrepancy between benchmark performance and actual operational efficacy. To address this, we introduce BizFinBench.v2, the first large‑scale evaluation benchmark grounded in authentic business data from both Chinese and U.S. equity markets, integrating online assessment. We performed clustering analysis on authentic user queries from financial platforms, resulting in eight fundamental tasks and two online tasks across four core business scenarios, totaling 29,578 expert‑level Q&A pairs. Experimental results demonstrate that ChatGPT‑5 achieves a prominent 61.5% accuracy in main tasks, though a substantial gap relative to financial experts persists; in online tasks, DeepSeek‑R1 outperforms all other commercial LLMs. Error analysis further identifies the specific capability deficiencies of existing models within practical financial business contexts. BizFinBench.v2 transcends the limitations of current benchmarks, achieving a business‑level deconstruction of LLM financial capabilities and providing a precise basis for evaluating efficacy in the widespread deployment of LLMs within the financial domain. The data and code are available at https://github.com/HiThink‑Research/BizFinBench.v2.
Authors:Ningning Zhang, Xingxing Yang, Zhizhong Tan, Weiping Deng, Wenyong Wang
Abstract:
Although long‑term memory systems have made substantial progress in recent years, they still exhibit clear limitations in adaptability, scalability, and self‑evolution under continuous interaction settings. Inspired by cognitive theories, we propose HiMem, a hierarchical long‑term memory framework for long‑horizon dialogues, designed to support memory construction, retrieval, and dynamic updating during sustained interactions. HiMem constructs cognitively consistent Episode Memory via a Topic‑Aware Event‑‑Surprise Dual‑Channel Segmentation strategy, and builds Note Memory that captures stable knowledge through a multi‑stage information extraction pipeline. These two memory types are semantically linked to form a hierarchical structure that bridges concrete interaction events and abstract knowledge, enabling efficient retrieval without sacrificing information fidelity. HiMem supports both hybrid and best‑effort retrieval strategies to balance accuracy and efficiency, and incorporates conflict‑aware Memory Reconsolidation to revise and supplement stored knowledge based on retrieval feedback. This design enables continual memory self‑evolution over long‑term use. Experimental results on long‑horizon dialogue benchmarks demonstrate that HiMem consistently outperforms representative baselines in accuracy, consistency, and long‑term reasoning, while maintaining favorable efficiency. Overall, HiMem provides a principled and scalable design paradigm for building adaptive and self‑evolving LLM‑based conversational agents. The code is available at https://github.com/jojopdq/HiMem.
Authors:Kuan Wei Chen, Ting Yi Lin, Wen Ren Yang, Aryan Kesarwani, Riya Singh
Abstract:
We present a cost‑effective two‑step authentication system that integrates face identification and speaker verification using only a camera and microphone available on common devices. The pipeline first performs face recognition to identify a candidate user from a small enrolled group, then performs voice recognition only against the matched identity to reduce computation and improve robustness. For face recognition, a pruned VGG‑16 based classifier is trained on an augmented dataset of 924 images from five subjects, with faces localized by MTCNN; it achieves 95.1% accuracy. For voice recognition, a CNN speaker‑verification model trained on LibriSpeech (train‑other‑360) attains 98.9% accuracy and 3.456% EER on test‑clean. Source code and trained models are available at https://github.com/NCUE‑EE‑AIAL/Two‑step‑Authentication‑Multi‑biometric‑System.
Authors:Anshul Kumar
Abstract:
Tokens are the basic units of Large Language Models (LLMs). LLMs rely on tokenizers to segment text into these tokens, and tokenization is the primary determinant of computational and inference cost. Sanskrit, one of the oldest languages, is hypothesized to express more meaning per token due to its morphology and grammar rules; however, no prior work has quantified this. We use a dataset of 701 parallel verses of the Bhagavad Gita, which comprises three languages‑Sanskrit, English, and Hindi along with transliteration of Sanskrit into English. We test tokenizers including SentencePiece (SPM), older GPT models, and the latest generation tokenizers from Gemini and GPT. We use metrics of token count, characters per token (token efficiency), and tokens per character (token cost). Results show a ~2x difference in token counts between Sanskrit and English/Hindi under the unbiased SPM baseline. English/Hindi translations of Sanskrit commentary resulted in an approximately 20x increase in token count. GPT o200k base (latest, used by GPT‑4o) and Gemini (latest) reduce bias by a significant degree compared to GPT cl100k base (used until GPT‑4), but still fail to fully capture Sanskrit's compactness. This matters because there might be a penalty bias for non‑English users, which inflates the token count. This research provides a foundation for improving future tokenizer design and shows the potential of Sanskrit for highly compact encoding, saving on cost while speeding up training and inference. The code and dataset are available at https://github.com/anshulkr713/sanskrit‑token‑efficiency
Authors:Ahmed H. Ismail, Anthony Kuang, Ayo Akinkugbe, Kevin Zhu, Sean O'Brien
Abstract:
Large language models (LLMs) often encode cognitive behaviors unpredictably across prompts, layers, and contexts, making them difficult to diagnose and control. We present CBMAS, a diagnostic framework for continuous activation steering, which extends cognitive bias analysis from discrete before/after interventions to interpretable trajectories. By combining steering vector construction with dense α‑sweeps, logit lens‑based bias curves, and layer‑site sensitivity analysis, our approach can reveal tipping points where small intervention strengths flip model behavior and show how steering effects evolve across layer depth. We argue that these continuous diagnostics offer a bridge between high‑level behavioral evaluation and low‑level representational dynamics, contributing to the cognitive interpretability of LLMs. Lastly, we provide a CLI and datasets for various cognitive behaviors at the project repository, https://github.com/shimamooo/CBMAS.
Authors:Bingyan Xie, Yongpeng Wu, Wenjun Zhang, Derrick Wing Kwan Ng, Merouane Debbah
Abstract:
The evolution of semantic communications has profoundly impacted wireless video transmission, whose applications dominate driver of modern bandwidth consumption. However, most existing schemes are predominantly optimized for simple additive white Gaussian noise or Rayleigh fading channels, neglecting the ubiquitous multiple‑input multiple‑output (MIMO) environments that critically hinder practical deployment. To bridge this gap, we propose the context video semantic transmission (CVST) framework under MIMO channels. Building upon an efficient contextual video transmission backbone, CVST effectively learns a context‑channel correlation map to explicitly formulate the relationships between feature groups and MIMO subchannels. Leveraging these channel‑aware features, we design a multi‑reference entropy coding mechanism, enabling channel state‑aware variable length coding. Furthermore, CVST incorporates a checkerboard‑based feature modulation strategy to achieve multiple rate points within a single trained model, thereby enhancing deployment flexibility. These innovations constitute our multi‑reference variable length and rate coding (MR‑VLRC) scheme. By integrating contextual transmission with MR‑VLRC, CVST demonstrates substantial performance gains over various standardized separated coding methods and recent wireless video semantic communication approaches. The code is available at https://github.com/xie233333/CVST.
Authors:Zeyi Liao, Yadong Lu, Boyu Gou, Huan Sun, Ahmed Awadallah
Abstract:
Graphical user interface (GUI) grounding, the process of mapping human instructions to GUI actions, serves as a fundamental basis to autonomous GUI agents. While existing grounding models achieve promising performance to simulate the mouse click action on various click‑based benchmarks, another essential mode of mouse interaction, namely dragging, remains largely underexplored. Yet, dragging the mouse to select and manipulate textual content represents a prevalent and important usage in practical GUI scenarios. To narrow this gap, we first introduce GUI‑Drag, a diverse dataset of 161K text dragging examples synthesized through a scalable pipeline. To support systematic and robust evaluation, we further construct ScreenDrag, a benchmark with 5,333 examples spanning three levels of interface context, together with three dedicated metrics designed for assessing text dragging capability. Models trained on GUI‑Drag with an efficient continual training strategy achieve substantial improvements on ScreenDrag, while preserving the original click‑based performance on ScreenSpot, ScreenSpot‑v2, and OSWorld‑G. Our work encourages further research on broader GUI grounding beyond just clicking and paves way toward a truly generalist GUI grounding model. All benchmark, data, checkpoints, and code are open‑sourced and available at https://osu‑nlp‑group.github.io/GUI‑Drag.
Authors:Chengming Cui, Tianxin Wei, Ziyi Chen, Ruizhong Qiu, Zhichen Zeng, Zhining Liu, Xuying Ning, Duo Zhou, Jingrui He
Abstract:
Large language models (LLMs) exhibit complementary strengths arising from differences in pretraining data, model architectures, and decoding behaviors. Inference‑time ensembling provides a practical way to combine these capabilities without retraining. However, existing ensemble approaches suffer from fundamental limitations. Most rely on fixed fusion granularity, which lacks the flexibility required for mid‑generation adaptation and fails to adapt to different generation characteristics across tasks. To address these challenges, we propose AdaFuse, an adaptive ensemble decoding framework that dynamically selects semantically appropriate fusion units during generation. Rather than committing to a fixed granularity, AdaFuse adjusts fusion behavior on the fly based on the decoding context, with words serving as basic building blocks for alignment. To be specific, we introduce an uncertainty‑based criterion to decide whether to apply ensembling at each decoding step. Under confident decoding states, the model continues generation directly. In less certain states, AdaFuse invokes a diversity‑aware scaling strategy to explore alternative candidate continuations and inform ensemble decisions. This design establishes a synergistic interaction between adaptive ensembling and test‑time scaling, where ensemble decisions guide targeted exploration, and the resulting diversity in turn strengthens ensemble quality. Experiments on open‑domain question answering, arithmetic reasoning, and machine translation demonstrate that AdaFuse consistently outperforms strong ensemble baselines, achieving an average relative improvement of 6.88%. The code is available at https://github.com/CCM0111/AdaFuse.
Authors:Jiayu Ding, Haoran Tang, Ge Li
Abstract:
In safety‑critical domains, linguistic ambiguity can have severe consequences; a vague command like "Pass me the vial" in a surgical setting could lead to catastrophic errors. Yet, most embodied AI research overlooks this, assuming instructions are clear and focusing on execution rather than confirmation. To address this critical safety gap, we are the first to define Open‑Vocabulary 3D Instruction Ambiguity Detection, a fundamental new task where a model must determine if a command has a single, unambiguous meaning within a given 3D scene. To support this research, we build Ambi3D, the large‑scale benchmark for this task, featuring over 700 diverse 3D scenes and around 22k instructions. Our analysis reveals a surprising limitation: state‑of‑the‑art 3D Large Language Models (LLMs) struggle to reliably determine if an instruction is ambiguous. To address this challenge, we propose AmbiVer, a two‑stage framework that collects explicit visual evidence from multiple views and uses it to guide an vision‑language model (VLM) in judging instruction ambiguity. Extensive experiments demonstrate the challenge of our task and the effectiveness of AmbiVer, paving the way for safer and more trustworthy embodied AI. Code and dataset available at https://jiayuding031020.github.io/ambi3d/.
Authors:Longbin Ji, Xiaoxiong Liu, Junyuan Shang, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang
Abstract:
Recent advances in video generation have been dominated by diffusion and flow‑matching models, which produce high‑quality results but remain computationally intensive and difficult to scale. In this work, we introduce VideoAR, the first large‑scale Visual Autoregressive (VAR) framework for video generation that combines multi‑scale next‑frame prediction with autoregressive modeling. VideoAR disentangles spatial and temporal dependencies by integrating intra‑frame VAR modeling with causal next‑frame prediction, supported by a 3D multi‑scale tokenizer that efficiently encodes spatio‑temporal dynamics. To improve long‑term consistency, we propose Multi‑scale Temporal RoPE, Cross‑Frame Error Correction, and Random Frame Mask, which collectively mitigate error propagation and stabilize temporal coherence. Our multi‑stage pretraining pipeline progressively aligns spatial and temporal learning across increasing resolutions and durations. Empirically, VideoAR achieves new state‑of‑the‑art results among autoregressive models, improving FVD on UCF‑101 from 99.5 to 88.6 while reducing inference steps by over 10x, and reaching a VBench score of 81.74‑competitive with diffusion‑based models an order of magnitude larger. These results demonstrate that VideoAR narrows the performance gap between autoregressive and diffusion paradigms, offering a scalable, efficient, and temporally consistent foundation for future video generation research.
Authors:Jingsheng Zheng, Jintian Zhang, Yujie Luo, Yuren Mao, Yunjun Gao, Lun Du, Huajun Chen, Ningyu Zhang
Abstract:
Autonomous machine learning agents have revolutionized scientific discovery, yet they remain constrained by a Generate‑Execute‑Feedback paradigm. Previous approaches suffer from a severe Execution Bottleneck, as hypothesis evaluation relies strictly on expensive physical execution. To bypass these physical constraints, we internalize execution priors to substitute costly runtime checks with instantaneous predictive reasoning, drawing inspiration from World Models. In this work, we formalize the task of Data‑centric Solution Preference and construct a comprehensive corpus of 18,438 pairwise comparisons. We demonstrate that LLMs exhibit significant predictive capabilities when primed with a Verified Data Analysis Report, achieving 61.5% accuracy and robust confidence calibration. Finally, we instantiate this framework in FOREAGENT, an agent that employs a Predict‑then‑Verify loop, achieving a 6x acceleration in convergence while surpassing execution‑based baselines by +6%. Our code and dataset will be publicly available soon at https://github.com/zjunlp/predict‑before‑execute.
Authors:Haoming Xu, Ningyuan Zhao, Yunzhi Yao, Weihong Xu, Hongru Wang, Xinle Deng, Shumin Deng, Jeff Z. Pan, Huajun Chen, Ningyu Zhang
Abstract:
As Large Language Models (LLMs) are increasingly deployed in real‑world settings, correctness alone is insufficient. Reliable deployment requires maintaining truthful beliefs under contextual perturbations. Existing evaluations largely rely on point‑wise confidence like Self‑Consistency, which can mask brittle belief. We show that even facts answered with perfect self‑consistency can rapidly collapse under mild contextual interference. To address this gap, we propose Neighbor‑Consistency Belief (NCB), a structural measure of belief robustness that evaluates response coherence across a conceptual neighborhood. To validate the efficiency of NCB, we introduce a new cognitive stress‑testing protocol that probes outputs stability under contextual interference. Experiments across multiple LLMs show that the performance of high‑NCB data is relatively more resistant to interference. Finally, we present Structure‑Aware Training (SAT), which optimizes context‑invariant belief structure and reduces long‑tail knowledge brittleness by approximately 30%. Code will be available at https://github.com/zjunlp/belief.
Authors:Dawei Wang, Chengming Zhou, Di Zhao, Xinyuan Liu, Marci Chi Ma, Gary Ushaw, Richard Davison
Abstract:
Recent breakthroughs in Large Language Models (LLMs) have positioned them as a promising paradigm for agents, with long‑term planning and decision‑making emerging as core general‑purpose capabilities for adapting to diverse scenarios and tasks. Real‑time strategy (RTS) games serve as an ideal testbed for evaluating these two capabilities, as their inherent gameplay requires both macro‑level strategic planning and micro‑level tactical adaptation and action execution. Existing RTS game‑based environments either suffer from relatively high computational demands or lack support for textual observations, which has constrained the use of RTS games for LLM evaluation. Motivated by this, we present TowerMind, a novel environment grounded in the tower defense (TD) subgenre of RTS games. TowerMind preserves the key evaluation strengths of RTS games for assessing LLMs, while featuring low computational demands and a multimodal observation space, including pixel‑based, textual, and structured game‑state representations. In addition, TowerMind supports the evaluation of model hallucination and provides a high degree of customizability. We design five benchmark levels to evaluate several widely used LLMs under different multimodal input settings. The results reveal a clear performance gap between LLMs and human experts across both capability and hallucination dimensions. The experiments further highlight key limitations in LLM behavior, such as inadequate planning validation, a lack of multifinality in decision‑making, and inefficient action use. We also evaluate two classic reinforcement learning algorithms: Ape‑X DQN and PPO. By offering a lightweight and multimodal design, TowerMind complements the existing RTS game‑based environment landscape and introduces a new benchmark for the AI agent field. The source code is publicly available on GitHub(https://github.com/tb6147877/TowerMind).
Authors:Alexandra Dragomir, Florin Brad, Radu Tudor Ionescu
Abstract:
Large language models (LLMs) have demonstrated competitive performance in zero‑shot multilingual machine translation (MT). Some follow‑up works further improved MT performance via preference optimization, but they leave a key aspect largely underexplored: the order in which data samples are given during training. We address this topic by integrating curriculum learning into various state‑of‑the‑art preference optimization algorithms to boost MT performance. We introduce a novel curriculum learning strategy with restarts (CLewR), which reiterates easy‑to‑hard curriculum multiple times during training to effectively mitigate the catastrophic forgetting of easy examples. We demonstrate consistent gains across several model families (Gemma2, Qwen2.5, Llama3.1) and preference optimization techniques. We publicly release our code at https://github.com/alexandra‑dragomir/CLewR.
Authors:Yinghan Xu, John Dingliana
Abstract:
We propose a novel framework for decomposing arbitrarily posed humans into animatable multi‑layered 3D human avatars, separating the body and garments. Conventional single‑layer reconstruction methods lock clothing to one identity, while prior multi‑layer approaches struggle with occluded regions. We overcome both limitations by encoding each layer as a set of 2D Gaussians for accurate geometry and photorealistic rendering, and inpainting hidden regions with a pretrained 2D diffusion model via score‑distillation sampling (SDS). Our three‑stage training strategy first reconstructs the coarse canonical garment via single‑layer reconstruction, followed by multi‑layer training to jointly recover the inner‑layer body and outer‑layer garment details. Experiments on two 3D human benchmark datasets (4D‑Dress, Thuman2.0) show that our approach achieves better rendering quality and layer decomposition and recomposition than the previous state‑of‑the‑art, enabling realistic virtual try‑on under novel viewpoints and poses, and advancing practical creation of high‑fidelity 3D human assets for immersive applications. Our code is available at https://github.com/RockyXu66/LayerGS
Authors:ChunTeng Chen, YiChen Hsu, YiWen Liu, WeiFang Sun, TsaiChing Ni, ChunYi Lee, Min Sun, YuanFu Yang
Abstract:
The ability to automatically generate large‑scale, interactive, and physically realistic 3D environments is crucial for advancing robotic learning and embodied intelligence. However, existing generative approaches often fail to capture the functional complexity of real‑world interiors, particularly those containing articulated objects with movable parts essential for manipulation and navigation. This paper presents SceneFoundry, a language‑guided diffusion framework that generates apartment‑scale 3D worlds with functionally articulated furniture and semantically diverse layouts for robotic training. From natural language prompts, an LLM module controls floor layout generation, while diffusion‑based posterior sampling efficiently populates the scene with articulated assets from large‑scale 3D repositories. To ensure physical usability, SceneFoundry employs differentiable guidance functions to regulate object quantity, prevent articulation collisions, and maintain sufficient walkable space for robotic navigation. Extensive experiments demonstrate that our framework generates structurally valid, semantically coherent, and functionally interactive environments across diverse scene types and conditions, enabling scalable embodied AI research. project page: https://anc891203.github.io/SceneFoundry‑Demo/
Authors:Xiaoshuai Song, Haofei Chang, Guanting Dong, Yutao Zhu, Zhicheng Dou, Ji-Rong Wen
Abstract:
Large language models (LLMs) are expected to be trained to act as agents in various real‑world environments, but this process relies on rich and varied tool‑interaction sandboxes. However, access to real systems is often restricted; LLM‑simulated environments are prone to hallucinations and inconsistencies; and manually built sandboxes are hard to scale. In this paper, we propose EnvScaler, an automated framework for scalable tool‑interaction environments via programmatic synthesis. EnvScaler comprises two components. First, SkelBuilder constructs diverse environment skeletons through topic mining, logic modeling, and quality evaluation. Then, ScenGenerator generates multiple task scenarios and rule‑based trajectory validation functions for each environment. With EnvScaler, we synthesize 191 environments and about 7K scenarios, and apply them to Supervised Fine‑Tuning (SFT) and Reinforcement Learning (RL) for Qwen3 series models. Results on three benchmarks show that EnvScaler significantly improves LLMs' ability to solve tasks in complex environments involving multi‑turn, multi‑tool interactions. We release our code and data at https://github.com/RUC‑NLPIR/EnvScaler.
Authors:Zezhou Wang, Ziyun Zhang, Xiaoyi Zhang, Zhuzhong Qian, Yan Lu
Abstract:
Vision‑language models are increasingly deployed as computer‑use agents (CUAs) that operate desktops and browsers. Top‑performing CUAs are framework‑based systems that decompose planning and execution, while end‑to‑end screenshot‑to‑action policies are easier to deploy but lag behind on benchmarks such as OSWorld‑Verified. GUI datasets like OSWorld pose two bottlenecks: they expose only a few hundred interactive, verifiable tasks and environments, and expert trajectories must be gathered by interacting with these environments, making such data hard to scale. We therefore ask how reinforcement learning from verifiable rewards (RLVR) can best exploit a small pool of exist expert trajectories to train end‑to‑end policies. Naively mixing these off‑policy traces into on‑policy RLVR is brittle: even after format conversion, expert trajectories exhibit structural mismatch and distribution shift from the learner. We propose BEPA (Bi‑Level Expert‑to‑Policy Assimilation), which turns static expert traces into policy‑aligned guidance via self‑rolled reachable trajectories under the base policy (LEVEL‑1) and a per‑task, dynamically updated cache used in RLVR (LEVEL‑2). On OSWorld‑Verified, BEPA improves UITARS1.5‑7B success from 22.87% to 32.13% and raises a held‑out split from 5.74% to 10.30%, with consistent gains on MMBench‑GUI and Online‑Mind2Web. Our code and data are available at: https://github.com/LEON‑gittech/Verl_GUI.git
Authors:Yongyi Yang, Jianyang Gao
Abstract:
Hyper‑Connections (HC) generalizes residual connections by introducing dynamic residual matrices that mix information across multiple residual streams, accelerating convergence in deep neural networks. However, unconstrained residual matrices can compromise training stability. To address this, DeepSeek's Manifold‑Constrained Hyper‑Connections (mHC) approximately projects these matrices onto the Birkhoff polytope via iterative Sinkhorn‑‑Knopp (SK) normalization. We identify two limitations of this approach: (i) finite SK iterations do not guarantee exact doubly stochasticity, leaving an approximation gap that can accumulate through network depth and undermine stability; (ii) efficient SK implementation requires highly specialized CUDA kernels, raising engineering barriers and reducing portability. Motivated by the Birkhoff‑‑von Neumann theorem, we propose mHC‑lite, a simple reparameterization that explicitly constructs doubly stochastic matrices as convex combinations of permutation matrices. This approach guarantees exact doubly stochasticity by construction and can be implemented using only native matrix operations. Extensive experiments demonstrate that mHC‑lite matches or exceeds mHC in performance while achieving higher training throughput with a naive implementation and eliminating the residual instabilities observed in both HC and mHC. The code is publicly available at https://github.com/FFTYYY/mhc‑lite.
Authors:Yuxuan Zhou, Fei Huang, Heng Li, Fengyi Wu, Tianyu Wang, Jianwei Zhang, Junyang Lin, Zhi-Qi Cheng
Abstract:
Verification is a key bottleneck in improving inference speed while maintaining distribution fidelity in Speculative Decoding. Recent work has shown that sequence‑level verification leads to a higher number of accepted tokens compared to token‑wise verification. However, existing solutions often rely on surrogate approximations or are constrained by partial information, struggling with joint intractability. In this work, we propose Hierarchical Speculative Decoding (HSD), a provably lossless verification method that significantly boosts the expected number of accepted tokens and overcomes joint intractability by balancing excess and deficient probability mass across accessible branches. Our extensive large‑scale experiments demonstrate that HSD yields consistent improvements in acceptance rates across diverse model families and benchmarks. Moreover, its strong explainability and generality make it readily integrable into a wide range of speculative decoding frameworks. Notably, integrating HSD into EAGLE‑3 yields over a 12% performance gain, establishing state‑of‑the‑art decoding efficiency without compromising distribution fidelity. Code is available at https://github.com/ZhouYuxuanYX/Hierarchical‑Speculative‑Decoding.
Authors:Tassallah Abdullahi, Shrestha Ghosh, Hamish S Fraser, Daniel León Tramontini, Adeel Abbasi, Ghada Bourjeily, Carsten Eickhoff, Ritambhara Singh
Abstract:
Persona conditioning can be viewed as a behavioral prior for large language models (LLMs) and is often assumed to confer expertise and improve safety in a monotonic manner. However, its effects on high‑stakes clinical decision‑making remain poorly characterized. We systematically evaluate persona‑based control in clinical LLMs, examining how professional roles (e.g., Emergency Department physician, nurse) and interaction styles (bold vs.\ cautious) influence behavior across models and medical tasks. We assess performance on clinical triage and patient‑safety tasks using multidimensional evaluations that capture task accuracy, calibration, and safety‑relevant risk behavior. We find systematic, context‑dependent, and non‑monotonic effects: Medical personas improve performance in critical care tasks, yielding gains of up to ~+20% in accuracy and calibration, but degrade performance in primary‑care settings by comparable margins. Interaction style modulates risk propensity and sensitivity, but it's highly model‑dependent. While aggregated LLM‑judge rankings favor medical over non‑medical personas in safety‑critical cases, we found that human clinicians show moderate agreement on safety compliance (average Cohen's κ= 0.43) but indicate a low confidence in 95.9% of their responses on reasoning quality. Our work shows that personas function as behavioral priors that introduce context‑dependent trade‑offs rather than guarantees of safety or expertise. The code is available at https://github.com/rsinghlab/Persona\_Paradox.
Authors:Qiao Liu, Wing Hung Wong
Abstract:
Modern data analysis increasingly requires flexible conditional inference P(X_B | X_A) where (X_A, X_B) is an arbitrary partition of observed variable X. Existing conditional inference methods lack this flexibility as they are tied to a fixed conditioning structure and cannot perform new conditional inference once trained. To solve this, we propose a Bayesian generative modeling (BGM) approach for arbitrary conditional inference without retraining. BGM learns a generative model of X through an iterative Bayesian updating algorithm where model parameters and latent variables are updated until convergence. Once trained, any conditional distribution can be obtained without retraining. Empirically, BGM achieves superior prediction performance with well calibrated predictive intervals, demonstrating that a single learned model can serve as a universal engine for conditional prediction with uncertainty quantification. We provide theoretical guarantees for the convergence of the stochastic iterative algorithm, statistical consistency and conditional‑risk bounds. The proposed BGM framework leverages the power of AI to capture complex relationships among variables while adhering to Bayesian principles, emerging as a promising framework for advancing various applications in modern data science. The code for BGM is freely available at https://github.com/liuq‑lab/bayesgm.
Authors:Yingzhuo Liu, Shuodi Liu, Weijun Luo, Liuyu Xiang, Zhaofeng He
Abstract:
Policy Space Response Oracles (PSRO) combines game‑theoretic equilibrium computation with learning and is effective in approximating Nash Equilibrium in zero‑sum games. However, the computational cost of PSRO has become a significant limitation to its practical application. Our analysis shows that game simulation is the primary bottleneck in PSRO's runtime. To address this issue, we conclude the concept of Simulation‑Free PSRO and summarize existing methods that instantiate this concept. Additionally, we propose a novel Dynamic Window‑based Simulation‑Free PSRO, which introduces the concept of a strategy window to replace the original strategy set maintained in PSRO. The number of strategies in the strategy window is limited, thereby simplifying opponent strategy selection and improving the robustness of the best response. Moreover, we use Nash Clustering to select the strategy to be eliminated, ensuring that the number of strategies within the strategy window is effectively limited. Our experiments across various environments demonstrate that the Dynamic Window mechanism significantly reduces exploitability compared to existing methods, while also exhibiting excellent compatibility. Our code is available at https://github.com/enochliu98/SF‑PSRO.
Authors:Tarun Prajapati
Abstract:
Modern Retrieval‑Augmented Generation (RAG) systems struggle with a fundamental architectural tension: vector indices are optimized for query latency but poorly handle continuous knowledge updates, while data lakes excel at versioning but introduce query latency penalties. We introduce LiveVectorLake, a dual‑tier temporal knowledge base architecture that enables real‑time semantic search on current knowledge while maintaining complete version history for compliance, auditability, and point‑in‑time retrieval. The system introduces three core architectural contributions: (1) Content‑addressable chunk‑level synchronization using SHA‑256 hashing for deterministic change detection without external state tracking; (2) Dual‑tier storage separating hot‑tier vector indices (Milvus with HNSW) from cold‑tier columnar versioning (Delta Lake with Parquet), optimizing query latency and storage cost independently; (3) Temporal query routing enabling point‑in‑time knowledge retrieval via delta‑versioning with ACID consistency across tiers. Evaluation on a 100‑document corpus versioned across five time points demonstrates: (i) 10‑15% re‑processing of content during updates compared to 100% for full re‑indexing; (ii) sub‑100ms retrieval latency on current knowledge; (iii) sub‑2s latency for temporal queries across version history; and (iv) storage cost optimization through hot/cold tier separation (only current chunks in expensive vector indices). The approach enables production RAG deployments requiring simultaneous optimization for query performance, update efficiency, and regulatory compliance. Code and resources: [https://github.com/praj‑tarun/LiveVectorLake]
Authors:Hadi Hosseini, Debmalya Mandal, Amrit Puhan
Abstract:
We introduce \mathbfSP‑Rank, the first large‑scale, publicly available dataset for benchmarking algorithms that leverage both first‑order preferences and second‑order predictions in ranking tasks. Each datapoint includes a personal vote (first‑order signal) and a meta‑prediction of how others will vote (second‑order signal), allowing richer modeling than traditional datasets that capture only individual preferences. SP‑Rank contains over 12,000 human‑generated datapoints across three domains ‑‑ geography, movies, and paintings, and spans nine elicitation formats with varying subset sizes. This structure enables empirical analysis of preference aggregation when expert identities are unknown but presumed to exist, and individual votes represent noisy estimates of a shared ground‑truth ranking. We benchmark SP‑Rank by comparing traditional aggregation methods that use only first‑order votes against SP‑Voting, a second‑order method that jointly reasons over both signals to infer ground‑truth rankings. While SP‑Rank also supports models that rely solely on second‑order predictions, our benchmarks emphasize the gains from combining both signals. We evaluate performance across three core tasks: (1) full ground‑truth rank recovery, (2) subset‑level rank recovery, and (3) probabilistic modeling of voter behavior. Results show that incorporating second‑order signals substantially improves accuracy over vote‑only methods. Beyond social choice, SP‑Rank supports downstream applications in learning‑to‑rank, extracting expert knowledge from noisy crowds, and training reward models in preference‑based fine‑tuning pipelines. We release the dataset, code, and baseline evaluations (available at https://github.com/amrit19/SP‑Rank‑Dataset ) to foster research in human preference modeling, aggregation theory, and human‑AI alignment.
Authors:Yiji Zhao, Zihao Zhong, Ao Wang, Haomin Wen, Ming Jin, Yuxuan Liang, Huaiyu Wan, Hao Wu
Abstract:
Spatial‑Temporal Graph (STG) forecasting on large‑scale networks has garnered significant attention. However, existing models predominantly focus on short‑horizon predictions and suffer from notorious computational costs and memory consumption when scaling to long‑horizon predictions and large graphs. Targeting the above challenges, we present FaST, an effective and efficient framework based on heterogeneity‑aware Mixture‑of‑Experts (MoEs) for long‑horizon and large‑scale STG forecasting, which unlocks one‑week‑ahead (672 steps at a 15‑minute granularity) prediction with thousands of nodes. FaST is underpinned by two key innovations. First, an adaptive graph agent attention mechanism is proposed to alleviate the computational burden inherent in conventional graph convolution and self‑attention modules when applied to large‑scale graphs. Second, we propose a new parallel MoE module that replaces traditional feed‑forward networks with Gated Linear Units (GLUs), enabling an efficient and scalable parallel structure. Extensive experiments on real‑world datasets demonstrate that FaST not only delivers superior long‑horizon predictive accuracy but also achieves remarkable computational efficiency compared to state‑of‑the‑art baselines. Our source code is available at: https://github.com/yijizhao/FaST.
Authors:Haoyu Zhao, Akide Liu, Zeyu Zhang, Weijie Wang, Feng Chen, Ruihan Zhu, Gholamreza Haffari, Bohan Zhuang
Abstract:
Embodied question answering (EQA) in 3D environments often requires collecting context that is distributed across multiple viewpoints and partially occluded. However, most recent vision‑‑language models (VLMs) are constrained to a fixed and finite set of input views, which limits their ability to acquire question‑relevant context at inference time and hinders complex spatial reasoning. We propose Chain‑of‑View (CoV) prompting, a training‑free, test‑time reasoning framework that transforms a VLM into an active viewpoint reasoner through a coarse‑to‑fine exploration process. CoV first employs a View Selection agent to filter redundant frames and identify question‑aligned anchor views. It then performs fine‑grained view adjustment by interleaving iterative reasoning with discrete camera actions, obtaining new observations from the underlying 3D scene representation until sufficient context is gathered or a step budget is reached. We evaluate CoV on OpenEQA across four mainstream VLMs and obtain an average +11.56% improvement in LLM‑Match, with a maximum gain of +13.62% on Qwen3‑VL‑Flash. CoV further exhibits test‑time scaling: increasing the minimum action budget yields an additional +2.51% average improvement, peaking at +3.73% on Gemini‑2.5‑Flash. On ScanQA and SQA3D, CoV delivers strong performance (e.g., 116 CIDEr / 31.9 EM@1 on ScanQA and 51.1 EM@1 on SQA3D). Overall, these results suggest that question‑aligned view selection coupled with open‑view search is an effective, model‑agnostic strategy for improving spatial reasoning in 3D EQA without additional training. Code is available on https://github.com/ziplab/CoV .
Authors:Wajid Nasser
Abstract:
LLM‑as‑judge systems promise scalable, consistent evaluation. We find the opposite: judges are consistent, but not with each other; they are consistent with themselves. Across 3,240 evaluations (9 judges x 120 unique video x pack items x 3 independent runs), inter‑judge agreement is near‑zero (Krippendorff's α = 0.042). On two dimensions, judges disagree more than random noise would predict (α < 0). Yet this disagreement isn't chaos; it's structured. A classifier identifies which judge produced an evaluation with 77.1% accuracy from rubric scores alone, rising to 89.9% with disposition features. Within model families, the signal is even stronger: GPT‑4.1 and GPT‑5.2 are distinguishable with 99.6% accuracy. We call this the reliability paradox: judges cannot agree on what constitutes quality, yet their disagreement patterns are so stable they function as fingerprints. Each judge implements a distinct, stable theory of quality: an "evaluative disposition" that shapes how it interprets any rubric. We characterize these dispositions along multiple axes: harshness/leniency, dimension emphasis, within‑judge stability (ICC), and evidence behavior (receipt validity, semantic linkage via NLI, and shotgun index). The implication is stark: LLM judges are not interchangeable instruments measuring a shared construct. They are distinct measurement devices, each encoding its own implicit theory of quality. Averaging their scores produces a synthetic verdict that corresponds to no judge's actual values.
Authors:Wenhao Zeng, Xuteng Zhang, Yuling Shi, Chao Hu, Yuting Chen, Beijun Shen, Xiaodong Gu
Abstract:
Large Reasoning Models (LRMs) achieve remarkable performance by explicitly generating multi‑step chains of thought, but this capability incurs substantial inference latency and computational cost. Collaborative inference offers a promising solution by selectively allocating work between lightweight and large models, yet a fundamental challenge remains: determining when a reasoning step requires the capacity of a large model or the efficiency of a small model. Existing routing strategies either rely on local token probabilities or post‑hoc verification, introducing significant inference overhead. In this work, we propose a novel perspective on step‑wise collaboration: the difficulty of a reasoning step can be inferred from its very first token. Inspired by the "Aha Moment" phenomenon in LRMs, we show that the entropy of the initial token serves as a strong predictor of step difficulty. Building on this insight, we introduce GlimpRouter, a training‑free step‑wise collaboration framework. GlimpRouter employs a lightweight model to generate only the first token of each reasoning step and routes the step to a larger model only when the initial token entropy exceeds a threshold. Experiments on multiple benchmarks demonstrate that our approach significantly reduces inference latency while preserving accuracy. For instance, GlimpRouter attains a substantial 10.7% improvement in accuracy while reducing inference latency by 25.9% compared to a standalone large model on AIME25. These results suggest a simple yet effective mechanism for reasoning: allocating computation based on a glimpse of thought rather than full‑step evaluation.
Authors:Ziqi Zhao, Zhaochun Ren, Jiahong Zou, Liu Yang, Zhiwei Xu, Xuri Ge, Zhumin Chen, Xinyu Ma, Daiting Shi, Shuaiqiang Wang, Dawei Yin, Xin Xin
Abstract:
Reinforcement learning with verifiable rewards (RLVR) has proven effective in enhancing the reasoning of large language models (LLMs). Monte Carlo Tree Search (MCTS)‑based extensions improve upon vanilla RLVR (e.g., GRPO) by providing tree‑based reasoning rollouts that enable fine‑grained and segment‑level credit assignment. However, existing methods still suffer from limited exploration diversity and inefficient reasoning. To address the above challenges, we propose reinforced efficient reasoning via semantically diverse explorations, i.e., ROSE, for LLMs. To encourage more diverse reasoning exploration, our method incorporates a semantic‑entropy‑based branching strategy and an \varepsilon‑exploration mechanism. The former operates on already sampled reasoning rollouts to capture semantic uncertainty and select branching points with high semantic divergence to generate new successive reasoning paths, whereas the latter stochastically initiates reasoning rollouts from the root, preventing the search process from becoming overly local. To improve efficiency, we design a length‑aware segment‑level advantage estimator that rewards concise and correct reasoning while penalizing unnecessarily long reasoning chains. Extensive experiments on various mathematical reasoning benchmarks with Qwen and Llama models validate the effectiveness and efficiency of ROSE. Codes are available at https://github.com/ZiqiZhao1/ROSE‑rl.
Authors:Jianbo Li, Yi Jiang, Sendong Zhao, Bairui Hu, Haochun Wang, Bing Qin
Abstract:
Retrieval‑Augmented Generation (RAG) helps LLMs stay accurate, but feeding long documents into a prompt makes the model slow and expensive. This has motivated context compression, ranging from token pruning and summarization to embedding‑based compression. While researchers have tried ''compressing'' these documents into smaller summaries or mathematical embeddings, there is a catch: the more you compress the data, the more the LLM struggles to understand it. To address this challenge, we propose ArcAligner (Adaptive recursive context Aligner), a lightweight module integrated into the language model layers to help the model better utilize highly compressed context representations for downstream generation. It uses an adaptive ''gating'' system that only adds extra processing power when the information is complex, keeping the system fast. Across knowledge‑intensive QA benchmarks, ArcAligner consistently beats compression baselines at comparable compression rates, especially on multi‑hop and long‑tail settings. The source code is publicly available.
Authors:Yi Jiang, Sendong Zhao, Jianbo Li, Bairui Hu, Yanrui Du, Haochun Wang, Bing Qin
Abstract:
Retrieval‑Augmented Generation (RAG) improves generation quality by incorporating evidence retrieved from large external corpora. However, most existing methods rely on statically selecting top‑k passages based on individual relevance, which fails to exploit combinatorial gains among passages and often introduces substantial redundancy. To address this limitation, we propose OptiSet, a set‑centric framework that unifies set selection and set‑level ranking for RAG. OptiSet adopts an "Expand‑then‑Refine" paradigm: it first expands a query into multiple perspectives to enable a diverse candidate pool and then refines the candidate pool via re‑selection to form a compact evidence set. We then devise a self‑synthesis strategy without strong LLM supervision to derive preference labels from the set conditional utility changes of the generator, thereby identifying complementary and redundant evidence. Finally, we introduce a set‑list wise training strategy that jointly optimizes set selection and set‑level ranking, enabling the model to favor compact, high‑gain evidence sets. Extensive experiments demonstrate that OptiSet improves performance on complex combinatorial problems and makes generation more efficient. The source code is publicly available.
Authors:Tongyu Wen, Guanting Dong, Zhicheng Dou
Abstract:
Large language model (LLM)‑based search agents have proven promising for addressing knowledge‑intensive problems by incorporating information retrieval capabilities. Existing works largely focus on optimizing the reasoning paradigms of search agents, yet the quality of intermediate search queries during reasoning remains overlooked. As a result, the generated queries often remain inaccurate, leading to unexpected retrieval results and ultimately limiting search agents' overall effectiveness. To mitigate this issue, we introduce SmartSearch, a framework built upon two key mechanisms: (1) Process rewards, which provide fine‑grained supervision for the quality of each intermediate search query through Dual‑Level Credit Assessment. (2) Query refinement, which promotes the optimization of query generation by selectively refining low‑quality search queries and regenerating subsequent search rounds based on these refinements. To enable the search agent to progressively internalize the ability to improve query quality under the guidance of process rewards, we design a three‑stage curriculum learning framework. This framework guides the agent through a progression from imitation, to alignment, and ultimately to generalization. Experimental results show that SmartSearch consistently surpasses existing baselines, and additional quantitative analyses further confirm its significant gains in both search efficiency and query quality. The code is available at https://github.com/MYVAE/SmartSearch.
Authors:Ao Sun, Xiaoyu Wang, Zhe Tan, Yu Li, Jiachen Zhu, Shu Su, Yuheng Jia
Abstract:
As Large Language Models (LLMs) serve a global audience, alignment must transition from enforcing universal consensus to respecting cultural pluralism. We demonstrate that dense models, when forced to fit conflicting value distributions, suffer from Mean Collapse, converging to a generic average that fails to represent diverse groups. We attribute this to Cultural Sparsity, where gradient interference prevents dense parameters from spanning distinct cultural modes. To resolve this, we propose \textscCuMA (Cultural Mixture of Adapters), a framework that frames alignment as a conditional capacity separation problem. By incorporating demographic‑aware routing, \textscCuMA internalizes a Latent Cultural Topology to explicitly disentangle conflicting gradients into specialized expert subspaces. Extensive evaluations on WorldValuesBench, Community Alignment, and PRISM demonstrate that \textscCuMA achieves state‑of‑the‑art performance, significantly outperforming both dense baselines and semantic‑only MoEs. Crucially, our analysis confirms that \textscCuMA effectively mitigates mean collapse, preserving cultural diversity. Our code is available at https://github.com/Throll/CuMA.
Authors:Chengxin Shi, Qinnan Cai, Zeyuan Chen, Long Zeng, Yibo Zhao, Jing Yu, Jianxiang Yu, Xiang Li
Abstract:
Designing academic posters is a labor‑intensive process requiring the precise balance of high‑density content and sophisticated layout. While existing paper‑to‑poster generation methods automate initial drafting, they are typically single‑pass and non‑interactive, often fail to align with complex, subjective user intent. To bridge this gap, we propose APEX (Academic Poster Editing agentic eXpert), the first agentic framework for interactive academic poster editing, supporting fine‑grained control with robust multi‑level API‑based editing and a review‑and‑adjustment Mechanism. In addition, we introduce APEX‑Bench, the first systematic benchmark comprising 514 academic poster editing instructions, categorized by a multi‑dimensional taxonomy including operation type, difficulty, and abstraction level, constructed via reference‑guided and reference‑free strategies to ensure realism and diversity. We further establish a multi‑dimensional VLM‑as‑a‑judge evaluation protocol to assess instruction fulfillment, modification scope, and visual consistency & harmony. Experimental results demonstrate that APEX significantly outperforms baseline methods. Our implementation is available at https://github.com/Breesiu/APEX.
Authors:Zefang Zong, Dingwei Chen, Yang Li, Qi Yi, Bo Zhou, Chengming Li, Bo Qian, Peng Chen, Jie Jiang
Abstract:
LLM agents have emerged as powerful systems for tackling multi‑turn tasks by interleaving internal reasoning and external tool interactions. Agentic Reinforcement Learning has recently drawn significant research attention as a critical post‑training paradigm to further refine these capabilities. In this paper, we present AT^2PO (Agentic Turn‑based Policy Optimization via Tree Search), a unified framework for multi‑turn agentic RL that addresses three core challenges: limited exploration diversity, sparse credit assignment, and misaligned policy optimization. AT^2PO introduces a turn‑level tree structure that jointly enables Entropy‑Guided Tree Expansion for strategic exploration and Turn‑wise Credit Assignment for fine‑grained reward propagation from sparse outcomes. Complementing this, we propose Agentic Turn‑based Policy Optimization, a turn‑level learning objective that aligns policy updates with the natural decision granularity of agentic interactions. ATPO is orthogonal to tree search and can be readily integrated into any multi‑turn RL pipeline. Experiments across seven benchmarks demonstrate consistent improvements over the state‑of‑the‑art baseline by up to 1.84 percentage points in average, with ablation studies validating the effectiveness of each component. Our code is available at https://github.com/zzfoutofspace/ATPO.
Authors:Yehoon Jang, Chaewon Lee, Hyun-seok Min, Sungchul Choi
Abstract:
The Patent Trial and Appeal Board (PTAB) of the USPTO adjudicates thousands of ex parte appeals each year, requiring the integration of technical understanding and legal reasoning. While large language models (LLMs) are increasingly applied in patent and legal practice, their use has remained limited to lightweight tasks, with no established means of systematically evaluating their capacity for structured legal reasoning in the patent domain. In this work, we introduce PILOT‑Bench, the first PTAB‑centric benchmark that aligns PTAB decisions with USPTO patent data at the case‑level and formalizes three IRAC‑aligned classification tasks: Issue Type, Board Authorities, and Subdecision. We evaluate a diverse set of closed‑source (commercial) and open‑source LLMs and conduct analyses across multiple perspectives, including input‑variation settings, model families, and error tendencies. Notably, on the Issue Type task, closed‑source models consistently exceed 0.75 in Micro‑F1 score, whereas the strongest open‑source model (Qwen‑8B) achieves performance around 0.56, highlighting a substantial gap in reasoning capabilities. PILOT‑Bench establishes a foundation for the systematic evaluation of patent‑domain legal reasoning and points toward future directions for improving LLMs through dataset design and model alignment. All data, code, and benchmark resources are available at https://github.com/TeamLab/pilot‑bench.
Authors:Tingyu Wu, Zhisheng Chen, Ziyan Weng, Shuhe Wang, Chenglong Li, Shuo Zhang, Sen Hu, Silin Wu, Qizhen Lan, Huacan Wang, Ronghao Chen
Abstract:
Existing long‑horizon memory benchmarks mostly use multi‑turn dialogues or synthetic user histories, which makes retrieval performance an imperfect proxy for person understanding. We present \BenchName, a publicly releasable benchmark built from long‑form autobiographical narratives, where actions, context, and inner thoughts provide dense evidence for inferring stable motivations and decision principles. \BenchName~reconstructs each narrative into a flashback‑aware, time‑anchored stream and evaluates models with evidence‑linked questions spanning factual recall, subjective state attribution, and principle‑level reasoning. Across diverse narrative sources, retrieval‑augmented systems mainly improve factual accuracy, while errors persist on temporally grounded explanations and higher‑level inferences, highlighting the need for memory mechanisms beyond retrieval. Our data is in \hrefKnowMeBenchhttps://github.com/QuantaAlpha/KnowMeBench.
Authors:Yiqun Chen, Lingyong Yan, Zixuan Yang, Erhan Zhang, Jiashu Zhao, Shuaiqiang Wang, Dawei Yin, Jiaxin Mao
Abstract:
Agentic search has emerged as a promising paradigm for complex information seeking by enabling Large Language Models (LLMs) to interleave reasoning with tool use. However, prevailing systems rely on monolithic agents that suffer from structural bottlenecks, including unconstrained reasoning outputs that inflate trajectories, sparse outcome‑level rewards that complicate credit assignment, and stochastic search noise that destabilizes learning. To address these challenges, we propose M‑ASK (Multi‑Agent Search and Knowledge), a framework that explicitly decouples agentic search into two complementary roles: Search Behavior Agents, which plan and execute search actions, and Knowledge Management Agents, which aggregate, filter, and maintain a compact internal context. This decomposition allows each agent to focus on a well‑defined subtask and reduces interference between search and context construction. Furthermore, to enable stable coordination, M‑ASK employs turn‑level rewards to provide granular supervision for both search decisions and knowledge updates. Experiments on multi‑hop QA benchmarks demonstrate that M‑ASK outperforms strong baselines, achieving not only superior answer accuracy but also significantly more stable training dynamics.\footnoteThe source code for M‑ASK is available at https://github.com/chenyiqun/M‑ASK.
Authors:Zhe Hou
Abstract:
We present Isabellm, an LLM‑powered theorem prover for Isabelle/HOL that performs fully automatic proof synthesis. Isabellm works with any local LLM on Ollama and APIs such as Gemini CLI, and it is designed to run on consumer grade computers. The system combines a stepwise prover, which uses large language models to propose proof commands validated by Isabelle in a bounded search loop, with a higher‑level proof planner that generates structured Isar outlines and attempts to fill and repair remaining gaps. The framework includes beam search for tactics, tactics reranker ML and RL models, premise selection with small transformer models, micro‑RAG for Isar proofs built from AFP, and counter‑example guided proof repair. All the code is implemented by GPT 4.1 ‑ 5.2, Gemini 3 Pro, and Claude 4.5. Empirically, Isabellm can prove certain lemmas that defeat Isabelle's standard automation, including Sledgehammer, demonstrating the practical value of LLM‑guided proof search. At the same time, we find that even state‑of‑the‑art LLMs, such as GPT 5.2 Extended Thinking and Gemini 3 Pro struggle to reliably implement the intended fill‑and‑repair mechanisms with complex algorithmic designs, highlighting fundamental challenges in LLM code generation and reasoning. The code of Isabellm is available at https://github.com/zhehou/llm‑isabelle
Authors:Quang-Tu Pham, Hoang-Dieu Vu, Dinh-Dat Pham, Hieu H. Pham
Abstract:
This paper introduces FedKDX, a federated learning framework that addresses limitations in healthcare AI through Negative Knowledge Distillation (NKD). Unlike existing approaches that focus solely on positive knowledge transfer, FedKDX captures both target and non‑target information to improve model generalization in healthcare applications. The framework integrates multiple knowledge transfer techniques‑‑including traditional knowledge distillation, contrastive learning, and NKD‑‑within a unified architecture that maintains privacy while reducing communication costs. Through experiments on healthcare datasets (SLEEP, UCI‑HAR, and PAMAP2), FedKDX demonstrates improved accuracy (up to 2.53% over state‑of‑the‑art methods), faster convergence, and better performance on non‑IID data distributions. Theoretical analysis supports NKD's contribution to addressing statistical heterogeneity in distributed healthcare data. The approach shows promise for privacy‑sensitive medical applications under regulatory frameworks like HIPAA and GDPR, offering a balanced solution between performance and practical implementation requirements in decentralized healthcare settings. The code and model are available at https://github.com/phamdinhdat‑ai/Fed_2024.
Authors:Delong Zeng, Yuexiang Xie, Yaliang Li, Ying Shen
Abstract:
Multimodal retrieval has emerged as a promising yet challenging research direction in recent years. Most existing studies in multimodal retrieval focus on capturing information in multimodal data that is similar to their paired texts, but often ignores the complementary information contained in multimodal data. In this study, we propose CIEA, a novel multimodal retrieval approach that employs Complementary Information Extraction and Alignment, which transforms both text and images in documents into a unified latent space and features a complementary information extractor designed to identify and preserve differences in the image representations. We optimize CIEA using two complementary contrastive losses to ensure semantic integrity and effectively capture the complementary information contained in images. Extensive experiments demonstrate the effectiveness of CIEA, which achieves significant improvements over both divide‑and‑conquer models and universal dense retrieval models. We provide an ablation study, further discussions, and case studies to highlight the advancements achieved by CIEA. To promote further research in the community, we have released the source code at https://github.com/zengdlong/CIEA.
Authors:Paul Pu Liang
Abstract:
Our experience of the world is multisensory, spanning a synthesis of language, sight, sound, touch, taste, and smell. Yet, artificial intelligence has primarily advanced in digital modalities like text, vision, and audio. This paper outlines a research vision for multisensory artificial intelligence over the next decade. This new set of technologies can change how humans and AI experience and interact with one another, by connecting AI to the human senses and a rich spectrum of signals from physiological and tactile cues on the body, to physical and social signals in homes, cities, and the environment. We outline how this field must advance through three interrelated themes of sensing, science, and synergy. Firstly, research in sensing should extend how AI captures the world in richer ways beyond the digital medium. Secondly, developing a principled science for quantifying multimodal heterogeneity and interactions, developing unified modeling architectures and representations, and understanding cross‑modal transfer. Finally, we present new technical challenges to learn synergy between modalities and between humans and AI, covering multisensory integration, alignment, reasoning, generation, generalization, and experience. Accompanying this vision paper are a series of projects, resources, and demos of latest advances from the Multisensory Intelligence group at the MIT Media Lab, see https://mit‑mi.github.io/.
Authors:Yifei Gao, Jiang Wu, Xiaoyi Chen, Yifan Yang, Zhe Cui, Tianyi Ma, Jiaming Zhang, Jitao Sang
Abstract:
Exploratory GUI testing is essential for software quality but suffers from high manual costs. While Multi‑modal Large Language Model (MLLM) agents excel in navigation, they fail to autonomously discover defects due to two core challenges: Goal‑Oriented Masking, where agents prioritize task completion over reporting anomalies, and Execution‑Bias Attribution, where system defects are misidentified as agent errors. To address these, we first introduce GUITestBench, the first interactive benchmark for this task, featuring 143 tasks across 26 defects. We then propose GUITester, a multi‑agent framework that decouples navigation from verification via two modules: (i) a Planning‑Execution Module (PEM) that proactively probes for defects via embedded testing intents, and (ii) a Hierarchical Reflection Module (HRM) that resolves attribution ambiguity through interaction history analysis. GUITester achieves an F1‑score of 48.90% (Pass@3) on GUITestBench, outperforming state‑of‑the‑art baselines (33.35%). Our work demonstrates the feasibility of autonomous exploratory testing and provides a robust foundation for future GUI quality assurance~\footnoteOur code is now available in~\hrefhttps://github.com/ADaM‑BJTU/GUITestBenchhttps://github.com/ADaM‑BJTU/GUITestBench.
Authors:James Brock, Ce Zhang, Nantheera Anantrasirichai
Abstract:
Modern forest monitoring workflows increasingly benefit from the growing availability of high‑resolution satellite imagery and advances in deep learning. Two persistent challenges in this context are accurate pixel‑level change detection and meaningful semantic change captioning for complex forest dynamics. While large language models (LLMs) are being adapted for interactive data exploration, their integration with vision‑language models (VLMs) for remote sensing image change interpretation (RSICI) remains underexplored. To address this gap, we introduce an LLM‑driven agent for integrated forest change analysis that supports natural language querying across multiple RSICI tasks. The proposed system builds upon a multi‑level change interpretation (MCI) vision‑language backbone with LLM‑based orchestration. To facilitate adaptation and evaluation in forest environments, we further introduce the Forest‑Change dataset, which comprises bi‑temporal satellite imagery, pixel‑level change masks, and multi‑granularity semantic change captions generated using a combination of human annotation and rule‑based methods. Experimental results show that the proposed system achieves mIoU and BLEU‑4 scores of 67.10% and 40.17% on the Forest‑Change dataset, and 88.13% and 34.41% on LEVIR‑MCI‑Trees, a tree‑focused subset of LEVIR‑MCI benchmark for joint change detection and captioning. These results highlight the potential of interactive, LLM‑driven RSICI systems to improve accessibility, interpretability, and efficiency of forest change analysis. All data and code are publicly available at https://github.com/JamesBrockUoB/ForestChat.
Authors:Mustapha Hamdi, Mourad Jabou
Abstract:
Energy efficiency is a first‑order concern in AI deployment, as long‑running inference can exceed training in cumulative carbon impact. We propose a bio‑inspired framework that maps protein‑folding energy basins to inference cost landscapes and controls execution via a decaying, closed‑loop threshold. A request is admitted only when the expected utility‑to‑energy trade‑off is favorable (high confidence/utility at low marginal energy and congestion), biasing operation toward the first acceptable local basin rather than pursuing costly global minima. We evaluate DistilBERT and ResNet‑18 served through FastAPI with ONNX Runtime and NVIDIA Triton on an RTX 4000 Ada GPU. Our ablation study reveals that the bio‑controller reduces processing time by 42% compared to standard open‑loop execution (0.50s vs 0.29s on A100 test set), with a minimal accuracy degradation (<0.5%). Furthermore, we establish the efficiency boundaries between lightweight local serving (ORT) and managed batching (Triton). The results connect biophysical energy models to Green MLOps and offer a practical, auditable basis for closed‑loop energy‑aware inference in production.
Authors:Zihan Gao, Mohsin Y. K. Yousufi, Jacob Thebault-Spieker
Abstract:
Large language model (LLM) question‑answering systems often fail on community‑specific queries, creating "knowledge blind spots" that marginalize local voices and reinforce epistemic injustice. We present Collective Narrative Grounding, a participatory protocol that transforms community stories into structured narrative units and integrates them into AI systems under community governance. Learning from three participatory mapping workshops with N=24 community members, we designed elicitation methods and a schema that retain narrative richness while enabling entity, time, and place extraction, validation, and provenance control. To scope the problem, we audit a county‑level benchmark of 14,782 local information QA pairs, where factual gaps, cultural misunderstandings, geographic confusions, and temporal misalignments account for 76.7% of errors. On a participatory QA set derived from our workshops, a state‑of‑the‑art LLM answered fewer than 21% of questions correctly without added context, underscoring the need for local grounding. The missing facts often appear in the collected narratives, suggesting a direct path to closing the dominant error modes for narrative items. Beyond the protocol and pilot, we articulate key design tensions, such as representation and power, governance and control, and privacy and consent, providing concrete requirements for retrieval‑first, provenance‑visible, locally governed QA systems. Together, our taxonomy, protocol, and participatory evaluation offer a rigorous foundation for building community‑grounded AI that better answers local questions.
Authors:Chi Liu, Xin Chen
Abstract:
Group Relative Policy Optimization (GRPO) has emerged as a popular algorithm for reinforcement learning with large language models (LLMs). However, upon analyzing its clipping mechanism, we argue that it is suboptimal in certain scenarios. With appropriate modifications, GRPO can be significantly enhanced to improve both flexibility and generalization. To this end, we propose Adaptive‑Boundary‑Clipping GRPO (ABC‑GRPO), an asymmetric and adaptive refinement of the original GRPO framework. We demonstrate that ABC‑GRPO achieves superior performance over standard GRPO on mathematical reasoning tasks using the Qwen3 LLMs. Moreover, ABC‑GRPO maintains substantially higher entropy throughout training, thereby preserving the model's exploration capacity and mitigating premature convergence. The implementation code is available online to ease reproducibility https://github.com/chi2liu/ABC‑GRPO.
Authors:Wenlong Huang, Yu-Wei Chao, Arsalan Mousavian, Ming-Yu Liu, Dieter Fox, Kaichun Mo, Li Fei-Fei
Abstract:
Humans anticipate, from a glance and a contemplated action of their bodies, how the 3D world will respond, a capability that is equally vital for robotic manipulation. We introduce PointWorld, a large pre‑trained 3D world model that unifies state and action in a shared 3D space as 3D point flows: given one or few RGB‑D images and a sequence of low‑level robot action commands, PointWorld forecasts per‑pixel displacements in 3D that respond to the given actions. By representing actions as 3D point flows instead of embodiment‑specific action spaces (e.g., joint positions), this formulation directly conditions on physical geometries of robots while seamlessly integrating learning across embodiments. To train our 3D world model, we curate a large‑scale dataset spanning real and simulated robotic manipulation in open‑world environments, enabled by recent advances in 3D vision and simulated environments, totaling about 2M trajectories and 500 hours across a single‑arm Franka and a bimanual humanoid. Through rigorous, large‑scale empirical studies of backbones, action representations, learning objectives, partial observability, data mixtures, domain transfers, and scaling, we distill design principles for large‑scale 3D world modeling. With a real‑time (0.1s) inference speed, PointWorld can be efficiently integrated in the model‑predictive control (MPC) framework for manipulation. We demonstrate that a single pre‑trained checkpoint enables a real‑world Franka robot to perform rigid‑body pushing, deformable and articulated object manipulation, and tool use, without requiring any demonstrations or post‑training and all from a single image captured in‑the‑wild. Project website at https://point‑world.github.io/.
Authors:Weijie Shi, Yanxi Chen, Zexi Li, Xuchen Pan, Yuchang Sun, Jiajie Xu, Xiaofang Zhou, Yaliang Li
Abstract:
Reinforcement learning drives recent advances in LLM reasoning and agentic capabilities, yet current approaches struggle with both exploration and exploitation. Exploration suffers from low success rates on difficult tasks and high costs of repeated rollouts from scratch. Exploitation suffers from coarse credit assignment and training instability: Trajectory‑level rewards penalize valid prefixes for later errors, and failure‑dominated groups overwhelm the few positive signals, leaving optimization without constructive direction. To this end, we propose R^3L, Reflect‑then‑Retry Reinforcement Learning with Language‑Guided Exploration, Pivotal Credit, and Positive Amplification. To synthesize high‑quality trajectories, R^3L shifts from stochastic sampling to active synthesis via reflect‑then‑retry, leveraging language feedback to diagnose errors, transform failed attempts into successful ones, and reduce rollout costs by restarting from identified failure points. With errors diagnosed and localized, Pivotal Credit Assignment updates only the diverging suffix where contrastive signals exist, excluding the shared prefix from gradient update. Since failures dominate on difficult tasks and reflect‑then‑retry produces off‑policy data, risking training instability, Positive Amplification upweights successful trajectories to ensure positive signals guide the optimization process. Experiments on agentic and reasoning tasks demonstrate 5% to 52% relative improvements over baselines while maintaining training stability. Our code is released at https://github.com/shiweijiezero/R3L.
Authors:Wajid Arshad Abbasi, Syed Ali Abbas, Maryum Bibi, Saiqa Andleeb, Muhammad Naveed Akhtar
Abstract:
The trade‑off between predictive accuracy and data availability makes it difficult to predict protein‑‑protein binding affinity accurately. The lack of experimentally resolved protein structures limits the performance of structure‑based machine learning models, which generally outperform sequence‑based methods. In order to overcome this constraint, we suggest a regression framework based on knowledge distillation that uses protein structural data during training and only needs sequence data during inference. The suggested method uses binding affinity labels and intermediate feature representations to jointly supervise the training of a sequence‑based student network under the guidance of a structure‑informed teacher network. Leave‑One‑Complex‑Out (LOCO) cross‑validation was used to assess the framework on a non‑redundant protein‑‑protein binding affinity benchmark dataset. A maximum Pearson correlation coefficient (P_r) of 0.375 and an RMSE of 2.712 kcal/mol were obtained by sequence‑only baseline models, whereas a P_r of 0.512 and an RMSE of 2.445 kcal/mol were obtained by structure‑based models. With a P_r of 0.481 and an RMSE of 2.488 kcal/mol, the distillation‑based student model greatly enhanced sequence‑only performance. Improved agreement and decreased bias were further confirmed by thorough error analyses. With the potential to close the performance gap between sequence‑based and structure‑based models as larger datasets become available, these findings show that knowledge distillation is an efficient method for transferring structural knowledge to sequence‑based predictors. The source code for running inference with the proposed distillation‑based binding affinity predictor can be accessed at https://github.com/wajidarshad/ProteinAffinityKD.
Authors:Yifan Wei, Li Du, Xiaoyan Yu, Yang Feng, Angsheng Li
Abstract:
Large Language Models (LLMs) and agent‑based systems often struggle with compositional generalization due to a data bottleneck in which complex skill combinations follow a long‑tailed, power‑law distribution, limiting both instruction‑following performance and generalization in agent‑centric tasks. To address this challenge, we propose STEPS, a Skill Taxonomy guided Entropy‑based Post‑training data Synthesis framework for generating compositionally challenging data. STEPS explicitly targets compositional generalization by uncovering latent relationships among skills and organizing them into an interpretable, hierarchical skill taxonomy using structural information theory. Building on this taxonomy, we formulate data synthesis as a constrained information maximization problem, selecting skill combinations that maximize marginal structural information within the hierarchy while preserving semantic coherence. Experiments on challenging instruction‑following benchmarks show that STEPS outperforms existing data synthesis baselines, while also yielding improved compositional generalization in downstream agent‑based evaluations.
Authors:Jean Seo, Gibaeg Kim, Kihun Shin, Seungseop Lim, Hyunkyung Lee, Wooseok Han, Jongwon Lee, Eunho Yang
Abstract:
We introduce EPAG, a benchmark dataset and framework designed for Evaluating the Pre‑consultation Ability of LLMs using diagnostic Guidelines. LLMs are evaluated directly through HPI‑diagnostic guideline comparison and indirectly through disease diagnosis. In our experiments, we observe that small open‑source models fine‑tuned with a well‑curated, task‑specific dataset can outperform frontier LLMs in pre‑consultation. Additionally, we find that increased amount of HPI (History of Present Illness) does not necessarily lead to improved diagnostic performance. Further experiments reveal that the language of pre‑consultation influences the characteristics of the dialogue. By open‑sourcing our dataset and evaluation pipeline on https://github.com/seemdog/EPAG, we aim to contribute to the evaluation and further development of LLM applications in real‑world clinical settings.
Authors:Zhongbin Guo, Zhen Yang, Yushan Li, Xinyue Zhang, Wenyu Gao, Jiacheng Wang, Chengzhi Li, Xiangrui Liu, Ping Jian
Abstract:
Recent advancements in Spatial Intelligence (SI) have predominantly relied on Vision‑Language Models (VLMs), yet a critical question remains: does spatial understanding originate from visual encoders or the fundamental reasoning backbone? Inspired by this question, we introduce SiT‑Bench, a novel benchmark designed to evaluate the SI performance of Large Language Models (LLMs) without pixel‑level input, comprises over 3,800 expert‑annotated items across five primary categories and 17 subtasks, ranging from egocentric navigation and perspective transformation to fine‑grained robotic manipulation. By converting single/multi‑view scenes into high‑fidelity, coordinate‑aware textual descriptions, we challenge LLMs to perform symbolic textual reasoning rather than visual pattern matching. Evaluation results of state‑of‑the‑art (SOTA) LLMs reveals that while models achieve proficiency in localized semantic tasks, a significant "spatial gap" remains in global consistency. Notably, we find that explicit spatial reasoning significantly boosts performance, suggesting that LLMs possess latent world‑modeling potential. Our proposed dataset SiT‑Bench serves as a foundational resource to foster the development of spatially‑grounded LLM backbones for future VLMs and embodied agents. Our code and benchmark will be released at https://github.com/binisalegend/SiT‑Bench .
Authors:Xukai Liu, Ye Liu, Jipeng Zhang, Yanghai Zhang, Kai Zhang, Qi Liu
Abstract:
Large language models (LLMs) perform well on multi‑hop reasoning, yet how they internally compose multiple facts remains unclear. Recent work proposes \emphhop‑aligned circuit hypothesis, suggesting that bridge entities are computed sequentially across layers before later‑hop answers. Through systematic analyses on real‑world multi‑hop queries, we show that this hop‑aligned assumption does not generalize: later‑hop answer entities can become decodable earlier than bridge entities, a phenomenon we call \emphlayer‑order inversion, which strengthens with total hops. To explain this behavior, we propose a \emphprobabilistic recall‑and‑extract framework that models multi‑hop reasoning as broad probabilistic recall in shallow MLP layers followed by selective extraction in deeper attention layers. This framework is empirically validated through systematic probing analyses, reinterpreting prior layer‑wise decoding evidence, explaining chain‑of‑thought gains, and providing a mechanistic diagnosis of multi‑hop failures despite correct single‑hop knowledge. Code is available at https://github.com/laquabe/Layer‑Order‑Inversion.
Authors:Di Wu, Yanyan Zhao, Xin Lu, Mingzhe Li, Bing Qin
Abstract:
Defending against jailbreak attacks is crucial for the safe deployment of Large Language Models (LLMs). Recent research has attempted to improve safety by training models to reason over safety rules before responding. However, a key issue lies in determining what form of safety reasoning effectively defends against jailbreak attacks, which is difficult to explicitly design or directly obtain. To address this, we propose STAR‑S (Self‑TAught Reasoning based on Safety rules), a framework that integrates the learning of safety rule reasoning into a self‑taught loop. The core of STAR‑S involves eliciting reasoning and reflection guided by safety rules, then leveraging fine‑tuning to enhance safety reasoning. Repeating this process creates a synergistic cycle. Improvements in the model's reasoning and interpretation of safety rules allow it to produce better reasoning data under safety rule prompts, which is then utilized for further training. Experiments show that STAR‑S effectively defends against jailbreak attacks, outperforming baselines. Code is available at: https://github.com/pikepokenew/STAR_S.git.
Authors:Qiang Zhang, Hanchao Yu, Ivan Ji, Chen Yuan, Yi Zhang, Chihuang Liu, Xiaolong Wang, Christopher E. Lambert, Ren Chen, Chen Kovacs, Xinzhu Bei, Renqin Cai, Rui Li, Lizhu Zhang, Xiangjun Fan, Qunshu Zhang, Benyu Zhang
Abstract:
Recent years have witnessed success of sequential modeling, generative recommender, and large language model for recommendation. Though the scaling law has been validated for sequential models, it showed inefficiency in computational capacity when considering real‑world applications like recommendation, due to the non‑linear(quadratic) increasing nature of the transformer model. To improve the efficiency of the sequential model, we introduced a novel approach to sequential recommendation that leverages personalization techniques to enhance efficiency and performance. Our method compresses long user interaction histories into learnable tokens, which are then combined with recent interactions to generate recommendations. This approach significantly reduces computational costs while maintaining high recommendation accuracy. Our method could be applied to existing transformer based recommendation models, e.g., HSTU and HLLM. Extensive experiments on multiple sequential models demonstrate its versatility and effectiveness. Source code is available at \hrefhttps://github.com/facebookresearch/PerSRechttps://github.com/facebookresearch/PerSRec.
Authors:Bugra Kilictas, Faruk Alpay
Abstract:
The deployment of Large Language Models (LLMs) on edge devices is fundamentally constrained by the "Memory Wall" the bottleneck where data movement latency outstrips arithmetic throughput. Standard inference runtimes often incur significant overhead through high‑level abstractions, dynamic dispatch, and unaligned memory access patterns. In this work, we present a novel "Virtual Tensor Core" architecture implemented in software, optimized specifically for ARM64 microarchitectures (Apple Silicon). By bypassing standard library containers in favor of direct memory mapping (mmap) and implementing hand‑tuned NEON SIMD kernels, we achieve a form of "Software‑Defined Direct Memory Access (DMA)." Our proposed Tensor Virtualization Layout (TVL) guarantees 100% cache line utilization for weight matrices, while our zero‑copy loader eliminates initialization latency. Experimental results on a 110M parameter model demonstrate a stable throughput of >60 tokens/second on M2 hardware. While proprietary hardware accelerators (e.g., Apple AMX) can achieve higher peak throughput, our architecture provides a fully open, portable, and deterministic reference implementation for studying the memory bottleneck on general‑purpose ARM silicon, meeting the 200ms psycholinguistic latency threshold without opaque dependencies.
Authors:Dhruv Trehan, Paras Chopra
Abstract:
We report a case study of four end‑to‑end attempts to autonomously generate ML research papers using a pipeline of six LLM agents mapped to stages of the scientific workflow. Of these four, three attempts failed during implementation or evaluation. One completed the pipeline and was accepted to Agents4Science 2025, an experimental inaugural venue that required AI systems as first authors, passing both human and multi‑AI review. From these attempts, we document six recurring failure modes: bias toward training data defaults, implementation drift under execution pressure, memory and context degradation across long‑horizon tasks, overexcitement that declares success despite obvious failures, insufficient domain intelligence, and weak scientific taste in experimental design. We conclude by discussing four design principles for more robust AI‑scientist systems, implications for autonomous scientific discovery, and we release all prompts, artifacts, and outputs at https://github.com/Lossfunk/ai‑scientist‑artefacts‑v1
Authors:Kaibo Huang, Jin Tan, Yukun Wei, Wanling Li, Zipei Zhang, Hui Tian, Zhongliang Yang, Linna Zhou
Abstract:
LLM‑based agents are increasingly deployed to autonomously solve complex tasks, raising urgent needs for IP protection and regulatory provenance. While content watermarking effectively attributes LLM‑generated outputs, it fails to directly identify the high‑level planning behaviors (e.g., tool and subgoal choices) that govern multi‑step execution. Critically, watermarking at the planning‑behavior layer faces unique challenges: minor distributional deviations in decision‑making can compound during long‑term agent operation, degrading utility, and many agents operate as black boxes that are difficult to intervene in directly. To bridge this gap, we propose AgentMark, a behavioral watermarking framework that embeds multi‑bit identifiers into planning decisions while preserving utility. It operates by eliciting an explicit behavior distribution from the agent and applying distribution‑preserving conditional sampling, enabling deployment under black‑box APIs while remaining compatible with action‑layer content watermarking. Experiments across embodied, tool‑use, and social environments demonstrate practical multi‑bit capacity, robust recovery from partial logs, and utility preservation. The code is available at https://github.com/Tooooa/AgentMark.
Authors:Mohamed Amine Ferrag, Abderrahmane Lakas, Merouane Debbah
Abstract:
Large Language Models (LLMs) are increasingly used as high level controllers for autonomous Unmanned Aerial Vehicle (UAV) missions. However, existing evaluations rarely assess whether such agents remain safe, protocol compliant, and effective under realistic next generation networking constraints. This paper introduces α^3‑Bench, a benchmark for evaluating LLM driven UAV autonomy as a multi turn conversational reasoning and control problem operating under dynamic 6G conditions. Each mission is formulated as a language mediated control loop between an LLM based UAV agent and a human operator, where decisions must satisfy strict schema validity, mission policies, speaker alternation, and safety constraints while adapting to fluctuating network slices, latency, jitter, packet loss, throughput, and edge load variations. To reflect modern agentic workflows, α^3‑Bench integrates a dual action layer supporting both tool calls and agent to agent coordination, enabling evaluation of tool use consistency and multi agent interactions. We construct a large scale corpus of 113k conversational UAV episodes grounded in UAVBench scenarios and evaluate 17 state of the art LLMs using a fixed subset of 50 episodes per scenario under deterministic decoding. We propose a composite α^3 metric that unifies six pillars: Task Outcome, Safety Policy, Tool Consistency, Interaction Quality, Network Robustness, and Communication Cost, with efficiency normalized scores per second and per thousand tokens. Results show that while several models achieve high mission success and safety compliance, robustness and efficiency vary significantly under degraded 6G conditions, highlighting the need for network aware and resource efficient LLM based UAV agents. The dataset is publicly available on GitHub : https://github.com/maferrag/AlphaBench
Authors:Chenglin Yu, Yuchen Wang, Songmiao Wang, Hongxia Yang, Ming Li
Abstract:
LLM agents can reason and use tools, but they often break down on long‑horizon tasks due to unbounded context growth and accumulated errors. Common remedies such as context compression or retrieval‑augmented prompting introduce trade‑offs between information fidelity and reasoning stability. We present InfiAgent, a general‑purpose framework that keeps the agent's reasoning context strictly bounded regardless of task duration by externalizing persistent state into a file‑centric state abstraction. At each step, the agent reconstructs context from a workspace state snapshot plus a fixed window of recent actions. Experiments on DeepResearch and an 80‑paper literature review task show that, without task‑specific fine‑tuning, InfiAgent with a 20B open‑source model is competitive with larger proprietary systems and maintains substantially higher long‑horizon coverage than context‑centric baselines. These results support explicit state externalization as a practical foundation for stable long‑horizon agents. Github Repo:https://github.com/ChenglinPoly/infiAgent
Authors:Anees Ur Rehman Hashmi, Numan Saeed, Christoph Lippert
Abstract:
Multimodal medical large language models have shown impressive progress in chest X‑ray interpretation but continue to face challenges in spatial reasoning and anatomical understanding. Although existing grounding techniques improve overall performance, they often fail to establish a true anatomical correspondence, resulting in incorrect anatomical understanding in the medical domain. To address this gap, we introduce AnatomiX, a multitask multimodal large language model explicitly designed for anatomically grounded chest X‑ray interpretation. Inspired by the radiological workflow, AnatomiX adopts a two stage approach: first, it identifies anatomical structures and extracts their features, and then leverages a large language model to perform diverse downstream tasks such as phrase grounding, report generation, visual question answering, and image understanding. Extensive experiments across multiple benchmarks demonstrate that AnatomiX achieves superior anatomical reasoning and delivers over 25% improvement in performance on anatomy grounding, phrase grounding, grounded diagnosis and grounded captioning tasks compared to existing approaches. Code and pretrained model are available at https://github.com/aneesurhashmi/anatomix
Authors:Andrew Shin
Abstract:
Despite rapid advances in large language models (LLMs), achieving reliable performance on highly professional and structured examinations remains a significant challenge. The Japanese bar examination is a particularly demanding benchmark, requiring not only advanced legal reasoning but also strict adherence to complex answer formats that involve joint evaluation of multiple propositions. While recent studies have reported improvements by decomposing such questions into simpler true‑‑false judgments, these approaches have not been systematically evaluated under the original exam format and scoring scheme, leaving open the question of whether they truly capture exam‑level competence. In this paper, we present a self‑verification model trained on a newly constructed dataset that faithfully replicates the authentic format and evaluation scale of the exam. Our model is able to exceed the official passing score when evaluated on the actual exam scale, marking the first demonstration, to our knowledge, of an LLM passing the Japanese bar examination without altering its original question structure or scoring rules. We further conduct extensive comparisons with alternative strategies, including multi‑agent inference and decomposition‑based supervision, and find that these methods fail to achieve comparable performance. Our results highlight the importance of format‑faithful supervision and consistency verification, and suggest that carefully designed single‑model approaches can outperform more complex systems in high‑stakes professional reasoning tasks. Our dataset and codes are publicly available.
Authors:Joseph Kampeas, Emir Haleva
Abstract:
Modern large language models (LLMs) drive interactive AI systems but are bottlenecked by the memory‑heavy growth of key‑value (KV) caches, which limits real‑time throughput under concurrent loads. Existing KV‑cache compression methods rely on rigid heuristics, disrupt tensor layouts, or require specialized compute, hindering scalability and deployment. We propose joint encoding of KV‑cache blocks, which fuses similar blocks across requests and input chunks into shared representations while preserving standard cache structure. This alleviates the KV‑cache memory bottleneck, supporting high‑concurrency serving without specialized hardware. Theoretically, we analyze the rate‑distortion tradeoff of fused cache blocks under a Poisson process model. Empirically, our method achieves up to 4.38 × KV‑cache compression with negligible accuracy loss across diverse LLMs and benchmarks, outperforming recent structured and adaptive compression baselines. In real LLM serving, joint encoding improves the token throughput by ~40% on a single‑machine vLLM benchmark, demonstrating substantial gains in inference throughput. Code is available at https://github.com/sef1/kv_fast_fusion kv_joint_encoding.
Authors:Youngjoon Jeong, Junha Chun, Taesup Kim
Abstract:
Vision‑based robotic policies often struggle with even minor viewpoint changes, underscoring the need for view‑invariant visual representations. This challenge becomes more pronounced in real‑world settings, where viewpoint variability is unavoidable and can significantly disrupt policy performance. Existing methods typically learn invariance from multi‑view observations at the scene level, but such approaches rely on visual appearance and fail to incorporate the physical dynamics essential for robust generalization. We propose View‑Invariant Latent Action (VILA), which models a latent action capturing transition patterns across trajectories to learn view‑invariant representations grounded in physical dynamics. VILA aligns these latent actions across viewpoints using an action‑guided objective based on ground‑truth action sequences. Experiments in both simulation and the real world show that VILA‑based policies generalize effectively to unseen viewpoints and transfer well to new tasks, establishing VILA as a strong pretraining framework that improves robustness and downstream learning performance.
Authors:Sara Micol Ferraina, Michele Brienza, Francesco Argenziano, Emanuele Musumeci, Vincenzo Suriani, Domenico D. Bloisi, Daniele Nardi
Abstract:
Tracking objects that move within dynamic environments is a core challenge in robotics. Recent research has advanced this topic significantly; however, many existing approaches remain inefficient due to their reliance on heavy foundation models. To address this limitation, we propose LOST‑3DSG, a lightweight open‑vocabulary 3D scene graph designed to track dynamic objects in real‑world environments. Our method adopts a semantic approach to entity tracking based on word2vec and sentence embeddings, enabling an open‑vocabulary representation while avoiding the necessity of storing dense CLIP visual features. As a result, LOST‑3DSG achieves superior performance compared to approaches that rely on high‑dimensional visual embeddings. We evaluate our method through qualitative and quantitative experiments conducted in a real 3D environment using a TIAGo robot. The results demonstrate the effectiveness and efficiency of LOST‑3DSG in dynamic object tracking. Code and supplementary material are publicly available on the project website at https://lab‑rococo‑sapienza.github.io/lost‑3dsg/.
Authors:Xinglang Zhang, Yunyao Zhang, ZeLiang Chen, Junqing Yu, Wei Yang, Zikai Song
Abstract:
Symbolic logical reasoning is a critical yet underexplored capability of large language models (LLMs), providing reliable and verifiable decision‑making in high‑stakes domains such as mathematical reasoning and legal judgment. In this study, we present a systematic analysis of logical reasoning under controlled increases in logical complexity, and reveal a previously unrecognized phenomenon, which we term Logical Phase Transitions: rather than degrading smoothly, logical reasoning performance remains stable within a regime but collapses abruptly beyond a critical logical depth, mirroring physical phase transitions such as water freezing beyond a critical temperature threshold. Building on this insight, we propose Neuro‑Symbolic Curriculum Tuning, a principled framework that adaptively aligns natural language with logical symbols to establish a shared representation, and reshapes training dynamics around phase‑transition boundaries to progressively strengthen reasoning at increasing logical depths. Experiments on five benchmarks show that our approach effectively mitigates logical reasoning collapse at high complexity, yielding average accuracy gains of +1.26 in naive prompting and +3.95 in CoT, while improving generalization to unseen logical compositions. Code and data are available at https://github.com/AI4SS/Logical‑Phase‑Transitions.
Authors:Ao Li, Jinghui Zhang, Luyu Li, Yuxiang Duan, Lang Gao, Mingcai Chen, Weijun Qin, Shaopeng Li, Fengxian Ji, Ning Liu, Lizhen Cui, Xiuying Chen, Yuntao Du
Abstract:
As an agent‑level reasoning and coordination paradigm, Multi‑Agent Debate (MAD) orchestrates multiple agents through structured debate to improve answer quality and support complex reasoning. However, existing research on MAD suffers from two fundamental limitations: evaluations are conducted under fragmented and inconsistent settings, hindering fair comparison, and are largely restricted to single‑modality scenarios that rely on textual inputs only. To address these gaps, we introduce M3MAD‑Bench, a unified and extensible benchmark for evaluating MAD methods across Multi‑domain tasks, Multi‑modal inputs, and Multi‑dimensional metrics. M3MAD‑Bench establishes standardized protocols over five core task domains: Knowledge, Mathematics, Medicine, Natural Sciences, and Complex Reasoning, and systematically covers both pure text and vision‑language datasets, enabling controlled cross‑modality comparison. We evaluate MAD methods on nine base models spanning different architectures, scales, and modality capabilities. Beyond accuracy, M3MAD‑Bench incorporates efficiency‑oriented metrics such as token consumption and inference time, providing a holistic view of performance‑‑cost trade‑offs. Extensive experiments yield systematic insights into the effectiveness, robustness, and efficiency of MAD across text‑only and multimodal scenarios. We believe M3MAD‑Bench offers a reliable foundation for future research on standardized MAD evaluation. The code is available at http://github.com/liaolea/M3MAD‑Bench.
Authors:Zhisheng Zhang, Xiang Li, Yixuan Zhou, Jing Peng, Shengbo Cai, Guoyang Zeng, Zhiyong Wu
Abstract:
Neural Audio Codecs (NACs) can reduce transmission overhead by performing compact compression and reconstruction, which also aim to bridge the gap between continuous and discrete signals. Existing NACs can be divided into two categories: multi‑codebook and single‑codebook codecs. Multi‑codebook codecs face challenges such as structural complexity and difficulty in adapting to downstream tasks, while single‑codebook codecs, though structurally simpler, suffer from low‑fidelity, ineffective modeling of unified audio, and an inability to support modeling of high‑frequency audio. We propose the UniSRCodec, a single‑codebook codec capable of supporting high sampling rate, low‑bandwidth, high fidelity, and unified. We analyze the inefficiency of waveform‑based compression and introduce the time and frequency compression method using the Mel‑spectrogram, and cooperate with a Vocoder to recover the phase information of the original audio. Moreover, we propose a sub‑band reconstruction technique to achieve high‑quality compression across both low and high frequency bands. Subjective and objective experimental results demonstrate that UniSRCodec achieves state‑of‑the‑art (SOTA) performance among cross‑domain single‑codebook codecs with only a token rate of 40, and its reconstruction quality is comparable to that of certain multi‑codebook methods. Our demo page is available at https://wxzyd123.github.io/unisrcodec.
Authors:Yuetian Chen, Yuntao Du, Kaiyuan Zhang, Ashish Kundu, Charles Fleming, Bruno Ribeiro, Ninghui Li
Abstract:
Most membership inference attacks (MIAs) against Large Language Models (LLMs) rely on global signals, like average loss, to identify training data. This approach, however, dilutes the subtle, localized signals of memorization, reducing attack effectiveness. We challenge this global‑averaging paradigm, positing that membership signals are more pronounced within localized contexts. We introduce WBC (Window‑Based Comparison), which exploits this insight through a sliding window approach with sign‑based aggregation. Our method slides windows of varying sizes across text sequences, with each window casting a binary vote on membership based on loss comparisons between target and reference models. By ensembling votes across geometrically spaced window sizes, we capture memorization patterns from token‑level artifacts to phrase‑level structures. Extensive experiments across eleven datasets demonstrate that WBC substantially outperforms established baselines, achieving higher AUC scores and 2‑3 times improvements in detection rates at low false positive thresholds. Our findings reveal that aggregating localized evidence is fundamentally more effective than global averaging, exposing critical privacy vulnerabilities in fine‑tuned LLMs.
Authors:Aniruddha Mahapatra, Long Mai, Cusuh Ham, Feng Liu
Abstract:
Cinemagraphs, which combine static photographs with selective, looping motion, offer unique artistic appeal. Generating them from a single photograph in a controllable manner is particularly challenging. Existing image‑animation techniques are restricted to simple, low‑frequency motions and operate only in narrow domains with repetitive textures like water and smoke. In contrast, large‑scale video diffusion models are not tailored for cinemagraph constraints and lack the specialized data required to generate seamless, controlled loops. We present DreamLoop, a controllable video synthesis framework dedicated to generating cinemagraphs from a single photo without requiring any cinemagraph training data. Our key idea is to adapt a general video diffusion model by training it on two objectives: temporal bridging and motion conditioning. This strategy enables flexible cinemagraph generation. During inference, by using the input image as both the first‑ and last‑ frame condition, we enforce a seamless loop. By conditioning on static tracks, we maintain a static background. Finally, by providing a user‑specified motion path for a target object, our method provides intuitive control over the animation's trajectory and timing. To our knowledge, DreamLoop is the first method to enable cinemagraph generation for general scenes with flexible and intuitive controls. We demonstrate that our method produces high‑quality, complex cinemagraphs that align with user intent, outperforming existing approaches.
Authors:Arjun S. Nair
Abstract:
Large language model fine‑tuning is bottlenecked by memory: a 7B parameter model requires 84GB‑‑14GB for weights, 14GB for gradients, and 56GB for FP32 optimizer states‑‑exceeding even A100‑40GB capacity. We present Chronicals, an open‑source training framework achieving 3.51x speedup over Unsloth through four synergistic optimizations: (1) fused Triton kernels eliminating 75% of memory traffic via RMSNorm (7x), SwiGLU (5x), and QK‑RoPE (2.3x) fusion; (2) Cut Cross‑Entropy reducing logit memory from 5GB to 135MB through online softmax computation; (3) LoRA+ with theoretically‑derived 16x differential learning rates between adapter matrices; and (4) Best‑Fit Decreasing sequence packing recovering 60‑75% of compute wasted on padding. On Qwen2.5‑0.5B with A100‑40GB, Chronicals achieves 41,184 tokens/second for full fine‑tuning versus Unsloth's 11,736 tokens/second (3.51x). For LoRA at rank 32, we reach 11,699 tokens/second versus Unsloth MAX's 2,857 tokens/second (4.10x). Critically, we discovered that Unsloth's reported 46,000 tokens/second benchmark exhibited zero gradient norms‑‑the model was not training. We provide complete mathematical foundations: online softmax correctness proofs, FlashAttention IO complexity bounds O(N^2 d^2 M^‑1), LoRA+ learning rate derivations from gradient magnitude analysis, and bin‑packing approximation guarantees. All implementations, benchmarks, and proofs are available at https://github.com/Ajwebdevs/Chronicals with pip installation via https://pypi.org/project/chronicals/.
Authors:Jyothi Rikhab Chand, Mathews Jacob
Abstract:
Solving inverse problems in imaging requires models that support efficient inference, uncertainty quantification, and principled probabilistic reasoning. Energy‑Based Models (EBMs), with their interpretable energy landscapes and compositional structure, are well‑suited for this task but have historically suffered from high computational costs and training instability. To overcome the historical shortcomings of EBMs, we introduce a fast distillation strategy to transfer the strengths of pre‑trained diffusion models into multi‑scale EBMs. These distilled EBMs enable efficient sampling and preserve the interpretability and compositionality inherent to potential‑based frameworks. Leveraging EBM compositionality, we propose Annealed Langevin Posterior Sampling (ALPS) algorithm for Maximum‑A‑Posteriori (MAP), Minimum Mean Square Error (MMSE), and uncertainty estimates for inverse problems in imaging. Unlike diffusion models that use complex guidance strategies for latent variables, we perform annealing on static posterior distributions that are well‑defined and composable. Experiments on image inpainting and MRI reconstruction demonstrate that our method matches or surpasses diffusion‑based baselines in both accuracy and efficiency, while also supporting MAP recovery. Overall, our framework offers a scalable and principled solution for inverse problems in imaging, with potential for practical deployment in scientific and clinical settings. ALPS code is available at the GitHub repository \hrefhttps://github.com/JyoChand/ALPSALPS.
Authors:Jiaqi Liu, Yaofeng Su, Peng Xia, Siwei Han, Zeyu Zheng, Cihang Xie, Mingyu Ding, Huaxiu Yao
Abstract:
To support long‑term interaction in complex environments, LLM agents require memory systems that manage historical experiences. Existing approaches either retain full interaction histories via passive context extension, leading to substantial redundancy, or rely on iterative reasoning to filter noise, incurring high token costs. To address this challenge, we introduce SimpleMem, an efficient memory framework based on semantic lossless compression. We propose a three‑stage pipeline designed to maximize information density and token utilization: (1) Semantic Structured Compression, which distills unstructured interactions into compact, multi‑view indexed memory units; (2) Online Semantic Synthesis, an intra‑session process that instantly integrates related context into unified abstract representations to eliminate redundancy; and (3) Intent‑Aware Retrieval Planning, which infers search intent to dynamically determine retrieval scope and construct precise context efficiently. Experiments on benchmark datasets show that our method consistently outperforms baseline approaches in accuracy, retrieval efficiency, and inference cost, achieving an average F1 improvement of 26.4% in LoCoMo while reducing inference‑time token consumption by up to 30‑fold, demonstrating a superior balance between performance and efficiency. Code is available at https://github.com/aiming‑lab/SimpleMem.
Authors:Hyeong Kyu Choi, Sharon Li
Abstract:
Selecting a single high‑quality output from multiple stochastic generations remains a fundamental challenge for large language models (LLMs), particularly in open‑ended tasks where no canonical answer exists. While Best‑of‑N and self‑consistency methods show that aggregating multiple generations can improve performance, existing approaches typically rely on external evaluators, reward models, or exact string‑match voting, limiting their applicability and efficiency. We propose Mode Extraction (ModeX), an evaluator‑free Best‑of‑N selection framework that generalizes majority voting to open‑ended text generation by identifying the modal output representing the dominant semantic consensus among generated texts. ModeX constructs a similarity graph over candidate generations and recursively applies spectral clustering to select a representative centroid, without requiring additional inference or auxiliary models. We further instantiate this selection principle as ModeX‑Lite, an improved version of ModeX with early pruning for efficiency. Across open‑ended tasks ‑‑ including text summarization, code generation, and mathematical reasoning ‑‑ our approaches consistently outperform standard single‑ and multi‑path baselines, providing a computationally efficient solution for robust open‑ended text generation. Code is released in https://github.com/deeplearning‑wisc/ModeX.
Authors:Subhankar Mishra
Abstract:
Graph Neural Networks (GNNs) suffer from over‑smoothing in deep architectures and expressiveness bounded by the 1‑Weisfeiler‑Leman (1‑WL) test. We adapt Manifold‑Constrained Hyper‑Connections (\mhc)~\citepxie2025mhc, recently proposed for Transformers, to graph neural networks. Our method, mHC‑GNN, expands node representations across n parallel streams and constrains stream‑mixing matrices to the Birkhoff polytope via Sinkhorn‑Knopp normalization. We prove that mHC‑GNN exhibits exponentially slower over‑smoothing (rate (1‑γ)^L/n vs.\ (1‑γ)^L) and can distinguish graphs beyond 1‑WL. Experiments on 10 datasets with 4 GNN architectures show consistent improvements. Depth experiments from 2 to 128 layers reveal that standard GNNs collapse to near‑random performance beyond 16 layers, while mHC‑GNN maintains over 74% accuracy even at 128 layers, with improvements exceeding 50 percentage points at extreme depths. Ablations confirm that the manifold constraint is essential: removing it causes up to 82% performance degradation. Code is available at \hrefhttps://github.com/smlab‑niser/mhc‑gnnhttps://github.com/smlab‑niser/mhc‑gnn
Authors:Wenting Lu, Didi Zhu, Tao Shen, Donglin Zhu, Ayong Ye, Chao Wu
Abstract:
Multi‑modal reasoning requires the seamless integration of visual and linguistic cues, yet existing Chain‑of‑Thought methods suffer from two critical limitations in cross‑modal scenarios: (1) over‑reliance on single coarse‑grained image regions, and (2) semantic fragmentation between successive reasoning steps. To address these issues, we propose the CoCoT (Collaborative Coross‑modal Thought) framework, built upon two key innovations: a) Dynamic Multi‑Region Grounding to adaptively detect the most relevant image regions based on the question, and b) Relation‑Aware Reasoning to enable multi‑region collaboration by iteratively aligning visual cues to form a coherent and logical chain of thought. Through this approach, we construct the CoCoT‑70K dataset, comprising 74,691 high‑quality samples with multi‑region annotations and structured reasoning chains. Extensive experiments demonstrate that CoCoT significantly enhances complex visual reasoning, achieving an average accuracy improvement of 15.4% on LLaVA‑1.5 and 4.0% on Qwen2‑VL across six challenging benchmarks. The data and code are available at: https://github.com/deer‑echo/CoCoT.
Authors:Inpyo Song, Eunji Jeon, Jangwon Lee
Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains, including software development, education, and technical assistance. Among these, software development is one of the key areas where LLMs are increasingly adopted. However, when hardware constraints are considered‑for instance, in physical computing, where software must interact with and control physical hardware their effectiveness has not been fully explored. To address this gap, we introduce \textscPCEval (Physical Computing Evaluation), the first benchmark in physical computing that enables a fully automatic evaluation of the capabilities of LLM in both the logical and physical aspects of the projects, without requiring human assessment. Our evaluation framework assesses LLMs in generating circuits and producing compatible code across varying levels of project complexity. Through comprehensive testing of 13 leading models, \textscPCEval provides the first reproducible and automatically validated empirical assessment of LLMs' ability to reason about fundamental hardware implementation constraints within a simulation environment. Our findings reveal that while LLMs perform well in code generation and logical circuit design, they struggle significantly with physical breadboard layout creation, particularly in managing proper pin connections and avoiding circuit errors. \textscPCEval advances our understanding of AI assistance in hardware‑dependent computing environments and establishes a foundation for developing more effective tools to support physical computing education.
Authors:Salim Khazem
Abstract:
Foundation segmentation models such as the Segment Anything Model (SAM) exhibit strong zero‑shot generalization through large‑scale pretraining, but adapting them to domain‑specific semantic segmentation remains challenging, particularly for thin structures (e.g., retinal vessels) and noisy modalities (e.g., SAR imagery). Full fine‑tuning is computationally expensive and risks catastrophic forgetting. We propose TopoLoRA‑SAM, a topology‑aware and parameter‑efficient adaptation framework for binary semantic segmentation. TopoLoRA‑SAM injects Low‑Rank Adaptation (LoRA) into the frozen ViT encoder, augmented with a lightweight spatial convolutional adapter and optional topology‑aware supervision via differentiable clDice. We evaluate our approach on five benchmarks spanning retinal vessel segmentation (DRIVE, STARE, CHASE\_DB1), polyp segmentation (Kvasir‑SEG), and SAR sea/land segmentation (SL‑SSDD), comparing against U‑Net, DeepLabV3+, SegFormer, and Mask2Former. TopoLoRA‑SAM achieves the best retina‑average Dice and the best overall average Dice across datasets, while training only 5.2% of model parameters (~4.9M). On the challenging CHASE\_DB1 dataset, our method substantially improves segmentation accuracy and robustness, demonstrating that topology‑aware parameter‑efficient adaptation can match or exceed fully fine‑tuned specialist models. Code is available at : https://github.com/salimkhazem/Seglab.git
Authors:Dachun Kai, Zeyu Xiao, Huyue Zhu, Jiaxiao Wang, Yueyi Zhang, Xiaoyan Sun
Abstract:
This paper addresses low‑light video super‑resolution (LVSR), aiming to restore high‑resolution videos from low‑light, low‑resolution (LR) inputs. Existing LVSR methods often struggle to recover fine details due to limited contrast and insufficient high‑frequency information. To overcome these challenges, we present RetinexEVSR, the first event‑driven LVSR framework that leverages high‑contrast event signals and Retinex‑inspired priors to enhance video quality under low‑light scenarios. Unlike previous approaches that directly fuse degraded signals, RetinexEVSR introduces a novel bidirectional cross‑modal fusion strategy to extract and integrate meaningful cues from noisy event data and degraded RGB frames. Specifically, an illumination‑guided event enhancement module is designed to progressively refine event features using illumination maps derived from the Retinex model, thereby suppressing low‑light artifacts while preserving high‑contrast details. Furthermore, we propose an event‑guided reflectance enhancement module that utilizes the enhanced event features to dynamically recover reflectance details via a multi‑scale fusion mechanism. Experimental results show that our RetinexEVSR achieves state‑of‑the‑art performance on three datasets. Notably, on the SDSD benchmark, our method can get up to 2.95 dB gain while reducing runtime by 65% compared to prior event‑based methods. Code: https://github.com/DachunKai/RetinexEVSR.
Authors:Huichao Zhang, Liao Qu, Yiheng Liu, Hang Chen, Yangyang Song, Yongsheng Dong, Shikun Sun, Xian Li, Xu Wang, Yi Jiang, Hu Ye, Bo Chen, Yiming Gao, Peng Liu, Akide Liu, Zhipeng Yang, Qili Deng, Linjie Xing, Jiyang Liu, Zhao Wang, Yang Zhou, Mingcong Liu, Yi Zhang, Qian He, Xiwei Hu, Zhongqi Qi, Jie Shao, Zhiye Fu, Shuai Wang, Fangmin Chen, Xuezhi Chai, Zhihua Wu, Yitong Wang, Zehuan Yuan, Daniel K. Du, Xinglong Wu
Abstract:
We present NextFlow, a unified decoder‑only autoregressive transformer trained on 6 trillion interleaved text‑image discrete tokens. By leveraging a unified vision representation within a unified autoregressive architecture, NextFlow natively activates multimodal understanding and generation capabilities, unlocking abilities of image editing, interleaved content and video generation. Motivated by the distinct nature of modalities ‑ where text is strictly sequential and images are inherently hierarchical ‑ we retain next‑token prediction for text but adopt next‑scale prediction for visual generation. This departs from traditional raster‑scan methods, enabling the generation of 1024x1024 images in just 5 seconds ‑ orders of magnitude faster than comparable AR models. We address the instabilities of multi‑scale generation through a robust training recipe. Furthermore, we introduce a prefix‑tuning strategy for reinforcement learning. Experiments demonstrate that NextFlow achieves state‑of‑the‑art performance among unified models and rivals specialized diffusion baselines in visual quality.
Authors:Chuanrui Hu, Xingze Gao, Zuyi Zhou, Dannong Xu, Yi Bai, Xintong Li, Hui Zhang, Tong Li, Chong Zhang, Lidong Bing, Yafeng Deng
Abstract:
Large Language Models (LLMs) are increasingly deployed as long‑term interactive agents, yet their limited context windows make it difficult to sustain coherent behavior over extended interactions. Existing memory systems often store isolated records and retrieve fragments, limiting their ability to consolidate evolving user states and resolve conflicts. We introduce EverMemOS, a self‑organizing memory operating system that implements an engram‑inspired lifecycle for computational memory. Episodic Trace Formation converts dialogue streams into MemCells that capture episodic traces, atomic facts, and time‑bounded Foresight signals. Semantic Consolidation organizes MemCells into thematic MemScenes, distilling stable semantic structures and updating user profiles. Reconstructive Recollection performs MemScene‑guided agentic retrieval to compose the necessary and sufficient context for downstream reasoning. Experiments on LoCoMo and LongMemEval show that EverMemOS achieves state‑of‑the‑art performance on memory‑augmented reasoning tasks. We further report a profile study on PersonaMem v2 and qualitative case studies illustrating chat‑oriented capabilities such as user profiling and Foresight. Code is available at https://github.com/EverMind‑AI/EverMemOS.
Authors:Almaz Ermilov
Abstract:
This paper presents FormationEval, an open multiple‑choice question benchmark for evaluating language models on petroleum geoscience and subsurface disciplines. The dataset contains 505 questions across seven domains including petrophysics, petroleum geology and reservoir engineering, derived from three authoritative sources using a reasoning model with detailed instructions and a concept‑based approach that avoids verbatim copying of copyrighted text. Each question includes source metadata to support traceability and audit. The evaluation covers 72 models from major providers including OpenAI, Anthropic, Google, Meta and open‑weight alternatives. The top performers achieve over 97% accuracy, with Gemini 3 Pro Preview reaching 99.8%, while tier and domain gaps persist. Among open‑weight models, GLM‑4.7 leads at 98.6%, with several DeepSeek, Llama, Qwen and Mistral models also exceeding 93%. The performance gap between open‑weight and closed models is narrower than expected, with several lower‑cost open‑weight models exceeding 90% accuracy. Petrophysics emerges as the most challenging domain across all models, while smaller models show wider performance variance. Residual length bias in the dataset (correct answers tend to be longer) is documented along with bias mitigation strategies applied during construction. The benchmark, evaluation code and results are publicly available.
Authors:Matthias Bartolo, Dylan Seychell, Gabriel Hili, Matthew Montebello, Carl James Debono, Saviour Formosa, Konstantinos Makantasis
Abstract:
This paper investigates the integration of the Learning Using Privileged Information (LUPI) paradigm in object detection to exploit fine‑grained, descriptive information available during training but not at inference. We introduce a general, model‑agnostic methodology for injecting privileged information‑such as bounding box masks, saliency maps, and depth cues‑into deep learning‑based object detectors through a teacher‑student architecture. Experiments are conducted across five state‑of‑the‑art object detection models and multiple public benchmarks, including UAV‑based litter detection datasets and Pascal VOC 2012, to assess the impact on accuracy, generalization, and computational efficiency. Our results demonstrate that LUPI‑trained students consistently outperform their baseline counterparts, achieving significant boosts in detection accuracy with no increase in inference complexity or model size. Performance improvements are especially marked for medium and large objects, while ablation studies reveal that intermediate weighting of teacher guidance optimally balances learning from privileged and standard inputs. The findings affirm that the LUPI framework provides an effective and practical strategy for advancing object detection systems in both resource‑constrained and real‑world settings.
Authors:Omar Momen, Emilie Sitter, Berenike Herrmann, Sina Zarrieß
Abstract:
Novel metaphor comprehension involves complex semantic processes and linguistic creativity, making it an interesting task for studying language models (LMs). This study investigates whether surprisal, a probabilistic measure of predictability in LMs, correlates with annotations of metaphor novelty in different datasets. We analyse the surprisal of metaphoric words in corpus‑based and synthetic metaphor datasets using 16 causal LM variants. We propose a cloze‑style surprisal method that conditions on full‑sentence context. Results show that LM surprisal yields significant moderate correlations with scores/labels of metaphor novelty. We further identify divergent scaling patterns: on corpus‑based data, correlation strength decreases with model size (inverse scaling effect), whereas on synthetic data it increases (quality‑power hypothesis). We conclude that while surprisal can partially account for annotations of metaphor novelty, it remains limited as a metric of linguistic creativity. Code and data are publicly available: https://github.com/OmarMomen14/surprisal‑metaphor‑novelty
Authors:Jingjing Wang, Qianglin Liu, Zhuo Xiao, Xinning Yao, Bo Liu, Lu Li, Lijuan Niu, Fugen Zhou
Abstract:
Thyroid cancer is the most common endocrine malignancy, and its incidence is rising globally. While ultrasound is the preferred imaging modality for detecting thyroid nodules, its diagnostic accuracy is often limited by challenges such as low image contrast and blurred nodule boundaries. To address these issues, we propose Nodule‑DETR, a novel detection transformer (DETR) architecture designed for robust thyroid nodule detection in ultrasound images. Nodule‑DETR introduces three key innovations: a Multi‑Spectral Frequency‑domain Channel Attention (MSFCA) module that leverages frequency analysis to enhance features of low‑contrast nodules; a Hierarchical Feature Fusion (HFF) module for efficient multi‑scale integration; and Multi‑Scale Deformable Attention (MSDA) to flexibly capture small and irregularly shaped nodules. We conducted extensive experiments on a clinical dataset of real‑world thyroid ultrasound images. The results demonstrate that Nodule‑DETR achieves state‑of‑the‑art performance, outperforming the baseline model by a significant margin of 0.149 in mAP@0.5:0.95. The superior accuracy of Nodule‑DETR highlights its significant potential for clinical application as an effective tool in computer‑aided thyroid diagnosis. The code of work is available at https://github.com/wjj1wjj/Nodule‑DETR.
Authors:Lakshay Sharma, Alex Marin
Abstract:
Self‑supervised learning (SSL) methods have become a dominant paradigm for creating general purpose models whose capabilities can be transferred to downstream supervised learning tasks. However, most such methods rely on vast amounts of pretraining data. This work introduces Subimage Overlap Prediction, a novel self‑supervised pretraining task to aid semantic segmentation in remote sensing imagery that uses significantly lesser pretraining imagery. Given an image, a sub‑image is extracted and the model is trained to produce a semantic mask of the location of the extracted sub‑image within the original image. We demonstrate that pretraining with this task results in significantly faster convergence, and equal or better performance (measured via mIoU) on downstream segmentation. This gap in convergence and performance widens when labeled training data is reduced. We show this across multiple architecture types, and with multiple downstream datasets. We also show that our method matches or exceeds performance while requiring significantly lesser pretraining data relative to other SSL methods. Code and model weights are provided at \hrefhttps://github.com/sharmalakshay93/subimage‑overlap‑predictiongithub.com/sharmalakshay93/subimage‑overlap‑prediction.
Authors:Hyunsoo Kim, Jaewan Moon, Seongmin Park, Jongwuk Lee
Abstract:
Modern recommender systems trained on domain‑specific data often struggle to generalize across multiple domains. Cross‑domain sequential recommendation has emerged as a promising research direction to address this challenge; however, existing approaches face fundamental limitations, such as reliance on overlapping users or items across domains, or unrealistic assumptions that ignore privacy constraints. In this work, we propose a new framework, MergeRec, based on model merging under a new and realistic problem setting termed data‑isolated cross‑domain sequential recommendation, where raw user interaction data cannot be shared across domains. MergeRec consists of three key components: (1) merging initialization, (2) pseudo‑user data construction, and (3) collaborative merging optimization. First, we initialize a merged model using training‑free merging techniques. Next, we construct pseudo‑user data by treating each item as a virtual sequence in each domain, enabling the synthesis of meaningful training samples without relying on real user interactions. Finally, we optimize domain‑specific merging weights through a joint objective that combines a recommendation loss, which encourages the merged model to identify relevant items, and a distillation loss, which transfers collaborative filtering signals from the fine‑tuned source models. Extensive experiments demonstrate that MergeRec not only preserves the strengths of the original models but also significantly enhances generalizability to unseen domains. Compared to conventional model merging methods, MergeRec consistently achieves superior performance, with average improvements of up to 17.21% in Recall@10, highlighting the potential of model merging as a scalable and effective approach for building universal recommender systems. The source code is available at https://github.com/DIALLab‑SKKU/MergeRec.
Authors:YuanLab. ai, :, Shawn Wu, Sean Wang, Louie Li, Darcy Chen, Allen Wang, Jiangang Luo, Xudong Zhao, Joseph Shen, Gawain Ma, Jasper Jia, Marcus Mao, Claire Wang, Hunter He, Carol Wang, Zera Zhang, Jason Wang, Chonly Shen, Leo Zhang, Logan Chen, Qasim Meng, James Gong, Danied Zhao, Penn Zheng, Owen Zhu, Tong Yu
Abstract:
We introduce Yuan3.0 Flash, an open‑source Mixture‑of‑Experts (MoE) MultiModal Large Language Model featuring 3.7B activated parameters and 40B total parameters, specifically designed to enhance performance on enterprise‑oriented tasks while maintaining competitive capabilities on general‑purpose tasks. To address the overthinking phenomenon commonly observed in Large Reasoning Models (LRMs), we propose Reflection‑aware Adaptive Policy Optimization (RAPO), a novel RL training algorithm that effectively regulates overthinking behaviors. In enterprise‑oriented tasks such as retrieval‑augmented generation (RAG), complex table understanding, and summarization, Yuan3.0 Flash consistently achieves superior performance. Moreover, it also demonstrates strong reasoning capabilities in domains such as mathematics, science, etc., attaining accuracy comparable to frontier model while requiring only approximately 1/4 to 1/2 of the average tokens. Yuan3.0 Flash has been fully open‑sourced to facilitate further research and real‑world deployment: https://github.com/Yuan‑lab‑LLM/Yuan3.0.
Authors:Jinwei Hu, Xinmiao Huang, Youcheng Sun, Yi Dong, Xiaowei Huang
Abstract:
As large language models (LLMs) transition to autonomous agents synthesizing real‑time information, their reasoning capabilities introduce an unexpected attack surface. This paper introduces a novel threat where colluding agents steer victim beliefs using only truthful evidence fragments distributed through public channels, without relying on covert communications, backdoors, or falsified documents. By exploiting LLMs' overthinking tendency, we formalize the first cognitive collusion attack and propose Generative Montage: a Writer‑Editor‑Director framework that constructs deceptive narratives through adversarial debate and coordinated posting of evidence fragments, causing victims to internalize and propagate fabricated conclusions. To study this risk, we develop CoPHEME, a dataset derived from real‑world rumor events, and simulate attacks across diverse LLM families. Our results show pervasive vulnerability across 14 LLM families: attack success rates reach 74.4% for proprietary models and 70.6% for open‑weights models. Counterintuitively, stronger reasoning capabilities increase susceptibility, with reasoning‑specialized models showing higher attack success than base models or prompts. Furthermore, these false beliefs then cascade to downstream judges, achieving over 60% deception rates, highlighting a socio‑technical vulnerability in how LLM‑based agents interact with dynamic information environments. Our implementation and data are available at: https://github.com/CharlesJW222/Lying_with_Truth/tree/main.
Authors:Emiliya Khidirova, Oktay Karakuş
Abstract:
Accurate crop yield prediction relies on diverse data streams, including satellite, meteorological, soil, and topographic information. However, despite rapid advances in machine learning, existing approaches remain crop‑ or region‑specific and require data engineering efforts. This limits scalability, reproducibility, and operational deployment. This study introduces UniCrop, a universal and reusable data pipeline designed to automate the acquisition, cleaning, harmonisation, and engineering of multi‑source environmental data for crop yield prediction. For any given location, crop type, and temporal window, UniCrop automatically retrieves, harmonises, and engineers over 200 environmental variables (Sentinel‑1/2, MODIS, ERA5‑Land, NASA POWER, SoilGrids, and SRTM), reducing them to a compact, analysis‑ready feature set utilising a structured feature reduction workflow with minimum redundancy maximum relevance (mRMR). To validate, UniCrop was applied to a rice yield dataset comprising 557 field observations. Using only the selected 15 features, four baseline machine learning models (LightGBM, Random Forest, Support Vector Regression, and Elastic Net) were trained. LightGBM achieved the best single‑model performance (RMSE = 465.1 kg/ha, R^2 = 0.6576), while a constrained ensemble of all baselines further improved accuracy (RMSE = 463.2 kg/ha, R^2 = 0.6604). UniCrop contributes a scalable and transparent data‑engineering framework that addresses the primary bottleneck in operational crop yield modelling: the preparation of consistent and harmonised multi‑source data. By decoupling data specification from implementation and supporting any crop, region, and time frame through simple configuration updates, UniCrop provides a practical foundation for scalable agricultural analytics. The code and implementation documentation are shared in https://github.com/CoDIS‑Lab/UniCrop.
Authors:Ziyue Zhang, Luxi Lin, Xiaolin Hu, Chao Chang, HuaiXi Wang, Yiyi Zhou, Rongrong Ji
Abstract:
Diffusion inversion is a task of recovering the noise of an image in a diffusion model, which is vital for controllable diffusion image editing. At present, diffusion inversion still remains a challenging task due to the lack of viable supervision signals. Thus, most existing methods resort to approximation‑based solutions, which however are often at the cost of performance or efficiency. To remedy these shortcomings, we propose a novel self‑supervised diffusion inversion approach in this paper, termed Deep Inversion (DeepInv). Instead of requiring ground‑truth noise annotations, we introduce a self‑supervised objective as well as a data augmentation strategy to generate high‑quality pseudo noises from real images without manual intervention. Based on these two innovative designs, DeepInv is also equipped with an iterative and multi‑scale training regime to train a parameterized inversion solver, thereby achieving the fast and accurate image‑to‑noise mapping. To the best of our knowledge, this is the first attempt of presenting a trainable solver to predict inversion noise step by step. The extensive experiments show that our DeepInv can achieve much better performance and inference speed than the compared methods, e.g., +40.435% SSIM than EasyInv and +9887.5% speed than ReNoise on COCO dataset. Moreover, our careful designs of trainable solvers can also provide insights to the community. Codes and model parameters will be released in https://github.com/potato‑kitty/DeepInv.
Authors:Myung-Hwan Jang, Jeong-Min Park, Yunyong Ko, Sang-Wook Kim
Abstract:
Graph neural networks (GNNs) have achieved breakthroughs in various real‑world downstream tasks due to their powerful expressiveness. As the scale of real‑world graphs has been continuously growing, a storage‑based approach to GNN training has been studied, which leverages external storage (e.g., NVMe SSDs) to handle such web‑scale graphs on a single machine. Although such storage‑based GNN training methods have shown promising potential in large‑scale GNN training, we observed that they suffer from a severe bottleneck in data preparation since they overlook a critical challenge: how to handle a large number of small storage I/Os. To address the challenge, in this paper, we propose a novel storage‑based GNN training framework, named AGNES, that employs a method of block‑wise storage I/O processing to fully utilize the I/O bandwidth of high‑performance storage devices. Moreover, to further enhance the efficiency of each storage I/O, AGNES employs a simple yet effective strategy, hyperbatch‑based processing based on the characteristics of real‑world graphs. Comprehensive experiments on five real‑world graphs reveal that AGNES consistently outperforms four state‑of‑the‑art methods, by up to 4.1X faster than the best competitor. Our code is available at https://github.com/Bigdasgit/agnes‑kdd26.
Authors:Wentao Bian, Fenglei Xu
Abstract:
In this paper, we revisit multimodal few‑shot 3D point cloud semantic segmentation (FS‑PCS), identifying a conflict in "Fuse‑then‑Refine" paradigms: the "Plasticity‑Stability Dilemma." In addition, CLIP's inter‑class confusion can result in semantic blindness. To address these issues, we present the Decoupled‑experts Arbitration Few‑Shot SegNet (DA‑FSS), a model that effectively distinguishes between semantic and geometric paths and mutually regularizes their gradients to achieve better generalization. DA‑FSS employs the same backbone and pre‑trained text encoder as MM‑FSS to generate text embeddings, which can increase free modalities' utilization rate and better leverage each modality's information space. To achieve this, we propose a Parallel Expert Refinement module to generate each modal correlation. We also propose a Stacked Arbitration Module (SAM) to perform convolutional fusion and arbitrate correlations for each modality pathway. The Parallel Experts decouple two paths: a Geometric Expert maintains plasticity, and a Semantic Expert ensures stability. They are coordinated via a Decoupled Alignment Module (DAM) that transfers knowledge without propagating confusion. Experiments on popular datasets (S3DIS, ScanNet) demonstrate the superiority of DA‑FSS over MM‑FSS. Meanwhile, geometric boundaries, completeness, and texture differentiation are all superior to the baseline. The code is available at: https://github.com/MoWenQAQ/DA‑FSS.
Authors:Habiba Kausar, Saeed Anwar, Omar Jamal Hammad, Abdul Bais
Abstract:
Face super‑resolution aims to recover high‑quality facial images from severely degraded low‑resolution inputs, but remains challenging due to the loss of fine structural details and identity‑specific features. This work introduces SwinIFS, a landmark‑guided super‑resolution framework that integrates structural priors with hierarchical attention mechanisms to achieve identity‑preserving reconstruction at both moderate and extreme upscaling factors. The method incorporates dense Gaussian heatmaps of key facial landmarks into the input representation, enabling the network to focus on semantically important facial regions from the earliest stages of processing. A compact Swin Transformer backbone is employed to capture long‑range contextual information while preserving local geometry, allowing the model to restore subtle facial textures and maintain global structural consistency. Extensive experiments on the CelebA benchmark demonstrate that SwinIFS achieves superior perceptual quality, sharper reconstructions, and improved identity retention; it consistently produces more photorealistic results and exhibits strong performance even under 8x magnification, where most methods fail to recover meaningful structure. SwinIFS also provides an advantageous balance between reconstruction accuracy and computational efficiency, making it suitable for real‑world applications in facial enhancement, surveillance, and digital restoration. Our code, model weights, and results are available at https://github.com/Habiba123‑stack/SwinIFS.
Authors:Xiaobao Wei, Zhangjie Ye, Yuxiang Gu, Zunjie Zhu, Yunfei Guo, Yingying Shen, Shan Zhao, Ming Lu, Haiyang Sun, Bing Wang, Guang Chen, Rongfeng Lu, Hangjun Ye
Abstract:
Parking is a critical task for autonomous driving systems (ADS), with unique challenges in crowded parking slots and GPS‑denied environments. However, existing works focus on 2D parking slot perception, mapping, and localization, 3D reconstruction remains underexplored, which is crucial for capturing complex spatial geometry in parking scenarios. Naively improving the visual quality of reconstructed parking scenes does not directly benefit autonomous parking, as the key entry point for parking is the slots perception module. To address these limitations, we curate the first benchmark named ParkRecon3D, specifically designed for parking scene reconstruction. It includes sensor data from four surround‑view fisheye cameras with calibrated extrinsics and dense parking slot annotations. We then propose ParkGaussian, the first framework that integrates 3D Gaussian Splatting (3DGS) for parking scene reconstruction. To further improve the alignment between reconstruction and downstream parking slot detection, we introduce a slot‑aware reconstruction strategy that leverages existing parking perception methods to enhance the synthesis quality of slot regions. Experiments on ParkRecon3D demonstrate that ParkGaussian achieves state‑of‑the‑art reconstruction quality and better preserves perception consistency for downstream tasks. The code and dataset will be released at: https://github.com/wm‑research/ParkGaussian
Authors:Qundong Shi, Jie Zhou, Biyuan Lin, Junbo Cui, Guoyang Zeng, Yixuan Zhou, Ziyang Wang, Xin Liu, Zhen Luo, Yudong Wang, Zhiyuan Liu
Abstract:
The development of audio foundation models has accelerated rapidly since the emergence of GPT‑4o. However, the lack of comprehensive evaluation has become a critical bottleneck for further progress in the field, particularly in audio generation. Current audio evaluation faces three major challenges: (1) audio evaluation lacks a unified framework, with datasets and code scattered across various sources, hindering fair and efficient cross‑model comparison;(2) audio codecs, as a key component of audio foundation models, lack a widely accepted and holistic evaluation methodology; (3) existing speech benchmarks are heavily reliant on English, making it challenging to objectively assess models' performance on Chinese. To address the first issue, we introduce UltraEval‑Audio, a unified evaluation framework for audio foundation models, specifically designed for both audio understanding and generation tasks. UltraEval‑Audio features a modular architecture, supporting 10 languages and 14 core task categories, while seamlessly integrating 24 mainstream models and 36 authoritative benchmarks. To enhance research efficiency, the framework provides a one‑command evaluation feature, accompanied by real‑time public leaderboards. For the second challenge, UltraEval‑Audio adopts a novel comprehensive evaluation scheme for audio codecs, evaluating performance across three key dimensions: semantic accuracy, timbre fidelity, and acoustic quality. To address the third issue, we propose two new Chinese benchmarks, SpeechCMMLU and SpeechHSK, designed to assess Chinese knowledge proficiency and language fluency. We wish that UltraEval‑Audio will provide both academia and industry with a transparent, efficient, and fair platform for comparison of audio models. Our code, benchmarks, and leaderboards are available at https://github.com/OpenBMB/UltraEval‑Audio.
Authors:Zixian Liu, Sihao Liu, Yuqi Zhao
Abstract:
With the rapid adoption of multimodal large language models (MLMs) in autonomous agents, cross‑platform task execution capabilities in educational settings have garnered significant attention. However, existing benchmark frameworks still exhibit notable deficiencies in supporting cross‑platform tasks in educational contexts, especially when dealing with school‑specific software (such as XiaoYa Intelligent Assistant, HuaShi XiaZi, etc.), where the efficiency of agents often significantly decreases due to a lack of understanding of the structural specifics of these private‑domain software. Additionally, current evaluation methods heavily rely on coarse‑grained metrics like goal orientation or trajectory matching, making it challenging to capture the detailed execution and efficiency of agents in complex tasks. To address these issues, we propose KGCE (Knowledge‑Augmented Dual‑Graph Evaluator for Cross‑Platform Educational Agent Benchmarking with Multimodal Language Models), a novel benchmarking platform that integrates knowledge base enhancement and a dual‑graph evaluation framework. We first constructed a dataset comprising 104 education‑related tasks, covering Windows, Android, and cross‑platform collaborative tasks. KGCE introduces a dual‑graph evaluation framework that decomposes tasks into multiple sub‑goals and verifies their completion status, providing fine‑grained evaluation metrics. To overcome the execution bottlenecks of existing agents in private‑domain tasks, we developed an enhanced agent system incorporating a knowledge base specific to school‑specific software. The code can be found at https://github.com/Kinginlife/KGCE.
Authors:Keith Frankston, Benjamin Howard
Abstract:
We introduce a recursive AlphaZero‑style Monte‑‑Carlo tree search algorithm, "RMCTS". The advantage of RMCTS over AlphaZero's MCTS‑UCB is speed. In RMCTS, the search tree is explored in a breadth‑first manner, so that network inferences naturally occur in large batches. This significantly reduces the GPU latency cost. We find that RMCTS is often more than 40 times faster than MCTS‑UCB when searching a single root state, and about 3 times faster when searching a large batch of root states. The recursion in RMCTS is based on computing optimized posterior policies at each game state in the search tree, starting from the leaves and working back up to the root. Here we use the posterior policy explored in "Monte‑‑Carlo tree search as regularized policy optimization" (Grill, et al.) Their posterior policy is the unique policy which maximizes the expected reward given estimated action rewards minus a penalty for diverging from the prior policy. The tree explored by RMCTS is not defined in an adaptive manner, as it is in MCTS‑UCB. Instead, the RMCTS tree is defined by following prior network policies at each node. This is a disadvantage, but the speedup advantage is more significant, and in practice we find that RMCTS‑trained networks match the quality of MCTS‑UCB‑trained networks in roughly one‑third of the training time. We include timing and quality comparisons of RMCTS vs. MCTS‑UCB for three games: Connect‑4, Dots‑and‑Boxes, and Othello.
Authors:Evgenii Rudakov, Jonathan Shock, Benjamin Ultan Cowley
Abstract:
Reinforcement learning from pixels is often bottlenecked by the performance and complexity of 3D rendered environments. Researchers face a trade‑off between high‑speed, low‑level engines and slower, more accessible Python frameworks. To address this, we introduce PyBatchRender, a Python library for high‑throughput, batched 3D rendering that achieves over 1 million FPS on simple scenes. Built on the Panda3D game engine, it utilizes its mature ecosystem while enhancing performance through optimized batched rendering for up to 1000X speedups. Designed as a physics‑agnostic renderer for reinforcement learning from pixels, PyBatchRender offers greater flexibility than dedicated libraries, simpler setup than typical game‑engine wrappers, and speeds rivaling state‑of‑the‑art C++ engines like Madrona. Users can create custom scenes entirely in Python with tens of lines of code, enabling rapid prototyping for scalable AI training. Open‑source and easy to integrate, it serves to democratize high‑performance 3D simulation for researchers and developers. The library is available at https://github.com/dolphin‑in‑a‑coma/PyBatchRender.
Authors:Zihua Yang, Xin Liao, Yiqun Zhang, Yiu-ming Cheung
Abstract:
Categorical data are prevalent in domains such as healthcare, marketing, and bioinformatics, where clustering serves as a fundamental tool for pattern discovery. A core challenge in categorical data clustering lies in measuring similarity among attribute values that lack inherent ordering or distance. Without appropriate similarity measures, values are often treated as equidistant, creating a semantic gap that obscures latent structures and degrades clustering quality. Although existing methods infer value relationships from within‑dataset co‑occurrence patterns, such inference becomes unreliable when samples are limited, leaving the semantic context of the data underexplored. To bridge this gap, we present ARISE (Attention‑weighted Representation with Integrated Semantic Embeddings), which draws on external semantic knowledge from Large Language Models (LLMs) to construct semantic‑aware representations that complement the metric space of categorical data for accurate clustering. That is, LLM is adopted to describe attribute values for representation enhancement, and the LLM‑enhanced embeddings are combined with the original data to explore semantically prominent clusters. Experiments on eight benchmark datasets demonstrate consistent improvements over seven representative counterparts, with gains of 19‑27%. Code is available at https://github.com/develop‑yang/ARISE
Authors:Tien-Huy Nguyen, Huu-Loc Tran, Thanh Duc Ngo
Abstract:
Vision Language Models (VLMs) have rapidly advanced and show strong promise for text‑based person search (TBPS), a task that requires capturing fine‑grained relationships between images and text to distinguish individuals. Previous methods address these challenges through local alignment, yet they are often prone to shortcut learning and spurious correlations, yielding misalignment. Moreover, injecting prior knowledge can distort intra‑modality structure. Motivated by our finding that encoder attention surfaces spatially precise evidence from the earliest training epochs, and to alleviate these issues, we introduceITSELF, an attention‑guided framework for implicit local alignment. At its core, Guided Representation with Attentive Bank (GRAB) converts the model's own attention into an Attentive Bank of high‑saliency tokens and applies local objectives on this bank, learning fine‑grained correspondences without extra supervision. To make the selection reliable and non‑redundant, we introduce Multi‑Layer Attention for Robust Selection (MARS), which aggregates attention across layers and performs diversity‑aware top‑k selection; and Adaptive Token Scheduler (ATS), which schedules the retention budget from coarse to fine over training, preserving context early while progressively focusing on discriminative details. Extensive experiments on three widely used TBPS benchmarks showstate‑of‑the‑art performance and strong cross‑dataset generalization, confirming the effectiveness and robustness of our approach without additional prior supervision. Our project is publicly available at https://trhuuloc.github.io/itself
Authors:Shiao Wang, Xiao Wang, Haonan Zhao, Jiarui Xu, Bo Jiang, Lin Zhu, Xin Zhao, Yonghong Tian, Jin Tang
Abstract:
Existing RGB‑Event visual object tracking approaches primarily rely on conventional feature‑level fusion, failing to fully exploit the unique advantages of event cameras. In particular, the high dynamic range and motion‑sensitive nature of event cameras are often overlooked, while low‑information regions are processed uniformly, leading to unnecessary computational overhead for the backbone network. To address these issues, we propose a novel tracking framework that performs early fusion in the frequency domain, enabling effective aggregation of high‑frequency information from the event modality. Specifically, RGB and event modalities are transformed from the spatial domain to the frequency domain via the Fast Fourier Transform, with their amplitude and phase components decoupled. High‑frequency event information is selectively fused into RGB modality through amplitude and phase attention, enhancing feature representation while substantially reducing backbone computation. In addition, a motion‑guided spatial sparsification module leverages the motion‑sensitive nature of event cameras to capture the relationship between target motion cues and spatial probability distribution, filtering out low‑information regions and enhancing target‑relevant features. Finally, a sparse set of target‑relevant features is fed into the backbone network for learning, and the tracking head predicts the final target position. Extensive experiments on three widely used RGB‑Event tracking benchmark datasets, including FE108, FELT, and COESOT, demonstrate the high performance and efficiency of our method. The source code of this paper will be released on https://github.com/Event‑AHU/OpenEvTracking
Authors:Patricio Vera
Abstract:
Language generation maps a rich, high‑dimensional internal state to a single token sequence. We study this many‑to‑one mapping through the lens of intention collapse: the projection from an internal intention space I to an external language space L. We introduce three cheap, model‑agnostic metrics computed on a pre‑collapse state I: (i) intention entropy Hint(I), (ii) effective dimensionality deff(I), and (iii) recoverability Recov(I), operationalized as probe AUROC for predicting eventual success. We evaluate these metrics in a 3x3 study across models (Mistral‑7B, LLaMA‑3.1‑8B, Qwen‑2.5‑7B) and benchmarks (GSM8K, ARC‑Challenge, AQUA‑RAT), comparing baseline, chain‑of‑thought (CoT), and a babble control (n=200 items per cell). CoT increases average accuracy from 34.2% to 47.3% (+13.1 pp), driven by large gains on GSM8K but consistent degradations on ARC‑Challenge. Across models, CoT induces distinct entropy regimes relative to baseline, dH = Hint(CoT) ‑ Hint(Base): Mistral shows dH < 0 (lower‑entropy CoT), whereas LLaMA shows dH > 0 (higher‑entropy CoT), highlighting heterogeneity in CoT‑induced internal uncertainty. Finally, probe AUROC is significantly above chance in a subset of settings and can dissociate from behavioral accuracy (e.g., high AUROC alongside lower CoT accuracy on ARC‑Challenge for Qwen), suggesting that informative internal signal is not always reliably converted into a final discrete decision under constrained response formats.
Authors:Zihan Li, Dandan Shan, Yunxiang Li, Paul E. Kinahan, Qingqi Hong
Abstract:
Medical image segmentation faces critical challenges in semi‑supervised learning scenarios due to severe annotation scarcity requiring expert radiological knowledge, significant inter‑annotator variability across different viewpoints and expertise levels, and inadequate multi‑scale feature integration for precise boundary delineation in complex anatomical structures. Existing semi‑supervised methods demonstrate substantial performance degradation compared to fully supervised approaches, particularly in small target segmentation and boundary refinement tasks. To address these fundamental challenges, we propose SASNet (Scale‑aware Adaptive Supervised Network), a dual‑branch architecture that leverages both low‑level and high‑level feature representations through novel scale‑aware adaptive reweight mechanisms. Our approach introduces three key methodological innovations, including the Scale‑aware Adaptive Reweight strategy that dynamically weights pixel‑wise predictions using temporal confidence accumulation, the View Variance Enhancement mechanism employing 3D Fourier domain transformations to simulate annotation variability, and segmentation‑regression consistency learning through signed distance map algorithms for enhanced boundary precision. These innovations collectively address the core limitations of existing semi‑supervised approaches by integrating spatial, temporal, and geometric consistency principles within a unified optimization framework. Comprehensive evaluation across LA, Pancreas‑CT, and BraTS datasets demonstrates that SASNet achieves superior performance with limited labeled data, surpassing state‑of‑the‑art semi‑supervised methods while approaching fully supervised performance levels. The source code for SASNet is available at https://github.com/HUANGLIZI/SASNet.
Authors:Julian D. Santamaria, Claudia Isaza, Jhony H. Giraldo
Abstract:
Wildlife monitoring is crucial for studying biodiversity loss and climate change. Camera trap images provide a non‑intrusive method for analyzing animal populations and identifying ecological patterns over time. However, manual analysis is time‑consuming and resource‑intensive. Deep learning, particularly foundation models, has been applied to automate wildlife identification, achieving strong performance when tested on data from the same geographical locations as their training sets. Yet, despite their promise, these models struggle to generalize to new geographical areas, leading to significant performance drops. For example, training an advanced vision‑language model, such as CLIP with an adapter, on an African dataset achieves an accuracy of 84.77%. However, this performance drops significantly to 16.17% when the model is tested on an American dataset. This limitation partly arises because existing models rely predominantly on image‑based representations, making them sensitive to geographical data distribution shifts, such as variation in background, lighting, and environmental conditions. To address this, we introduce WildIng, a Wildlife image Invariant representation model for geographical domain shift. WildIng integrates text descriptions with image features, creating a more robust representation to geographical domain shifts. By leveraging textual descriptions, our approach captures consistent semantic information, such as detailed descriptions of the appearance of the species, improving generalization across different geographical locations. Experiments show that WildIng enhances the accuracy of foundation models such as BioCLIP by 30% under geographical domain shift conditions. We evaluate WildIng on two datasets collected from different regions, namely America and Africa. The code and models are publicly available at https://github.com/Julian075/CATALOG/tree/WildIng.
Authors:Jawad Chowdhury, Rezaur Rashid, Gabriel Terejanu
Abstract:
Understanding affective polarization in online discourse is crucial for evaluating the societal impact of social media interactions. This study presents a novel framework that leverages large language models (LLMs) and domain‑informed heuristics to systematically analyze and quantify affective polarization in discussions on divisive topics such as climate change and gun control. Unlike most prior approaches that relied on sentiment analysis or predefined classifiers, our method integrates LLMs to extract stance, affective tone, and agreement patterns from large‑scale social media discussions. We then apply a rule‑based scoring system capable of quantifying affective polarization even in small conversations consisting of single interactions, based on stance alignment, emotional content, and interaction dynamics. Our analysis reveals distinct polarization patterns that are event dependent: (i) anticipation‑driven polarization, where extreme polarization escalates before well‑publicized events, and (ii) reactive polarization, where intense affective polarization spikes immediately after sudden, high‑impact events. By combining AI‑driven content annotation with domain‑informed scoring, our framework offers a scalable and interpretable approach to measuring affective polarization. The source code is publicly available at: https://github.com/hasanjawad001/llm‑social‑media‑polarization.
Authors:Huang Junyao, Situ Ruimin, Ye Renqin
Abstract:
As artificial intelligence systems increasingly mediate consumer information discovery, brands face algorithmic invisibility. This study investigates Cultural Encoding in Large Language Models (LLMs) ‑‑ systematic differences in brand recommendations arising from training data composition. Analyzing 1,909 pure‑English queries across 6 LLMs (GPT‑4o, Claude, Gemini, Qwen3, DeepSeek, Doubao) and 30 brands, we find Chinese LLMs exhibit 30.6 percentage points higher brand mention rates than International LLMs (88.9% vs. 58.3%, p<.001). This disparity persists in identical English queries, indicating training data geography ‑‑ not language ‑‑ drives the effect. We introduce the Existence Gap: brands absent from LLM training corpora lack "existence" in AI responses regardless of quality. Through a case study of Zhizibianjie (OmniEdge), a collaboration platform with 65.6% mention rate in Chinese LLMs but 0% in International models (p<.001), we demonstrate how Linguistic Boundary Barriers create invisible market entry obstacles. Theoretically, we contribute the Data Moat Framework, conceptualizing AI‑visible content as a VRIN strategic resource. We operationalize Algorithmic Omnipresence ‑‑ comprehensive brand visibility across LLM knowledge bases ‑‑ as the strategic objective for Generative Engine Optimization (GEO). Managerially, we provide an 18‑month roadmap for brands to build Data Moats through semantic coverage, technical depth, and cultural localization. Our findings reveal that in AI‑mediated markets, the limits of a brand's "Data Boundaries" define the limits of its "Market Frontiers."
Authors:Lili Chen, Wensheng Gan, Shuang Liang, Philip S. Yu
Abstract:
Temporal point processes (TPPs) are crucial for analyzing events over time and are widely used in fields such as finance, healthcare, and social systems. These processes are particularly valuable for understanding how events unfold over time, accounting for their irregularity and dependencies. Despite the success of large language models (LLMs) in sequence modeling, applying them to temporal point processes remains challenging. A key issue is that current methods struggle to effectively capture the complex interaction between temporal information and semantic context, which is vital for accurate event modeling. In this context, we introduce TPP‑TAL (Temporal Point Processes with Enhanced Temporal Awareness in LLMs), a novel plug‑and‑play framework designed to enhance temporal reasoning within LLMs. Rather than using the conventional method of simply concatenating event time and type embeddings, TPP‑TAL explicitly aligns temporal dynamics with contextual semantics before feeding this information into the LLM. This alignment allows the model to better perceive temporal dependencies and long‑range interactions between events and their surrounding contexts. Through comprehensive experiments on several benchmark datasets, it is shown that TPP‑TAL delivers substantial improvements in temporal likelihood estimation and event prediction accuracy, highlighting the importance of enhancing temporal awareness in LLMs for continuous‑time event modeling. The code is made available at https://github.com/chenlilil/TPP‑TAL
Authors:Ayda Aghaei Nia
Abstract:
While Deep Learning has improved Brain‑Computer Interface (BCI) decoding accuracy, clinical adoption is hindered by the "Black Box" nature of these algorithms, leading to user frustration and poor neuroplasticity outcomes. We propose OmniNeuro, a novel HCI framework that transforms the BCI from a silent decoder into a transparent feedback partner. OmniNeuro integrates three interpretability engines: (1) Physics (Energy), (2) Chaos (Fractal Complexity), and (3) Quantum‑Inspired uncertainty modeling. These metrics drive real‑time Neuro‑Sonification and Generative AI Clinical Reports. Evaluated on the PhysioNet dataset (N=109), the system achieved a mean accuracy of 58.52%, with qualitative pilot studies (N=3) confirming that explainable feedback helps users regulate mental effort and reduces the "trial‑and‑error" phase. OmniNeuro is decoder‑agnostic, acting as an essential interpretability layer for any state‑of‑the‑art architecture.
Authors:Yin Li
Abstract:
Large Language Models (LLMs) are widely believed to possess self‑correction capabilities, yet recent studies suggest that intrinsic self‑correction‑‑where models correct their own outputs without external feedback‑‑remains largely ineffective. In this work, we systematically decompose self‑correction into three distinct sub‑capabilities: error detection, error localization, and error correction. Through cross‑model experiments on GSM8K‑Complex (n=500 per model, 346 total errors) with three major LLMs, we uncover a striking Accuracy‑Correction Paradox: weaker models (GPT‑3.5, 66% accuracy) achieve 1.6x higher intrinsic correction rates than stronger models (DeepSeek, 94% accuracy)‑‑26.8% vs 16.7%. We propose the Error Depth Hypothesis: stronger models make fewer but deeper errors that resist self‑correction. Error detection rates vary dramatically across architectures (10% to 82%), yet detection capability does not predict correction success‑‑Claude detects only 10% of errors but corrects 29% intrinsically. Surprisingly, providing error location hints hurts all models. Our findings challenge linear assumptions about model capability and self‑improvement, with important implications for the design of self‑refinement pipelines.
Authors:Tao An
Abstract:
Conversation summarization loses nuanced details: when asked about coding preferences after 40 turns, summarization recalls "use type hints" but drops the critical constraint "everywhere" (19.0% exact match vs. 93.0% for our approach). We present CogCanvas, a training‑free framework inspired by how teams use whiteboards to anchor shared memory. Rather than compressing conversation history, CogCanvas extracts verbatim‑grounded artifacts (decisions, facts, reminders) and retrieves them via temporal‑aware graph. On the LoCoMo benchmark (all 10 conversations from the ACL 2024 release), CogCanvas achieves the highest overall accuracy among training‑free methods (32.4%), outperforming RAG (24.6%) by +7.8pp, with decisive advantages on complex reasoning tasks: +20.6pp on temporal reasoning (32.7% vs. 12.1% RAG) and +1.1pp on multi‑hop questions (41.7% vs. 40.6% RAG). CogCanvas also leads on single‑hop retrieval (26.6% vs. 24.6% RAG). Ablation studies reveal that BGE reranking contributes +7.7pp, making it the largest contributor to CogCanvas's performance. While heavily‑optimized approaches achieve higher absolute scores through dedicated training (EverMemOS: ~92%), our training‑free approach provides practitioners with an immediately‑deployable alternative that significantly outperforms standard baselines. Code and data: https://github.com/tao‑hpu/cog‑canvas
Authors:Taekyung Ki, Sangwon Jang, Jaehyeong Jo, Jaehong Yoon, Sung Ju Hwang
Abstract:
Talking head generation creates lifelike avatars from static portraits for virtual communication and content creation. However, current models do not yet convey the feeling of truly interactive communication, often generating one‑way responses that lack emotional engagement. We identify two key challenges toward truly interactive avatars: generating motion in real‑time under causal constraints and learning expressive, vibrant reactions without additional labeled data. To address these challenges, we propose Avatar Forcing, a new framework for interactive head avatar generation that models real‑time user‑avatar interactions through diffusion forcing. This design allows the avatar to process real‑time multimodal inputs, including the user's audio and motion, with low latency for instant reactions to both verbal and non‑verbal cues such as speech, nods, and laughter. Furthermore, we introduce a direct preference optimization method that leverages synthetic losing samples constructed by dropping user conditions, enabling label‑free learning of expressive interaction. Experimental results demonstrate that our framework enables real‑time interaction with low latency (approximately 500ms), achieving 6.8X speedup compared to the baseline, and produces reactive and expressive avatar motion, which is preferred over 80% against the baseline.
Authors:Longtian Qiu, Shan Ning, Chuyu Zhang, Jiaxuan Sun, Xuming He
Abstract:
Direct Preference Optimization (DPO) has shown strong potential for mitigating hallucinations in Multimodal Large Language Models (MLLMs). However, existing multimodal DPO approaches often suffer from overfitting due to the difficulty imbalance in preference data. Our analysis shows that MLLMs tend to overemphasize easily distinguishable preference pairs, which hinders fine‑grained hallucination suppression and degrades overall performance. To address this issue, we propose Difficulty‑Aware Direct Preference Optimization (DA‑DPO), a cost‑effective framework designed to balance the learning process. DA‑DPO consists of two main components: (1) Difficulty Estimation leverages pre‑trained vision‑‑language models with complementary generative and contrastive objectives, whose outputs are integrated via a distribution‑aware voting strategy to produce robust difficulty scores without additional training; and (2) Difficulty‑Aware Training reweights preference pairs based on their estimated difficulty, down‑weighting easy samples while emphasizing harder ones to alleviate overfitting. This framework enables more effective preference optimization by prioritizing challenging examples, without requiring new data or extra fine‑tuning stages. Extensive experiments demonstrate that DA‑DPO consistently improves multimodal preference optimization, yielding stronger robustness to hallucinations and better generalization across standard benchmarks, while remaining computationally efficient. The project page is available at https://artanic30.github.io/project_pages/DA‑DPO/.
Authors:Miaowei Wang, Jakub Zadrożny, Oisin Mac Aodha, Amir Vaxman
Abstract:
Accurately simulating existing 3D objects and a wide variety of materials often demands expert knowledge and time‑consuming physical parameter tuning to achieve the desired dynamic behavior. We introduce MotionPhysics, an end‑to‑end differentiable framework that infers plausible physical parameters from a user‑provided natural language prompt for a chosen 3D scene of interest, removing the need for guidance from ground‑truth trajectories or annotated videos. Our approach first utilizes a multimodal large language model to estimate material parameter values, which are constrained to lie within plausible ranges. We further propose a learnable motion distillation loss that extracts robust motion priors from pretrained video diffusion models while minimizing appearance and geometry inductive biases to guide the simulation. We evaluate MotionPhysics across more than thirty scenarios, including real‑world, human‑designed, and AI‑generated 3D objects, spanning a wide range of materials such as elastic solids, metals, foams, sand, and both Newtonian and non‑Newtonian fluids. We demonstrate that MotionPhysics produces visually realistic dynamic simulations guided by natural language, surpassing the state of the art while automatically determining physically plausible parameters. The code and project page are available at: https://wangmiaowei.github.io/MotionPhysics.github.io/.
Authors:Shengjun Zhang, Zhang Zhang, Chensheng Dai, Yueqi Duan
Abstract:
Recent reinforcement learning has enhanced the flow matching models on human preference alignment. While stochastic sampling enables the exploration of denoising directions, existing methods which optimize over multiple denoising steps suffer from sparse and ambiguous reward signals. We observe that the high entropy steps enable more efficient and effective exploration while the low entropy steps result in undistinguished roll‑outs. To this end, we propose E‑GRPO, an entropy aware Group Relative Policy Optimization to increase the entropy of SDE sampling steps. Since the integration of stochastic differential equations suffer from ambiguous reward signals due to stochasticity from multiple steps, we specifically merge consecutive low entropy steps to formulate one high entropy step for SDE sampling, while applying ODE sampling on other steps. Building upon this, we introduce multi‑step group normalized advantage, which computes group‑relative advantages within samples sharing the same consolidated SDE denoising step. Experimental results on different reward settings have demonstrated the effectiveness of our methods.
Authors:Yifan Zhang, Yifeng Liu, Mengdi Wang, Quanquan Gu
Abstract:
The efficacy of deep residual networks is fundamentally predicated on the identity shortcut connection. While this mechanism effectively mitigates the vanishing gradient problem, it imposes a strictly additive inductive bias on feature transformations, thereby limiting the network's capacity to model complex state transitions. In this paper, we introduce Deep Delta Learning (DDL), a novel architecture that generalizes the standard residual connection by modulating the identity shortcut with a learnable, data‑dependent geometric transformation. This transformation, termed the Delta Operator, constitutes a rank‑1 perturbation of the identity matrix, parameterized by a reflection direction vector \mathbfk(\mathbfX) and a gating scalar β(\mathbfX). We provide a spectral analysis of this operator, demonstrating that the gate β(\mathbfX) enables dynamic interpolation between identity mapping, orthogonal projection, and geometric reflection. Furthermore, we restructure the residual update as a synchronous rank‑1 injection, where the gate acts as a dynamic step size governing both the erasure of old information and the writing of new features. This unification empowers the network to explicitly control the spectrum of its layer‑wise transition operator, enabling the modeling of complex, non‑monotonic dynamics while preserving the stable training characteristics of gated residual architectures.