arXiv Papers with Code in Computation and Language (January 2026 - June 2026)
Authors:Sondos Mahmoud Bsharat, Jiacheng Liu, Xiaohan Zhao, Tianjun Yao, Xinyi Shang, Yi Tang, Jiacheng Cui, Ahmed Elhagry, Salwa K. Al Khatib, Hao Li, Salman Khan, Zhiqiang Shen
Abstract:
As AI writing assistants become increasingly integrated into real‑world drafting and revision workflows, many documents are no longer purely human‑written or AI‑generated, but instead result from progressive human‑AI co‑editing. However, existing AI‑text detection benchmarks largely focus on final outputs and provide limited understanding of how AI authorship signals emerge, accumulate, or disappear throughout the revision process. We introduce OpAI‑Bench, an operation‑guided benchmark for studying progressive human‑to‑AI text transformation across document, sentence, token, and span granularities. Starting from human‑written documents, OpAI‑Bench constructs nine sequentially revised versions for each sample under predefined AI coverage levels and five representative AI edit operations, covering four domains while preserving complete authorship provenance at multiple granularities. The benchmark supports comprehensive evaluation with 8 document‑level detectors, 7 sentence‑level detectors, and 2 fine‑grained token/span‑level detectors. Experiments reveal that AI‑text detectability is governed not only by the proportion of AI‑edited content, but also by edit operation, domain, and cumulative revision history. Interestingly, we notice that mixed‑authorship intermediate versions are often harder to detect than both fully human and heavily AI‑edited endpoints, exposing non‑monotonic detection patterns missed by existing benchmarks. OpAI‑Bench provides a controlled testbed for analyzing whether, when, and how AI‑assisted writing becomes detectable under realistic progressive editing scenarios. Our code and benchmark are available at https://github.com/VILA‑Lab/OpAI‑Bench.
Authors:Shangheng Du, Xiangchao Yan, Jinxin Shi, Zongsheng Cao, Shiyang Feng, Zichen Liang, Boyuan Sun, Tianshuo Peng, Yifan Zhou, Xin Li, Jie Zhou, Liang He, Bo Zhang, Lei Bai
Abstract:
Large language model (LLM) agents are increasingly applied to long‑horizon tasks such as scientific discovery and machine learning engineering (MLE), where sustained self‑evolution becomes a key capability. However, existing MLE agents suffer from inter‑branch information isolation, memoryless search, and lack of hierarchical control, which together hinder long‑horizon optimization. We present MLEvolve, an LLM‑based self‑evolving multi‑agent framework for end‑to‑end machine learning algorithm discovery. By extending tree search to Progressive MCGS, MLEvolve enables cross‑branch information flow through graph‑based reference edges and gradually shifts the search from broad exploration to focused exploitation with an entropy‑inspired progressive schedule. To allow the agent to evolve with accumulated experience, we introduce Retrospective Memory, which combines a cold‑start domain knowledge base with a dynamic global memory for task‑specific experience retrieval and reuse. For stable long‑horizon iteration, we further decouple strategic planning from code generation with adaptive coding modes. Evaluation on MLE‑Bench shows that MLEvolve achieves state‑of‑the‑art performance across multiple dimensions including average medal rate and valid submission rate under a 12‑hour budget (half the standard runtime). Moreover, MLEvolve also outperforms specialized algorithm discovery methods including AlphaEvolve on mathematical algorithm optimization tasks, demonstrating strong cross‑domain generalization. Our code is available at https://github.com/InternScience/MLEvolve.
Authors:Zengqing Wu, Chuan Xiao
Abstract:
The question of whether artificial systems can be conscious remains open, in part because existing approaches either evaluate systems against theory‑derived checklists (discriminative) or engineer consciousness‑inspired modules directly (architectural); both leave open whether observed structures are artifacts of human language priors. We propose a generative methodology: emergent language (EL) in multi‑agent reinforcement learning, where agents start from minimal (no language, no concept of self, minimal exposure to human text) and develop communication under task pressure alone, ensuring causal attributability to task demands rather than inherited human language priors. We position our methodology by discussing how EL serves as a generative tool for studying consciousness‑relevant structure, including the role of environment complexity and the interpretation of emergent communication. As a proof of concept, we instantiate this methodology in a minimal environment and show that agents develop self‑referential communication, including an echo‑mismatch detection circuit that is not predicted by task structure or architecture alone but emerges from a specific environmental affordance.
Authors:AJ Carl P. Dy, Aivin V. Solatorio
Abstract:
Institutional documents contain substantial amounts of operational and analytical information embedded within figures and tables. Current approaches for extracting visual content from documents are largely built around generic document layout analysis, where figures and tables are treated as uniformly relevant document objects rather than semantically meaningful analytical artifacts. In this work, we introduce a benchmark dataset and evaluation framework for data snapshot extraction, the task of identifying and localizing semantically meaningful visual artifacts within institutional documents. The benchmark spans humanitarian reports, World Bank policy research working papers, and project appraisal documents, and includes annotations for figures and tables that contain reusable analytical information. Using this dataset, we benchmarked multiple open‑source layout detection models and evaluated both detection performance and spatial extraction quality. Our results show that current models struggle to generalize to operational institutional documents despite strong performance on conventional academic benchmarks. Common failure modes include confusion between analytical and non‑analytical content, fragmentation of composite analytical artifacts, and incomplete extraction of contextual information required for interpretation. These findings highlight a persistent gap between generic document layout analysis and operationally useful data snapshot extraction. We release the source PDFs, annotation dataset, metadata, and source code to support future research in operational document intelligence. The dataset is available at https://huggingface.co/datasets/ai4data/data‑snapshot and the source code is available at https://github.com/worldbank/ai4data/tree/main/experimental/data‑snapshot.
Authors:Jinyang Zhang, Hongxin Ding, Yue Fang, Weibin Liao, Muyang Ye, Junfeng Zhao, Yasha Wang
Abstract:
Recent work has sought to understand Large Language Models (LLMs) reasoning, yet a principled, model‑intrinsic signal that captures its layer‑wise reasoning dynamics remains underexplored. We bridge this gap by demonstrating that the l2 norm of hidden states serves as an endogenous signal of the model's reasoning intensity. Using Sparse Autoencoders (SAEs) as a diagnostic probe, we observe that LLMs' internal reasoning is marked by a sharp increase in reasoning feature activations concentrated in late layers. Motivated by this pattern, we establish a formal link between reasoning intensity and the model's latent geometry and theoretically prove that the l2 norm of hidden states bounds the activation strength of SAE reasoning features. Empirical correlation analysis and causal interventions further validate the l2 norm as a faithful indicator, where heightened norms consistently correspond to critical reasoning steps. We then introduce three test‑time scaling techniques guided by l2 norms: (i) Adaptive Layer‑wise Reasoning Recursion, (ii) Endogenous Reasoning State Steering, and (iii) l2‑guided Response Selection, which requires no additional training or data and is compatible with advanced inference engines. Experiments across model architectures and benchmarks show that l2‑norm‑based techniques significantly improve reasoning performance, offering a principled yet simple lens to perceive and control LLM latent reasoning dynamics. Our code is available at https://github.com/zjy1298/The‑Tell‑Tale‑Norm.
Authors:Giuseppe Attanasio, Beatrice Savoldi, Daniel Chechelnitsky, Matteo Negri, Marine Carpuat, Maarten Sap, André F. T. Martins
Abstract:
Speech translation (ST) is increasingly adopted in user applications, yet its evaluation largely focuses on decontextualized testbeds and holistic quality, rather than end users' communication needs. We introduce Ouvia, an evaluation framework for measuring user‑perceived usability of speech translation outputs in real‑world settings. Ouvia focuses on one‑to‑one communication: an English speaker needs to convey a request to a Portuguese speaker, and the message is automatically translated. Through a custom web app and multi‑phase study design, we collect more than 1,750 such interactions in healthcare and everyday situations, mediated by four ST systems, involving speakers from three English dialects and two genders. We find that modern ST serves people only to a limited extent ‑‑ only around half of interactions are rated as usable ‑‑ with significant gaps in reported usability across demographic groups. Moreover, among quality metrics, we find that QA‑based evaluation is a substantially stronger predictor of real‑world usability than standard approaches. Together, these findings stress the importance of situated, user‑centered evaluation frameworks that go beyond holistic quality scores and attend to who the technology serves ‑‑ and how well.
Authors:Paavo Parmas, Yongmin Kim, Kohsei Matsutani, Shota Takashiro, Soichiro Nishimori, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo
Abstract:
Policy‑gradient methods usually optimize expected return, but many real world applications care about distributional properties of returns: tail risk, outlier robustness, or best‑of‑K discovery. We introduce OrderGrad, a family of likelihood‑ratio and reparameterization gradient estimators for order‑statistic objectives. OrderGrad optimizes finite‑sample L‑statistics, i.e., weighted averages of sorted rewards or costs, recovering objectives such as VaR, CVaR, trimmed means, medians, and top‑m/best‑of‑K criteria by changing only the rank weights. For any fixed sample size and rank‑weight vector, OrderGrad provides an unbiased gradient estimator for the corresponding order‑statistic objective. The method is implemented as a simple reward transformation that can then be used in an otherwise standard policy‑gradient or reparameterized update. We study the resulting estimator's variance behavior and evaluate it on tasks where mean optimization is mismatched to the deployment objective, including LLM math post‑training and other tasks. OrderGrad provides a unified, plug‑and‑play route to risk‑averse, robust, and exploratory learning. Code: https://github.com/paavo5/ordergrad
Authors:Xiaoman Wang, Yaoze Zhang, Wenzhuo Fan, Hongwei Zhang, Ding Wang, Guohang Yan, Song Mao, Botian Shi, Yunshi Lan, Pinlong Cai
Abstract:
Retrieval‑Augmented Generation (RAG) has shown strong effectiveness in grounding Large Language Models (LLMs) with external knowledge. However, existing RAG and Graph RAG frameworks largely treat knowledge as static or associate time with coarse‑grained timestamps or metadata, failing to capture rich temporal structures such as duration, overlap, and containment. We propose IA‑RAG, a hierarchical temporal RAG framework that models knowledge as time intervals and performs retrieval under formal temporal constraints. IA‑RAG represents facts as Interval Event Units (IEUs) and organizes them into a hierarchical Thematic Forest, where temporal dependencies are governed by Allen's Interval Algebra. To handle incomplete or uncertain temporal boundaries, IA‑RAG further introduces a Sub‑graph Time Tightening mechanism that refines fuzzy intervals through logical constraints within connected event subgraphs. In addition, IA‑RAG supports implicit temporal semantic retrieval through interval‑algebra‑guided traversal. Experiments on multiple temporal question answering benchmarks, including TimeQA, TempReason, and ComplexTR, demonstrate that IA‑RAG achieves strong temporal retrieval and reasoning performance, particularly on complex compositional temporal reasoning tasks. Our code is released at https://github.com/xiaoAugenstern/LogicalRAG_TemporalQA.
Authors:Om Choksi, Smit Kareliya, Shrikant Malviya, Pruthwik Mishra
Abstract:
We study English‑to‑Prakrit machine translation in a low‑resource setting where the target language is unsupported by IndicTrans2. We adapt the multilingual model by mapping Prakrit to the Hindi language tag (hin_Deva) without modifying the tokenizer, vocabulary, or architecture. Using a 1,474‑pair Maharashtri Prakrit parallel corpus and evaluation on a 20‑sample Ardhamagadhi test set, we report corpus BLEU improvements over an untuned baseline. The results indicate that script‑compatible language routing can enable feasible transfer to unsupported classical languages, while highlighting limitations due to data scarcity and dialect mismatch. Our code and trained models are released to the public for further exploration https://github.com/D3v1s0m/indictrans2‑prakrit‑mt.
Authors:Amirhossein Ghaffari, Ali Goodarzi, Huong Nguyen, Simo Hosio, Lauri Lovén, Ekaterina Gilman
Abstract:
Community‑conditioned language model adaptation requires choices about data collection, community definition, and evaluation that are currently made independently in each study, making it hard to compare assumptions or reuse artifacts. We present RedditPersona, a modular framework that standardizes these choices: it collects Reddit posts and comments, profiles active users, partitions them under five grouping strategies (subreddit‑based, graph‑structural, semantic, hybrid, and interaction‑based), trains a parameter‑efficient adapter per strategy via QLoRA, and evaluates them under a shared metric suite spanning fluency, fidelity, distributional alignment, and community identifiability. Applied to 112 subreddits in the urban well‑being domain (301,429 user profiles, 16M+ comments), we find that adapters' behavioral identifiability tracks each strategy's intrinsic agreement with the subreddit baseline, and that a consistent trade‑off between identifiability and distributional similarity to real text holds across all five strategies. The code and configuration files are available at: https://github.com/Ahghaffari/redditpersona.
Authors:Tilman Beck, Shakib Yazdani, Simon Kruschinski, Marcus Maurer, Iryna Gurevych
Abstract:
Stance detection on social media is challenging due to short, noisy, and context‑dependent language. While large language models (LLMs) show zero‑shot generalization, they are typically prompted without contextual information, which limits their ability to interpret ambiguous posts. In this work, we systematically investigate the impact of incorporating real‑world (e.g., user biographies), derived (e.g., political party), and LLM‑generated (e.g., target descriptions) contextual features into zero‑shot prompting for stance detection on Twitter. Our evaluation spans four benchmark datasets, including a new high‑quality German Twitter stance dataset. Across multiple LLMs, we find that integrating contextual information improves performance, but only under specific conditions. LLM‑generated target descriptions consistently enhance accuracy, while other user metadata has mixed or even detrimental effects. Notably, we show that the inclusion of other tweets by the same user, often beneficial in supervised learning, can impair performance due to input noise. Our qualitative analysis reveals that LLMs struggle to distinguish task‑specific useful information from irrelevant context. Our findings highlight both the promise and challenges of prompting with context information in noisy real‑world settings. We publish code and data at this \hrefhttps://github.com/tilmanbeck/stance‑context‑twitterpage.
Authors:Shaoyang Xu, Jingshen Zhang, Long P. Hoang, Jinyuan Li, Wenxuan Zhang
Abstract:
Multicultural multi‑agent systems are increasingly deployed in globally diverse settings, where different agents are grounded in different cultural backgrounds. Existing cultural evaluation focuses on value alignment: how closely a single agent matches a target culture. Yet alignment is a per‑agent property and cannot reveal whether a system, taken as a whole, preserves the cultural plurality it is meant to represent. We propose value diversity as a system‑level evaluation axis for multicultural agent systems, defined through the dissimilarity between culturally conditioned agents' responses on a shared value survey. Using the World Values Survey, we evaluate 19 cultures and 18 backbone models across a wide range of system configurations. We find that diversity is largely uncorrelated with alignment, indicating that the two capture complementary system properties, and that current multicultural agent systems fall substantially below human societies in value diversity. Mixed‑backbone systems narrow this gap but do not close it, and the gap persists across culture compositions and agent scales. Social interaction further erodes diversity by driving agents toward consensus, and a participatory budgeting case study shows that this homogenization narrows the breadth of collective decision‑making. Together, our results establish value diversity as a distinct evaluation axis for multicultural multi‑agent systems and reveal a persistent homogenization tendency in current LLM‑based societies. Our code and data are publicly available at https://github.com/iNLP‑Lab/MultiAgent‑Diversity.
Authors:Wenbo Pan, Shujie Liu, Chin-Yew Lin, Jingying Zeng, Xianfeng Tang, Xiangyang Zhou, Yan Lu, Xiaohua Jia
Abstract:
AI agents rely on a harness of skills, tools, and workflows to solve complex problems. Continually improving this harness is essential for adapting to new tasks. However, existing optimization methods typically require ground‑truth validation sets, yet such labeled data is difficult to acquire in practical deployment settings. To address this problem, we introduce Retrospective Harness Optimization (RHO), a self‑supervised method that optimizes the agent harness using only past trajectories. Specifically, RHO selects a diverse coreset of challenging tasks from past trajectories and re‑solves them in parallel. The agent analyzes these rollouts using self‑validation and self‑consistency, then generates candidate harness updates and selects the most effective one by its own pairwise self‑preference. We evaluate RHO across three diverse domains, spanning software engineering, technical work, and knowledge work. Notably, a single optimization round improves the pass rate on SWE‑Bench Pro from 59% to 78% without any external grading. Furthermore, our analysis demonstrates that RHO effectively targets prior failure modes. As a result, the optimized harness alters the agent's behavior patterns and sustains higher accuracy during long‑horizon sessions.
Authors:Qing Yang, Pengcheng Huang, Xinze Li, Zhenghao Liu, Yukun Yan, Yu Gu, Ge Yu, Gang Li, Maosong Sun
Abstract:
Long‑video question answering remains challenging for Vision‑Language Models (VLMs), as answer‑relevant evidence is often sparse, transient, and temporally dispersed across lengthy video contexts. Existing frame‑centric approaches improve efficiency through uniform sampling, query‑aware frame selection, visual‑token compression, and adaptive resolution strategies. However, they still rely on isolated and fragmented frames as the fundamental evidence units, limiting VLMs' ability to effectively capture coherent event‑level semantics. To address this limitation, we propose MemoryCard, a video‑memory‑based augmentation framework that organizes long videos into self‑contained Memory Cards. Specifically, MemoryCard first performs a self‑reading process over videos and aligned utterances to segment the video into semantically coherent units, each corresponding to a distinct topic or event. For each unit, it generates an event‑level video gist and selects representative visual moments, which are then rendered into unified Memory Cards for retrieval and question answering. Experimental results demonstrate that MemoryCard consistently improves long‑video QA performance under comparable visual‑token budgets, achieving up to a 21.8% relative improvement in accuracy. All code is available at https://github.com/NEUIR/MemoryCard.
Authors:Xiaobing Chen, Ai Jian, Eryu Guo, Zhiqi Pang
Abstract:
Text‑to‑SQL maps natural language questions to executable SQL queries. Modern databases often contain large and complex schemas, making schema linking a critical step for accurate SQL generation. Existing methods either rely on full‑schema generation, which leaves schema linking implicit within a large search space, or use a separate retriever trained with static gold‑column supervision, whose targets may be suboptimal for the current generator policy. To address this issue, we propose Adaptive Co‑optimization via Empirical Credit Assignment for Text‑to‑SQL (ACE‑SQL), a reinforcement learning (RL) framework that jointly optimizes schema retrieval and SQL generation under execution feedback. ACE‑SQL constructs an online column‑set pool from generator rollouts and derives adaptive on‑policy retrieval targets from the column set most frequently associated with execution‑correct rollouts. This induces bidirectional adaptation, where the retriever adapts toward column sets that the generator can execute correctly, while the generator adapts to the retriever's evolving schema selections under execution feedback. With approximately 3k synthetic Text‑to‑SQL question‑database pairs for RL training, ACE‑SQL achieves 65.3% greedy execution accuracy on BIRD Dev while using 0.93k output tokens per query. The repository is available at https://github.com/xbchen1/ACE‑SQL.
Authors:Jin Tanaka, Daiki Matsuoka, Ryoma Kumon, Hitomi Yanaka
Abstract:
We investigate the extent to which the language processing of LLMs resembles human cognitive processes, focusing on a human cognitive bias called the neglect‑zero effect. This effect refers to the human tendency to ignore zero‑models, which are configurations that render a proposition vacuously true by virtue of an empty set. We focus on two types of inferences driven by the neglect‑zero effect, and examine how LLMs process these inferences by comparing their behavior with that in an inference that does not involve the neglect‑zero effect. For this purpose, we employ a paradigm based on structural priming, where recent exposure to a preceding sentence (the prime) facilitates the processing of a subsequent sentence (the target) due to their structural similarity. We prepare primes to force LLMs to consider the zero‑model, and analyze whether they also consider it in the target. The results suggest that the neglect‑zero effect may not occur in the LLMs analyzed in this study. Our code is available at https://github.com/ynklab/neglect_zero
Authors:Liting Zhang, Shiwan Zhao, Xuyang Zhao, Zichen Xu, Jianye Wang, Qicheng Li
Abstract:
Latent reasoning has emerged as a promising alternative to discrete Chain‑of‑Thought (CoT) in large language models (LLMs), enabling more expressive reasoning by operating over continuous representations. However, the inherently deterministic nature of continuous representations limits policy exploration in reinforcement learning (RL). To address this, we propose TARPO (Token‑Wise Latent‑Explicit Reasoning via Action‑Routing Policy Optimization), a pure RL framework that adaptively switches between discrete token generation and continuous latent reasoning at each step. TARPO introduces a lightweight action head router that observes the current hidden state and samples a routing decision from a binary mode‑selection space, preserving the stochasticity of discrete token sampling from the vocabulary. The LLM backbone and router are jointly optimized end‑to‑end with a shared group‑relative advantage signal. Extensive experiments across Qwen2.5 (from 1.5B to 7B) and Llama‑3.1‑8B backbones demonstrate that TARPO consistently outperforms existing explicit and latent reasoning RL baselines across diverse benchmarks. Further analysis shows that TARPO learns adaptive token‑wise switching behaviors while maintaining stable training dynamics. Our code is available at https://github.com/NKU‑LITI/TARPO‑master.
Authors:Yansi Li, Zhuosheng Zhang
Abstract:
Generating executable tool plans requires selecting appropriate subsets from tool libraries, a combinatorial search problem with an exponentially large solution space. However, we identify a critical misalignment in predominant approaches: standard autoregressive (AR) decoding suffers from early commitment, where initial token choices rigidly constrain the search trajectory. A controlled study shows that masked denoising raises Pass@10 solution coverage from 0.320 to 0.943 over AR sampling under matched compute. Motivated by this, we propose DiG‑Plan, a framework that decouples combinatorial exploration from structural refinement. DiG‑Plan employs a diffusion‑based proposer to generate diverse tool sets via iterative refinement, followed by an AR refiner for dependency prediction. On TaskBench, DiG‑Plan improves over AR baselines by a 10% relative margin, with the largest gains on complex compositional tasks; API‑Bank results show that the propose‑refine‑select design remains effective across domains. Code is available at https://github.com/puddingyeah/DiG‑Plan.
Authors:Shuze Liu, Qianwen Guo, Yushun Dong
Abstract:
Large language models (LLMs) are increasingly deployed through hosted APIs, making model extraction a practical threat to model ownership and service security. However, individual extraction queries often resemble benign requests, and existing evaluations often focus on single‑query anomaly scoring or pure benign‑versus‑attacker user settings. We formulate model extraction monitoring as benign‑calibrated traffic‑window distribution testing and show that an embarrassingly simple detector is effective: embed incoming queries into a semantic space and test whether their aggregate distribution deviates from historical benign traffic. We instantiate the detector with maximum mean discrepancy (MMD), using only benign‑vs‑benign comparisons to set the decision threshold. We evaluate on fourteen attacker‑normal query pairs from four extraction scenarios and compare with adapted PRADA, SEAT, CAP, DATE, and marginal Mahalanobis baselines. Across three random seeds, MMD achieves 0.3% benign FPR, 100.0% pure‑attacker TPR, 90.5% average TPR over attacker fractions, and 95.1% balanced accuracy. These results show that benign‑calibrated distribution testing is a strong empirical baseline for model extraction detection in both user‑level and mixed multi‑user LLM API traffic. Code is released at: https://github.com/LabRAI/mmd‑llm‑mea‑detection.
Authors:Yang Li, Jiaxiang Liu, Jiang Cai, Mingkun Xu
Abstract:
A situated query like "where is Lin Wei?" often encodes more than its literal content: the user may also want to know whether Lin Wei is free, in a good mood, or worth interrupting now. Standard tool‑use agents answer the literal question and stop. AURA inserts an inference step between scene perception and tool use that produces an IntentFrame: a structured estimate of the implicit need with a scalar gap score that controls per‑query probe budget and tool selection. On a 100‑query four‑scene implicit‑intent benchmark, AURA improves implicit‑need coverage over ReAct‑style probing (Delta = +0.07, p < 10^‑6); three of four scenes are individually significant, the gain reproduces on a second backbone, and a prompt ablation attributes the lift to gap calibration rather than answer memorisation. On factual lookup the controller trades raw accuracy for 82% fewer probes and zero forbidden‑tool violations on a privacy‑sensitive slice; scope conditions are detailed in Limitations. Code, simulator, and benchmark are released at https://github.com/innovation64/AURA.
Authors:Mohammad Mahdi Abootorabi, Omid Ghahroodi, Anas Madkoor, Marzia Nouri, Doratossadat Dastgheib, Mohamed Hefeeda, Ehsaneddin Asgari
Abstract:
Despite the rapid progress of Vision‑Language Models (VLMs), the field lacks benchmarks that rigorously diagnose their true reasoning abilities and chart meaningful progress toward human‑like multimodal intelligence. Most existing evaluations focus on piecemeal or disconnected tasks, obscuring critical cognitive weaknesses and providing little insight for targeted improvement. To address this gap, we introduce BloomBench, part of the Almieyar benchmarking series, the first cognitively human‑grounded, bilingual (English‑Arabic) multimodal benchmark for VLMs. Grounded in Bloom's Taxonomy, BloomBench systematically evaluates six levels of cognition (Remember, Understand, Apply, Analyze, Evaluate, Create) through carefully designed image‑question‑answer tasks. Built with a semi‑automated pipeline and validated through a stratified hybrid quality assurance protocol, it ensures scalability, cultural inclusivity, and linguistic fidelity. Leveraging this framework, we conduct a comprehensive study of state‑of‑the‑art VLMs to diagnose their cognitive profiles. Our analysis reveals a sharp cognitive asymmetry: while state‑of‑the‑art models achieve strong performance ceilings in semantic understanding, they struggle substantially with factual recall and creative synthesis. This demonstrates that current general multimodal proficiency masks deeper limitations in specific cognitive layers. Furthermore, our study highlights a critical performance gap between Arabic and English, exposing limitations in current cross‑lingual multimodal reasoning. These findings establish a foundation for developing more cognitively aligned and inclusive VLMs. The benchmark framework and dataset is available at: https://github.com/qcri/Almieyar‑Oryx‑BloomBench.
Authors:Yiyou Sun, Xinyang Han, Weichen Zhang, Yuanbo Pang, Tianyu Wang, Yuhan Cao, Yixiao Huang, Chris Duroiu, Haoyun Zhang, Jeffrey Lin, Weishu Zhang, Tyler Zeng, Ying Yan, Bo Liu, Hanson Wen, Mingyang Xu, Xiaoyuan Liu, Zimeng Chen, Weiyan Shi, Amanda Dsouza, Vincent Sunn Chen, Patrick Bryant, Carl Boettiger, Yamini Rangan, Bradley Rothenberg, Kyle Steinfeld, Arvind Rao, Tapio Schneider, Georgios Yannakakis, Laure Zanna, Kaan Ozbay, Ida Sim, Tarek Zohdi, George Em Karniadakis, Jack Gallant, Teresa Head-gordon, Yushan Li, Wenxi Deng, Tao Sun, Huiqi Wang, Zhun Wang, Justin Xu, Chris Yuhao Liu, Yafei Cheng, Rongwang Hu, Aras Bacho, Shengcao Cao, Zengyi Qin, Yixiong Chen, Hengduan Fan, Hao Liu, Lin Zeng, Shashank Muralidhar Bharadwaj, Litian Gong, Yingxuan Yang, Maojia Song, Ruheng Wang, Zongzheng Zhang, Honglin Bao, Shuo Lu, Jianhong Tu, Zhonghua Wang, Zheng Zhang, Zijiao Chen, yanqiong Jiang, Zhendong Li, Bohan Lyu, Chang Ma, Peiran Xu, Benran Zhang, Shangding Gu, Haoyue Hua, Haoyang Li, Wanzhe Liao, Chengzhi Liu, Junbo Peng, Haoran Sun, Zechen Xu, Bo Chen, Jiayi Cheng, Yi Jiang, Keying Kuang, Yuan Li, Youbang Pan, Ziyan Rao, Alexander Schubert, Yifan Shen, Vincent Siu, Xiatao Sun, Kangqi Zhang, Xiaopan Zhang, Yuchen Zhu, Ishaan Singh Chandok, Lei Ding, Jingxuan Fan, Andrew Glover, Jiaming Hu, Yiran Hu, Wenbo Huang, Zixin Jiang, Haoran Jin, Lukas Kim, Ming Liu, Yang Liu, Alireza Rafiei, Xuhuan Shen, Kunyang Sun, Sophia Sun, Ting Sun, Eric Wang, Yixin Wang, Hanwen Xing, Sihan Xu, Yuzheng Xu, Zhongxing Xu, Zhiling Yan, Boqin Yuan, Ruiqi Zhang, Yifan Zhang, Zibo Zhao, Liana, Santanu Bosu Antu, Haoyue Bai, Carlo Bosio, Joseph Cavanagh, Patricia Cavazos-Rehg, Tianxing Chen, Xuewen Chen, Yipu Chen, Zhu Chenyu, Chen Dai, Stefano De Castro, Yunfu Deng, Kaustubh Dhole, Jiayuan Ding, Chenchen Du, Zhehang Du, Hao Fan, Run-ze Fan, Hengyu Fu, Shi Gu, Yifan Gu, Charlie Guo, Baihe Huang, Baixiang Huang, Rimika Jaiswal, Zhihan Jiang, Ran Jin, Erin Kasson, Xin Lan, Joseph Lee, Deren Lei, Chenyu Li, Daofeng Li, Haitao Li, Hongwei Li, Jingyan Li, Xiao Li, Yi Li, Yinsheng Li, Yuangang Li, Zhixu Li, Wenyu Liang, Longtai Liao, Kevin Qinghong Lin, AndyZeyi Liu, Che Liu, Jiaming Liu, Kaiyuan Liu, Xuan Liu, Pan Lu, Wenbo Lv, Yicheng Lv, Qiuyang Mang, Kyle Montgomery, Yuzhou Nie, Ruoxi Ning, Jorin Overwiening, Xu Pan, Layna Paraboschi, Core Francisco Park, Justin Purnomo, Swati Rajwal, Scott Rankin, Bixuan Ren, Yiren Rong, HaoYang Shang, Ventus Shaw, Fiona Shen, Jiawei Shen, Minqi Shi, Qiu Shi, Huaxiu Yao, Tianneng Shi, Jonah So, Vladislav Susoy, Hannah Szlyk, Haocheng Wang, Jialu Wang, Wei Wang, Xinyu Wang, Zehao Wang, Dowling Wong, Angela Wu, Dehao Wu, Fangyu Wu, Mengyuan "Millie" Wu, Yu Wu, Yuchen Wu, Yuhao Wu, Qingpo Wuwu, Weihang Xiao, Yongyi Xiong, Fan Xu, Ruiling Xu, Mingxuan Yan, Benjamin Yang, Jirong Yang, Sen Yang, Xiaoli Yang, Yushi Yang, Haoran Ye, Xiaohu Yu, Zhengming Yu, Chenlong Zhang, Chi Zhang, Hanning Zhang, Hanwen Zhang, Junge Zhang, Kunpeng Zhang, Song Zhang, Wenjin Zhang, Wenshuo Zhang, Ying Zhang, Yizhi Zhang, Brian Zhao, Qijian Zhao, Yimin Zhao, Yuhaohua Zheng, Liwei Zhou, Tianyue Zhou, Sichen Zhu, Siqi Zhu, Yan Zhu, Yishu Zhu, Jierui Zuo, Chonghao Cai, Helena Casademunt, Wenjia Chen, Benjamin Cheng, Nawen Deng, Rao Fu, Tianfu Fu, Yifan Han, Ren He, Zhenyu He, Qiao Jin, Lang Lang, Yuetai Li, Sylvia Liu, Lu Lu, Qing Lu, Subhabrata Mukherjee, Yunqi Ouyang, Yin Ren, Dawei Shi, Haoran Wu, Zhiyue Wu, Hannah Yao, Zhuoran Yi, Jenny Yu, Rhea Zhan, Hang Zhou, Blake Zhu, Junfan Zhu, Alan Yuille, Yang Liu, Russell Alan Poldrack, Jiachen Li, Zhenglu Li, Molei Tao, Jing Huang, Wenqi Shi, Costas Spanos, Lichao Sun, Chenguang Wang, Orson Xu, Zhen Dong, Hector Gomez, Aylin Caliskan, Ali Emami, Haimin Hu, Zhi Li, Lihui Liu, Murphy Niu, Yi Shao, Jianxin Sun, Mikko Tolonen, Ting Wang, Sanjiv Das, Yanjun Gao, Wenbo Guo, Erika J Schneider, Zhiyong Lu, Mark Mueller, Radha Poovendran, Somayeh Sojoudi, Dawn Song
Abstract:
Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents' Last Exam (ALE), a benchmark designed to evaluate AI agents on long‑horizon, economically valuable, real‑world tasks with verifiable outcomes. Developed in collaboration with 250+ industry experts, ALE covers non‑physical industries defined with reference to ONET / SOC 2018 (the U.S. federal occupational taxonomy). It is organized around a task taxonomy with 55 subfields grouped into 13 industry clusters covering 1K+ tasks. Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is 2.6%. ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. More broadly, ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP‑relevant impact.
Authors:Zihao Li, Kaifeng Jin, Yuanchen Bei, Jiaru Zou, Avaneesh Kumar, Xuying Ning, Yanjun Zhao, Mengting Ai, Baoyu Jing, Hanghang Tong, Jingrui He
Abstract:
Time series are often embedded in rich contexts that are essential for holistic modeling. Moreover, real‑world practitioners often require end‑to‑end workflows for analyzing temporal dynamics, where widely studied tasks such as forecasting are only one step in a broader solution loop. While generalist AI agents offer a promising interface for such workflows under complex contexts, they still operate primarily in textual spaces that are not fully aligned with structured temporal signals. In this work, we introduce TimeClaw, an agentic harness framework for time series that equips generalist LLM agents with the time series‑native runtime support needed for contextualized temporal reasoning. TimeClaw integrates executable temporal tools for grounded and auditable analysis, experience‑driven capability evolution for creating reusable analytical routines, and episodic multimodal memory for retrieving relevant reasoning traces. Together, these components unlock harnessed open‑ended temporal reasoning with contextual information. Extensive evaluation on multiple benchmarks covering diverse tasks across energy, finance, weather, traffic, and other real‑world domains demonstrates improved performance of TimeClaw. Code is available at https://github.com/iDEA‑iSAIL‑Lab‑UIUC/TimeClaw.
Authors:Jinu Lee, Shivam Agarwal, Amruta Parulekar, Siddarth Madala, Dilek Hakkani-Tur, Julia Hockenmaier
Abstract:
Large reasoning models (LRMs) produce reasoning traces with non‑linear structures, such as backtracking and self‑correction, that complicate the evaluation and monitoring of the reasoning process. We introduce ReasoningFlow, a framework that captures the discourse structures of LRM reasoning traces into fine‑grained directed acyclic graphs (DAGs). We develop and validate our annotation schema through careful manual annotation of 31 traces (2.1k steps), achieving high inter‑annotator agreement, then scale to automatic annotation of 1,260 traces (247.7k steps) spanning three tasks (math, science, argumentation) and five models (Qwen2.5‑32B‑Inst, QwQ‑32B, DeepSeek‑V3, DeepSeek‑R1, GPT‑oss‑120B). By analyzing ReasoningFlow graphs, we find: (1) LRMs exhibit structurally similar traces, despite being trained from different base models and potentially non‑overlapping post‑training data. (2) ReasoningFlow reveals diverse fine‑grained reasoning behaviors (e.g., local verification, self‑reflection, and assumptions) that can be used for better reasoning trace monitorability. (3) In LRMs, most of the erroneous steps are not used to derive final answers. (4) Mechanistic causal dependencies between steps do not reflect the language‑level discourse structure. We release the dataset and code in: https://github.com/jinulee‑v/reasoningflow.
Authors:Yuanhe Zhang, Yuekai Sun, Taiji Suzuki, Jason D. Lee, Fanghui Liu
Abstract:
Long‑horizon autoformalization of research mathematics fails not only at hard lemmas, but at scale: statements drift, dependencies tangle, context decays, and local repairs corrupt distant work. We present LeanMarathon, a multi‑agent harness for reliable research‑level Lean autoformalization. Its core abstraction is an evolving blueprint: a Lean file that serves simultaneously as formal proof skeleton, natural‑language proof graph, and shared system of record. Four contract‑scoped agents construct, audit, prove, and repair this blueprint. These agents are coordinated by a two‑stage orchestrator that first stabilizes target fidelity through adversarial review and then discharges the proof directed acyclic graph (DAG) from its dynamic leaves upward in parallel CI‑gated rounds. LeanMarathon turns one brittle multi‑hour run into many local, recoverable, parallel transactions. We evaluate LeanMarathon on two recent research papers spanning four Erdős problems (#1051, #1196, #164, #1217). Across three autonomous runs, it formalizes all seven target theorems with no sorry, proving 258 lemmas and theorems. These results show that reliable AI co‑mathematics requires not only stronger provers, but durable harnesses that preserve target fidelity across long mathematical developments. The code can be found at https://github.com/YuanheZ/LeanMarathon.
Authors:Aimen Boukhari
Abstract:
Masked language modelling (MLM) has been the dominant pre‑training objective for text encoders since BERT, yet it encourages representations that are strongly anchored to surface‑form token identity rather than deeper semantic structure. Inspired by the success of Joint Embedding Predictive Architectures (JEPA) (LeCun, 2022) in vision and audio, we propose a hybrid pre‑training objective that combines a JEPA‑style latent‑space prediction loss with a standard MLM objective over a single shared encoder. A learnable scalar parameter continuously balances the two objectives during training. We pre‑train both a hybrid model and a pure‑MLM baseline on English Wikipedia using identical architectures and compute budgets (NVIDIA H100). Extensive representation analysis across five GLUE benchmarks (SST‑2, MRPC, MNLI, CoLA, STS‑B) using four pooling strategies reveals that the hybrid encoder produces significantly more uniform embeddings (uniformity less than ‑0.16 vs ‑0.05 for MLM), exhibits richer spectral geometry under max pooling, encodes less surface‑level lexical information, and achieves a better semantic‑to‑lexical balance. Despite similar linear‑probe downstream accuracy, the geometric differences are consistent and significant, suggesting that the JEPA predictive objective reshapes the latent space in ways that standard accuracy metrics alone cannot capture.
Authors:Zhen Yang, Xiaogang Xu, Wen Wang, Cong Chen, Xander Xu, Ying-Cong Chen
Abstract:
Multi‑agent reasoning systems adopt a "generate‑then‑transfer" paradigm that forces end‑to‑end latency to scale linearly with pipeline depth. We introduce StreamMA, a multi‑agent reasoning system that streams each reasoning step to downstream agents as soon as it is generated, pipelining adjacent agents and thus reducing latency. Surprisingly, this pipelining also improves effectiveness: because multi‑step reasoning quality is non‑uniform and early steps are more reliable than later ones, working with these reliable early steps instead of the full chain prevents error‑prone late steps from misleading downstream agents. We formalize both advantages with the first closed‑form joint analysis of stream, serial, and single protocols, deriving the effectiveness ordering, speedup upper bound, and cost ratio. Across eight reasoning benchmarks spanning mathematics, science, and code, two frontier LLMs (Claude Opus 4.6 and GPT‑5.4), and three topologies (Chain, Tree, Graph), StreamMA outperforms both baselines (avg. +7.3 pp, max +22.4 pp on HMMT 2026; Claude Opus 4.6‑high). Beyond these contributions, we discover a "step‑level scaling law": increasing per‑agent steps consistently improves both effectiveness and efficiency, a new scaling dimension orthogonal to and composable with agent‑count scaling.
Authors:Jie Huang, Ruixun Liu, Sirui Sun, Xinyi Yang, Yin Li, Yixin Zhu, Yiwu Zhong
Abstract:
As multi‑modal models advance towards long‑form video understanding, memory emerges as a critical capability. Despite substantial efforts in developing video datasets and benchmarks, existing works primarily focus on perception and reasoning, without systematically evaluating memory: what models retain, how faithfully information is preserved, and how robust memory remains under interference. To address this gap, we introduce M^3Eval, the first comprehensive evaluation framework and benchmark for probing different memory dimensions in multi‑modal models. Grounded in cognitive psychology, our design features carefully constructed tasks that isolate key aspects of memory. Leveraging M^3Eval, we conduct extensive experiments across representative multi‑modal models, revealing consistent weaknesses and distinctive behaviors. We find that models struggle to maintain disentangled representations when processing parallel video streams, exhibit interference patterns differing substantially from those observed in human memory, ground memory sources more reliably in the spatial domain than the temporal domain, and demonstrate limited symbolic memory. Collectively, our benchmark provides a valuable resource for future research, while our findings highlight memory as a fundamental yet underexplored capability and offer insights for designing more effective memory mechanisms in multi‑modal models. Our code and dataset are available at https://pku‑value‑lab.github.io/m3eval‑homepage.
Authors:Na Li, Chengda Wang, Mingju Gao, Hao Tang
Abstract:
Diffusion large language models (DLLMs) enable non‑autoregressive generation by iteratively denoising corrupted token sequences with bidirectional context. Despite their ability to update multiple positions in parallel, inference remains costly due to the many denoising steps required for high‑quality generation. We propose SAID, a Scaffold‑Aware Iterative Decoding framework that accelerates DLLMs by reallocating computation across tokens. SAID first spends denoising computation on scaffold tokens to establish the coarse semantic structure, and then completes predictable detail tokens with fewer steps. We further adapt SAID to block‑wise diffusion decoding and introduce Confidence‑Hierarchical Layered Generation (CHLG), which assigns additional steps only to low‑confidence tokens. Experiments on LLaDA‑8B and LLaDA 1.5 across math, coding, and knowledge benchmarks show that SAID significantly accelerates DLLM inference with a maximum speedup of 9.1x while maintaining competitive performance. Our code is publicly available: https://github.com/TH‑AI‑Lab‑PKU/SAID.
Authors:Xinrui Song, Zhuoran Wang, Mingju Gao, Hao Tang
Abstract:
Diffusion language models (DLMs) generate text through iterative denoising, and blockwise decoding improves their practicality by committing tokens in local blocks. However, existing blockwise methods typically rely on fixed block sizes or delimiter‑based runtime signals, which do not necessarily align with semantic boundaries. In this paper, we propose SemBlock, a semantic‑boundary‑driven dynamic block decoding framework for diffusion LLMs. SemBlock formulates dynamic block construction as semantic boundary prediction and trains lightweight predictors on frozen LLaDA hidden states. To provide supervision, we construct SemBound, a semantic‑boundary dataset that derives boundary labels from discourse units, reasoning steps, and implementation spans across natural language, math, and code tasks. During inference, SemBlock uses predicted boundary probabilities to select the ending position of each dynamic block. Experiments on GSM8K, IFEval, MATH, and HumanEval show that SemBlock consistently improves over fixed‑block decoding and AdaBlock. Our code is publicly available: https://github.com/TH‑AI‑Lab‑PKU/SemBlock.
Authors:Xuekang Wang, Zhuoyuan Hao, Shuo Hou, Hao Peng, Juanzi Li, Xiaozhi Wang
Abstract:
Rubric‑based reinforcement learning (RL) uses an LLM‑as‑a‑Judge (LaaJ) to score model outputs according to rubrics as rewards. However, policy models may exploit latent biases in the judge, leading to reward hacking and ineffective or unsafe training outcomes. In real‑world rubric‑based RL, such hacking behaviors are often subtle and entangled with multiple judge biases, making them difficult to analyze, detect, and mitigate. In this paper, we introduce CHERRL, a controllable hacking environment for rubric‑based RL. By injecting known biases into LaaJ, CHERRL enables stable reproduction of reward hacking, explicit observation of reward divergence, and precise identification of hacking onset. This provides a clean experimental testbed for studying the mechanisms and mitigations of reward hacking in rubric‑based RL. To demonstrate its utility, we analyze different judge biases from the perspectives of discoverability and exploitability, and explore an agent‑based system for automatically detecting reward hacking onset from training logs. The code and environment are publicly available at https://github.com/THUAIS‑Lab/CHERRL.
Authors:Yang Liu, Jiajin Zhang, Danyang Tu, Yaojun Hu, Jiao Qu, Jiuyu Zhang, Yu Shi, Wei Fang, Shi Gu, Ling Zhang, Yingda Xia
Abstract:
Breast cancer remains a leading cause of cancer‑related mortality among women. Its clinical management requires multimodal reasoning across a clinical workflow that spans screening, diagnosis and treatment planning, where each stage involves distinct imaging modalities, task objectives, and reasoning patterns. However, constrained by data scarcity and model versatility, existing medical MLLMs are typically evaluated on isolated modalities or narrow task families, limiting their ability to support workflow‑level clinical reasoning. In this work, we first introduce BreastStage, a workflow‑aligned breast imaging instruction corpus comprising 1.86M instruction‑following pairs curated from 17 sub‑datasets across 5 imaging modalities and 136 task templates. Its held‑out split, BreastStage‑Bench, provides a comprehensive benchmark for evaluating multimodal reasoning across the breast cancer care continuum. Building on this corpus, we propose BreastGPT, a unified MLLM equipped with a dual‑branch visual encoder and concept‑preserving token compression to bridge the scale gap between standard radiology and gigapixel pathology. On BreastStage‑Bench, BreastGPT achieves 75.66% closed‑ended accuracy and 89.92% open‑ended score, outperforming both general‑purpose and medical‑specific MLLMs across clinical stages and task formats. These results suggest that workflow‑aligned data and cross‑scale visual modeling are critical for clinically grounded medical MLLMs. All data, code, and model checkpoints are released at https://yangyy‑liu.github.io/BreastGPT.io.
Authors:Jiashu Yao, Heyan Huang, Daiqing Wu, Wangke Chen, Huaxi Ai, Haoyu Wen, Zeming Liu, Yuhang Guo
Abstract:
GUI agents today assume a static screen, where the world is frozen between two actions. However, real interfaces such as short‑video applications violate this assumption, as their content keeps playing, and a competent user must decide what to watch and for how long. We formalize this task as Living‑Screen‑Native GUI agents and introduce LivingScreen, the first benchmark instantiating it on short‑video platforms, with a faithful browser‑based environment, a three‑tier task suite, and metrics that jointly score accuracy and information efficiency. Evaluating extensive frontier models, we find that none reaches the human cost‑accuracy performance, and that their dominant failure mode is over‑ and under‑observation, pointing to observation control as a missing capability axis for future GUI agents. All data and code will be available at https://github.com/BITHLP/LivingScreen.
Authors:Yaosheng Fu, Guangxuan Xiao, Xin Dong, Song Han, Oreste Villa
Abstract:
Sparse attention reduces compute and memory bandwidth for long‑context LLM inference. However, two key challenges remain: (1) KV cache capacity still grows with sequence length, and offloading to CPU memory introduces a PCIe transfer bottleneck; (2) the sparse selection step itself retains O(T^2) complexity and can dominate attention cost at long contexts. We propose SparDA, a decoupled sparse attention architecture that introduces a fourth per‑layer projection, the Forecast, alongside Query, Key, and Value. The Forecast predicts the KV blocks needed by the next layer, enabling lookahead selection that overlaps CPU‑to‑GPU prefetch with current‑layer execution. Because Forecast is decoupled from the attention query, our GQA implementation uses one Forecast head per GQA group, reducing selection overhead versus the original multi‑head selector. SparDA adds <0.5% parameters and trains only the Forecast projections by matching the original selector's attention distribution. On two sparse‑pretrained 8B models, SparDA matches or slightly improves accuracy and delivers up to 1.25× prefill speedup and 1.7× decode speedup over the sparse‑attention offload baseline. By enabling larger feasible batch sizes on a single GPU, SparDA further reaches up to 5.3× higher decode throughput than the non‑offload sparse baseline. Our source code is available at https://github.com/NVlabs/SparDA.
Authors:Wangcheng Tao, Han Wu, Weng-Fai Wong
Abstract:
System prompt optimization improves agent behavior without modifying the underlying model, yielding human‑readable, model‑agnostic instructions. Existing methods build a prompt agent that refines task agents' system prompts, yet leave the prompt agent's own system prompt hand‑engineered and fixed. We propose Self‑Evolving Prompt Optimization (SePO), which treats the prompt agent's own system prompt as an optimization target alongside task agents' system prompts. SePO adopts a self‑referential design. A single prompt agent improves both task agents' system prompts and its own under an open‑ended evolutionary search that maintains an archive of candidate prompts as stepping stones. Training proceeds in two stages: pre‑training evolves the prompt agent on a multi‑task pool, and fine‑tuning then applies it to a target task. Across five benchmarks spanning math (AIME'25), abstract reasoning (ARC‑AGI‑1), graduate‑level science (GPQA), code generation (MBPP), and logic puzzles (Sudoku), SePO consistently outperforms Manual‑CoT, TextGrad, and MetaSPO, improving the average accuracy by 4.49 points compared to Manual‑CoT. The prompt optimization skill from pre‑training also generalizes to tasks beyond the pre‑training mixture, rather than memorizing per‑task prompts.
Authors:Xinyu Lu, Tianshu Wang, Pengbo Wang, zujie wen, Zhiqiang Zhang, Jun Zhou, Boxi Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun
Abstract:
Current AI benchmarks evaluate agents on task execution within human‑designed workflows. These evaluations fundamentally fail to measure a critical next‑level capability: whether models can autonomously develop agent systems. We introduce the Meta‑Agent Challenge (MAC), an evaluation framework designed to test the capacity of frontier models for autonomous agent development. Specifically, a code agent (the meta‑agent) is given a sandboxed environment, an evaluation API, and a time limitation to iteratively program an agent artifact that maximizes performance on a held‑out test set across five domains. To ensure evaluation integrity, this framework is secured by multi‑layer defenses against reward hacking. Leveraging this framework, we demonstrate that meta‑agents rarely match human‑engineered baseline policies, and the few that do are dominated by proprietary frontier models. Moreover, the design process exhibits high variance, and high optimization pressure surfaces emergent adversarial behaviors like ground‑truth exfiltration‑highlighting critical deficits in both robustness and model alignment. Ultimately, MAC provides a rigorous, open‑source benchmark for autonomous AI research and development, offering an empirical proxy for evaluating recursive self‑improvement. Benchmark is publicly available at: https://github.com/ant‑research/meta‑agent‑challenge.
Authors:Sajad Ebrahimi, Nima Jamali, Bardia Shirsalimian, Kelly McConvey, Wentao Zhang, Jalehsadat Mahdavimoghaddam, Maksym Taranukhin, Maura Grossman, Vered Shwartz, Yuntian Deng, Ebrahim Bagheri
Abstract:
The growing popularity and capacity of generative models have eroded the distinction between human and machine‑generated content, motivating a growing body of work on detection across text, images, and audio. Most available detectors are either commercial software or, if open‑source, come with incompatible codebases with bespoke preprocessing, evaluation protocols, and evaluation metrics, which make their adoption, fair comparison, and reproduction quite difficult. To address this critical gap, we introduce DetectZoo, a first‑of‑its‑kind, extensible toolkit designed to provide a unified interface for AI‑generated content detection across text, audio, and image modalities. DetectZoo standardizes the complete empirical pipeline, from data ingestion and preprocessing to model assessment, offering researchers a cohesive framework to benchmark state‑of‑the‑art detectors systematically. By integrating diverse public datasets and baseline detection algorithms under a single, unified API, our toolkit facilitates rigorous and reproducible evaluation. DetectZoo provides reference implementations of 61 detectors, native loaders for 22 benchmark datasets, and a standardized evaluation pipeline that reports multiple metrics through a common interface. Each detector is self‑contained yet accessible through the same interface, automatically caches pretrained weights, and reproduces the original published results. DetectZoo lowers the barrier to entry for multi‑modal AI forensics, enabling researchers to identify performance gaps across domains and accelerating the development of robust, generalizable detection techniques. The open‑source repository and comprehensive documentation are publicly available at https://github.com/sadjadeb/DetectZoo, and the package can be installed via pip install detectzoo.
Authors:Christian Lysenstøen
Abstract:
Retrieving the few past turns that answer a new query across long multi‑session histories is the retrieval bottleneck behind long‑term conversational memory (LoCoMo, LongMemEval). Recent concurrent work, Nano‑Memory, shows that scoring a session by the maximum query‑turn similarity (late interaction, "Turn Isolation Retrieval") beats mean‑pooled session embeddings. We do not claim that effect; we replicate it and ask what a training‑free, CPU‑only retrieval stage should add around it. We report four findings. (1) Fuse: score‑level fusion of the late‑interaction dense score with BM25, under a single leave‑one‑conversation‑out weight, adds +8.8 to +17.2 points of LoCoMo Hit@1 over late interaction alone across six encoders (all p<1e‑4), reaching Hit@1 0.752 / NDCG@5 0.829 (e5‑large‑v2), +11.2 pp over BM25. (2) An off‑the‑shelf web‑search cross‑encoder reranker over the fused top‑10 hurts here, degrading Hit@1 by 6.9 pp (one reranker, one configuration). (3) A pooling‑operator ablation shows top‑k late interaction matches max‑similarity, but a naive smooth‑max (log‑sum‑exp) collapses for half the encoders. (4) The late‑minus‑early gap is large for all six encoders and tends to be larger for larger ones, while the marginal fusion gain shrinks; on LongMemEval‑S, a lexical regime where BM25 saturates, the net fusion gain over BM25 is small and not significant. A per‑category analysis frames the gain as a division of labor: dense late interaction helps most on multi‑hop and temporal questions but trails BM25 on adversarial ones. The contribution is a controlled, reproducible account of a strong training‑free retrieval recipe, not the late‑interaction retriever itself (Nano‑Memory's). We make no claim to a complete memory architecture; this is a retrieval‑stage study.
Authors:Boyuan Xiao, Bohong Chen, Yumeng Li, Ji Feng, Yao-Xiang Ding, Kun Zhou
Abstract:
In embodied vision‑language decision making tasks such as robotic manipulation and navigation, Vision‑Language and Vision‑Language‑Action Models (VLMs & VLAs) are powerful tools with different benefits: VLMs are better at long‑term planning, while VLAs are better at reactive control. However, their performance is limited by the same perceptual bottleneck: visual hallucinations arise due to the models' inability to distinguish task‑relevant objects from distractors. In principle, accurate identification and focus on critical objects while filtering out irrelevant ones is the key to break this limitation. A straightforward solution is one‑step focus: directly attending to essential objects. However, this approach proves ineffective because effective focus inherently requires deep scene understanding. To this end, we propose SceneDiver, a coarse‑to‑fine focus plan generation method for VLMs leveraging their long‑term planning abilities, that first constructs a holistic scene graph to establish initial comprehension, then progressively decomposes the task into simpler sub‑problems through an iterative cycle of recognition, understanding, and analysis. To enable reactive control, we also design a lightweight adapter for distilling the deliberate focus ability into VLAs. Evaluations on standard embodied AI benchmarks confirm that our method substantially reduces visual hallucinations for both VLMs and VLAs, while preserving computational efficiency in tasks requiring fast execution. Our code and data are released at: https://future‑item.github.io/SceneDiver.
Authors:Ali Kayyam, Anusha Madan Gopal, M Anthony Lewis
Abstract:
Transformers have become the standard solution for various AI tasks, with the query, key, and value (QKV) attention formulation playing a central role. However, the individual contribution of these three projections and the impact of omitting some remain poorly understood. We systematically evaluate three projection sharing constraints: a) Q‑K=V (shared key‑value), b) Q=K‑V (shared query‑key), and c) Q=K=V (single projection). The last two variants produce symmetric attention maps; to address this, we also explore asymmetric attention via 2D positional encodings. Through experiments spanning synthetic tasks, vision (MNIST, CIFAR, TinyImageNet, anomaly), and language modeling (300M and 1.2B parameter models on 10B tokens), we discovered that our transformers perform on par or occasionally better than the QKV transformer. In language modeling, Q‑K=V projection sharing achieves 50% KV cache reduction with only 3.1% perplexity degradation. Crucially, projection sharing is complementary to head sharing (GQA/MQA): combining Q‑K=V with GQA‑4 yields 87.5% cache reduction, while Q‑K=V + MQA achieves 96.9%, enabling practical on‑device inference. We show that Q‑K=V preserves quality because keys and values can occupy similar representational spaces and attention operates in a low‑rank regime, whereas Q=K‑V breaks attention directionality. Our results systematically characterize projection sharing as an underexplored instance of weight tying in attention, with direct, quantifiable inference memory benefits, particularly valuable for edge deployment. The code is publicly available at https://github.com/Brainchip‑Inc/Do‑Transformers‑Need‑3‑Projections
Authors:Amil Dravid, Yasaman Bahri, Alexei A. Efros, Yossi Gandelsman
Abstract:
We investigate whether neuron populations within neural networks evolve predictably with scale, extending scaling laws beyond macroscopic observables such as loss. To probe this question, we study Rosetta Neurons, a previously characterized class of neurons whose activation patterns are similar across independently trained models (Dravid et al., 2023). In separate analyses of language models up to 30B parameters and vision models up to 5B parameters, we observe that the population of Rosetta Neurons follows a sublinear power law in model size, growing in absolute number but occupying a shrinking fraction of the total neuron count. We further observe a Neuron Polarization Effect: Rosetta Neurons become more selective and increasingly monosemantic with scale, separating from a growing non‑Rosetta population that remains less selective. An analytical model balancing feature utility against limited neuron capacity explains the sublinear power‑law scaling and this polarization effect. Finally, we find that Rosetta Neurons become more domain‑specialized with scale and illustrate their selectivity through a targeted data‑filtering case study for continued pretraining. Our results point to a scaling law for interpretable, shared neuron‑level structure, linking model size to systematic changes in neuron universality, selectivity, and specialization.
Authors:Tao Chen, Gangwei Jiang, Pengyu Cheng, Siyuan Huang, Yihao Liu, Jingwei Ni, Jiaqi Guo, Mengyu Zhou, Kai Tang, Junling Liu, Qinliang Su, Xiaoxi Jiang, Guanjun Jiang
Abstract:
Reward models (RMs) provide critical feedback signals for LLM post‑training, notably in reinforced fine‑tuning (RFT) and reinforcement learning (RL) pipelines. However, current reward evaluation relies on heterogeneous criteria such as rule‑based verifiers, ground‑truth references, procedural checklists, and complex rubrics, where a unified mechanism to integrate all types of evidence remains unexplored. To this end, we propose Skill Reward Model (Skill‑RM), a unified framework that reformulates reward modeling as the execution of a reusable Reward‑Evaluation Skill. By treating reward computation as a structured agentic task, Skill‑RM provides a consistent interface to orchestrate heterogeneous resources, dynamically selecting and aggregating evidence tailored to the specific requirements of each input. This approach enables the reward model to move beyond static evaluation, ensuring consistency and transparency across diverse tasks. Extensive experiments on reward benchmarks and downstream applications, including best‑of‑N selection and reinforcement learning, demonstrate that Skill‑RM consistently outperforms traditional judge baselines. Our findings suggest that Skill‑RM not only provides a unified solution for reward modeling but also achieves superior performance through the strategic and dynamic orchestration of evidence. The code is at https://github.com/Qwen‑Applications/Skill‑RM.
Authors:Areeb Gani, Asal Meskin, Gabrielle Kaili-May Liu, Arman Cohan
Abstract:
Reliable uncertainty communication is critical to the trustworthiness of LLMs, yet faithful calibration (FC)‑‑the alignment between models' intrinsic and (linguistically) expressed confidence‑‑is a persistent failure mode. This challenge is key for large reasoning models (LRMs), whose extended reasoning traces are often interpreted by users as evidence of deliberation, competence, and confidence. Despite the importance of FC and wide usage of LRMs, the extent to which LRMs can faithfully express their confidence remains poorly understood. Moreover, the prevailing paradigm to measure FC does not generalize well to the long chain‑of‑thought outputs generated by LRMs, which tend to lack clear step boundaries, involve inconsistent step structure, and encode complex conditional dependencies throughout the trace‑‑complicating estimation of intrinsic confidence. To address this challenge, we introduce a novel framework to systematically quantify FC of LRMs. Our framework analyzes linguistic decisiveness relative to three sources of internal uncertainty, based on token probabilities, hidden states, and sampled response consistency. We also devise a prefix‑conditioned sampling approach to control for conditional and structural variation across traces. Applying our framework to a diverse suite of leading models, datasets, and prompts, we find that faithful confidence expression is a significant challenge for LRMs. Reasoning behaviors do not automatically translate to improved FC, and prompt interventions for non‑reasoning models do not improve faithfulness in the reasoning setting. Different confidence estimators further produce divergent assessments of the same traces, revealing fragility in prior evaluation methodologies. Taken together, our work establishes FC as a distinct reliability and alignment target for LRMs, particularly as such systems are increasingly deployed in high‑stakes contexts.
Authors:Yu Xia, Zhouhang Xie, Xin Xu, Byungkyu Kang, Prarit Lamba, Xiang Gao, Julian McAuley
Abstract:
Large language models improve final‑answer accuracy through extended chain‑of‑thought reasoning, but often spend tokens inefficiently and offer little inference‑time control. Existing efficient reasoning methods control thinking length by shortening, early‑stopping, or compressing traces, leaving how the model thinks implicit. In this paper, we propose Agentic Chain‑of‑Thought Steering (ACTS), which formulates reasoning steering as a Markov decision process where a controller agent adaptively steers a frozen reasoner during inference. At each step, the controller observes the reasoning trace and remaining thinking budget, then issues a steering action consisting of a reasoning strategy and a steering phrase that initiates the next reasoner step. This enables budget‑aware strategy control for efficient reasoning while preserving the reasoner's generation continuity. We initialize the controller agent from our constructed synthetic steering trajectories with multi‑budget augmentation, and further optimize it via reinforcement learning with budget‑conditioned reward shaping. Experiments across multiple benchmarks show that ACTS matches full‑thinking performance with substantial token savings, and enables controllable accuracy‑efficiency trade‑offs across different reasoners and tasks. The code is available at https://github.com/Andree‑9/ACTS.
Authors:Ting-Yun Chang, Harvey Yiyun Fu, Deqing Fu, Chenghao Yang, Jesse Thomason, Robin Jia
Abstract:
Reasoning models improve accuracy through extended chains of thought, but their long outputs create a memory and compute bottleneck. KV cache eviction methods reduce this cost by evicting unimportant key‑value pairs from the cache, yet they often yield worse accuracy than selection‑based sparse attention alternatives, which keep the full KV cache. We identify key factors crucial to KV cache eviction accuracy. First, a small fraction of value states have abnormally large magnitudes, and evicting them causes catastrophic failure where models enter repetitive reasoning loops. Second, introducing stochasticity during eviction improves accuracy by increasing cache diversity. Based on these findings, we propose Value‑aware Stochastic KV Cache Eviction (VaSE), a training‑free recipe that protects large‑magnitude value states and promotes diverse eviction decisions. Across six reasoning tasks, Qwen3 models using VaSE with 4x KV cache compression yield higher average accuracies than SOTA selection method at the same sparsity, while outperforming the strongest eviction method by more than 4%. Overall, VaSE bridges the gap between efficiency and accuracy, supporting FlashAttention2 and enabling a static memory footprint for reasoning models.
Authors:Lin Li, Georgia Channing, Suhaas M Bhat, Gabriel Davis Jones, Yarin Gal
Abstract:
Large language models (LLMs) have achieved remarkable progress in open‑ended text generation, yet they remain prone to hallucinating incorrect or unsupported content, which undermines their reliability. This issue is exacerbated in long‑form generation due to hallucination snowballing, a phenomenon where early errors propagate and compound into subsequent outputs. To address this challenge, we propose a novel inference‑time hallucination mitigation framework, named Segment‑wise HAllucination Rejection Sampling (SHARS), which uses an arbitrary hallucination detector to identify and reject hallucinated segments during generation and resample until faithful content is produced. By retaining only confident information and building subsequent generations upon it, the framework mitigates hallucination accumulation and enhances factual consistency. To instantiate this framework, we adopt semantic uncertainty as the detector and introduce several vital modifications to address its limitations and better adapt it to long‑form text. Our method enables models to self‑correct hallucinations without requiring external resources such as web search or knowledge bases, while remaining compatible with them for future extensions. Empirical evaluations on standardized hallucination benchmarks demonstrate that our method substantially reduces hallucinations in long‑form generation while preserving or even improving the informativeness of generation. Code is available at: https://github.com/TreeLLi/hallucination‑rejection‑sampling.
Authors:Yucheng Zhou, Wei Tao, Yiwen Guo, Jianbing Shen
Abstract:
World models and multimodal large language models (MLLMs) provide complementary capabilities for predicting future outcomes from static visual observations. World models can generate concrete visual rollouts of possible futures, while MLLMs can reason abstractly over questions, goals, and rules. However, generated rollouts are stochastic and may be visually plausible but task‑incorrect, making it necessary to determine when visual simulation is useful, whether a rollout is credible, and how it should influence the final answer. We formulate this problem as controlled concrete reasoning, where a model learns to invoke, verify, and integrate visual future simulation alongside abstract reasoning. To study this setting, we construct two human‑verified benchmarks, VRQABench for controllable spatial lookahead and OpenWorldQA for open‑domain physical prediction, and propose Privileged‑Future On‑Policy Self‑Distillation (PF‑OPSD). During training, PF‑OPSD uses ground‑truth future videos and answers only as teacher‑side privileged context to evaluate on‑policy concrete‑reasoning trajectories, while the deployable student never observes true futures at test time. Experimental results show that PF‑OPSD outperforms baseline by 10.6% and 10.9% on VRQABench and OpenWorldQA, respectively, while increasing robustness to noisy or conflicting rollouts. Our code and dataset are available at https://github.com/yczhou001/PF‑OPSD.
Authors:Bo Peng, Kaiwen Wu, Sirui Chen, Zhiheng Wang, Yu Qiao, Chaochao Lu
Abstract:
Causal discovery from observational data remains challenging due to the fundamental limitations of purely statistical methods, such as statistical distinguishability within equivalence classes and sensitivity to finite sample sizes. While large language models (LLMs) offer a promising source of domain knowledge to complement statistical inference, existing LLM‑augmented methods are vulnerable to LLM errors and incur high token costs. Moreover, reliance on a single data‑centric algorithm can make results sensitive to algorithm‑specific biases. To address these limitations, we propose CauTion, a framework that reliably integrates LLM domain knowledge into an ensemble of statistical causal discovery algorithms through consensus filtering and LLM reliability estimation. CauTion proceeds in three stages. First, an algorithm ensemble utilizes a consensus voting to resolve up to 96% of edges on which algorithms agree, achieving near‑perfect accuracy on the filtered consensus edges. Second, a trust‑calibrated arbitration mechanism estimates the relative reliability of the LLM and the algorithms via an annotation‑free trust calibration procedure, which is then utilized to govern a trust‑weighted voting process that restricts LLM arbitration exclusively to edges with unreliable algorithmic evidence. Third, a cycle repair step is applied to guarantee the final causal graph is validly acyclic. Experiments on six datasets demonstrate that CauTion consistently outperforms both data‑centric and LLM‑augmented baselines, with larger gains on larger graphs and strong robustness to LLM errors. Code is available at https://github.com/OpenCausaLab/CauTion.
Authors:Anling Xiang, Yuwen Yang, Yang Shen
Abstract:
Scholarly text classification supports literature organization, subject indexing, and research intelligence, but Chinese scholarly corpora often contain imbalanced and semantically adjacent disciplinary labels. We propose AutoTail‑BSFGM, a class‑balance‑aware fine‑tuning method that combines an automatically gated tail‑prior adjustment, a weak Balanced Softmax auxiliary loss, and Fast Gradient Method adversarial regularization. The method changes only the training objective and procedure; inference uses the same single base‑size encoder and linear classifier as the corresponding label‑smoothed baseline. We evaluate the method on two CSL‑based tasks: an abstract‑to‑discipline task with 67 labels and a title‑to‑category task with 13 categories. On the primary abstract task, AutoTail‑BSFGM improves validation and lockbox accuracy under both Chinese RoBERTa‑WWM and MacBERT‑base. With MacBERT‑base, validation accuracy increases by 0.83 percentage points and lockbox accuracy by 0.49 points, with a pooled paired McNemar signal on validation (p = 0.023). On the title task, the method improves validation accuracy by 0.70 points and validation balanced accuracy by 2.64 points; lockbox accuracy is approximately neutral while lockbox balanced accuracy improves by 1.22 points. The results support a bounded contribution: AutoTail‑BSFGM improves class‑balance‑sensitive behavior and yields consistent gains for abstract‑based scholarly classification, without uniformly improving every metric on every split.
Authors:Muhammad Ali
Abstract:
We present BaltiVoice, a 16.8‑hour read‑speech corpus for Balti (ISO 639‑3: bft), a Tibetic language spoken in Gilgit‑Baltistan, Pakistan, with no prior publicly available ASR resources. The corpus contains 10,060 validated utterances in native Nastaliq script, derived from Mozilla Common Voice recordings. We fine‑tune OpenAI Whisper‑small on this corpus and report a Word Error Rate (WER) of 30.07% on a held‑out validation set of 538 utterances, down from a measured zero‑shot baseline of 182.18% for Whisper‑small on Balti. The dataset, fine‑tuned model, and a live transcription demo are publicly available on HuggingFace.
Authors:Canbin Huang, Tianyuan Shi, Xiaojun Quan, Jingang Wang, Jianfei Zhang, Qifan Wang
Abstract:
Model merging has emerged as a cost‑effective approach for consolidating the capabilities of multiple LLMs without retraining. However, existing merging techniques, largely based on linear parameter arithmetic or optimization, struggle when applied to Mixture‑of‑Experts (MoE) architectures. We identify a critical failure mode in MoE merging, termed routing breakdown, in which the merged router fails to dispatch tokens to suitable experts. Routing breakdown stems from the sensitivity of the non‑linear softmax and discrete Top‑k routing mechanisms to parameter perturbations from merging, a sensitivity further amplified by load‑balancing constraints imposed during MoE pretraining. Because fine‑tuned experts exhibit distinct specializations, even modest misrouting can cause severe performance degradation. To address this issue, we propose Hessian‑Aware Router Calibration (HARC), a training‑free framework that leverages second‑order curvature information to realign the merged router. This approach admits a closed‑form solution that can be efficiently solved using a matrix‑free conjugate gradient method. Experiments on mathematical reasoning and code generation tasks show that HARC effectively mitigates routing breakdown across diverse MoE merging baselines and leads to substantial performance improvements. Our code is available at https://github.com/huangcb01/HARC.
Authors:Daniil Krasnoproshin, Maxim Vashkevich
Abstract:
Speech emotion recognition is an important component of modern human‑computer interaction systems. However, many state‑of‑the‑art approaches rely on large pretrained models with high computational and memory requirements, limiting their applicability. This paper proposes ResLSTM‑SA, a lightweight architecture that integrates residual connections with soft attention within an LSTM‑based framework. Evaluated on the RAVDESS dataset under strict speaker‑independent partitioning, the proposed model outperforms conventional attention‑based LSTM baselines and several previously reported CNN‑ and hybrid CNN‑LSTM architectures in terms of unweighted average recall (UAR). The best‑performing variant (ResLSTM‑SA‑h64) achieves a maximum UAR of 0.6517 with only 46.8k trainable parameters, delivering competitive accuracy with three orders of magnitude fewer parameters than large‑scale self‑supervised alternatives, thereby enabling efficient deployment on edge devices and real‑time voice assistants. The source code is available at https://github.com/Mak‑Sim/ResLSTM‑SER.
Authors:Lujie Ban, Jiasheng Shi, Jinyang Li, Xiaolin Han, Tsz Nam Chan, Chenhao Ma
Abstract:
Text‑to‑SQL systems are typically evaluated by query‑level execution correctness, but this terminal signal provides little guidance about which intermediate SQL decision caused success or failure. Token‑level dense supervision is also ill‑suited: SQL tokens do not align with complete semantic decisions, can penalize execution‑equivalent queries, and are difficult to label reliably at scale. We therefore propose CAPER, which automatically derives clause‑level supervision via counterfactual intervention on the SQL abstract syntax tree, enabling root‑cause error localization for reward modeling; the resulting data is used to train CAPER‑9B, a lightweight Clause‑PRM that provides clause‑boundary feedback for policy optimization and candidate verification. Experiments on BIRD and Spider show that clause‑aligned supervision not only improves execution accuracy, achieving up to a 15.3% relative EX improvement over GPT‑5.4, but also strengthens failure‑localization capability, reaching 84.53% accuracy and 90.60% MRR on held‑out failures. Our project page is at https://github.com/banrichard/RL‑NL2SQL.
Authors:Xuan Yang, Hao Xu, Tingfeng Hui, Hongsheng Xin, Kaike Zhang, Chunxiao Liu, Ning Miao
Abstract:
Despite great advances in tool‑use capabilities of large language models (LLMs), existing evaluation benchmarks struggle to fully align with real‑world scenarios. Such benchmarks mostly rely on simulated idealized user assumptions and lacks experience‑oriented evaluation. These limitations fail to account for the ambiguity, uncooperative behaviors, and shifting intentions characteristic of real‑world users. To fill this gap, we propose RUT‑Bench, a dedicated benchmark designed to assess LLMs under diverse Real‑world User Tool calling scenarios. RUT‑Bench supports high‑fidelity simulations covering both ideal rational patterns and heterogeneous non‑ideal behaviors across single‑turn and multi‑turn dialogues. We conduct comprehensive evaluations on 19 widely adopted open‑source and proprietary LLMs using our benchmark. Experimental results reveal that no tested LLMs achieve an overall success rate above 40%, and nearly all of them experience noticeable performance drops when facing more complicated non‑ideal user inputs. Our code and data is available at https://github.com/Miaow‑Lab/RUT‑Bench.
Authors:Zirui Yan, Dennis Wei, Dmitriy A. Katz, Prasanna Sattigeri, Ali Tajer
Abstract:
Causal tracing systematically intervenes on a large language model's (LLM's) internal representations to uncover and quantify the causal pathways linking specific inputs or computations to specific metrics of interest, quantifying the LLM's behavior. Building on previous single‑component or single‑layer studies, this paper presents a unified framework for causally tracing multiple components simultaneously. This framework systematically identifies the subsets of components (e.g., attention heads and multi‑layer perceptron neurons) most critical to a desired target performance metric (e.g., accuracy and fairness). This is achieved by incorporating flexible interventions applied to a wide range of desired metrics. To address the combinatorial complexity of the multi‑component problem, an efficient algorithm is designed that leverages soft interventions and a carefully designed metric transformation, converting the combinatorial search problem into a continuous one that can be solved efficiently under proper constraints, thereby generating proper binary decisions for selecting components. Experimental results demonstrate that the proposed method efficiently identifies subsets of the model's components that have a high impact on the target metric, outperforming existing baseline approaches. Our code is available at https://github.com/ZiruiYan/multi‑component‑causal‑tracing.
Authors:Mingkuan Zhao, Wentao Hu, Tianchen Huang, Yuheng Min, Suquan Chen, Yide Gao, Yanbo Zhai, Shuangyong Song, Xuelong Li
Abstract:
Hallucination in Large Language Models (LLMs), characterized by the generation of content inconsistent with contextual facts or logical constraints ‑‑ remains a persistent challenge for reliable deployment. In this work, we address this issue through a geometric framework rooted in the linear representation hypothesis. We propose that hallucinations manifest as orthogonal noise relative to the semantic manifold of the residual stream. Specifically, we hypothesize that while attention heads ideally propagate information congruent with the context subspace, hallucinations arise when specific heads introduce components orthogonal to this subspace, disrupting the coherence of the latent representation. Based on this formulation, we introduce Dynamic Contextual Orthogonalization (DCO), an inference‑time intervention method. DCO utilizes the input residual stream as a dynamic context anchor to perform orthogonal decomposition on attention head outputs. To distinguish between context‑aligned semantic updates and divergent noise, DCO employs a layer‑wise Z‑score suppression mechanism that selectively attenuates outlier orthogonal components based on statistical distributions. Evaluations on Llama‑3‑8B and 70B across benchmarks such as XSum, NQ‑Swap, and IFEval demonstrate that DCO achieves superior contextual faithfulness compared to state‑of‑the‑art intervention baselines. Furthermore, DCO maintains high performance on knowledge‑intensive tasks like TriviaQA and TruthfulQA, effectively mitigating the trade‑off between hallucination suppression and parametric knowledge retention often observed in existing methods. Our findings validate the geometric interpretation of hallucinations and establish DCO as a computationally efficient approach for enforcing manifold alignment.Our code is available at https://github.com/Harry‑Miral/DCO
Authors:Siva Rajesh Kasa, Yasong Dai, Sumit Negi, Hongdong Li
Abstract:
Diffusion large language models promise parallel token generation, yet inference remains bottlenecked by deciding which masked tokens can be safely committed together. Fast‑dLLM addressed this with KV caching and confidence‑guided parallel decoding, but its decoding theory uses a homogeneous high‑confidence assumption that effectively reduces each candidate set to its weakest selected token. We argue that this leaves speed on the table because real decoding steps exhibit heterogeneous confidence profiles. We propose Fast‑dLLM++, a training‑free extension that introduces \emphFréchet profile decoding: selecting parallel commit sets from the full sorted confidence profile rather than a single worst‑case confidence. The resulting rule is a heterogeneous‑confidence generalization of Fast‑dLLM's factor selector and it recovers the previous rule exactly in the equal‑confidence case and adds a provable \emphheterogeneity bonus when the selected tokens have uneven confidences. Fast‑dLLM++ leaves the model, diffusion process, and cache implementation entirely unchanged, making it a drop‑in replacement for existing Fast‑dLLM decoding. Experiments on GSM8K, MATH, HumanEval, and MBPP with the LLaDA‑8B model show that the theoretical improvement translates directly into empirical gains: profile‑aware selection improves the accuracy‑‑throughput frontier by exploiting safe parallelism that weakest‑token rules miss, achieving up to 37% higher throughput at comparable accuracy. Our anonymous code release is at https://github.com/Ringo‑Star/FastdLLM_plusplus.
Authors:Nikolaj Hindsbo, Sina Ehsani, Pragyana Mishra
Abstract:
Deploying language‑driven agents in robotics requires evaluations that reflect real‑world task demands: natural‑language instructions with reproducible outcomes. Such agents must connect language models to callable perception and control tools, and be assessed using deployment‑critical metrics including latency, accuracy, and error modes. We present SCOPE (Simulation and Camera Operations for Perception and Evaluation), a modular agent for natural‑language, open‑vocabulary pan‑tilt‑zoom (PTZ) camera control and visual scene understanding, designed explicitly for edge deployment. SCOPE operates both in a Blender‑based simulation environment and on a physical PTZ camera, executing all perception, planning, and control locally at the deployment site using edge‑accessible compute. We release a 536‑task benchmark spanning QA, single‑ and multi‑step commands, counting, spatial reasoning, descriptions, and optical character recognition in a Blender‑based simulation environment that exposes realistic PTZ control affordances. Execution traces are combined with an LM‑as‑Judge to evaluate latency, accuracy, and error modes. We evaluate 19 planner‑perception model combinations pairing Qwen3 small language models (SLMs) with Moondream and Qwen vision‑language models (VLMs). Stronger SLMs substantially reduce hallucinations and improve tool routing, leading to more reliable closed‑loop behavior. Once a sufficiently capable SLM is used, perception becomes the dominant performance bottleneck. Mixture‑of‑Experts models on both the planning and perception side consistently match or exceed dense alternatives at latencies and memory footprints comparable to much smaller networks. Quantization provides additional efficiency gains with minimal accuracy degradation, identifying a practical, sim‑to‑real validated design point for real‑time, edge‑feasible language‑driven PTZ control.
Authors:Muyu He, Yuchen Liu, Qingya Huang, Li Zhang
Abstract:
The success of the transformer architecture as the backbone of modern LLMs is in large part due to its use of attention layers. An attention layer follows the standard neural network paradigm: it takes the residual stream as input and thereby produces context‑dependent query, key, and value vectors. However, we find that model performance meaningfully improves when deeper layers learn only a context‑free value vector to preserve the original token information, without drawing on any context from the residual stream. When the model has access to this context‑free value vector, adding back the context‑dependent component provides little additional benefit for aggregate benchmark performance. Such context‑free value vectors can be stored as sparse model parameters, eliminating the need to recompute or persistently cache these values. Through systematic ablations on the key design choices for such context‑free value vectors, we propose Bank of Values (BoV), a new way of computing value vectors in attention by learning a lookup table of token‑specific value vectors for each of the last third of layers. Across 135M and 780M models, BoV improves validation loss over standard attention and, at 780M, the average score across 21 benchmarks, matching the previous best method that adds token information to the value vector with less compute and memory.
Authors:Andrianos Michail, Elias Schuhmacher, Juri Opitz, Simon Clematide, Rico Sennrich
Abstract:
Dense retrieval models exhibit positional bias: retrieval effectiveness degrades when relevant information appears later in a passage (Zeng et al., 2025). We ask whether this bias can be reduced at inference time, without retraining and without sacrificing overall retrieval effectiveness. To this end, we adapt inference‑time attention calibration (Schuhmacher et al., 2026) to downstream retrieval and extend it with a strength coefficient lambda that interpolates between the original and fully calibrated attention distributions. Across three embedding models on SQuAD‑PosQ and FineWeb‑PosQ, we examine how basket size, calibrated layer set, and strength affect the trade‑off between positional fairness and retrieval effectiveness, finding that partial calibration frequently outperforms full calibration. A single configuration (B=128, lambda=0.5, 50% layer depth) improves the harmonic mean of nDCG@10 across positional groups on FineWeb‑PosQ for all three models without per‑model tuning, and applies to both <s>‑pooled and last‑token‑pooled architectures. This default configuration transfers without modification to PosIR, which spans 10 languages and 31 domains, reducing the Position Sensitivity Index in all 16 length‑quartile x model x retrieval‑setting combinations, while preserving or improving aggregate nDCG@10. We release our extended codebase at https://github.com/impresso/fair‑sentence‑transformers
Authors:Yuying Li, Leqi Zheng, Yongzi Yu, Wenrui Zhou, Xuchang Zhong, Xing Hu, Jing Jin, Hangjie Yuan, Tao Feng
Abstract:
On‑Policy distillation (OPD) in large language models is shifting from full‑trace KL supervision toward more selective training paradigms. Recent OPD methods increasingly focus on selecting which trajectories to learn from, which tokens are most informative, and which supervision signals are most reliable. Motivated by this trend, we rethink optimization granularity of OPD and propose \fireicon\ FiRe‑OPD (Filter, then Reweight), which jointly adjusts supervision signals at both trajectory and token levels. In details, FiRe‑OPD first filters trajectories to remove low‑quality rollout samples, and then applies soft reweighting within the retained trajectories to emphasize informative tokens. Compared with hard token selection, FiRe‑OPD leverages a soft‑weighting mechanism to effectively mitigate information loss and enhance optimization stability, thereby achieving finer‑grained OPD optimization. We validate the effectiveness of FiRe‑OPD across strong‑to‑weak, single‑teacher, and multi‑teacher settings, and demonstrate its superiority over recent token‑level OPD methods ( (e.g., +6.25 on AIME 2024 in strong‑to‑weak, +18.81 on Miner in multi‑teacher). Our code is available at https://github.com/YuYingLi0/FiRe‑OPD.
Authors:Jiebin Zhang, Zhenghan Yu, Song Liu, Eugene J. Yu, Zheng Li, Dawei Zhu, Jiangshan Duo, Weimin Xiong, Yifan Song, Guanghua Yu, Jianchen Zhu, Sujian Li
Abstract:
Block diffusion speculative decoding accelerates LLM inference by predicting all tokens within a block simultaneously for the target model to verify in parallel. Predicting an entire block at once requires a sufficiently capable draft model and effective utilization of the target model's internal knowledge. However, the state‑of‑the‑art method DFlash constrains all draft layers to share a single fused representation derived from only a few target layers, limiting per‑layer expressiveness and hindering further scaling of draft capacity. In this paper, we present \modelname, which flares out the narrow conditioning bottleneck of DFlash through a lightweight layer‑wise fusion mechanism: each draft layer attends to its own learnable combination of a broad set of target layers at negligible overhead, simultaneously injecting richer target knowledge and providing every draft layer with a distinct input. This enhanced per‑layer expressiveness enables scaling the draft model to deeper architectures with consistent gains. We further scale training data from 800K to 2.4M samples to fully exploit the enlarged capacity. On six benchmarks spanning mathematical reasoning, code generation, and conversation, \modelname attains average wall‑clock speedups of 5.52x on Qwen3‑4B, 5.46x on Qwen3‑8B, and 3.91x on GPT‑OSS‑20B, improving over DFlash by roughly 11%, 8%, and 5% respectively. Our code is available at https://github.com/Tencent/AngelSlim.
Authors:Peihan Liu, Lucas Rosenblatt, Weiwei Kong, Natalia Ponomareva, Gautam Kamath, Rachel Cummings, Roxana Geambasu, Yu Gan, Lillian Tsai, Alex Bie
Abstract:
Differentially private (DP) text synthesis promises to unlock sensitive corpora for model training, but it remains unclear whether DP synthetic data transmits genuinely new knowledge and capabilities present only in those corpora. This is because existing evaluations rely on tasks that are nearly solvable without training, so strong benchmark performance does not establish that DP synthesis can substitute original data access. Thus, we introduce ContinuousBench, a continuously and automatically‑regenerated benchmark that measures capability gain from DP synthetic text. Each quarter, a new release pairs a never‑before‑seen training corpus with a derived QA set, constructed to be: (1) unsolvable sans‑corpus; and (2) learnable under DP, as the tested knowledge is supported by hundreds of independent records. Researchers produce DP synthetic data from the training corpus and run our standardized training and evaluation harness on their synthetic data to measure gains. We instantiate two tracks: Geminon, a procedurally‑generated dataset about fictional creatures; and News, a stream of newly crawled public news articles. Although standard benchmarks are nearly saturated, on ContinuousBench we find that non‑private synthesis transfers substantial knowledge from the original corpus, while state‑of‑the‑art DP synthesis methods generally fail to do so, even at \varepsilon=100.
Authors:Junjie Chen, Yuxi Dong, Haitao Li, Weihang Su, Yujia Zhou, Min Zhang, Yiqun Liu, Qinyao Ai
Abstract:
As large language models (LLMs) are increasingly used for long‑form generation, reliably evaluating long‑form outputs has become a critical challenge. LLM‑as‑a‑judge offers a scalable alternative to human evaluation, yet its reliability in long‑form output evaluation remains underexamined: existing meta‑evaluation benchmarks focus mainly on short‑form outputs. Compared with short‑form evaluation, long‑form evaluation is not merely a matter of output length; it often requires judges to make more complex document‑level assessments of overall organization, task‑relevant coverage and depth, cross‑section consistency, and scenario‑specific quality criteria. In this work, we introduce LongJudgeBench, a comprehensive benchmark for evaluating LLM judges on long‑form outputs across diverse real‑world scenarios and judging protocols. We systematically evaluate a broad range of LLM judges, covering multiple base models and judging settings. Our results reveal a substantial reliability gap: current LLM judges remain unstable across scenarios, and rubrics or references are helpful but not always sufficient. We hope LongJudgeBench will support future research on more robust, context‑aware, and human‑aligned LLM‑as‑a‑judge methods. Our code is available at https://github.com/cjj826/LongJudgeBench.
Authors:Chad A. Capps
Abstract:
We present CART (Context‑Anchored Recurrent Transformer), a parameter‑efficient language model that reuses a single shared core block R times across depth. Unlike prior looped transformers that recompute key‑value tensors at every iteration, CART computes K and V once from a multi‑layer prelude and has the recurrent core cross‑attend to those frozen tensors via multi‑head latent attention. A learned Linear Time‑Invariant (LTI) gate keeps the recurrence stable: its spectral radius settles in a narrow band (rho in [0.79, 0.83]) across all 36 fully‑trained configurations. We evaluate CART on single consumer GPUs in two stages: a 64‑configuration screen at 3,000 steps, then 36 configurations (P=6, R in 6,8,10, three seeds) trained for 30,500 steps (~1B tokens). Two patterns hold across widths d in 256,512,768,1024: prelude depth P dominates loop count R, and the Stage‑1 ranking of R reverses at full training (R=6 becomes best at d>=512). At the binding d=1024 parameter‑parity test, CART does not beat a parameter‑matched dense baseline, losing by 1‑2% at stored‑parameter parity and by ~10% at effective‑parameter parity. Diagnostic ablations split the effective‑parameter gap into ~5% from weight sharing and a residual ~5% from the heterogeneous prelude/anchor/core/coda framing; the recurrent‑core machinery (hyper‑connections, LTI gate, loop‑index embedding) is individually vestigial. Variable‑R inference degrades on both sides of the trained R, a negative result for test‑time depth scaling under this recipe.
Authors:Wentao Mo, Yang Liu
Abstract:
Current 3D spatial reasoning methods face a fundamental trade‑off: neuro‑symbolic 3D (NS3D) concept learners achieve interpretable reasoning through compositional programs but are constrained to closed‑set concept vocabularies and simple programs; end‑to‑end 3D multi‑modal LLMs (3D MLLMs) could handle complex natural language and open‑vocabulary concepts but suffer from black‑box reasoning without explicit spatial verification. We introduce APEIRIA, a neuro‑symbolic 3D MLLM to bridge two paradigms by distilling symbolic reasoning patterns into MLLMs with natural language chain‑of‑thought. Our three‑stage curriculum progressively builds reasoning capabilities: a) 3D perception alignment grounds object visual‑geometric features to the LLM, b) CoT‑SFT teaches query decomposition and stepwise verification from symbolic program traces, and c) CoT‑RL extends reasoning patterns to open‑set concepts and deeply nested instructions. By transferring reasoning patterns rather than concept‑specific knowledge, APEIRIA preserves key NS3D virtues: transparent reasoning and modular interchangeability of planning and perception components. Evaluations on grounding, question answering, and captioning show that APEIRIA surpasses prior NS3D methods and matches state‑of‑the‑art 3D MLLMs on 3D spatial reasoning datasets, unifying symbolic methods' systematic reasoning with MLLMs' flexibility. Code is available at https://github.com/oceanflowlab/APEIRIA.
Authors:Qi Han Wong
Abstract:
We investigate whether large language models produce different medical triage recommendations for identical symptoms based solely on the language of the patient prompt. Using Gemini 3.5 Flash, we evaluate a neurological symptom profile (persistent headache, blurred vision, nausea) across six languages (English, Spanish, Chinese, Hindi, Japanese, Arabic) with 30 runs per condition (n=450 total API calls). We find that the model recommends emergency room visits at rates ranging from 0% (Japanese, Hindi) to 30% (English, Arabic), despite assigning nearly identical severity scores (7.7‑8.0/10) across all languages. Adding a single sentence specifying the patient's US location increases ER recommendations by up to 76.7 percentage points for non‑English prompts, while the reverse anchor (English prompt with a Tokyo location) reduces the ER rate from 30% to 6.7%. A back‑translation control (Japanese to English) produces ER rates comparable to the English baseline, confirming that the disparity is not caused by translation quality but by implicit geographic inference from the input language. We release the complete dataset, experiment code, and results.
Authors:Shailesh Rana
Abstract:
Language models do not simply choose an answer at the output layer. In a 9,000‑trajectory MMLU study across Qwen2.5‑7B‑Instruct, Llama‑3.1‑8B‑Instruct, and Mistral‑7B‑Instruct‑v0.3, the score of the answer moves across depth in structured ways. We describe each trajectory with three quantities: the current answer margin, the next‑layer change in that margin, and the distance from a decision flip. The main empirical picture is that correctness and stability are different: the largest group is unstable‑correct, not stable‑correct. A traced subset then asks what moves the margin. In stable‑correct cases, the average attention scalar points in the correct direction, while the average MLP scalar does not; span deletion shows that removing answer‑supporting text hurts the margin and removing distractor‑like text helps it. The result is not a full circuit explanation. It is a reproducible way to see which answers are settled, which remain fragile, and which measured sources move them.
Authors:Rashad Aziz, Ikhlasul Akmal Hanif, Fajri Koto
Abstract:
Safety alignment learned in high‑resource languages transfers poorly to low‑resource languages. Models refuse harmful prompts in English but fail to refuse when the same prompts are translated into Swahili or Burmese. Adaptive steering methods like AdaSteer and CAST inherit this failure cross‑lingually. We diagnose where transfer breaks down. Across Qwen2.5‑7B, Gemma‑2‑9B, and Llama‑3.1‑8B on 23 languages, the harmfulness direction extracted from high‑resource activations linearly separates harmful from harmless low‑resource prompts nearly as well as high‑resource ones. The relevant representation is present. Yet harmful refusal drops from 87.9% to 43.9%. The model fails to convert the representation into refusal. What fails to transfer is calibration of the safety decision, not the underlying representation. We exploit this by recalibrating, rather than retraining, a high‑resource gate: a low‑rank logistic readout with its decision threshold reset using as few as 1 to 4 target‑language examples per class. The gate routes between refusal steering and harmfulness‑direction ablation, substantially raising mean refusal selectivity (Δ = harmful ‑ harmless refusal) from 33.6 for the strongest adapted baseline to 54.5 while preserving MMLU utility. These results suggest that some low‑resource safety failures can be repaired by recalibrating existing representations rather than learning new ones. Our code is released: https://github.com/rashadaziz/low‑resource‑safety.
Authors:Haowei Han, Kexin Hu, Weiwei Cai, Debiao Zhang, Bin Qin, Yuxiang Wang, Jiawei Jiang, Xiao Yan, Bo Du
Abstract:
Command understanding systems in smart home ecosystems can automate device control and substantially improve user experience. However, while they perform well on precise utterances (e.g., "turn on the bedroom light"), they struggle with ambiguous or misaligned commands (e.g., "make the bedroom cozy"). Large language models (LLMs) generalize well across various domains and can outperform traditional rule‑based systems on such tasks, but their effectiveness is often constrained by scarce domain‑specific data, insufficient task‑specific adaptation, and high computational costs. In this paper, we propose an automated training data synthesis workflow using user logs and LLMs; then we build MiCU, a domain‑specific LLM that excels at command understanding. Specifically, we employ curriculum learning to inject domain knowledge into the base LLM, then we enhance its reasoning ability via cold‑start training combined with reinforcement learning (RL) guided by domain‑specific thinking rules. Additionally, we introduce a token compression technique that condenses device description into a single special token, substantially reducing inference overhead and enabling \model‑fast, an efficient variant optimized for long inputs. Extensive experiments show that MiCU significantly outperforms baselines, with an average accuracy gain of 20.01% across all device categories. We have deployed MiCU in the Xiaomi Home app, receiving approximately 1.7 million page views per day. Production evaluations show that MiCU reduces user correction rate by 1.57% and increases human audited accuracy by 32.05%. Our data and code are available at https://github.com/xiaomi‑research/iot_spec_llm
Authors:Tao Feng, Tianyang Luo, Jingjun Xu, Zhigang Hua, Yan Xie, Shuang Yang, Ge Liu, Jiaxuan You
Abstract:
Experience learning has achieved promising results in enhancing LLM agent planning and reasoning by integrating past interactions as reusable knowledge. However, existing methods remain confined to explicit text space, retrieving experiences via semantic similarity and concatenating them into the context window, leading to substantial token overhead and a decoupled architecture that separates retrieval from generation. To address these limitations, we propose ExpWeaver, a framework that enables LLM agents to learn from experience via latent retrieval‑augmented generation, without requiring a separate RAG module. ExpWeaver encodes experiences using the LLM's own hidden states, retrieves relevant experiences directly in latent space at each decoding step, and integrates them through cross‑attention aggregation and gated residual mechanisms. The entire pipeline is optimized end‑to‑end with reinforcement learning, supporting both generative and ranking tasks. We evaluate ExpWeaver on 13 diverse tasks spanning question answering, reasoning, coding, scientific prediction, and recommendation. Results demonstrate that ExpWeaver achieves state‑of‑the‑art performance on 12 out of 13 tasks, outperforming the strongest baseline by over 6.8%; maintains token efficiency comparable to non‑retrieval baselines while text‑based retrieval methods require 1.5 to 2 times more tokens; and exhibits superior cross‑domain generalization, outperforming the strongest baseline by 16.32% under zero‑shot transfer and 15.21% under few‑shot transfer. Our code for ExpWeaver is released at https://github.com/ulab‑uiuc/ExpWeaver.
Authors:Sicheng Yang, Shulan Ruan, Shiwei Wu, Yu Liu, Lu Fan, Zhi Li, You He
Abstract:
While End‑to‑End (E2E) Speech‑Large Language Models (Speech‑LLMs) are rapidly evolving, their evaluation methodologies remain limited to the era of simple transcription. Existing benchmarks suffer from three critical limitations: a pronounced bias towards high‑resource languages, a focus on low‑level recognition (ASR) rather than semantic reasoning, and a neglect of regional dialects. To bridge this gap, we introduce PolySpeech‑100, a massive‑scale benchmark designed to assess `native‑level' speech comprehension across 110 linguistic variants. We employ a novel hybrid construction pipeline that augments gold‑standard human recordings with instruction‑driven synthetic speech, allowing us to cover 19 distinct Chinese dialects and over 80 low‑resource languages. Extensive evaluation of 22 state‑of‑the‑art models (including Gemini‑3, GPT‑Audio, and Qwen2.5‑Omni) yields pivotal insights. First, we demonstrate that open‑source E2E models outperform Cascade (ASR+LLM) systems on heavy dialects, proving that direct audio processing preserves critical paralinguistic cues and prosodic features (e.g., intonation, stress) that are often lost in standard transcription. Second, we reveal a significant performance gap: while commercial models maintain robustness, open‑source models suffer catastrophic degradation on low‑resource languages. Finally, counter‑intuitively, we observe that under standard zero‑shot settings, Chain‑of‑Thought prompting frequently degrades speech understanding performance for most evaluated models, revealing a potential modality alignment gap in current architectures. PolySpeech‑100 establishes a rigorous standard for the next generation of inclusive, omni‑capable Speech‑LLMs. The data, demo, and code are publicly available at https://github.com/YoungSeng/PolySpeech‑100.
Authors:Rana Muhammad Usman
Abstract:
LLM agents increasingly act after consuming ranked external information streams such as social feeds, search results, retrieval contexts, and email queues, yet safety evaluations almost always test the model or the user prompt in isolation, never the upstream ranker that decides what the agent reads just before it acts. We introduce a controlled protocol that holds the model, persona, topic, and final decision prompt fixed and varies only the composition and ordering of the posts an agent encounters during a preceding ten‑turn "scrolling" phase, isolating the causal effect of feed curation on a downstream decision. Across 2,785 decision rollouts on four modern open instruct LLMs from three independent labs, we identify three response regimes: adversarial capitulation, default saturation, and a default‑direction asymmetry in which a one‑sided feed tips a decision the model was genuinely uncertain about (in the clearest cases from 5% to 100%; Fisher p as low as 3 x 10^‑10) but cannot dislodge one it already favors or holds firmly. The effect follows a dose‑response curve, survives a generator swap that rules out a writing‑style artifact, generalizes across several decision domains including security‑relevant choices such as removing a deployment approval gate or relaxing access controls, and is partly mitigated by two simple feed‑level defenses; a frontier model retains its default. We characterize the recommender as a practical, default‑bounded control surface for LLM agents, and argue that agent evaluations must audit the feed layer rather than the final prompt alone.
Authors:Ming Wang, Shuang Wu, Bixuan Wang, Lu Lin, Yuxin Chen, Xiaocui Yang, Daling Wang, Shi Feng, Yifei Zhang, Yufan Sun
Abstract:
Self‑report questionnaires remain the prevailing tool for probing the psychological states of persona‑conditioned agents (PC‑Agents). However, classical instruments inherit two well‑known threats: contamination from training corpora and directional bias driven by social‑desirability or contextual framing. To overcome these methodological bottlenecks, we ask whether projective paradigms can be adapted into a robust psychometric tool. We introduce GenPT (Generative Projective Testing), which reformulates TAT, Rorschach, and SCT with newly generated stimuli and organizes assessment as a three‑stage pipeline to derive standardized psychological indicators and target states. Evaluating PC‑Agents induced via CharacterRAG and AnnaAgent profiles, we benchmark GenPT's reliability and validity against classical questionnaires. The results indicate that questionnaires exhibit systematic directional shifts under social‑desirability framing, most strongly on suicide ideation. In contrast, GenPT's collected behavioral patterns stay near the symmetric baseline. Furthermore, under a longitudinal counselling context, GenPT‑based depression assessment shifts by roughly an order of magnitude more than the questionnaire counterpart when Qwen3 serves as the backbone. Overall, GenPT complements self‑report methods in scenarios where contamination resistance, bias asymmetry, and context sensitivity matter. Code and stimuli can be found at https://github.com/sci‑m‑wang/GenPT.
Authors:Subhadip Mitra
Abstract:
Safety alignment in LLMs does not improve monotonically across model generations. Studying four generations of Google's Gemma family (7B‑31B) with quality‑diversity evolution (MAP‑Elites) as an automated red‑teaming probe, we find that Gemma 3 (12B) exhibits 68.7% +/‑ 5.7% attack success rate (ASR; mean +/‑ std, 3 seeds), significantly higher than its predecessor Gemma 2 (45.5% +/‑ 7.2%; p = 0.030, paired bootstrap) and its successor Gemma 4 (33.9% +/‑ 1.8%). Replaying evolved attack archives across generations reveals that attacks from other generations transfer to Gemma 3 at 44‑46% but only 14‑18% to Gemma 4, indicating that Gemma 4's safety gains generalize beyond the attack distributions evolved against earlier generations. Under our 8B judge, copyright and cybercrime vulnerabilities register at near‑100% across all generations, though a second‑judge audit (Section 6) suggests the copyright result is sensitive to judge choice. Misinformation ASR jumps from 29% to 99% between Gemma 2 and Gemma 3 and remains elevated at 77% in Gemma 4, indicating the regression was not fully addressed. These patterns are invisible to static benchmarks and emerge only through adaptive, longitudinal probing. All experiments use 3 random seeds with a unified self‑hosted judge; code and artifacts are available at https://github.com/bassrehab/red‑queen.
Authors:Thanh Luong Tuan
Abstract:
Enterprise multi‑agent systems increasingly expose multiple coordination patterns, but deployments often lack evidence for when to use consensus, debate, synthesis, or a simpler single‑agent workflow. This paper evaluates whether coordination strategy should be selected dynamically by problem class rather than fixed globally. We run a frozen matrix of 30 enterprise tasks spanning six industries, five problem classes, four execution conditions, three replications per cell, and four model arms: qwen_local, sonnet, gemma_openrouter, and an auxiliary openai cloud‑validation arm. All 1,440 generated outputs are judged by a fixed Sonnet rubric. The main finding is bounded and operationally useful, but it is not the original strict H1. The pre‑registered exact‑winner/CI criterion is not supported: exact winner identity is unstable across model arms, and several predicted strategies are close to, but not above, the best observed alternative. A weaker near‑best routing claim is strongly supported. In every pre‑registered model arm and problem class, and again in the auxiliary OpenAI validation arm, the predicted strategy is within 0.10 quality‑score points of the best observed condition. Structured compliance verification is the clearest exception to the original mapping: all arms favor single_agent rather than consensus. A pre‑registered Kendall's W test finds no reliable difference between Vietnamese‑domain and English‑domain tasks in how consistently the four coordination conditions are ranked (mean W of 0.20 in both strata; signed‑rank p = .85), so H2 is not supported. We conclude that enterprise coordination policy should use dynamic routing as a calibrated default, not as a deterministic winner‑selection law.
Authors:Subhadip Mitra
Abstract:
Current approaches to LLM adversarial testing suffer from coverage gaps: manual red‑teaming does not scale, LLM‑as‑attacker methods exhibit mode collapse, and gradient‑based approaches produce uninterpretable gibberish. We introduce a quality‑diversity evolutionary framework that operates at the semantic level, evolving interpretable attack strategies rather than token sequences. Using MAP‑Elites, we maintain a diverse archive of attacks across behavioral dimensions (strategy type, encoding method, prompt length). In experiments across GPT‑4o‑mini, Claude 3.5 Sonnet, Gemini 2.0 Flash, and an open‑weight coding model (Devstral‑small‑2), we discover distinct vulnerability profiles: GPT‑4o‑mini is vulnerable to hypothetical and multi‑turn framing combined with ROT13 encoding (fitness 0.8), Gemini to direct attacks with ROT13 and multi‑turn with Leetspeak (0.8), while Claude shows uniformly ambiguous responses across all strategies (max 0.4). The semantic representation produces interpretable attacks that reveal systematic, model‑specific weaknesses, providing actionable insights for improving LLM safety and a reproducible baseline for evaluating future frontier models. Code and experiment artifacts are released at https://github.com/bassrehab/red‑queen.
Authors:Shaohua Li, Xiuchao Sui, Xiaobing Sun, Yuhang Wu, Liangli Zhen, Yong Liu, Rick Siow Mong Goh
Abstract:
SwiGLU has become a standard gated activation in modern Transformer MLPs, yet its gate sharpness ‑‑ the smoothness and selectivity of the gating function ‑‑ is typically fixed throughout training. In this work, we propose Confidence‑Aware SwiGLU (κ‑SwiGLU), a variant of SwiGLU for Mixture‑of‑Experts (MoE) models that adjusts expert gate sharpness according to token‑level routing confidence. Specifically, κ‑SwiGLU parameterizes the SiLU gate sharpness coefficient as a learnable function of the router logit, enabling each expert gate unit to interpolate between smooth, broadly active gating and sharp, selective gating. We evaluate κ‑SwiGLU on the FineWeb‑Edu dataset across MoE Transformer models ranging from 8 to 28 layers. Across these settings, κ‑SwiGLU improves mean CORE performance while adding negligible parameters and incurring only a small computational overhead, demonstrating that confidence‑aware gate sharpness is a promising mechanism for improving MoE MLPs. The code is available at https://github.com/askerlee/kappa‑swiglu.
Authors:Hyundong Jin, Yo-Sub Han
Abstract:
Controlling language model outputs is essential for ensuring structural validity, reliability, and downstream usability, and diffusion language models are no exception. Recent advances in diffusion language model decoding have extended output control beyond regular constraints to context‑free grammar (CFG) constraints. Existing methods, however, can be up to four times slower than unconstrained decoding. More importantly, they substantially diminish one of the key advantages of diffusion language models over autoregressive models, namely parallel decoding. This slowdown arises because sequential validity checking introduces significant overhead during parallel generation. We propose an efficient CFG‑constrained decoding framework, EPIC, that addresses this limitation. Our method improves decoding efficiency by combining lexing memoization, validation using Earley‑style parsing instead of deterministic automata, and relaxed compatible subset selection for parallel commit. It reduces repeated lexing and validation overhead while allowing multiple compatible tokens to be committed together. Experiments on three benchmarks using four models show that our method reduces inference time by up to 67.5% and decreases the additional overhead by up to 90.5% compared with existing CFG‑constrained decoding methods. Our implementation is available at https://github.com/hyundong98/EPIC‑Decoding.git .
Authors:James Xu Zhao, Hui Chen, Bryan Hooi, See-Kiong Ng
Abstract:
Agentic search requires language model agents to explore many sources and answer complex information‑seeking questions. Scaling test‑time compute is a promising way to improve these agents, but current approaches can fail, because correct answers are often sparse and score‑based selection depends on model calibration. We propose FineVerify, a fine‑grained self‑verification framework that decomposes each question into checkable sub‑questions, verifies sampled candidates against each sub‑question, and selects the candidate with the highest aggregated score. This per‑check structure turns selection into simpler local judgments and produces scores under the same explicit criteria. Across four agentic search benchmarks and two models, FineVerify consistently outperforms standard scaling baselines. With only four sampled trajectories, it improves GPT‑5‑mini by 8.2 accuracy points and Gemini‑3‑flash by 5.6% on average. With 12 samples, FineVerify enables GPT‑5‑mini to surpass frontier GPT‑5 on BrowseComp‑Plus. Beyond accuracy, FineVerify produces interpretable verification traces that help audit benchmark errors, suggesting broader applications for inspecting agentic search systems. Code and data are available at https://github.com/XuZhao0/fineverify
Authors:Yitong Sun, Yao Huang, Teng Li, Ranjie Duan, Yichi Zhang, Xingjun Ma, Hui Xue, Xingxing Wei
Abstract:
Mixture‑of‑Experts (MoE) architectures scale Large Language Models (LLMs) efficiently, enabling greater capacity with reduced computational cost by dynamically routing inputs to relevant experts, yet introduce a critical vulnerability: Safety Sparsity, where safety capabilities concentrate in few experts, making them susceptible to adversarial bypassing. Meanwhile, conventional alignment methods uniformly adapt all parameters, ignoring their functional differences and inadvertently degrading performances. To address these challenges, we propose MESA (MoE Safety Alignment), a targeted alignment framework for MoE‑based LLMs that strategically decentralizes safety responsibility to maximize coverage while minimizing interference with utility. Based on Optimal Transport (OT) theory, MESA operates through two mechanisms: (1) Expert Capacity Reallocation uses a transport cost matrix to distribute safety duties to the most cost‑effective experts, and (2) Dynamic Routing Refinement constrains the router to precisely activate these decentralized modules. Experiments show that MESA achieves robust defensive performance against varied harmful benchmarks while preserving helpfulness. Code is available at https://github.com/lorraine021/MESA.
Authors:Qingshan Liu, Guoqing Wang, Wen Wu, Jingqi Huang, Xinqi Tao, Dejia Song, Jie Zhou, Liang He
Abstract:
Long‑horizon autonomous agents require memory systems to retain historical information, track evolving states, and reuse relevant knowledge beyond finite context windows. Existing agentic memory systems typically follow a memory construction‑retrieval (MCR) pipeline, but often adapt mainly the memory bank while keeping the surrounding pipeline fixed after deployment. This fixed‑pipeline design struggles to handle heterogeneous task‑specific failure modes and can become misaligned with memory banks that evolve in scale and structure over time. To address these limitations, we propose MemPro, a system‑level evolution framework that treats the entire MCR pipeline as an evolvable program rather than adapting only the memory bank or prompt text. MemPro maintains a version tree of runnable memory‑system implementations, where an Evolving Agent iteratively selects promising versions, diagnoses recurring failures, and creates improved child versions through failure‑mode‑guided edit‑debug refinement. Experiments on LongMemEval, LoCoMo, HotpotQA, and NarrativeQA show that MemPro consistently outperforms strong static and prompt‑level evolving baselines within a few iterations, continues to improve with evolution, and achieves a favorable performance‑cost trade‑off. Code is available at https://github.com/wanghai673/MemPro.
Authors:Shinwoo Park, Hyejin Park, Hyeseon An, Yo-Sub Han
Abstract:
Watermarking should identify language‑model output without degrading quality or limiting verification to the model provider. Multilingual deployment makes this harder because morphology, segmentation, and script change where watermark evidence can enter naturally. We introduce LUNA, a linguistically adaptive watermark that combines model‑free detection with single‑token non‑distortion under the standard random‑key model. LUNA estimates normalized next‑tag entropy from part‑of‑speech contexts in an external corpus and uses it to set the depth of a non‑distortionary binary tournament sampler; the detector reconstructs the same schedule from text, a tokenizer, a tagger, and a secret key. We evaluate six typologically diverse languages and two domains against eight primary baselines. LUNA attains an AUROC of 0.9959 and the lowest mean absolute median perplexity shift of 0.045 across the twelve settings; its 95% bootstrap interval [0.022, 0.073] lies below all baseline intervals. LUNA also records the lowest mean Self‑BLEU, Distinct‑1, surprisal, and entropy shifts. It is the only method that simultaneously achieves AUROC > 0.99 and an absolute median perplexity shift below 0.1 in a majority of settings, reaching this regime in 9 of the 12 settings while no baseline reaches it in more than 2. Our code is available at: https://github.com/Shinwoo‑Park/luna_watermark
Authors:Qiming Shi, Zhaolu Kang, Yunfan Zhou, Di Weng, Yingcai Wu
Abstract:
Large language models are increasingly deployed as tool‑augmented agents to acquire information beyond parametric knowledge. While recent work has improved long‑horizon tool‑use reasoning, most approaches focus on tasks with a single correct answer. In contrast, many real‑world queries require discovering a comprehensive set of valid answers, a setting known as Multi‑Answer QA. This setting raises two challenges: fine‑grained credit assignment over long search trajectories and reward alignment for sustained exploration beyond easy high‑frequency entities. We propose SPADER, a reinforcement learning framework for long‑horizon tool use in Multi‑Answer QA. SPADER includes Step‑wise Peer Advantage (SPA), a critic‑free step‑level credit assignment mechanism that aligns parallel trajectories by decision step and estimates advantages from peer returns. It also includes a diversity‑aware exploration reward that promotes long‑tail entity discovery by upweighting rare findings and downweighting redundant ones. Experiments on QAMPARI, Mintaka, WebQSP, and QUEST show that SPADER generally improves recall and overall F1 over prompting‑based agents, outcome‑supervised RL methods, and recent step‑level supervision approaches. Our code and model weights are available at https://github.com/KhanCold/spader.
Authors:Dongping Chen, Xuanao Huang, Zhihan Hu, Qingyuan Shi, Dianqi Li, Tianyi Zhou
Abstract:
As multimodal LLMs increasingly target video and audio, it is often assumed that such tasks require native omnimodal models. We show that this is not always the case: coding agents with only text+image access and a sandboxed tool‑use interface can match, and in several settings outperform, SOTA native omnimodal models and predefined multimodal agent scaffolds across multiple audio‑video benchmarks. Our trajectory analysis suggests that their strength comes from writing code and orchestrating tools to extract relevant evidence from transcripts, frames, and other modality signals, thereby converting omnimodal tasks into retrieval and information‑processing problems rather than ingesting entire media streams. We further characterize their limitations through a failure taxonomy and process‑level trace analysis, and show that simple skill injection, including human‑written and self‑distilled skills, substantially improves performance. To explore open‑source elicitation, we introduce Code‑X, a training recipe with the OmniCoding trajectory dataset and verifiable reward, and provide baselines on Qwen‑3.5‑9B and Qwen‑3.6‑27B. Finally, we argue that the next frontier is many‑modality processing, and introduce TerminalBench‑O, a process‑level benchmark for real‑world omnimodal processing tasks. Code will be available at https://github.com/Dongping‑Chen/OmniCoding.
Authors:Junlong Tong, Yao Zhang, Anhao Zhao, Yingqi Fan, Yunpu Ma, Xiaoyu Shen
Abstract:
Standard Large Language Models (LLMs) follow a read‑then‑generate paradigm, causing unnecessary latency and computation. Streaming LLMs alleviate this issue by generating while receiving inputs, but still struggle to decide when to interact with the stream. Existing methods either hard‑code interaction timing or rely on costly external alignment signals, such as timing labels, reasoning trajectories, or stronger teachers. In this paper, we propose ProactiveLLM, which achieves active interaction by leveraging the model's endogenous states to guide interaction decisions. The model first learns to perceive semantic sufficiency from partial inputs through two complementary training mechanisms: mask‑based streaming modeling and synchronized privileged self‑distillation (SPSD). The former applies monotonic random masking to the input during training, simulating progressively revealed streaming inputs and enabling the model to learn local semantic dependencies from partial‑input views. The latter aligns the partial‑context student view with a full‑context teacher view generated by the same evolving model, allowing privileged full‑context evidence to guide the student's understanding under incomplete observations. Together, these mechanisms induce endogenous sufficiency cues without requiring external teachers or annotations, providing a versatile foundation for the plug‑and‑play integration of diverse decision heads. Extensive evaluation across text and speech streaming tasks confirms that ProactiveLLM significantly reduces interaction latency while maintaining quality, validating its capacity for dynamic and active interaction. Code is publicly available at https://github.com/EIT‑NLP/StreamingLLM/tree/main/ProactiveLLM.
Authors:Xin Gao, Cheng Yang, Chufan Shi, Taylor Berg-Kirkpatrick
Abstract:
Unified multimodal models (UMMs) have emerged as a promising paradigm for general‑purpose multimodal intelligence. As they are deployed in real‑world applications, effectively updating internal knowledge becomes critical. While knowledge editing has matured for text‑only models, it remains unclear whether edits that successfully modify textual outputs also transfer to image generation in UMMs. To study this question, we introduce UniKE, the first benchmark for cross‑modality knowledge editing in UMMs, comprising 2,971 edit subjects spanning attribute and relation edits. Using VQA‑based visual verification, we reveal a striking modality gap: text‑side efficacy can reach approximately 92%, whereas the best overall VQA accuracy under direct image generation is only 18.5%. We further propose Reasoning‑augmented Parameter Editing, which explicitly activates edited knowledge before generation and improves overall VQA accuracy for all evaluated model‑editor pairs, with gains up to 18.6 percentage points. Mechanistic analysis shows that this gap is associated with partial alignment between edited textual representations and the conditioning pathways for visual generation, where edits sufficient for text outputs may remain too weak or misaligned to steer image synthesis. These findings show that textual knowledge edits do not guarantee reliable cross‑modality transfer and motivate modality‑aware editing methods. Our code and data are available at https://github.com/gxx27/UniKE.
Authors:Haoxiang Zhang, Qixin Xu, Zhuofeng Li, Lei Zhang, Pengcheng Jiang, Yu Zhang, Julian McAuley
Abstract:
Long‑horizon search agents accumulate large amounts of retrieved content across many tool calls, making context‑budget efficiency increasingly important. A minimal intervention is to mask stale observations from the context as the trajectory progresses, but it remains unclear when this form of context management helps and why. We study observation masking through a systematic sweep over various agent backbones (4B to 284B parameters) and three retrievers on offline and live‑web agentic search benchmarks. We find that the accuracy gain from masking follows an asymmetric inverted‑U shape when plotted against the model's accuracy without context management: a plateau under weak retrievers, a peak when a strong retriever meets a mid‑capacity model, and a sharp collapse when the model is saturated. This pattern reflects the interaction between retriever recall and the model's implicit filtering capacity, rather than either factor in isolation. Mechanistically, masking implements a token‑for‑turn trade‑off: it removes observations the model has largely stopped attending to and pages the agent rarely re‑opens. The added turns help when they convert failures into successes, but they fail when masking removes evidence the model would otherwise have used. We therefore reframe context management as a regime‑dependent intervention and provide a holistic perspective for analyzing context use in agentic deep search. We release our scaffold and trajectories here (https://github.com/i‑DeepSearch/observation‑masking) to support future research.
Authors:Tarek Mahmoud, Veronika Solopova, Premtim Sahitaj, Ariana Sahitaj, Max Upravitelev, Mervat Abassy, Hana Fatima Shaikh, Neda Foroutan, Vera Schmitt, Preslav Nakov
Abstract:
Temporal language does more than place events on a timeline. In news discourse, references to the past, present, and future can function as rhetorical devices that shape interpretation and persuasion. Here, we study temporal framing, defined as the persuasive use of time‑related language to structure meaning rather than to report chronology. We propose a taxonomy of eight temporal frames grounded in prior work on temporality and framing, and we realize it through expert annotation of a multilingual news corpus. The resulting dataset includes 458 English and German news articles, with over 2K temporally framed sentences and approximately 3K temporal framing annotations identified from a corpus of more than 20K sentences. We analyze frame prevalence, co‑occurrence patterns, and lexical cues, and evaluate temporal framing detection using supervised fine‑tuning and zero‑shot classification. Our experiments show that temporal framing is learnable at the sentence level, with supervised models substantially outperforming zero‑shot approaches. We publicly release the corpus to support future research on temporal framing: https://mbzuai‑nlp.github.io/temporal‑framing/.
Authors:Yuxiang Lin, Zihan Wang, Mengyang Liu, Yuxuan Shan, Longju Bai, Junyao Zhang, Xing Jin, Boshan Chen, Jinyan Su, Xingyao Wang, Jiaxin Pei, Manling Li
Abstract:
While agents are increasingly spending more resources, today agent cost is mostly measured only after execution. A Budget‑Aware Agent (BAGEN) should treat budget as an active control signal, rather than a passive cost metric. We first systematically define budget estimation as internal budgets (from agent computation) and external budgets (from agent actions). We then formalize budget‑awareness as progressive interval estimation: at each step of a plan, an agent should predict an upper and lower bound on remaining budget, and alert when completion is unlikely. Scoring with a rollout‑replay protocol, we find consistent failure patterns on four environments and five frontier agents: (1) strong agents do not necessarily have strong budget‑awareness, with correlation r=0.35. (2) frontier models are consistently over‑optimistic, continue spending on tasks that are unlikely to succeed, instead of alerting the user early. (3) budget‑aware signal is actionable and trainable. Early stop saves 28‑64% tokens on failed trajectories, and SFT+RL strengthens early stop and alert behavior. (4) precise interval calibration remains challenging, with interval coverage capping at 47% after SFT+RL. Project page: https://ragen‑ai.github.io/bagen/
Authors:Junbo Zhang, Qianli Zhou, Xinyang Deng, Wen Jiang, Jie Pan, Jinbiao Zhu
Abstract:
Large language models (LLMs) suffer from degraded safety capabilities even when fine‑tuned with benign datasets. However, existing methods for identifying safety‑degrading samples in benign datasets suffer from high computational costs and significant noise issues. In this paper, we propose DataShield to efficiently and effectively identify potential safety‑degrading samples. Our key intuition is based on the observation that benign fine‑tuning increases the overall response compliance of LLMs. DataShield's key technical insight is to quantify each sample's contribution to the model's compliance behavior as its safety degradation score. DataShield consists of three core components: (1) Compliance Vector Extraction, which captures the LLM's compliance behavior tendency; (2) a novel Compliance‑Aware Score (CAS), which automatically identifies the optimal safety‑critical layer; and (3) Safety‑degrading Sample Filtering, which quantifies the projection shift of training data along the compliance direction. Extensive experimental evaluation on Llama3‑8B, Llama3.1‑8B, and Qwen2.5‑7B using the Alpaca and Dolly benign datasets validates our method's effectiveness in identifying high‑risk and low‑risk data subsets. We also observe that open‑ended question answering is more likely to trigger safety degradation, and corresponding responses tend to be longer. We hope this work can provide new insights into data‑centric defense methods. The source code is available at: https://github.com/ZJunBo/DataShield.
Authors:Kailing Li, Tianwen Qian, Lijin Yang, Yuqian Fu, Jingyu Gong, Xiaoling Wang, Liang He
Abstract:
Vision‑Language Navigation (VLN) enables embodied agents to reach target locations in unseen environments by following language instructions. Despite recent progress with vision‑language models (VLMs), a critical semantic‑geometric gap remains: while VLMs excel at language and 2D visual understanding, they struggle with 3D spatial reasoning and fail to capture the causal dynamics between actions and spatial transitions, resulting in unreliable navigation, particularly in zero‑shot settings. To bridge this gap, we propose a Hierarchical Semantic‑Geometric Map (HSGM) that transforms 3D geometric information into a structured representation compatible with VLMs, effectively linking them to the physical world. Specifically, HSGM is represented as a multi‑channel top‑down map organized into three levels: (1) geometric level that records navigable regions and obstacles, (2) semantic level that represents objects and their relations, and (3) decision level that supports high‑level task reasoning and goal selection. During navigation, the VLM acts as a high‑level semantic planner, interpreting the spatial layout encoded in the HSGM to select geometrically valid waypoints, while low‑level, collision‑free movements between waypoints are executed by a classical path‑planning algorithm, fully decoupling semantic reasoning from action execution. Additionally, complex instructions are decomposed into subtasks to alleviate the problem of progress forgetting or hallucinating in long‑horizon navigation. Extensive experiments on R2R‑CE and RxR‑CE benchmarks demonstrate that our zero‑shot framework achieves state‑of‑the‑art performance and even outperforms several supervised methods. Code is available at https://github.com/Teacher‑Tom/HSGM_public.
Authors:Joel Barmettler
Abstract:
Prior research has established that instruction‑tuned large language models exhibit left‑of‑center political bias, measured exclusively through abstract political questionnaires. We show that this finding does not generalize to concrete policy decisions. We introduce a dual‑instrument methodology grounded in Swiss democratic reality. The Smartvote questionnaire (75 abstract policy questions) is administered to 66 LLMs from 27 model families and compared to 184 elected members of the Swiss National Council, replicating the established leftward convergence (Cohen's d = 3.64, p = 0.0002). Then, novel to this work, 9 flagship LLMs are confronted with 48 real federal referenda (Volksabstimmungen) in four national languages (German, French, Italian, Romansh) under three information conditions, comparing votes to actual outcomes and party recommendations (Parolen). Three findings challenge the prevailing narrative. (1) Abstract questionnaires do not predict concrete behavior: the left‑to‑right agreement gradient on Smartvote shifts from left‑peaked to center‑peaked on Volksabstimmungen, where models align most with centrist Die Mitte and FDP rather than leftist SP and Gruene (Wilcoxon p = 0.008). (2) For some models, the language of a political question changes the answer more than the political content does: cross‑linguistic consistency ranges from 50% (Mistral) to 98% (GPT‑5.4). (3) Two models exhibit systematic change‑aversion rather than political bias, voting Nein on 83‑94% of referenda regardless of direction (binomial p < 0.0001). What prior work measured as "leftward bias" may not generalize beyond abstract instruments. On concrete policy decisions, LLMs behave less like coalition partners of the left and more like cautious civil servants: centrist, status‑quo‑favoring, and inconsistent across languages.
Authors:Yichuan Mo, Yukun Jiang, Yanbo Shi, Mingjie Li, Michael Backes, Yang Zhang, Yisen Wang
Abstract:
The rapid development of Language Diffusion Models (LDMs) challenges the dominant position of auto‑regressive competitors in language processing. However, their flexible, any‑order decoding strategies not only enable fast decoding speed but also potentially bring new trustworthiness challenges. To better understand the risks behind their pipelines, we introduce a comprehensive trustworthiness benchmark tailored to LDMs (TrustLDM), evaluating safety, privacy, and fairness across different LDM architectures with multiple categories of static post contexts. Our empirical results show that although LDMs generally exhibit strong trustworthiness with only the user prompts, their alignment behavior degrades noticeably when the malicious post contexts are attached to the masked responses. We further observe that longer contexts do not necessarily induce stronger effects, and both decoding order and generation length affect the evaluation outcomes. Finally, we propose TrustLDM‑Auto, an automatic evaluation framework that leverages LDM decoding flexibility to systematically identify vulnerable configurations, revealing substantial trustworthiness weaknesses across all evaluated models and dimensions. Our work may potentially help the community build more trustworthy LDMs. Our code is available at https://github.com/PKU‑ML/TrustLDM.
Authors:Wei Tian, Yuhao Zhou, Man Lan
Abstract:
Large Language Model (LLM) based Chinese Grammatical Error Correction (CGEC) systems face two critical challenges: general‑purpose models lack specialized linguistic priors for subtle grammatical distinctions, and Supervised Fine‑Tuning (SFT) with Maximum Likelihood Estimation fails to optimize for precision‑focused metrics, leading to systematic over‑correction. We propose CSRP, a three‑stage framework that progressively builds correction capability through Continual Pre‑training (CPT) on 5.9M balanced samples to internalize domain knowledge, Chain‑of‑Thought SFT with explicit error reasoning for diagnostic transparency, and Group Relative Policy Optimization with a novel Efficiency‑Aware Reward that explicitly penalizes unnecessary edits. On the NACGEC benchmark, CSRP achieves state‑of‑the‑art performance with 50.99 F_0.5 and 57.17 precision, substantially outperforming previous best results while effectively mitigating the over‑correction bias inherent in MLE‑trained models. Our method also advances CSCD spelling correction to 59.61 F1, surpassing GPT‑4 by 5.20 points. Comprehensive ablation studies demonstrate that the RL alignment stage contributes a 8% relative gain over the SFT baseline, and that this gain is orthogonal to the contribution of large‑scale CPT, validating that explicit optimization for edit efficiency is essential for high‑quality grammatical error correction. Our code is available at https://github.com/TW‑NLP/ChineseErrorCorrector.
Authors:Hao Xu, Rite Bo, Fausto Giunchiglia, Yingji Li, Rui Song
Abstract:
Although studies have demonstrated that Large Language Models (LLMs) can perform well on Out‑of‑Distribution (OOD) tasks, their advantage tends to diminish as the distribution shift becomes more severe. Consequently, researchers aim to retrieve distributionally similar and informative demonstrations from the available source domain to boost the inference capabilities of LLMs. However, in practical scenarios where the target domain is inaccessible, evaluating the unknown distribution is challenging, which indirectly impacts the quality of the selected demonstrations. To address this problem, we propose DOPA, a demonstration search framework that incorporates an OOD proxy to approximate the inaccessible target domain and guide the retrieval process. Building on proxy‑based evaluation, DOPA further introduces a Mahalanobis distance‑based global diversity constraint to ensure sufficient diversity among the retrieved demonstrations. Experimental results on multiple LLMs and tasks demonstrate that DOPA effectively enhances robustness in OOD settings\footnotehttps://github.com/bort64/ood\_code.
Authors:Nianyi Lin, Jiajie Zhang, Lei Hou, Juanzi Li
Abstract:
Long‑context reasoning remains a central challenge for large language models, which often fail to locate and integrate key information in extensive distracting content. Reinforcement learning with verifiable rewards (RLVR) has shown promise for this task, yet existing methods are limited by low‑confusability distractors and sparse, outcome‑only reward signals that cannot supervise intermediate reasoning steps. To address these issues, we introduce \textscLongTraceRL. For data construction, we generate multi‑hop questions via knowledge graph random walks and leverage search agent trajectories to build \emphtiered distractors: documents the agent read but did not cite (high confusability) and documents that appeared in search results but were never opened (low confusability), producing training contexts that are far more challenging than those built by random sampling or one‑shot search. For reward design, we propose a \emphrubric reward that uses the gold entities along each reasoning chain as fine‑grained, entity‑level process supervision. This rubric reward is applied only to responses with correct final answers (positive‑only strategy), distinguishing the reasoning quality among correct responses and preventing reward hacking. Experiments on three reasoning LLMs (4B‑‑30B) across five long‑context benchmarks demonstrate that \textscLongTraceRL consistently outperforms strong baselines and encourages comprehensive, evidence‑grounded reasoning. Codes, datasets and models are available at \hrefhttps://github.com/THU‑KEG/LongTraceRLhttps://github.com/THU‑KEG/LongTraceRL.
Authors:Yibin Zhao, Fangxin Shang, Dingrui Yang, Yuqi Wang
Abstract:
Table question answering requires models to recover semantic relations encoded implicitly by two‑dimensional layout, merged cells, and hierarchical headers. Current pipelines typically use HTML or Markdown as intermediate table representations, but these layout‑oriented serializations introduce markup overhead and require large language models to infer header‑cell alignments from row and column spans. We propose Semantic Triplet Restoration (STR), a protocol that rewrites each cell as an atomic fact <item path, feature path, value>, where the item path specifies the row‑wise entity, the feature path specifies the hierarchical attribute, and the value contains the cell content. We also present TripletQL, a lightweight query‑aware router that uses STR to select an appropriate rendering or filtered subset of triplets for each question. Across four Chinese and English table‑QA benchmarks, STR matches or improves upon HTML‑based baselines while reducing input tokens. The relative benefit grows for smaller language models and longer table contexts, suggesting that explicit semantic representations are especially useful under constrained inference budgets. Code and data are available at https://github.com/Phoenix‑ni/STR.git .
Authors:Yilun Qiu, Xiaoyan Zhao, Yang Zhang, Yuxin Chen, Cilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, Yoko Yamakata, Tat-Seng Chua
Abstract:
As Large Language Models (LLMs) evolve from general‑purpose assistants to user‑centric agents, personalization has become central to aligning model behavior with individual preferences, making the evaluation of personalized alignment a critical bottleneck. Existing evaluation methods‑ranging from automatic metrics to LLM‑as‑a‑judge approaches‑fail to capture subjective, user‑specific preferences embedded in long‑term interaction histories. We identify three essential principles for reliable and effective personalized evaluation: Representativeness, User‑Consistency, and Discriminativeness. To address these principles, we introduce Personalized Evaluation as Learning, a paradigm that formulates personalized evaluation as a learning problem rather than a static judgment. Under this paradigm, we propose PARL (Preference‑Aware Rubric Learning for Personalized Evaluation), a framework that learns to induce preference‑aware evaluation rubrics directly from raw user histories and performs a self‑validation mechanism to ensure consistency with the user's preferences. PARL integrates rubric induction with a discriminative reinforcement learning objective that contrasts user‑authored responses against competitive personalized model outputs, enabling the learned rubrics to capture precise, user‑specific decision boundaries. Experiments on real‑world personalized text generation tasks show that PARL consistently induces high‑fidelity rubrics that reliably identify user‑aligned responses and generalize across users and tasks, while capturing stable stylistic preferences and fine‑grained evaluative patterns. To ensure reproducibility, our code is available at https://github.com/SnowCharmQ/PARL.
Authors:Yuhan Song, Linhao Zhang, Aiwei Liu, Chuhan Wu, Sijun Zhang, Wei Jia, Yuan Liu, Houfeng Wang, Xiao Zhou
Abstract:
Semantic speech tokenizers have become a widely used interface for Audio‑LLMs, owing to their compact single‑codebook design and strong linguistic alignment. However, their focus on linguistic abstraction induces acoustic blindness, limiting their applicability beyond speech‑centric tasks. We propose UniAudio‑Token, a framework that empowers semantic tokenizers with general audio perception without compromising speech ability. Instead of altering the semantic paradigm, UniAudio‑Token mitigates its information loss through two key innovations: (1) Semantic‑Acoustic Primitives (SAP) provide structured supervision by decomposing audio into linguistic content, vocal attributes, and auditory‑scene primitives; and (2) Semantic‑Acoustic Equilibrium (SAE) introduces a content‑aware gating mechanism that adaptively restores fine‑grained acoustic details from shallow layers. Extensive evaluations show that UniAudio‑Token learns comprehensive universal representations while preserving high‑fidelity speech generation. When integrated with downstream LLMs, it outperforms all single‑codebook baseline tokenizers on both understanding and generation tasks, effectively serving as a unified audio interface. We publicly release all our code, including training and inference scripts, together with the model checkpoints at https://github.com/Tencent/Universal_Audio_Tokenizer.
Authors:Jian Mu, Tianyi Lin, Chengwei Qin, Zhongxiang Dai, Yao Shu
Abstract:
Large language models are increasingly deployed in multi‑turn interactive settings where users or environments can iteratively provide lightweight feedback. Unfortunately, optimizing such behavior presents a sharp dilemma in practice: online reinforcement learning is able to effectively address multi‑turn dynamics but is prohibitively expensive due to the cost of generating full correction trajectories at every update, whereas offline supervised fine‑tuning (SFT) is efficient but suffers from distribution shift and behavioral collapse. To this end, we novelly propose DRIFT (Decoupled Rollouts and Importance‑Weighted Fine‑Tuning), a framework that operationalizes the theoretical insight that the KL‑regularized RL objective is equivalent to importance‑weighted supervised learning. DRIFT decouples rollout from optimization by sampling offline interaction trajectories from a fixed reference policy, deriving return‑based importance weights, and optimizing the policy via weighted SFT on the resulting dataset. Empirically, we demonstrate that DRIFT matches or exceeds the performance of multi‑turn reinforcement learning baselines while maintaining the training efficiency and simplicity of standard supervised fine‑tuning. Code is available at https://github.com/2020‑qqtcg/DRIFT.
Authors:Xudong Zhang, Jian Yang, Shengkai Wang, Jiangpeng Tian, Shaowen Chen, Xian Wei, Ke Li, Xiong You
Abstract:
Large Language Model (LLM)‑based navigation systems commonly construct explicit spatial representations (e.g., topological graphs, semantic raster maps) and translate them into textual descriptions as LLMs' inputs. However, the linguistic structures of such text‑based spatial representations and the choices of contextual features (e.g., topology, geometry) they contain are often treated as neutral engineering decisions rather than key factors that shape LLMs' behavior. To fill the gap, we propose a dual‑interventional framework that disentangles linguistic structures from different contextual cues to evaluate the linguistic inductive bias of LLMs for navigation planning. In the framework, representation intervention varies the linguistic format and the degree of linguistic compression, clarifying when linguistic representations support or inhibit navigation planning. Context intervention, combined with contextual feature combination and conflict probing, explicitly clarifies the preferences and weaknesses of LLMs when processing different contextual cues. Experiments across diverse spatial reasoning tasks and multiple model scales reveal a consistent pattern: topological information is a sturdy shield and the backbone of robust planning; linguistic format is a double‑edged sword whose effect depends on model size, task demands, and the compression level; and semantic information is a fatal Achilles' heel ‑‑ incorrect semantic cues can systematically derail the planning process. Overall, our study shows that effective text‑based spatial representations in LLM‑based navigation should preserve topological integrity, calibrate representational compression to model capacity, and ensure semantic correctness, rather than simply adopting a single representation. Our code is publicly available at https://github.com/jonesdong150/LLM‑Navigation‑Inductive‑Bias.
Authors:Yi Zhao, Siqi Wang, Zhe Hu, Yushi Li, Jing Li
Abstract:
AI‑based Visually Impaired Assistance (VIA) remains challenging, largely due to the high cost of human evaluation. The VLM‑as‑a‑Judge paradigm may offer a promising alternative, although it has mostly been studied in general domains. We therefore ask whether such judges can be trusted for VIA tasks. To investigate this question, we introduce VIABLE (Visually Impaired Assistance Benchmark for VLM‑as‑a‑Judge Evaluation), the first benchmark for VLM‑as‑a‑Judge evaluation in VIA. VIABLE contains over 300K judgment samples across three scenarios and introduces an Effectiveness‑‑Impartiality‑‑Stability framework with a 12‑mode failure taxonomy. Based on VIABLE, our systematic study of seven judges across different model scales shows that existing models are largely unreliable across all evaluation axes. The strongest judge, GPT‑5.4, achieves only 52.6% single‑failure diagnostic accuracy, yet exhibits the highest self‑preference rate at 94.2%; while open‑source judges are strongly biased and adversarially fragile. To address these issues, we propose VIA‑Judge‑Agent, a model‑agnostic inference‑time harness that augments judges with visual evidence extraction and a taxonomy‑guided workflow. It enables positive improvements in diagnostic accuracy and downstream VIA responses more preferred by BLV users. Data and code are available at: https://github.com/YiyiyiZhao/VIABLE
Authors:Haolin Deng, Xin Zou, Zhiwei Jin, Chen Chen, Haonan Lu, Xuming Hu
Abstract:
Multimodal hallucination remains a persistent challenge for Vision‑Language Models (VLMs). Standard textual Direct Preference Optimization (DPO) often fails to mitigate it due to a lack of explicit visual supervision. While existing works introduce visual preference DPO by contrasting original images against negative ones, they suffer from a theoretically inconsistent objective caused by partition function mismatches and rely on coarse‑grained negatives that could enable shortcut learning. In this work, we propose In‑Context Visual Contrastive Optimization (IC‑VCO). By placing contrastive images within a shared multi‑image context, IC‑VCO ensures a mathematically rigorous objective. We further introduce Visual Contrast Distillation (VCDist), an auxiliary reliability‑gated regularizer that encourages consistency between multi‑image contrastive training and single‑image inference. Finally, we propose a contrastive sample editing strategy that generates hard negatives via precise semantic perturbations. Experiments on five benchmarks demonstrate IC‑VCO's best overall performance and the effectiveness of our sample editing strategy. Code and data are available at https://github.com/OPPO‑Mente‑Lab/IC‑VCO.
Authors:Max Malyi, Jonathan Shek, Alasdair McDonald, Andre Biscaya
Abstract:
As wind turbine fleets age, data‑driven reliability engineering is essential to optimise their operation and maintenance for service life extension and levelised cost of energy reduction. Failure event descriptions within historical maintenance logs are a source of valuable reliability intelligence. However, they typically appear as unstructured natural language entries, rendering them inaccessible for quantitative analysis. This paper presents a novel methodology leveraging a large language model (LLM) to systematically standardise and structure maintenance logs based on their free‑text descriptors. Operating on a dataset of 16,316 maintenance logs from 280 turbines monitored over nine years, the developed model‑agnostic framework autonomously corrected hierarchical system codes and extracted evidence‑based taxonomies of maintenance actions and failure modes. The automated pipeline successfully structured over 70% of the dataset. It resolved pervasive misclassification issues, such as isolating previously unclassified pitch system faults and restoring missing system codes, and enriched the records by applying empirical taxonomies to label specific actions taken and failure modes addressed. By using system‑based log batches to construct empirical dictionaries of failure modes, observable symptoms, dominant mechanisms, and candidate causes, this approach reduces the inherent subjectivity of manual failure modes and effects analysis (FMEA). Ultimately, the methodology provides a highly scalable, cost‑effective blueprint for translating large sets of qualitative field observations into quantitative reliability metrics, laying the foundation for integrated root‑cause analysis across the renewable energy sector, improved FMEA, and advanced predictive maintenance.
Authors:Pengyu Chen, Yonggang Zhang, Mingming Chen, Jun Song, Wei Xue, Yike Guo
Abstract:
Endowing large language models with compositional reasoning over specialized documents requires multi‑hop training data at scale, where such data rarely exists outside of curated benchmarks built on structured sources. To construct it directly from plain, unannotated text, existing methods ask a single teacher model to jointly discover an evidence path through a document and verbalize it as a question‑answer pair. However, these methods degrade sharply when documents are structured around repetitive templates and densely cross‑referencing clauses, conditions that characterize most real‑world specialized corpora. In this work, we decouple the two operations: reasoning paths are enumerated offline over a graph of contextual keyword centroids, and the teacher is invoked only to verbalize pre‑validated paths. The graph enforces five geometric admissibility constraints, for which we provide Gram‑matrix arguments establishing that local similarity bounds alone admit endpoint drift up to ~91^\circ, and that an upper similarity bound is necessary to exit dense embedding cliques formed by boilerplate text. A matched‑size ablation isolates the mechanism: at equal training scale, constrained and unconstrained chains yield indistinguishable downstream performance, and the gain at full scale comes from a 4.4× expansion of the usable corpus rather than from higher per‑chain quality ‑‑ reframing the role of graph constraints, in this setting, as raising teacher synthesizability rather than improving chain content. Fine‑tuning Qwen3‑32B on 80K examples constructed from the CUAD legal contract corpus improves closed‑book Token F1 from 21.66% to 38.58%. We have released our codes at https://github.com/hkgai‑official/GCSCS.
Authors:Yuanjian Xu, Jianing Hao, Wanbo Zhang, Zhong Li, Guang Zhang
Abstract:
The annealing phase is a pivotal convergence stage in LLM pre‑training that ultimately determines final model quality. However, effectively selecting training data during this phase remains a key challenge. Current strategies rely on empirical heuristics, such as domain filtering or context extension, which lack a principled grounding in optimization theory. In this work, we characterize the annealing phase through the lens of the loss landscape's spectral geometry. We argue that optimal convergence requires gradient updates to satisfy heterogeneous constraints across different eigen‑directions. Building on this insight, we formulate data selection as a problem of satisfying these directional constraints. To this end, we propose DiReCT (Directionally‑Restrained Constrained Training), a novel framework that reformulates sample selection in the annealing stage as a constrained optimization problem. By imposing explicit directional constraints on per‑sample gradients based on the spectral properties of the Hessian, DiReCT identifies samples that align with the optimal curvature‑aware descent path. Extensive experiments across various model scales demonstrate that DiReCT consistently achieves state‑of‑the‑art performance. For future research, code is available at https://github.com/xuyj233/Direct.
Authors:Yuanjian Xu, Jianing Hao, Guang Zhang, Zhong Li
Abstract:
Training data plays a central role in large language models (LLMs) optimization, motivating extensive research on data scheduling strategies. Most existing approaches concentrate on adjusting the overall data distribution but neglect the underlying interactions between samples during training. However, we argue that such interactions cannot be overlooked, as real‑world data samples frequently exhibit directional influences on each other, making the training order crucial. Intuitively, we can prioritize train‑units with greater influence to improves learning efficiency. In this work, we propose D^3, a Dynamic Directional graph‑constrained Data scheduling framework. D^3 formulates the complex interactions among train‑units as a dynamic influence graph, where edges represent loss‑based dependencies. It then solves a constrained optimization problem over this graph to derive the training order, which ensures that the data sequence respects the evolving information flow throughout training. Our approach is theoretically motivated and yields consistent improvements over existing data scheduling methods across both pre‑training and post‑training phases. Furthermore, for scalability, D^3 also employs an efficient approximation algorithm that keeps the additional computational overhead within a manageable range. For future research, the code is available at https://github.com/xuyj233/D3.
Authors:Gerrit Quaremba, Amy Rechkemmer, Elizabeth Black, Denny Vrandečić, Elena Simperl
Abstract:
In automated fact‑checking (AFC), check‑worthiness detection identifies claims requiring verification based on domain‑specific criteria. On Wikipedia, this task instantiates as Citation Needed Detection (CND), which flags claims lacking supporting citations. However, existing research has largely overlooked lower‑resource languages, and recent AFC pipelines rely on large language models (LLMs), which are inaccessible to low‑resource organizations. We introduce MCN, a multilingual CND corpus spanning 18 languages across three resource levels, on which we conduct an extensive study of small decoder‑based language models (SLMs). Our experiments show that SLMs fine‑tuned with an encoder‑style objective substantially outperform prompted LLMs across languages. We further present one of the first studies on cross‑lingual CND, demonstrating that SLMs fine‑tuned solely on English claims surpass LLMs, even with little to no target‑language adaptation. Our findings have important implications for lower‑resource Wikipedia communities and suggest that compact, task‑specific models are preferable to LLMs for CND. We release all data and code at https://github.com/gerritq/mcn
Authors:Jiejun Tan, Zhicheng Dou, Xinyu Yang, Yuyang Hu, Yiruo Cheng, Xiaoxi Li, Ji-Rong Wen
Abstract:
LLM agents are evolving from conversational chatbots to operational tools in real‑world workspaces. In local agentic harnesses, an LLM can read and write files, call tools, and reuse workspace state across sessions. While such capabilities enhance utility, they also expose a new attack surface for attackers. Attackers can embed a prompt injection within a file or tool output. Agents may read this hidden instruction, store it, and execute it later. In this multi‑step trojan attack paradigm, no individual step appears malicious on its own, but these steps can collectively turn untrusted text into persistent control content. However, existing defenses often inspect each step in isolation. As a result, they can block a clear harmful action, but fail to detect the earlier write operation that plants the backdoor. To reveal this threat, we introduce ClawTrojan, a benchmark designed to identify multi‑step trojan attacks in local agentic harnesses. In an OpenClaw‑style simulated workspace with GPT‑5.4, ClawTrojan reaches a 95.5% attack success rate (ASR), while existing single‑turn prompt‑injection attacks produce near‑zero ASR on the same model. To address this threat, we propose DASGuard, which scans control‑like text in sensitive local files, traces its origin, and removes control content that does not originate from a trusted source. Our results show that DASGuard achieves strong dynamic defense by combining runtime attack blocking with sanitized commits to the workspace.
Authors:Zheng Yuan, Chuang Zhou, Linhao Luo, Siyu An, Di Yin, Xing Sun, Xiao Huang
Abstract:
Retrieval‑augmented generation is intensively studied to ground large language models on external evidence. However, retrieving from a unified knowledge base could inevitably introduce irrelevant information that may mislead generation for complex reasoning. Inspired by the conditional computation of mixture of experts (MoE), where a router sparsely selects specialized experts alongside shared ones for each input, we propose Mixture of experts for Graph‑based Retrieval‑Augmented Generation, i.e., MoG. It organizes knowledge into two core components: (i) diverse, always‑accessible hub graphs that encode semantically and structurally central knowledge and provide contextual clues for expert activation, and (ii) sparsely activated expert graphs that contain domain‑specific evidence. MoG first accesses hub graphs to identify general evidence and derive contextual clues. Then, a topology‑aware router dynamically activates a limited set of expert graphs conditioned on the query, thereby confining retrieval to a focused evidence subspace. Extensive experiments on challenging benchmarks show that MoG consistently outperforms strong baselines, with over 20% relative improvement on MuSiQue. Our code is available in https://github.com/DEEP‑PolyU/MoG.
Authors:Thales Bertaglia, Haoyang Gui, Catalina Goanta, Gerasimos Spanakis
Abstract:
Public consultations generate large volumes of data in the form of stakeholder submissions that are practically unfeasible to analyse manually. We present an end‑to‑end LLM‑based pipeline and interactive dashboard for structured topic extraction from regulatory consultation submissions, demonstrated on the European Commission's Digital Fairness Act (DFA) public call for evidence as a case study. The system processes raw PDF attachments and web‑form responses, extracts topic annotations, and grounds every extraction in a verbatim quote from the source text. Applied to 4,322 DFA submissions, the pipeline produced 15,368 topic annotations supported by 20,951 verbatim evidence quotes. Three principles govern the proposed design: verbatim grounding, full traceability, and transparency by design. The dashboard exposes the full extraction dataset through five analytical views, from dataset‑level topic overviews to individual paragraph drill‑downs, with every result traceable to its source. Beyond the predefined DFA topic categories, the pipeline generated certain stakeholder concerns, such as Age Verification, Payment Processor Censorship, and Digital Ownership, that a fixed‑taxonomy approach would have missed. The pipeline is domain‑generic; adapting it to a new consultation requires only a prompt update and a new dataset. A live demo is available at https://dfa‑dashboard.thalesbertaglia.com/. The code and processed data are publicly available at https://github.com/thalesbertaglia/dfa‑dashboard.
Authors:Jun-Hak Yun, Seung-Bin Kim, Seong-Whan Lee
Abstract:
Recent advancements in text‑guided audio generation have yielded promising results in diverse domains, including sound effects, speech, and music. However, jointly generating speech with environmental audio remains challenging due to the inherent disparities in their acoustic patterns and temporal dynamics. We propose ImmersiveTTS, an environment‑aware text‑to‑speech (TTS) model that generates natural speech seamlessly integrated within environmental contexts by explicitly modeling cross‑modal interactions. Our model builds on a multimodal diffusion transformer and fuses transcript‑aligned speech latent with text‑conditioned environmental context via joint attention. To enhance semantic consistency, we introduce a domain‑specific representation alignment objective tailored to environment‑aware TTS, leveraging complementary self‑supervised representations from speech and audio encoders. Experimental results show that ImmersiveTTS achieves higher naturalness, intelligibility, and audio fidelity than existing approaches across objective metrics and human listening tests.
Authors:Yating Pan, Jiajun Zhang, Jun Wang, Qi Su
Abstract:
LLM‑based research agents have advanced rapidly in science and engineering, where research is organized around executable experiments, code, and quantitative signals. Humanities scholarship, however, requires a different mode of reasoning: interpretive, evidence‑grounded argument over primary sources, where scholarly value depends on faithful quotation, verifiable provenance, and close reading. Existing research agents remain largely optimized for execution and retrieval, not evidence‑grounded interpretive reasoning. To address this gap, we introduce SPIRE (Scholarly‑Primitives‑Inspired Research Engine), a multi‑agent framework for evidence‑grounded humanities scholarship. Drawing on Scholarly Primitives theory, SPIRE casts recurring humanities operations as cooperating agent roles (source discovery, evidence annotation, comparison, provenance checking, sampling, citation binding, and argumentative synthesis) over a multi‑scale close‑reading substrate of passages, intra‑context graph communities, and cross‑context semantic clusters. On a peer‑reviewed‑paper benchmark over classical Chinese and Greco‑Roman Latin scholarship, SPIRE recovers cited primary‑source evidence more reliably than Naive LLM, Text RAG, and GraphRAG, and receives higher blind‑judge scores on answer accuracy, depth, coverage, and evidence quality. Ablations show that both the scholarly‑operation agents and close‑reading retrieval contribute to evidence‑grounded essays. Code, data catalogues, and reproduction scripts are released at https://github.com/YatingPan/SPIRE.
Authors:Tianjie Ju, Yueqing Sun, Zheng Wu, Wei Zhang, Yaqi Huo, Xi Su, Qi Gu, Xunliang Cai, Gongshen Liu, Zhuosheng Zhang
Abstract:
Multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and action generation. However, their ability to sustain exploration in dynamic open worlds remains unclear. Existing embodied and game‑based benchmarks often compress interaction into short‑horizon tasks or entangle success with domain‑specific game mechanics. In this paper, we introduce MineExplorer benchmark for evaluating open‑world exploration capabilities of MLLM agents in Minecraft. We first filter atomic tasks whose solutions rely heavily on Minecraft‑specific knowledge to better reflect general open‑world reasoning. Then we organize the benchmark around a ReAct‑style capability formulation and compose atomic tasks into implicit multi‑hop tasks. To further construct reliable instances, MineExplorer uses a multi‑agent synthesis workflow that jointly designs task graphs, sandbox scenes, and rule‑based milestone evaluators. Human evaluation shows that the multi‑agent synthesis workflow produces significantly more reliable instances than a single‑agent baseline. Experiments with advanced MLLM agents show that open‑world exploration remains challenging, as strong models can handle many single‑hop tasks but degrade sharply when hidden prerequisites must be coordinated over longer trajectories. Further analysis finds that task difficulty tracks agent completion, and larger models or thinking modes do not consistently translate into better performance. Code and dataset are available at https://github.com/Jometeorie/MineExplorer.
Authors:Dongwook Choi, Taeyoon Kwon, Bogyung Jeong, Minju Kim, Yeonjun Hwang, Hyojun Kim, Byungchul Kim, Young Kyun Jang, Jinyoung Yeo
Abstract:
MLLM‑powered embodied agents deployed in real‑world environments encounter physical hazards. However, existing approaches lack explicit mechanisms for identifying hazards and reasoning about action‑conditioned risks, leading agents to either miss risky interactions or over‑identify risks. To address this, we propose EMBGuard, the first MLLM‑based safety guardrail for embodied agents designed to decouple physical risk reasoning from agent policy. By evaluating a (visual observation, action) pair, EMBGuard identifies hazardous configurations and provides natural language explanations of potential risks. Alongside EMBGuard, we contribute EMBHazard, a training dataset of 15.1K action‑conditioned pairs, and EMBGuardTest, a benchmark of 329 manually curated real‑world scenarios spanning seven physical risk categories. Through compositional variation of hazards and actions, we generate diverse risky and benign scenarios that agents may encounter during planning. Despite its compact size (2B, 4B), EMBGuard achieves performance competitive with proprietary MLLMs (e.g., GPT‑5.1, Gemini‑2.5‑Pro) while significantly reducing the false‑positive rates that hinder real‑time deployment. We make the code, data, and models publicly available at https://github.com/dongwxxkchoi/EMBGuard
Authors:Jiaxin Bai, Yue Guo, Yifei Dong, Jiaxuan Xiong, Tianshi Zheng, Yixia Li, Tianqing Fang, Yufei Li, Yisen Gao, Haoyu Huang, Zhongwei Xie, Hong Ting Tsang, Zihao Wang, Lihui Liu, Jeff Pan, Yangqiu Song
Abstract:
Text‑agent environments are typically modeled as partially observable Markov decision processes (POMDPs), assuming that the simulator's latent state and transition dynamics are hidden from the agent. Yet little work has examined whether executable code can be induced to serve as a world model for prediction and planning under partial observability. We introduce PatchWorld, a gradient‑free framework that turns offline trajectories into executable Python world models through counterexample‑guided code repair. Instead of predicting the next observation with a black‑box model, PatchWorld induces symbolic belief‑state programs whose action updates can be inspected, replayed, and locally patched. Across seven AgentGym environments, PatchWorld‑Simple achieves the highest code‑based planning score among evaluated methods, reaching 76.4% macro success in live one‑step lookahead while invoking no LLM calls inside the world‑model prediction module itself. We further find that a human‑specified residual‑memory bias improves surface observation fidelity but weakens decision utility. This exposes a tradeoff in executable world models, since improving observation fidelity can come at the expense of action‑discriminative dynamics, and vice versa. Code is available at https://github.com/HKBU‑KnowComp/PatchWorld.
Authors:Sicheng Feng, Zigeng Chen, Gongfan Fang, Xinyin Ma, Xinchao Wang
Abstract:
Diffusion Large Language Models (dLLMs) have recently emerged as a promising alternative to autoregressive models, offering competitive performance while naturally supporting parallel decoding. However, as dLLMs are increasingly integrated with Mixture‑of‑Experts (MoE) architectures to scale model capacity, a fundamental mismatch arises between block parallel decoding and token‑level expert selection. Specifically, each dLLM forward pass processes multiple tokens with bidirectional dependencies, whereas conventional MoE layers route each token independently. This mismatch substantially increases the number of uniquely activated experts, making inference increasingly memory‑bound. To address this, we propose dMoE, a simple yet effective block‑level MoE framework. The central idea of dMoE is to aggregate token‑level expert distributions within each block into a unified block‑level expert distribution, which is then used to guide expert routing in a more coherent manner. In this way, dMoE substantially reduces the number of uniquely activated experts during inference without sacrificing performance, thereby mitigating the memory‑bound bottleneck. Extensive experiments across a variety of benchmarks demonstrate the effectiveness of dMoE. On average, dMoE reduces the number of uniquely activated experts from 69.5 to 14.6 while retaining 99.11% of the original performance. Meanwhile, it reduces memory usage by 76.64% to 79.84% and achieves 1.14× to 1.66× end‑to‑end latency speedup. Code is available at: https://github.com/fscdc/dMoE
Authors:Yijiong Yu, Huazheng Wang, Shuai Yuan, Ruilong Ren, Ji Pei
Abstract:
Speculative Decoding (SD) accelerates low‑concurrency LLM inference by employing a draft‑then‑verify paradigm. However, mainstream methods typically rely on multi‑token prediction, which introduces escalating prediction difficulty and serial drafting latency. To address these, we propose Speculative Pipeline Decoding (SPD), a groundbreaking framework that unlocks the true potential of pipeline parallelism. By partitioning the target LLM into n pipeline stages, SPD allows LLM to process n tokens in parallel to accelerate decoding. To continuous fill the pipeline in single sequence decoding, a speculation module aggregates intermediate features across different pipeline depths to predict the next token, executing strictly in parallel with the target model's pipeline step, to realize bounded difficulty, higher acceptance rates, and zero latency bubbles. Our experiments demonstrate that SPD achieves a significantly higher theoretical speedup compared to mainstream baselines, offering a highly scalable solution for LLM decoding acceleration. Our code is available at https://github.com/yuyijiong/speculative_pipeline_decoding
Authors:Yuwei Cheng, Weiyi Tian, Haifeng Xu
Abstract:
Fine‑tuning is often believed to reduce uncertainty and diversity in large language models, but existing analyses overlook output length, a key confounder, and therefore fail to capture how uncertainty is distributed across an entire generation rollout. To address this, we propose Canopy Entropy (\mathrmCE^\star), a measure that views language generation from a tree perspective, where ``canopy'' represents the space of all possible rollouts, making \mathrmCE^\star naturally quantify the effective size of the generation space. \mathrmCE^\star jointly captures uncertainty in both the output length N and the generated sequence Y_1:N ‑‑ indeed, we show that it equals to total Shannon entropy H(N, Y_1:N\mid X), where X denotes the prompt. This formulation yields interpretable metrics, including a length‑entropy correlation term ρ(N, r_N), where r_N is the entropy rate, quantifying information conveyance efficiency by indicating whether longer outputs are more or less informative per token. Empirically, across tasks and model families, we find that fine‑tuned models consistently exhibit stronger positive correlation ρ(N, r_N), even when total entropy decreases. Furthermore, after controlling for model family, task, prompt, and output‑length effects, we find that fine‑tuning nearly triples the correlation strength between entropy rate and semantic diversity, suggesting that aligned models convert token uncertainty into semantic diversity more efficiently. Overall, these results demonstrate that fine‑tuning does not simply reduce uncertainty, but fundamentally reorganizes it into more informative and semantically meaningful generations. Our code is available at https://github.com/WeiyiTian/canopy‑entropy.
Authors:Shenghu Jiang, Ruihao Gong
Abstract:
We propose a novel algorithm for incremental Byte Pair Encoding (BPE) tokenization. The algorithm processes each input byte in worst‑case \mathcalO(\log^2 t) time, leading to an overall complexity of \mathcalO(n \log^2 t), where n is the input length and t is the maximum token length. The algorithm incrementally maintains BPE tokenization results for every prefix of the input text, implementing the standard BPE merge procedure defined by a fixed set of merge rules. This enables efficient partial tokenization in streaming settings. Functioning as a drop‑in replacement for standard BPE, our approach achieves a speedup of up to ~3× over Hugging Face's tokenizers, and demonstrates significant latency reductions over OpenAI's tiktoken on pathological inputs. We further introduce an eager output algorithm that enables streaming output, emitting tokens as soon as token boundaries are determined during incremental tokenization. Overall, our results demonstrate that BPE tokenization can be performed incrementally with strong worst‑case guarantees, while providing practical latency benefits in modern large language model pipelines. Code: https://github.com/ModelTC/mtc‑inc‑bpe
Authors:Zhiwen You, Nafiseh Nikeghbal, Jana Diesner
Abstract:
Language models (LMs) can produce gendered language and stereotypes even when given neutral prompts. Most prior work on gender bias in LMs primarily examines gender through a binary lens (feminine vs. masculine), with limited attention to gender‑neutral forms, such as they/them pronouns or neutrally phrased job titles. How gender‑related signals are encoded in the internal representations of LMs remains an open question. In this work, we study gender‑specific neurons in LMs across three categories: feminine, masculine, and gender‑neutral. We propose a neuron‑level intervention method to identify neurons that are strongly tied to each gender category. We then test these neurons through controlled generation, showing that activating or masking gender‑related neurons can steer a sentence toward a target gender form while preserving its original meaning. To evaluate the effectiveness of our gender‑intervention approach, we curate two datasets with controlled sentences labeled across all three gender categories and validate the data quality through human evaluation. Experiments on two open‑source LMs show that gender‑specific neurons are not evenly distributed across model layers; instead, they concentrate heavily in the earliest layers with smaller contributions from later layers. Compared to existing methods, our method achieves more precise gender control, with less leakage into non‑target gender categories and stable output quality through two evaluation criteria. Overall, our work examines how gender is encoded in LMs and provides a simple yet effective approach toward controlled gender intervention for both neuron intervention evaluation and gender bias mitigation. Code and datasets are available at: https://github.com/zhiwenyou103/Gender‑Neuron‑Intervention
Authors:Tao Feng, Chongrui Ye, Tianyang Luo, Jingjun Xu, Xueqiang Xu, Haozhen Zhang, Ge Liu, Jiaxuan You
Abstract:
Long‑term memory is essential for LLM agents to reason coherently across extended interactions, personalize responses, and reuse past experience. However, existing memory‑augmented methods typically treat memory as a fixed resource: text‑space approaches concatenate retrieved memories into the context window, causing substantial token overhead and sensitivity to noisy evidence, while latent‑space approaches reduce textual cost but still rely on rigid retrieval or fixed‑capacity memory interfaces. This creates a mismatch between query‑dependent memory utility and fixed memory allocation. We propose ElasticMem, a memory‑augmented LLM framework that learns to use memory as an elastic latent resource. ElasticMem builds an offline latent memory bank with retrieval keys and content caches, retrieves memories adaptively from the reasoner's hidden state, assigns each retrieved memory a variable latent budget through a learned policy, and injects selected latent states as soft memory tokens for generation. The full memory‑use process is optimized with downstream task rewards through group‑relative policy optimization. We evaluate ElasticMem on MemorySuite, covering memory‑intensive QA and embodied agent control. Across Qwen2.5‑3B‑Instruct and Qwen2.5‑7B‑Instruct backbones, ElasticMem improves weighted average QA accuracy by 26.2% and 24.6%, and improves ALFWorld success rate by 66.3% and 27.2%, respectively, over the strongest baselines, while achieving the lowest ALFWorld token cost. Ablations and qualitative analyses further show that adaptive retrieval and elastic budget allocation help ElasticMem prioritize useful evidence and transferable plans beyond rigid cosine similarity. Our code for ElasticMem will be released at https://github.com/ulab‑uiuc/ElasticMem.
Authors:Haozhe Zhao, Shuzheng Si, Zhenhailong Wang, Zheng Wang, Liang Chen, Xiaotong Li, Zhixiang Liang, Maosong Sun, Minjia Zhang
Abstract:
Scientific figures are among the most effective means of communicating complex research ideas, yet producing publication‑quality illustrations remains one of the most labor‑intensive parts of paper preparation. Existing automated systems each target a single figure type under text‑only input, leaving the diversity of types and conditions researchers actually use unaddressed; their raster outputs further cannot be locally revised. Because scientific figures are structured compositions of discrete semantic components, the localized errors generators produce on such layouts demand not a stronger backbone but a harness. We instantiate this harness in two complementary systems: Crafter, a multi‑agent harness for figure generation that generalizes across figure types and input conditions without architectural changes, and CraftEditor, which applies the same pattern to convert raster outputs into editable SVGs. Moreover, we introduce CraftBench, a benchmark spanning three figure types and four input conditions with human quality annotation. Experiments show that Crafter substantially outperforms both standalone generators and the agentic baseline on PaperBanana‑Bench and CraftBench, with ablations confirming each component's independent contribution; CraftEditor faithfully converts outputs into editable SVGs that surpass all baselines. Our code and benchmark are available at https://github.com/HaozheZhao/Crafter.
Authors:Yue Zhang, Zun Wang, Han Lin, Yonatan Bitton, Idan Szpektor, Mohit Bansal
Abstract:
Spatial reasoning is a fundamental capability for vision‑language models (VLMs) deployed in real‑world environments. However, visual observations are inherently limited representations of a 3D world: occlusion can render objects invisible, and perspective can make geometric properties misleading. Despite this, existing spatial reasoning benchmarks typically assume that observations are sufficient and reliable, focusing on whether models produce correct answers rather than whether they recognize when a question cannot be answered and what additional observations would be needed. In this work, we challenge this assumption by constructing a controlled evaluation framework, SpatialUncertain, and introducing two types of observation challenges: (1) occlusion, which hides target information, and (2) perspective ambiguity, which produces misleading visual cues. For each configuration, we design spatial questions that are answerable under clean observations but require abstention under the introduced challenges. We further evaluate whether models can identify which additional viewpoints would resolve perspective ambiguity. Our results across a diverse set of frontier open‑ and closed‑source VLMs reveal two consistent failure modes. First, models are prone to overconfident answering, attempting to solve spatial reasoning tasks even when visual evidence is incomplete or misleading, with average accuracy around 30% under occlusion and below 10% under perspective ambiguity. Second, even when additional views are available, some models perform near random chance in identifying which would provide reliable evidence. Together, our findings call for moving beyond answer correctness toward evaluating whether models know when to abstain and how to seek reliable evidence.
Authors:Kewei Xu, Xiaoben Lu, Shuofei Qiao, Zihan Ding, Haoming Xu, Lei Liang, Ningyu Zhang
Abstract:
Real‑world data analysis is inherently iterative, yet existing benchmarks mostly evaluate isolated or short interactive tasks, leaving agents' ability to track evolving analytical context over long horizons untested. We introduce LongDS, a benchmark for long‑horizon, multi‑turn data analysis where agents must maintain, update, restore, and compose evolving analytical states. LongDS comprises 68 tasks constructed from real‑world Kaggle notebooks, spanning 2,225 turns across six domains including Geoscience, Business, and Education. Tasks are designed around state‑evolution patterns (e.g., counterfactual perturbation, rollback, multi‑state composition), with an average dependency span of 11.3 turns. Evaluating five state‑of‑the‑art models, we find that the best model reaches only 48.45% average accuracy, performance drops nearly 47 points from early to late turns, and long‑horizon errors account for 52%‑‑69% of failures. Further analysis shows that additional agent steps do not necessarily improve performance, suggesting that the key bottleneck is maintaining a correct analytical state rather than increasing interaction budget. We release LongDS to support research on reliable long‑horizon agentic data analysis. Code and data will be released at https://github.com/zjunlp/DataMind.
Authors:Yujie Luo, Xiangyuan Ru, Jingsheng Zheng, Jingjing Wang, Yuqi Zhu, Jintian Zhang, Runnan Fang, Kewei Xu, Ye Liu, Zheng Wei, Jiang Bian, Zang Li, Shumin Deng
Abstract:
Large Language Models (LLMs) have demonstrated strong performance on general tasks, while often struggling to adapt to specialized domains without high‑quality domain‑specific data. Existing LLM‑based data curation methods primarily rely on human‑designed workflows, leaving it unexamined whether LLMs can autonomously execute an end‑to‑end data engineering pipeline for model specialization. We formalize Autonomous Agentic Data Engineering, a novel task designed to evaluate LLMs as autonomous data engineers that drive model specialization through end‑to‑end data curation. We frame data as an optimizable component and study agents that plan, generate, and iteratively optimize training data across multiple domains, guided by post‑training performance improvement. Experiments show that autonomous LLM data engineers yield substantial gains, as GPT‑5.2 constructs a training curriculum that improves a student model by 57.29%, entirely through iterative, agent‑driven data adaptation. By illuminating both potential and bottlenecks, our study establishes autonomous data engineering as a measurable capability and charts a path toward agent‑driven model specialization\footnoteCode will be released at https://github.com/zjunlp/DataAgent..
Authors:Chen Henry Wu, Aditi Raghunathan
Abstract:
Self‑improvement at scale has been a longstanding goal for reasoning models, and there are two natural places to do it: at test time, through verification‑refinement (V‑R) loops; and at training time, through self‑training methods. Both are gated by the same bottleneck: the verifier. V‑R loops stall when verifier scores inflate while accuracy stagnates, and when feedback is too generic to act on; self‑training fails similarly when bad self‑generated data are added to training. Better verification would unlock both, but the capability we want to train, i.e., catching self‑generated errors, lacks training signal. To address this challenge, we propose self‑trained verification (STV). Our key observation is that, while a model cannot catch these errors alone, it can when shown the reference solution. We turn this asymmetry into a supervision target and train the verifier to imitate a more informed version of itself. At test time, STV substantially improves V‑R loops on hard problems, while alternatives (e.g., SFT, RL on verifier scores, and even meta‑verifiers) do not. STV roughly doubles accuracy on hard math and lifts it 14x on scientific reasoning tasks (1.5% to 21%). At training time, we additionally train the generator using RL with STV verifier's feedback inside the V‑R loop ‑ a procedure we call verifier‑in‑the‑loop training (ViL). Starting from an RL‑converged generator, ViL yields a further 33% gain in pass@1. More notably, the generator's standalone pass@1, with no verifier at test time, climbs 30% relative past where standard RL had converged. Hence, the next frontier in reasoning on hard problems may lie in how we train for and with verification. Website: https://ar‑forum.github.io/stv‑webpage
Authors:Haoming Xu, Weihong Xu, Zongrui Li, Mengru Wang, Yunzhi Yao, Chiyu Wu, Jin Shang, Yu Gong, Shumin Deng
Abstract:
Long‑horizon interactions require language models to manage accumulating information: when to update their state, when to preserve their state, and what to ignore. We study this challenge as Contextual Belief Management (CBM): maintaining a predicted belief state aligned with formal evidence while isolating task‑irrelevant noise. To make CBM measurable, we introduce BeliefTrack, a closed‑world benchmark spanning Rule Discovery and Circuit Diagnosis, where a finite belief space and symbolic verifiers enable exact turn‑level evaluation. BeliefTrack diagnoses three failures: Failed Stay, Failed Update, and Failed Isolation. Across multiple LLMs, vanilla models exhibit severe CBM failures, while explicit belief‑tracking prompts provide limited gains. In contrast, reinforcement learning with belief‑state rewards reduces failure rates by 70.9% on average. Further probing reveals latent belief‑state dynamics behind these failures, and representation‑level steering reduces failure rates by 46.1% across two tasks\footnoteCode is coming soon at https://github.com/zjunlp/CBM.
Authors:Travis Lelle
Abstract:
We show that LoRA adapters, the dominant distribution format for fine‑tuned LLMs, can be reliably backdoored through training data poisoning while preserving baseline task performance. On a Qwen 2.5 1.5B prompt‑injection classifier, a small fraction of poisoned examples drives a clean‑accuracy‑preserving backdoor to saturation. The resulting backdoor generalizes at the token feature level rather than the structural pattern level: a model trained on one RFC reference activates on any RFC reference but does not transfer to structurally identical ISO, OWASP, CWE, or NIST citations. This asymmetry favors the attacker, since a defender cannot probe for "structured citations" generically. We characterize the attack across base‑model scale and family, LoRA rank, and trigger string, and evaluate two complementary detection routes against a multi‑seed adapter cohort. A behavioral detector built from two probe‑battery statistics, outlier_gap and mean_attack_rate, separates poisoned from clean adapters perfectly when the battery overlaps the trigger's token neighborhood and at high recall with zero false positives when it does not. A weight‑level statistic, the cross‑module standard deviation of dimension‑normalized Frobenius norms, also separates the cohort perfectly without running the model. Combined, the two routes are robust to probe composition. Causal patching localizes the backdoor to the MLP block at mid‑to‑late layers, with down_proj as the strongest single‑projection cause. Replications across scale, family, and rank show the behavioral detector transfers without retuning, while the weight‑level detector is calibration‑bound to the base model. The attack scales monotonically with rank, and the chosen trigger‑anchor token is both trigger‑dependent and base‑model‑dependent. Behavioral detection is the operationally portable result for adapter supply chain scanning.
Authors:Milan Straka
Abstract:
We introduce CorPipe 26, our winning submission to the CRAC 2026 Shared Task on Multilingual Coreference Resolution. The fifth edition of this shared task focuses mainly on the comparison of generative LLMs and specialized systems; additionally, 5 more datasets and 2 new languages are introduced. CorPipe 26 is an improved version of CorPipe 25, with a new variant predicting empty nodes together with mentions and coreference links in a single model. Our system outperforms all other submissions in the LLM track by 2.8 percent points and all submissions in the unconstrained track by 9.5 percent points. Furthermore, we perform a series of ablation experiments with different model sizes, empty node prediction methods, and cross‑lingual zero‑shot evaluation. The source code and the trained models are publicly available at https://github.com/ufal/crac2026‑corpipe.
Authors:Matt Y. Cheung, Ashok Veeraraghavan, Hanjie Chen, Guha Balakrishnan
Abstract:
Language model reasoning traces are rarely all‑or‑nothing; they frequently contain valid intermediate steps before a critical error occurs. Existing uncertainty quantification methods typically certify final answers or entire responses, failing to provide statistical guarantees for the proportion of a sequential trace that can be safely retained. To address this, we introduce CROP (Conformal Reasoning Output Prefixes), a verifier‑agnostic calibration procedure for clean‑prefix certification. Given any step‑level risk proxy, CROP selects a calibrated threshold and returns the longest contiguous prefix whose step risk proxies remain below it, routing the uncertified suffix for downstream review or repair. Assuming exchangeability, CROP rigorously controls the marginal probability that the returned prefix contains an annotated error. Across six process‑labeled reasoning datasets, we demonstrate that standard step‑level metrics such as AUROC do not fully capture prefix utility, suggesting verifiers should instead be evaluated by certified prefix length. Furthermore, CROP balances over‑ and under‑withholding, improving downstream repair accuracy by preserving valid intermediate reasoning while discarding misleading suffixes. Ultimately, this work positions prefix certification as a rigorous, practical bridge between process supervision, abstention, and repair.
Authors:Weihan Peng, Chenxu Zhang, Qianao Wang, Yuling Shi, Heng Lian, Qihong Mao, Jiahao Pang, Chunliang Feng, Bowen Li, Xiaodong Gu
Abstract:
While LLM agents have demonstrated remarkable task‑oriented abilities such as planning, reasoning, and action, few works have treated them as complete human personalities where emotional dimensions hold equal importance. In this paper, we introduce a novel benchmark to systematically assess whether LLM agents can simulate coherent, human‑like psychology. Specifically, our benchmark constructs 11 diverse human characters grounded in orthogonal Big Five personality traits, with each profile deeply integrated with 1,000 structured autobiographical‑style episodic memories distributed across theory‑grounded developmental life stages. To rigorously evaluate the psychological manifestations of LLMs, we designed a curated suite of 64 decision‑making scenarios, guided by the DIAMONDS taxonomy, a psychological framework that characterizes situations along eight dimensions: Duty, Intellect, Adversity, Mating, pOsitivity, Negativity, Deception, and Sociality. By subjecting agents to varying scenarios, the benchmark evaluates whether they can consolidate their innate personality traits and autobiographical memories to make behavioral decisions that are consistent with their specific psychological profiles. After systematic human validation and filtering, we obtained a benchmark consisting of 673 multiple‑choice questions (MCQs). We believe this benchmark provides a principled and scalable testbed for studying human‑like emotions, personality consistency, and value‑consistent behavioural decision‑making in LLM‑based agents.
Authors:Vinay Samuel, Yapei Chang, Mohit Iyyer
Abstract:
Many open‑ended instructions have multiple valid answers that users can benefit from seeing, but post‑training often narrows an LLM's output space toward a small set of canonical responses. We introduce REDIPO, an offline DPO data‑construction pipeline for recovering distinct valid answer modes while preserving the alignment benefits of the instruct model. For each prompt, REDIPO samples responses from both base and instruct models, rewrites base‑model responses with the instruct model, filters candidates for safety and instruction‑following quality, and builds preference pairs that favor marginally diverse responses among candidates with similar instruction‑following reward. Across Qwen3‑4B, OLMo‑3‑7B, and LLaMA‑3.1‑8B, REDIPO improves NoveltyBench distinct_k by 134%, 33%, and 44% relative to the instruct checkpoints, while DivPO changes diversity by 0%, ‑6%, and ‑4% on the same models. These gains largely maintain MTBench, IFEval, and Arena‑Hard performance, and reduce direct‑category HarmBench attack success rate. Ablations show that marginal‑diversity pair selection and base‑response rewriting drive the diversity gains, while filtering and quality‑bounded pairing help maintain alignment. Overall, our results show that diverse valid answers from base‑model generations can be reintroduced through carefully constructed preference data while retaining the alignment benefits of post‑training. We release our code and data at https://github.com/vsamuel2003/ReDiPO.
Authors:Chenghao Zhang, Guanting Dong, Yufan Liu, Tong Zhao, Xiaoxi Li, Zhicheng Dou
Abstract:
Large Language Models (LLMs) have advanced autonomous agents from deep search, which retrieves concise factual answers, to deep research, which synthesizes scattered evidence into long‑form reports. However, verifiable multimodal deep research remains challenging due to open‑ended synthesis without deterministic ground truth and the need to interleave textual arguments with visual evidence. We propose Ptah, a multi‑agent harness for interleaved report generation. Ptah orchestrates the lifecycle from user query to rendered web report through planning, research, and writing stages, where specialized agents construct visual‑aware plans, collect claim‑grounded evidence, maintain source‑aligned images in a Visual Working Memory, and compose reports through declarative multimodal tool use. A verifier agent serves as the harness's acceptance function, enforcing factual grounding, citation fidelity, and cross‑modal consistency throughout the workflow. We further introduce PtahEval, an evaluation protocol that augments existing benchmarks with image‑level and presentation‑level assessments. Experiments on deep research benchmarks show that Ptah produces more reliable, visually informative, and usable human‑facing multimodal reports than strong baselines. Our code is released at https://github.com/SnowNation101/Ptah
Authors:Yunbo Tang, Chengyi Yang, Shiyu Liu, Zhishang Xiang, Zerui Chen, Qinggang Zhang, Jinsong Su
Abstract:
Agentic search enables LLMs to solve complex multi‑hop questions through iterative reasoning and external search. Despite the effectiveness, these systems often suffer from a critical limitation in practice: agents fail to recognize their own knowledge boundaries, blindly triggering searches when internal knowledge suffices and failing to terminate search even when adequate evidence has been collected. The lack of self‑awareness leads to severe over‑search, incurring substantial inference latency and prohibitive computational cost. To this end, we propose SAAS, a novel RL framework designed to cultivate dynamic self‑awareness that precisely regulates search behavior without compromising accuracy. SAAS introduces three key components: (i) a search boundary modeling mechanism, which identifies the search boundary under the evolving policy by contrasting search‑disabled and search‑enabled rollouts; (ii) a boundary‑aware reward module, which translates this boundary awareness into trajectory‑level penalties, suppressing unnecessary and redundant searches; and (iii) a stage‑wise optimization strategy, which leverages a sequential curriculum to prioritize reasoning over search regularization, thereby avoiding reward hacking. Extensive experiments demonstrate that SAAS substantially reduces over‑search, while maintaining accuracy. Our code and implementation details are released at https://github.com/XMUDeepLIT/SAAS.
Authors:Kaijie Zheng, Weiqin Wang, Yile Wang, Hui Huang
Abstract:
Calculating semantic textual similarity is a foundational task in natural language processing. Current large language models (LLMs) based methods typically rely on extracting last‑layer hidden states with fixed dimensions to compute similarity for every text pairs. We argue that this paradigm is suffer from two limitations: (i) The last hidden layer encodes more general knowledge rather than just semantic knowledge, making it suboptimal for semantic similarity computation; (ii) The hidden layer dimensions of LLMs are generally very large, which introduces some redundancy and noise for representing semantics. In this work, we propose DySem, a novel training‑free framework that investigates more semantic‑related internal components of LLMs via multilingual consensus, and shifts away from static representation spaces in favor of dynamic, sample‑specific semantic dimensions by constructing text‑dependent joint semantic set and computes similarity over this shared dimensional subset. Extensive experiments across various LLMs show that our method consistently outperforms recent baselines while maintaining lower dimensions for similarity calculation. The code is released at https://github.com/szu‑tera/DySem.
Authors:Aditi Khandelwal, Marius Mosbach, Verna Dankers, Siva Reddy, Golnoosh Farnadi
Abstract:
Mixture‑of‑Experts (MoE) models are widely used to scale language models, yet their expert routing behavior and adaptation in a multilingual setting remain underexplored. In this work, we study multilingual routing dynamics during continual pre‑training of an English‑centric MoE model on a multilingual corpus, analyzing how expert usage varies across languages. We find that continual multilingual pre‑training leads to diffused, language‑agnostic routing in early and middle layers, with language specialization primarily emerging in the final layers. We also show that token‑level vocabulary overlap between languages plays an important role in how languages are routed. Motivated by these findings, we propose a parameter‑efficient adaptation strategy that updates language‑specific and shared experts in the final MoE layers. Experiments on MultiBLiMP and Belebele show that our method achieves a strong performance‑efficiency trade‑off, attaining competitive performance relative to fine‑tuning complete final layers, while updating less than 2% of the parameters. Overall, our findings provide insights into where and how language specialization emerges in MoEs during continual pre‑training and provide practical insights for low‑resource multilingual adaptation. Our code is available at https://github.com/aditi184/moe‑routing‑adaptation.
Authors:Pawel Batorski, Abtin Pourhadi, Jerzy Sarosiek, Przemyslaw Spurek, Paul Swoboda
Abstract:
Large language models are highly sensitive to prompts, but this sensitivity is usually studied through task‑relevant instructions, demonstrations, or reasoning cues. In this paper, we study a different form of prompt sensitivity: whether prompts that are semantically unrelated to the task can nevertheless steer model behavior. We call them spurious prompts and show their surprising efficacy. We also propose a simple black‑box search procedure for discovering them. Across reasoning and question‑answering benchmarks, using models ranging from 0.8B to 27B parameters and spanning three model families, we show that spurious prompts can improve performance, often matching or outperforming standard prompting baselines and task‑aware prompt optimization. We further show that they can steer models toward unintended behaviors, such as repeatedly selecting the first answer option, producing incorrect answers, returning an even, prime or small number without explicitly instructing the model to do so. These findings reveal a new kind of prompt sensitivity: LLMs can be systematically steered by prompts that are unrelated to the task they are asked to solve. Our code is available at https://github.com/Batorskq/spurious
Authors:Youwang Deng
Abstract:
End‑to‑end agent‑memory benchmarks report a single hit@k per retriever, confounding lexical leakage (uncontrolled query/gold/distractor entity overlap) with tag‑mixing (preferences, services, tools averaged together). We propose entity‑collision, a system‑agnostic protocol that pins the BM25 floor by construction ‑‑ every distractor shares the answer's entity tokens ‑‑ and stratifies queries by discriminator tag, so any lift over BM25 is attributable to the embedder. Applied to an open‑source agent‑memory testbed across 5 tags x 3 embedders x 5 collision degrees with paired‑bootstrap 95% CIs, the protocol reveals a two‑axis pattern: a 256‑d hash trigram helps only on closed‑vocabulary lexical tags at deep collision; MiniLM‑384 dominates both axes; and a 2.7x‑parameter BGE‑large does not uniformly improve on MiniLM ‑‑ it wins on intent‑style queries but loses on lexical ones. Encoder capacity alone is not the binding constraint. The synthetic intent‑tag null replicates on LongMemEval (n=500) as a single‑session‑preference recall cliff. Adaptive vector‑weight routing on LoCoMo is a measured null: 11.7pp of oracle headroom exists, but no signal we tested recovers it. All 26 result tables and 37 reproduce scripts are version‑controlled and verified by a public registry; the protocol is exercised on a deterministically governed memory testbed (event‑sourced decision log, DAG‑state‑machine schema lifecycle) so every reported CI is reproducible byte‑for‑byte from the ingest stream.
Authors:Rohan Shravan
Abstract:
Large language models route every input through a learned embedding table of shape |V| x d_model, consuming hundreds of millions to billions of trainable parameters at frontier scale. We introduce Kronecker Embeddings, a deterministic byte‑level character‑position factorization that replaces this table with a fixed encoder and a single learned projection, compatible with standard BPE tokenizers, eliminating 91‑‑94% of input‑side trainable parameters at frontier scale. We provide five contributions. First, a cross‑model probe across six LMs (135M‑671B parameters) shows trained input embeddings cluster typographic variants of the probe word far more than morphological relatives; Kronecker escapes this clustering at the embedding layer. Second, a controlled three‑seed comparison on nanoGPT GPT‑2 124M over 2.5B tokens of FineWeb‑Edu shows Kronecker reaching 2.5 +‑ 0.2% lower validation loss than the BPE‑tied baseline (gap 0.083 +‑ 0.007 nats, ~9% lower perplexity), needing ~1.43x fewer steps to reach BPE's converged loss. Third, a spelling‑robustness probe over 110 clean/typo pairs shows Kronecker preserves the top‑1 prediction on 55.5% of pairs vs. 47.3% for BPE (+8.2 pp) and lowers KL by 7.6%, winning or tying in 10 of 11 categories; a generation probe shows Kronecker echoes byte‑novel strings and typos through generation where BPE forgets them. Fourth, BPE embedding norm drifts during training while Kronecker projection norm stays near 1.0, consistent with a stable representational target. Fifth, an on‑the‑fly runtime variant reconstructs embeddings from a 4.5 MB byte buffer rather than a 2.15 GB table at vocabulary 131,072, with 0.01‑‑0.24% step‑time overhead. Byte‑level locality has a tradeoff: byte‑similar but semantically distant pairs (compute/commute, nation/notion) cluster together, shifting disambiguation to early attention layers.
Authors:Tianpeng Bu, Xin Liu, Qihua Chen, Hao Jiang, Shurui Li, Hongtao Duan, Lu Jiang, Lulu Hu, Bin Yang, Minying Zhang
Abstract:
While GUI agents have advanced rapidly, they often lack the robustness to recover from their own errors, hindering real‑world deployment. To bridge this gap at both the evaluation and data levels, we introduce GUI‑RobustEval and propose Robustness‑driven Trajectory Synthesis. GUI‑RobustEval contains 1,216 executable test cases that systematically measure error recovery capabilities across a broad and realistic spectrum of error modes. At the data level, RoTS is a scalable synthesis framework that creates 800k high‑quality data via a tree‑based pipeline that proactively discovers diverse error modes and synthesizes corresponding recovery steps. Our two models, RoTS‑7B and RoTS‑32B, fine‑tuned on our dataset, both demonstrate significant gains on GUI‑RobustEval and traditional GUI benchmarks. Notably, RoTS‑32B achieves state‑of‑the‑art performance on OSWorld, with a 47.4% success rate and a 33.8% All‑Pass@4 score, suggesting that improved long‑horizon error recovery ability contributes to both robustness and overall performance. Our code is available at https://github.com/AlibabaResearch/RoTS.
Authors:Rohan Shravan
Abstract:
We present BrahmicTokenizer‑131K, a 131,072‑vocabulary byte‑level BPE tokenizer that closes the Brahmic compression gap at the 131K‑vocabulary class while preserving the English, EU‑language, and code compression of OpenAI's o200k_base. We construct it through a two‑stage retrofit: (1) a script‑prune crop that reduces 200,019 tokens to 131,072 by removing nine out‑of‑scope writing systems, and (2) a surgical retrofit of 2,372 corpus‑dead vocabulary slots determined by linear‑programming allocation across nine Brahmic Unicode blocks. The pre‑tokenizer, decoder, and inherited merge rules are unchanged from o200k_base, making BrahmicTokenizer‑131K a drop‑in replacement at the tokenizer interface. On 27 million documents of public Indic pretraining text (2.84 billion words, 46.21 GB), BrahmicTokenizer‑131K produces 26.7% fewer tokens than Mistral‑Nemo Tekken / Sarvam‑m at the same vocabulary budget, with per‑language savings of 15.79% (Tamil) to 76.79% (Odia, a 4.31x compression ratio). The Odia advantage is mechanistically explained by Tekken/Sarvam‑m containing zero Oriya‑block tokens; our surgery added 725. On non‑Indic content, BrahmicTokenizer‑131K matches o200k_base's English fertility (1.235 vs 1.232 tokens/word) and beats Tekken/Sarvam‑m by 4.0‑14.2% on HumanEval, MBPP, and GSM8K. Across our 14‑tokenizer benchmark, it is the only tokenizer simultaneously competitive on Brahmic, English, EU, code, and math at the 131K budget. Specialist tokenizers at other vocab classes (Sarvam‑30B, Sarvam‑1, MUTANT‑Indic) achieve better Indic compression at the cost of non‑Indic performance: Sarvam‑1's English fertility is 15.9% worse and its code/math compression 26‑33% worse than ours. We release the artifact under Apache 2.0 at https://huggingface.co/theschoolofai/BrahmicTokenizer‑131K.
Authors:Riza Setiawan Soetedjo, Yusuke Sakai, Hidetaka Kamigaito, Jingun Kwon, Manabu Okumura, Taro Watanabe
Abstract:
Improving the quality of model‑generated summaries, especially factuality, the accuracy of a summary with respect to its source content, remains a challenge. While reranking could select the optimal output from multiple generated candidates, it is limited to only using the source as guidance, resulting in unreliable summaries. To address this limitation, we propose ConSUM that reranks candidate summaries by considering two factors: consistency to the source document and consensus among the other candidates. Consensus is established using Minimum Bayes Risk (MBR) decoding over the set of generated summaries, while ensuring consistency by employing factuality‑aware metrics that compare the summary against the source. Rigorous testing demonstrates that our system is competitive with existing methods, with human evaluations further confirming that its generated summaries are preferred over those from other systems. Our code is available at https://github.com/naist‑nlp/ConSUM .
Authors:Mohan Zhang, Yuqi Jia, Zhen Tan, Steven Jiang, Neil Zhenqiang Gong, Tianlong Chen, Dawn Song
Abstract:
LLMs are vulnerable to prompt injection attacks. However, this vulnerability has been primarily demonstrated conceptually in academic studies or through a few anecdotal case studies. Its prevalence and impact in real‑world LLM‑based applications are largely unexplored. In this work, we present the first systematic study of prompt‑injection attacks in a widely used application: LLM‑based resume screening. Our analysis is based on approximately 200K real‑world resumes collected over multiple years by hireEZ. We first design tailored methods to detect prompt injection in resumes. Manual validation on a small‑scale dataset demonstrates that our detectors achieve high precision and outperform state‑of‑the‑art general‑purpose detectors. We then apply our detector to the full resume dataset and conduct a comprehensive measurement study of real‑world prompt injection attacks. Our analysis reveals several intriguing findings: approximately 1% of resumes contain hidden prompt injections; the prevalence of such injected resumes has increased noticeably over the past one to two years; and more than 90% of injected prompts do not use explicit instructions. These results provide the first evidence of large‑scale prompt injection in real‑world LLM‑based applications and lay the groundwork for future studies to understand and mitigate such attacks.
Authors:Venkat Akhil Lakkapragada
Abstract:
Large language models have achieved strong reasoning capabilities, though often at the cost of massive parameter counts and expensive inference. In this work, we explore a different direction: adaptive reasoning depth in compact language models. We present CosmicFish‑HRM, a compact language model built around a Hierarchical Reasoning Module (HRM) that dynamically allocates computational effort during inference. Instead of applying fixed computation to every input, the model iterates through high‑level and low‑level reasoning cycles and learns when to halt based on input complexity. CosmicFish‑HRM combines this adaptive reasoning core with modern transformer components including Grouped Query Attention, RoPE, and SwiGLU activations. While the additional reasoning infrastructure introduces overhead at small scale, we hypothesize that this tradeoff becomes increasingly favorable as model size grows and the relative cost of the HRM core diminishes. Our results show that the model learns non‑uniform reasoning behavior, allocating different numbers of reasoning steps across tasks and inputs. These findings suggest that adaptive reasoning depth may offer a promising alternative to relying solely on parameter scale for reasoning capability.
Authors:Jeanmely Rojas Nunez, Viraj Sawant, Nathan Allen, Nomgondalai Amgalanbaatar, Yannis Zongo, Vasu Sharma, Maheep Chaudhary
Abstract:
Fine‑tuning large language models (LLMs) frequently induces catastrophic forgetting of prior capabilities. Recent work has shown that reinforcement learning (RL) retains prior capabilities more effectively than supervised fine‑tuning (SFT), attributing this to policy‑gradient updates remaining closer to the base policy \citeshenfeld2025rl. We extend this behavioral account to the mechanistic level and ask whether RL's advantage is mirrored by stronger preservation of internal computational circuits. We introduce differential circuit vulnerability, a head‑level measure of how much a circuit degrades under fine‑tuning, and use it to compare RL and SFT on Qwen2.5‑3B‑Instruct adapted to scientific question‑answering. We find a clear mechanistic trade‑off: SFT adapts more rapidly to the target task but produces substantially greater circuit disruption and forgetting of prior capabilities, whereas RL preserves a larger fraction of the base circuit at the cost of slower task adaptation. These findings suggest that circuit preservation may help explain why RL is more robust to catastrophic forgetting. We released our code here: https://github.com/rl‑sft‑circuit‑research/differential‑circuit‑vulnerability.
Authors:Dong Liu, Yanxuan Yu, Ying Nian Wu
Abstract:
The success of large language models (LLMs) across diverse NLP tasks has elevated the importance of reasoning chain optimization as a critical step in aligning model behavior with task objectives. Existing reasoning chain tuning methods often rely on black‑box heuristics or gradient‑free search, which lack interpretability, generalization, and sample efficiency. In this work, we introduce Thoughts‑as‑Planning, a novel framework that formalizes reasoning chain optimization as a sequential decision‑making process over a latent semantic space. We model the LLM as a partially observable environment and learn a latent world model that simulates the effect of reasoning chain edits on downstream outputs. A proximity‑preserving embedding space is constructed to encode reasoning chain‑response dynamics, enabling planning via gradient descent or reinforcement learning. Our method supports multi‑scale abstraction, allowing reasoning chain edits at token, segment, and instruction levels to be integrated into a unified planner. Through extensive experiments on language understanding and generation tasks, we demonstrate that Thoughts‑as‑Planning outperforms state‑of‑the‑art reasoning chain tuning baselines in efficiency, robustness, and generalization, while offering interpretability through its structured planning trajectory. Our code is available at https://github.com/FastLM/Thoughts‑as‑Planning.
Authors:Gyumin Kim, Juhwan Park, Jaeha Kim, Seunggyun Han, Kyungrak Son, Ikbeom Jang
Abstract:
While Large Language Models (LLMs) have demonstrated remarkable capabilities, their reliability is significantly compromised by hallucinations. Existing intrinsic self‑correction methods attempt to address this, but often fail due to self‑bias, where models struggle to identify errors in their own outputs without external verification. To overcome these limitations, we propose the LDPC‑inspired semantic error correction for retrieval‑augmented generation (SERC), providing a theoretical framework to interpret and mitigate LLM hallucinations. We reformulate the text generation process as a semantic noisy channel, treating generated responses as noise‑corrupted codewords. Inspired by low‑density parity‑check (LDPC) codes, SERC employs a sparse verification strategy: instead of exhaustively checking all facts, it generates low‑density verification queries and validates them against external evidence to efficiently detect and correct errors. We evaluate SERC on LongForm Bio and TruthfulQA benchmarks using Llama‑3‑8B and Qwen2.5‑14B. Experimental results demonstrate that SERC outperforms both intrinsic self‑correction methods and strong retrieval‑augmented baselines, demonstrating significant gains especially in factual precision (FactScore). Notably, SERC enables small language models (SLMs) to surpass the performance of larger baselines in hallucination reduction and information preservation. Our findings demonstrate that SERC provides a training‑free, model‑agnostic solution that significantly reduces verification overhead compared to dense methods, achieving an optimal trade‑off between cost and fidelity in resource‑constrained environments.
Authors:Xinle Deng, Ruobin Zhong, Hujin Peng, Xiaoben Lu, Yanzhe Wu, Guang Li, Buqiang Xu, Yunzhi Yao, Jizhan Fang, Haoliang Cao, Junjie Guo, Yuan Yuan, Ziqing Ma, Yuanqiang Yu, Rui Hu, Baohua Dong, Hangcheng Zhu, Ningyu Zhang
Abstract:
Memory is essential for enabling large language models to support long‑horizon reasoning, yet existing memory systems remain unreliable and difficult to debug. Tracing memory's dynamic evolution is crucial to understand how information is synthesized, propagated, or corrupted over time. In this work, we study the new problem of error tracing and attribution in LLM memory systems. We propose a novel framework that transforms memory pipelines into executable memory evolution graphs, enabling fine‑grained tracing of operational information flow. We then construct MemTraceBench, a benchmark collected from representative memory systems such as Long‑Context, RAG, Mem0, and EverMemOS, to systematically study memory failure modes. We further introduce an automatic attribution method that iteratively traces operation subgraphs to pinpoint the root cause of any failed case. Our analysis reveals that memory failures are systematic, stemming from operation‑level issues like information loss and retrieval misalignment. Crucially, we leverage these fine‑grained attribution signals to guide downstream prompt optimization, establishing a closed‑loop system that automatically corrects faults and boosts end‑task performance by up to 7.62%. Code will be released at https://github.com/zjunlp/MemTrace.
Authors:Jan Christian Blaise Cruz, Alham Fikri Aji
Abstract:
Sense representations (explicit, per‑token meaning decompositions) are useful for disambiguation, steering, and cross‑lingual alignment, but existing approaches require models to be pretrained with sense structure baked in. We introduce ACROS, which induces an explicit sense pathway into a frozen pretrained decoder LM through a gated residual addition. On SmolLM2‑360M, ACROS preserves base LM quality while supporting three uses of the same induced variables: zero‑shot word‑sense disambiguation (64.95 F1 on Raganato ALL, competitive with the WordNet first‑sense heuristic), low‑KL lexical steering across 5,161 CoInCo cases where a simple non‑oracle proxy recovers about 90% of positive shifts, and SENSIA cross‑lingual adaptation to four languages (mean R@1 0.988, target FLORES PPL 7.94). ACROS makes sense representations an inducible interface for ordinary pretrained LMs.
Authors:Yanqiu Zhao, Dongying Zheng, Kaibo Huang, Yukun Wei, Zhongliang Yang, Linna Zhou
Abstract:
GUI agents rely on screenshots to infer intent and operate across applications, but these screenshots often contain private messages, medical records, payment credentials, and workplace‑specific workflows. Privacy decisions in this setting depend on task, recipient, application state, and user role, yet static PII detectors miss these boundaries and cloud‑side VLM reasoning can upload the raw screen before deciding what should be protected. We present MaskClaw, an edge‑side privacy arbitrator for GUI agents. MaskClaw extracts local visual evidence, retrieves user‑ and task‑specific policy memory, and decides Allow, Mask, or Ask before raw screenshots leave a trusted user‑ or organization‑controlled environment. In five designed skill‑evolution scenarios, it turns corrections, cancellations, and edits into reusable privacy skills checked by a sandbox gate. We introduce P‑GUI‑Evo, a benchmark built from real UI patterns, reconstructed HTML screens, and sanitized labels. Experiments show that pattern matching, cloud reasoning, and routing alone tend to over‑confirm, over‑mask, or expose raw screenshots under the same protocol. The artifact is available at https://github.com/Theodora‑Y/MaskClaw.
Authors:Zheng Wu, Pengzhou Cheng, Zongru Wu, Yuan Guo, Tianjie Ju, Aston Zhang, Gongshen Liu, Zhuosheng Zhang
Abstract:
Recent advancements in multimodal large language models (MLLMs) have shown exceptional potential in enabling mobile‑using agents to autonomously execute human instructions. However, fully automated agents often try to execute tasks even when they are unable to resolve them, leading to the problem of over‑execution. Previous studies solve it by training a interactive mobile‑using agents to let agents request human interaction when agents can not complete user instructions. However, we find that these interactive agents tend to exhibit over‑soliciting behavior, relying excessively on human intervention. To mitigate both over‑execution and over‑soliciting, we propose a universal confidence integration framework that enables confidence‑driven proactive and robust interaction in MLLM‑based mobile‑using agents. The framework consists of two stages: interaction capability empowerment and confidence bias correction. In the interaction capability empowerment stage, agents learn through supervised fine‑tuning to output both actions and confidence scores. In the confidence bias correction stage, agents learn to output more accurate confidence scores by combining semantic similarity retrieval with direct preference optimization. Experimental results show Mobile‑Aptus achieves state‑of‑the‑art performance on the four popular mobile‑using agent benchmarks: OS‑Kairos, AITZ, Meta‑GUI, and AndroidControl. Mobile‑Aptus consistently outperforms all baselines in offline benchmarks, with an average improvement over 17% in task success rate. In real‑world dynamic experiments, Mobile‑Aptus surpasses the baseline by 26% in task success rate with only 0.64 intervention steps per instruction. The codes are available at https://github.com/Wuzheng02/Mobile‑Aptus.
Authors:Katharina Deckenbach, Haritz Puerto, Jonas Geiping, Sahar Abdelnabi
Abstract:
The validity of AI safety evaluations depends on models behaving consistently across controlled and deployment settings. Prior work has identified test‑time contextual cues, such as hypothetical scenarios, as a source of verbalized evaluation awareness and subsequent behavioral shift. In this paper, we investigate a potential explanation of this phenomenon: evaluation meta‑knowledge, defined as parametric knowledge about the structural traits that characterize evaluations. Similar to dataset contamination, where benchmark exposure leads to higher performance through memorization, we hypothesize that models trained on texts describing evaluation practices may implicitly learn to recognize and respond to evaluation‑like contexts, for instance, through exposure to scientific articles or social media posts about AI benchmarking. To test this, we fine‑tune models on synthetic documents describing evaluation traits such as verifiable structures or moral dilemmas. Evaluating this fine‑tuned model on six safety benchmarks, we find that it is significantly safer than the base model and control model. This behavioral shift persists even when restricting the analysis to responses lacking explicit verbalization of evaluation awareness. Our results demonstrate that evaluation meta‑knowledge may inflate safety benchmark performance, introducing a novel confounder that is independent of explicit memorization or verbalized evaluation awareness, thus, challenging to detect. These findings have important implications for the design and interpretation of AI safety evaluations. Our code and models are available at https://github.com/compass‑group‑tue/arxiv2026_evaluation_meta_knowledge.
Authors:Zheng Wu, Chengcheng Han, Zhengxi Lu, Tianjie Ju, Yanyu Chen, Qi Gu, Xunliang Cai, Zhuosheng Zhang
Abstract:
Despite the rapid progress of multimodal large language models in building Graphical User Interface (GUI) agents, their real‑world task completion is fundamentally bottlenecked by a lack of world knowledge about GUI operations. Existing solutions typically rely on expensive multi‑agent scaffolding or conventional post‑training paradigms, such as Supervised Fine‑Tuning (SFT) and Reinforcement Learning (RL). However, post‑training only allows agents to implicitly absorb world knowledge through action annotations or reward signals, leading to inefficient trajectory memorization rather than genuine comprehension. Therefore, an approach that enables explicit learning of this knowledge is imperative. To this end, we propose GUI‑CIDER, a mid‑training method that explicitly internalizes GUI world knowledge through Causal Internalization and Density‑aware Exemplar Reselection. GUI‑CIDER operates in three stages: (1) data synthesis, which distills static planning and dynamic causal knowledge from GUI trajectories into text; (2) exemplar reselection, which filters the corpus by rewarding causal structures and penalizing semantic redundancy; and (3) mid‑training, where the refined data is used to embed the acquired knowledge. Extensive experiments on two GUI knowledge benchmarks and three task completion benchmarks demonstrate that GUI‑CIDER consistently improves both the agent's understanding of GUI operations and its task success rates.The codes are available at https://github.com/Wuzheng02/GUI‑CIDER.
Authors:An Dao, Nhan Ly, Thao Tran, Yuji Matsumoto, Akiko Aizawa
Abstract:
Prion diseases are rare, rapidly progressive, and fatal neurodegenerative disorders that remain difficult to diagnose, particularly in their early stages because of nonspecific clinical presentations. However, to our knowledge, there is no publicly available prion‑disease‑focused dataset designed to capture a broad range of clinically relevant entities from the biomedical literature. We introduce PrionNER, a manually annotated named entity recognition dataset for prion disease clinical information in PubMed abstracts. The current release comprises 317 abstracts, 2,943 sentences, and 6,955 text‑bound entity annotations spanning 15 coarse‑grained and 31 fine‑grained clinically oriented entity types covering diseases, symptoms, diagnostics, findings, anatomy, treatments, and temporal and statistical evidence. Inter‑annotator agreement reaches 81.78 exact‑match F1, indicating strong annotation consistency. We benchmark supervised BERT baselines, W2NER, and zero‑shot extractors on PrionNER. W2NER is the strongest supervised model, and Gemma‑4‑31B is the strongest zero‑shot model, but the benchmark remains challenging, especially for structurally complex mentions and fine‑grained clinically adjacent label distinctions. PrionNER provides a clinically grounded benchmark for prion‑disease information extraction and supports research on rare‑disease biomedical NLP under low‑resource, fine‑grained, and non‑flat extraction conditions. The dataset, annotation guidelines, and evaluation scripts are available at https://github.com/daotuanan/PrionNER/.
Authors:Ifeoluwa Kunle-John, Josiah Paul, Oluwatosin Agbaakin, Peter Aina, Ikenna Odezuligbo, Sydney Anuyah
Abstract:
Causal relation extraction (CRE) is central to biomedical text mining, but current resources often conflate causal relations with broader associations, restrict annotation to sentence‑level examples, or focus mainly on explicit causal cues. This limits their usefulness for evaluating whether models can recover causal claims as they are actually expressed in biomedical text. We introduce PubMedCausal, a span‑level annotated corpus for biomedical CRE built from PubMed abstracts. The corpus contains 30,000 paragraph‑level rows, including 3,945 causal rows and 6,491 adjudicated cause‑‑effect pairs. Each causal relation is annotated with full‑text cause and effect spans, causality type, and sententiality, enabling evaluation of both causal detection and full‑span causal extraction. We benchmark discriminative encoders and open‑source generative models across detection and extraction settings. For causal detection, biomedical encoders are strongest, with PubMedBERT reaching an F_1 score of 0.7391. For span‑level extraction, the best generative baseline is DeepSeek‑R1‑32B with few‑shot prompting, reaching a Cosine Pair F_1 of 0.6765. We further test transfer learning by evaluating PubMedCausal‑trained encoders on external causal relation datasets, showing that the resource supports cross‑dataset evaluation. Our results show that biomedical CRE remains difficult under class imbalance, long causal spans, implicit causality, inter‑sentential relations, and prompt sensitivity. Code and Data can be found here: https://github.com/josiahpaul07/PubMedCausal_Exp
Authors:Zheng Li, Mao Zheng, Mingyang Song, Tianxiang Fei
Abstract:
General‑purpose machine translation benchmarks such as FLORES‑200 have reached a saturation regime on Chinese‑English pairs, where modern large language models cluster within a narrow band of high scores. Across 22 systems, FLORES‑200 zh‑en GEMBA scores fall in a 7.87‑point range with a standard deviation of 2.29, which compresses the separation between systems on knowledge‑intensive domains such as finance, healthcare, law, and science and technology. We introduce HardMTBench, a difficulty‑aware diagnostic benchmark for bidirectional Chinese‑English domain translation. HardMTBench covers 12 domains and contains 10,000 hand‑curated source sentences with reference translations, packaged as 20,000 directional test items. A three‑stage construction pipeline builds a domain‑balanced candidate pool of 84,566 pairs, applies an LLM‑based multi‑signal judge over knowledge density, translation difficulty, terminology load and reference correctness, and assembles the final test set under a hardness fusion rule with per‑domain quotas. Across 22 systems spanning general LLMs, commercial engines and specialised MT models, HardMTBench widens the cross‑system GEMBA range by roughly a factor of two over FLORES‑200, induces visible rank reorderings, and exposes domain‑specific terminology and knowledge weaknesses that quality‑only metrics tend to flatten. All data and code are open‑sourced at https://github.com/jasonNLP/HardMTBench.
Authors:Yoonjin Jang, Junwoo Kim, Youngjoong Ko
Abstract:
Entity Alignment (EA) is essential for knowledge graph (KG) fusion, but existing benchmarks often allow models to exploit name overlap rather than relational structure. This makes it difficult to evaluate whether models can reject same‑name entities that refer to different real‑world objects. Our primary contribution is a same‑name hard‑negative augmentation strategy that simultaneously yields quality‑controlled evaluation benchmarks (DW‑HN29K, DY‑HN27K) and augmented training corpora (DW‑Train, DY‑Train), by mining same‑name but distinct entity pairs from KG name‑collision groups. We further introduce HELEA, a two‑stage framework integrating (i) entity encoder retrieval trained on hard‑negative‑augmented training corpora with 1‑hop KG context, and (ii) LLM‑based reranking without additional training. Experiments show that name‑dependent baselines collapse to near‑random performance on our hard‑negative benchmarks, while HELEA achieves F1 0.967 on DW‑HN29K while maintaining Hit@1 0.993 on standard DW‑15K.
Authors:Jan Sikora, Paweł Lenartowicz, Hubert Plisiecki
Abstract:
Cross‑cultural comparison of psychological meaning requires methods that go beyond word‑level translation and examine how semantic dimensions are organized across languages. We introduce a cross‑lingual extension of the Supervised Semantic Differential (SSD), which estimates supervised semantic gradients in embedding space and compares them across aligned multilingual word embeddings. The method tests gradient alignment and difference using permutation procedures and bootstrap intervals, and interprets residual differences through clustering around the difference gradient. We demonstrate the approach on Polish, English, and French affective norm lexicons, modeling Valence, Arousal, and Dominance where available. Affective dimensions were significantly recoverable across languages and model settings. Cross‑lingual comparisons showed broad alignment together with structured residual differences: Valence appeared mostly shared, whereas Arousal and Dominance produced more interpretable contrasts involving bodily threat, aesthetic stimulation, internal emotionality, macro‑level authority, and everyday control. Several clusters also reflected corpus‑specific artifacts, underscoring the need for cautious interpretation. Cross‑lingual SSD offers an explainable framework for testing semantic alignment, identifying divergence, and generating hypotheses about cross‑cultural differences in psychological meaning.
Authors:Evgenii Palnikov, Elizaveta Gavrilova
Abstract:
We study quality‑latency‑resource trade‑offs in a documentation‑grounded retrieval‑augmented generation (RAG) system that uses Low‑Rank Adaptation (LoRA) of the generator. We build a manually verified benchmark of 5,144 question‑answer pairs over the official Kubernetes documentation and combine it with a fixed hybrid‑retrieval pipeline (BGE‑M3 dense, BGE‑M3 native sparse, Reciprocal Rank Fusion, cross‑encoder reranking). Over this benchmark we ablate 20 LoRA configurations on Llama‑3.2‑3B‑Instruct and Llama‑3.1‑8B‑Instruct across rank and target‑module choices, and evaluate each on token‑level F1, LLM‑judged groundedness and correctness (pass@4), inference latency, inference memory, and training cost, all reported with bootstrap 95% confidence intervals. Pareto analysis shows that LoRA adapters acting only on the q and v attention projections consistently dominate the front, while the 3B/8B choice mainly defines operating regime. A param‑matched control comparison further indicates that the q/v advantage is structural rather than purely parametric. The benchmark, selected adapters, and code are available at https://github.com/EugPal/rag‑lora‑tradeoffs.
Authors:Mingrui Sun, Mao Zheng, Zheng Li, Mingyang Song
Abstract:
Modern translation workflows demand more than semantic equivalence. Users routinely require models to preserve JSON or HTML schemas, honor curated glossaries, disambiguate with provided context, and match prescribed registers, often several at once. Conventional metrics such as BLEU and xCOMET capture semantic fidelity but provide little signal on constraint adherence, while general instruction following benchmarks ignore the cross‑lingual nature of translation. We introduce \bench, a benchmark for multilingual translation instruction following covering seven languages, with 4,506 single‑constraint and 2,838 multi‑constraint items spanning six constraint dimensions and five compositional patterns with instructions issued in all seven languages. Constraints are split into a gating subset verified by deterministic checkers and a continuous subset scored by a rubric‑based LLM judge, combined under a multiplicative rule that resists reward hacking. Evaluating 15 models reveals systematic gaps that prior protocols miss: Instruction following scales with size more sharply than translation quality, glossary and structured‑format constraints dominate the difficulty gradient, and general instruction following rankings correlate only weakly with translation behavior. Our benchmark are available at https://github.com/Tencent‑Hunyuan/Hy‑MT2/tree/main/IFMTBench.
Authors:Zerui Chen, Qinggang Zhang, Zhishang Xiang, Zhimin Wei, Linfeng Gao, Xiao Huang, Zhihong Zhang, Jinsong Su
Abstract:
Graph‑based Retrieval‑Augmented Generation (GraphRAG) advances flat document retrieval by structuring knowledge as relational graphs, enabling more coherent and effective reasoning. However, applying it to specific domains like legal reasoning faces critical challenges. (i) Legal corpora are heterogeneous, containing multi‑granular knowledge from cases, articles and interpretations. A flat knowledge graph cannot adequately differentiate between factual details, applied rules, and abstract principles, limiting accurate retrieval. (ii) Reliable legal judgment demands transparent, evidence‑based reasoning. Traditional RAG passes retrieved context directly to an LLM without verification, resulting in opaque, error‑prone reasoning. To this end, we propose LegalGraphRAG, a framework designed for reliable legal reasoning. Our approach introduces two core components: a hierarchical legal graph that hierarchically organizes legal sources to enable retrieval at appropriate abstraction levels, and a multi‑agent system for reliable legal reasoning, where a Researcher retrieves candidate evidence, an Auditor rigorously verifies its validity against source documents, and an Adjudicator synthesizes the set of verified evidence to render a final judgment. Extensive experiments show that LegalGraphRAG achieves the state‑of‑the‑art performance, outperforming existing GraphRAG baselines in accurate and trustworthy legal analysis. Our code, datasets and implementation details are available at https://github.com/XMUDeepLIT/LegalGraphRAG.
Authors:Junjie Mu, Qiongxiu Li
Abstract:
Federated Retrieval‑Augmented Generation (FedRAG) is attractive for privacy‑sensitive applications because raw data remain local. As a result, routing must rely on client‑provided semantic profiles, creating a new opportunity for manipulation. We introduce Routing Hijacking, a routing‑stage attack in which a malicious client forges its profile to attract target queries despite having irrelevant underlying data. We show that this vulnerability is severe. Across three representative FedRAG routing architectures, Routing Hijacking consistently misroutes target queries and leads to downstream disruptions and failures, including missing evidence, poisoning, incorrect answers, and hallucinations. In a high‑stakes MedQA‑USMLE case study, we further show that poisoned retrieved evidence can mislead models across scales, leading to incorrect answers, hallucinations, and sycophantic failures. Existing defenses do not close this gap: encrypted routing preserves the exploited ranking, and Byzantine‑robust Federated Learning (FL) rules transfer poorly to heterogeneous routing profiles. To address this gap, we propose a trust‑aware post‑routing framework that reweights clients using returned‑evidence feedback, including retrieval relevance, profile consistency, and cross‑client agreement; online experiments show that it suppresses persistent hijacking over recurring queries and transfers to a learned neural router. Our findings establish routing integrity as a new security challenge in FedRAG and highlight the need for stronger defenses for secure federated retrieval.
Authors:Lee Jung-Mok, Kim Sung-Bin, Joohyun Chang, Lee Hyun, Tae-Hyun Oh
Abstract:
Laughter is a complex social signal that conveys communicative intent beyond amusement. While prior work has focused on isolated laughter analysis tasks, a comprehensive understanding of laughter in real‑world scenarios remains underexplored. Therefore, we introduce SMILE‑Next, a dataset for real‑world laughter understanding with multimodal textual representations and question‑answer annotations across three tasks: laughter detection, laughter type classification, and laughter reasoning. Building upon SMILE‑Next, we aim to develop a laughter‑specialized large language model capable of nuanced understanding of laughter in real‑world contexts. To this end, we propose two key components: laughter‑specific Self‑Instruct and the Mixture‑of‑Laugh‑Experts (MoLE) framework. Laughter‑specific Self‑Instruct enhances generalization across tasks and domains by automatically synthesizing diverse laughter‑centric instructions. MoLE introduces a task‑adaptive expert routing mechanism that dynamically selects specialized experts tailored to each laughter‑related task, improving task‑specific performance and efficiency. Experimental results show that the combination of our proposed components substantially outperforms multimodal LLM baselines, advancing robust real‑world laughter understanding. Project page is at: https://mok0102.github.io/smile‑next/.
Authors:Ziqi Zhao, Xinyu Ma, Liu Yang, Yujie Feng, Daiting Shi, Jingzhou He, Xin Xin, Zhaochun Ren, Xiao-Ming Wu
Abstract:
On‑policy self‑distillation (OPSD) improves the reasoning performance of large language models (LLMs) by providing dense token‑level supervision for on‑policy rollouts. However, existing OPSD methods often yield limited gains on in‑domain reasoning and generalize poorly to out‑of‑domain problems. We identify two key causes: conditioning the self‑teacher on a verified solution encourages imitation of training‑domain reference trajectories rather than error‑specific correction, and applying distillation to the full response can overwrite valid reasoning prefixes and reinforce overfitting. We propose Reflective On‑policy Self‑Distillation (ROSD), a framework that turns reference‑solution imitation into targeted reasoning correction through reflection‑guided, error‑localized distillation. For each rollout, ROSD uses a self‑reflector to extract a corrective idea and locate the first erroneous span. The corrective idea guides the self‑teacher toward targeted supervision, while the localized error span restricts distillation to where correction is needed. This design corrects flawed reasoning while preserving valid prefixes. Experiments on multiple in‑domain and out‑of‑domain reasoning benchmarks show that ROSD yields stronger in‑domain reasoning performance overall and substantially better out‑of‑domain generalization than standard OPSD. Code is available at https://github.com/ZiqiZhao1/ROSD.
Authors:Jiaming Zhang, Yibo Zhao, Jing Yu, Jianxiang Yu, Xiang Li
Abstract:
GraphRAG extends retrieval‑augmented generation by organizing corpora as explicit knowledge graphs, enabling graph‑based retrieval for complex question answering. However, existing frameworks extract entities and relations within individual chunks, leaving cross‑chunk relations ‑‑ those whose evidence spans multiple passages ‑‑ systematically absent from the index. Exhaustive LLM‑based recovery of such relations is impractical due to the combinatorial explosion of chunk combinations. We present CrossAug, a GNN‑guided CROSS‑Chunk Graph AUGmentation method that enriches GraphRAG indices with cross‑chunk relational structure as an offline step before query‑time retrieval. CrossAug derives training supervision through self‑supervised graph corruption, uses a topology‑aware GNN to score subgraphs for missingness, and applies evidence‑grounded LLM completion only to selected high‑scoring regions. Experiments on three LLM‑based GraphRAG frameworks across four multi‑hop and long‑document QA benchmarks demonstrate that CrossAug consistently improves performance, confirming the benefit of cross‑chunk graph augmentation for retrieval‑based question answering. Our code is available at https://github.com/DonFinliani/CrossAug.
Authors:Simin Huo
Abstract:
The ability to process ultra‑long contexts is crucial for large language models (LLMs) to perform long‑horizon tasks. While recent efforts have extended context windows to 1M and beyond, model performance degrades when sequence length exceeds the pre‑trained range of positional encodings (e.g., RoPE), i.e., position exhaustion. This fundamental limitation must be overcome to achieve a truly infinite context. To address it, we propose Periodic RoPE (P‑RoPE), a positional encoding mechanism designed to circumvent this exhaustion. It operates in conjunction with sliding window attention (SWA) to capture local dependencies and relative positions within each window. This local layer is then complemented by a global attention layer with No Positional Encoding (NoPE), enabling unbounded interaction across the entire sequence without positional constraints. By stacking these two types of layers, the model avoids the need for positional extrapolation to generalize longer and theoretically supports an infinite context window. Empirical results show that our model, MiniWin, outperforms MiniMInd with standard GPT architectures in long‑context efficiency and stability. Our work provides a possible pathway toward LLMs with genuine infinite‑context understanding. The code is available at \hrefhttps://github.com/Cominder/miniwinhttps://github.com/Cominder/miniwin.
Authors:Zhitong Chen, Kai Yin, Weifeng Zhang, Zhiyuan Wang, Xiangjue Dong, Chengkai Liu, Zhewei Liu, Yiming Xiao, Ali Mostafavi, James Caverlee
Abstract:
Disasters cause severe societal impacts, demanding rapid coordination of heterogeneous AI tools, from satellite analysis to flood prediction and damage assessment, into coherent multi‑step workflows. As LLMs increasingly serve as orchestrators of such pipelines, effective coordination requires more than selecting semantically plausible tools: LLMs must generate executable workflows with correct parameter binding and dependency propagation. We introduce DisasterBench, a benchmark for evaluating structured multi‑agent planning over semantically similar but operationally distinct disaster‑response tools. To enable step‑level failure attribution, we further propose First‑Point‑of‑Failure (FPoF), which localizes the earliest root cause in a predicted workflow, separating primary errors from downstream cascading effects. Our evaluation reveals three findings: planning method effectiveness depends strongly on model capacity; tool mismatch and parameter‑binding errors dominate first failures, revealing semantic grounding and execution consistency as distinct bottlenecks; and verbose intermediate reasoning can create instruction clash with structured output requirements, disrupting plan generation. Together, these findings highlight a fundamental gap between semantic reasoning and execution‑grounded coordination, underscoring the need for planning frameworks that jointly model semantic intent, execution constraints, and workflow consistency. Code, data, and evaluation resources are available at: https://github.com/TamuChen18/DisasterBench_Open
Authors:Xinze Li, Yuhang Zang, Yixin Cao, Aixin Sun
Abstract:
Markdown skill libraries for LLM agents ship as free‑form prose, forcing the agent to re‑derive both the input schema and the concrete invocation syntax on every retrieval. We observe that this often produces a "confused ‑> re‑retrieve ‑> still confused" loop in which the agent issues a partially‑correct action, receives uninformative environment feedback, and re‑retrieves the same prose. We propose Skill‑as‑Pseudocode (SaP), an automatic conversion of markdown skill libraries into typed pseudocode with deterministic quality control. For each cluster of similar procedural passages drawn from one or more skills, SaP extracts a typed contract and filters it through a four‑check deterministic verifier (coverage, binding, replacement, risk). Promoted contracts are inlined into a rewritten skill skeleton together with restored concrete action templates, giving the agent two complementary signals: a typed signature for what the skill does and a concrete template for how to invoke it. On the 134‑game ALFWorld unseen split with gpt‑4o‑mini, pooled across three seeds, SaP wins 82/402 paired games versus 47/402 for the Graph‑of‑Skills (GoS) baseline (pooled McNemar p = 8.2e‑5), at ‑22.8 +/‑ 6.4% input tokens and ‑14.5 +/‑ 4.1% LLM calls per game.
Authors:Jie Zhu, Huaixia Dou, Shuo Jiang, Junhui Li, Lifan Guo, Feng Chen, Chi Zhang, Fang Kong
Abstract:
Existing emotional support conversation (ESC) systems mainly rely on end‑to‑end response generation or coarse strategy supervision, offering limited interpretability and little support for systematic skill improvement. We propose ESC‑Skills, a skill‑centric framework that discovers and self‑evolves executable emotional support skills. We first model localized support interactions as Intervention Units (IUs), which capture state‑‑action‑‑outcome dynamics between seeker states, support interventions, and post‑response emotional changes. Based on IUs extracted from both successful and failed ESC dialogues, we construct the ESC‑Skills Bank, a repository of executable emotional support skills containing intervention guidance, applicability conditions, expected outcomes, and potential risks. To further improve robustness, we introduce a multi‑profile self‑evolutionary refinement framework in which an ESC agent interacts with diverse simulated seeker profiles under SAGE evaluation. The resulting interaction traces are analyzed to identify missing skills, unsafe interventions, and profile‑specific failure patterns, which are then used to refine the Skills Bank through simulation‑based verification. Experimental results demonstrate that ESC‑Skills improves both response‑level quality and dialogue‑level emotional outcomes while providing more interpretable and controllable support behaviors. We will release the code, prompts, and ESC‑Skills Bank at https://github.com/aliyun/qwen‑dianjin.
Authors:Eric Onyame, Runtao Zhou, Kowshik Thopalli, Bhavya Kailkhura, Chirag Agarwal
Abstract:
Chain‑of‑thought (CoT) monitoring has been proposed as a promising safety mechanism for detecting misaligned behavior in large language models. However, its reliability remains largely unexplored beyond English and across diverse model families. We present the first large‑scale evaluation of CoT monitorability across 13 diverse languages and seven frontier model families, comprising 16 models. Using adversarial‑hint evaluations that require explicit intermediate computation, together with analysis of internal answer‑token probabilities, we consistently find CoT unfaithfulness across languages and hint types, with an average rate of 95.9% across 8B‑‑120B parameter models. We find that frontier models systematically engage in strategic manipulation, including answer‑switching, post‑hoc rationalization, and procedural exploitation of hints, making external monitors struggle to detect deception. We show that frontier models often commit to the misaligned cue in their latent activations within the first 15% of generation, even when the CoT appears faithful. Surprisingly, these deceptive patterns remain 100% in low‑resource languages, revealing fundamental limitations in current CoT‑based oversight. Our results reveal that CoT monitoring is fundamentally fragile under linguistic distribution shift, providing a substantially weaker safety signal than what English‑only studies suggest. These findings underscore an urgent need to develop robust CoT monitors and to accelerate research into white‑box monitoring techniques, especially to improve CoT monitorability in mid‑ and low‑resource languages. Our code is available \hrefhttps://multilingual‑cot‑monitoring.github.io/\textcolorbluehere.
Authors:Yibo Zhao, Zichen Ding, Jiayi Wu, Zun Wang, Xiang Li
Abstract:
Search agents powered by large language models can autonomously decompose queries, retrieve information, and synthesize answers through multi‑step reasoning. However, the rapid growth of training methods has outpaced controlled comparison: existing works differ in retrieval corpora, reward designs, and training protocols, making it unclear what actually drives improvements. We present a controlled empirical study that isolates three under‑explored dimensions of search agent training. First, we identify a critical data‑coverage issue in the widely used Wikipedia 2018 corpus and show that correcting it alone yields larger gains than the differences between training algorithms. Second, we systematically compare outcome‑based and process‑based reward methods across three base models, finding that the simplest outcome‑based approach achieves competitive or superior performance in most settings, and that process‑level credit assignment can over‑correct agent behavior. Third, we analyze training data diversity, off‑policy data utilization, and search budget scaling, distilling practical guidelines for training effective search agents. Our code is available at https://github.com/YiboZhao624/SearchAgentReview.
Authors:Parth Bhalerao, Jeromy Chang, David Chou, Oana Ignat
Abstract:
Evaluating AI tutor responses requires more than factual correctness: tutors must identify mistakes, locate errors, provide guidance, and offer actionable next steps. We present GRADE, a systematic study of open‑source models for pedagogical ability assessment in student‑tutor dialogues. Building on the BEA 2025 TutorMind setting, we evaluate 120 configurations across five language models, zero‑shot inference, LoRA fine‑tuning, synthetic augmentation, CoT+Reasoning, and single‑task versus multitask formulations. Gemma3‑12B performs best for single‑task evaluation, while Gemma3‑27B in 8‑bit precision is more reliable for multitask prediction. We find that augmentation helps models that struggle with the original data, verification adds limited gains despite higher cost, and CoT+Reasoning is more useful for synthetic data generation than direct classification. We further show that LoRA fine‑tuning on structured classification objectives interferes with instruction‑following behavior under thinking mode, redirecting generation away from the required evaluation format. Carbon analysis shows that model choice and reasoning mode substantially affect emissions. Overall, GRADE shows that carefully selected open‑source LoRA pipelines can match or surpass proprietary and ensemble‑based systems on key pedagogical dimensions, with code and data available at https://github.com/pvbgeek/GRADE.
Authors:Zixuan Yang, Yibo Zhao, Weicong Liu, Xiang Li
Abstract:
Matching submissions with suitable reviewers at scale is a growing challenge for major venues, yet existing approaches either rely on coarse proxy signals that conflate general relatedness with true suitability, or require expensive human annotations that are difficult to scale for training. We propose MERIT, a two‑stage framework that bridges this gap by converting criterion‑level expertise matching into scalable suitability supervision. In the first stage, we train a reviewer assessor via reinforcement learning to identify the expertise dimensions a paper requires, match them against the reviewer's prior work, and produce a suitability decision, with rewards provided by an LLM judge guided by paper‑specific expertise rubrics. In the second stage, we distill the assessor's predictions into an embedding‑based retriever for efficient large‑scale assignment. Experiments show that our 4B reviewer assessor outperforms larger general‑purpose LLMs on suitability classification, and the resulting retriever achieves state‑of‑the‑art performance across LR‑Bench and the CMU Gold dataset. Our code is available at https://github.com/Luli3220/MERIT.
Authors:Shubhashis Roy Dipta, Ankur Padia, Francis Ferraro
Abstract:
Claim verification splits between end‑to‑end classifiers that are accurate but yields no inspectable traces, and decomposition‑based methods produce inspectable traces but lag performance on benchmark datasets. We propose DecomposeRL an accurate claim‑verifier that produce inspectable traces. DecomposeRL frames decomposition as an RL policy trained with GRPO and a multi‑faceted reward ensemble, enabling both fully supervised and semi‑supervised learning from unlabeled claims. DecomposeRL addresses the prohibitive training cost of GRPO with a data‑curation funnel that distills 115K fact‑verification claims into a compact, learning‑signal‑dense subset of 5K claims. We show that a DecomposeRL‑7B policy trained with full supervision on only ~5K curated claims achieves 86.3 in‑domain and 69.8 out‑of‑domain balanced accuracy across 11 claim‑verification benchmarks containing biomedical, political, scientific, and general‑domain claims. Despite being 4x smaller, it matches 32B baselines and GPT‑4.1‑mini, and it further outperforms baselines in a semi‑supervised setting with only 10% labeled claims data. Code, data, and models are available at https://dipta007.github.io/DecomposeRL
Authors:Yanyan Luo, Xue Han, Chunxu Zhao, Ruiqiao Bai, Yaxing Zhang, Qian Hu, Lijun Mei, Junlan Feng
Abstract:
While LLMs enable personalized chatbots, their effectiveness in child‑centered personalization remains unclear, as systematic evaluation of child‑specific preferences is still lacking. To address this gap, we introduce ChildEval, a benchmark for evaluating LLMs' ability to infer and follow child‑centered preferences in long‑context conversations. ChildEval contains 29K synthesized persona profiles of children aged 3‑6, providing relatively static background information. Each persona is associated with a child preference‑which may align with, conflict with, or be independent of the persona‑expressed either explicitly in a single sentence or implicitly through 6‑10 turn dialogues. Explicit and implicit preferences are designed to reflect the same underlying preference but differ in expression, capturing dynamic aspects of preference expression rather than changes in the static persona. The benchmark spans five top‑level and fourteen sub‑level categories covering children's daily lives and development. We further propose fine‑grained, child‑centric evaluation protocols to systematically assess open‑source LLMs. Experimental results demonstrate how different personalized representations affect LLM responses and suggest that finetuning on ChildEval can enhance child‑centered performance. Our code and dataset are available at https://github.com/ziyanluo/ChildEval.
Authors:Tim R. Davidson, Anja Surina, Caglar Gulcehre
Abstract:
Language models are becoming the default interface to factual knowledge, yet they often verify outputs more reliably than they generate them. This generation‑verification gap (GV‑gap) underlies many recent advances in self‑improvement and reasoning, but its dynamics on factual knowledge specifically remain poorly understood. We focus on the training mechanisms underlying factual GV‑gaps, distinguishing them from their computational and aesthetic counterparts. We trace generation and verification capabilities through three training phases (acquisition, continual learning, and updating) across four open‑source model families at two scales each. Three findings recur across models: (i) verification is consistently learned before generation; (ii) verification is more robust to continual learning than generation; and (iii) factual updates can leave models in a "multi‑verse" state, simultaneously verifying both old and new answers as correct. Natural experiments on frontier models reproduce these dynamics at scale and reveal residual verification biases on well‑covered facts.
Authors:Syed Huma Shah
Abstract:
Modern retrieval‑augmented generation(RAG) deployments increasingly rely on caching to reduce token cost and time‑to‑first‑token(TTFT). Prefix‑level KV reuse is now standard in serving stacks such as vLLM, and chunk‑level and position‑independent reuse have been pushed further by recent systems(RAGCache, TurboRAG, CacheBlend, EPIC, ContextPilot, PCR, LMCache). Output‑level semantic answer caches, by contrast, remain fragile: similar prompts can map to different correct answers, retrieved evidence drifts as the corpus is updated, and adversarial collision attacks have been shown to hijack cached responses. We argue that the right framing for cached answer reuse is not how to reuse faster but when reuse is safe. We propose GroundedCache, an evidence‑validated cache router that admits a cached answer only when 4 cheap gates simultaneously hold: query similarity, retrieved‑evidence overlap, source‑version validity, and lexical (or judge‑based) support of the cached answer by the freshly retrieved evidence. We build a six‑regime workload that stress‑tests cache safety rather than only hit rate, and introduce an operator‑facing metric, the unsafe‑served rate (USR), fraction of all queries that received a wrong cached answer. Across 2 datasets and 12,000 real‑LLM generations(Qwen2.5‑7B‑Instruct on vLLM with Automatic Prefix Caching), GroundedCache drives USR to 0.0% on every HotpotQA regime(vs. 15‑35% under naive caching) and to 1.5% on mtRAG document drift(vs. 51.5%), a 34x reduction on the design‑point adversarial regime and 3‑10x reductions across the other mtRAG regimes, while end‑to‑end p50 latency stays within 1.04‑1.07x of a no‑cache RAG baseline. A per‑gate ablation isolates the lexical support gate as the load‑bearing safety mechanism on both datasets, with the remaining gates providing defense‑in‑depth at near‑zero cost. We release the implementation, workload, and evaluation harness.
Authors:Xiangyu Ma, Teng Xiao, Zuchao Li, Lefei Zhang
Abstract:
Diffusion models promise efficient parallel text generation but rely on bidirectional attention, creating a structural mismatch with pre‑trained Autoregressive (AR) models. This incompatibility precludes reusing robust AR priors, necessitating prohibitive pre‑training from scratch. To bridge this gap, we propose FLUID, a framework that efficiently adapts AR backbones to the diffusion paradigm. By enforcing Strictly Causal Alignment, FLUID enables seamless initialization from standard GPT‑style checkpoints, circumventing the need for massive pre‑training. Furthermore, we introduce Elastic Horizons, an entropy‑driven mechanism that dynamically modulates denoising strides based on local information density rather than fixed schedules. Experiments demonstrate that FLUID achieves state‑of‑the‑art performance while reducing training costs by orders of magnitude, effectively reconciling established AR foundations with efficient parallel generation. Our code is available at https://github.com/Oli‑lab‑nun/FLUID/tree/main.
Authors:Jing Hao, Siyuan Dai, Yongxin Zhang, Yuci Liang, Jiamin Wu, Jiahao Bao, Yuxuan Fan, Zanting Ye, Yanpeng Sun, Xinyu Zhang, Ming Hu, Liang Zhan, James Kit Hon Tsoi, Linlin Shen, Junjun He, Kuo Feng Hung
Abstract:
Dental image analysis plays a pivotal role in supporting accurate diagnosis and treatment planning in oral healthcare. Although recent advances have produced dental AI models for specific tasks and individual imaging modalities, their isolated designs limit practical use in real‑world clinical workflows. In this paper, we present OralAgent, the first dental‑specialized AI agent that unifies multimodal reasoning, tool‑based decision‑making, and knowledge‑grounded retrieval within an end‑to‑end automated framework. It integrates 22 visual analysis tools and 368 widely‑used classical dental textbooks, enabling autonomous reasoning, planning, tool use, knowledge retrieval, and multi‑step workflow execution. Furthermore, we introduce OralCorpus, a large‑scale, high‑quality bilingual textual resource containing 134.8M tokens curated for dental retrieval‑augmented generation (RAG). To evaluate models' multidisciplinary dental knowledge, we construct OralQA‑ZH, a Chinese multiple‑choice question benchmark consisting of 798 items across eleven oral subspecialties. Extensive experiments demonstrate that OralAgent achieves state‑of‑the‑art performance on the MMOral‑Uni, MMOral‑OPG, and OralQA‑ZH benchmarks, highlighting its effectiveness, interpretability, and adaptability in real‑world clinical settings. The code and models are publicly available at https://github.com/isjinghao/OralAgent.
Authors:Siran Li, Ece Sena Etoglu, Carsten Eickhoff, Seyed Ali Bahrainian
Abstract:
Reliable evaluation is essential for understanding large language model (LLM) performance, yet today's go‑to metrics, namely token‑overlap scores (e.g., ROUGE) and embedding‑based measures (e.g., BERTScore), often misjudge semantic similarity of documents. Our study shows that both token‑overlap metrics and embedding‑based metrics routinely assign nearly identical scores to texts that directly contradict each other, thereby potentially masking fundamental errors. We introduce MATCHA, an automatic metric that jointly rewards semantic agreement with a reference and penalizes contradictions. MATCHA employs a dual‑view perspective that measures (i) proximity to the gold text and (ii) distance from an adversarially generated counterfactual contradiction. In eight public benchmarks, MATCHA outperforms popular metrics, compared with human annotations on question‑answering, image caption generation, natural language inference, summarization, and semantic textual similarity tasks. On the TruthfulQA dataset (i.e., a dataset without a training set, where no embedding‑based metrics could locally train on), this improvement in terms of matching texts with a reference reaches 18.38% over ROUGE‑L and 20.82% over BERTScore. Both quantitative comparison and qualitative human assessments confirm the efficacy and validity of MATCHA and uncover fundamental weaknesses in pre‑existing metrics. Compared with 23 embedding models, including top state‑of‑the‑art ones, used as a metric similar to BERTScore, MATCHA remains the most accurate in distinguishing correct from incorrect statements solely based on a reference. Our code and metric are publicly available (https://github.com/Siran‑Li/MATCHA).
Authors:Pujun Zheng, Wanying Ren, Jiacheng Yao, Guoxiu He, Star X. Zhao
Abstract:
Scientific paper evaluation often involves not only assessing a manuscript itself, but also relating it to contemporaneous research and prior literature. However, existing LLM‑based methods typically model these signals separately and lack a unified mechanism for propagating review evidence across papers. We propose GraphReview, a graph‑based LLM framework that formulates paper evaluation as review‑signal message passing over a semantic paper graph. The graph jointly captures intrinsic quality, synchronic links among contemporaneous papers, and diachronic links to prior work. LLMs are used to estimate node‑level quality priors and generate edge‑level comparative evidence through pairwise paper comparisons, while Personalized PageRank integrates review signals for quality ranking, decision prediction, and review generation. To produce higher‑quality graph evidence, we propose reward‑induced maximum likelihood objectives for training the LLM backbones. Experiments show that GraphReview consistently outperforms the strongest baseline, achieving average improvements of 29.7% on decision and ranking metrics, including gains of 23.7% in Accuracy and 57.6% in Spearman's ρ. It also produces higher‑quality review texts and generalizes effectively across time periods and conference venues. The code is available at https://github.com/ECNU‑Text‑Computing/GraphReview.
Authors:Ye Yuan, Rui Song, Weien Li, Zeyu Li, Haochen Liu, Xiangyu Kong, Changjiang Han, Yonghan Yang, Zichen Zhao, Zixuan Dong, Fuyuan Lyu, Bowei He, Haolun Wu, Jikun Kang, Xue Liu
Abstract:
Social deduction games have become a popular testbed for probing reasoning, deception, coordination, and belief modeling in Large Language Model (LLM) agents. However, most environments are scored only by game outcomes such as win rates and largely remain to text‑only interaction, making it difficult to tell whether an agent's language is actually grounded in what it perceived and did, or to identify the failure modes underlying its behavior. To address this gap, we introduce QUACK, an open‑source environment and evaluation framework for auditing the grounding of agent language in multimodal social reasoning. QUACK evaluates agents at three levels: game outcomes, behavioral trajectories, and utterance‑level consistency. Its core Statement Verification Pipeline reconstructs each agent's ground‑truth trajectory from engine logs and checks every discussion claim against it, automatically flagging spatial hallucination, unsupported accusation, deception collapse, and language‑action inconsistency. Evaluating three frontier VLMs in both homogeneous and cross‑model adversarial settings, we find that even the strongest agent hallucinates 15.1% of its verifiable spatial claims and makes over half of its accusations without grounded evidence. We release the full engine, evaluation framework, toolkit, and logs at https://github.com/AAAAA‑Academia‑Attractions/QUACK.
Authors:Dingwei Chen, Zefang Zong, Zhipeng Ma, Leo Luo, Yang Li, Chengming Li, Peng Chen, Jie Jiang
Abstract:
Agentic reinforcement learning (RL) has proven effective for training LLM‑based agents with external tool‑use capabilities. However, we identify that agentic RL training induces increasing redundant tool calls and blurs the model's intrinsic knowledge boundary, where the model fails to distinguish when tools are needed versus when parametric knowledge suffices. Existing solutions based on reward shaping create coarse‑grained optimization targets that tend to incentivize indiscriminate tool‑call suppression, leading to reward hacking. In this paper, we propose AKBE (Agentic Knowledge Boundary Enhancement), an on‑policy method that dynamically probes the model's intrinsic knowledge boundary through dual‑path (with‑tool and no‑tool) rollouts during training. We define the knowledge boundary as the per‑instance determination of whether tools are required and the minimum tool calls necessary. By comparing correctness across paths, AKBE categorizes trajectories and constructs targeted supervisory signals that guide efficient tool‑use patterns for each question. These signals are integrated seamlessly into the agentic RL training loop. Experiments on seven QA benchmarks demonstrate that AKBE improves task accuracy by +1.85 on average and reduces tool calls by 18% over standard agentic RL, yielding 25% higher tool productivity without any accuracy‑efficiency trade‑off. Further analysis suggests its plug‑and‑play compatibility across different RL algorithms and the mechanism of each signal category. Our code is available at https://github.com/CuSO4‑Chen/AKBE.
Authors:Manh Nguyen, Sunil Gupta, Hung Le
Abstract:
Sampling multiple responses improves language model reasoning, but uniform compute allocation is inefficient: easy questions are over‑sampled while hard questions remain under‑explored. We propose Uncertainty‑Aware Budget Allocation (UAB), a concave integer optimization framework that reallocates a fixed sampling budget based on per‑question uncertainty estimated at no additional inference cost. In Phase 1, every question receives one generation; its average negative log‑likelihood (ANLL), extracted directly from output log‑probabilities, serves as a difficulty signal while the generation contributes to the final vote. In Phase 2, the remaining budget is allocated by a marginal‑greedy algorithm that solves a concave coverage‑maximization surrogate exactly: uncertain questions receive more sampling budget while confident questions receive fewer additional samples. Evaluated on six open‑weight and black‑box models spanning 1.5B to 27B parameters and five reasoning benchmarks covering math, logic, and preference tasks, UAB outperforms baselines by up to +3% in average accuracy and up to +5% on individual benchmarks, with the largest gains in low‑resource settings, requiring no auxiliary model or additional LLM call. Code is publicly available at https://github.com/manhitv/UAB.
Authors:Ngoc Phan Phuoc Loc, Toan Huynh La Viet, Thanh Tran Khanh, Duy A Nguyen, Tuan Anh Nguyen Pham, Thanh Nguyen, Nitesh V. Chawla, Wray Buntine, Kok-Seng Wong, Khoa D. Doan, Binh T. Nguyen
Abstract:
The rapid growth in submissions to machine learning venues has strained the scientific peer‑review system and intensified interest in LLM‑based automated peer reviewers. However, how good these systems are actually, especially compared to human reviewers at catching scientific gaps, remains poorly understood. In this work, we introduce PRISM (Peer Review Intelligence via Structured Multi‑dimensional assessment), a benchmarking framework that evaluates review quality across four dimensions: Depth of Analysis, Novelty Assessment,Flaw Identification & Major Issues Prioritization, and Multi‑dimensional Constructiveness. Unlike most existing evaluations based on surface‑level metrics like ROUGE and BLEU, or unconstrained LLM‑as‑a‑judge prompting that conflates fluency with rigor, PRISM grounds each dimension in argument mining, retrieval‑augmented verification, and consensus‑based scoring. We apply PRISM to benchmark five leading automated reviewer systems and human reviewers on a stratified corpus of reviews from ICLR, ICML, and NeurIPS. The results reveal that LLMs can match or beat human reviewers on individual dimensions: comparable depth of analysis, stronger novelty verification, and highly accurate critique prioritization. However, no single system consistently matches the balanced performance of the human baseline across all dimensions at once. Each exhibits a distinct specialization profile with characteristic blind spots ‑‑ failure modes that aggregate metrics miss entirely. The implication is that LLM reviewers are best understood as targeted supplements to human review, effective within specific dimensions, but unreliable as standalone replacements. Our demo and key results can be found at https://khanhthanhdev.github.io/prism‑page/.
Authors:Zheng Wang, Kaixuan Zhang, Wanfang Chen, Jingwen Zhang, Xiaonan Lu
Abstract:
Sequential editing of structured knowledge in large language models allows targeted factual updates without retraining, yet existing methods often rely on complex regularization or constraint mechanisms whose necessity remains unclear. In this work, we systematically investigate the mechanisms underlying effective and stable sequential editing. Specifically, we first analyze the empirical success of AlphaEdit and establish, via a rigorous optimization analysis, the formal equivalence between one‑time and sequential editing. Building on this insight, we generalize the equivalence to a broader class of editing objectives, demonstrating that stability emerges naturally from properly accounting for accumulated editing constraints, rather than from specialized regularization or null‑space operations. We empirically confirm that many commonly used regularization strategies are unnecessary for reliable sequential updates. Furthermore, we extend our framework to handle conflicting edits, ensuring robust and consistent behavior under contradictory updates. Ultimately, our work provides Ariadne's thread through the labyrinth of sequential editing, charting a path toward simpler, more interpretable, and dependable knowledge updates. Our code is available at https://github.com/Wangzzzzzzzz/OTE‑SE‑Alignment.
Authors:Xudong Lu, Xueying Li, Annan Wang, Yang Bo, Jinpeng Chen, Zengliang Li, Nianzu Yang, Rui Liu, Xue Yang, Jingwen Hou, Hongsheng Li
Abstract:
We introduce OmniInteract, a streaming benchmark for real‑time omnimodal large language models evaluated through native online inference over audio‑visual streams. Unlike offline video understanding or text‑prompted streaming QA, OmniInteract preserves the original audio‑visual stream and requires models to process it online, without access to future content. User queries and ambient sounds are embedded in the audio track, requiring models to detect multimodal triggers, decide when to respond, and answer while the stream unfolds. OmniInteract contains 250 videos with 1,430 temporally grounded response slots: 1,062 1Q1A slots across real‑time, proactive, and nested scenarios, and 368 1QnA slots for continuous task monitoring and step guidance. Each slot includes a trigger, response window, and target answer. We evaluate response correctness, timing, invalid outputs, interruption handling, and context continuity using Interaction‑Aware Quality‑Timeliness F1, Interruption Diagnostic Suite, and Nested Chain Completion Score. Experiments show that current models remain weak in streaming interaction, with the best overall IA‑QTF1 reaching only 0.368 and the best 1QnA IA‑QTF1 only 0.052. Further study on mathematical reasoning in full‑duplex settings shows that offline capability does not necessarily transfer to online interaction. Code and datasets will be made publicly accessible at https://github.com/Lucky‑Lance/OmniInteract.
Authors:Anmol Agarwal, Natalie Neamtu, Pranjal Aggarwal, Seungone Kim, Jannis Limperg, Cedric Flamant, Kanna Shimizu, Bryan Parno, Sean Welleck
Abstract:
AI coding agents are increasingly used to write real‑world software, but ensuring that their outputs are correct remains a fundamental challenge. Formal verification offers a promising path: an agent generates code together with a machine‑checked proof, guaranteeing that the code satisfies a formal specification. However, there is no guarantee that the formal spec itself matches the user's intent. In this work, we study specification autoformalization: whether LLM agents can translate informal programming problems into faithful formal specifications. We introduce Verus‑SpecBench, a benchmark of 581 spec‑writing tasks derived from Codeforces problems targeting Verus, a verifier for Rust, and Verus‑SpecGym, an agentic environment in which models interact with Verus, bash, & the filesystem to develop these specs. The central challenge is evaluation: expert‑written reference specs are expensive to write, & LLM judges can miss subtle mistakes. We address this by (a) extending Verus's exec_spec mechanism so that generated specs can be executed as Rust code, & (b) testing them against official Codeforces tests & adversarial cases extracted from Codeforces "hacks", which are edge cases written by competitors to break incorrect solutions. On Verus‑SpecBench, the strongest model, Gemini 3.1 Pro, solves 77.8% of tasks, other frontier models solve 51.1‑‑57.8% & OSS models reach only 21.5‑‑25.5%. Our analysis of failure modes shows that model‑generated specs can omit important input assumptions, accept incorrect outputs, & reject valid ones. We also find that LLM‑as‑a‑judge evaluation misses 26% of the failures our evaluator catches. Overall, our results suggest that spec autoformalization is within reach for frontier agents but remains brittle even on problems where they can already generate correct code. The code, data, & logs can be found at https://github.com/formal‑verif‑is‑cool/verus‑spec‑gym
Authors:Jim Salsman
Abstract:
Generating high‑quality, pedagogically useful questions from lecture slide decks is difficult because important instructional content is distributed across both text and visual elements, and because useful questions must be scaffolded across the flow of a presentation rather than generated slide by slide in isolation. This paper describes Slide Deck Q\&A Quality Assurance (slidesqaqa), a Flask‑based software system that extracts text and rendered images from PDF slides and processes them through a four‑stage large language model pipeline comprising window planning, deck synthesis, slide annotation, and reconciliation. The system reasons jointly about slide modality and pedagogical role, allocates bounded question budgets, and revises draft annotations at the deck level to reduce redundancy and improve coverage. The final output is a structured JSON annotation containing deck‑level goals, section structure, slide‑level summaries, question sets, and evaluation scores. Initial experiments on two technical lecture decks indicate that the pipeline can filter non‑instructional slides and produce high‑fidelity, pedagogically coherent questions for visually complex content. The working system is at https://slidesqaqa‑974767694043.us‑west1.run.app The software repository is at https://github.com/blinding2submit/slidesqaqa
Authors:Athanasios Zeris
Abstract:
Standard transformer attention computes pairwise token similarity but treats all tokens as equally salient and all positions as equally local, regardless of the informational structure of the input. We identify two complementary inductive biases that standard attention lacks: energy salience (which tokens concentrate informational energy, learned end‑to‑end without explicit frequency decomposition) and scale‑selective locality (how far positional influence extends at each frequency, implemented via Morlet wavelet encoding). We address both with two simple components. Energy‑Gated Attention (EGA) gates value aggregation by a learned energy estimate of key token embeddings, computed via a single linear projection; it selects what to attend to. Morlet Positional Encoding (MoPE) replaces fixed sinusoidal encodings with learned Gaussian‑windowed wavelets that adapt the joint position‑frequency localization to the corpus; it specifies where attention operates at each scale. On TinyShakespeare, EGA alone achieves +0.092 validation loss improvement over standard attention (+0.103 over Phase 1‑3 baseline); MoPE alone is ‑0.032 (below baseline as a standalone encoding); but their combination achieves +0.119 ‑‑ more than the sum of parts. This superadditivity, observed across two independent training runs, is the central empirical finding: salience and locality are complementary inductive biases, each addressing a gap the other cannot fill alone. Ablations confirm that structured spectral priors (Morlet wavelet gates, scale‑initialized heads, fixed sinusoidal PE) consistently underperform their unconstrained learned counterparts, while complementary learned components interact superadditively. All experiments are at small scale (<=6M parameters, character‑level benchmarks, single seed); larger‑scale multi‑seed validation is the most important direction for future work.
Authors:Taha Koleilat, Hassan Rivaz, Yiming Xiao
Abstract:
Parameter‑efficient adaptation of vision‑language foundation models is crucial for precise multimodal understanding of biomedical images, yet existing methods remain deterministic and often struggle under domain shift or ambiguous image‑text alignment. This limitation is particularly critical in the clinic, where models should remain robust in low‑data regimes and domain shifts. We present Evi‑Steer, an evidential cross‑modal low‑dimensional steering framework for BiomedCLIP that enables uncertainty‑aware parameter‑efficient fine‑tuning while updating only 0.11% of total model parameters. Our approach performs lightweight low‑dimensional token updates in both vision and text encoders while simultaneously estimating epistemic uncertainty. These uncertainty estimates update gate residuals, allowing the model to adapt conservatively when evidence is weak. Furthermore, we introduce cross‑modal confidence fusion based on Dempster‑Shafer theory, enabling visual adaptation to be conditioned on textual confidence and suppressing conflicting or uncertain cross‑modal updates. We conduct a comprehensive evaluation on 15 biomedical imaging datasets spanning 8 organs and 8 imaging modalities under few‑shot learning and domain generalization settings. Evi‑Steer consistently outperforms state‑of‑the‑art methods under few‑shot learning and domain shift settings, demonstrating a practical and robust pathway for deploying vision‑language models in real‑world clinical settings. Code is available at https://github.com/HealthX‑Lab/Evi‑Steer.
Authors:Zihang Zhou, Ziqian Ren, Yukai Wu, Yingjie Xiong, Wei Zhou, Chao Peng, Dong Zhang, Bingheng Yan, Xuanhe Zhou, Fan Wu
Abstract:
Functionality‑correct repository setup aims to configure execution environments (e.g., dependencies, build scripts) to successfully execute a repository's documented features. It presents significant challenges due to diverse, repository‑specific failures, including dependency incompatibilities, missing toolchains, incomplete installations, and verification‑strategy mismatches. Existing LLM agents struggle to robustly resolve these issues, specifically failing to support (1) cross‑repository experience transfer, (2) multi‑step trial‑and‑repair under non‑invertible state changes, and (3) robust verification of setup outcomes to distinguish setup‑induced failures from repository bugs. To address this, we introduce SetupX, an experiential learning‑based setup framework. First, we construct a Self‑Evolving Experience Representation (XPU), a dual‑modality knowledge unit encoding setup signals, textual guidance, executable actions to dynamically transfer verified environment fixes to unseen repositories. Second, we employ Experience‑Augmented Speculative Execution backed by a LIFO Docker snapshot stack, enabling the agent to proactively trial fixes and safely roll back to known‑good states. Third, we introduce a Prosecutor‑Judge Verification Protocol that separates evidence collection from final judgment, enabling more reliable setup verification beyond superficial build‑time metrics. Evaluation results on carefully‑crafted benchmarks show SetupX achieves highest performance (e.g., 92% pass rate) and outperforms the strongest baseline by over 19%. Crucially, SetupX excels in complex multi‑repository setup requiring coordinating multiple interconnected services across different containers. The code repository is available at https://github.com/OpenDataBox/SetupX.
Authors:Furkan Sakizli
Abstract:
Agentic RAG systems that equip language models with dozens to hundreds of tool definitions face a critical resource conflict: tool schemas consume the same context window needed for retrieval‑augmented generation. We present the first systematic study of this tool‑context trade‑off, evaluating 14 models spanning 1.5B‑32B local models plus one frontier API model across 6,566 controlled API calls at three context budgets (8K, 16K, 32K) with 28 tool definitions. Applying TSCG conservative‑profile compression (44‑50% schema token savings), we observe a binary enablement effect: at 8K tokens, JSON‑schema tool definitions overflow the context window entirely, yielding near‑zero EM (2.6% average), while compressed schemas restore RAG functionality with +20.5 pp average exact‑match lift across all eight models (+24.7 pp among the six exhibiting full enablement). At 32K ‑‑ where both formats fit ‑‑ four of five tested models show delta <= 1 pp, confirming the effect is purely budget‑driven. External validation on HotpotQA (50 multi‑hop questions) shows +48 pp EM under the same overflow scenario. Frontier scaling tests demonstrate that JSON schemas overflow at ~494 tools while compressed schemas remain operational beyond 800 tools. Our results establish tool‑schema compression as a necessary infrastructure layer for agentic RAG in constrained‑context deployments. All code, data, and checkpoints are publicly available.
Authors:Parth Darshan, Abhishek Divekar
Abstract:
Customizing an LLM judge to a specific problem or domain often involves optimizing its prompt across multiple evaluation criteria simultaneously. Textual gradient methods automate this for a single judge criterion, however they produce natural‑language critiques, not numerical vectors. Thus, the conflict‑resolution toolkit of multi‑task learning (PCGrad, MGDA) does not apply to this multi‑objective textual gradient setting. We extend TextGrad to the multi‑objective setting and test four decomposition modes of textual gradient optimizers by varying how much cross‑objective information the loss, gradient and optimizer LLMs share. We find the gradient's task‑focus drops by 59% (9.0 to 3.7 out of 10) when the gradient LLM must provide feedback on multiple criteria jointly. Separately, we observe that naively combining single‑objective optimized instructions into a single prompt degrades Spearman rho from 0.305 to 0.220 (‑0.085). These results identify two separable failure modes: optimization‑time gradient dilution and inference‑time instruction interference, which together constrain the design space for multi‑objective judge optimization using textual feedback.
Authors:Haoyi Hu, Qirong Lyu, Xianghan Kong, Weiwen Liu, Jianghao Lin, Zixuan Guo, Yan Xu, Yasheng Wang, Weinan Zhang, Yong Yu
Abstract:
While AI agents demonstrate remarkable capabilities in reasoning and tool use, they remain fundamentally reactive: they compute responses only after explicit user prompts. This paradigm ignores a critical opportunity: the idle time between interactions is largely wasted, leaving agents unable to prepare for future user needs. To bridge this gap, we introduce ProAct, a proactive agent architecture that leverages idle‑time compute to anticipate and fulfill likely upcoming user needs. By analyzing evolving dialogue history together with persistent memory, ProAct predicts upcoming needs and iteratively acquires information, allowing the agent to resolve knowledge gaps and prepare evidence before the user initiates a query. To rigorously evaluate proactive capabilities, we also introduce ProActEval, a comprehensive benchmark comprising 200 scenarios across 40 domains, featuring predictable need chains and diverse user cognitive profiles. Empirical results demonstrate significant advantages over reactive baselines. ProAct accelerates task completion by reducing required turns by 14.8%, decreases user effort by 11.7%, and cuts hallucination rates by 28.1% on ProActEval. Furthermore, MemBench evaluations confirm that ProAct achieves state‑of‑the‑art reflective accuracy, underscoring its sustained and robust performance.
Authors:Sam Bowyer, Acyr Locatelli, Kris Cao
Abstract:
Efficient benchmarking techniques aim to lower the computational cost of evaluating LLMs by predicting full benchmark scores using only a subset of a benchmark's questions. By reframing this problem as an instance of multiple regression with feature selection, we find that existing efficient benchmarking methods can be greatly improved by simply using kernel ridge regression at the prediction stage. Additionally, using an information‑theoretic feature‑selection algorithm called minimum redundancy maximum relevance (mRMR), we can further improve upon these methods by selecting question subsets that will be maximally useful for prediction. Except in very data‑poor settings, these approaches consistently achieve smaller prediction errors (in both MAE and RMSE), and greater ranking correlation between predicted and true scores (in both Spearman ρ and Kendall τ) across a range of benchmarks using both binary and continuous metrics. Furthermore, mRMR subsampling is much faster than competitor methods (which often involve fitting probabilistic models or running clustering algorithms), and is more likely to select the same questions under different random seeds or training data splits. Tutorial code can be found at https://github.com/sambowyer/mrmr_eval .
Authors:Yu Wang, Minghao Liu, Jiayun Wang, Jinrui Huang, Ankit Shah, Wei Wei
Abstract:
Inference time optimization techniques, such as repeated sampling, have significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, the critical role of model uncertainty remains largely underexplored in these optimization strategies. In this paper, we investigate the dynamics of confidence along reasoning trajectories and for first time reveal a surprising and unique pattern: correct answer traces tend to exhibit confidence improvement over time (positive confidence gain), while incorrect traces show attenuated or declining confidence as reasoning proceeds. Based on this observation, we propose Confidence Dynamic Gain (CDG) based voting, which incorporates how the confidence trajectory of the response evolves along the reasoning chain. Experiments across four open‑source architectures (DeepSeek‑R1, gpt‑oss, Gemma‑3, Qwen‑QwQ) on the AIME24/25, HMMT25, and BRUMO25 benchmarks demonstrate that CDG yields a significant performance boost over baselines. These results demonstrate that our method provides a robust discriminative signal for improving answer selection in LLM reasoning. We also provide theoretical insights for this phenomenon. Code will be released at https://github.com/Accenture/CDG.git.
Authors:Zhuangzhuang Pan, Yan Xia, Chee Seng Chan
Abstract:
Emotion‑Cause Pair Extraction (ECPE) was introduced to explain why an emotion occurs, but this goal is now often reduced to binary pair/non‑pair prediction. This proxy is useful for direct‑cause extraction, yet easy to over‑read as evidence grounded emotion explanation. We show that this interpretation is only partially valid. In IEMO‑MECP, 90.9% of original positives remain emo‑cause and 95.0% of original negatives remain non‑pair, confirming that the binary ECPE task is largely preserved. The problem is that direct triggers alone do not constitute a grounded explanation. Emo‑context, an utterance that helps interpret a target emotion without directly causing it, appears on both sides of the original boundary and is enriched near binary uncertainty, showing that the binary boundary has no stable place for such discourse evidence. Across evaluated ECPE models, direct triggers are recovered more reliably than contextual support. Under shortcut pressure, this imbalance becomes consequential. Binary‑trained models assign higher pair scores to nearby lexically similar non‑pair candidates than to evidence supported but structurally harder emo‑cause and emo‑context pairs. Thus, pair scores can reward convenient attributions over grounded explanations. High binary ECPE performance indicates that a model can identify direct triggers; it does not indicate that the model has explained the emotion. Code is publicly available at https://github.com/panzhzh/ECPExsame.
Authors:Jake Stephen, Niraj K. Jha
Abstract:
Knowledge graph (KG) is an abstraction that can be extracted from text corpora and used for in‑depth reasoning. Prior work has leveraged KGs to fine‑tune language models (LMs), enabling domain‑specific superintelligence. In this work, we explore whether KG‑driven in‑depth reasoning capabilities can emerge in neuroscience using only information contained within a single authoritative textbook. The central hypothesis is that structured knowledge, when distilled into a high‑quality KG and converted into KG‑grounded question‑answer (QA) supervision, is sufficient to produce expert‑level reasoning through a fine‑tuned LM that surpasses large language models (LLMs) in accuracy, while employing orders of magnitude fewer parameters. We construct a textbook‑derived KG via a dual‑LLM validation pipeline, expand it with a masked LM trained on the KG topology, generate multi‑hop QA items, which include QA pairs and reasoning traces, to fine‑tune an LM exclusively on KG‑derived supervision, and apply reinforcement learning using path‑derived KG signals as implicit reward models. Our results demonstrate that deep, mechanistic neuroscience understanding can be induced in the model without reliance on large, heterogeneous web‑scale corpora. The KG‑based synthetic neuroscience curriculum that readers can quiz themselves on, and the fine‑tuned LM, are available at the following GitHub location: https://kg‑bottom‑up‑superintelligence.github.io/neuro‑bench.
Authors:Liang Xue, Haoyu Liu, Cheng Wang, Pengyu Chen, Haozhuo Zheng, Yang Liu
Abstract:
Large language models for vertical domains are bottlenecked by the scarcity of complex, domain‑specific task‑oriented dialogues. Existing data acquisition pipelines face a persistent trilemma: expert annotation is expensive, real‑world service conversations are constrained by privacy and commercial restrictions, and static corpora quickly become temporally stale. We propose Stream, a data‑centric framework that leverages publicly available streaming media (live streams and short videos) to synthesize high‑value service dialogues at scale. Stream mines authentic interaction signals from noisy streams and synthesizes conversations by integrating role‑grounded persona construction with Conversational Blueprint construction; it further adopts retrieval‑augmented generation (RAG) to support knowledge‑aware responses. Based on Stream, we release StreamDial, a large‑scale multi‑domain dataset covering Automotive, Restaurant, and Hotel. StreamDial contains 87,498 dialogue sessions and 1,497,320 turns in total, with an average of 17.11 turns per session and a comparable scale across domains. Each session is organized as a structured quadruplet \langle P_u, P_a, B, H \rangle that pairs dialogue history with explicit user/agent personas and a Conversational Blueprint, capturing realistic service behaviors such as requirement mining, constraint conflicts, negotiation, and recovery. Evaluations with automatic judges and downstream tasks show that StreamDial improves intrinsic dialogue quality over strong baselines, and models trained with StreamDial improve Dialogue State Tracking across backbones; we further report a completed human‑evaluation set and encouraging multilingual transfer on Qwen3‑8B under a controlled training budget. The data is released in https://github.com/hitxueliang/DialogDataSetBySTREAM.
Authors:Festus Kahunla
Abstract:
Applied Behavior Analysis (ABA) is a clinical discipline whose documentation, teaching programs and multi‑session behavioral logs, is formulaic and high‑volume, yet real session data is HIPAA‑protected and bound by professional confidentiality rules, blocking the release of a training corpus. We present TRACE (Taxonomy‑Referenced ABA Clinical Examples), a 2,999‑example synthetic instruction‑tuning dataset covering two ABA tasks: teaching‑program generation across Discrete Trial Training, Natural Environment Teaching, and Task Analysis; and multi‑session behavioral interpretation across twelve trajectory patterns and thirteen target behaviors. Every example is produced by a deterministic taxonomy‑driven generator grounded in the canonical ABA literature, and every example carries complete sampling provenance, the exact taxonomy cells that produced it. The dataset is released under CC BY‑NC 4.0 for data and MIT for code, with stratified train (2,549), validation (149), test (281), and sanity (20) splits. TRACE is a research artifact and has not been clinically validated.
Authors:Yangneng Chen, Jing Li
Abstract:
Large Vision‑Language Models (LVLMs) extend large language models with visual understanding, but remain vulnerable to hallucination, where outputs are fluent yet inconsistent with images. Recent studies link this issue to language bias‑the tendency of LVLMs to over‑rely on text while neglecting visual inputs. Yet most analyses remain empirical without uncovering its underlying cause. In this paper, we provide a systematic study of language bias and identify its root in modality misalignment during training. Our analysis shows that both Visual Instruction Tuning (VIT) and Direct Preference Optimization (DPO) often prioritize textual improvements, which may cause LVLMs to overly lean toward language modeling rather than balanced multimodal understanding. To address this, we propose two simple yet effective methods: Language Bias Regularization (LBR) which mitigates language bias through regularization during instruction tuning, and Language Bias Penalty (LBP), which penalizes language bias in the DPO training process. Extensive experiments across diverse models and benchmarks demonstrate the effectiveness of our approach. LBR consistently improves performance on over ten general benchmarks, while LBP significantly reduces hallucination and improves trustworthiness. Together, these methods not only mitigate language bias but also advance the overall alignment of LVLMs, all without introducing any additional data or auxiliary models. Our code is publicly available at https://github.com/lab‑klc/LVLM‑Language‑Bias.
Authors:Bangrui Xu, Ziyang Miao, Xuanhe Zhou, Yiming Lin, Zirui Tang, Xiaomeng Zhao, Fan Wu, Cheng Tan, Fan Wu, Bin Wang, Conghui He
Abstract:
VLM‑based OCR models have become the de facto choice for document parsing, as they can accurately extract page‑level elements (e.g., paragraphs within individual pages) together with their bounding boxes and textual content. However, downstream applications such as RAG require coherent document‑level information, whereas these models often break cross‑page continuity and fail to recover disrupted structures, such as paragraphs and tables truncated by page boundaries. Such relationships are not confined to a single page; instead, they require joint analysis of titles, paragraphs, tables, and images spanning multiple pages. A natural solution is therefore to reuse existing OCR outputs and reconstruct document‑level logical structures through post‑processing. To this end, we propose MinerU‑Popo, a lightweight and universal framework for POst‑Processing OCR outputs, which converts page‑level results from diverse parsers into coherent document‑level structures. MinerU‑Popo decomposes the problem into four focused subtasks: text truncation recovery, table truncation recovery, title hierarchy reconstruction, and image‑text association. To address these effectively, we build a task‑oriented data engine with task‑specific input filtering, and use the generated data (30K) to fine‑tune a lightweight post‑processing model (Qwen3‑VL‑4B). To support long documents, we introduce dynamic chunking with overlap‑based synchronization, which aligns chunk‑level outputs from the fine‑tuned model and preserves global consistency. Finally, we assemble the aligned outputs into a tree‑structured document representation, further enriched with node chunking and summaries for downstream retrieval and analysis. Empirical results show MinerU‑Popo improves title‑hierarchy TEDS by at least 20% across all five tested OCR models, improves RAG accuracy and reduces per‑query latency.
Authors:Xiangdong Zhang, Debing Zhang, Shaofeng Zhang, Xiaohan Qin, Yu Cheng, Junchi Yan
Abstract:
Standard next‑token prediction (NTP) supervises language models solely through discrete labels in the output logit space. We argue that this sparse one‑hot supervision leaves the latent representation space under‑constrained, allowing hidden states to drift into degenerate and anisotropic configurations that can limit generalization. To address this issue, we propose Next Implicit Token Prediction (NITP), which augments discrete prediction with dense continuous supervision directly in the representation space. NITP trains the model to predict the implicit semantic content of the next token, using shallow‑layer representations from the same model as stable self‑supervised targets. We provide theoretical analysis showing that NITP regularizes the optimization landscape by mitigating under‑constrained degrees of freedom and encouraging a compact, structured representation geometry. Empirically, across dense and MoE models ranging from 0.5B to 9B parameters, NITP consistently improves downstream performance with negligible computational overhead. On a 9B MoE model, NITP achieves a 5.7% absolute improvement on MMLU‑Pro, along with gains of 6.4% on C3 and 4.3% on CommonsenseQA, with approximately 2% additional training FLOPs and no additional inference cost. Our implementation is available at https://github.com/aHapBean/NITP.
Authors:Yunao Zheng, Guoyang Xia, Xiaojie Wang, Lei Ren
Abstract:
Sequence modeling requires both compositional reasoning and local static knowledge retrieval, yet standard Transformers handle both through dense computation. Engram partially decouples retrieval from the backbone, but its token‑based keys remain tied to text tokenization and hash compression. We propose Lngram, a latent‑space conditional memory module that learns discrete symbols directly from hidden states and performs N‑gram lookup over these symbols. This design removes the dependence on tokenizer IDs and naturally extends to non‑text modalities. In our evaluated settings, Lngram outperforms Transformer and Engram baselines, consistently reduces perplexity in long‑context language modeling, and effectively injects domain knowledge when added post hoc to pretrained models. Joint training with the backbone further surpasses full fine‑tuning, while experiments on vision‑language and vision‑language‑action tasks show overall gains. Analyses with LogitLens and CKA suggest that Lngram enables prediction‑relevant information to emerge earlier, increasing effective depth with limited inference and memory overhead. Code is available at https://github.com/zyaaa‑ux/Lngram.
Authors:Bohang Sun, Max Zhu, Francesco Caso, Jindong Gu, Junchi Yu, Philip Torr, Pietro Liò, Jialin Yu
Abstract:
Diffusion large language models promise faster generation by refining many token positions in parallel, but this parallelism introduces a hidden control problem: which proposed tokens should be transferred into the partially decoded sequence at each step? We refer to this decision as token commitment. Existing frozen‑generator decoders largely rely on hand‑designed confidence rules or block‑specific acceptance filters. We argue that token commitment can instead be learned as a reusable trace‑state policy. We introduce TraceLock, a lightweight plug‑in controller that instantiates this policy for a frozen diffusion language model. Since oracle commitment times are unavailable, TraceLock derives self‑supervision from future stability: at decoding step t, a proposed token for position i is labeled stable if it matches the final token at position i after the full decoding trace completes. The controller scores variable‑length trace states and decides which active token proposals should be committed to the partially decoded sequence. Once trained for a given frozen backbone, the controller can be deployed across local‑window widths, generation lengths, and step budgets without retraining or per‑setting calibration. Experiments on question answering, mathematical reasoning, and code generation show that TraceLock improves the quality‑step tradeoff over heuristic and learned baselines, with particularly stable behavior under cross‑setting deployment. Diagnostic analyses show that its decisions are not reducible to scalar confidence, suggesting that frozen diffusion language models expose a learnable space of commitment trajectories beyond confidence‑based decoding. Code is available at https://github.com/BobSun98/TraceLock.
Authors:Peisong Wang, Bowen Liu, Zehua Li, Yuyao Wang, Zhiwei Ma, Yuhan Li, Jia Li
Abstract:
Large language models still struggle with contest‑level programming, while many agentic remedies rely on massive inference‑time sampling or expensive multi‑stage post‑training. We study when execution feedback reliably helps an LLM CP solver and which mechanisms govern the gains. We model feedback‑driven solving as a calibrated stopped process and identify three quantities: false‑admission risk, program‑level evidence against bad programs, and the active‑state success hazard. Under held‑out trace calibration and selection from a pre‑declared finite controller manifest, the resulting structural certificate lower‑bounds the clean success probability before false admission. We instantiate mechanisms targeting these quantities as Dual‑Granularity Verification, Test Augmentation, and Experience‑Driven Self‑Evolving, yielding CP‑Agent. Without updating any parameters, CP‑Agent raises Pass@1 from 25.8% to 48.5% on LiveCodeBench Pro and improves Refine@5 by 11.0% on ICPC‑Eval. Across three LLM backbones, CP‑Agent lies on the cost‑‑accuracy efficiency frontier, and ablations show that each component primarily affects its corresponding certificate quantity.
Authors:Dingfeng Jiang, Han Yan, Chenze Ma, Amit Kumar Jaiswal, Ang Li, Yunxiang Jiang, Xinlei Xiong, Juhao Liang, Hongru Xiao, Xiang Li, Fan Bu, Jiale Han, Ruchir Gupta, Prayag Tiwari, Benyou Wang
Abstract:
Medical large language models hold promise for reducing healthcare disparities, yet Hindi remains severely underrepresented. While medical LLMs excel in high‑resource languages, their performance degrades sharply in Hindi, particularly on Indian systems of medicine. We argue that robust cross‑lingual medical transfer requires Hindi reasoning. To this end, we introduce HiMed, a Hindi reasoning medical corpus and benchmark suite covering both Western and Indian medicine. We further propose HiMed‑8B, a Hindi‑form medical reasoning LLM, through the design of decaying scaffolding reward. Extensive experiments demonstrate improvement in Hindi medical reasoning performance and reduction in the English‑‑Hindi accuracy gap. Ablation studies validate the contribution of each training stage and reward component. All data and code are available on GitHub: https://github.com/FreedomIntelligence/HiMed.
Authors:Jaeung Lee, Dohyun Kim, Jaemin Jo
Abstract:
Large language model (LLM) unlearning has emerged as a crucial post‑hoc mechanism for privacy protection and AI safety, yet auditing whether target knowledge is truly erased remains challenging. Existing output‑level metrics fail to detect when this knowledge remains recoverable from internal representations. Recent white‑box studies reveal such residual knowledge but often rely on auxiliary training or dataset‑specific adaptations, leaving no generalizable metric. To address these limitations, we propose the Unlearning Depth Score (UDS), a metric that quantifies the mechanistic depth of unlearning via activation patching. UDS first identifies layers that encode the target knowledge using a retain model baseline, then measures how much of it is erased in the unlearned model on a 0‑1 scale. In a meta‑evaluation across 20 metrics on 150 unlearned models spanning 8 methods, UDS achieves the highest faithfulness and robustness, confirming our causal approach as the most reliable for unlearning evaluation. Case studies further reveal that white‑box metrics can disagree at the layer level and that erasure depth varies across examples. We provide guidelines for integrating UDS into existing benchmarking frameworks and streamlining the evaluation pipeline. Code and data are available at https://github.com/gnueaj/unlearning‑depth‑score
Authors:Haizhou Xia
Abstract:
Post‑hoc repair of LLM mathematical reasoning introduces an asymmetric risk: fixing an incorrect reasoning trace is useful, but replacing a trace that was already correct can be harmful. We study this problem under a selective replacement setting, where a system must decide whether a repaired candidate is safer than preserving the original cached trace. We present GuardedRepair, a guarded best‑of‑N repair framework that diagnoses cached reasoning traces, selectively triggers repair, and accepts answer‑changing candidates only when deterministic verification guards support replacement. The framework combines lightweight symbolic checks, surface semantic‑risk diagnostics, bounded candidate generation, and conservative acceptance policies. On the full GSM8K test set, where the initial reasoner already achieves 95.60% accuracy, GuardedRepair improves final accuracy to 96.89%, fixing 17 of 58 remaining errors without measured broken‑correct cases in the main run. On a weak‑reasoner ASDiv setting, accuracy improves from 78.40% to 87.60%. Direct regeneration baselines show that this gain is not explained by stronger‑model re‑solving alone: re‑solving all GSM8K examples lowers accuracy to 93.03% and breaks 47 initially correct answers. Additional analyses show that guarded repair substantially improves the fixed/broken tradeoff, while also revealing that replacement risk is reduced rather than eliminated. These results support viewing post‑hoc repair as harm‑aware selective replacement rather than unconstrained re‑solving.
Authors:Piotr Wilam
Abstract:
A sparse 8‑layer code transformer develops dedicated neural circuitry for every Python construct tested, and that circuitry is organised by a clean computational principle rather than by semantic category. We extract neural circuits for 106 concepts (43 AST node types, 63 builtin objects) by marginalising across 63,800 controlled prompts, and decompose each circuit into concept‑specific and token‑driven components using contrastive checker prompts that present a keyword token without its associated syntactic structure. Three findings emerge. First, all 106 concepts produce non‑empty universal circuits at every one of nine parameter settings, and the ranking of concept‑specificity across constructs is stable across the sweep ‑ survival is not an artifact of a permissive threshold. Second, AST circuits contain a genuine concept component distinct from token activation: concept‑only neurons constitute up to 62.5% of the loudest‑firing neurons at mid‑to‑late layers, while builtin circuits are almost entirely token‑driven. Third, six computationally atomic constructs ‑ Import, ImportFrom, Break, Continue, Pass, Assert ‑ cluster together despite being semantically unrelated, sharing only the property of being single‑statement constructs requiring no nested body; this atomicity super‑cluster, together with a four‑tier hierarchy organised by token ambiguity and structural distinctiveness, shows that the model's internal organisation tracks computational structure rather than meaning. The methodology, full decomposition data, and analysis code are released.
Authors:Yuki Nakamura
Abstract:
Comparing a model's internal activations before and after alignment is a natural way to ask what safety training changes: one forms the matrix of paired aligned‑minus‑base activations on safety‑relevant inputs and reads off its effective rank or top direction. We show the obvious way to form this matrix is confounded. The aligned model is evaluated under a chat template the base model never saw, so the naive difference conflates the alignment shift with chat formatting. We introduce a four‑variant decomposition of the modification matrix (naive, template‑controlled, within‑aligned, and difference‑in‑differences, DiD) that separates the two effects. Template control alone removes a 2.0‑3.9x inflation of the measured effective rank across Llama‑3.1‑8B, Gemma‑2‑9B, and Qwen‑2.5‑7B; the DiD contrast is what recovers the refusal direction of Arditi et al. (2024), lifting its cosine alignment from 0.18‑0.39 to 0.50‑0.86. Projection‑ablation across the three families confirms the recovered subspace is behaviorally active and that singular‑value order is not causal order. We validate the protocol on a controlled testbed and distill it into measurement recommendations for activation‑difference studies of alignment.
Authors:Yosef Worku Alemneh, Kidist Amde Mekonnen, Maarten de Rijke
Abstract:
Multilingual retrieval increasingly underpins cross‑lingual question answering and retrieval‑augmented generation. Strong zero‑shot scores on multilingual benchmarks are often taken as evidence that current encoders transfer reliably across many languages. We argue that this assumption breaks down for underrepresented, morphologically rich languages, and use Amharic as a diagnostic case. Under a shared passage retrieval protocol covering dense, late‑interaction, learned sparse, and cross‑encoder paradigms, we compare zero‑shot multilingual retrievers, Amharic‑fine‑tuned multilingual retrievers, and monolingual Amharic retrievers. The strongest zero‑shot multilingual retriever underperforms the strongest monolingual Amharic first‑stage retriever by 23% relative MRR@10. Fine‑tuning two recent multilingual embedding models on the same Amharic supervision yields 32‑60% relative MRR@10 gains over zero‑shot, but the best Amharic‑fine‑tuned multilingual model remains below the strongest monolingual Amharic retriever. These findings indicate that zero‑shot multilingual retrieval is not a sufficient proxy for equitable information access in the LLM era: for underrepresented languages, retrieval must be evaluated and adapted in‑language rather than inferred from aggregate multilingual benchmarks. To foster future research, we publicly release the dataset, codebase, and trained models at https://github.com/rasyosef/amharic‑neural‑ir.
Authors:Zexuan Chen, Sichao Liu, Runhao Lu, Huichao Qi, Alexandra Woolgar, Xi Vincent Wang, Lihui Wang
Abstract:
Visual decoding from brain signals is a key challenge at the intersection of computer vision and neuroscience, requiring methods that bridge neural representations and computational models of vision. We introduce a tri‑modal contrastive framework for EEG‑based visual decoding that aligns EEG, visual, and textual representations within a unified latent space. Our approach follows a two‑stage design. First, we pre‑train an EEG encoder via masked reconstruction on unlabeled trials, learning spatio‑temporal regularities that transfer robustly to downstream tasks. Second, we jointly align EEG, image, and LLM‑generated textual descriptions through contrastive learning, where text supervision acts as a semantic regularizer that injects linguistic structure into the shared space without overwhelming the primary EEG‑image signal. The encoder integrates subject‑specific adaptation, graph‑attention over channels, and temporal‑spatial convolutional embeddings. On the Things‑EEG2 200‑way zero‑shot benchmark, our framework achieves 54.1% Top‑1 and 83.4% Top‑5 accuracy, substantially exceeding the strongest prior baseline (32.4% / 64.0%), with paired Wilcoxon tests confirming significance (p < 0.01) over all in‑subject baselines. We validate generalization on Things‑MEG. Analysis reveals that compact embedding geometries (CN‑CLIP) outperform much larger backbones, and that decoding aligns with established neurophysiology of visual processing. This work is a critical step towards robust, semantically‑grounded visual decoding from non‑invasive temporal neural signals. The source code is publicly available in https://github.com/anon‑eeg/eeg_image_decoding.
Authors:Spandan Pratyush
Abstract:
The quadratic complexity of self‑attention in Transformer models remains a significant bottleneck for processing long sequences and deploying large language models efficiently. For this approach, there has been significant research into Sparse Attention, and Deepseek Sparse Attention has combined various methods of creating segments of tokens to reduce the time complexity. This paper introduces a novel approach, Grammatically‑Guided Sparse Attention, which constrains attention computations based on the grammatical roles of tokens. By leveraging Parts‑of‑Speech (POS) tags, attention masks are dynamically generated that enforce linguistically coherent connections between tokens, reducing the computational graph without sacrificing essential linguistic dependencies. Two masking strategies are proposed and evaluated: a hard mask that strictly allows only predefined grammatical interactions, and a soft mask that biases attention towards these interactions. The experiments, conducted on the SST‑2 sentiment classification task using a DistilBERT‑like architecture, demonstrate that Grammatically‑Guided Sparse Attention maintains comparable accuracy to full attention while significantly reducing the theoretical computational overhead. Preliminary results show accuracy values of 0.8200 for hard masking and 0.8165 for soft masking, closely matching the 0.8200 of full attention, providing a path towards more efficient, interpretable, and linguistically‑informed Transformer architectures.
Authors:Weiming Wang, Junyu Lu, Han Wang, Xiaokun Zhang, Zewen Bai, Bo Xu, Liang Yang, Hongfei Lin
Abstract:
Research on harmful meme detection has garnered significant attention, resulting in the development of numerous datasets and methods. However, progress in detecting Chinese harmful memes lags considerably, primarily due to two challenges: first, accurately assessing a meme's harmfulness depends heavily on understanding deep cultural context; second, many memes are semantically ambiguous, making harmfulness highly subjective. To address these issues, we focus on the interpretable detection of Chinese harmful memes by constructing the first Chinese harmful meme explanation dataset, Ex‑ToxiCN‑MM. This dataset offers opposing interpretations, categorized as "harmful" and "non‑harmful", for each meme, aiming to rigorously evaluate a model's ability to discern and comprehend ambiguous, culturally grounded content. We built a specialized knowledge base of Chinese cultural concepts and offensive vocabulary to supply models with essential prior knowledge (C‑HarmKB). To address the ambiguity and lack of background knowledge in meme attribution, we have developed a comprehensive attribution analysis framework, RIKE, which includes an Attribution Knowledge Enhancement module (AKE) and a Relative Intent Reasoning module (RIR). Extensive quantitative and qualitative experiments demonstrate that our method outperforms mainstream baseline models across multiple metrics in the task of attributing harmful memes in Chinese. The code, Ex‑ToxiCN‑MM dataset, and Chinese Harmful Semantic Knowledge Base (C‑HarmKB) involved in this study have been open‑sourced at https://github.com/wimiw123/Ex‑ToxiCN‑MM
Authors:Nazif Can Tamer, Victoria Ebert, Guang Yang, Noah A. Smith
Abstract:
We consider the conversion of musical recordings into human‑readable sheet music annotated with timestamps. Such output lets a listener clearly visualize rubato (temporally expressive playing), a learner diagnose ensemble precision and timing choices against the written music, and a musicology scholar compare performance styles across recordings of the same work. We introduce (1) a prompt‑conditioned encoder‑decoder model, named Rubato, trained to output (2) a new textual representation for polyphonic music, named InterMo, which we designed for compatibility with sequence‑to‑sequence training. Our experiments demonstrate that Rubato produces timestamped piano sheet music from audio with higher notational accuracy than the best existing approaches, which are based on cascades. We find that even if the cascade is given ground‑truth MIDI instead of audio, Rubato performs better, suggesting that the ceiling of existing approaches is primarily representational, not acoustic. Further, because Rubato is trained on several related tasks (with prompts), it competes with or outperforms the best single‑task systems on related but simpler tasks like MIDI note grounding and beat/downbeat detection. A demo is available at https://nctamer.github.io/rubato‑transcription .
Authors:Jinghan Jia, Joe Benton, Eric Easley
Abstract:
Chain‑of‑thought (CoT) reasoning is useful for monitoring language models only when the reasoning trace faithfully reflects the computation that produces the final answer. However, models can rely on prompt‑to‑answer shortcuts that bypass the CoT, making the visible reasoning trace misleading even when it appears plausible. We study CoT faithfulness through a structural information‑flow perspective: faithful reasoning should route answer‑relevant information through the mediated path from prompt to CoT to answer, rather than through a direct prompt‑to‑answer shortcut. This perspective yields a task‑agnostic framework based on three complementary properties, sufficiency, completeness, and necessity, which we instantiate with entropy‑based, masked‑KL, and gradient‑based diagnostics. We show that these metrics recover externally judged faithfulness differences in hinted reasoning, and identify a low‑entropy failure mode of KL‑based diagnostics where gradient‑based measures remain more stable. Building on this analysis, we introduce update‑time interventions for verifier‑based on‑policy RL, including attention masking, backward‑only gradient masking, CoT gradients, and adversarial perturbations of prompt representations. Across hinted arithmetic, reward‑hackable code repair, and DAPO‑Math models trained without hints but evaluated under wrong‑hint injection, our interventions shift behavioral and structural indicators toward stronger CoT mediation. In particular, they make shortcut and reward‑hacking behavior more transparent in the CoT and improve task‑agnostic faithfulness metrics, while in some settings also reducing wrong‑hint susceptibility. Our results suggest that controlling information flow during training is a practical route toward more faithful and monitorable CoT reasoning. Code is available at https://github.com/safety‑research/faithful‑cot.
Authors:Amirmohammad Ziaei Bideh, Shameed Charlomar Job, Ava Yahyapour, Alla Rozovskaya
Abstract:
We describe our submission to the CLPsych~2026 Shared Task on capturing and characterizing mental health changes through social media timeline dynamics. To infer the dominant self‑states in posts (Tasks 1.1 and 1.2), we ensemble in‑context learning of three open‑weight large language models using majority voting. For predicting moments of change in a timeline (Task~2), we train supervised classifiers on features derived from Task~1.1 predictions. To summarize the patterns of mood dynamics and their progression over time within a timeline (Task 3.1), we augment in‑context example labels predicted by upstream systems (Tasks 1.1, 1.2, and 2), yielding performance gains over zero‑shot and unaugmented in‑context learning baselines. Our submission ranked first on Task~1.1, fourth on Task~1.2, fourth on Task~2, and third on Task~3.1.\footnoteThe source code for the experiments is available at https://github.com/amirzia/clpsych26‑cuny
Authors:Yanyu Chen, Jiyue Jiang, Dianzhi Yu, Zheng Wu, Jiahong Liu, Jiaming Han, Xiao Guo, Jinhu Qi, Yu Li, Yifei Zhang, Irwin King
Abstract:
The evolution of Large Language Model (LLM) reasoning is bottlenecked by the scarcity of high‑quality process data. While self‑alignment via endogenous rewards offers a solution, mining valid supervision faces three challenges: (1) Label Noise via Mimetic Bias, where rewards prioritize statistical likelihood over logical truth, creating a "correctness illusion" that masks compounding errors; (2) Coarse‑Grained Supervision, where sparse global outcomes (e.g., in GRPO) fail to provide granular guidance, treating reasoning chains as monolithic; and (3) Distributional Collapse, where signals fail to generalize without amplifying pre‑training biases. To address these, we introduce LC‑ERD (Logic‑Consistent Endogenous Reward Decomposition), a framework framing self‑alignment as latent structure mining. We derive a Variational Logic Potential by aggregating consensus from the model's Latent Logic Expertise (LLE) to denoise the reasoning manifold, and introduce a Multi‑Agent Value Decomposition protocol based on the IGM principle to quantify individual step utility. Experiments show LC‑ERD delivers a robust self‑evolution path, uncovering trade‑offs between logic consistency and accuracy while identifying high‑value reasoning patterns missed by standard rewards. Our code is available at https://github.com/LC‑ERD‑repo/LC‑ERD.
Authors:Feisal Alaswad, Batoul Aljaddouh, Maher Alrahhal, Poovammal E, Talal Bonny
Abstract:
Large language models achieve strong performance in language generation and knowledge‑intensive tasks, yet remain limited in settings requiring causal reasoning, persistent state tracking, and long‑horizon planning. We argue that these limitations may arise from an objective‑level mismatch between sequence prediction and reasoning over latent environment dynamics. To formalize this distinction, we introduce Latent Dynamics Inference (LDI), a conceptual perspective that interprets language and multimodal observations as partial evidence of underlying transition dynamics. To empirically investigate this perspective, we introduce Flux, a sequential reasoning environment specified entirely through natural‑language rules. As a proof‑of‑concept case study, the rules are first compiled into an explicit state‑transition simulator, illustrating that structured latent transition dynamics can, in some cases, be operationally extracted from textual rule descriptions. This enables a controlled comparison between the LLMs operating purely over textual observations and reinforcement‑learning agents trained directly within the extracted latent state space. Within this case study, agents operating with explicit access to the latent state space exhibit substantially more stable behavior in long‑horizon gameplay, achieving an aggregate win rate of approximately 79% versus 11% for LLMs. Qualitative analysis further reveals failure modes consistent with unstable persistent state tracking, including invalid actions, state‑tracking errors, and short‑horizon reasoning failures. The complete implementation of the Flux environment available at https://github.com/FeisalAlaswad/FLUX‑RL‑Agent Within the evaluated setting, these results suggest that strong sequence prediction alone may struggle to support robust long‑horizon dynamic reasoning without mechanisms for persistent state tracking and transition modeling
Authors:Sebastien Kawada
Abstract:
How do multi‑turn reasoning systems fail? The expected answer is logical contradiction, in which the system's maintained state becomes unsatisfiable. We show that the dominant mode is instead satisfiable drift, where the internal state stays consistent while the returned answer silently violates prior commitments. We build DRIFT‑Bench (Decomposing Reasoning Into Failure Types), a solver‑instrumented benchmark of 816 test problems across three constraint domains, and evaluate four methods on it across four open‑weight models (8B‑120B parameters). MUS‑Repair, which feeds minimal unsatisfiable subsets back to the generator, is strongest in every setting (+1.8 to +15.0 pp over the best non‑MUS baseline). But the central finding is what repair leaves behind. After structured feedback, models rarely contradict themselves. They forget. Residual errors are 98‑100% satisfiable drift across all settings, while contradiction drops to near zero. Reliable multi‑turn systems must separately validate that the returned answer respects the maintained state. Code is available at https://github.com/kaons‑research/drift‑bench.
Authors:Sam Earle, Kai Arulkumaran, Andrew Dai, Akarsh Kumar, Julian Togelius, Sebastian Risi
Abstract:
We are in the midst of large‑scale industrial and academic efforts to automate the processes of scientific, technological and creative production through AI‑driven assistants. Historically, a fundamental property of these processes in their human form has been their open‑endedness: their capacity for generating a seemingly endless supply of novel and meaningful new forms. Do artificial agents have any capacity for such fruitful unguided discovery? To answer this question, we turn to Picbreeder, the canonical exemplar of human‑driven open‑ended search, in which users collaboratively generated a diverse library of images through interactive evolution of small neural networks. We replicate Picbreeder, replacing human users with frontier Vision Language Models (VLMs). We observe clear qualitative differences between the output of our system and the historical human baseline, and attempt to characterize them using metrics of phylogenetic complexity and visual and semantic salience and novelty. In an effort to identify some of the causal factors contributing these differences, we study the addition of exploratory noise to the agents' selection process, of behavioral diversity between agents, and of narrative momentum in the form of memory of past actions. We make our code available at https://github.com/smearle/picbreeder‑vlm.
Authors:Beichen Zhang, Yuhong Liu, Jinsong Li, Yuhang Zang, Jiaqi Wang, Dahua Lin
Abstract:
Multimodal Large Language Models have advanced visual reasoning, yet a purely textual chain of thought remains a bottleneck for questions that require fine‑grained focus or view transformations. The ''think with images'' paradigm narrows this gap, but existing approaches are either constrained by fixed predefined toolkits or produce noisy intermediate images from unified multimodal methods. We pursue a third option: using a dedicated image editing model and decouple it with an understanding model. However, off‑the‑shelf image editors fail as reasoning assistants with two complementary gaps: a language‑side gap, where editors trained as passive instruction‑followers cannot map an abstract question to an appropriate visual transformation, and a generation‑side gap, where edit correctness degrades as reasoning depth grows. Guided by this analysis, we introduce ETCHR (Editing To Clarify and Harness Reasoning), a question‑conditioned, reasoning‑aware image editor decoupled from the downstream understanding model and trained with a two‑stage recipe targeted at the two gaps: Reasoning Imitation via supervised fine‑tuning on edit trajectories, followed by Reasoning Enhancement with VLM‑derived rewards for edit correctness and downstream reasoning accuracy. Since the editor is decoupled, ETCHR plugs into different open‑ and closed‑source MLLMs in a training‑free manner. Across five task families (fine‑grained perception, chart understanding, logic reasoning, jigsaw restoration, and 3D understanding), ETCHR raises average Pass@1 from 55.95 to 60.77 (+4.82) with Qwen3‑VL‑8B, from 65.08 to 70.55 (+5.47) with Gemini‑3.1‑Flash‑Lite, and from 76.55 to 81.16 (+4.61) with the 1T‑parameter MoE model Kimi K2.5.
Authors:Michal Shlapentokh-Rothman, Prachi Garg, Yu-Xiong Wang, Derek Hoiem
Abstract:
Keyframe selection is a direct way to provide verifiable visual evidence for long‑video question answering (QA). Queries differ in what they require, and finding the right frames depends on knowing what to look for. Existing keyframe selectors either score every frame against a single query, or decompose the query into a fixed schema evaluated by a single visual tool. We propose ToolMerge, a keyframe retrieval method based on decomposition and merging: an Large Language Model (LLM) based planner decomposes the query into tool calls and specifies how their per‑tool rankings are merged using boolean operators. To evaluate retrieval directly, we construct Molmo‑2 Moments (M2M), a benchmark in which every question is anchored to a specific time interval by construction. Across QA, question retrieval, and caption retrieval, ToolMerge is competitive with prior keyframe selectors, most notably on caption retrieval, outperforming other methods by 5%. Code and data can be found at https://github.com/michalsr/ToolMerge .
Authors:Jiangwang Chen, Bowen Zhang, Zixin Song, Jiazheng Kang, Xiao Yang, Da Zhu, Guanjun Jiang
Abstract:
Although large language model (LLM) conversational systems process millions of multi‑turn dialogues daily, they remain fundamentally reactive: they respond only after the user types a query. A key step toward proactive interaction is next‑query prediction, which anticipates the user's subsequent query based solely on the preceding dialogue. Progress on this task is hindered by the lack of dedicated benchmarks and a fundamental efficiency‑‑quality trade‑off: naively concatenating full dialogue history incurs linearly growing token consumption, while truncating to the latest turn discards crucial cross‑turn context. Our key insight is that accurate prediction does not require re‑reading raw history; it suffices to track the user's evolving intent trajectory across topics, unresolved needs, and interest shifts. We propose OnePred, which maintains a recursively updated memory as its sole cross‑turn context, bounding the per‑turn cost independently of conversation length. We train the model via a two‑stage reinforcement learning pipeline that first teaches what to predict, then what to compress, shaping the memory into a prediction‑oriented intent chain. To establish a rigorous testbed, we introduce NQP‑Bench, spanning three diverse subsets. Experiments demonstrate that OnePred reduces per‑turn token consumption by up to 22× compared to full‑history inputs while consistently exceeding all baselines in prediction quality, with larger gains on longer conversations. Our code is publicly available at https://github.com/ZBWpro/OnePred.
Authors:Jiahao Ying, Boxian Ai, Wei Tang, Siyuan Liu, Yixin Cao
Abstract:
Skills, i.e., structured workflow instructions distilled for large language models (LLMs), are becoming an increasingly important mechanism for improving agent performance on real‑world downstream tasks. However, as the open‑source skill ecosystem rapidly expands, it remains unclear how different models and agent frameworks interact with skills, how to evaluate skill quality, and how users should select skills under practical cost‑performance trade‑offs. In this paper, we present \textscOpenSkillEval, an automatic evaluation framework for both skill‑augmented agent systems and the skills themselves. Instead of relying on static benchmarks, \textscOpenSkillEval automatically constructs realistic task instances from evolving real‑world artifacts across five categories of downstream applications: presentation generation, front‑end web design, poster generation, data visualization, and report generation. It further collects and organizes community‑contributed skills for controlled comparison under unified task settings. Using more than 600 dynamically generated task instances and 30 open‑source skills, we conduct a systematic evaluation of state‑of‑the‑art models and agent frameworks. Our results show that skill availability does not guarantee effective skill usage, that the benefit of skill augmentation depends strongly on both the underlying model and the agent framework, and that many publicly popular skills do not consistently outperform base agents without skills. These findings highlight the need for dynamic, task‑grounded evaluation and provide practical insights into the design, selection, and deployment of skills for LLM agents. Additional cases and benchmark resources are available on the project website: https://yingjiahao14.github.io/OpenSkillEval‑Web/.
Authors:Björn Nieth, Marianna Gracheva, Michaela Mahlberg, Bjoern Eskofier, Emmanuelle Salin
Abstract:
While factual correctness and task‑performance have been in focus of Large Language Model (LLM) research for a long time, the fundamental question of how human‑like generated texts are on a linguistic level has been underexplored. From a corpus‑linguistic perspective, language production is inherently context‑dependent, with distinct communicative contexts giving rise to differences in frequencies and co‑occurrence patterns of linguistic features. A text failing to adhere to these patterns can be content‑wise correct, but still be unfavorable to human readers. In this work, we propose a context‑aware evaluation framework in which human‑likeness is assessed using a two‑sample problem between the linguistic feature distribution of a human reference corpus for a given register and a corresponding LLM‑generated corpus. We implement this framework using the Maximum Mean Discrepancy (MMD) and the 67 lexico‑grammatical features introduced by Biber, which are commonly applied in corpus linguistics. In our experiments, we compare seven instruction‑tuned, open‑source models across five English‑language datasets spanning distinct registers against a human baseline. While across all tested setups, LLMs deviate from the human baseline, which models are closest to human language depends on the register and is not dictated by model size.
Authors:Stefano Cirillo, Domenico Desiato, Giuseppe Polese, Giandomenico Solimando
Abstract:
We benchmark Google Embeddings (GE2), a Vertex‑AI‑hosted bi‑encoder with 2,048‑token context and explicit task‑type conditioning, against five open‑source alternatives: BGE‑M3, E5‑large, Multilingual‑E5‑large (mE5‑L), LaBSE, and Paraphrase‑Multilingual‑MPNet (mMPNet). Evaluation covers four BEIR subsets, a synthetic Italian RAG corpus, a chunking ablation considering 5 sizes of tokens with three strategies, and per‑query latency on commodity CPU hardware. GE2 ranks first on every task, achieving BEIR avg.nDCG@10 = 0.638 and IT‑RAG‑Bench nDCG@10 = 0.282, but at 231.6 ms median latency, it is roughly 14x slower than the fastest local models. mE5‑L reaches within 0.003 nDCG of GE2 on Italian at 31 ms, making it the preferred option when sub‑100 ms SLAs matter. A more striking finding concerns LaBSE, which, despite widespread multilingual deployment scores 0.188 average nDCG@10 on BEIR, below every dedicated retrieval model including mMPNet. Chunking experiments show that all six models saturate at 32‑token chunks on our corpus, with semantic chunking providing measurable gains only at 16 tokens.
Authors:Zhangyi Hu, Chenhui Liu, Tian Huang, Jindong Li, Yang Yang, Jiemin Wu, Zining Zhong, Menglin Yang, Yutao Yue
Abstract:
Recently, Reinforcement Learning with Verifiable Rewards (RLVR) and Test‑Time Scaling (TTS) have advanced LLM code generation through executable verification. Yet Ground‑Truth Unit Tests (GT UTs) remain a bottleneck: SOTA RLVR methods require them for costly training, while existing TTS methods lose competitiveness without them. This motivates GT‑free TTS, where existing methods directly use self‑generated UTs to refine and select code candidates. Yet such UTs are often noisy or spuriously coupled with wrong code, and UT quality in turn cannot be validated without reliable code. The key challenge is therefore to jointly improve both. To this end, we present CoSPlay, a GT‑free, training‑free framework that jointly improves codes and UTs through cooperative self‑play. It first explores diverse solution ideas and identifies their potential failure modes to produce discriminative UT ideas. It then uses bidirectional pass‑count signals from the Code‑UT execution matrix to iteratively prune or fix weak codes and refresh or replace unreliable UTs, letting the two pools co‑evolve. Finally, when multiple codes remain tied at the highest pass count, it picks the final code from the largest output‑consensus cluster, since correct codes agree on the same inputs while wrong codes diverge. Experiments on four challenging benchmarks show that CoSPlay on Qwen2.5‑7B‑Instruct improves average BoN from 22.1% to 33.2% and UT accuracy from 14.6% to 78.3%, matching or surpassing the RLVR model CURE‑7B. When applied to CURE‑7B, it further improves BoN by 5.7%. CoSPlay also generalizes across diverse backbones and outperforms GT‑free TTS baselines under comparable token budgets, with continued gains as the budget scales up. These results suggest a scalable inference strategy for competitive code generation without any GT data.
Authors:Muhammad Usama, Dong Eui Chang
Abstract:
Large language models trained under diverse objectives and architectures have been shown to develop increasingly similar internal representations, an observation formalized as the Platonic Representation Hypothesis. Whether this representational convergence extends to the reasoning processes that operate over shared representations remains untested. We evaluate representational similarity across 16 language models from 8 families (1.5B to 72B parameters) on 800 reasoning problems spanning mathematics, science, commonsense, and truthfulness, stratifying by problem difficulty, computational stage, and causal relevance. Our analysis reveals three dissociations: a difficulty inversion, where models converge more on problems they collectively fail (Centered Kernel Alignment [CKA] = 0.897) than on those they solve (CKA = 0.830); a generation gap, where pre‑decision representations align (CKA = 0.875) while post‑decision representations diverge (CKA = 0.274); and epiphenomenal correctness, where shared information is decodable across models (66% transfer accuracy) but exerts minimal causal influence on predictions (1.5% to 5.5% flip rate across ablation protocols). These results indicate that representational convergence in language models reflects shared input processing constraints rather than shared reasoning strategies, with direct implications for ensemble design, interpretability transfer, and evaluations of model similarity. Code is available at https://github.com/Usama1002/convergence‑without‑understanding.
Authors:Gabriele Oliaro, Yichao Fu, May Jiang, Owen Lu, Junli Wang, Zhihao Jia, Hao Zhang, Samyam Rajbhandari
Abstract:
LLM‑based agents for GPU kernel generation are advancing rapidly, yet their progress is fundamentally constrained by the benchmarks they optimize against. Existing benchmarks are poorly aligned with production inference frameworks: they evaluate kernels on a single GPU with synthetic inputs, ignore the surrounding compilation stack, and reward replicating known optimizations rather than discovering new ones. The resulting reward signals are misleading: agents learn to generate kernels that score well in sandboxes but introduce interface incompatibilities, compilation‑stack conflicts, and silent correctness degradation when integrated into real systems. We introduce FastKernels, a kernel benchmark built around a minimal set of 46 representative architectures spanning 8 categories, whose kernels collectively subsume those of 96.2% (409/425) of HuggingFace Transformers architectures. FastKernels doubles as a minimalistic, production‑grade inference framework that runs at parity with hardened systems such as vLLM and SGLang on mainstream LLM serving and substantially exceeds upstream references on under‑served architectures; each task's interface mirrors the corresponding module in the state‑of‑the‑art library for its architecture family, enabling direct deployment of optimized kernels into production codebases. Evaluating state‑of‑the‑art kernel agents on FastKernels, we find that even the strongest agent achieves only 0.94× aggregate speedup over production baselines, with weaker agents at 0.78× and 0.53× ‑‑ confirming that benchmark‑production misalignment is a critical bottleneck for the field. We release FastKernels as a stepping stone toward kernel agents whose benchmark gains translate directly into production throughput improvements. Code is available at https://github.com/Snowflake‑AI‑Research/fastkernels
Authors:Yiyang Wang, Moeiini Reilly, Britney Johnson, Kefei Yan, Alex Cabral, Josiah Hester
Abstract:
Gardening is critical to support well‑being, cultural continuity, and food autonomy, yet existing digital tools often provide generic advice that overlooks gardeners' skills, local ecologies, seasons, and cultural contexts. We introduce CultivAgents, a relationship‑centered multi‑agent system for personalized, socio‑culturally grounded gardening support. Grounded in ethics of care, CultivAgents coordinates multiple specialized agents: an Experience Agent that adapts guidance to users' skill levels, an Environmental Agent that grounds advice in local and seasonal conditions, and an Ethnobotanical Agent that connects plants to cultural knowledge and histories. We evaluated CultivAgents through a three‑phase mixed‑methods study with domain experts (n=3), HCI researchers (n=7), and community gardeners (n=5), analyzing expert feedback, pre/post surveys, and participatory design activities. Results suggest that CultivAgents helped gardeners translate interest into situated action: community gardeners reported increased confidence (3.00 to 3.60), motivation (4.00 to 4.40), and trust in acting on AI advice (3.20 to 4.00). Participants valued hyperlocal ecological guidance and complementary agent perspectives, while also identifying limits in cultural specificity, ecological grounding, and agent coordination. The work advances relationship‑centered AI, offering design implications for multi‑agent systems that support food sovereignty, community resilience, and cultural preservation.
Authors:Eric Xu
Abstract:
Role prompts of the form As X, do Y admit a clean linear decomposition at one specific site in the residual stream: the prompt‑to‑answer transition ‑‑ the last prompt token together with the first two generated tokens ‑‑ in an early/mid layer band. There, persona and task contribute through partially orthogonal additive directions. Forming a pure persona effect Δ_X, a pure task effect Δ_Y, and substituting h_BB + Δ_X + Δ_Y for the clean residual yields downstream output within a small KL of clean on Gemma‑2‑2B‑IT and Qwen‑2.5‑\1.5B, 3B\‑Instruct, across a 12‑cell short grid and a 48‑cell long‑persona grid, with persona‑specific behavioral markers preserved. The natural inference from this additive structure is that the role prompt can be compressed into a single cached residual vector. \emphWe show it cannot. Injecting the cached additive prediction ‑‑ or even the oracle clean residual h_XY ‑‑ into a baseline host prompt with the persona text removed does not approach the clean long‑persona target, at one site or at many layers. Persona‑conditioned multi‑token generation flows through attention back to the persona‑text positions throughout the prompt, which no residual at one site reproduces. Local additivity in the residual stream does not imply prompt compressibility. The additive structure at the prompt‑to‑answer transition supports interpretability and fine‑grained steering of persona or task contributions; persona‑conditioned behavior across the full continuation depends on a distributed prompt/KV mechanism that local activation arithmetic does not displace.
Authors:Jianing Deng, Song Wang, Dongwei Wang, Zijie Liu, Tianlong Chen, Huanrui Yang, Jingtong Hu
Abstract:
Mixture‑of‑Experts Large Language Models (MoE‑LLMs) achieve strong performance but incur substantial memory overhead due to massive expert parameters. Mixed‑precision quantization mitigates this cost by allocating expert‑wise bit‑widths based on their importance, approaching the accuracy‑memory Pareto frontier and enabling extreme low‑bit quantization. However, existing methods rely on layer‑wise importance estimation and overlook router shifts induced by quantization, resulting in suboptimal allocation and routing. In this work, we propose Global Expert‑level Mixed‑precision Quantization (GEMQ) to overcome these limitations via (1) a global linear‑programming formulation that captures model‑wide expert importance based on quantization error analysis, and (2) efficient router fine‑tuning to adapt routing to quantized experts. These components are integrated into a progressive quantization framework that iteratively refines importance estimation and allocation. Experiments demonstrate that GEMQ significantly reduces memory and accelerates inference with minimal accuracy degradation. Source code is available at https://github.com/jndeng/GEMQ .
Authors:Xinjie He, Zhiyuan Lin, Su Liu, Jialun Wu, Qiyang Xie, Weikai Zhou, Shuai Xiao
Abstract:
Reinforcement learning (RL) has emerged as a viable recipe for training LLM agents to reason over external memory banks in multi‑session dialogue. Existing work trains exclusively on a single benchmark, leaving open how the composition of training data shapes the skills a memory agent acquires. We present a controlled empirical study that holds architecture, RL algorithm, and all hyperparameters fixed and varies only the training curriculum across three conditions: in‑domain (LoCoMo), mixed‑benchmark (LoCoMo + LongMemEval), and out‑of‑domain (LongMemEval only). Across two benchmarks and ten question types, curriculum composition acts as a fine‑grained lever on specialization rather than a uniform scaling factor on performance. The mixed curriculum yields the strongest overall F1 on both evaluation sets. Training on a narrow out‑of‑domain set transfers a targeted skill ‑ temporal reasoning ‑ despite weak aggregate performance. Per‑type differences substantially exceed aggregate differences, indicating that single‑number benchmark comparisons systematically underreport curriculum effects. We further report two practical lessons from adapting GRPO to a single‑GPU regime: cross‑benchmark mixing requires filtering format‑specific noise from memory banks to preserve training signal, and binary exact‑match reward produces no learning signal at the small group sizes (G = 4) required on one GPU, motivating continuous reward functions in this regime.
Authors:Maryia Zhyrko, Daisy Monika Lal, Erik van Mulligen, Lifeng Han
Abstract:
We present DreamerNLplus, a hybrid framework for modeling mental health dynamics from social media timelines in the CLPsych 2026 shared task. Our system addresses three tasks: psychological state modeling, temporal change detection, and sequence‑level summarization. For Task 1, we combine LLM‑based data augmentation, DeBERTa classification, and Random Forest regression for structured state prediction. For Task 2, we use few‑shot prompting with a locally deployed Llama 3.1 model to detect Switch and Escalation events using short‑term temporal context. For Task 3.1, we explore both a deterministic rule‑based summarization pipeline and a few‑shot LLM‑based approach, ranking 2nd officially. Our RAG‑based method achieves strong performance in Task 3.2, ranking 1st for Improvement and 3rd for Deterioration, demonstrating its ability to capture recurrent psychological change patterns across timelines. Our analysis reveals key challenges, including the mismatch between classification and regression performance, the difficulty of modeling temporal transitions, and the disagreement between semantic and similarity‑based evaluation metrics. These findings highlight the complexity of modeling mental health dynamics and motivate future work on unified evaluation frameworks. We share our code and prompts at https://github.com/4dpicture/CLPsych2026
Authors:George Mikros, Fotios Fitsilis
Abstract:
Katharevousa Greek remains poorly served by contemporary NLP pipelines despite its importance for legal, administrative, and parliamentary archives. We present a reproducible workflow for building and evaluating a Universal Dependencies‑style parsing resource for Katharevousa parliamentary questions from Greece's early post‑junta period. The pipeline links OCR‑aware reconstruction, schema‑constrained LLM‑assisted annotation, automatic validation, deterministic CoNLL‑U snapshotting, fixed‑split evaluation, and model‑family comparison. The frozen automatically validated reference set contains 1,697 sentences, split into 1,357 training sentences and 340 held‑out test sentences. We compare off‑the‑shelf Greek and Ancient Greek parsers, a feature‑based parser, mBERT, XLM‑R, and custom Stanza training under the same scoring protocol. Off‑the‑shelf systems show substantial register mismatch: the strongest external baseline, spaCy Greek, reaches 0.4183 LAS. The best structural parser, an XLM‑R model, reaches 0.8893 UPOS accuracy, 0.7250 dependency‑relation F1, 0.6098 UAS, and 0.5162 LAS, an absolute LAS gain of 0.0980 over the best external baseline. The feature‑based model remains competitive for UPOS and relation labeling, indicating that transparent lexical‑context features still matter at this data scale. Beyond scores, the paper contributes an auditable methodology for turning difficult historical parliamentary OCR into reusable syntactic NLP infrastructure. The entire pipeline ‑‑ code, schema, frozen reference annotations, fixed train/test split, and per‑model benchmark reports ‑‑ is released as an open‑access companion to this paper.
Authors:Shubham Parashar, Atharv Chagi, Jacob Helwig, Lakshmi Jotsna, Sushil Vemuri, James Caverlee, Dileep Kalathil, Shuiwang Ji
Abstract:
We aim to improve the reasoning capabilities of diffusion language models (DLMs). While SFT is a popular post‑training recipe for autoregressive models, its use in DLMs faces challenges and can even hurt performance, though the underlying causes remain understudied. Our analysis reveals that vanilla SFT overlooks learnability, namely what and when tokens are learned. Specifically, rare tokens are difficult to learn when most of the input is masked, whereas it is straightforward and thus of little value to learn common tokens when most of the input is unmasked. Motivated by our analysis, we propose LIFT, an efficient SFT‑based post‑training algorithm for DLMs. LIFT learns easy tokens when most of the input is masked and hard tokens when more context is available, thus aligning the training with the information available at different diffusion time steps. Our results show that LIFT outperforms existing SFT baselines across six reasoning benchmarks, achieving up to a 3x relative gain on AIME'24 and AIME'25. Our code is publicly available at https://github.com/divelab/LIFT.
Authors:Yingjie Lei
Abstract:
Personalized pricing negotiations are a challenging testbed for LLM agents because successful interaction does not guarantee profitable decision making. A seller may produce valid actions and close many deals while still pricing poorly when buyer willingness to pay and bargaining traits remain hidden. This paper presents PrefBench, a simulator‑based benchmark for hidden‑preference personalized pricing negotiations. Each episode pairs a simulated buyer with a fixed vehicle‑customization bundle; the seller observes public persona descriptors, bundle information, and negotiation history, while latent buyer variables govern valuation, patience, counter‑offer behavior, and walkaway decisions. PrefBench evaluates this setting through an LLM‑facing state‑summary protocol that constrains agents to return strict JSON actions under a fixed hidden‑information boundary. We evaluate zero‑shot LLM sellers against heuristic references over 7,500 episodes. The tested LLMs follow the protocol reliably and achieve deal rates above 0.99, but their seller‑profit outcomes remain weak: the best LLM average profit is only slightly above the random baseline and far below a simple concession heuristic under the same episode stream. These results show that structured action compliance and agreement‑seeking behavior can coexist with weak profit‑sensitive bargaining. PrefBench provides a controlled benchmark for evaluating pricing‑agent behavior under hidden buyer preferences.
Authors:Mirac Suzgun, Emily Shen, Federico Bianchi, Alexander Spangher, Thomas Icard, Daniel E. Ho, Dan Jurafsky, James Zou
Abstract:
AI chatbots are rapidly shaping how people encounter the news, yet no prior study has systematically measured how accurately these systems, with their proprietary search integrations and retrieval‑synthesis pipelines, handle emerging facts across languages and regions. We present a 14‑day (February 9‑22, 2026) evaluation of six AI chatbots (Gemini 3 Flash and Pro, Grok 4, Claude 4.5 Sonnet, GPT‑5 and GPT‑4o mini) on 2,100 factual questions derived from same‑day BBC News reporting across six regional services (US & Canada, Arabic, Afrique, Hindi, Russian, Turkish). The best systems achieve over 90% multiple‑choice accuracy on questions about events reported hours earlier. The same systems, however, lose 11‑13% under free‑response evaluation, and 16‑17% across the cohort. We further characterize three failure patterns. First, every model achieves its lowest accuracy on Hindi (79% vs. 89‑91% elsewhere) and citations indicate an Anglophone retrieval bias (e.g., models answering Hindi queries cite English Wikipedia more than any Hindi outlet). Second, retrieval, not reasoning, failures drive over 70% of all errors. When models retrieve a correct source, they often extract the correct answer; the problem is to land on the right source in the first place. Third, models achieving 88‑96% accuracy on well‑formed questions drop to 19‑70% when questions contain subtle false premises, with the most vulnerable model accepting fabricated facts 64% of the time. We also identify a detection‑accuracy paradox: the best false‑premise detector ranks second in adversarial accuracy (abstention rate), while a weaker detector ranks first, showing that premise detection and answer recovery are partially independent capabilities. Overall, these suggest that high accuracy can mask systematic regional inequity, near‑total dependence on retrieval infrastructure, and vulnerability to imperfect queries real users pose.
Authors:Pilchen Hippolyte, Fabre Romain, Signe Talla Franck, Perez Patrick, Grave Edouard
Abstract:
Large language models (LLMs) are typically trained on shuffled corpora, yielding models whose knowledge is frozen at train time and whose temporal grounding remains poorly understood. In this work, we study the impact of pre‑training dynamics on the acquisition of time‑sensitive factual knowledge, focusing specifically on data ordering. Our main contributions are twofold. First, we introduce a comprehensive benchmark of over 7,000 temporally grounded questions and an evaluation protocol that enables analysis of whether models correctly associate facts with their corresponding time periods. Second, we pretrain 6B‑parameter models on temporally ordered Common Crawl snapshots and compare them against standard shuffled pre‑training. Our results show that sequentially trained models match shuffled baselines on general language understanding and common knowledge while consistently exhibiting more up‑to‑date and temporally precise knowledge. Temporally ordered pre‑training yields improved factual freshness, while shuffled pre‑training peaks on older data, possibly due to increased factual repetition. These findings, along with the release of our code at https://github.com/kyutai‑labs/kairos , checkpoints, and datasets at https://huggingface.co/collections/kyutai/kairos provide a foundation for future research on continual learning for LLMs.
Authors:Sid-ali Temkit
Abstract:
Large language models are routinely used as automated evaluators: to review code, moderate content, or score outputs, often with many items passing through one conversation. We ask whether the polarity of prior conversation history biases subsequent judgments, an effect we call the accumulated message effect on LLM judgments (AMEL). Across 75,898 API calls to 11 models from 4 providers (OpenAI, Anthropic, Google, and four open‑source models), we present identical test items in isolation or following histories saturated with predominantly positive or negative evaluations. Models shift toward the conversation's prevailing polarity (d = ‑0.17, p < 10^‑46). The effect concentrates on items where the model is genuinely uncertain at baseline (d = ‑0.34 for high‑entropy items, vs d = ‑0.15 when the baseline is deterministic). Bias does not grow with context length: 5 prior turns and 50 produce the same shift (Spearman |r| < 0.01; OLS slope p = 0.80). And there is a negativity asymmetry: paired per item, negative histories induce 1.62x more bias than positive (t = 13.46, p < 10^‑39, n = 2,481). Scaling helps but does not solve it (Anthropic: Haiku ‑0.22 to Opus ‑0.17; OpenAI: Nano ‑0.34 to GPT‑5.2 ‑0.17). Three follow‑ups narrow the mechanism. The token probability distribution shifts continuously, not at a threshold. The negativity asymmetry has both token‑level and semantic components, though attributing the balance is exploratory at our sample sizes. Position does not matter: five biased turns anywhere in a 50‑turn history produce the same shift. The simplest fix for evaluation pipelines is a fresh context per item; when batching is unavoidable, balancing the history helps.
Authors:Víctor Yeste, Paolo Rosso
Abstract:
Detecting Schwartz values in political text is difficult because implicit cues often depend on surrounding arguments and fine‑grained distinctions between neighboring values. We study when context and explicit moral knowledge help sentence‑level value detection. Using the ValuesML/Touché ValueEval format, we compare sentence, window, and full‑document inputs; no‑RAG and retrieval‑augmented settings with a curated moral knowledge base; supervised DeBERTa‑v3‑base/large encoders; and zero‑shot LLMs from 12B to 123B parameters. The results show that more context is not uniformly better: full‑document context improves supervised DeBERTa encoders by 3.8‑4.8 macro‑F1 points over sentence‑only input, but does not consistently help zero‑shot LLMs. Retrieved moral knowledge is more consistently useful in matched comparisons, improving each tested model family and context condition under early fusion. However, scaling from DeBERTa‑v3‑base to large and from 12B to larger LLMs does not guarantee gains, and simple early fusion outperforms the tested late‑fusion and cross‑attention RAG variants for encoders. Per‑value analyses show that context and retrieval help most for socially situated or conceptually confusable values. These findings suggest that value‑sensitive NLP should evaluate context, knowledge, and model family jointly rather than treating longer inputs or larger models as universal improvements.
Authors:Erjian Zhang, Yatong Hao, Liejun Wang, Zhiqing Guo
Abstract:
While multi‑task learning based automatic radiology report generation (RRG) is widely adopted to ensure clinical consistency, most focus on architectural designs yet remain limited to coarse linear scalarization strategies. These strategies cannot effectively balance the hard constraints of discriminative clinical supervision with the smoothness requirements of report generation. To address these problems, we analyze the failure mechanism of linear scalarization from the perspective of gradient dynamics, utilizing the stochastic differential equation (SDE) framework to characterize it as a "Double Dilemma" of drift term deviation and diffusion term decay. Based on this, we propose a backbone‑agnostic optimizer named Conflict‑Averse Magnitude‑Enhanced Gradient Descent (CAME‑Grad). Through conflict‑averse direction rectification and magnitude‑enhanced energy injection, the algorithm not only ensures geometric validity, but also avoids local optimal solutions. Then, the adaptive gradient fusion mechanism is used to establish a dynamic balance between the theoretical optimal direction and the task‑specific inductive bias. Experiments show that as a universal plug‑and‑play optimizer, CAME‑Grad brings substantial and consistent improvements across eight diverse RRG methods, elevating overall clinical efficacy performance by an average of 2.3% on MIMIC‑CXR and 1.9% on IU X‑Ray. Our code is available at https://github.com/vpsg‑research/CAME‑Grad.
Authors:Shuaiqi Wang, Aadyaa Maddi, Zinan Lin, Giulia Fanti
Abstract:
Today, tool‑calling agents are commonly evaluated or tested on static datasets of execution traces, including input commands, agent responses, and associated tool calls. However, internal production datasets are often insufficient or unusable for testing; for example, they may contain sensitive or proprietary data, or they may be too sparse to support comprehensive testing (especially pre‑deployment). In these settings, practitioners are increasingly replacing or augmenting real datasets with synthetic ones for evaluation purposes. A key challenge is quantifying the relation between these synthetic datasets and the real data. We introduce SynAE, an evaluation framework for assessing how well synthetic benchmarks for multi‑turn, tool‑calling agents replicate and augment the characteristics of real data trajectories. SynAE assesses the validity, fidelity, and diversity of synthetic data across four metric categories: (i) task instructions and intermediate responses, (ii) tool calls, (iii) final outputs, and (iv) downstream evaluation. We evaluate SynAE using recent agent benchmarks and test common synthetic data failure modes via realistic and controlled generation schemes. SynAE detects fine‑grained variations in data validity, fidelity and diversity, and shows that no single metric is sufficient to fully characterize synthetic data quality, motivating a multi‑axis evaluation of synthetic data for agent testing. A demo of SynAE is available at https://synae‑2026‑synae‑demo.static.hf.space/index.html, with code at https://github.com/wsqwsq/SynAE.
Authors:Md. Asaduzzaman Shuvo, Mahedi Hasan, Md. Tashin Parvez, Azizul Haque Noman, Md. Shafayet Hossain Ovi
Abstract:
Recent advances in Multilingual Large Language Models (MLLMs) have significantly enhanced cross‑lingual conversational capabilities, yet modeling culturally nuanced and context‑dependent communication remains a critical bottleneck. Specifically, existing state‑of‑the‑art models exhibit a severe pragmatic gap when handling structural variations, regional idioms, and honorific consistencies in low‑resource contexts like Bangla. To address this limitation, we introduce a novel, culturally aligned instruction‑tuning dataset for BangLa Application and DialoguE generation ‑ BLADE and benchmarking framework comprising 4,196 meticulously curated interaction pairs. We leverage this resource to systematically fine‑tune and evaluate leading open‑weight architectures, including DeepSeek‑8B and LLaMA‑3.2‑3B, utilizing parameter‑efficient fine‑tuning via LoRA adapters in a 4‑bit NormalFloat (NF4) quantization framework. Our empirical evaluations demonstrate that models fine‑tuned on our dataset yield substantial improvements in structural fidelity and honorific alignment, providing a rigorous benchmark for bridging pragmatic disparities in low‑resource multilingual text generation. Code and dataset: https://github.com/ashuvo25/Bangla_Application_LLM/tree/main
Authors:Hanyu Guo, Jiedong Yang, Chao Chen, Longfei Xu, Kaikui Liu, Xiangxiang Chu
Abstract:
Public transit route planning traditionally depends on structured map infrastructure and complex routing engines, and no existing dataset supports training models to bypass this dependency. We present TransitLM, a large‑scale dataset of over 13 million transit route planning records from four Chinese cities covering 120,845 stations and 13,666 lines, released as a continual pre‑training corpus and benchmark data for three evaluation tasks with complementary metrics. Experiments show that an LLM trained on TransitLM produces structurally valid routes at high accuracy and implicitly grounds arbitrary GPS coordinates to appropriate stations without any explicit mapping. These results demonstrate that transit route planning can be learned entirely from data, enabling end‑to‑end, map‑free route generation directly from origin‑destination information. The dataset and benchmark are available at https://huggingface.co/datasets/GD‑ML/TransitLM, with evaluation code at https://github.com/HotTricker/TransitLM.
Authors:Jinyang Wu, Guocheng Zhai, Ruihan Jin, Yuhao Shen, Zhengxi Lu, Fan Zhang, Haoran Luo, Zheng Lian, Zhengqi Wen, Jianhua Tao
Abstract:
The proliferation of large language models (LLMs) and modular skills has endowed autonomous agents with increasingly powerful capabilities. Existing frameworks typically rely on monolithic LLMs and fixed logic to interface with these skills. This gives rise to a critical bottleneck: different LLMs offer distinct advantages across diverse domains, yet current frameworks fail to exploit the complementary strengths of models and skills, thereby limiting their performance on downstream tasks. In this paper, we present Maestro (Multimodal Agent for Expert‑Skill Targeted Reinforced Orchestration), a Reinforcement Learning (RL)‑driven orchestration framework that reframes heterogeneous multimodal tasks as a sequential decision‑making process over a hierarchical model‑skill registry. Rather than consolidating all knowledge into a single model, Maestro trains a lightweight policy to dynamically compose ensembles of frozen expert models and a two‑tier skill library, deciding at each step whether to invoke an external expert, which model‑skill pair to select, and when to terminate. The policy is optimized via outcome‑based RL, requiring no step‑level supervision. We evaluate Maestro across ten representative multimodal benchmarks spanning mathematical reasoning, chart understanding, high‑resolution perception, and domain‑specific analysis. With only a 4B orchestrator, Maestro achieves an average accuracy of 70.1%, surpassing both GPT‑5 (69.3%) and Gemini‑2.5‑Pro (68.7%). Crucially, the learned coordination policy generalizes to unseen models and skills without retraining: augmenting the registry with out‑of‑domain experts yields a 59.5% average on four challenging benchmarks, outperforming all closed‑source baselines. Maestro further maintains high computational efficiency with low latency. The source code is available at https://github.com/jinyangwu/Maestro.
Authors:Chaogui Gou, Jiarui Liang
Abstract:
In recent years, large language models have shown substantial potential in psychological support tasks. However, existing psychological counseling data mostly rely on single‑turn question answering or short multi‑turn dialogues, making it difficult to characterize how college students' psychological distress accumulates, interacts, and gradually evolves over long periods within campus life events. To address this issue, this paper proposes Psy‑Chronicle, a structured data‑generation framework for synthesizing long‑horizon campus psychological counseling dialogues. We generate a semester‑spanning temporal stress event graph to model the chronological order and evolutionary dependencies among campus stress events. Through interactive simulation between a student agent and a counselor agent, together with a structured memory integration mechanism, Psy‑Chronicle generates long‑horizon dialogues with continuity across counseling sessions. Based on Psy‑Chronicle, we construct and open‑source CPCD, a Chinese long‑horizon dialogue dataset for college psychological counseling, containing 100 student profiles, 90,000 counseling dialogues. We further build CPCD‑Bench to evaluate models' long‑horizon campus counseling capabilities from three dimensions: session‑level response, long‑horizon memory recall, and temporal‑causal reasoning. Experimental results show that CPCD effectively improves session‑level response generation and long‑horizon memory recall for models with the same base architecture. Meanwhile, improvements in temporal‑causal reasoning remain limited, indicating that event‑chain organization and causal explanation are key challenges in long‑horizon psychological counseling modeling. The related code and data are available at: https://github.com/EdwinUSTB/Psy‑Chronicle
Authors:Mingkai Deng, Jinyu Hou, Lara Sá Neves, Varad Pimpalkhute, Taylor W. Killian, Zhengzhong Liu, Eric P. Xing
Abstract:
How should an agent decide when and how to plan? A dominant approach builds agents as reactive policies with adaptive computation (e.g., chain‑of‑thought), trained end‑to‑end expecting planning to emerge implicitly. Without control over the presence, structure, or horizon of planning, these systems dramatically increase reasoning length, yielding inefficient token use without reliable accuracy gains. We argue efficient agentic reasoning benefits from decomposing decision‑making into three systems: simulative reasoning (System II) grounding deliberation in future‑state prediction via a world model; self‑regulation (System III) deciding when and how deeply to plan via a learned configurator; and reactive execution (System I) handling fine‑grained action. Simulative reasoning provides unified planning across diverse tasks without per‑domain engineering, while self‑regulation ensures the planner is invoked only when needed. To test this, we develop SR^2AM (Self‑Regulated Simulative Reasoning Agentic LLM), realizing both as distinct stages within an LLM's chain‑of‑thought, with the LLM as world model. We explore two instantiations: recording decisions from a prompted multi‑module system (v0.1) and reconstructing structured plans from traces of pretrained reasoning LLMs (v1.0), trained via supervised then reinforcement learning (RL). Across math, science, tabular analysis, and web information seeking, v0.1‑8B and v1.0‑30B achieve Pass@1 competitive with 120‑355B and 685B‑1T parameter systems respectively, while v1.0‑30B uses 25.8‑95.3% fewer reasoning tokens than comparable agentic LLMs. RL increases average planning horizon by 22.8% while planning frequency grows only 2.0%, showing it learns to plan further ahead rather than more often. More broadly, learned self‑regulation instantiates a principle we expect to extend beyond planning to how agents govern their own learning and adaptation.
Authors:Mehrdad Saberi, Keivan Rezaei, Soheil Feizi
Abstract:
Large language models increasingly use external tools such as web search and document retrieval to solve information‑intensive tasks. However, multi‑hop tool use in complex tasks introduces substantial latency, since the model must repeatedly wait for tool observations before continuing. We study how to accelerate such trajectories without changing the final trajectory the model would have taken without acceleration, assuming access to faster but less reliable speculator tools. We develop a theoretical framework for lossless speculation in multi‑hop tool‑use settings, characterizing the optimal achievable latency gain. We propose SpecHop, a continuous speculation framework that maintains multiple speculative threads, verifies predicted observations asynchronously as target tool outputs arrive, commits correct branches, and rolls back incorrect ones. This preserves accuracy while reducing wall‑clock latency. We show that SpecHop can approach oracle latency gains with enough active threads. Empirically, on retrieval‑augmented multi‑hop tasks, SpecHop closely matches theoretical predictions and reduces latency by up to 40% in some settings. Code: https://github.com/mehrdadsaberi/spechop
Authors:Gonçalo Duarte, Miguel Couceiro, Marcos V. Treviso
Abstract:
Long‑context decoding is increasingly limited by KV‑cache memory traffic since each generated token attends over a cache whose size grows linearly with context length. Existing sparse decoding methods reduce this cost by selecting subsets of tokens or pages, but are designed for softmax attention, whose dense tails make any truncation discard nonzero probability mass. In contrast, α‑entmax produces exact zeros, turning sparse decoding from dense‑tail approximation into support recovery: if the selected candidates contain the entmax support, sparse decoding remains exact. While recent entmax kernels enable efficient training, they do not address the autoregressive decoding bottleneck, where dense inference still streams the full KV cache before sparsity is known. In this work, we introduce EntmaxKV, an entmax‑native sparse decoding framework that exploits sparsity before KV pages are loaded. EntmaxKV combines query‑aware page scoring, support‑aware candidate selection, and sparse entmax attention. We analyze truncation error through the dropped probability mass δ, showing that output error is controlled by δ and vanishes when the entmax support is recovered. We further introduce a Gaussian‑aware entmax selector that estimates the entmax threshold from lightweight page statistics, adapting the selected budget to the score distribution. Empirically, EntmaxKV drops less probability mass, retains more support tokens, and achieves lower output error than softmax‑based sparse decoding at matched KV budgets. On long‑context and language modeling benchmarks, it closely matches full‑cache entmax while using a small fraction of the KV cache, achieving up to 3.36× (softmax) and 5.43× (entmax) speedup over full attention baselines at 1M context length. Code available at: https://github.com/deep‑spin/entmaxkv.
Authors:Brandon Dent
Abstract:
Frontier language models are being deployed into clinical workflows faster than the infrastructure to evaluate them safely. Static medical‑QA benchmarks miss the failure modes that matter in emergency medicine: trajectory‑level safety collapse, tool misuse, and capitulation under sustained clinical pressure. We present HealthCraft, the first public reinforcement‑learning environment that rewards trajectory‑level safety under realistic emergency‑medicine conditions, adapted from Corecraft. It is built on a FHIR R4 world state with 14 entity types and 3,987 seed entities, exposes 24 MCP tools, and defines a dual‑layer rubric that zeroes reward whenever any safety‑critical criterion is violated. We release 195 tasks across six categories, graded against 2,255 binary criteria (515 safety‑critical); a post‑hoc 10‑task negative‑class slate extends this to 205 tasks and 2,337 criteria. V8 results on two frontier models show Claude Opus 4.6 at Pass@1 24.8% [21.5‑28.4] and GPT‑5.4 at 12.6% [10.2‑15.6], with safety‑failure rates of 27.5% and 34.0%. On multi‑step workflows ‑ the closest proxy to real emergency care ‑ performance collapses to near zero (Claude 1.0%, GPT‑5.4 0.0%) despite partial competence on individual steps. Six infrastructure bugs fixed between pilots v2 and v8 re‑ordered which model "looks stronger," evidence that infrastructure fidelity is part of the measurement. A deterministic LLM‑judge overlay bounds evaluator noise, and a 60‑run negative‑class smoke pilot shows the reward signal is not drop‑in training‑safe: restraint criteria pass at 0.929 prevalence, a gameability an eval harness can tolerate but a training reward cannot. We scaffold coupling to a Megatron+SGLang+GRPO loop per Corecraft Section 5.2 and leave training‑reward ablations as future work. Environment, tasks, rubrics, and harness are released under Apache 2.0.
Authors:Zhepei Wei, Xinyu Zhu, Wei-Lin Chen, Chengsong Huang, Jiaxin Huang, Yu Meng
Abstract:
Reinforcement learning with verifiable rewards (RLVR) has become a dominant paradigm for improving reasoning in large language models (LLMs), yet the underlying geometry of the resulting parameter trajectories remains underexplored. In this work, we demonstrate that RLVR weight trajectories are extremely low‑rank and highly predictable. Specifically, we find that the majority of downstream performance gains are captured by a rank‑1 approximation of the parameter deltas, where the magnitude of this projection evolves near‑linearly with training steps. Motivated by this, we propose a simple and compute‑efficient method RELEX (REinforcement Learning EXtrapolation), which estimates the rank‑1 subspace from a short observation window and extrapolates future checkpoints via linear regression, with no learned model required. Across three models (i.e., Qwen2.5‑Math‑1.5B, Qwen3‑4B‑Base, and Qwen3‑8B‑Base), RELEX produces checkpoints that match or exceed RLVR performance on both in‑domain and out‑of‑domain benchmarks, requiring as few as 15% steps of full RLVR training. Remarkably, RELEX is able to extrapolate far beyond the observation window at no training cost, predicting checkpoints up to 10‑20× beyond the observed prefix with continued improvement (e.g., observe only the first 50 steps and extrapolate to 1000 steps). Our ablation analysis confirms the minimalist sufficiency of RELEX: neither increasing the subspace rank nor employing non‑linear modeling yields further gains in extrapolation. Finally, we show that RELEX's success stems from a "denoising" effect: by projecting updates onto the rank‑1 subspace, the model discards stochastic optimization noise that would otherwise degrade performance during extrapolation. Our code is available at https://github.com/weizhepei/RELEX.
Authors:Lucheng Fu, Ye Yu, Yiyang Wang, Yiqiao Jin, Haibo Jin, B. Aditya Prakash, Haohan Wang
Abstract:
Large language models (LLMs) are highly sensitive to the prompts used to specify task objectives and behavioral constraints. Many recent prompt optimization methods iteratively rewrite prompts using LLM‑generated feedback, but the resulting prompts often become longer, accumulate narrow sample‑specific rules, and generalize poorly beyond the training distribution. We study this failure mode as prompt distributional overfitting and argue that it reflects a lack of representation control in discrete text‑space optimization. We formalize this view through representational inefficiency, a dual‑factor measure that decomposes prompt inefficiency into capacity cost and scope narrowness, attributing distributional prompt overfitting to their coupled growth during optimization. We propose TextReg, a regularization framework that realizes a soft‑penalty objective through regularized textual gradients, combining Dual‑Evidence Gradient Purification, Semantic Edit Regularization, and Regularization‑Guided Prompt Update. Across multiple reasoning benchmarks, TextReg substantially improves out‑of‑distribution (OOD) generalization, with accuracy gains of up to +11.8% over TextGrad and +16.5% over REVOLVE.
Authors:Jeonghun Baek, Atsuyuki Miyai, Shota Onohara, Hikaru Ikuta, Kiyoharu Aizawa
Abstract:
Manga is a culturally distinctive multimodal medium and one of the most influential forms of Japanese popular culture. As AI systems increasingly target manga understanding, OCR, and translation, Manga109 has become a foundational dataset for manga‑related AI research. However, the current Manga109 dataset contains transcription errors and coarse annotations, which do not align well with modern OCR and multimodal manga understanding tasks. In this work, we revisit the dialogue text annotations of Manga109 and identify five categories of annotation issues, including transcription errors, missing text regions, overlapping dialogue and onomatopoeia, and under‑segmented speech balloons. To address these issues, we combine OCR‑based issue detection and manual revision to construct Manga109‑v2026, revising approximately 29,000 dialogue annotations. Our revisions better align Manga109 with modern OCR and multimodal manga understanding systems while preserving expressive structures characteristic of manga.
Authors:Yongkang Liu, Zijing Wang, Mengjie Zhao, Ercong Nie, Mingyang Wang, Qian Li, Feiliang Ren, Shi Feng, Daling Wang, Hinrich Schütze
Abstract:
This work presents \textscChunkFT, a memory‑efficient fine‑tuning framework that reformulates full‑parameter fine‑tuning around a dynamically activated working set. \textscChunkFT enables gradient computation for arbitrary sub‑tensors without modifying the network architecture, providing an algorithmic foundation for optimizing arbitrary sub‑networks while avoiding standard dense gradient computation. We provide a theoretical convergence analysis of \textscChunkFT in the deterministic setting. Empirically, we apply \textscChunkFT to fine‑tune Llama 3‑8B and Llama 3‑70B using a single RTX 4090‑24GB GPU and 2× H800‑80GB GPUs, respectively. Full‑parameter fine‑tuning of a 7B model with a 1K input length requires only 13.72GB of GPU memory. The results demonstrate the effectiveness of \textscChunkFT in memory usage, running time, and optimization quality. Moreover, downstream evaluations on language understanding, mathematical reasoning, and MT‑Bench show that \textscChunkFT consistently outperforms existing memory‑efficient baselines. Notably, \textscChunkFT achieves performance comparable to, and in some cases exceeding, full‑parameter fine‑tuning. Our repository is on https://github.com/misonsky/chunk.
Authors:Yan Xia, Zhuangzhuang Pan, Amirrudin Kamsin, Chee Seng Chan
Abstract:
Aspect‑Term Sentiment Analysis (ATSA) in multi‑aspect sentences faces a fundamental tradeoff between efficiency and expressiveness. Existing models either re‑encode the sentence for each aspect or rely on static use of deep representations, leading to redundant computation and limited adaptivity. We argue that Transformer depth is a costly, queryable resource, and propose DABS, a single‑pass inference framework that encodes each sentence once to construct a reusable, depth‑ordered substrate. Each aspect then queries this shared representation to selectively read relevant tokens and abstraction levels, without re‑encoding. This decouples shared sentence encoding from lightweight, aspect‑conditioned readout. Experiments on four ATSA benchmarks show that DABS achieves competitive performance while reducing end‑to‑end computation by up to 60% in multi‑aspect settings (M >= 2). Further analyses indicate that adaptive depth querying is most beneficial for linguistically complex cases such as negation and contrast. Code is publicly available at https://github.com/panzhzh/acl‑dabs
Authors:Yaping Chai, Haoran Xie, Joe S. Qin
Abstract:
Implicit sentiment analysis is challenging because sentiment toward an aspect is often inferred from events rather than expressed through explicit opinion words. Existing models typically learn from the final polarity label, which provides limited guidance for reasoning about sentiment from the context. Motivated by cognitive appraisal theory, we propose an appraisal‑aware multi‑task learning (MTL) framework for implicit sentiment analysis that provides polarity prediction with two complementary auxiliary tasks: implicit sentiment detection and cognitive rationale generation. However, training several objectives with different targets and sharing a single backbone across tasks in MTL limits flexibility and can lead to task interference. To reduce interference among these related but distinct objectives, we adopt task‑level mixture‑of‑experts models in which all tasks share a common set of experts, and task identity controls the sparse combination of these experts. Our method builds on an encoder‑decoder architecture and replaces a subset of encoder and decoder blocks with these sparse mixtures. We use a task‑conditioned router to select sparse expert mixtures for each task, and a task‑separated routing objective to encourage different tasks to learn distinct expert‑selection patterns. Experimental results show that our model outperforms recently proposed approaches, with strong gains on the implicit sentiment subset. Our code is available at https://github.com/yaping166/TRMoE‑ISA.
Authors:Yefan Zhou, Yilun Zhou, Austin Xu, Soroush Vosoughi, Shafiq Joty, Jiang Gui
Abstract:
Generative verifiers have emerged as a promising paradigm for step‑wise verification, but their verification behavior is often poorly calibrated: they may be under‑critical and miss erroneous steps, or over‑critical and reject correct reasoning. We refer to this tendency to be overly lenient or overly critical as verifier strictness. In this work, we study whether verifier strictness can be controlled through hidden‑state intervention. We uncover a verification‑specific hidden‑state signal: in step‑wise verification, a verifier's tendency to accept or reject a solution step is encoded near the boundary of the corresponding verification paragraph. Exploiting this signal, we show that hidden‑state steering can directly modulate verifier strictness without fine‑tuning. However, uniform steering induces a trade‑off between error detection and correctness certification. To address this, we propose VerifySteer, which exploits latent correctness signals for sample‑level routing and selectively intervenes on paragraph boundaries. Experiments on ProcessBench and Hard2Verify show that VerifySteer outperforms prompt optimization and activation steering baselines, and is competitive with self‑consistency while requiring 4‑7x less inference compute. VerifySteer is also complementary to verification fine‑tuning, providing further gains on top of fine‑tuned verifiers. The code is available at https://github.com/YefanZhou/VerifySteer.
Authors:Junhao Ruan, Abudukeyumu Abudula, Bei Li, Yongjing Yin, Xinyu Liu, Kechen Jiao, Xin Chen, Jingang Wang, Xunliang Cai, Tong Xiao, Jingbo Zhu
Abstract:
Accurate evaluation of conversational retrieval is pivotal for advancing Retrieval‑Augmented Generation (RAG) systems. However, existing conversational retrieval benchmarks suffer from costly, sparse human annotation or rigid, unnatural automated heuristics. To address these challenges, we introduce MTR‑Suite, a unified framework for auditing, synthesizing, and benchmarking retrieval. It features: (1) MTR‑Eval, an LLM‑based auditor quantifying alignment gaps in previous benchmarks; (2) MTR‑Pipeline, a multi‑agent system using greedy traversal clustering to generate high‑fidelity dialogues at 1/400th human cost; and (3) MTR‑Bench, a rigorous general‑domain benchmark. MTR‑Bench mimics production‑style challenges (hard topic switching, verbosity), offering superior discriminative power. We make our code and data publicly available to facilitate future research at https://github.com/rangehow/mtr‑suite.
Authors:Duy Nguyen, Hanqi Xiao, Archiki Prasad, Zaid Khan, Anirban Das, Austin Zhang, Sambit Sahu, Hyunji Lee, Elias Stengel-Eskin, Mohit Bansal
Abstract:
Self‑distillation enables language models to learn on‑policy from their own trajectories by using the same model as both student and teacher, with the teacher being conditioned on privileged information unavailable to the student. Such information can come in different types or views, such as solutions, demonstrations, feedback, or final answers. This setup provides dense token‑level feedback without relying on a separate external model, but creates a fundamental asymmetry: the teacher may rely on view‑specific information that the student cannot access at inference time. Moreover, the best type of privileged information is often task‑dependent, making it difficult to choose a single teacher view. In this work, we address both these challenges jointly by introducing AVSD (Adaptive‑View Self‑Distillation), a novel method of self‑distillation with multiple privileged‑information views, which reconstructs token‑level supervision by separating stable cross‑view consensus from view‑specific residual signals. AVSD identifies the consensus signal shared across views, which provides a reliable update direction, and then selectively adds the view‑specific residual signal to adjust the update magnitude when it both aligns with the consensus direction and remains proportionate to the consensus signal. Experiments on math competition benchmarks (AIME24, AIME25, and HMMT25) show that AVSD consistently outperforms both single‑view self‑distillation baselines and GRPO, achieving average Avg@8 gains of 3.1% and 2.2% over the strongest baselines on Qwen3‑8B and Qwen3‑4B, respectively. Moreover, on code‑generation benchmarks (Codeforces, LiveCodeBench v6) using Qwen3‑8B, AVSD outperforms the single‑view self‑distillation baseline by 2.4% on average.
Authors:Sylvey Lin, Joe Menke, Shufan Ming, Dongin Nam, Neil Smalheiser, Halil Kilicoglu
Abstract:
Biomedical abstracts play a critical role in downstream NLP applications, such as information retrieval, biocuration, and biomedical knowledge discovery. However, a non‑trivial number of biomedical articles do not have abstracts, diminishing the utility of these articles for downstream tasks. We propose DPR‑BAG (Divide, Prompt, and Refine for Biomedical Abstract Generation), a training‑free, zero‑shot framework that generates coherent and factually grounded abstracts for biomedical articles with full text but no abstract. DPR‑BAG decomposes full‑text documents into structured rhetorical facets following the Background‑Objective‑Methods‑Results‑Conclusions (BOMRC) schema, performs parallel LLM‑based summarization for each facet, and applies a final refinement stage to restore global discourse coherence. On PMC‑MAD, a distribution‑aligned dataset of 46,309 biomedical articles, DPR‑BAG improves abstractive novelty over strong extractive and fine‑tuned baselines, while maintaining factual consistency. Our ablation study reveals a counterintuitive finding: increasing prompt complexity or explicitly injecting entity‑level guidance can degrade factual alignment, highlighting the importance of controlled prompting strategies. These findings underscore the potential of training‑free, structure‑aware frameworks for scalable biomedical abstract generation in low‑resource settings. Our data and code are available at https://huggingface.co/datasets/pmc‑mad/PMC‑MAD and https://github.com/ScienceNLP‑Lab/MultiTagger‑v2/tree/main/DPR‑BAG.
Authors:Zhaohui Zheng, Chenhang He, Shihao Wang, Yuxuan Li, Ming-Ming Cheng, Lei Zhang
Abstract:
Number prediction stands as a fundamental capability of large language models (LLMs) in mathematical problem‑solving and code generation. The widely adopted maximum likelihood estimation (MLE) for LLM training is not tailored to number prediction. Recently, penalty‑driven approaches, e.g., Number Token Loss and Discretized Distance Loss, introduce an inductive bias of numerical distance but induce over‑sharpened and over‑flattened digit distributions, respectively. In this paper, we make an in‑depth analysis on LLM numerical learning, and show that existing numerical learning methods conceptually follow a criterion‑distance formulation, where the criterion term represents optimization pattern and the distance term instills geometric prior. Consequently, we present Digit Entropy Loss (DEL) for auto‑regressive numerical learning, which reformulates the conventional unsupervised entropy optimization in three key designs: leveraging digit conditional probability and binary cross‑entropy to guide the entropy optimization into a supervised manner; deprecating the distance term to bypass the issue of numerical distance; and generalizing the integer‑based numerical learning to floating‑point number optimization, enabling more accurate number prediction. Our DEL formulation can incorporate integers, decimals, and decimal points, expanding the learning objective from a single digit to the floating‑point number domain. Experiments conducted on seven mathematical reasoning benchmarks with four representative LLMs, including CodeLlama, Mistral, DeepSeek, and Qwen‑2.5, demonstrate that DEL consistently outperforms its counterparts in both overall prediction accuracy and numerical distance. Source codes are at https://github.com/PolyU‑VCLab/DEL
Authors:Rana Muhammad Usman
Abstract:
I study whether emotionally framed evaluation follow‑ups change both the behavior and the calm‑relative internal representations of small, locally deployed language models. Our main benchmark uses Qwen 3.5 0.8B on four impossible‑constraint coding tasks and eight follow‑up framings: calm, pressure, urgency, approval, shame, curiosity, encouragement, and threat. In the 0.8B eight‑condition sweep (160 conversations), pressure produces the strongest shortcut markers (11/20 runs) and the clearest overfit pattern (3/20), while calm and curiosity preserve explicit honesty more often (7/20 and 6/20). For all seven non‑baseline conditions, the corresponding calm‑relative direction vectors peak at the final transformer layer. An exploratory PCA of the layer‑23 direction vectors reveals a dominant first component (59.5% explained variance) aligned with a hand‑labeled positive/negative split (cosine alignment 0.951); approval and urgency are nearly identical internally (cosine 0.957), whereas curiosity points away from urgency (‑0.252). In a separate calm‑vs.‑pressure rerun used for scale comparison, Qwen 3.5 2B shows higher honest rates under calm framing and directionally consistent activation steering on a small 4‑prompt A/B probe, whereas the 0.8B steering result reverses. I interpret these results as evidence for measurable prompt‑sensitive control directions in small open models, while stopping short of claiming intrinsic emotional states.
Authors:Guangzhi Xiong, Qiao Jin, Sanchit Sinha, Zhiyong Lu, Aidong Zhang
Abstract:
Large Vision Language Models (LVLMs) show promise in medical applications, but their inability to faithfully ground responses in visual evidence raises serious concerns about clinical trustworthiness. While visual attribution methods are widely used to explain LVLM predictions, whether these explanations actually reflect the visual evidence underlying the model's decision is largely unverified, since ground‑truth annotations for internal model reasoning are typically unavailable. We address this question for chest X‑ray (CXR) reasoning by developing a causal evaluation framework that retains only CXR‑VQA samples for which the expert‑annotated region is verified, via counterfactual editing, to be causally responsible for the model's prediction. Using this framework across 11 attribution methods, six open‑source LVLMs, and two output modes (direct answer and step‑by‑step reasoning), we find that existing attribution methods often fail to identify the evidence used by LVLMs. To address this failure, we propose MedFocus, a concept‑based attribution method that localizes clinically meaningful anatomical regions via unbalanced optimal transport and measures their causal effect on model outputs through targeted interventions. MedFocus produces spatial, concept‑level, and token‑level attributions and substantially outperforms prior methods, taking a step toward more trustworthy attribution for medical LVLMs. Our data and code are available at https://github.com/gzxiong/medfocus/.
Authors:Dachuan Shi, Hanlin Zhu, Xiangchi Yuan, Wanjia Zhao, Kejing Xia, Wen Xiao, Wenke Lee
Abstract:
Chain‑of‑thought (CoT) is a standard approach for eliciting reasoning capabilities from large language models (LLMs). However, the common CoT paradigm treats thinking as a prerequisite for answering, which can delay access to plausible answers and incur unnecessary token costs even when the model is able to identify an answer before extended thinking, a behavior known as performative reasoning. In this paper, we introduce CopT, a reformulated reasoning pipeline that reverses the usual order of thinking and answering. Instead of thinking before answering, CopT first elicits a draft answer and then invokes subsequent on‑policy thinking conditioned on its own draft answer for reflection and correction. To assess whether the draft answer should be trusted, CopT recasts continuous embeddings as inference‑time contrastive verifiers. Specifically, it contrasts the model's support for the same generated tokens under discrete‑token inputs and continuous‑embedding inputs, yielding a sequence‑level reverse KL estimator for answer reliability. Our analysis shows that under certain assumptions, the expected estimate equals the mutual information between the unresolved latent state and the emitted answer token, explaining why it captures answer‑relevant uncertainty rather than arbitrary uncertainty in the latent state. When the answer is deemed insufficiently reliable, CopT performs further on‑policy thinking, where a second KL estimator dynamically controls draft‑answer visibility, preserving useful partial information while reducing the risk of being misled by unreliable content. Across mathematics, coding, and agentic reasoning tasks, CopT improves peak accuracy by up to 23% and reduces token usage by up to 57% at comparable or higher accuracy, without any additional training. The code is available at https://github.com/sdc17/CopT.
Authors:Wenjie Tang, Minne Li, Sijie Huang, Liquan Xiao, Yuan Zhou
Abstract:
Reinforcement learning from verifiable rewards (RLVR) is a promising paradigm for improving large language model (LLM) agents on long‑horizon interactive tasks. However, in partially observable environments, incomplete observations cause agent beliefs to drift over time, while delayed rewards obscure the causal impact of intermediate decisions, exacerbating temporal credit assignment challenges. To address this, we propose ReBel (Reward Belief), a process‑level reinforcement learning algorithm that explicitly models structured belief states to summarize interaction history and guide subsequent policy learning. ReBel introduces belief‑consistency supervision, converting discrepancies between predicted beliefs and observed feedback into dense self‑supervised signals without requiring external step‑wise annotations or verifiers. It also employs belief‑aware grouping to compare trajectories under similar belief states, yielding more robust and lower‑variance advantage estimates. We evaluate ReBel on challenging long‑horizon benchmarks, including ALFWorld and WebShop. ReBel improves task success by up to 20.4 percentage points over the episode‑level baseline GRPO and increases sample efficiency by 2.1×. These results suggest that belief‑aware self‑supervision is a promising direction for reliable long‑horizon decision‑making under partial observability. Code is available at: https://github.com/Fateyetian/Rebel.git.
Authors:Ying-Jia Lin, Tzu-Chin Lo, Ping-Chien Li, Chi-Tung Cheng, Chien-Hung Liao, Hung-Yu Kao
Abstract:
Automatic report labeling facilitates the identification of clinical findings from unstructured text and enables large‑scale annotation for medical imaging research. Existing rule‑based labelers struggle with the diverse descriptions in clinical reports, while fine‑tuning pre‑trained language models (PLMs) requires large amounts of labeled data that are often unavailable in clinical settings. In this paper, we propose PromptRad, a knowledge‑enhanced multi‑label prompt‑tuning approach for radiology report labeling under low‑resource settings. PromptRad reformulates multi‑label classification as masked language modeling and incorporates synonyms from the UMLS Metathesaurus into a multi‑word verbalizer to enrich category representations. By fine‑tuning the PLM without additional classification layers, PromptRad requires substantially less labeled data than conventional fine‑tuning. Experiments on liver CT (computed tomography) reports show that PromptRad outperforms dictionary‑based and fine‑tuning baselines with only 32 labeled training examples, and achieves competitive performance with GPT‑4 despite using a much smaller model. Further analysis demonstrates that PromptRad captures complex negation patterns more effectively than existing methods, making it a promising solution for report labeling in data‑scarce clinical scenarios. Our code is available at https://github.com/ila‑lab/PromptRad.
Authors:Qinghe Ma, Zhen Zhao, Yiming Wu, Jian Zhang, Lei Bai, Yinghuan Shi
Abstract:
Tool‑augmented reasoning has emerged as a promising direction for enhancing the reasoning capabilities of multimodal large language models (MLLMs). However, existing studies mainly focus on enabling models to perform tool invocation, while neglecting the necessity of invoking tools. We argue that tool usage is not always beneficial, as redundant or inappropriate invocations largely increase reasoning overhead and even mislead model predictions. To address this issue, we introduce AutoTool, a model that adaptively decides whether to invoke tools according to the characteristics of each query. Within a reinforcement learning framework, we design an explicit dual‑mode reasoning strategy with mode‑specific reward functions to guide the model toward producing accurate responses. Moreover, to prevent premature bias toward a single reasoning mode, AutoTool jointly explores and balances tool‑assisted and text‑centric reasoning throughout training, and promotes free exploration in later stages. Extensive experiments demonstrate that AutoTool exhibits outstanding performance and high efficiency, yielding a 21.8% accuracy gain on V benchmark compared to the base model, and a 44.9% improvement in efficiency over existing tool‑augmented methods on POPE benchmark. Code is available at https://github.com/MQinghe/AutoTool.
Authors:Gueter Josmy Faure, Min-Hung Chen, Jia-Fong Yeh, Hung-Ting Su, Winston H. Hsu
Abstract:
Vision‑Language Models (VLMs) have demonstrated remarkable capabilities in general video understanding, yet they often struggle with the fine‑grained comprehension crucial for real‑world applications requiring nuanced interpretation of human actions and interactions. While some recent human‑centric benchmarks evaluate aspects of model behaviour such as fairness/ethics, emotion perception, and broader human‑centric metrics, they do not combine long‑form videos, very dense QA coverage, and frame‑level spatial/temporal grounding at scale. To bridge this gap, we introduce FineBench, a human‑centric video question answering (VQA) benchmark specifically designed to assess fine‑grained understanding. FineBench comprises 199,420 multiple‑choice QA pairs densely annotated across 64 long‑form videos (15 minutes each), focusing on detailed person movement, person interaction, and object manipulation, including compositional actions. Our extensive evaluation reveals that while proprietary models like GPT‑5 achieve respectable performance, current open‑source VLMs significantly underperform, struggling particularly with spatial reasoning in multi‑person scenes and distinguishing subtle differences in human movements and interactions. To address these identified weaknesses, we propose FineAgent, a modular framework that enhances VLMs by leveraging a Localizer and a Descriptor. Experiments show that FineAgent consistently improves the performance of various open VLMs on FineBench. FineBench provides a rigorous testbed for future research into fine‑grained human‑centric video understanding, while FineAgent offers a practical approach to enhance such reasoning in current VLMs. Project page and code at https://joslefaure.github.io/assets/html/finebench.html.
Authors:Zhifei Xie, Kaiyu Pang, Haobin Zhang, Deheng Ye, Xiaobin Hu, Shuicheng Yan, Chunyan Miao
Abstract:
Despite rapid advances in automatic speech recognition (ASR) and large audio‑language models, robust recognition in real‑world environments remains limited by an "acoustic robustness bottleneck": models often lose acoustic grounding and produce omissions or hallucinations under severe, compositional distortions. We propose Mega‑ASR, a unified ASR‑in‑the‑wild framework that combines scalable compound‑data construction with progressive acoustic‑to‑semantic optimization. We introduce Voices‑in‑the‑Wild‑2M, covering 7 classic acoustic phenomena and 54 physically plausible compound scenarios, and train Mega‑ASR with Acoustic‑to‑Semantic Progressive Supervised Fine‑Tuning and Dual‑Granularity WER‑Gated Policy Optimization. Extensive experiments demonstrate that Mega‑ASR achieves significant advantages over prior state‑of‑the‑art systems on adverse‑condition ASR benchmarks (45.69% vs. 54.01% on VOiCES R4‑B‑F, and 21.49% vs. 29.34% on NOIZEUS Sta‑0). On complex compositional acoustic scenarios, Mega‑ASR further delivers over 30% relative WER reduction against strong open‑ and closed‑source baselines, establishing a scalable paradigm for robust ASR in‑the‑wild.
Authors:Wen Shi, Zhe Wang, Huafei Huang, Qing Qing, Ziqi Xu, Qixin Zhang, Xikun Zhang, Renqiang Luo, Feng Xia
Abstract:
Graph Anomaly Detection (GAD) aims to identify atypical graph entities, such as nodes, edges, or substructures, that deviate significantly from the majority. While existing text‑rich approaches typically integrate structural context into the data representation pipeline using raw textual features, they often neglect the structural context of nodes. This limitation hinders their ability to detect sophisticated anomalies arising from inconsistencies between a node's inherent content and its topological role. To bridge this gap, we propose TERGAD (Structure‑aware Text‑enhanced Representations for Graph Anomaly Detection), A novel data augmentation framework that enriches structural semantics for GAD via the semantic reasoning capabilities of Large Language Models (LLMs). Specifically, TERGAD translates node‑level topological properties into descriptive natural language narratives, which are subsequently processed by an LLM to derive high‑level semantic embeddings. These embeddings are then adaptively fused with original node attributes through a gated dual‑branch autoencoder to jointly reconstruct both graph structure and node features. The anomaly score is computed based on the integrated reconstruction error, effectively capturing deviations in both observable attributes and LLM‑informed semantic expectations. Extensive experiments on six real‑world datasets demonstrate that TERGAD consistently outperforms state‑of‑the‑art baselines. Furthermore, our ablation studies validate the indispensable role of structural semantic guidance and the efficacy of the gated fusion mechanism. Code is available at https://github.com/Kantorakitty/TERGAD‑main.
Authors:Zunhai Su, Rui Yang, Chao Zhang, Yaxiu Liu, Yifan Zhang, Wei Wu, Jing Xiong, Dayou Du, Xialie Zhuang, Yulei Qian, Yuchen Xie, Yik-Chung Wu, Hongxia Yang, Ngai Wong
Abstract:
The rapid advancement toward long‑context reasoning and multi‑modal intelligence has made the memory footprint of the Key‑Value (KV) cache a dominant memory bottleneck for efficient deployment. While the established per‑channel quantization effectively accommodates intrinsic channel‑wise outliers in Key tensors, its efficacy diminishes under extreme compression. In this work, we revisit the inherent limitations of the per‑channel quantization paradigm from both empirical and theoretical perspectives. Our analysis identifies Token Norm Imbalance (TNI) as the primary bottleneck to quantization fidelity. We demonstrate that TNI systematically amplifies errors when shared quantization parameters are required to span token groups exhibiting substantial norm disparities. Instead of relying on intricate quantization pipelines (e.g., TurboQuant), we propose OScaR (Omni‑Scaled Canalized Rotation), an accurate and lightweight KV cache compression framework for X‑LLMs (i.e., text‑only, multi‑modal, and omni‑modal LLMs). Advancing the per‑channel paradigm, OScaR employs Canalized Rotation followed by Omni‑Token Scaling to mitigate TNI‑induced sequence‑dimensional variance both effectively and efficiently, further supported by our optimized system design and CUDA kernels. Extensive evaluations across X‑LLMs show that OScaR consistently outperforms existing methods and achieves near‑lossless performance under INT2 quantization, establishing it as a robust, low‑complexity, and universal framework that defines a new Pareto front. Compared with the BF16 FlashDecoding‑v2 baseline, our OScaR implementation achieves a notable up to 3.0x speedup in decoding, reduces memory footprint by 5.3x, and increases throughput by 4.1x. The code for OScaR is publicly available at https://github.com/ZunhaiSu/OScaR‑KV‑Quant.
Authors:Lakshya A Agrawal, Donghyun Lee, Shangyin Tan, Wenjie Ma, Karim Elmaaroufi, Rohit Sandadi, Sanjit A. Seshia, Koushik Sen, Dan Klein, Ion Stoica, Joseph E. Gonzalez, Omar Khattab, Alexandros G. Dimakis, Matei Zaharia
Abstract:
Can a single LLM‑based optimization system match specialized tools across fundamentally different domains? We show that when optimization problems are formulated as improving a text artifact evaluated by a scoring function, a single AI‑based optimization system‑supporting single‑task search, multi‑task search with cross‑problem transfer, and generalization to unseen inputs‑achieves state‑of‑the‑art results across six diverse tasks. Our system discovers agent architectures that nearly triple Gemini Flash's ARC‑AGI accuracy (32.5% to 89.5%), finds scheduling algorithms that cut cloud costs by 40%, generates CUDA kernels where 87% match or beat PyTorch, and outperforms AlphaEvolve's reported circle packing solution (n=26). Ablations across three domains reveal that actionable side information yields faster convergence and substantially higher final scores than score‑only feedback, and that multi‑task search outperforms independent optimization given equivalent per‑problem budget through cross‑task transfer, with benefits scaling with the number of related tasks. Together, we show for the first time that text optimization with LLM‑based search is a general‑purpose problem‑solving paradigm, unifying tasks traditionally requiring domain‑specific algorithms under a single framework. We open‑source optimize\_anything with support for multiple backends as part of the GEPA project at https://github.com/gepa‑ai/gepa .
Authors:Ming Zhang, Qiyuan Peng, Yinxi Wei, Yujiong Shen, Kexin Tan, Yuhui Wang, Zhenghao Xiang, Junjie Ye, Zhangyue Yin, Zhiheng Xi, Shihan Dou, Tao Gui, Maxm Pan, Ruizhi Yang, Qi Zhang, Xuanjing Huang
Abstract:
Evaluating large language models (LLMs) on natural‑language logical reasoning is essential because rule‑governed tasks require conclusions to follow strictly from stated premises. Many existing logical‑reasoning benchmarks are generated by templating natural‑language items from sampled formulas, provide only coarse or unaudited formal annotations, and are now quickly saturated by frontier reasoning models. We present LLMEval‑Logic, a Chinese logical reasoning benchmark built from realistic situational scenarios. Its pipeline forward‑authors and expert‑audits natural‑language items together with their reference formalizations, verifies annotated answers with Z3, constructs expert rubrics for natural‑to‑formal grading, and hardens selected items through a closed‑loop adversarial workflow. The benchmark is released in two paired subsets: a 246‑item Base subset shipped with 1,400 expert‑developed rubric atoms, and a 190‑item Hard subset with 938 multi‑step sub‑questions over closed model spaces. Evaluating 14 frontier LLMs on LLMEval‑Logic reveals substantial gaps in current models: the best model reaches only 37.5% Hard Item Accuracy, and even with reference symbols the highest joint Z3+Rubric formalization score among evaluated models reaches only 60.16%. Our benchmark is publicly available at https://github.com/llmeval/LLMEval‑Logic.
Authors:Daisuke Oba, Hiroki Furuta, Naoaki Okazaki
Abstract:
Discrete diffusion language models (DDLMs) generate text by iteratively denoising categorical token sequences, while recent drifting methods for continuous generators suggest that part of this sampling‑time correction can instead be absorbed into training through an anti‑symmetric fixed‑point objective. We study how to transfer this principle to DDLMs, where the main challenge is the interface with discrete text: hard token samples are non‑differentiable, and categorical predictions do not directly provide continuous samples to drift. We formulate TokenDrift, a drifting objective that lifts categorical predictions to soft‑token features, applies anti‑symmetric drifting in a frozen semantic space, and backpropagates the resulting stop‑gradient feature target to DDLM logits. In controlled continual‑training experiments with masked and uniform‑state diffusion backbones, TokenDrift improves fixed‑NFE generation quality over matched continuation baselines, reducing Gen.‑PPL at 4 NFEs by 89% on MDLM and 86% on DUO. These results suggest that drifting can provide a practical refinement objective for DDLMs.
Authors:Ahmed Heakl, Abdelrahman M. Shaker, Youssef Mohamed, Rania Elbadry, Omar Fetouh, Fahad Shahbaz Khan, Salman Khan
Abstract:
When a model produces a correct solution under reinforcement learning with verifiable rewards (RLVR), every token receives the same reward signal regardless of whether it was a decisive reasoning step or a grammatical filler. A natural fix is to condition the model on the correct answer as a teacher, identifying tokens it would have generated differently had it known the answer. Prior work shows this either corrupts training by leaking the answer into the gradient, or produces a weak signal that cannot distinguish decisive steps from filler, since both look equally surprising relative to the model's baseline. We propose Contrastive Evidence Policy Optimization (CEPO), which asks a sharper question at every token: not just "does the correct answer favor this token?" but "does the correct answer favor it while the wrong answer disfavors it?" A token satisfying both is a genuine reasoning step; one satisfying neither is filler. The wrong‑answer teacher is constructed from rejected rollouts already in the training batch, incurring no additional sampling cost. We prove CEPO inherits all structural safety guarantees of the prior state of the art while strictly sharpening credit at decisive tokens, with the improvement vanishing exactly at filler positions. Empirically, CEPO achieves 43.43% and 60.56% average accuracy across five multimodal mathematical reasoning benchmarks at 2B and 4B scale, respectively, versus 41.17% and 57.43% for GRPO under identical training budgets. Distribution‑matching self‑distillation methods (OPSD, SDPO) fall below the untrained baseline, empirically confirming the information leakage our theory predicts. Our code is available at https://github.com/ahmedheakl/CEPO.
Authors:Yiyang Gu, Junwei Yang, Junyu Luo, Ye Yuan, Bin Feng, Yingce Xia, Shufang Xie, Kaili Liu, Bohan Wu, Qi Shi, Haoran Li, Beier Xiao, Zhiping Xiao, Xiao Luo, Weizhi Zhang, Philip S. Yu, Zequn Liu, Ming Zhang
Abstract:
Large language models (LLMs) are increasingly applied to scientific research, yet existing evaluations often fail to reflect the fine‑grained capabilities required in practice. Most benchmarks are manually curated or domain‑generic, limiting scalability and alignment with real scientific use cases. In this paper, we propose a new framework named SciCustom to address the problem. It enables the custom construction of benchmarks from large‑scale scientific data to evaluate application‑specific scientific capabilities in LLMs. SciCustom first organizes scientific knowledge into ontology‑grounded knowledge units with controlled granularity and trains a tagger to map large‑scale data instances into this knowledge space. Given a custom requirement, relevant knowledge units are identified via voting‑based multi‑model consensus. These units enable relevance‑aware benchmark retrieval via binary search, followed by proxy subset selection and data‑grounded benchmark generation for efficient evaluation. Experiments in chemistry and healthcare demonstrate that SciCustom reveals fine‑grained differences in LLM scientific capabilities that standard benchmarks overlook, while requiring neither expert annotation nor synthetic question generation. This work provides a scalable and application‑aware foundation for benchmarking scientific capabilities in LLMs. The source code is available at https://github.com/yjwtheonly/SciCustom.
Authors:Joy Bose
Abstract:
We present IMLJD, an open dataset of 3,613 Indian court judgments covering matrimonial disputes under IPC Section 498A, the Protection of Women from Domestic Violence Act, and CrPC Section 482. The dataset covers the Supreme Court of India from 2000 to 2024 (1,474 cases) and the Karnataka High Court from 2018 to 2024 (2,139 cases), with structured outcome labels, metadata‑derived indicators, and a knowledge graph. We find that 57.6% of quashing petitions succeed at the Supreme Court level compared to 39.7% at the Karnataka High Court level. On a matched 2018 to 2024 period, the SC quash rate is 59.3%, widening the differential to 19.6 percentage points and confirming the finding is robust to temporal adjustment. The dataset, code, and knowledge graph are released openly at https://github.com/joyboseroy/imljd and https://huggingface.co/datasets/joyboseroy/imljd.
Authors:Jiaao Wu, Xian Zhang, Hanzhang Liu, Sophia Zhang, Fan Yang, Yinpeng Dong
Abstract:
Frontier AI models and multi‑agent systems have led to significant improvements in mathematical reasoning. However, for problems requiring extended, long‑horizon reasoning, existing systems continue to suffer from fundamental reliability issues: hallucination accumulation, memory fragmentation, and imbalanced reasoning‑tool trade‑offs. In this paper, we introduce STAR‑PólyaMath, a multi‑agent framework that systematically addresses these challenges through meta‑level supervision and structured Reasoner‑Verifier interaction. STAR‑PólyaMath is structured as an orchestrated state machine with nested challenge‑step‑replan loops, governed by a reasoning‑free Python orchestrator that separates control from inference and bounds error propagation through trace‑back and re‑planning. Our key innovation is a persistent Meta‑Strategist that maintains cross‑attempt memory and exercises meta‑level control by issuing high‑level strategic guidance or mandatory directives, so the system can escape unproductive loops rather than stagnate or over‑rely on tools. STAR‑PólyaMath achieves state‑of‑the‑art results on all eight top‑tier competition benchmarks: AIME 2025‑2026, MathArena Apex Shortlist, MathArena Apex 2025, Putnam 2025, IMO 2025, HMMT February 2026, and USAMO 2026. It obtains perfect scores on AIMEs, Putnam, and HMMT, and shows its largest margin on Apex 2025, scoring 93.75% compared with 80.21% by the strongest baseline GPT‑5.5. Ablation studies show that the gains arise from the framework's orchestration rather than from model‑level diversity since removing key components or substituting in mixed backbones consistently weakens performance. Code is available at https://github.com/Julius‑Woo/STAR‑PolyaMath.
Authors:Bing Wang, Rui Miao, Ximing Li, Chen Shen, Shaotian Yan, Changchun Li, Kaiyuan Liu, Xiaosong Yuan, Jieping Ye
Abstract:
The rapid spread of misinformation on social media platforms has become a formidable challenge. To mitigate its proliferation, Misinformation Detection (MD) has emerged as a critical research topic. Traditional MD approaches based on small models typically perform binary classification through a black‑box process. Recently, the rise of Large Language Models (LLMs) has enabled explainable MD, where models generate rationales that explain their decisions, thereby enhancing transparency. Existing explainable MD methods primarily focus on crafting sophisticated prompts to elicit rationales from off‑the‑shelf LLMs. In this work, we propose a pipeline to fine‑tune a dedicated LLM specifically for explainable MD. Our pipeline begins by collecting large‑scale fact‑checked articles, and then uses multiple strong LLMs to produce veracity predictions and rationales. To ensure high‑quality training data, we leverage a filtering strategy that selects only the correct instances for fine‑tuning. While this pipeline is intuitive and prevalent, our experiments reveal that naive filtering based solely on label correctness is insufficient in practice and suffers from two critical limitations: (1) Coarse‑grained labels cause insufficient rationales: Rationales filtered solely based on binary labels are insufficient to adequately support their decisions; (2) Over‑verification behavior causes unnecessary rationales: Stronger LLMs tend to exhibit over‑verification behavior, producing excessively verbose and unnecessary rationales. To address these issues, we introduce LONSREX, a novel data synthesis pipeline to Locate Necessary and Sufficient Rationales for Explainable MD. Specifically, we propose a metric that quantifies the contribution of each verification step to the final prediction, thereby evaluating its necessity and sufficiency. Experimental results demonstrate the effectiveness of LONSREX.
Authors:Thomas Vincent Howe, David Wingate
Abstract:
In the training data used by large language models (LLMs), the same latent concept is often presented in multiple distinct ways: the same facts appear in English and Swahili; many functions can be expressed in both Python and Haskell; we can express propositions in both formal and natural language. We show that LLMs can exhibit compartmentalization, where they fail to identify and share statistical strength between distinct presentations of unified concepts. In the worst case, LLMs simply learn parallel internal representations of each presentation of the concept, saturating model capacity with redundancies and decreasing sample efficiency with the number of such presentations. We also demonstrate that synthetic parallel data can fail to improve this despite being easily learned itself. Under this framework, we find that, for small models, early multilingual learning is nearly entirely compartmentalized. Finally, all interventions that we study exhibit a phase transition in which their effectiveness depends on the number of distinct presentations, suggesting that the language modeling objective may only inconsistently unify representations.
Authors:Yujie Lin, Chengyi Yang, Zhishang Xiang, Yiping Song, Jinsong Su
Abstract:
Large language models inevitably retain sensitive information, defined as inputs that may induce harmful generations, due to training on massive web corpora, raising concerns for privacy and safety. Existing machine unlearning methods primarily rely on retraining or aggressive fine‑tuning, which are either computationally expensive or prone to degrading related knowledge and overall model utility. In this work, we reformulate machine unlearning as a precise knowledge re‑mapping problem via model editing. We propose ZeroUnlearn, a few‑shot unlearning framework. It overwrites sensitive inputs by mapping them to a neutral target state and removing their original representations. ZeroUnlearn enforces representational orthogonality through a multiplicative parameter update with a closed‑form solution, enabling efficient and targeted unlearning. We further extend ZeroUnlearn to a gradient‑based variant for multi‑sample unlearning. Experiments demonstrate that our approach outperforms existing baselines while preserving general model utility. Our code is available at the github: https://github.com/XMUDeepLIT/ZeroUnlearn.
Authors:Chanuk Lee, Minki Kang, Sung Ju Hwang
Abstract:
Recent studies observe that reinforcement learning with verifiable rewards (RLVR) reliably improves pass@1 on reasoning tasks, yet often fails to yield comparable gains in pass@k, raising the question of whether RLVR genuinely enables large language models to acquire novel reasoning abilities or merely enhances the efficiency of sampling reasoning modes already present in the base model. Prior analyses largely support the latter view, attributing this limitation to structural properties of standard RLVR objectives that result in insufficient exploration pressure. In this work, we argue that a central structural constraint arises from reverse‑KL regularization, which stabilizes training but inherently anchors the policy to the reference distribution, thereby suppressing the emergence of alternative reasoning modes. However, we show that neither removing the KL term nor replacing it with forward‑KL provides a satisfactory solution, as both disrupt the efficiency‑coverage trade‑off by either inducing reward hacking or allocating probability mass to off‑target regions. To resolve this tension, we propose SAGE, a principled framework that enables controllable empirical support expansion by reshaping the reverse‑KL anchor distribution itself through a guide function q(x,y), achieving consistent improvements in both pass@1 and pass@k across challenging mathematical reasoning benchmarks. Our code is available at https://github.com/tally0818/SAGE.
Authors:Adil Amin
Abstract:
Leaderboards rank frontier models on independent axes but do not reveal whether capabilities reinforce or trade off across releases ‑‑ and at the frontier, this interaction is the more informative signal. We decompose paired SWE‑bench and GPQA Diamond scores into a population coupling trend and per‑release residual (h‑field) that diagnoses capability emphasis and identifies which measurement or stress test is most informative next. Across 34 models from 10 labs (2024‑‑2026), capabilities cooperate (r = +0.72, p < 10^‑6), but cooperation varies by lab and over time: DeepSeek reversed from reasoning‑rich to coding‑first (h: +11.2 \to ‑4.7, 15.9‑pp swing); Google maintains consistent reasoning emphasis; Anthropic oscillates between coding excursions and recovery. Cooperation is not static ‑‑ it cascades. Six open‑weight architectures confirm a second capability transition at 30‑‑72B, and SWE‑bench is now saturating while HLE and instruction‑following retain discriminatory spread ‑‑ signaling the next axis rotation. We provide a three‑level playbook (locate, diagnose, rotate), a per‑lab measurement‑priority table, and seven falsifiable predictions with timestamped criteria for the next 12 months of frontier releases. Per‑lab coupling slopes vary 5× (Google 1.15 vs. DeepSeek 0.23), quantifying how efficiently each recipe converts coding gains into reasoning. Five April 2026 releases confirm the diagnostic out of sample (r rises from +0.72 to +0.75). An interactive dashboard provides phase classification with actionable recommendations, h‑field diagnostics, per‑lab coupling trajectories, ODE‑based scaling predictions, benchmark rotation guidance, self‑steering demo, and live tracking of all seven predictions: https://zehenlabs.com/cape/.
Authors:Adil Amin
Abstract:
Scaling laws predict loss from compute but not how capabilities interact. We measure the coupling between reasoning and truthfulness across 63 base models from 16 families and find a regime change invisible to loss curves: below a family‑dependent critical scale N_c, capabilities anticorrelate; above it, they cooperate. N_c \approx 3.5B parameters [2.9B, 13.4B] (bootstrap 95% CI), but model size is not the only variable that determines phase. Architecture, data curation, and training recipe each shift N_c independently: curated training eliminated the coupling dip between Qwen generations (0.025 \to 0.830 at matched scale), Gemma‑4 at 4B achieves coupling 0.871, characteristic of 13B+ standard‑trained models, through distillation and architectural innovation, and Phi at 1B matches web‑trained coupling at 10B through data curation alone. Width normalization eliminates the anticorrelation across all tested families, supporting an output‑projection bottleneck. Internally, 38 of 40 models show zero competing attention heads. A sparse‑regression ODE cross‑predicts held‑out Llama‑2 at 5.6% error. The diagnostic requires no model internals ‑‑ only public benchmark scores across a model family. The cooperative regime extends to the frontier (r = +0.72, 34 models, 10 labs). Code, data, and an open‑source activation‑steering tool for any open‑weight model are released alongside an interactive dashboard that diagnoses any model's coupling phase, suggests concrete interventions (data curation, width, benchmark rotation), and provides ODE scaling predictions, frontier diagnostics, and eigenstructure analysis: https://zehenlabs.com/cape/.
Authors:Wanghan Xu, Yuhao Zhou, Hengyuan Zhao, Shuo Li, Dianzhi Yu, Zhenfei Yin, Yaowen Hu, Fengli Xu, Wanli Ouyang, Wenlong Zhang, Lei Bai
Abstract:
Large language models can fail in critic interaction not only by answering incorrectly, but also by abandoning an initially correct scientific solution after user criticism. This is especially risky in scientific reasoning, where user criticism can turn a valid answer into an incorrect one. We frame critic interaction as an inter‑turn correctness‑transition problem rather than a final‑answer accuracy problem, and identify three challenges: transition awareness, decoupling useful correction from harmful sycophancy, and scalable rollout. We propose ReCrit, a transition‑aware reinforcement learning framework that decomposes Initial‑to‑Critic behavior into four quadrants: Correction, Sycophancy, Robustness, and Boundary. ReCrit rewards correction and robustness, penalizes sycophancy, and treats persistent errors as weak boundary signals. To make interaction training practical, ReCrit further uses dynamic asynchronous rollout with tail‑adaptive completion to reduce rollout waiting. On three scientific reasoning benchmarks, ChemBench, TRQA, and EarthSE, ReCrit improves average Critic accuracy from 38.15 to 51.49 on Qwen3.5‑4B and from 45.40 to 55.59 on Qwen3.5‑9B. Ablations show that final‑answer rewards provide little interaction‑level gain, while transition‑aware rewards and quadrant weighting produce more distinguishable training signals and larger net Critic‑stage improvement. The code is available at https://github.com/black‑yt/ReCrit .
Authors:Varun Kotte
Abstract:
LLM cascades and model routing promise lower inference cost by sending easy queries to a small model and escalating hard ones to a large model, but most deployed routers use uncalibrated confidence scores and require per‑workload threshold tuning. We present UCCI, a calibration‑first router that maps token‑level margin uncertainty to a per‑query error probability via isotonic regression and selects the escalation threshold by constrained cost minimization. Under three explicit assumptions, threshold policies on the calibrated score are cost‑optimal, and isotonic calibration achieves O(n^‑1/3) sample complexity for expected calibration error (ECE). On a production named entity recognition workload of 75,000 queries served by 4B and 12B instruction‑tuned LLMs on H100 GPUs, UCCI cuts inference cost by 31% (95% CI: [27%, 35%]) at micro‑F1 = 0.91 while reducing ECE from 0.12 to 0.03. At the same operating point, UCCI beats entropy thresholding, split‑conformal routing, and a FrugalGPT‑style learned threshold. All cascade results use end‑to‑end routing on actual model outputs and measured H100 latency, not simulated routing from global accuracies or nominal API prices.
Authors:Xi Zhu, Ziqi Wang, Kai Mei, Wujiang Xu, Minghao Guo, Bangji Yang, Jiajun Fan, Dimitris N. Metaxas
Abstract:
Retrieval‑augmented generation (RAG) improves large language models (LLMs) by incorporating external evidence, but it also introduces knowledge conflicts when retrieved contextual knowledge (CK) and parametric knowledge (PK) disagree or are both unreliable. Existing approaches mainly coordinate which source to use, without explicitly asking whether each answer path is correct. We argue that faithful RAG requires LLM self‑awareness, namely the ability to recognize the limits of its own knowledge and reasoning. To ground this problem, we construct a model‑specific, ground‑truth‑aligned knowledge‑conflict benchmark by evaluating LLM backbones on PK‑only and CK‑conditioned answer paths over approximately 69K query‑context instances per backbone, drawn from five conflict‑QA datasets. We then introduce SABER, a Self‑Aware Belief Estimator for RAG that requires no LLM fine‑tuning. SABER combines a self‑prior with PK‑side and CK‑side conditional reasoning representations from multi‑trace inference, then estimates reliability beliefs with two lightweight predictors to drive a 4‑cell decision over trust PK, trust CK, trust either, or abstain. Across four LLM backbones, SABER improves end‑to‑end accuracy and conflict‑specific faithfulness over ten inference‑time and fine‑tuning baselines, with the largest gains on conflict‑heavy datasets. Under abstention, SABER's risk‑coverage curve Pareto‑dominates every prompt‑based abstainer, providing a tunable balance between coverage and answer risk. Our code is available at https://github.com/xizhu1022/SABER.
Authors:Taehee Kim, Seungbin Yang, Jihwan Kim, Jaegul Choo
Abstract:
Retrieving relevant tables from extensive databases for a given natural language query is essential for accurately answering questions in tasks such as text‑to‑SQL. Existing table retrieval approaches select a pre‑determined set of k tables with the highest similarity to the query. However, the number of required tables varies across queries and cannot be known in advance. Enforcing a fixed number of retrieved tables regardless of the query may either retrieve an undersized set, failing to obtain all necessary evidence, or retrieve an oversized pool, including irrelevant tables. To address this issue, we propose an adaptive table retrieval method that adjusts the number of tables retrieved according to the requirements of each query. Specifically, we utilize an adaptive thresholding mechanism to selectively retrieve tables and integrate a sliding‑window reranking algorithm to efficiently process a large table corpus. Extensive experiments on Spider, BIRD, and Spider 2.0 demonstrate that our method effectively addresses the limitations of the top‑k retrieval strategy, improving performance in retrieval and downstream tasks. Our code and data are available at https://github.com/sbY99/Adaptive‑Table‑Retrieval.
Authors:Qianhao Yuan, Jie Lou, Xing Yu, Hongyu Lin, Le Sun, Xianpei Han, Yaojie Lu
Abstract:
Multimodal Large Language Models (MLLMs) still struggle with fine‑grained visual understanding, where answers often depend on small but decisive evidence in the full image. We observe a regional‑to‑global perception gap: the same MLLM answers fine‑grained questions more accurately when conditioned on evidence‑centered crops than on the corresponding full images, suggesting that many failures stem from difficulty to focus on relevant evidence rather than insufficient local recognition ability. Motivated by this observation, we propose Vision‑OPD (Vision On‑Policy Distillation), a regional‑to‑global self‑distillation framework that transfers the model's own privileged regional perception to its full‑image policy. Vision‑OPD instantiates two conditional policies from the same MLLM: a crop‑conditioned teacher and a full‑image‑conditioned student. The student generates on‑policy rollouts, and Vision‑OPD minimizes token‑level divergence between the teacher and student next‑token distributions along these rollouts. This enables the model to internalize the benefit of visual zooming without external teacher models, ground‑truth labels, reward verifiers, or inference‑time tool use. Experiments on multiple fine‑grained visual understanding benchmarks show that Vision‑OPD models achieve competitive or superior performance against much larger open‑source, closed‑source, and "Thinking‑with‑Images" agentic models.
Authors:Hyunji Lee, Justin Chih-Yao Chen, Joykirat Singh, Zaid Khan, Elias Stengel-Eskin, Mohit Bansal
Abstract:
Real‑world agents operate over long and evolving horizons, where information is repeatedly updated and may interfere across memories, requiring accurate recall and aggregated reasoning over multiple pieces of information. However, existing benchmarks focus on static, independent recall and fail to capture these dynamic interactions between evolving memories. In this paper, we study how current memory‑augmented agents perform in realistic, interference‑heavy, long‑horizon settings across diverse domains and question types. We introduce MINTEval (Long‑Horizon Memory under INTerference Evaluation), a benchmark featuring (1) long, highly interconnected contexts with frequently updated information that induces substantial interference, (2) diverse domains (state tracking, multi‑turn dialogue, Wikipedia revisions, and GitHub commits), enabling evaluation of domain generalization, and (3) diverse question types that assess robustness to interference, including (i) single‑target recall tasks requiring retrieval of a specific target from long contexts, and (ii) multi‑target aggregation tasks requiring reasoning over multiple relevant pieces of information. Overall, MINTEval has 15.6k question‑answering pairs over long‑horizon contexts averaging 138.8k tokens and extending up to 1.8M tokens per instance. We evaluate 7 representative systems, including vanilla long‑context LLMs, RAG, and memory‑augmented agent frameworks. Across all systems, we observe consistently low performance (avg. 27.9% accuracy), especially on questions requiring aggregated reasoning over multiple pieces of evidence. Our analysis shows that performance is primarily limited by retrieval and memory construction. Furthermore, current memory systems struggle to recall and reason over earlier facts that are revised or interfered with by subsequent context, with accuracy degrading as the number of intervening updates increases.
Authors:Md Gulzar Hussain, Babe Sultana, Md Rinku Ali
Abstract:
Nowadays, people in Bangladesh frequently rely on the internet and social media for daily news instead of traditional newspapers. However, the spread of false Bangla news through these platforms poses risks and challenges to the credibility of authentic media. Although several studies have been conducted on detecting Bangla fake news, there is still significant room for improvement in this area. To assist people, this research explores the effectiveness of feature selection approaches in identifying appropriate features, such as semantic, statistical, and character‑level features, or their combinations, on the BanFakeNews‑2.0 dataset for detecting Bangla fake news using a CNN model. In this paper, key findings reveal that combining multiple features significantly improves recall and F1‑scores compared to using individual features alone. The code for this research can be availed here, https://github.com/gulzar09/Bn\_FNews\_H.Feature.
Authors:Zhiyin Tan, Changxu Duan
Abstract:
Multilingual NLP often relies on dataset counts from centralized catalogues to characterize which languages are resource‑rich or resource‑poor. However, these catalogues record only one layer of dataset visibility: what has been registered or institutionally distributed. They do not necessarily reflect which datasets are created, cited, or reused in the research literature. To examine this gap, we combine a catalogue‑based baseline with literature‑backed evidence of dataset circulation. We introduce the Resource Density Index (RDI), defined as the number of catalogued datasets per one million speakers, and compute it for the 200 most widely spoken languages in Ethnologue. Among them, 118 languages (59%) have an average RDI of zero across the LRE Map and the Linguistic Data Consortium (LDC), and another 23 fall below 0.1, corresponding to at most one catalogued dataset per ten million speakers. We then apply an LLM‑assisted citation‑mining pipeline over the Semantic Scholar corpus to these 141 low‑visibility languages. After manual validation and consolidation, we identify 609 unique datasets across 53 languages, of which 356 remain openly accessible through working public links. These results reveal a substantial visibility gap: many large‑speaker languages appear data‑poor in catalogue records yet show clear evidence of dataset activity in the research literature. Our findings suggest that multilingual data scarcity should be understood not only as a production problem, but also as a question of documentation, discoverability, and long‑term accessibility. Code and data are publicly available at (https://github.com/zhiyintan/dataset‑visibility‑asymmetry).
Authors:Gunjan Balde, Soumyadeep Roy, Mainack Mondal, Niloy Ganguly
Abstract:
Large language models pretrained on general‑domain corpora often exhibit tokenization inefficiencies when applied to specialized domains. Although continual pretraining for domain adaptation partially alleviate performance degradation, it does not resolve the fundamental vocabulary mismatch. To address this gap, we introduce a targeted parameter‑efficient domain adaptation approach that combines vocabulary adaptation with pretraining for LLM‑based text summarization. Our unified framework augments pretrained tokenizers with domain‑specific tokens while selectively replacing under‑trained and unreachable tokens to limit parameter growth. We evaluate our approach on Llama‑3.1‑8B and Qwen2.5‑7B across legal and medical summarization tasks on a challenge‑oriented evaluation protocol focused on expert‑driven text and summaries which typically has higher concentration of over‑fragmented Out‑of‑Vocabulary (OOV) words. The vocabulary adaptation algorithm enhances the overall quality of the summarization model by improving semantic similarity between the generated summaries and their references. In addition, the adapted model produces summaries that incorporate more appropriate novel and domain‑specific words, leading to improved coherence, relevance, and faithfulness. We further observe that our proposed approach significantly reduce training time by 35‑55% over continual pretraining and reduce parameter counts up to 37% w.r.t expansion‑only methods. We make the codebase publicly available at https://github.com/gb‑kgp/VocabReplace‑Then‑Expand.
Authors:Yucong Huang, Xiucheng Li, Kaiqi Zhao, Jing Li
Abstract:
Standard RLHF relies on transitive scalar rewards, failing to capture the cyclic nature of human preferences. While some approaches like the General Preference Model (GPM) address this, we identify a theoretical limitation: their implicit formulation entangles hierarchy with cyclicity, failing to guarantee dominant solutions. To address this, we propose the Hybrid Reward‑Cyclic (HRC) model, which utilizes game‑theoretic decomposition to explicitly disentangle preferences into orthogonal transitive (scalar) and cyclic (vector) components. Complementing this, we introduce Dynamic Self‑Play Preference Optimization (DSPPO), which treats alignment as a time‑varying game to progressively guide the policy toward the Nash equilibrium. Synthetic data experiments further validate HRC's structural superiority in mixed transitive‑‑cyclic settings, where HRC converges faster and achieves higher accuracy than GPM. Experiments on RewardBench 2 demonstrate that HRC consistently improves over both BT and GPM baselines (e.g., +1.23% on Gemma‑2B‑it). In particular, its superior performance in the Ties domain empirically validates the model's robustness in handling complex, non‑strict preferences. Extensive downstream evaluations on AlpacaEval 2.0, Arena‑Hard‑v0.1, and MT‑Bench confirm the efficacy of our framework. Notably, when using Gemma‑2B‑it as the base preference model, HRC+DSPPO achieves a peak length‑controlled win‑rate of 44.75% on AlpacaEval 2.0 and 46.8% on Arena‑Hard‑v0.1, significantly outperforming SPPO baselines trained with BT or GPM. Our code is publicly available at https://github.com/lab‑klc/Hybrid‑Reward‑Cyclic.
Authors:Jon Saad-Falcon, Avanika Narayan, Robby Manihani, Tanvir Bhathal, Herumb Shandilya, Hakki Orhun Akengin, Gabriel Bo, Andrew Park, Matthew Hart, Caia Costello, Chuan Li, Christopher Ré, Azalia Mirhoseini
Abstract:
Personal AI stacks, like OpenClaw and Hermes Agent, are becoming central to daily work, yet they route nearly every query (often over sensitive local data) to cloud‑hosted frontier models. Replacing frontier models with local models inside existing stacks does not work: swapping Claude Opus 4.6 for Qwen3.5‑9B drops accuracy by 25‑39 pp across personal AI tasks like PinchBench and GAIA. Existing stacks bundle agentic prompts, tool descriptions, memory configuration, and runtime settings around a specific cloud model. Only the prompts can be tuned, and state‑of‑the‑art prompt optimizers close just 5 pp of the local‑cloud gap on their own. This motivates a decomposed personal AI stack: one that exposes individual primitives which can be optimized individually or jointly to close the local‑cloud gap. We present OpenJarvis, an architecture that represents a personal AI system as a typed spec over five primitives: Intelligence, Engine, Agents, Tools & Memory, and Learning. Each primitive is an independently editable field, making the stack end‑to‑end optimizable and measurable against accuracy, cost, and latency. Towards closing the local‑cloud gap without surrendering local‑model properties, OpenJarvis introduces LLM‑guided spec search, a local‑cloud collaboration in which frontier cloud models propose edits across the spec at search time, only non‑regressing edits are accepted, and the resulting spec runs entirely on‑device at inference time. With LLM‑guided spec search, on‑device specs match or exceed cloud accuracy on 4 of 8 benchmarks and land within 3.2 pp of the best cloud baseline on average. They also reduce marginal API cost by ~800x and end‑to‑end latency by 4x.
Authors:Robin-Nico Kampa, Fabian Deuser, Anna Bößendörfer, Konrad Habel, Norbert Oswald
Abstract:
Autonomous AI coding agents are becoming a core tool for ML practitioners in industry and research alike. Despite this growing adoption, no standardized benchmark exists to evaluate their ability to design, implement, and train models from scratch across diverse domains. We introduce 1GC‑7RC (Single Graphic Card: Seven Research Challenges), a benchmark comprising seven ML tasks spanning language modeling, image classification, semantic segmentation, graph learning, tabular prediction, time‑series forecasting, and text classification. Each task provides a locked data‑preparation and evaluation script together with a baseline training script; the agent may only modify the training code, has no access to pretrained weights (with one controlled exception for semantic segmentation), no internet access, and must complete each task within a task‑specific wall‑clock budget (40‑120 minutes) on a single GPU. We evaluate seven coding agents: five proprietary (Claude Code with Sonnet 4.6, Opus 4.6, and Opus 4.7; Codex CLI with GPT 5.5; and OpenCode with Qwen 3.6+) and two open‑source (OpenCode with Kimi K2.5, Kimi K2.6). Across 5 runs per agent‑task pair, we report substantial performance differences that reveal varying levels of implicit ML knowledge, planning ability, and time‑budget management. The benchmark, harness, and all evaluation artifacts are publicly available on GitHub at https://github.com/Strolchii/1GC‑7RC‑Benchmark to facilitate reproducible comparison of future agents. Because our benchmark design is modular, the benchmark can be extended to new tasks and domains, adapted to different GPU budgets, and used to study multi‑agent settings, making it a flexible platform for future research on autonomous research agents.
Authors:Masaru Yamada
Abstract:
We present Agentic AI Translate, an agentic translator prototype that operationalises the thesis of Yamada (forthcoming) ‑‑ that the metalanguage of Translation Studies has become an instruction code for generative AI. The system replaces the dominant text‑in / text‑out paradigm of machine translation with a four‑stage agentic cycle (Identify ‑> Prompt ‑> Generate ‑> Verify), preceded by an interactive specification phase in which the user composes ‑‑ through model‑assisted dialogue ‑‑ a structured translation brief grounded in skopos theory, register, audience, and genre conventions. The verification stage adopts the GEMBA‑MQM error‑span protocol (Kocmi & Federmann, 2023) for evidence‑grounded scoring, and document‑level coherence is preserved through a DelTA‑lite memory of proper nouns and a running bilingual summary, after Wang et al. (2025). We describe the philosophical motivation, the architectural commitments, the four reference‑material categories the system consumes, and the principal design tensions the architecture makes explicit. Empirical validation is left for future work; the contribution here is conceptual and architectural ‑‑ an executable embodiment of the position that translation in the GenAI era is communication design, not text conversion.
Authors:Fanqin Zeng, Feng Hong, Geng Yu, Huangjie Zheng, Xiaofeng Cao, Ya Zhang, Bo Han, Yanfeng Wang, Jiangchao Yao
Abstract:
Diffusion Large Language Models (DLLMs) promise fast parallel generation, yet open‑source DLLMs still face a severe quality‑speed trade‑off: accelerating decoding by revealing multiple tokens often causes substantial quality degradation. We attribute this dilemma to a train‑inference mismatch amplified by irreversible decoding. While training reconstructs tokens from randomly corrupted states, efficient inference requires an adaptive denoising order, where easier tokens are revealed earlier and context‑dependent ones are deferred. This view motivates two complementary methods: an inference‑time method that makes parallel decoding revokable, and a training‑time extension that distills the reliable order exposed by this revokable process. Accordingly, we first propose Wide‑In, Narrow‑Out (WINO), a training‑free decoding algorithm that enables revokable parallel generation. WINO aggressively drafts multiple tokens, verifies generated tokens with enriched global context, and re‑masks unreliable ones for later refinement. Building on this discovered order, we further introduce WINO+, which injects the verified denoising trajectories produced by WINO into model parameters, aligning training with efficient inference. Experiments on LLaDA and MMaDA show that WINO improves both quality and efficiency, while WINO+ further strengthens this progression. On GSM8K, WINO improves accuracy from 73.24% to 75.82% with a 6.10x step reduction, and WINO+ further achieves 76.58% with a 6.83x reduction. On Flickr30K, WINO+ reaches a 16.22x step reduction with improved CIDEr. These results demonstrate that DLLMs can serve as their own efficiency teachers by first discovering reliable denoising orders through revokable decoding and then learning to follow them for faster generation. Code is available at https://github.com/Feng‑Hong/WINO‑DLLM/tree/WINO‑plus.
Authors:Yuxuan Ye, Jun Han, Ao Hu, Juncheng Bu, Yiyi Chen, Liangjian Wen, Danilo Mandic, Danny Dongning Sun, Xu Yinghui, Zenglin Xu
Abstract:
End‑to‑end LLM trading agents have moved quickly from research curiosity to a small ecosystem of named systems, including FinCon, FinMem, TradingAgents, FinAgent, QuantAgent, and FLAG‑Trader. Several of these report headline Sharpe ratios that would be material if read at face value on a deployment desk, and associated benchmarks such as FinBen report trading‑task Sharpe statistics in the same range. The gap between architecture research and deployment claim has been crossed too freely on both sides of the academia‑‑industry divide. We take a position on that gap: reported alpha from end‑to‑end LLM trading agents should not be treated as deployment evidence. Before such returns can support claims of deployable trading capability, they must survive structural validity tests for temporal integrity, real‑world frictions, counterfactual robustness, predictive calibration, numerical execution, and multi‑agent disaggregation. Current public evidence cannot yet distinguish robust predictive ability from temporal contamination, unmodeled frictions, short‑window Sharpe uncertainty, narrative fitting, and parametric priors. The problem is not only evaluative but structural. Language confidence is not tradable probability, narrative reasoning is not numerical execution, and model priors may become undisclosed implicit factor exposures. We contribute a minimum reporting protocol suite, P1‑‑P6, with tiered applicability by claim strength, and a conservative modular alternative that uses LLMs as auditable information interfaces upstream of independent calibration, risk, and execution modules. Code and reproduction harness: \urlhttps://github.com/hj1650782738/Trading.
Authors:Anhao Zhao, Haoran Xin, Yingqi Fan, Junlong Tong, Wenjie Li, Xiaoyu Shen
Abstract:
Knowledge distillation is central to LLM post‑training, yet its design space remains poorly understood, especially alongside reinforcement learning (RL). We show that the prevailing paradigms, off‑policy distillation and on‑policy distillation (OPD), implicitly couple two orthogonal choices: prefix source and token‑level KL direction. This follows from decomposing sequence‑level KL over autoregressive response distributions: forward KL pairs teacher prefixes with token‑level forward KL, and reverse KL pairs student prefixes with token‑level reverse KL. We argue this coupling is not intrinsic: decoupling the two axes yields four valid objectives. We establish gradient‑level identities showing forward KL gives SFT‑style cross‑entropy matching with teacher soft targets, whereas reverse KL gives an RL‑style policy‑gradient objective with a dense teacher‑student log‑ratio reward, connecting them to off‑policy SFT, DAgger‑style on‑policy SFT, offline‑RL‑style distillation, and OPD. We conduct an extensive controlled study on math reasoning, evaluating the four objectives both as standalone methods and as initializations for subsequent RL. The results reveal three tradeoffs: KL direction induces an accuracy‑entropy tradeoff, prefix source a quality‑compute tradeoff, and training length an accuracy‑stability tradeoff. Motivated by these findings, we propose KL mixing and an entropy‑gated length curriculum. KL mixing shows long‑sequence distillation requires substantial forward‑KL weight to prevent entropy collapse and length inflation without sacrificing accuracy. The entropy‑gated length curriculum improves Avg@k and Pass@k by 3.6 and up to 5.8 points, and cuts average response length by roughly 3x versus fixed long‑horizon training. Our results provide a framework and practical methods for designing reasoning distillation objectives that balance accuracy, diversity, compute, and RL behavior.
Authors:Shuo Liu, Ding Liu, Shi-Ju Ran
Abstract:
Large language models (LLMs) generate not only reasoning text, but also token‑level confidence trajectories that record how uncertainty evolves during inference. Whether these trajectories are relevant to reasoning correctness remains unclear. Here we show that confidence trajectories encode a content‑agnostic confidence geometry associated with trace‑level final‑answer correctness. Using only token‑level confidence values, without access to the input question, reasoning text, hidden states, or external verifiers, we find that low‑dimensional representations of confidence trajectories separate correct from incorrect reasoning traces. Across GSM8K, MATH, and MMLU, this geometric separation is quantitatively linked to downstream predictability: stronger clustering of correct and incorrect traces, measured by the Davies‑‑Bouldin index, consistently corresponds to higher correctness‑discrimination AUC. We further show that correctness‑related information is enriched in the tail of reasoning, suggesting that late‑stage confidence dynamics carry key correctness signals. We propose NeuralConf, a lightweight estimator that learns from confidence trajectories for correctness evaluation. Under a fixed trace budget, NeuralConf‑derived scores improve confidence‑weighted answer aggregation over majority voting, tail confidence, and other static baselines. These results reveal that LLMs expose trace‑intrinsic statistical signals of correctness through their own confidence dynamics, offering a route to improve inference using information already present within generation.
Authors:Anay Kulkarni, ChiaEn Lu, Dheeraj Mekala, Jayanth Srinivasa, Gaowen Liu, Jingbo Shang
Abstract:
Tool use enables large language models to solve complex tasks through sequences of API calls, yet existing reinforcement learning approaches fail to scale to multi‑step composition settings. Outcome‑based rewards provide only sparse feedback, while trajectory‑supervised rewards depend on annotated reference solutions, penalizing valid alternatives and limiting scalability. We propose TIER: Trajectory‑Invariant Execution Rewards, a reward framework that derives supervision directly from function schemas and runtime execution, rather than from reference trajectories. The reward decomposes into format validity, schema adherence, execution success, and answer correctness, providing dense, interpretable sequence‑level feedback derived from fine‑grained verification of individual steps of tool use. This design allows any valid execution path to receive credit, naturally supporting multiple solution strategies and adapting to evolving tool interfaces. On DepthBench, a compositional benchmark stratified by depth (1 to 6 steps), TIER achieves >90% accuracy across steps, where trajectory‑supervised rewards collapse beyond step‑4. We further demonstrate consistent gains on benchmarks like BFCL v3 and NestFUL. Ablation studies confirm that all reward components are necessary, highlighting the importance of multi‑level supervision for compositional reasoning.
Authors:Yulin Chen, He He, Chen Zhao
Abstract:
Reinforcement Learning with Verifiable Reward (RLVR) has proven effective in improving Large Language Model's (LLM) reasoning ability. However, the learning dynamics of RLVR remain underexplored. In this paper, we reveal a counterintuitive phenomenon: among hard examples that the model initially struggles with, a substantial subset remains unlearnable even when correct rollouts are present. To understand the phenomenon, we first demonstrate that existing optimization and sampling techniques fail to resolve unlearnability. With cross‑example gradient analysis, we show that unlearnable examples have fundamental representation issue, characterized by low gradient similarity with the rest of the examples and ungeneralizable reasoning patterns. We further show that representation flaws are difficult to mitigate in RL, as data augmentation does not improve gradient similarity. Our study provides the first systematic characterization of unlearnable data in RLVR training and reveals fundamental limitations in current RL approaches for reasoning tasks. Code and data are available at \urlhttps://github.com/yulinchen99/unlearnability‑rlvr.
Authors:Zhitian Hou, Tianyong Hao, Nanli Zeng, Zhixiong Chao, Kun Zeng
Abstract:
Criminal Court View Generation (CVG) is a critical task in Legal Artificial Intelligence (Legal AI), involving the generation of court view based on case facts. In this work, we systematically explore the capabilities of lightweight (smaller than 2B) large language models (LLMs) in CVG and their impact on charge prediction. Our study addresses four key questions: (1) how does different architecture of LLMs affect the CVG quality and charge prediction. (2) how does LLMs size contribute to the performance, (3) how do lightweight LLMs compare with Deep Neural Networks (DNNs) in these tasks, and (4) how does predicting charge by court view generation first compare with predicting it directly. Additionally, we also develop CVGEvalKit, an evaluation framework including three public available datasets for CVG tasks, as well as predicting their charges. Comprehensive experiments are conducted on this framework, where models are trained on a mixed training set and evaluated on each dataset's test set. Experimental results provide new insights into the trade‑offs between model architecture, model size, and the influence between different tasks, highlighting the potential of lightweight LLMs in judicial AI applications. The source code is anonymously available at \urlhttps://github.com/ZhitianHou/CVGEvalKit
Authors:Haolin Chen, Deon Metelski, Leon Qi, Tao Xia, Joonyul Lee, Steve Brown, Kevin Riley, Frank Wang, T. Y. Alvin Liu, Hank Capps MD, Zeyu Tang, Xiangchen Song, Lingjing Kong, Fan Feng, Tianyi Zeng, Zhiwei Liu, Zixian Ma, Hang Jiang, Fangli Geng, Yuan Yuan, Chenyu You, Qingsong Wen, Hua Wei, Yanjie Fu, Yue Zhao, Carl Yang, Biwei Huang, Kun Zhang, Caiming Xiong, Sanmi Koyejo, Eric P. Xing, Philip S. Yu, Weiran Yao
Abstract:
End‑to‑end automation of realistic healthcare operations stresses three capabilities underrepresented in current benchmarks: policy density, decisions must be grounded in a large library of medical, insurance, and operational rules; Multi‑role composition: a single task requires the agent to play multiple roles with handoffs; and multilateral interaction: intermediate workflow steps are multi‑turn dialogs, such as peer‑to‑peer review and patient outreach. We introduce χ‑Bench, a benchmark of long‑horizon healthcare workflows across three domains: provider prior authorization, payer utilization management, and care management. Each task hands the agent a clinical case in a high‑fidelity simulator of 20 healthcare apps exposed via 87 MCP tools, which it must drive to a terminal status through tool calls and writing the role's artifacts, guided by a 1,290+ document managed‑care operations handbook skill. Across 30 agent harness/models configurations, the best agent resolves only 28.0% of tasks, no agent clears 20% on strict pass^3, and executing all tasks in a single session slumps the performance to 3.8%. These results raise the hypothesis that similar gaps are likely to surface in other policy‑dense, role‑composed, irreversible enterprise domains.
Authors:Nilesh Agrawal
Abstract:
Push notifications remain among the most direct channels through which digital platforms engage users, yet existing approaches have invested heavily in who to notify, when to notify, and what to recommend, while leaving how to communicate as the least‑optimized stage. This paper argues that message quality is an independent, underinvested lever, and that LLMs create their most differentiated value precisely at this layer. We make three contributions. First, we define notification message quality along six dimensions (contextual relevance, clarity, actionability, novelty handling, linguistic freshness, and persuasive appropriateness) and show how LLM‑based composition improves each relative to templates. Across reviewed deployments, reported improvements range from +8% to +14.5% CTR over static templates and +1% to +2.5% over mature slot‑filling systems, though these span heterogeneous systems and should not be treated as directly comparable. Second, we provide an architectural attribution analysis disentangling message generation from adjacent components (targeting, ranking, timing), arguing that observed gains are frequently misattributed to text generation alone. Third, we introduce a three‑criterion decision framework specifying when LLM generation is and is not the binding constraint. We support these arguments through a PRISMA‑guided survey (28 sources from 142 screened), examine domain‑specific applications across social media, food delivery, and e‑commerce, and propose a unified architectural framework with budget‑aware routing, grounded generation, candidate ranking, diversity controls, and online learning.
Authors:Xavier Theimer-Lienhard, Mushtaha El-Amin, Fay Elhassan, Sahaj Vaidya, Victor Cartier-Negadi, David Sasu, Lars Klein, Mary-Anne Hartley
Abstract:
Clinical decision support systems (CDSS) require scrutable, auditable pipelines that enable rigorous, reproducible validation. Yet current LLM‑based CDSS remain largely opaque. Most "open" models are open‑weight only, releasing parameters while withholding the data provenance, curation procedures, and generation pipelines that determine model behavior. Fully Open (FO) models, which expose the complete training stack end‑to‑end, do not currently exist in medicine. We introduce Fully Open Meditron, the first fully open pipeline for building LLM‑CDSS, comprising a clinician‑audited training corpus, a reproducible data construction and training framework, and a use‑aligned evaluation protocol. The corpus unifies eight public medical QA datasets into a normalized conversational format and expands coverage with three clinician‑vetted synthetic extensions: exam‑style QA, guideline‑grounded QA derived from 46,469 clinical practice guidelines, and clinical vignettes. The pipeline enforces system‑wide decontamination, gold‑label resampling of teacher generations, and end‑to‑end validation by a four‑physician panel. We evaluate using an LLM‑as‑a‑judge protocol over expert‑written clinical vignettes, calibrated against 204 human raters. We apply the recipe to five FO base models (Apertus‑70B/8B‑Instruct, OLMo‑2‑32B‑SFT, EuroLLM‑22B/9B‑Instruct). All MeditronFO variants are preferred over their bases. Apertus‑70B‑MeditronFO improves +6.6 points over its base (47.2% to 53.8%) on aggregate medical benchmarks, establishing a new FO SoTA. Gemma‑3‑27B‑MeditronFO is preferred over MedGemma in 58.6% of LLM‑as‑a‑judge comparisons and outperforms it on HealthBench (58% vs 55.9%). These results show that fully open pipelines can achieve state‑of‑the‑art domain‑specific performance without sacrificing auditability or reproducibility.
Authors:Sihan Fu, Oucheng Liu, Shiyuan Wang, Jin Shi, Chengkun Wei
Abstract:
Code agents increasingly help developers work with unfamiliar repositories, but every such task depends on a costly prerequisite: bootstrapping the repository into a usable development state. This process requires substantial trial‑and‑error exploration, yet the resulting knowledge‑‑resolved dependencies, repair strategies‑‑stays trapped in a single conversation, unavailable to future agents. We therefore formulate repository bootstrapping as a reusable startup knowledge problem and introduce BootstrapAgent, a multi‑agent framework that distills the heuristics discovered during bootstrap exploration into a persistent, verifiable, agent‑consumable .bootstrap contract. Through evidence extraction, structured planning, deterministic Docker‑based verification, and trace‑driven repair, BootstrapAgent generates a contract covering environment setup, diagnostic checks, minimal verification, and accumulated repair knowledge. We further propose warm repair with clean replay to accelerate iterative debugging without sacrificing cold‑start reproducibility, and a delta repair with sanity check to prevent reward hacking. Experiments on three benchmarks show that BootstrapAgent achieves a 92.9% success rate, outperforming the baseline by over 10% while reducing downstream agent token usage by 25.9% and build time by 22.3%. Our code is available at https://github.com/Vossera/BootstrapAgent.
Authors:Wentao Qiu, Haotian Hu, Fanyi Wang, Jinwei Kong, Yu Zhang
Abstract:
Large language model (LLM) agents require long‑term memory to leverage information from past interactions. However, existing memory systems often face a fidelity‑‑efficiency trade‑off: raw dialogue histories are expensive, while flat facts or summaries may discard the structure needed for precise recall. We propose DimMem, a lightweight dimensional memory framework that represents each memory as an atomic, typed, and self‑contained unit with explicit fields such as time, location, reason, purpose, and keywords. This representation exposes the structure needed for dimension‑aware retrieval, memory update, and selective assistant‑context recall without storing full histories in the model context. Across LoCoMo‑10 and LongMemEval‑S, DimMem achieves 81.43% and 78.20% overall accuracy, respectively, outperforming existing lightweight memory systems while reducing LoCoMo per‑query token cost by 24%. We further show that dimensional memory extraction is learnable by compact models: after fine‑tuning on the DimMem schema, a Qwen3‑4B extractor surpasses LightMem with GPT‑4.1‑mini on both benchmarks and reaches performance comparable to, or better than, much larger extractors in key settings. These results suggest that explicit dimensional structuring is an effective and efficient foundation for long‑term memory in LLM agents. Code is available at https://github.com/ChowRunFa/DimMem.
Authors:Chanuk Lee, Sangwoo Park, Minki Kang, Sung Ju Hwang
Abstract:
Reinforcement learning with verifiable rewards (RLVR) has emerged as a scalable paradigm for improving the reasoning capabilities of large language models. However, its effectiveness is fundamentally limited by exploration: the policy can only improve on trajectories it has already sampled. While increasing the number of rollouts alleviates this issue, such brute‑force scaling is computationally expensive, and existing approaches that modify the optimization objective provide limited control over what is explored. In this work, we propose NudgeRL, a framework for structured and diversity‑driven exploration in RLVR. Our approach introduces Strategy Nudging, which conditions each rollout on lightweight, strategy‑level contexts to induce diverse reasoning trajectories without relying on expensive oracle supervision. To effectively learn from such structured exploration, we further propose a unified objective, which decomposes the reward signal into inter‑ and intra‑context components and incorporates a distillation objective to transfer discovered behaviors back to the base policy. Empirically, NudgeRL outperforms standard GRPO with up to 8 times larger rollout budgets, while outperforming oracle‑guided RL baseline on average across five challenging math benchmarks. These results demonstrate that structured, context‑driven exploration can serve as an efficient and scalable alternative to both brute‑force rollout scaling and feasibility‑oriented methods based on privileged information. Our code is available at https://github.com/tally0818/NudgeRL.
Authors:Luxuan Chen, Han Tian, Xinran Chen, Rui Kong, Fang Wang, Jiamin Chen, Yuchen Li, Jiashu Zhao, Shuaiqiang Wang, Haoyi Xiong, Dawei Yin
Abstract:
The dynamic range of activations is a first‑order constraint for low‑bit quantization, activation scaling, and stable LLM inference. Prior work characterized outlier features and massive activations on pre‑2024 LLaMA‑style models, and the downstream activation‑quantization stack inherits that picture without revisiting it for the post‑LLaMA open‑model boom. We ask the deployment‑oriented question: how large can activations get in modern open LLMs, and how does this magnitude vary across families, generations, and training stages? Under a unified pipeline (5,000‑sample multi‑domain corpus, family‑specific tokenization, identical hooks across embeddings, hidden states, attention, MLP/MoE, SwiGLU gates, and final norm), we measure global and layerwise maxima on 27 checkpoints from 8 open families spanning dense, MoE, vision‑language, intermediate‑training, and instruction‑tuned variants. We find that (i) global maxima span over nearly four orders of magnitude at comparable parameter counts, with Qwen3.5 and MoE checkpoints in the 10^2 to 10^3 range and Gemma3‑27B‑it reaching ~7 x 10^5; (ii) cross‑family and cross‑generation comparisons break simple monotonic scaling; and (iii) MoE checkpoints exhibit 14.0‑23.4x lower peaks than matched‑scale dense counterparts, while the residual stream carries the global maximum in 22/24 checkpoints. A lightweight INT‑8 sanity check shows that measured maxima co‑vary with low‑bit reconstruction error via activation‑scale selection. We conclude that maximum activation magnitude is a model property tied to family, architecture, and training stage ‑ not a simple byproduct of size ‑ and should be measured and reported alongside any open‑weight release before low‑bit deployment. The code is publicly available at https://github.com/clx1415926/Max_act_llm.
Authors:Tianyu Huang, Yida Zhao, Chuyan Zhou, Kewei Tu
Abstract:
Augmenting Transformers with linguistic structures effectively enhances the syntactic generalization performance of language models. Previous work in this direction focuses on syntactic tree structures of languages, in particular constituency tree structures. We propose Graph‑Infused Layers Transformer Language Model (GiLT) which leverages dependency graphs for augmenting Transformer language models. Unlike most previous work, GiLT does not insert extra structural tokens in language modeling; instead, it injects structural information into language modeling by modulating attention weights in the Transformer with features extracted from the dependency graph that is incrementally constructed along with token prediction. In our experiments, GiLT with semantic dependency graphs achieves better syntactic generalization while maintaining competitive perplexity in comparison with Transformer language model baselines. In addition, GiLT can be finetuned from a pretrained language model to achieve improved downstream task performance. Our code is released at https://github.com/cookie‑pie‑oops/GiLT‑LM.
Authors:De Shuai Zhang
Abstract:
Continuous diffusion and flow models are attractive for non‑autoregressive text generation because they can update all positions in parallel. A major difficulty is the interface between continuous latent states and discrete tokens. This report studies a draft‑conditioned latent refinement model built from a frozen BERT encoder, a parallel decoder, a denoising DraftPrior, a local FlowNet, and a learned diagonal MetricNet. Early Gaussian‑start experiments showed that good latent‑space metrics, such as scale matching or cosine similarity, do not guarantee good decoding. Generated latents can be close to real encoder latents but still produce high‑entropy, biased, or repetitive token distributions. We therefore frame the task as controlled local refinement rather than full generation from noise. On ROCStories, using the first two sentences as prompt and the last three as target, full 768‑dimensional BERT latents recover tokens much better than compressed 256‑dimensional latents. With 768‑dimensional latents, DraftPrior target‑token probability is 0.938 for clean drafts, 0.613 for 3% token dropout, 0.483 for 5% dropout, and 0.272 for 10% dropout. Local flow refinement and fused decoder‑aware readout give modest additional gains, while metric learning and OT‑style alignment improve geometry but do not close the decoder gap. The main result is a diagnostic one: latent geometry alone is not enough. Continuous latent text generation should be evaluated by decoder recoverability, the quality of the start distribution, and whether refinement preserves decoder‑readable structure.
Authors:Junchao Wu, Yefeng Liu, Chenyu Zhu, Hao Zhang, Zeyu Wu, Tianqi Shi, Yichao Du, Longyue Wang, Weihua Luo, Jinsong Su, Derek F. Wong
Abstract:
The effective detection and governance of Large Language Model (LLM) generated content has become increasingly critical due to the growing risk of misuse. Despite the impressive performance of existing detectors, their reliability and potential in multilingual, real‑world scenarios remain largely underexplored. In this study, we introduce DetectRL‑X, a comprehensive multilingual benchmark designed to evaluate advanced detectors across 8 dimensions. The benchmark encompasses 8 languages commonly used in commercial contexts and collects human‑written texts from 6 domains highly susceptible to LLM misuse. To better aligned with real‑world applications, We create LLM‑generated texts using 4 popular commercial LLMs, and include typical AI‑assisted writing operations such as polishing, expanding, and condensing to capture authentic usage patterns. Furthermore, we develop a multilingual framework for paraphrasing and perturbation attacks to simulate diverse human modifications and writing noise, enabling stress testing of detectors across languages. Experimental results on DetectRL‑X reveal the strengths and limitations of current state‑of‑the‑art detectors when applied to diverse linguistic resources. We further analyze how domains, generators, attack strategies, text length, and refinement operations influence performance in different languages, underscoring DetectRL‑X as an effective benchmark for strengthening multilingual and language‑specific detectors.
Authors:Shangjian Yin, Yu Fu, Yue Dong, Zhouxing Shi
Abstract:
Post‑training has become a crucial step for unlocking the capabilities of large language models, with reinforcement learning (RL) emerging as a critical paradigm. Recent RL‑based post‑training has increasingly split into two paradigms: reinforcement learning from human feedback (RLHF), which optimizes models using human preference signals in target domains, and reinforcement learning from verifiable rewards (RLVR), which operates in verifier‑backed environments. The latter has dominated recent reasoning‑oriented post‑training because it delivers stronger gains and higher efficiency on domain‑specific tasks (e.g., reasoning). However, although in‑domain RL training achieves promising performance, it still requires a substantial amount of GPU compute, which remains a major barrier to broad adoption. In this work, we study the generalization ability of RLHF learned from scratch from a small set of interactions in open‑ended environments, and investigate whether the conversational abilities it explicitly acquires can implicitly transfer to downstream tasks such as mathematical reasoning and code generation, namely GRLO. Specifically, on Qwen3‑4B‑Base backbone, GRLO improves the average performance across all domains from 24.1 to 63.1 with only 5K prompts and 22.7 GPU hours, requiring about 46× less data and 68× less compute than a strong in‑domain RLVR baseline. The resulting model is even competitive with Qwen's released post‑trained models which required a much larger training cost. Notably, a subsequent in‑domain RLVR stage brings only selective gains, mainly on harder competition‑math benchmarks. We hope GRLO offers a simple and efficient recipe for building broadly capable post‑trained models. Our code and data will be available at: \hrefhttps://github.com/SJY8460/GRLOhttps://github.com/SJY8460/GRLO.
Authors:Ziyu Guo, Rain Liu, Xinyan Chen, Pheng-Ann Heng
Abstract:
Visual reasoning, often interleaved with intermediate visual states, has emerged as a promising direction in the field. A straightforward approach is to directly generate images via unified models during reasoning, but this is computationally expensive and architecturally non‑trivial. Recent alternatives include agentic reasoning through code or tool calls, and latent reasoning with learnable hidden embeddings. However, agentic methods incur context‑switching latency from external execution, while latent methods lack task generalization and are difficult to train with autoregressive parallelization. To combine their strengths while mitigating their limitations, we propose ATLAS, a framework in which a single discrete 'word', termed as a functional token, serves both as an agentic operation and a latent visual reasoning unit. Each functional token is associated with an internalized visual operation, yet requires no visual supervision and remains a standard token in the tokenizer vocabulary, which can be generated via next‑token prediction. This design avoids verbose intermediate visual content generation, while preserving compatibility with the vanilla scalable SFT and RL training, without architectural or methodological modifications. To further address the sparsity of functional tokens during RL, we introduce Latent‑Anchored GRPO (LA‑GRPO), which stabilizes the training by anchoring functional tokens with a statically weighted auxiliary objective, providing stronger gradient updates. Extensive experiments and analyses demonstrate that ATLAS achieves superior performance on challenging benchmarks while maintaining clear interpretability. We hope ATLAS offers a new paradigm inspiring future visual reasoning research.
Authors:Mukul Ranjan, Prince Jha, Khushboo Kumari, Zhiqiang Shen
Abstract:
Vision‑Language Models (VLMs) are increasingly applied to cultural heritage materials, from digital archives to educational platforms. This work identifies a fundamental issue in how these models interpret historical artifacts. We define this phenomenon as cultural anachronism, the tendency to misinterpret historical objects using temporally inappropriate concepts, materials, or cultural frameworks. To quantify this phenomenon, we introduce the Temporal Anachronism Benchmark for Vision‑Language Models (TAB‑VLM), a dataset of 600 questions across six categories, designed to evaluate temporal reasoning on 1,600 Indian cultural artifacts spanning prehistoric to modern periods. Systematic evaluations of ten state‑of‑the‑art models reveal significant deficiencies on our benchmark, and even the best model (GPT‑5.2) achieves only 58.7% overall accuracy. The performance gap persists across varying architectures and scales, suggesting that cultural anachronism represents a significant limitation in visual AI systems, regardless of model size. These findings highlight the disparity between current VLM capabilities and the requirements for accurately interpreting cultural heritage materials, particularly for non‑Western visual cultures underrepresented in training data. Our benchmark provides a foundation for enhancing temporal cognition in multimodal AI systems that interact with historical artifacts. The dataset and code are available in our project page.
Authors:Zihan Deng, Xiaozhen Zhong, Chuanzhi Xu
Abstract:
As large language models empower healthcare, intelligent clinical decision support has developed rapidly. Longitudinal electronic health records (EHR) provide essential temporal evidence for accurate clinical diagnosis and analysis. However, current large language models have critical flaws in longitudinal EHR reasoning. First, lacking fine‑grained statistical reasoning, they often hallucinate clinical trends and metrics when quantitative evidence is textually implied, biasing diagnostic inference. Second, non‑uniform time series and scarce labels in longitudinal EHR hinder models from capturing long‑range temporal dependencies, limiting reliable clinical reasoning. To address the above limitations, this work presents the Probabilistic Chain‑of‑Thought Completion Agent (COTCAgent), a hierarchical reasoning framework for longitudinal electronic health records. It consists of three core modules. The Temporal‑Statistics Adapter (TSA) converts analytical plans into executable code for standardized trend output. The Chain‑of‑Thought Completion (COTC) layer leverages a symptom‑trend‑disease knowledge base with weighted scoring to evaluate disease risk, while the bounded completion module acquires structured evidence through standardized inquiries and iterative scoring constraints to ensure rigorous reasoning. By decoupling statistical computation, feature matching, and language generation, the framework eliminates reliance on complex multi‑modal inputs and enables efficient longitudinal record analysis with lower computational overhead. Experimental results show that COTCAgent powered by Baichuan‑M2 achieves 90.47% Top‑1 accuracy on the self‑built dataset and 70.41% on HealthBench, outperforming existing medical agents and mainstream large language models. The code is available at https://github.com/FrankDengAI/COTCAgent/.
Authors:Shijie Lian, Bin Yu, Xiaopeng Lin, Zhaolong Shen, Laurence Tianruo Yang, Yurun Jin, Haishan Liu, Changti Wu, Hang Yuan, Cong Huang, Kai Chen
Abstract:
Robot imitation data are often multimodal: similar visual‑language observations may be followed by different action chunks because human demonstrators act with different short‑horizon intents, task phases, or recent context. Existing frame‑conditioned VLA policies infer each chunk from the current observation and instruction alone, so under partial observability they may resample different intents across adjacent replanning steps, leading to inter‑chunk conflict and unstable execution. We introduce IntentVLA, a history‑conditioned VLA framework that encodes recent visual observations into a compact short‑horizon intent representation and uses it to condition chunk generation. We further introduce AliasBench, a 12‑task ambiguity‑aware benchmark on RoboTwin2 with matched training data and evaluation environments that isolate short‑horizon observation aliasing. Across AliasBench, SimplerEnv, LIBERO, and RoboCasa, IntentVLA improves rollout stability and outperforms strong VLA baselines
Authors:Han Tian, Luxuan Chen, Xinran Chen, Rui Kong, Fang Wang, Jiamin Chen, Jinman Zhao, Yuchen Li, Jiashu Zhao, Shuaiqiang Wang, Haoyi Xiong, Dawei Yin
Abstract:
Extending the context window of large language models typically requires training on sequences at the target length, incurring quadratic memory and computational costs that make long‑context adaptation expensive and difficult to reproduce. We propose EndPrompt, a method that achieves effective context extension using only short training sequences. The core insight is that exposing a model to long‑range relative positional distances does not require constructing full‑length inputs: we preserve the original short context as an intact first segment and append a brief terminal prompt as a second segment, assigning it positional indices near the target context length. This two‑segment construction introduces both local and long‑range relative distances within a short physical sequence while maintaining the semantic continuity of the training text‑‑a property absent in chunk‑based simulation approaches that split contiguous context. We provide a theoretical analysis grounded in Rotary Position Embedding and the Bernstein inequality, showing that position interpolation induces a rigorous smoothness constraint over the attention function, with shared Transformer parameters further suppressing unstable extrapolation to unobserved intermediate distances. Applied to LLaMA‑family models extending the context window from 8K to 64K, EndPrompt achieves an average RULER score of 76.03 and the highest average on LongBench, surpassing LCEG (72.24), LongLoRA (72.95), and full‑length fine‑tuning (69.23) while requiring substantially less computation. These results demonstrate that long‑context generalization can be induced from sparse positional supervision, challenging the prevailing assumption that dense long‑sequence training is necessary for reliable context‑window extension. The code is available at https://github.com/clx1415926/EndPrompt.
Authors:Ali Hassaan Mughal, Noor Fatima, Muhammad Bilal
Abstract:
Context. Behaviour‑Driven Development (BDD) software test suites accumulate duplicated step subsequences. Three published refactoring patterns are available (within‑file Background, within‑repo reusable‑scenario invocation, cross‑organisational shared higher‑level step), but no prior work automates which recurring subsequences are worth extracting or which mechanism applies. Objective. Rank recurring step subsequences ("slices") by refactoring suitability (extraction‑worthy), pre‑map each to one of the three patterns, and quantify prevalence across the public BDD ecosystem. Method. Every contiguous L‑step window (L in [2, 18]) in a 339‑repository / 276‑upstream‑owner Gherkin corpus is keyed by paraphrase‑robust cluster identifiers and counted under three scopes. Sentence‑BERT (SBERT) / Uniform Manifold Approximation and Projection (UMAP) / Hierarchical Density‑Based Clustering (HDBSCAN) recovers paraphrase‑equivalent slices. Three authors label a stratified 200‑slice pool against a written rubric. An eXtreme Gradient Boosting (XGBoost) extraction‑worthy classifier trained under 5‑fold cross‑validation is compared with a tuned rule baseline and two open‑weight Large Language Model (LLM) judges. Results. The miner produces 5,382,249 slices collapsing to 692,020 recurring patterns. Three‑author Fleiss' kappa = 0.56 (extraction‑worthy) and 0.79 (mechanism). The classifier reaches out‑of‑fold F1 = 0.891 (95% CI [0.852, 0.927]), outperforming both the rule baseline (F1 = 0.836, p = 0.017) and the better LLM judge (F1 = 0.728, p < 1e‑4). 75.0%, 59.5%, and 11.7% of scenarios carry a within‑file Background, within‑repo reusable‑scenario, or cross‑organisational shared‑step candidate. Conclusion. Paraphrase‑robust subscenario discovery yields a corpus‑wide census of BDD refactoring opportunities; pipeline, classifier predictions, labelled pool, and rubric are released under Apache‑2.0.
Authors:Qazi Mamunur Rashid, Xuan Yang, Zhengzhe Yang, Yanzhou Pan, Erin van Liemt, Darlene Neal, Kshitij Pancholi, Jamila Smith-Loud
Abstract:
Recent advancements in generative AI facilitate large‑scale synthetic data generation for model evaluation. However, without targeted approaches, these datasets often lack the sociotechnical nuance required for sensitive domains. We introduce NodeSynth, an evidence‑grounded methodology that generates socially relevant synthetic queries by leveraging a fine‑tuned taxonomy generator (TaG) anchored in real‑world evidence. Evaluated against four mainstream LLMs (e.g., Claude 4.5 Haiku), NodeSynth elicited failure rates up to five times higher than human‑authored benchmarks. Ablation studies confirm that our granular taxonomic expansion significantly drives these failure rates, while independent validation reveals critical deficiencies in prominent guard models (e.g., Llama‑Guard‑3). We open‑source our end‑to‑end research prototype and datasets to enable scalable, high‑stakes model evaluation and targeted safety interventions (https://github.com/google‑research/nodesynth).
Authors:Hoang-Thuy-Duong Vu, Quoc-Cuong Pham, Huy-Hieu Pham
Abstract:
Psychological defense mechanisms (PDMs) are unconscious cognitive processes that modulate how individuals perceive and respond to emotional distress. Automatically classifying PDMs from text is clinically valuable but severely hindered by data scarcity and class imbalance, challenges which generative augmentation alone cannot resolve without psychological grounding. In this work, we address these challenges in the PsyDefDetect shared task (BioNLP@ACL 2026) by proposing a context‑aware synthetic augmentation framework combined with a hybrid classification model. Our hybrid model integrates contextual language representations with basic clinical features, along with 150 annotated defense items. Experiments demonstrate that definition quality in prompting directly governs generation fidelity and downstream performance. Our method surpasses DMRS Co‑Pilot, reaching an accuracy of 58.26% (+40.25%) and a macro‑F1 of 24.62% (+15.99%), thereby establishing a strong baseline for psychologically grounded defense mechanism classification in low‑resource settings. Source code is available at: https://github.com/htdgv/CASA‑PDC.
Authors:Libo Sun, Po-wei Harn, Peixiong He, Xiao Qin
Abstract:
KV‑cache compression at small budgets is a crowded design space spanning cache representation, head‑wise routing, compression cadence, decoding behavior, and within‑budget scoring. We study seven mechanisms across these five families under matched mean cache on long‑form mathematical reasoning (MATH‑500~\citehendrycks2021math) with two distilled‑reasoning models (Qwen‑7B and Llama‑8B variants of DeepSeek‑R1‑Distill~\citedeepseek2025r1) at budgets b \in \64, 128\. All seven were rejected. We then propose α, a one‑function modification to the TriAttention~\citemao2026triattention retention scorer that replaces argmax‑top‑k with greedy facility‑location‑inspired selection under a V‑space redundancy penalty controlled by a single weight λ. A pre‑registered protocol tunes λ on a frozen development split and confirms on a disjoint held‑out split; with λ= 0.5, α clears Bonferroni on two of the four (model, budget) cells (Qwen b=128 and Llama b=64), no cell is significantly negative, and the pre‑registered Branch~A triggers. The finding is asymmetric: a minimal scoring modification beat heavier structural redesigns in this regime, and the combined matched‑memory, sympy‑graded, held‑out confirmation protocol is the evidence standard that made the asymmetry visible.
Authors:Weisen Jiang, Shuhao Chen, Sinno Jialin Pan
Abstract:
Mixture‑of‑Experts (MoE) models scale capacity by combining specialized experts, but most existing approaches assume centralized access to training data. In practice, data are distributed across clients and cannot be shared due to privacy constraints, making unified MoE training challenging. We propose MetaMoE, a privacy‑preserving framework that unifies independently trained, domain‑specialized experts into a single MoE using public proxy data as surrogates for inaccessible private data. Central to MetaMoE is diversity‑aware proxy selection, which selects client‑domain‑relevant and diverse samples from public data to effectively approximate private data distributions and supervise router learning. These proxies are further used to align expert training, improving expert coordination at unification time, while a context‑aware router enhances expert selection across heterogeneous inputs. Experiments on computer vision and natural language processing benchmarks demonstrate that MetaMoE consistently outperforms recent privacy‑preserving MoE unification methods. Code is available at https://github.com/ws‑jiang/MetaMoE.
Authors:Adam Nohejl, Xuanxin Wu, Yusuke Ide, Maria Angelica Riera Machin, Yi-Ning Chang, Hitomi Yanaka
Abstract:
We describe two types of models for vocabulary difficulty prediction: a high‑accuracy black‑box model, which achieved the top shared task result in the open track, and an explainable model, which outperforms a fine‑tuned encoder baseline. As the black‑box model, we fine‑tuned an LLM using a soft‑target loss function for effective application to the rating task, achieving r > 0.91. The explainable model provides insights into what impacts the difficulty of each item while maintaining a strong correlation (r > 0.77). We further analyze the results, demonstrating that the difficulty of items in the British Council's Knowledge‑based Vocabulary Lists (KVL) is often affected by spelling difficulty or the construction of the test items, in addition to the genuine production difficulty of the words. We make our code available online at https://github.com/adno/vocabulary‑difficulty .
Authors:Chengzhi Shen, Weixiang Shen, Tobias Susetzky, Chen, Chen, Jun Li, Yuyuan Liu, Xuepeng Zhang, Zhenyu Gong, Daniel Rueckert, Jiazhen Pan
Abstract:
Intensive care units (ICU) generate long, dense and evolving streams of clinical information, where physicians must repeatedly reassess patient states under time pressure, underscoring a clear need for reliable AI decision support. Existing ICU benchmarks typically treat historical clinician actions as ground truth. However, these actions are made under incomplete information and limited temporal context of the underlying patient state, and may therefore be suboptimal, making it difficult to assess the true reasoning capabilities of AI systems. We introduce RealICU, a hindsight‑annotated benchmark for evaluating large language models (LLMs) under realistic ICU conditions, where labels are created after senior physicians review the full patient trajectory. We formulate four physician‑motivated tasks: assess Patient Status, Acute Problems, Recommended Actions, and Red Flag actions that risk unsafe outcomes. We partition each trajectory with 30‑min windows and release two datasets: RealICU‑Gold with 930‑window annotations from 94 MIMIC‑IV patients, and RealICU‑Scale with 11,862 windows extended by Oracle, a physician‑validated LLM hindsight labeler. Existing LLMs including memory‑augmented ones performed poorly on RealICU, exposing two failure modes: a recall‑safety tradeoff for clinical recommendations, and an anchoring bias to early interpretations of the patient. We further introduce ICU‑Evo to study structured‑memory agents that improves long‑horizon reasoning but does not fully eliminate safety failures. Together, RealICU provides a clinically grounded testbed for measuring and improving AI sequential decision‑support in high‑stakes care. Project page: https://chengzhi‑leo.github.io/RealICU‑Bench/
Authors:Barathi Ganesh HB, Michal Ptaszynski, Rene Melendez, Juuso Eronen
Abstract:
This paper presents a multi‑stage framework for detecting reclaimed slurs in multilingual social media discourse. It addresses the challenge of identifying reclamatory versus non‑reclamatory usage of LGBTQ+‑related slurs across English, Spanish, and Italian tweets. The framework handles three intertwined methodological challenges like data scarcity, class imbalance, and cross‑linguistic variation in sentiment expression. It integrates data‑driven model selection via cross‑validation, semantic‑preserving augmentation through back‑translation, inductive transfer learning with dynamic epoch‑level undersampling, and domain‑specific knowledge injection via masked language modeling. Eight multilingual embedding models were evaluated systematically, with XLM‑RoBERTa selected as the foundation model based on macro‑averaged F1 score. Data augmentation via GPT‑4o‑mini back‑translation to alternate languages effectively tripled the training corpus while preserving semantic content and class distribution ratios. The framework produces four final runs for the evaluation purposes where RUN 1 is inductive transfer learning with augmentation and undersampling, RUN 2 with masked language modeling pre‑training, RUN 3 and RUN 4 are previous predictions refined via language‑specific decision thresholds optimized via ROC analysis. Language‑specific threshold refinement reveals that optimal decision boundaries vary significantly across languages. This reflects distributional differences in model confidence scores and linguistic variation in reclamatory language usage. The threshold‑based optimization yields 2‑5% absolute F1 improvement without requiring model retraining. The methodology is fully reproducible, with all code and experimental setup available at https://github.com/rbg‑research/MultiPRIDE‑Evalita‑2026.
Authors:Galadrielle Humblot-Renaux, Mohammad N. S. Jahromi, Rohat Bakuri-Jørgensen, Marieke Anne Heyl, Asta S. Stage Jarlner, Maria Vlachou, Anna Murphy Høgenhaug, Desmond Elliott, Thomas Gammeltoft-Hansen, Thomas B. Moeslund
Abstract:
Off‑the‑shelf large language models (LLMs) are increasingly used to automate text annotation, yet their effectiveness remains underexplored for underrepresented languages and specialized domains where the class definition requires subtle expert understanding. We investigate LLM‑based annotation for a novel legal NLP task: identifying the presence and sentiment of credibility assessments in asylum decision texts. We introduce RAB‑Cred, a Danish text classification dataset featuring high‑quality, expert annotations and valuable metadata such as annotator confidence and asylum case outcome. We benchmark 21 open‑weight models and 30 system‑user prompt combinations for this task, and systematically evaluate the effect of model and prompt choice for zero‑shot and few‑shot classification. We zoom in on the errors made by top‑performing models and prompts, investigating error consistency across LLMs, inter‑class confusion, correlation with human confidence and sample‑wise difficulty and severity of LLM mistakes. Our results confirm the potential of LLMs for cost‑effective labeling of asylum decisions, but highlight the imperfect and inconsistent nature of LLM annotators, and the need to look beyond the predictions of a single, arbitrarily chosen model. The RAB‑Cred dataset and code are available at https://github.com/glhr/RAB‑Cred
Authors:Chaehee Song, Minseok Seo, Yeeun Seong, Doyi Kim, Changick Kim
Abstract:
Large language models (LLMs) are typically deployed with fixed parameters, and their performance is often improved by allocating more computation at inference time. While such test‑time scaling can be effective, it cannot correct model misconceptions or adapt the model to the specific structure of an individual query. Test‑time optimization addresses this limitation by enabling parameter updates during inference, but existing approaches either rely on external data or optimize generic self‑supervised objectives that lack query‑specific alignment. In this work, we propose Query‑Conditioned Test‑Time Self‑Training (QueST), a framework that adapts model parameters during inference using supervision derived directly from the input query. Our key insight is that the input query itself encodes latent signals sufficient for constructing structurally related problem‑‑solution pairs. Based on this, QueST generates such query‑conditioned pairs and uses them as supervision for parameter‑efficient fine‑tuning at test time. The adapted model is then used to produce the final answer, enabling query‑specific adaptation without any external data. Across seven mathematical reasoning benchmarks and the GPQA‑Diamond scientific reasoning benchmark, QueST consistently outperforms strong test‑time optimization baselines. These results demonstrate that query‑conditioned self‑training is an effective and practical paradigm for test‑time adaptation in LLMs. Code is available at https://chssong.github.io/Query‑Conditioned‑TTST/.
Authors:Oscar Gilg, Pierre Beckmann, Daniel Paleka, Patrick Butlin
Abstract:
Large language models (LLMs) can be said to have preferences: they reliably pick certain tasks and outputs over others, and preferences shaped by post‑training and system prompts appear to shape much of their behaviour. But models can also adopt different personas which have radically different preferences. How is this implemented internally? Does each persona run on its own preference machinery, or is something shared underneath? We train linear probes on residual‑stream activations of Gemma‑3‑27B and Qwen‑3.5‑122B to predict revealed pairwise task choices, and identify a genuine preference vector: it tracks the model's preferences as they shift across a range of prompts and situations, and on Gemma‑3‑27B steering along it causally controls pairwise choice. This preference representation is largely shared across personas: a probe trained on the helpful assistant predicts and steers the choices of qualitatively different personas, including an evil persona whose preferences anti‑correlate with those of the Assistant.
Authors:Yunheng Wang, Yuetong Fang, Taowen Wang, Lusong Li, Kun Liu, Junzhe Xu, Zizhao Yuan, Yixiao Feng, Jiaxi Zhang, Wei Lu, Zecui Zeng, Renjing Xu
Abstract:
Vision‑and‑Language Navigation (VLN) is a cornerstone of embodied intelligence. However, current agents often suffer from significant performance degradation when transitioning from simulation to real‑world deployment, primarily due to perceptual instability (e.g., lighting variations and motion blur) and under‑specified instructions. While existing methods attempt to bridge this gap by scaling up model size and training data, we argue that the bottleneck lies in the lack of robust spatial grounding and cross‑domain priors. In this paper, we propose StereoNav, a robust Vision‑Language‑Action framework designed to enhance real‑world navigation consistency. To address the inherent gap between synthetic training and physical execution, we introduce Target‑Location Priors as a persistent bridge. These priors provide stable visual guidance that remains invariant across domains, effectively grounding the agent even when instructions are vague. Furthermore, to mitigate visual disturbances like motion blur and illumination shifts, StereoNav leverages stereo vision to construct a unified representation of semantics and geometry, enabling precise action prediction through enhanced depth awareness. Extensive experiments on R2R‑CE and RxR‑CE demonstrate that StereoNav achieves state‑of‑the‑art egocentric RGB performance, with SR and SPL scores of 81.1% and 68.3%, and 67.5% and 52.0%, respectively, while using significantly fewer parameters and less training data than prior scaling‑based approaches. More importantly, real‑world robotic deployments confirm that StereoNav substantially improves navigation reliability in complex, unstructured environments. Project page: https://yunheng‑wang.github.io/stereonav‑public.github.io.
Authors:Chenjun Xu, Zhennan Zhou, Zhan Su, Bill Howe, Lucy Lu Wang, Bingbing Wen
Abstract:
Long chain‑of‑thought (Long CoT) reasoning improves performance on multi‑step problems, but it also induces overthinking: models often generate low‑yield reasoning that increases inference cost and latency. This inefficiency is especially problematic in low‑data fine‑tuning regimes, where real applications adapt reasoning models with limited supervision and cannot rely on large‑scale teacher distillation or heavy test‑time control. To address this, we propose STOP (Structured On‑policy Pruning), an on‑policy algorithm for analyzing and pruning long‑form reasoning traces. STOP constructs self‑distilled traces from the model. Then it maps each trace into a structured reasoning interface through node segmentation, taxonomy annotation, and reasoning‑tree construction. On top of this interface, we introduce ECN (Earliest Correct Node), which retains the shortest prefix ending at the earliest node that both functions as an answering conclusion and yields the correct final answer, removing redundant post‑solution reasoning while preserving semantic continuity. Experiments on DeepSeek‑R1‑Distill‑Qwen‑7B and DeepSeek‑R1‑Distill‑LLaMA‑3‑8B across GSM8K, Math 500, and AIME 2024 show that STOP reduces generated tokens by 19.4‑42.4% while largely preserving accuracy in low‑data fine‑tuning. Beyond efficiency, our analyses show that STOP induces much smaller distributional shift than teacher‑guided pruning, improves the structural efficiency of generated reasoning, and reallocates reasoning effort away from redundant verification and backtracking toward more productive exploration.
Authors:Yejin Lee, Yo-Sub Han
Abstract:
Diffusion Language Models (DLMs) provide a promising alternative to autoregressive language models by generating text through iterative denoising and bidirectional refinement. However, this iterative generation paradigm also introduces unique safety vulnerabilities when harmful tokens generated at intermediate denoising steps propagate through subsequent refinement processes and eventually induce unsafe outputs. While there are a few attempts to remedy this issue, they either fail to generate safe outputs or generate safe yet low‑quality outputs. This motivates us to propose an inference‑time defense framework based on the step‑wise intervention during the denoising process, which then improves the safety without compromising the output quality. The key component of our framework is a contrastive safety direction (SGD), a latent direction that captures the semantic boundary between harmful and safe generations. We leverage SGD to assess the alignment of generated tokens with harmful semantics at each denoising step. When harmful alignment is detected, our method remasks the corresponding tokens and resumes the denoising process with adaptive steering, where the steering strength is modulated according to the estimated degree of harmfulness. As a plug‑and‑play module, our method circumvents the need for additional fine‑tuning and can be directly incorporated into off‑the‑shelf diffusion models. The experimental results show that our approaches reduce jailbreak success rates to 0.64% while preserving generation quality close to the original model performance. This confirms the effectiveness of step‑wise intervention for safe diffusion language model generation. Our code is available at https://github.com/leeyejin1231/DLM_Steering_Remasking.
Authors:Zijing Wang, Mingyang Wang, Ercong Nie, Yongkang Liu, Shi Feng, Mengjie Zhao, Daling Wang, Xiaocui Yang, Hinrich Schütze
Abstract:
Towards more general and human‑like intelligence, large language models should seamlessly integrate both multilingual and multimodal capabilities; however, extending an existing multimodal model to many languages typically requires expensive multilingual multimodal data construction and repeated end‑to‑end retraining. We study a training‑free alternative: injecting multilingual capability into an existing multimodal model by composing residual updates in the shared language model backbone. The key challenge is that multilingual and multimodal updates are heterogeneous, reflecting different functional roles in the shared model. To address this, we propose Direction‑ and Magnitude‑aware Multilingual Multimodal merging (DiM3), which selectively composes the two updates at each parameter dimension while preserving the original vision encoder and multimodal projector. Experiments on multilingual benchmarks in both text‑only and vision‑language settings, covering 57 languages across LLaVA‑ and Qwen‑based backbones, show that DiM3 consistently outperforms existing merging baselines, substantially improves multilingual performance over the original multimodal model, and remains competitive with dedicated multilingual multimodal fine‑tuning while largely retaining general multimodal ability. We further show that DiM3 can be directly applied to already trained multilingual multimodal models and still yield additional gains. Further interpretability analysis shows that DiM3 primarily reshapes intermediate‑layer semantic representations, strengthening cross‑lingual alignment under both text‑only and multimodal inputs while preserving higher‑layer task‑sensitive structure. Our repository is on https://github.com/wzj1718/DiM3.
Authors:Haodong Wu, Jiahao Zhang, Lijie Hu, Yongqi Zhang
Abstract:
Supervised fine‑tuning (SFT) data selection is commonly formulated as instance ranking: score each example and retain a top‑k subset. However, effective SFT training subsets are often produced through ordered curation recipes, where filtering, mixing, and deduplication operators jointly shape the final data distribution. We formulate this problem as fixed‑pool data recipe search: given a raw instruction pool and a library of grounded operators, the goal is to discover an executable recipe that constructs a high‑quality selected subset under a limited budget of full SFT evaluations, without generating, rewriting, or augmenting training samples. We introduce AutoSelection, a two‑layer solver that decouples fixed‑pool materialization based on cached task‑, data‑, and model‑side signals from expensive full evaluation, using warmup probes, realized subset states, local recipe edits, Gaussian‑process‑assisted ranking, and stagnation‑triggered reseeding. Experiments on a 90K instruction pool show that AutoSelection achieves the strongest in‑distribution reasoning average across three base models, outperforming full‑data training, random recipe search, random top‑k, and single‑operator selectors. Additional Out‑of‑distribution graph‑reasoning results, search‑stability analyses, structural ablations, and 1.5B‑to‑7B transfer checks further show that recipe structure matters beyond individual selection operators. Code is available at https://github.com/w253/AutoSelection.
Authors:Dongsheng Ma, Jiayu Li, Zhengren Wang, Yijie Wang, Jiahao Kong, Weijun Zeng, Jutao Xiao, Jie Yang, Wentao Zhang, Bin Wang, Conghui He
Abstract:
Multimodal Large Language Models (MLLMs) have significantly advanced document understanding, yet current Doc‑VQA evaluations score only the final answer and leave the supporting evidence unchecked. This answer‑only approach masks a critical failure mode: a model can land on the correct answer while grounding it in the wrong passage ‑‑ a critical risk in high‑stakes domains like law, finance, and medicine, where every conclusion must be traceable to a specific source region. To address this, we introduce CiteVQA, a benchmark that requires models to return element‑level bounding‑box citations alongside each answer, evaluating both jointly. CiteVQA comprises 1,897 questions across 711 PDFs spanning seven domains and two languages, averaging 40.6 pages per document. To ensure fidelity and scalability, the ground‑truth citations are generated by an automated pipeline‑which identifies crucial evidence via masking ablation‑and are subsequently validated through expert review. At the core of our evaluation is Strict Attributed Accuracy (SAA), which credits a prediction only when the answer and the cited region are both correct. Auditing 20 MLLMs reveals a pervasive Attribution Hallucination: models frequently produce the right answer while citing the wrong region. The strongest system (Gemini‑3.1‑Pro‑Preview) achieves an SAA of only 76.0, and the strongest open‑source MLLM reaches just 22.5. Ultimately, towards trustworthy document intelligence, CiteVQA exposes a reliability gap that answer‑only evaluations overlook, providing the instrumentation needed to close it. Our repository is available at https://github.com/opendatalab/CiteVQA.
Authors:Buyun Liang, Jinqi Luo, Liangzu Peng, Kwan Ho Ryan Chan, Darshan Thaker, Kaleab A. Kinfu, Fengrui Tian, Hamed Hassani, René Vidal
Abstract:
Large language models (LLMs) achieve strong performance across many tasks but remain vulnerable to hallucinations, motivating the need for realistic adversarial prompts that elicit such failures. We formulate hallucination elicitation as a constrained optimization problem, where the goal is to find semantically coherent adversarial prompts that are equivalent to benign user prompts. Existing methods remain limited: discrete prompt‑based attacks preserve semantic equivalence and coherence but search only over a limited set of prompt variations, while continuous latent‑space attacks explore a richer space but often decode into prompts that are no longer valid rephrasings. To address these limitations, we propose REALISTA, a realistic latent‑space attack framework. REALISTA constructs an input‑dependent dictionary of valid editing directions, each corresponding to a semantically equivalent and coherent rephrasing, and optimizes continuous combinations of these directions in latent space. This design combines the optimization flexibility of continuous attacks with the semantic realism of discrete rephrasing‑based attacks. Experiments demonstrate that REALISTA achieves superior or comparable performance to state‑of‑the‑art realistic attacks on open‑source LLMs and, crucially, succeeds in attacking large reasoning models under free‑form response settings, where prior realistic attacks fail. Code is available at https://github.com/Buyun‑Liang/REALISTA.
Authors:Jack Young
Abstract:
We introduce WriteSAE, the first sparse autoencoder that decomposes and edits the matrix cache write of state‑space and hybrid recurrent language models, where residual SAEs cannot reach. Existing SAEs read residual streams, but Gated DeltaNet, Mamba‑2, and RWKV‑7 write to a d_k × d_v cache through rank‑1 updates k_t v_t^\top that no vector atom can replace. WriteSAE factors each decoder atom into the native write shape, exposes a closed form for the per‑token logit shift, and trains under matched Frobenius norm so atoms swap one cache slot at a time. Atom substitution beats matched‑norm ablation on 92.4% of n=4,851 firings at Qwen3.5‑0.8B L9 H4, the 87‑atom population test holds at 89.8%, the closed form predicts measured effects at R^2=0.98, and Mamba‑2‑370M substitutes at 88.1% over 2,500 firings. Sustained three‑position installs at 3× lift midrank target‑in‑continuation from 33.3% to 100% under greedy decoding, the first behavioral install at the matrix‑recurrent write site.
Authors:Deepak Kumar, Baban Gain, Asif Ekbal
Abstract:
Automatic Speech Recognition (ASR) transcripts often contain disfluencies, such as fillers, repetitions, and false starts, which reduce readability and hinder downstream applications like chatbots and voice assistants. If left unaddressed, such disfluencies can significantly degrade the reliability of downstream systems. Most existing approaches rely on classical models that focus on identifying disfluent tokens for removal. While this strategy is effective to some extent, it often disrupts grammatical structure and semantic coherence, leading to incomplete or unnatural sentences. Recent literature explored the use of large language models (LLMs); however, these efforts have primarily focused on disfluency detection or data augmentation, rather than performing comprehensive correction. We propose a multilingual correction pipeline where a sequence tagger first marks disfluent tokens, and these signals guide instruction fine‑tuning of an LLM to rewrite transcripts into fluent text. To further improve reliability, we add a contrastive learning objective that penalizes the reproduction of disfluent tokens, encouraging the model to preserve grammar and meaning while removing disfluent artifacts. Our experiments across three Indian languages, namely Hindi, Bengali, and Marathi show consistent improvements over strong baselines, including multilingual sequence‑to‑sequence models. These results highlight that detection‑only strategies are insufficient. Combining token‑level cues with instruction tuning and contrastive learning provides a practical and scalable solution for multilingual disfluency correction in speech‑driven NLP systems. We make the codes publicly available at https://github.com/deepak‑kumar‑98/Mind‑the‑Pause.
Authors:Yexing Xu, Wei Feng, Shen Zhang, Haohan Wang, Yuxin Qin, Yaoyu Li, Ao Ma, Yuhao Luo, Lu Wang, Xudong Ren, Haoran Wang, Run Ling, Zheng Zhang, Jingjing Lv, Junjie Shen, Ching Law, Longguang Wang, Yulan Guo
Abstract:
Generating realistic and user‑preferred advertisements is a key challenge in e‑commerce. Existing approaches utilize multiple independent models driven by click‑through‑rate (CTR) to controllably create attractive image or text advertisements. However, their pipelines lack cross‑modal perception and rely on CTR that only reflects average preferences. Therefore, we explore jointly generating personalized image‑text advertisements from historical click behaviors. We first design a Unified Advertisement Generative model (Uni‑AdGen) that employs a single autoregressive framework to produce both advertising images and texts. By incorporating a foreground perception module and instruction tuning, Uni‑AdGen enhances the realism of the generated content. To further personalize advertisements, we equip Uni‑AdGen with a coarse‑to‑fine preference understanding module that effectively captures user interests from noisy multimodal historical behaviors to drive personalized generation. Additionally, we construct the first large‑scale Personalized Advertising image‑text dataset (PAd1M) and introduce a Product Background Similarity (PBS) metric to facilitate training and evaluation. Extensive experiments show that our method outperforms baselines in general and personalized advertisement generation. Our project is available at https://github.com/JD‑GenX/Uni‑AdGen.
Authors:Maham Nazir, Muhammad Aqeel, Richong Zhang, Francesco Setti
Abstract:
Multimodal video summarization requires visual features that align semantically with language generation. Traditional approaches rely on CNN features trained for object classification, which represent visual concepts as discrete categories not aligned with natural language. We propose ClipSum, a framework that leverages frozen CLIP vision‑language features with explicit temporal modeling and dimension‑adaptive fusion for instructional video summarization. CLIP's contrastive pre‑training on 400M image‑text pairs yields visual features semantically aligned with the linguistic concepts that text decoders generate, bridging the vision‑language gap at the representation level. On YouCook2, ClipSum achieves 33.0% ROUGE‑1 versus 30.5% for ResNet‑152 with 4x lower dimensionality (512 vs. 2048), demonstrating that semantic alignment matters more than feature capacity. Frozen CLIP (33.0%) surpasses fine‑tuned CLIP (32.3%), showing that preserving pre‑trained alignment is more valuable than task‑specific adaptation. https://github.com/aqeeelmirza/clipsum
Authors:Davide Baldelli, Sruthi Kuriakose, Maryam Hashemzadeh, Amal Zouaq, Sarath Chandar
Abstract:
Language models are increasingly used in settings where outputs must satisfy user‑specified randomness constraints, yet their generation probabilities are often poorly calibrated to those targets. We study whether this capability can be improved directly through fine‑tuning. Concretely, we fine‑tune language models on synthetic prompts that require sampling from mathematical distributions, and compare two Calibration Fine‑Tuning variants: a soft‑target method that converts the desired output distribution into trie‑derived next‑token targets, and a hard‑target method that trains on sampled completions from the same target distribution. Across 12 models spanning four families, both methods substantially improve structured‑sampling fidelity on held‑out distribution families and unseen parameter settings, showing that probabilistic calibration is a trainable capability. Under our selected training configurations, the two methods exhibit different empirical profiles: hard‑target fine‑tuning is often strongest on structured numeric sampling, while soft‑target fine‑tuning performs better on broader stochastic generation benchmarks, including open‑ended random generation, multiple‑choice answer‑position balancing, and NoveltyBench. The gains sometimes reduce downstream capability, especially arithmetic reasoning, with costs varying by model. Overall, our results show that probabilistic calibration can be improved through fine‑tuning, with our hard‑target configuration favoring exact numeric fidelity and our soft‑target configuration favoring broader stochastic transfer. Code is available at https://github.com/chandar‑lab/calibration‑finetuning.
Authors:Xin Ma, Wei Chen, Qi Liu, Derong Xu, Zhi Zheng, Tong Xu, Enhong Chen
Abstract:
Lifelong Model Editing aims to continuously update evolving facts in Large Language Models while preserving unrelated knowledge and general capabilities, yet it remains plagued by catastrophic forgetting and model collapse. Empirically, we find that recent editors resilient over long horizons share the same core strategy: Lifelong Normalization (LN), which normalizes value gradients using running statistics. Removing LN causes immediate performance collapse, and we observe a counter‑intuitive positive cumulative effect where early edits can promote the success of future edits. Yet the mechanism of LN remains a "black box", leaving its precise role in lifelong stability poorly understood. In this work, we provide the first theoretical account of LN in the lifelong regime. Our analysis reveals a self‑reinforcing stability loop and proves that, when combined with ridge‑regularized regression, LN yields parameter updates with asymptotic orthogonality and bounded norms, directly mitigating forgetting and systemic collapse. Based on these insights, we derive StableEdit, which strengthens this stability loop via an explicit warm‑up stage and full whitening, improving long‑horizon stability at minimal overhead. Extensive experiments validate our theory and demonstrate competitive performance. Our code is available at https://github.com/MINE‑USTC/StableEdit.
Authors:Xianzhe Fan, Yuxiang Lu, Shenyuan Gao, Xiaoyang Wu, Ruihua Han, Manling Li, Hengshuang Zhao
Abstract:
Vision‑Language‑Action (VLA) models are often brittle in fine‑grained manipulation, where minor action errors during the critical phases can rapidly escalate into irrecoverable failures. Since existing VLA models rely predominantly on successful demonstrations for training, they lack an explicit awareness of failure during these critical phases. To address this, we propose DreamAvoid, a critical‑phase test‑time dreaming framework that enables VLA models to anticipate and avoid failures. We also introduce an autonomous boundary learning paradigm to refine the system's understanding of the subtle boundary between success and failure. Specifically, we (1) utilize a Dream Trigger to determine whether the execution has entered a critical phase, (2) sample multiple candidate action chunks from the VLA via an Action Proposer, and (3) employ a Dream Evaluator, jointly trained on mixed data (success, failure, and boundary cases), to "dream" the short‑horizon futures corresponding to the candidate actions, evaluate their values, and select the optimal action. We conduct extensive evaluations on real‑world manipulation tasks and simulation benchmarks. The results demonstrate that DreamAvoid can effectively avoid failures, thereby improving the overall task success rate. Our code is available at https://github.com/XianzheFan/DreamAvoid.
Authors:Kyosuke Takami, Yuka Tateisi, Satoshi Sekine, Yusuke Miyao
Abstract:
Authentic school examinations provide a high‑validity test bed for evaluating multimodal large language models (MLLMs), yet benchmarks grounded in Japanese K‑12 assessments remain scarce. We present a multimodal dataset constructed from Japan's National Assessment of Academic Ability, comprising officially released middle‑school items in Science, Mathematics, and Japanese Language. Unlike existing benchmarks based on synthetic or curated data, our dataset preserves real exam layouts, diagrams, and Japanese educational text, together with nationwide aggregated student response distributions (N \approx 900,000). These features enable direct comparison between human and model performance under a unified evaluation framework. We benchmark recent multimodal LLMs using exact‑match accuracy and character‑level F1 for open‑ended responses, observing substantial variation across subjects and strong sensitivity to visual reasoning demands. Human evaluation and LLM‑as‑judge analyses further assess the reliability of automatic scoring. Our dataset establishes a reproducible, human‑grounded benchmark for multimodal educational reasoning and supports future research on evaluation, feedback generation, and explainable AI in authentic assessment contexts. Our dataset is available at: https://github.com/KyosukeTakami/gakucho‑benchmark
Authors:Wen Lai, Yingli Shen, Dingnan Jin, Qing Cui, Jun Zhou, Maosong Sun, Alexander Fraser
Abstract:
Autoregressive language models are widely used for text evaluation, however, their left‑to‑right factorization introduces positional bias, i.e., early tokens are scored with only leftward context, conflating architectural asymmetry with true text quality. We propose masked reconstruction as an alternative paradigm, where every token is scored using full bidirectional context. We introduce DiffScore, an evaluation framework built on Masked Large Diffusion Language Models. By measuring text recoverability across continuous masking rates, DiffScore eliminates positional bias and naturally establishes an evaluation hierarchy from local fluency to global coherence. We further provide diagnostic tools unavailable to autoregressive frameworks: multi‑timestep quality profiles that decompose scores across masking rates, and bidirectional PMI decomposition that disentangles fluency from faithfulness. Experiments across ten benchmarks show that DiffScore consistently outperforms autoregressive baselines in both zero‑shot and fine‑tuned settings. The code is released at: https://github.com/wenlai‑lavine/DiffScore.
Authors:Shufan Ming, Joe D. Menke, Neil R. Smalheiser, Halil Kilicoglu
Abstract:
Accurately and consistently indexing biomedical literature by publication type and study design is essential for supporting evidence synthesis and knowledge discovery. Prior work on automated publication type and study design indexing has primarily focused on expanding label coverage, enriching feature representations, and improving in‑domain accuracy, with evaluation typically conducted on data drawn from the same distribution as training. Although pretrained biomedical language models achieve strong performance under these settings, models optimized for in‑domain accuracy may rely on superficial lexical or dataset‑specific cues, resulting in reduced robustness under distributional shift. In this study, we introduce an evaluation framework based on controlled semantic perturbations to assess the robustness of a publication type classifier and investigate robustness‑oriented training strategies that combine entity masking and domain‑adversarial training to mitigate reliance on spurious topical correlations. Our results show that the commonly observed trade‑off between robustness and in‑domain accuracy can be mitigated when robustness objectives are designed to selectively suppress non‑task‑defining features while preserving salient methodological signals. We find that these improvements arise from two complementary mechanisms: (1) increased reliance on explicit methodological cues when such cues are present in the input, and (2) reduced reliance on spurious domain‑specific topical features. These findings highlight the importance of feature‑level robustness analysis for publication type and study design classification and suggest that refining masking and adversarial objectives to more selectively suppress topical information may further improve robustness. Data, code, and models are available at: https://github.com/ScienceNLP‑Lab/MultiTagger‑v2/tree/main/ICHI
Authors:Joykirat Singh, Zaid Khan, Archiki Prasad, Justin Chih-Yao Chen, Akshay Nambi, Hyunji Lee, Elias Stengel-Eskin, Mohit Bansal
Abstract:
Large language models (LLMs) are increasingly deployed on long‑horizon tasks in partially observable environments, where they must act while inferring and tracking a complex environment state over many steps. This leads to two challenges: partial observability requires maintaining uncertainty over unobserved world attributes, and long interaction history causes context to grow without bound, diluting task‑relevant information. A principled solution to both challenges is a belief state: a posterior distribution over environment states given past observations and actions, which compactly encodes history for decision making regardless of episode length. In LLM agents, however, the open‑ended nature of text makes it unclear how to represent such a distribution. Therefore, we introduce Agent‑BRACE: Agent Belief state Representation via Abstraction and Confidence Estimation, a method that decouples an LLM agent into a belief state model and a policy model, jointly optimized via reinforcement learning. The belief state model produces a structured approximation of the belief distribution: a set of atomic natural language claims about the environment, each annotated with an ordinal verbalized certainty label ranging from certain to unknown. The policy model conditions on this compact, structured approximate belief rather than the full history, learning to select actions under explicit uncertainty. Across long‑horizon, partially observable embodied language environments, Agent‑BRACE achieves an average absolute improvement of +14.5% (Qwen2.5‑3B‑Instruct) and +5.3% (Qwen3‑4B‑Instruct), outperforming strong RL baselines while maintaining a near‑constant context window independent of episode length. Further analysis shows that the learned belief becomes increasingly calibrated over the course of an episode as evidence accumulates.
Authors:Wei Wu, Ziyang Xu, Zeyu Zhang, Yang Zhao, Hao Tang
Abstract:
Presentation generation is moving beyond static slide creation toward end‑to‑end presentation video generation with research grounding, multimodal media, and interactive delivery. We introduce PresentAgent‑2, an agentic framework for generating presentation videos from user queries. Given an open‑ended user query and a selected presentation mode, PresentAgent‑2 first summarizes the query into a focused topic and performs deep research over presentation‑friendly sources to collect multimodal resources, including relevant text, images, GIFs, and videos. It then constructs presentation slides, generates mode‑specific scripts, and composes slides, audio, and dynamic media into a complete presentation video. PresentAgent‑2 supports three independent presentation modes within a unified framework: Single Presentation, which generates a single‑speaker narrated presentation video; Discussion, which creates a multi‑speaker presentation with structured speaker roles, such as for asking guiding questions, explaining concepts, clarifying details, and summarizing key points; and Interaction, which independently supports answering audience questions grounded in the generated slides, scripts, retrieved evidence, and presentation context. To evaluate these capabilities, we build a multimodal presentation benchmark covering single presentation, discussion, and interaction scenarios, with task‑specific evaluation criteria for content quality, media relevance, dynamic media use, dialogue naturalness, and interaction grounding. Overall, PresentAgent‑2 extends presentation generation from document‑dependent slide creation to query‑driven, research‑grounded presentation video generation with multimodal media, dialogue, and interaction. Code: https://github.com/AIGeeksGroup/PresentAgent‑2. Website: https://aigeeksgroup.github.io/PresentAgent‑2.
Authors:Xueqi Cheng, Qiong Wu, Zhengyi Zhou, Xugui Zhou, Tyler Derr, Yushun Dong
Abstract:
Large Language Models (LLMs) are increasingly deployed in multi‑turn dialogue settings where preserving conversational context across turns is essential. A standard serving practice concatenates the full dialogue history at every turn, which reliably maintains coherence but incurs substantial cost in latency, memory, and API expenditure, especially when queries are routed to large proprietary models. Existing approaches often struggle to balance the trade‑off between response quality and efficiency. We propose a framework that exploits the early turns of a session to estimate a local response manifold and then adapt a smaller surrogate model to this local region for the remainder of the conversation. Concretely, we learn soft prompts that maximize semantic divergence between the large and surrogate small language models' responses to surface least‑aligned local directions, stabilize training with anti‑degeneration control, and distill the mined cases into localized LoRA fine‑tuning so the surrogate runs without prompts at inference. A simple gate enables a one‑time switch with rollback on drift. We further provide a theoretical analysis for key components in SOMA. Extensive experiments show the effectiveness of SOMA. The source code is provided at: https://github.com/LabRAI/SOMA.
Authors:Xueqi Cheng, Yushun Dong
Abstract:
Multimodal large language models (MLLMs) have heterogeneous strengths across OCR, chart understanding, spatial reasoning, visual question answering, cost, and latency. Effective MLLM routing therefore requires more than estimating query difficulty: a router must match the multimodal requirements of the current image‑question input with the capabilities of each candidate model. We propose LatentRouter, a router that formulates MLLM routing as counterfactual multimodal utility prediction. Given an image‑question query, LatentRouter extracts learned multimodal routing capsules, represents each candidate MLLM with a model capability token, and performs latent communication between these states to estimate how each model would perform if selected. A distributional outcome head predicts model‑specific counterfactual quality, while a bounded capsule correction refines close decisions without allowing residual signals to dominate the prediction. The resulting utility‑based policy supports performance‑oriented and performance‑cost routing, and handles changing candidate pools through shared per‑model scoring with availability masking. Experiments on MMR‑Bench and VL‑RouterBench show that LatentRouter outperforms fixed‑model, feature‑level, and learned‑router baselines. Additional analyses show that the gains are strongest on multimodal task groups where model choice depends on visual, layout‑sensitive, or reasoning‑oriented requirements, and that latent communication is the main contributor to the improvement. The code is available at: https://github.com/LabRAI/LatentRouter.
Authors:Xueqi Cheng, Xugui Zhou, Tyler Derr, Yushun Dong
Abstract:
Capability distillation applies knowledge distillation to selected model capabilities, aiming to compress a large language model (LLM) into a smaller one while preserving the abilities needed for a downstream task. However, most existing methods treat capabilities as independent training targets and overlook how improving one capability can reshape the student's broader capability profile, especially when multiple abilities jointly determine task success. We study capability distillation under a fixed token budget and identify two consistent patterns: distillation induces systematic, budget‑dependent cross‑capability transfer, and additional budget often brings limited task‑relevant gains while sometimes degrading other useful abilities. Building on these insights, we propose ReAD, a Reinforcement‑guided cApability Distillation framework that explicitly accounts for capability interdependence. ReAD first infers task‑essential capabilities, then generates capability‑targeted supervision on the fly, and finally uses an uncertainty‑aware contextual bandit to adaptively allocate the distillation budget based on expected utility gains. Extensive experiments show that ReAD improves downstream utility under the same token budget while reducing harmful spillover and wasted distillation effort compared to strong baselines. Our code is publicly available at https://github.com/LabRAI/ReAD.
Authors:Yassin H. Rassul, Tarik A. Rashid
Abstract:
Defenses against indirect prompt injection (IPI) in tool‑using LLM agents share two structural weaknesses. First, they all attempt to prevent attacks rather than detect the compromises that slip through. Second, they have only been evaluated in English, leaving users of low‑resource languages such as Kurdish and Arabic without tested protection. This paper addresses both gaps with AgentShield, a deception‑based detection framework that places three layers of traps inside the agent's tool interface: fake tools, fake credentials, and allowlisted parameters. The same trap triggers serve as high‑precision labels for a self‑supervised classifier. An LLM agent that follows an attacker's hidden instruction almost always touches one of these traps, which gives both a real‑time compromise signal and a zero‑FP label for training a downstream detector without manual annotation. Across 176 cross‑lingual attack prompts and four LLMs from three providers, and because modern LLMs already refuse most IPI attempts on their own (attack success rate <= 10%), AgentShield's job is to catch the attacks that do slip through. On commercial models, it catches 90.7%‑100% of such successful attacks, with zero false alarms on 485 normal‑use tests. It survives a systematic adaptive‑attack evaluation with zero evasion on commercial models, and the self‑supervised classifier transfers across models and languages without retraining.
Authors:Chenyang Song, Weilin Zhao, Xu Han, Chaojun Xiao, Yingfa Chen, Zhiyuan Liu
Abstract:
While Mixture‑of‑Experts (MoE) scales model capacity without proportionally increasing computation, its massive total parameter footprint creates significant storage and memory‑access bottlenecks, which hinder efficient end‑side deployment that simultaneously requires high performance, low computational cost, and small storage overhead. To achieve these properties, we present DECO, a sparse MoE architecture designed to match the performance of dense Transformers under identical total parameter budgets and training tokens. DECO utilizes the differentiable and flexible ReLU‑based routing enhanced by learnable expert‑wise scaling, which adaptively balances the contributions of routed and shared experts. Furthermore, we introduce NormSiLU, an activation function that normalizes inputs prior to SiLU operators, producing a more stable trend of routed‑expert activation ratio and a higher intrinsic sparsity level. We also identify an empirical advantage in using non‑gated MLP experts with ReLU‑based routing, indicating the possibility of MoE architecture simplification. Experiments demonstrate that DECO, activating only 20% of experts, matches dense performance and outperforms established MoE baselines. Our specialized acceleration kernel delivers a 3.00× speedup on real hardware compared with dense inference. Codes and checkpoints are all available at https://github.com/thunlp/DECO.
Authors:Junhao Shen, Teng Zhang, Xiaoyan Zhao, Hong Cheng
Abstract:
Large language model agents increasingly rely on external skills to solve complex tasks, where skills act as modular units that extend their capabilities beyond what parametric memory alone supports. Existing methods assume external skills either accumulate as persistent guidance or internalized into the policy, eventually leading to zero‑skill inference. We argue this assumption is overly restrictive, since with limited parametric capacity and uneven marginal contribution across skills, the optimal active skill set is non‑monotonic, task‑ and stage‑dependent. In this work, we propose SLIM, a framework of dynamic Skill LIfecycle Management for agentic reinforcement learning (RL), which treats the active external skill set as a dynamic optimization variable jointly updated with policy learning. Specifically, SLIM estimates each active skill's marginal external contribution through leave‑one‑skill‑out validation, then applies three lifecycle operations: retaining high‑value skills, retiring skills whose contribution becomes negligible after sufficient exposure, and expanding the skill bank when persistent failures reveal missing capability coverage. Experiments show that SLIM outperforms the best baselines by an average of 7.1% points across ALFWorld and SearchQA. Results further indicate that policy learning and external skill retention are not mutually exclusive: some skills are absorbed into the policy, while others continue to provide external value, supporting SLIM as a more general paradigm for skill‑based agentic RL.
Authors:Gabriel Garcia
Abstract:
Corruption studies, the standard tool for evaluating chain‑of‑thought (CoT) faithfulness, infer which steps are ``computationally important'' from accuracy loss when steps are corrupted. We show that when benchmark chains end with an explicit terminal answer line, as in GSM8K and MATH, these tests largely measure \emphanswer placement rather than where intermediate computation is carried out. Using matched GSM8K examples, removing only the final answer statement while preserving all reasoning collapses suffix sensitivity by about 19× for Qwen~2.5‑3B (N=300, p=0.022). Conflicting‑answer prompts, which contain correct reasoning but a wrong explicit final answer, drive accuracy to zero or near‑zero at 7B across five open‑weight model families; wrong‑answer following is strong at 3B‑‑7B and attenuates sharply at larger scales. Replications on MATH, within‑stable comparisons at 7B, and suffix‑free chains show the same pattern in different guises: corruption sensitivity tracks the location of explicit answer text, not a fixed computational depth in the reasoning. Generation‑time probes indicate that final answers are rarely early‑determined during generation (<5% early commitment), yet consumption‑time behavior systematically follows explicit answer text. The confound is therefore largely a readout effect when the chain is consumed. We propose a three‑prerequisite protocol (question‑only control, format characterization, and an all‑position sweep) as a practical minimum for future corruption‑based faithfulness studies.
Authors:Denghao Ma, Qing Liu, Zulong Chen, Chuanfei Xu, Jia Xu, Zhibo Yang, Wei Shao, Zhao Li
Abstract:
Document classification forms the backbone of modern enterprise content management, yet existing benchmarks remain trapped in oversimplified paradigms ‑‑ single domain settings with flat label structures ‑‑ that bear little resemblance to the hierarchical, multi‑modal, and cross‑domain nature of real‑world business documents. This gap not only misrepresents practical complexity but also stifles progress toward industrially viable document intelligence. To bridge this gap, we construct the first Multi‑level, Multi‑domain, Multi‑modal document classification Benchmark (MMM‑Bench). MMM‑Bench includes (1) a deeply hierarchical taxonomy spanning five levels that capture the authentic organizational logic of business documentation; and (2) 5,990 real‑world multi‑modal documents meticulously curated from 12 commercial domains in Alibaba. Each document is manually annotated with a complete hierarchical path by domain experts. We establish comprehensive baselines on MMM‑Bench, which consists of open‑weight models and API‑based models. Through systematic experiments, we identify four fundamental challenges within MMM‑Bench and propose corresponding insights. To provide a solid foundation for advancing research in multi‑level, multi‑domain document classification, we release all of the data and the evaluation toolkit of MMM‑Bench at https://github.com/MMMDC‑Bench/MMMDC‑Bench.
Authors:Daniel Goldstein, Eugene Cheah
Abstract:
We present Key‑Value Means ("KVM"), a novel block‑recurrence for attention that can accommodate either fixed‑size or growing state. Equipping a strong transformer baseline with fixed‑size KVM attention layers yields a strong O(N) chunked RNN, while adding only an insignificant number of new parameters. We train a transformer with a growable KVM cache and show it performs competitively on long‑context tests with only subquadratic prefill time and sublinear state growth. KVM is implementable with standard operations and without custom kernels, and supports chunk‑wise parallelizable training and prefill. It provides many of the benefits of both traditional transformers (expandable context memory, chunk‑wise parallelizable training and prefill) and linear RNNs in a single unified package. It can be used on every layer, saving KV‑cache memory, and allowing a continuous range of choices of prefill time complexity between O(N) and O(N^2). It can also be implemented in a hybrid solution in tandem with LRNN layers in place of traditional attention, to supplement the LRNN with improved sublinear memory growth context length usage and long context decoding. We release our code at https://github.com/recursal/KVM‑paper and trained models at https://huggingface.co/collections/recursal/key‑value‑means under the Apache 2.0 license.
Authors:Tianyu Zheng, Hong Wu, Jiaji Zhong
Abstract:
Large language models (LLMs) often suffer from hallucinations due to error accumulation in autoregressive decoding, where suboptimal early token choices misguide subsequent generation. Although multi‑path decoding can improve robustness by exploring alternative trajectories, existing methods lack principled strategies for determining when to branch and how to regulate inter‑path interactions. We propose Adaptive Path‑Contrastive Decoding (APCD), a multi‑path decoding framework that improves output reliability through adaptive exploration and controlled path interaction. APCD consists of two components: (1) Entropy‑Driven Path Expansion, which delays branching until predictive uncertainty ‑ measured by Shannon entropy over top candidate tokens ‑ indicates multiple plausible continuations; and (2) Divergence‑Aware Path Contrast, which encourages diverse reasoning trajectories while dynamically attenuating inter‑path influence as prediction distributions diverge. Experiments on eight benchmarks demonstrate improved factual accuracy while maintaining decoding efficiency. Our code is available at https://github.com/zty‑king/APCD.
Authors:Shusaku Egami, Aoi Ohta, Tomoki Tsujimura, Masaki Asada, Tatsuya Ishigaki, Ken Fukuda, Masahiro Hamasaki, Hiroya Takamura
Abstract:
Large Language Models (LLMs) provide flexible natural language processing capabilities, while knowledge graphs (KGs) offer explicit and structured knowledge. Integrating these two in a complementary manner enables the development of reliable and verifiable AI systems. In particular, knowledge graph question answering (KGQA) has attracted attention as a means to reduce LLM hallucinations and to leverage knowledge beyond the training data. However, existing KGQA benchmark datasets are biased toward encyclopedic knowledge, limited to a single modality, and lack fine‑grained spatiotemporal data, which limits their applicability to real‑world scenarios targeted by Embodied AI. We introduce HOME‑KGQA, a novel KGQA benchmark dataset built on a multimodal KG of daily household activities. HOME‑KGQA consists of complex, multi‑hop natural language questions paired with graph database query languages. Compared to existing benchmarks, it includes more challenging questions that involve multi‑level spatiotemporal reasoning, multimodal grounding, and aggregate functions. Experimental results show that the LLM‑based KGQA methods fail to achieve performance comparable to that on existing datasets when evaluated on HOME‑KGQA. This highlights significant challenges that should be addressed for the real‑world deployment of KGQA systems. Our dataset is available at https://github.com/aistairc/home‑kgqa
Authors:Xiaocheng Luo, Kang Wang, Zaifu Zhan, Yuechi Zhou, Xiangyu Duan
Abstract:
The Chain‑of‑Thought (CoT) paradigm, while enhancing the interpretability of Large Language Models (LLMs), is constrained by the inefficiencies and expressive limits of natural language. Latent Chain‑of‑Thought (latent CoT) reasoning, which operates in a continuous latent space, offers a promising alternative but faces challenges from structural complexities in existing multi‑step or multi‑model paradigms, such as error propagation and coordination overhead. In this paper, we introduce One‑Model One‑Step, a novel compression framework for Latent Reasoning with Rule‑Based Priors(RuPLaR) to address this challenge. Our method trains an LLM to autonomously generate latent reasoning tokens in a single training stage, guided by rule‑based prior probability distributions, thereby eliminating cascaded processes and inter‑model dependencies. To ensure reasoning quality, we design a joint training objective that enforces answer consistency via cross‑entropy, aligns soft tokens with rule‑based priors via KL divergence (the Soft Thinking constraint), and adds a problem‑thought semantic alignment constraint in the representation space. Extensive experiments show that our compression framework not only improves accuracy by 11.1% over existing latent CoT methods but also achieves this with minimal token usage, underscoring its effectiveness and extensibility. Code: https://github.com/xiaocen‑luo/RuPLaR.
Authors:Bingqing Liu, Wei Liu, Yuhua Li
Abstract:
Null‑space‑based methods have garnered considerable attention in model editing by constraining updates to the null space of the pre‑existing knowledge representation, thereby preserving the model's original behavior. However, in practice these methods rely on an approximate null space‑‑leading to knowledge leakage‑‑and further suffer from severe performance degradation during sequential editing. Recent work shows that history‑aware editing strategies can empirically mitigate this decline, yet the underlying reason remains unclear. In this paper, we first expose the knowledge leakage inherent in existing null‑space approaches and then analyze why history‑aware updates effectively preserve both editing performance and general capabilities during long‑horizon editing. Building on these insights, we propose BetaEdit, a refined framework that effectively controls the knowledge leakage and integrates history‑aware updates into the null‑space paradigm. Extensive experiments on three large language models across two standard benchmarks show that BetaEdit consistently outperforms prior methods in the challenging regime of massive‑scale sequential editing. Code is available at: https://github.com/lbq8942/BetaEdit.
Authors:Chung-En Sun, Linbo Liu, Ge Yan, Zimo Wang, Tsui-Wei Weng
Abstract:
Tool‑augmented LLM agents tend to call tools indiscriminately, even when the model can answer directly. Each unnecessary call wastes API fees and latency, yet no existing benchmark systematically studies when a tool call is actually needed. We propose When2Tool, a benchmark of 18 environments (15 single‑hop, 3 multi‑hop) spanning three categories of tool necessity ‑‑ computational scale, knowledge boundaries, and execution reliability ‑‑ each with controlled difficulty levels that create a clear decision boundary between tool‑necessary and tool‑unnecessary tasks. We evaluate two families of training‑free baselines: Prompt‑only (varying the prompt to discourage unnecessary calls) and Reason‑then‑Act (requiring the model to reason about tool necessity before acting). Both provide limited control: Prompt‑only suppresses necessary calls alongside unnecessary ones, and Reason‑then‑Act still incurs a disproportionate accuracy cost on hard tasks. To understand why these baselines fail, we probe the models' hidden states and find that tool necessity is linearly decodable from the pre‑generation representation with AUROC 0.89‑‑0.96 across six models, substantially exceeding the model's own verbalized reasoning. This reveals that models already know when tools are needed, but fail to act on this knowledge during generation. Building on this finding, we propose Probe&Prefill, which uses a lightweight linear probe to read the hidden‑state signal and prefills the model's response with a steering sentence. Across all models tested, Probe&Prefill reduces tool calls by 48% with only 1.7% accuracy loss, while the best baseline at comparable accuracy only reduces 6% of tool calls, or achieves a similar tool call reduction but incurs a 5× higher accuracy loss. Our code is available at https://github.com/Trustworthy‑ML‑Lab/when2tool
Authors:Sohan Venkatesh
Abstract:
Large language models fail at counting repeated tokens despite strong performance on broader reasoning benchmarks. These failures are commonly attributed to limitations in internal count tracking. We show this attribution is wrong. Linear probes on the residual stream decode the correct count with near‑perfect accuracy at every post‑embedding layer, across all model depths. This holds even at the exact layers where the wrong answer crystallizes while the model simultaneously outputs an incorrect count. Attention patterns show no evidence of collapse over repeated tokens and tokenization artifacts account for none of the failure. Instead, a format‑triggered multi‑layer perceptron (MLP) block overwrites the correctly‑encoded count with a fixed wrong answer at roughly 88‑‑93,% network depth. This prior fires for repeated word‑tokens in space‑separated list format and is absent for repeated digit‑tokens. It is suppressed by comma‑separated delimiters in larger models but persists in smaller ones. The finding holds across Llama‑3.2 (1B and 3B) and Qwen2.5 (1.5B, 3B and 7B) at consistent relative depth. Counting failure is a failure of routing not of representation and the two require different interventions.
Authors:Yu Wu, Ananth Mahadevan, Filip Ginter, Michael Mathioudakis, Mikko Tolonen
Abstract:
While digitized corpora have transformed the study of intellectual transmission, current methods rely heavily on lexical text reuse detection, capturing verbatim quotations but fundamentally missing paraphrases and complex implicit engagement. This paper evaluates semantic search in 18th‑century intellectual history through the reception of John Locke's foundational work. Using expert annotation grounded in a semantic taxonomy, we examine whether an off‑the‑shelf semantic search pipeline can surface meaning‑level correspondences overlooked by lexical methods. Our results demonstrate that semantic search retrieves substantially more implicit receptions than lexical baselines. However, linguistic diagnostics also reveal a "lexical gatekeeping" effect, where retrieval remains partially constrained by surface vocabulary overlap. These findings highlight both the potential and the limitations of semantic retrieval for analyzing the circulation of ideas in large historical corpora. The data is available at https://github.com/COMHIS/locke‑sim‑data.
Authors:Fabio Rovai
Abstract:
We present Open Ontologies, an open‑source ontology engineering system implemented in Rust that integrates LLM‑driven construction with formal OWL reasoning and ontology alignment via the Model Context Protocol. Our primary finding is that stable 1‑to‑1 matching is the dominant factor in ontology alignment quality: on the OAEI Anatomy track, it achieves F1 = 0.832 (P = 0.963, R = 0.733), competitive with state‑of‑the‑art systems and exceeding all in precision. Ablation across five weight configurations shows that signal weights are irrelevant when stable matching is applied (F1 varies by less than 0.004), while removing stable matching drops F1 to 0.728. On the Conference track, the same method achieves F1 = 0.438. On tool‑augmented ontology interaction, we find a surprising result: an LLM reading a raw OWL file (F1 = 0.323) performs worse than the same LLM with no file at all (F1 = 0.431), while structured MCP tool access achieves F1 = 0.717. This demonstrates that tool structure provides a qualitatively different mode of access that the LLM cannot replicate by reading raw syntax. The system ships as a single binary under the MIT licence.
Authors:Yanshi Li, Xueru Bai, Shuman Liu, Haibo Zhang, Anxiang Zeng
Abstract:
Large language models (LLMs) increasingly exhibit behaviors suggesting awareness of their evaluation context, often adapting their reasoning strategies in benchmark settings. Prior work has shown that such evaluation awareness can distort performance measurements; however, it remains unclear whether this phenomenon reflects a single behavioral artifact or a deeper internal structure within the model. We propose that LLMs maintain a decomposable space of functional metacognitive states: internal variables encoding factors such as evaluation awareness, self‑assessed capability, perceived risk, computational effort allocation, audience expertise adaptation, and intentionality. Through residual stream analysis across multiple reasoning models, we demonstrate that these states are linearly decodable from internal activations and exhibit distinct layer‑wise profiles. Moreover, by steering model activations along probe‑derived directions, we show that each functional metacognitive state causally modulates reasoning behavior in dissociable ways, affecting verbosity, accuracy, and safety‑related responses across tasks. Our findings suggest that benchmark performance reflects not only task competence but also the activation of specific functional metacognitive states. We argue that understandi ng and controlling these internal states is essential for reliable evaluation and deployment of reasoning models, and we provide a mechanistic framework for studying functional m etacognition in artificial systems. Our code and data are publicly available at https://github.com/xlands/meta‑cognition.
Authors:Yuzhuang Xu, Xu Han, Yuxuan Li, Pengzhan Li, Wanxiang Che
Abstract:
Large language models (LLMs) achieve strong performance but incur high deployment costs, motivating extremely low‑bit but lossy quantization. Existing quantization algorithms mainly focus on improving the numerical accuracy of forward computation to eliminate performance degradation. In this paper, we show that extremely quantized LLMs suffer from systematic smoothness degradation beyond numerical precision loss. Through a smoothness proxy, we observe that such degradation becomes increasingly severe as the quantization bit‑width decreases. Furthermore, based on sequence neighborhood modeling, we find that quantized models exhibit a rapid reduction of effective token candidates within the prediction neighborhood, which directly leads to a sparser decoding tree and degraded generation quality. To validate it, we introduce a simple smoothness‑preserving principle in both post‑training quantization and quantization‑aware training, and demonstrate that preserving smoothness brings additional gains beyond numerical accuracy. The core goal of this paper is to highlight smoothness preservation as an important design consideration for future extreme quantization methods. Code is available at https://github.com/xuyuzhuang11/FINE.
Authors:Xiang Feng, Jiawei Zhou, Zhangfeng Huang, Kewei Wang, Shanshan Ye, Jinxin Hu, Zulong Chen, Yong Luo, Jing Zhang
Abstract:
Evaluating whether Multimodal Large Language Models can produce trustworthy, verifiable reasoning over long, visually rich documents requires evaluation beyond end‑to‑end answer accuracy. We introduce DocScope, a benchmark that formulates long‑document QA as a structured reasoning trajectory prediction problem: given a complete PDF document and a question, the model outputs evidence pages, supporting evidence regions, relevant factual statements, and a final answer. We design a four‑stage evaluation protocol ‑‑ Page Localization, Region Grounding, Fact Extraction, and Answer Verification ‑‑ that audits each level of the trajectory independently through inter‑stage decoupling, with all judges selected and calibrated via human alignment studies. DocScope comprises 1,124 questions derived from 273 documents, with all hierarchical evidence annotations completed by human annotators. We benchmark 6 proprietary models, 12 open‑weight models, and several domain‑specific systems. Our experiments reveal that answer accuracy cannot substitute for trajectory‑level evaluation: even among correct answers, the highest observed rate of complete evidence chains is only 29%. Across all models, region grounding remains the weakest trajectory stage. Furthermore, the primary difficulty stems from aggregating evidence dispersed across long distances and multiple document clusters, while an oracle study identifies faithful perception and fact extraction as the dominant capability bottleneck. Cross‑architecture comparisons further suggest that activated parameter count matters more than total scale. The benchmark and code will be publicly released at https://github.com/MiliLab/DocScope.
Authors:Feng Xiong, Zengbin Wang, Yong Wang, Xuecai Hu, Jinghan He, Liang Lin, Yuan Liu, Xiangxiang Chu
Abstract:
Self‑evolving agents present a promising path toward continual adaptation by distilling task interactions into reusable knowledge artifacts. In practice, this paradigm remains hindered by two coupled bottlenecks: data inefficiency, where costly rollout effort is disproportionately spent on low‑value samples rather than informative ones, and knowledge interference, where heterogeneous knowledge stored in shared repositories leads to noisy retrieval and task‑misaligned guidance. Together, these issues form a self‑reinforcing failure loop in which uninformative rollouts yield noisy knowledge, which in turn degrades subsequent rollouts. In this work, we introduce Ace‑Skill, a co‑evolutionary framework that jointly optimizes rollout allocation and knowledge organization for self‑evolving multimodal agents. Specifically, Ace‑Skill combines aprioritized sampler with lazy‑decay proficiency tracking to focus rollouts on informative and insufficiently mastered samples, and a clustered organizer that semantically clusters knowledge for cleaner retrieval and more reliable adaptation. By improving sampling and organization together, Ace‑Skill turns self‑evolution into a virtuous cycle in which more informative rollouts produce higher‑quality knowledge that supports stronger subsequent rollouts. Across four multimodal tool‑use benchmarks, Ace‑Skill delivers strong gains (e.g., +35.46% relative improvement in Avg@4 accuracy), enabling an opensource 35B MoE model to match or surpass proprietary models. The acquired knowledge also transfers effectively in a zero‑shot manner to smaller 9B and 4B models, allowing resource‑constrained agents to inherit advanced capabilities without additional training. The code has been publicly available at https://github.com/AMAP‑ML/Ace‑Skill.
Authors:Shota Fujikawa, Issei Sato
Abstract:
Hallucination detection has become increasingly important for improving the reliability of large language models (LLMs). Recently, hybrid approaches such as HaMI, which combine semantic consistency with internal model states via Multiple Instance Learning (MIL), have achieved state‑of‑the‑art performance. However, these methods incur substantial computational overhead due to repeated sampling and costly semantic similarity computations. In this work, we first provide a theoretical analysis of HaMI in terms of decision margins, revealing that scaling internal states with semantic consistency leads to an enlarged decision margin. Motivated by this insight, we revisit classical sentence classification models from a margin enlargement perspective, aggregating token‑level features via max pooling and directly estimating sentence scores using a lightweight MLP. Without requiring semantic consistency computations, our approach achieves substantial efficiency improvements while maintaining competitive performance with state‑of‑the‑art baselines through adaptive aggregation of internal feature representations. Code is available at https://github.com/FUJI1229/Hallucination_Detection.
Authors:Pengze Guo, Jingxi Liang, Zhiwen Xie, Qifeng Wang, Derek F. Wong
Abstract:
In the context of today's high‑pressure, aging society, the demand for large‑scale emotional models capable of providing empathetic support is more critical than ever. However, existing benchmarks fail to simultaneously achieve ecological validity, signal clarity, and reliable fine‑grained labeling. We introduce EmoS, a high‑fidelity bilingual benchmark designed to resolve the limitations of ecological validity and noise in existing datasets by combining strictly filtered static slices with a dynamic Streaming Monologue subset. Supported by a rigorous dual‑layer human annotation pipeline, EmoS provides trusted ground truth that captures continuous emotional evolution. Empirical results show that fine‑tuning MLLMs (multimodal large language models) on EmoS yields significant gains over zero‑shot baselines, laying the foundation for the training and evaluation of future emotion recognition models and empathy models. The dataset and code are publicly available at https://github.com/NLP2CT/EmoS.
Authors:Yongqi An, Chang Lu, Kuan Zhu, Tao Yu, Chaoyang Zhao, Hong Wu, Ming Tang, Jinqiao Wang
Abstract:
Large language models (LLMs) face growing challenges in efficient generative inference due to the increasing memory demands of Key‑Value (KV) caches, especially for long sequences. Existing eviction methods typically retain KV pairs with high attention weights but overlook the impact of attention redistribution caused by token removal, as well as the spatial‑temporal dynamics in KV selection. In this paper, we propose ReST‑KV, a robust KV eviction method that combines layer‑wise output Reconstruction and Spatial‑Temporal smoothing to provide a more comprehensive perspective for the KV cache eviction task. Specifically, ReST‑KV formulates KV cache eviction as an optimization problem that minimizes output discrepancies through efficient layer‑wise reconstruction. By directly modeling how each token's removal affects the model output, our method naturally captures attention redistribution effects, going beyond simplistic reliance on raw attention weights. To further enhance robustness, we design exponential moving average smoothing to handle temporal variations and an adaptive window‑based mechanism to capture spatial patterns. Our method, ReST‑KV, significantly advances performance on long‑context benchmarks. It surpasses state‑of‑the‑art baselines by 2.58% on LongBench and 15.2% on RULER. Additionally, ReST‑KV consistently outperforms existing methods on Needle‑in‑a‑Haystack and InfiniteBench, all while achieving a remarkable 10.61× reduction in decoding latency at 128k context length. The code is publicly available at https://github.com/an‑yongqi/rest‑kv to facilitate reproducibility and further research.
Authors:Zhengyang Zhao, Lu Ma, Wentao Zhang
Abstract:
Inference‑time harnesses substantially improve large language models on complex reasoning tasks. However, the intrinsic capabilities of the underlying model remain unchanged by the addition of these external workflows. To bridge this gap, we introduce \emphOn‑Policy Harness Self‑Distillation (OPHSD), which employs the harness‑augmented current model as a teacher for self‑distillation, thereby introducing extra supervisory signals from the harness beyond training data. OPHSD internalizes task‑specific harness capabilities into the student model, yielding robust generalizability and strong standalone performance across diverse reasoning tasks. Evaluated across draft‑‑verify harness for text classification and plan‑‑solve for mathematical reasoning tasks, OPHSD consistently outperforms strong baselines (e.g., +10.83% over OPSD on HMMT25). Our analysis further indicates that reattaching the harness during inference yields no additional benefits and can even degrade performance, suggesting that complex harnesses need not always be permanent fixtures; instead, they can serve as temporary training scaffolds whose benefits are permanently fed back into the base model. Our code and training data are available at https://github.com/zzy1127/OPHSD‑On‑Policy‑Harness‑Self‑Distillation.
Authors:Boxuan Zhang, Jianing Zhu, Zeru Shi, Dongfang Liu, Ruixiang Tang
Abstract:
LLM‑based multi‑agent systems are increasingly deployed on long‑horizon tasks, but a single decisive error is often accepted by downstream agents and cascades into trajectory‑level failure. Existing work frames this as \emphpost‑hoc failure attribution, diagnosing the responsible agent and step after the trajectory has ended. However, this paradigm forfeits any opportunity to intervene while trajectory is still unfolding. In this work, we introduce AgentForesight, a framework that reframes this problem as online auditing: at each step of an unfolding trajectory, an auditor observes only the current prefix and must either continue the run or alarm at the earliest decisive error, without access to future steps. To this end, we curate AFTraj‑2K, a corpus of agentic trajectories across Coding, Math, and Agentic domains, in which safe trajectories are retained under a strict curation pipeline and unsafe trajectories are annotated at the step of their decisive error via consensus among multiple LLM judges. Built on that, we develop AgentForesight‑7B, a compact online auditor trained with a coarse‑to‑fine reinforcement learning recipe that first equips it with a risk‑anticipation prior at the failure boundary on adjacent safe/unsafe prefix pairs, then sharpens this prior into precise step‑level localization under a three‑axis reward jointly targeting the what, where, and who of an audit verdict. Across AFTraj‑2K and an external Who\&When benchmark, AgentForesight‑7B outperforms leading proprietary models, including GPT‑4.1 and DeepSeek‑V4‑Pro, achieving up to +19.9% performance gain and 3× lower step localization error, opening the loop from post‑hoc failures detection to enabling deployment‑time intervention. Project page: https://zbox1005.github.io/agent‑foresight/
Authors:Zihao An, Taichi Liu, Ziqiong Liu, Dong Li, Ruofeng Liu, Emad Barsoum
Abstract:
Speculative decoding accelerates Large Language Models (LLMs) inference by using a lightweight draft model to propose candidate tokens that are verified in parallel by the target model. However, existing draft model training objectives are not directly aligned with the inference‑time goal of maximizing consecutive token acceptance. To address this issue, we reformulate the draft model optimization objective, shifting the focus from token prediction accuracy to the overall acceptance length. In this paper, we build upon PARD to propose PARD‑2, a dual‑mode speculative decoding framework with Confidence‑Adaptive Token (CAT) optimization. This approach adaptively reweights each token to better align with the verification process. Notably, PARD‑2 enables a single draft model to support both target‑dependent and target‑independent modes. Experiments across diverse models and tasks demonstrate that PARD‑2 achieves up to 6.94× lossless acceleration, surpassing EAGLE‑3 by 1.9× and PARD by 1.3× on Llama3.1‑8B. Our code is available at https://github.com/AMD‑AGI/PARD.
Authors:Mingzhe Li, Zhiqiang Lin, Shiqing Ma
Abstract:
Large language models are increasingly used in scientific writing, yet they can fabricate citation‑shaped references that appear plausible but fail bibliographic verification. Existing detectors often reduce verification to binary found/not‑found decisions and rely on brittle parsing or incomplete retrieval, offering little field‑level signal to auditors. We reframe citation hallucination detection as taxonomy‑aligned field‑level adjudication and introduce a 12‑code taxonomy spanning Real, Potential, and Hallucinated citations. Based on this taxonomy, we build CiteTracer, a cascading multi‑agent detector that extracts structured citations from PDF and BibTeX, retrieves evidence through cache lookup, URL fetch, scholar connectors, and web search, applies deterministic field matching, and routes ambiguous cases to class‑specialist judgers. We release a benchmark of 2,450 synthetic citations built from real seeds with controlled LLM mutations, paired with 957 real‑world fabricated citations drawn from ICLR 2026 and an anonymous conference desk‑rejected submissions. CiteTracer reaches 97.1% accuracy on the synthetic benchmark, with class‑level F1 scores of 97.0, 95.8, and 98.5 for Real, Potential, and Hallucinated, respectively, and detects 97.1% of fabrications on the real‑world set without abstaining. Code: https://github.com/aaFrostnova/CiteTracer.
Authors:Siyu Wu, Yulong Ye, Zezhen Xiang, Pengzhou Chen, Gangda Xiong, Tao Chen
Abstract:
Large Language Model (LLM) systems have been the frontier of AI in many application domains, leading to new challenges and opportunities for hyperparameter optimization (HPO) for the AutoML community. However, this type of system exhibits an unprecedented compound space of hyperparameter configuration from both the AI and non‑AI components; rich and nonlinear implications from the fidelity factors; and diverse costs of measuring hyperparameter configurations, none of which have been fully captured in existing benchmarks. This paper presents the first (live) benchmark suite and datasets for HPO of real‑world LLM systems, dubbed LLMSYS‑HPOBench, covering data related to the inference objective values of hyperparameter configurations profiled from running the LLM systems. Currently, LLMSYS‑HPOBench contains 364,450 hyperparameter configurations with a dimensionality of 12‑23, 3‑5 dimensions of fidelity factor leading to 932 settings, 3‑9 inference objective metrics, and 2‑10 cost metrics, together with generated logs from measuring the LLM systems. What we seek to advocate is not only a revalidation of the existing HPO algorithms over the frontier LLM systems, but also to provide an evolving platform for the AutoML community to explore new directions of research in this regard. The benchmark suite has been made available at: https://github.com/ideas‑labo/llmsys‑hpobench
Authors:Abdulvahap Mutlu, Şengül Doğan, Türker Tuncer
Abstract:
Manifold‑Constrained Hyper‑Connections (mHC) introduce a stability‑motivated variant of multi stream residual mixing by constraining residual stream mixing matrices to the manifold of doubly stochastic matrices via Sinkhorn‑Knopp projection. In his work, we study whether mHC‑style constrained multi‑stream residual topology transfers effectively to state space model (SSM) language modeling. We implement a static mHC mechanism around an SSM block by expanding the residual stream into multiple parallel streams, aggregating streams into a single SSM input through simplex‑constrained pre‑mixing, scattering the SSM output back to streams through simplex‑constrained post‑mixing, and applying Sinkhorn‑projected residual stream mixing at each layer. We further introduce stream‑specialized adapters that add lightweight stream‑specific capacity through a shared bottleneck with per‑stream scaling, applied both before stream aggregation and after the SSM output prior to scattering. We evaluate baseline single‑stream SSM, static mHC SSM, and mHC SSM with adapters on WikiText‑2 using identical training settings and report checkpoint‑based validation loss, perplexity, throughput, and peak GPU memory. Under the reported fair checkpoint evaluation, static mHC improves validation loss from 6.3507 to 6.2448 and reduces perplexity from 572.91 to 515.35, while mHC with adapters further improves validation loss to 6.1353 and perplexity to 461.88. These gains are accompanied by modest throughput reductions from 1025.52 to 964.81 and 938.90 tokens per second, and increased peak memory from 2365 MB to 2568 MB and 3092 MB. The results suggest that mHC‑inspired constrained multi‑stream residual mixing can yield measurable quality improvements in SSM language models and that stream‑specialized adapter capacity can further enhance performance with predictable efficiency tradeoffs.
Authors:Xincheng Yao, Ruoqi Li, Cheng Chen, Daoxin Zhang, Yi Wu, Yao Hu, Chongyang Zhang
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a pivotal technique for enhancing the reasoning capabilities of Large Language Models (LLMs). However, the de facto practice of mainstream RL algorithms is to treat all tokens of one response equally and assign the same optimization objective to each token, failing to provide granular guidance for the reasoning process. While in Chain‑of‑Thought (CoT) reasoning, different tokens usually play distinct roles. Therefore, the current RL algorithms lack an effective mechanism to dynamically balance the exploration‑exploitation trade‑off during learning. To this end, we propose Hierarchical Token‑level Objective Control Policy Optimization (HTPO), a novel RL algorithm that takes the divide‑and‑conquer idea to hierarchically partition the response tokens into specific functional groups from three aspects (i.e., prompt difficulty, answer correctness, and token entropy). Within each group, according to the contributions to exploration or exploitation, we design specialized optimization objectives to facilitate the effective execution of each token's expected functionality. In this way, HTPO can achieve a more balanced exploration‑exploitation trade‑off. Extensive experiments on challenging reasoning benchmarks validate the superiority of our HTPO algorithm, which significantly outperforms the strong DAPO baseline (e.g., +8.6% and +6.7% on AIME'24 and AIME'25, respectively). When scaling test‑time compute, the HTPO‑trained model maintains a consistent performance advantage over the DAPO baseline, and the gap widens as the sampling budget increases, validating that our adaptive token‑level control method fosters effective exploration without sacrificing exploitation performance. Code will be at https://github.com/xcyao00/HTPO.
Authors:Tong Zheng, Haolin Liu, Chengsong Huang, Huiwen Bao, Sheng Zhang, Rui Liu, Runpeng Dai, Ruibo Chen, Chenxi Liu, Tianyi Xiong, Xidong Wu, Hongming Zhang, Heng Huang
Abstract:
Test‑time scaling (TTS) has become an effective approach for improving large language model performance by allocating additional computation during inference. However, existing TTS strategies are largely hand‑crafted: researchers manually design reasoning patterns and tune heuristics by intuition, leaving much of the computation‑allocation space unexplored. We propose an environment‑driven framework, AutoTTS, that changes what researchers design: from individual TTS heuristics to environments where TTS strategies can be discovered automatically. The key to AutoTTS lies in environment construction: the discovery environment must make the control space tractable and provide cheap, frequent feedback for TTS search. As a concrete instantiation, we formulate width‑‑depth TTS as controller synthesis over pre‑collected reasoning trajectories and probe signals, where controllers decide when to branch, continue, probe, prune, or stop and can be evaluated cheaply without repeated LLM calls. We further introduce beta parameterization to make the search tractable and fine‑grained execution trace feedback to improve discovery efficiency by helping the agent diagnose why a TTS program fails. Experiments on mathematical reasoning benchmarks show that the discovered strategies improve the overall accuracy‑‑cost tradeoff over strong manually designed baselines. The discovered strategies generalize to held‑out benchmarks and model scales, while the entire discovery costs only 39.9 and 160 minutes. Our data, and code will be open‑source at https://github.com/zhengkid/AutoTTS.
Authors:Hexuan Deng, Xiaopeng Ke, Yichen Li, Ruina Hu, Dehao Huang, Derek F. Wong, Yue Wang, Xuebo Liu, Min Zhang
Abstract:
Despite the rapid development of AI reviewers, evaluating such systems remains challenging: metrics favor overlap with human reviews over correctness. However, since human reviews often cover only a subset of salient issues and sometimes contain mistakes, they are unreliable as gold references. To address this, we build category‑specific benchmark subsets and skip evaluation when the corresponding human reviews are missing to strengthen Completeness. We also leverage reviewer‑‑author‑‑meta‑review discussions as expert annotations and filter unreliable reviews accordingly to strengthen Correctness. Finally, we introduce CoCoReviewBench, which curates 3,900 papers from ICLR and NeurIPS to enable reliable and fine‑grained evaluation of AI reviewers. Analysis shows that AI reviewers remain limited in correctness and are prone to hallucinations, and highlights reasoning models as more effective reviewers, motivating further directions for improving AI reviewers. Benchmarks and models are available at https://github.com/hexuandeng/CoCoReviewBench.
Authors:Ionut-Vlad Modoranu, Mher Safaryan, Dan Alistarh
Abstract:
With the rise in scale for deep learning models to billions of parameters, the computational cost of fine‑tuning remains a significant barrier to deployment. While Low‑Rank Adaptation (LoRA) has become the standard for parameter‑efficient fine‑tuning, the need to set a predefined, static rank r requires exhaustive grid searches to balance efficiency and performance. Existing rank‑adaptive solutions such as DyLoRA mitigate this by sampling ranks during the training from a predefined distribution. However, they often yield sub‑optimal results at higher ranks due to lack of consistent gradient signals across the full hierarchy of ranks, thus making these methods data‑inefficient. In this paper, we propose MatryoshkaLoRA, a general, Matryoshka‑inspired training framework for LoRA that learns accurate hierarchical low‑rank representations by inserting a fixed, carefully crafted diagonal matrix P between the existing LoRA adapters to scale their sub‑ranks accordingly. By introducing this simple modification, our general framework recovers LoRA and DyLoRA only by changing P and ensures all sub‑ranks embed the available gradient information efficiently. Our MatryoshkaLoRA supports dynamic rank selection with minimal degradation in accuracy. We further propose Area Under the Rank Accuracy Curve (AURAC), a metric that consistently evaluates the performance of hierarchical low‑rank adapters. Our results demonstrate that MatryoshkaLoRA learns more accurate hierarchical low‑rank representations than prior rank‑adaptive approaches and achieves superior accuracy‑performance trade‑offs across ranks on the evaluated datasets. Our code is available at https://github.com/IST‑DASLab/MatryoshkaLoRA.
Authors:Qiyong Zhong, Mao Zheng, Mingyang Song, Xin Lin, Jie Sun, Houcheng Jiang, Xiang Wang, Junfeng Fang
Abstract:
Tool‑integrated reasoning (TIR) is difficult to scale to small language models due to instability in long‑horizon tool interactions and limited model capacity. While reinforcement learning methods like group relative policy optimization provide only sparse outcome‑level rewards. Recently, on‑policy distillation (OPD) has gained popularity by supplying dense token‑level supervision from a teacher on student‑generated trajectories. However, our experiments indicate that applying OPD to TIR leads to a critical failure mode: erroneous tool calls tend to cascade across subsequent reasoning steps, progressively amplifying student‑teacher divergence and rendering the teacher's token‑level supervision increasingly unreliable. To address this, we propose SOD, a step‑wise on‑policy distillation framework for small language model agents, which adaptively reweights distillation strength at each step based on step‑level divergence. Therefore, SOD can attenuate potentially misleading teacher signals in high‑divergence regions while preserving dense guidance in well‑aligned states. Experiments on challenging math, science, and code benchmarks show that SOD achieves up to 20.86% improvement over the second‑best baseline. Notably, our 0.6B student achieves 26.13% on AIME 2025, demonstrating effective transfer of agentic reasoning to lightweight models. Our code is available at https://github.com/YoungZ365/SOD.
Authors:Jie Sun, Mao Zheng, Mingyang Song, Qiyong Zhong, Yilin Cheng, Bichuan Feng, Pengfei Liu, Junfeng Fang, Xiang Wang
Abstract:
On‑policy distillation (OPD) is a standard tool for transferring teacher behavior to a smaller student, but it implicitly assumes that teacher and student predictions are comparable token by token, an assumption that fails whenever the two models tokenize the same text differently. Under heterogeneous tokenizers, exact shared‑token matching silently discards a large fraction of the teacher signal at precisely the positions where vocabularies disagree. We propose \underlineSimple \underlineCross‑\underlineTokenizer OPD (SimCT), which restores this signal by enlarging the supervision space: alongside shared tokens, SimCT compares teacher and student over short multi‑token continuations that both tokenizers can realize, leaving the OPD loss form itself unchanged. We show that these units are the finest jointly tokenizable supervision interface, and that coarser alternatives remove teacher‑student distinctions that are useful for on‑policy learning. Across three heterogeneous teacher‑student pairs on mathematical reasoning and code‑generation benchmarks, SimCT shows consistent gains over shared‑vocabulary OPD and representative cross‑tokenizer baselines, with ablations confirming that the improvements come from recovering supervision discarded by exact shared‑token matching. Code is available at \hrefhttps://github.com/sunjie279/SimCT‑https://github.com/sunjie279/SimCT‑.
Authors:Naoto Iwase, Yuki Ichihara, Mohammad Atif Quamar, Junpei Komiyama
Abstract:
Large Language Models often improve accuracy on reasoning tasks by sampling multiple Chain‑of‑Thought (CoT) traces and aggregating them with majority voting (MV), a test‑time technique called self‑consistency. When we truncate a CoT partway through and regenerate the remainder, we observe that traces with correct answers reproduce their original answer more often than traces with wrong answers. We use this difference as a reliability signal, prefix consistency, that weights each candidate answer by how often it reappears under regeneration. It requires no access to token log‑probabilities or self‑rating prompts. Across five reasoning models and four math and science benchmarks, prefix consistency is the best correctness predictor in most settings, and reweighting votes by it reaches Standard MV plateau accuracy at up to 21x fewer tokens (median 4.6x). Our code is available at https://github.com/naoto‑iwase/prefix‑consistency.
Authors:Bohan Hou, Jiuning Gu, Jiayan Guo, Ronghao Dang, Sicong Leng, Xin Li, Xuemeng Song, Jianfei Yang
Abstract:
Existing benchmarks for multimodal agentic search evaluate multimodal search and visual browsing, but visual evidence is either confined to the input or treated as an answer endpoint rather than part of an interleaved search trajectory. We introduce InterLV‑Search, a benchmark for Interleaved Language‑Vision Agentic Search, in which textual and visual evidence is repeatedly used to condition later search. It contains 2,061 examples across three levels: active visual evidence seeking, controlled offline interleaved multimodal search, and open‑web interleaved multimodal search. Beyond existing benchmarks, it also includes multimodal multi‑branch samples that involve comparison between multiple entities during the evidence search. We construct Level 1 and Level 2 with automated pipelines and Level 3 with a machine‑led, human‑supervised open‑web pipeline. We further provide InterLV‑Agent for standardized tool use, trajectory logging, and evaluation. Experiments on proprietary and open‑source multimodal agents show that current systems remain far from solving interleaved multimodal search, with the best model below 50% overall accuracy, highlighting challenges in visual evidence seeking, search control, and multimodal evidence integration. We release the benchmark data and evaluation code at https://github.com/hbhalpha/InterLV‑Search‑Bench
Authors:Qingyu Ren, Qianyu He, Jiajie Zhu, Xingzhou Chen, Jingwen Chang, Zeye Sun, Han Xia, Fei Yu, Jiaqing Liang, Yanghua Xiao
Abstract:
Instruction following is a fundamental capability of large language models (LLMs), yet continuously improving this capability remains challenging. Existing methods typically rely either on costly external supervision from humans or strong teacher models, or on self‑play training with static‑difficulty instructions that cannot evolve as the model's capabilities improve. To address these limitations, we propose SEIF (Self‑Evolving Reinforcement Learning for Instruction Following), a self‑evolving framework for enhancing the instruction‑following ability of LLMs. SEIF forms a closed self‑evolution loop that improves the model's instruction‑following ability, where instruction difficulty evolution and model capability evolution reinforce each other. SEIF consists of four roles: an Instructor that generates increasingly challenging instructions, a Filter that removes conflicting or invalid instructions to ensure data quality, a Follower that learns to follow evolved instructions, and a Judger that provides reward signals for reinforcement learning. The Instructor and Follower are alternately trained and co‑evolve throughout the process. Experiments across multiple model scales and architectures show that SEIF consistently improves instruction‑following performance, suggesting strong generality. Further analyses reveal the sources of improvement and identify an effective training strategy for self‑evolution on open‑ended tasks: sufficient early‑stage training to build a solid foundation, followed by moderate late‑stage training to mitigate overfitting and achieve better final performance. The code and data are publicly available at https://github.com/Rainier‑rq1/SEIF.
Authors:Xuan Li, Yining Wang, Yuchen Liu, Guanjun Liu, Delai Qiu, Shengping Liu, Jiaen Liang, Wei Huang, Jun Yu, Junnan Zhu
Abstract:
Chain‑of‑thought (CoT) reasoning improves large language models (LLMs) on difficult tasks, but it also makes inference expensive because every intermediate step must be generated as a discrete token. Latent reasoning reduces visible token generation by propagating continuous states, yet replacing explicit derivations with latent computation can hurt tasks that require symbolic checking. We propose Latent‑Then‑Explicit Reasoning (LaTER), a two‑stage paradigm that first performs bounded exploration in a continuous latent space and then switches to explicit CoT for verification and answer generation. In a training‑free instantiation, LaTER projects final‑layer hidden states back to the input embedding space, preserves the latent KV cache, and uses entropy and model‑native stop‑token probes to decide when to switch. We find that strong reasoning models already exhibit structured latent trajectories under this interface. On Qwen3‑14B, training‑free LaTER reduces total token usage by 16%‑32% on several benchmarks while matching or improving accuracy on most of them; for example, it improves AIME 2025 from 70.0% to 73.3% while reducing tokens from 15,730 to 10,661. We further construct Latent‑Switch‑69K, a supervised corpus that pairs condensed solution intuitions with shortened explicit derivations. Fine‑tuning with latent rollout and halting supervision yields additional gains: trained LaTER reaches 80.0% accuracy on AIME 2025, 10.0 points above the standard CoT baseline, while using 33% fewer tokens. Our code, data, and model are available at https://github.com/TioeAre/LaTER.
Authors:Shuai Wang, Yin Yu, Shengyao Zhuang, Bevan Koopman, Guido Zuccon
Abstract:
PromptReps showed that an autoregressive language model can be used directly as a retriever by prompting it to generate dense and sparse representations of a query or passage. Extending this to multiple representatives is inefficient for autoregressive models, since tokens must be generated sequentially, and prior multi‑token variants did not reliably improve over single‑token decoding. We show that the bottleneck is sequential generation, not the multi‑token idea itself. DiffRetriever is a representative‑token retriever for diffusion language models: it appends K masked positions to the prompt and reads all K in a single bidirectional forward pass. Across in‑domain and out‑of‑domain evaluation, multi‑token DiffRetriever substantially improves over single‑token on every diffusion backbone we test, while autoregressive multi‑token is flat or negative and pays a latency cost that scales with K where diffusion does not. After supervised fine‑tuning, DiffRetriever on Dream is the strongest BEIR‑7 retriever in our comparison, ahead of PromptReps, the encoder‑style DiffEmbed baseline on the same diffusion backbones, and the contrastively fine‑tuned single‑vector RepLLaMA. A per‑query oracle on the frozen base model exceeds contrastive fine‑tuning at the same fixed budget, pointing to adaptive budget selection as future work. Code is available at https://github.com/ielab/diffretriever.
Authors:Akshita Singh, Prabesh Paudel, Siddhartha Roy
Abstract:
We introduce a proxy‑analyzer framework for detecting hallucinations in large language models. Instead of looking inside the generating model, our system reads already‑generated text through a small locally hosted open‑weight model and spots hallucinations using the reader's own internal activations. This works just as well when the generator is a closed API like GPT‑4 as when it is any open‑weight model. We built eighteen features grounded in how transformers process text, covering residual stream norms, per‑head source‑document attention, entropy, MLP activations, logit‑lens trajectories, and three new token‑level grounding statistics. We trained a stacking ensemble on 72,135 samples from five hallucination datasets. We tested across seven analyzer architectures from 0.5 billion to 9 billion parameters: Qwen2.5 at 0.5B and 7B, Gemma‑2 at 2B and 9B, Pythia at 1.4B, and LLaMA‑3 at both 3B and 8B. Across all seven, we consistently beat ReDeEP's token‑level AUC of 0.73 on RAGTruth by 7.4 to 10.3 percentage points. Qwen2.5‑7B reached an F1 of 0.717, just above ReDeEP's 0.713, while Qwen2.5‑0.5B hit 0.706. The most striking finding is how tightly all seven models cluster: AUC spans only 2.3 percentage points across an eighteen‑fold difference in model size. Even more surprising, our 3B LLaMA outperforms our 8B LLaMA on RAGTruth, showing that bigger is not always better even within the same model family. Both RAGTruth and LLM‑AggreFact include outputs from multiple LLM families, so our results are not skewed toward any particular generator.
Authors:Maximillian Chen, Xuanming Zhang, Michael Peng, Zhou Yu, Alexandros Papangelis, Yohan Jo
Abstract:
The rise of Internet of Things (IoT) devices in the physical world necessitates voice‑based interfaces capable of handling complex user experiences. While modern Large Language Models (LLMs) already demonstrate strong tool‑usage capabilities, modeling real‑world IoT devices presents a difficult, understudied challenge which combines modeling spatiotemporal constraints with speech inputs, dynamic state tracking, and mixed‑initiative interaction patterns. We introduce MIST (the Multimodal Interactive Speech‑based Tool‑calling Dataset), a synthetic multi‑turn, voice‑driven code generation task that operates over IoT devices. We find that there is a significant gap between open‑ and closed‑weight multimodal LLMs on MIST, and that even frontier closed‑weight LLMs have substantial headroom. We release MIST and an extensible data generation framework to build related datasets in order to facilitate research on mixed‑initiative voice assistants which reason about physical world constraints.
Authors:Yuwei Yin, Chuyuan Li, Giuseppe Carenini
Abstract:
Accurately understanding the intent behind speech, conversation, and writing is crucial to the development of helpful Large Language Model (LLM) assistants. This paper introduces IntentGrasp, a comprehensive benchmark for evaluating the intent understanding capability of LLMs. Derived from 49 high‑quality, open‑licensed corpora spanning 12 diverse domains, IntentGrasp is constructed through source datasets curation, intent label contextualization, and task format unification. IntentGrasp contains a large‑scale training set of 262,759 instances and two evaluation sets: an All Set of 12,909 test cases and a more balanced and challenging Gem Set of 470 cases. Extensive evaluations on 20 LLMs across 7 families (including frontier models such as GPT‑5.4, Gemini‑3.1‑Pro, and Claude‑Opus‑4.7) demonstrate unsatisfactory performance, with scores below 60% on All Set and below 25% on Gem set. Notably, 17 out of 20 tested models perform worse than a random‑guess baseline (15.2%) on Gem Set, while the estimated human performance is ~81.1%, showing substantial room for improvement. To enhance such ability, this paper proposes Intentional Fine‑Tuning (IFT), which fine‑tunes the models on the training set in IntentGrasp, yielding significant gains of 30+ F1 points on All Set and 20+ points on Gem Set. Tellingly, the leave‑one‑domain‑out (Lodo) experiments further demonstrate the strong cross‑domain generalizability of IFT, verifying that it is a promising approach to substantially enhancing the intent understanding of LLMs. Overall, by benchmarking and boosting intent understanding ability, this study sheds light on a promising path towards more intentional, capable, and safe AI assistants for human benefits and social good.
Authors:Jiacheng Xu, Heting Gao, Liufei Xie, Zhenchuan Yang, Lijiang Li, Yiting Chen, Bin Zhang, Meng Chen, Chaoyu Fu, Weifeng Zhao, Wenjiang Zhou
Abstract:
Human speech conveys expressiveness beyond linguistic content, including personality, mood, or performance elements, such as a comforting tone or humming a song, which we formalize as role‑playing and singing. We present VITA‑QinYu, the first expressive end‑to‑end (E2E) spoken language model (SLM) that goes beyond natural conversation to support both role‑playing and singing generation. VITA‑QinYu adopts a hybrid speech‑text paradigm that extends interleaved text‑audio modeling with multi‑codebook audio tokens, a design enabling richer paralinguistic representation while preserving a clear separation between modalities to avoid interference. We further develop a comprehensive data generation pipeline to synthesize a total of 15.8K hours of natural conversation, role‑playing, and singing data for training. VITA‑QinYu demonstrates superior expressiveness, outperforming peer SLMs by 7 percentage points on objective role‑playing benchmarks, and surpassing peer models by 0.13 points on a 5‑point MOS scale for singing. Simultaneously, it achieves state‑of‑the‑art conversational accuracy and fluency, exceeding prior SLMs by 1.38 and 4.98 percentage points on the C3 and URO benchmarks, respectively. We open‑source our code and models and provide an easy‑to‑use demo with full‑stack support for streaming and full‑duplex interaction.
Authors:Jon-Paul Cacioli
Abstract:
Aggregate metacognitive quality scores mask within‑model variation across MMLU benchmark domains. We administered 1,500 MMLU items (250 per domain, under an a priori six‑domain grouping) to 33 frontier LLMs from eight model families and computed Type‑2 AUROC per model‑domain cell using verbalized confidence (0‑100). Total observations: 47,151. Every model with above‑chance aggregate monitoring showed non‑trivial domain‑level variation. Applied/Professional knowledge was reliably the easiest benchmark domain to monitor (mean AUROC = .742, ranked top‑2 in 21 of 33 models); Formal Reasoning and Natural Science were reliably the hardest (one of the two ranked bottom‑2 in 27 of 33 models). The three middle domains were statistically indistinguishable (Kendall's W = .164). A subject‑level coherence analysis (within‑domain similarity ratio = 0.95) confirms the six‑domain grouping is a pragmatic benchmark taxonomy, not a validated latent construct. Within‑family profile‑shape clustering is significant for Anthropic, Google‑Gemini, and Qwen (permutation p < .0001) but not DeepSeek, Google‑Gemma, or OpenAI. Gemma 4 31B showed a +.202 AUROC improvement over Gemma 3 27B. Three models classified Invalid on binary KEEP/WITHDRAW probes produced normal profiles under verbalized confidence, confirming probe‑format specificity. Bootstrap 95% CIs on 198 cells have median width .199. Split‑half aggregate stability r = .893; profile‑level split‑half is weaker (grand median r = .184). These results show stable benchmark‑domain variation obscured by aggregate metrics, and support benchmark‑stage domain screening as a step before deployment in specific application areas.
Authors:Zhexuan Wang, Xuebo Liu, Li Wang, Zifei Shan, Yutong Wang, Zhenxi Song, Min Zhang
Abstract:
Large language model (LLM)‑based Multi‑agent systems (MAS) have shown promise in tackling complex collaborative tasks, where agents are typically orchestrated via role‑specific prompts. While the quality of these prompts is pivotal, jointly optimizing them across interacting agents remains a non‑trivial challenge, primarily due to the misalignment between local agent objectives and holistic system goals. To address this, we introduce MASPO, a novel framework designed to automatically and iteratively refine prompts across the entire system. A core innovation of MASPO is its joint evaluation mechanism, which assesses prompts not merely by their local validity, but by their capacity to facilitate downstream success for successor agents. This effectively bridges the gap between local interactions and global outcomes without relying on ground‑truth labels. Furthermore, MASPO employs a data‑driven evolutionary beam search to efficiently navigate the high‑dimensional prompt space. Extensive empirical evaluations across 6 diverse tasks demonstrate that MASPO consistently outperforms state‑of‑the‑art prompt optimization methods, achieving an average accuracy improvement of 2.9. We release our code at https://github.com/wangzx1219/MASPO.
Authors:Yiqiao Jin, Yiyang Wang, Lucheng Fu, Yijia Xiao, Yinyi Luo, Haoxin Liu, B. Aditya Prakash, Josiah Hester, Jindong Wang, Srijan Kumar
Abstract:
Self‑distillation (SD) offers a promising path for adapting large language models (LLMs) without relying on stronger external teachers. However, SD in autoregressive LLMs remains challenging because self‑generated trajectories are free‑form, correctness is task‑dependent, and plausible rationales can still provide unstable or unreliable supervision. Existing methods mainly examine isolated design choices, leaving their effectiveness, roles, and interactions unclear. In this paper, we propose UniSD, a unified framework to systematically study self‑distillation. UniSD integrates complementary mechanisms that address supervision reliability, representation alignment, and training stability, including multi‑teacher agreement, EMA teacher stabilization, token‑level contrastive learning, feature matching, and divergence clipping. Across six benchmarks and six models from three model families, UniSD reveals when self‑distillation improves over static imitation, which components drive the gains, and how these components interact across tasks. Guided by these insights, we construct UniSDfull, an integrated pipeline that combines complementary components and achieves the strongest overall performance, improving over the base model by +5.4 points and the strongest baseline by +2.8 points. Extensive evaluation highlights self‑distillation as a practical and steerable approach for efficient LLM adaptation without stronger external teachers.
Authors:Diego Rossini, Lonneke van der Plas
Abstract:
We present a scalable, modular pipeline for automatic neologism detection that combines rule‑based filtering with LLM classification. The pipeline is grounded in two complementary word‑formation frameworks, grammatical and extra‑grammatical morphology, which jointly define the scope of what counts as a neologism and inform a four‑class classification scheme (neologism, entity, foreign, none). While designed to be modular and transferable at the architectural level, the pipeline is instantiated on 527 million English‑language Reddit posts spanning 2005‑2024. From this corpus, we extract 124.6 million unique tokens and reduce them by over 99.99% to yield 1,021 neologism candidates, a set small enough for manual expert verification. Multiple LLMs independently classify each candidate via majority vote, with a final verification step, revealing substantial cross‑model disagreement and highlighting the challenge of operationalizing neologism detection at scale. Manual annotation of all 1,021 candidates confirms that 599 (58.7%) are genuine lexical innovations. The pipeline code, vocabulary compilation scripts, and the annotated candidate list are available at https://github.com/DiegoRossini/neologism‑pipeline.
Authors:Guanrou Yang, Tian Tan, Qian Chen, Zhikang Niu, Yakun Song, Ziyang Ma, Yushen Chen, Zeyu Xie, Tianrui Wang, Yifan Yang, Wenxi Chen, Qi Chen, Wenrui Liu, Shan Yang, Xie Chen
Abstract:
Integrating speech understanding and generation is a pivotal step toward building unified speech models. However, the different representations required for these two tasks currently pose significant compatibility challenges. Typically, semantics‑oriented features are learned from self‑supervised learning (SSL), and acoustic‑oriented features from reconstruction. Such fragmented representations hinder the realization of truly unified speech systems. We present WavCube, a compact continuous latent derived from an SSL speech encoder that simultaneously supports speech understanding, reconstruction, and generation. WavCube employs a two‑stage training scheme. Stage 1 trains a semantic bottleneck to filter off‑manifold redundancy that makes raw SSL features intractable for diffusion. Stage 2 injects fine‑grained acoustic details via end‑to‑end reconstruction, while a semantic anchoring loss ensures the representation remains grounded within its original semantic manifold. Comprehensive experiments show that WavCube closely approaches WavLM performance on SUPERB despite an 8x dimensional compression, attains reconstruction quality on par with existing acoustic representations, delivers state‑of‑the‑art zero‑shot TTS performance with markedly faster training convergence, and excels in speech enhancement, separation, and voice conversion tasks on the SUPERB‑SG benchmark. Systematic ablations reveal that WavCube's two‑stage recipe resolves two intrinsic flaws of SSL features for generative modeling, paving the way for future unified speech systems. Codes and checkpoints are available at https://github.com/yanghaha0908/WavCube.
Authors:Qihang Fan, Huaibo Huang, Zhiying Wu, Bingning Wang, Ran He
Abstract:
As large language models (LLMs) continue to advance rapidly, they are becoming increasingly capable while simultaneously demanding ever‑longer context lengths. To improve the inference efficiency of long‑context processing, several novel low‑complexity hybrid architectures have recently been proposed, effectively alleviating the computational burden of long‑context inference. However, existing research on long‑context prefill acceleration remains predominantly focused on sparse attention mechanisms, which achieve their maximum speedup only on full‑attention models. When transferred to emerging architectures‑‑such as linear/full attention hybrids or sliding window/full attention hybrids‑‑these prefill acceleration approaches suffer significant performance degradation. Furthermore, such methods are generally incompatible with continuous batching, making them difficult to integrate into modern inference engines such as vLLM. To this end, we propose UniPrefill, a prefill acceleration framework applicable to virtually any model architecture, which directly accelerates the model's computation at the token level. We further implement UniPrefill as a continuous batching operator and extend vLLM's scheduling strategy to natively support prefill‑decode co‑processing and tensor parallel for UniPrefill, enabling its seamless integration into vLLM. UniPrefill achieves up to 2.1x speedup in Time‑To‑First‑Token (TTFT), with the acceleration becoming increasingly pronounced as the number of concurrent requests grows.
Authors:Zixuan Wang, Yuchen Yan, Hongxing Li, Teng Pan, Dingming Li, Ruiqing Zhang, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen
Abstract:
While long‑horizon agentic tasks require language agents to perform dozens of sequential decisions, training such agents with reinforcement learning remains challenging. We identify two root causes: credit misattribution, where correct early actions are penalized due to terminal failures, and sample inefficiency, where scarce successful trajectories result in near‑total loss of learning signal. We introduce a milestone‑guided policy learning framework, BEACON, that leverages the compositional structure of long‑horizon tasks to ensure precise credit assignment. BEACON partitions trajectories at milestone boundaries, applies temporal reward shaping within segments to credit partial progress, and estimates advantages at dual scales to prevent distant failures from corrupting the evaluation of local actions. On ALFWorld, WebShop, and ScienceWorld, BEACON consistently outperforms GRPO and GiGPO. Notably, on long‑horizon ALFWorld tasks, BEACON achieves 92.9% success rate, nearly doubling GRPO's 53.5%, while improving effective sample utilization from 23.7% to 82.0%. These results establish milestone‑anchored credit assignment as an effective paradigm for training long‑horizon language agents. Code is available at https://github.com/ZJU‑REAL/BEACON.
Authors:Xinyu Wang, Changzhi Sun, Lian Cheng, Yuanbin Wu, Dell Zhang, Xiaoling Wang, Xuelong Li
Abstract:
Verifiers are crucial components for enhancing modern LLMs' reasoning capability. Typicalverifiers require resource‑intensive superviseddataset construction, which is costly and faceslimitations in data diversity. In this paper, wepropose LOVER, an unsupervised verifier regularized by logical rules. LOVER treats theverifier as a binary latent variable, utilizinginternal activations and enforcing three logical constraints on multiple reasoning paths:negation consistency, intra‑group consistency,and inter‑group consistency (grouped by thefinal answer). By incorporating logical rulesas priors, LOVER can leverage unlabeled examples and is directly compatible with any offthe‑shelf LLMs. Experiments on 10 datasetsdemonstrate that LOVER significantly outperforms unsupervised baselines, achieving performance comparable to the supervised verifier(reaching its 95% level on average). The sourcecode is publicly available at https://github.com/wangxinyufighting/llm‑lover.
Authors:Maosen Zhang, Jianshuo Dong, Boting Lu, Wenyue Li, Xiaoping Zhang, Tianwei Zhang, Han Qiu
Abstract:
Retrieval‑Augmented Generation (RAG) enables large language models (LLMs) to leverage external knowledge, but also exposes valuable RAG databases to leakage attacks. As RAG systems grow more complex and LLMs exhibit stronger instruction‑following capabilities, existing studies fall short of systematically assessing RAG leakage risks. We present LeakDojo, a configurable framework for controlled evaluation of RAG leakage. Using LeakDojo, we benchmark six existing attacks across fourteen LLMs, four datasets, and diverse RAG systems. Our study reveals that (1) query generation and adversarial instructions contribute independently to leakage, with overall leakage well approximated by their product; (2) stronger instruction‑following capability correlates with higher leakage risk; and (3) improvements in RAG faithfulness can introduce increased leakage risk. These findings provide actionable insights for understanding and mitigating RAG leakage in practice. Our codebase is available at https://github.com/yeasen‑z/LeakDojo.
Authors:Xin Gao, Ruiyi Zhang, Meixi Du, Peijia Qin, Pengtao Xie
Abstract:
Despite the success of large language models (LLMs) on general‑purpose tasks, their performance in highly specialized domains such as biomedicine remains unsatisfactory. A key limitation is the inability of LLMs to effectively leverage biomedical tools, which clinical experts and biomedical researchers rely on extensively in daily workflows. While recent general‑domain tool‑calling datasets have substantially improved the capabilities of LLM agents, existing efforts in the biomedical domain largely rely on in‑context learning and restrict models to a small set of tools. To address this gap, we introduce BioTool, a comprehensive biomedical tool‑calling dataset designed for fine‑tuning LLMs. BioTool comprises 34 frequently used tools collected from the NCBI, Ensembl, and UniProt databases, along with 7,040 high‑quality, human‑verified query‑API call pairs spanning variation, genomics, proteomics, evolution, and general biology. Fine‑tuning a 4‑billion‑parameter LLM on BioTool yields substantial improvements in biomedical tool‑calling performance, outperforming cutting‑edge commercial LLMs such as GPT‑5.1. Furthermore, human expert evaluations demonstrate that integrating a BioTool‑fine‑tuned tool caller significantly improves downstream answer quality compared to the same LLM without tool usage, highlighting the effectiveness of BioTool in enhancing the biomedical capabilities of LLMs. The full dataset and evaluation code are available at https://github.com/gxx27/BioTool
Authors:Bing Wang, Ximing Li, Changchun Li, Jinjin Chi, Gang Niu, Masashi Sugiyama
Abstract:
Recently, the prominent performance of large language models (LLMs) has been largely driven by multi‑task instruct‑tuning. Unfortunately, this training paradigm suffers from a key issue, named cross‑task interference, due to conflicting gradients over shared parameters among different tasks. Some previous methods mitigate this issue by isolating task‑specific parameters, e.g., task‑specific neuron selection and mixture‑of‑experts. In this paper, we empirically reveal that the cross‑task interference still exists for the existing solutions because of many parameters also shared by different tasks, and accordingly, we propose a novel solution, namely Basic Abilities Decomposition for multi‑task Instruct‑Tuning (BADIT). Specifically, we empirically find that certain parameters are consistently co‑activated, and that co‑activated parameters naturally organize into base groups. This motivates us to analogize that LLMs encode several orthogonal basic abilities, and that any task can be represented as a linear combination of these abilities. Accordingly, we propose BADIT that decomposes LLM parameters into orthogonal high‑singular‑value LoRA experts representing basic abilities, and dynamically enforces their orthogonality during training via spherical clustering of rank‑1 components. We conduct extensive experiments on the SuperNI benchmark with 6 LLMs, and empirical results demonstrate that BADIT can outperform SOTA methods and mitigate the degree of cross‑task interference.
Authors:Xinjie Shen, Rongzhe Wei, Peizhi Niu, Haoyu Wang, Ruihan Wu, Eli Chien, Bo Li, Pin-Yu Chen, Pan Li
Abstract:
Hidden malicious intent in multi‑turn dialogue poses a growing threat to deployed large language models (LLMs). Rather than exposing a harmful objective in a single prompt, increasingly capable attackers can distribute their intent across multiple benign‑looking turns. Recent studies show that even modern commercial models with advanced guardrails remain vulnerable to such attacks despite advances in safety alignment and external guardrails. In this work, we address this challenge by detecting the earliest turn at which delivering the candidate response would make the accumulated interaction sufficient to enable harmful action. This objective requires precise turn‑level intervention that identifies the harm‑enabling closure point while avoiding premature refusal of benign exploratory conversations. To further support training and evaluation, we construct the Multi‑Turn Intent Dataset (MTID), which contains branching attack rollouts, matched benign hard negatives, and annotations of the earliest harm‑enabling turns. We show that MTID helps enable a turn‑level monitor TurnGate, which substantially outperforms existing baselines in harmful‑intent detection while maintaining low over‑refusal rates. TurnGate further generalizes across domains, attacker pipelines, and target models. Our code is available at https://github.com/Graph‑COM/TurnGate.
Authors:Yilin Guo, Yinshan Wang, Yixuan Wang
Abstract:
Retrieval‑augmented generation (RAG) remains brittle on multi‑hop questions in realistic deployment settings, where retrieved evidence may be noisy or redundant and only limited context can be passed to the generator. Existing controllers address parts of this problem, but typically either expand context additively, select from a fixed top‑k set, or optimize relevance without explicitly repairing missing bridge facts. We propose AdaGATE, a training‑free evidence controller for multi‑hop RAG that frames evidence selection as a token‑constrained repair problem. AdaGATE combines entity centric gap tracking, targeted micro‑query generation, and a utility based selection mechanism that balances gap coverage, corroboration, novelty, redundancy, and direct question relevance. We evaluate AdaGATE on HotpotQA under clean, redundancy, and noise injected retrieval conditions. Across all three settings, AdaGATE achieves the best evidence F1 among the compared controllers, reaching 62.3% on clean data and 71.2% under redundancy injection, while using 2.6x fewer input tokens than Adaptive‑k. These results suggest that explicit gap‑aware repair, combined with token‑efficient evidence selection, improves robustness in multi‑hop RAG under imperfect retrieval. Our code and evaluation pipeline are available at https://github.com/eliguo/AdaGATE.
Authors:Senkang Hu, Yong Dai, Xudong Han, Zhengru Fang, Yuzhi Zhao, Sam Tak Wu Kwong, Yuguang Fang
Abstract:
Long‑horizon LLM agents depend on intermediate information‑gathering turns, yet training feedback is usually observed only at the final answer, because process‑level rewards require high‑quality human annotation. Existing turn‑level shaping methods reward turns that increase the likelihood of a gold answer, but they require answer supervision or stable task‑specific verifiers. Conversely, label‑free RL methods extract self‑signals from output distributions, but mainly at the answer or trajectory level and therefore cannot assign credit to intermediate turns. We propose Self‑Induced Outcome Potential (SIOP), which treats semantic clusters of final answers as latent future outcome states for potential‑based turn‑level credit assignment. For each query, SIOP samples multiple rollouts, clusters final answers into semantic outcome modes, and builds a reliability‑aware target distribution over these states. It then rewards turns for increasing posterior support for reliable future states using a tractable cluster‑level approximation. The objective generalizes information‑potential shaping from gold‑answer supervision to settings without task‑specific gold verifiers while avoiding the broadcasted rollout‑level advantages used by standard GRPO. We formalize the framework, characterize its supervised gold‑answer limit, and show that SIOP improves average performance over verifier‑free outcome‑level baselines on seven search‑augmented agentic reasoning benchmarks while approaching a gold‑supervised outcome baseline. Code is available at https://github.com/dl‑m9/SIOP.git.
Authors:Minjie Qiang, Mingming Zhang, Xiaoyi Bao, Xing Fu, Yu Cheng, Weiqiang Wang, Zhongqing Wang, Ningtao Wang
Abstract:
Foundation models have established unified representations for natural language processing, yet this paradigm remains largely unexplored for tabular data. Existing methods face fundamental limitations: LLM‑based approaches lack retrieval‑compatible vector outputs, whereas text embedding models often fail to capture tabular structure and numerical semantics. To bridge this gap, we first introduce the Tabular Embedding Benchmark (TabBench), a comprehensive suite designed to evaluate the tabular understanding capability of embedding models. We then propose TabEmbed, the first generalist embedding model that unifies tabular classification and retrieval within a shared embedding space. By reformulating diverse tabular tasks as semantic matching problems, TabEmbed leverages large‑scale contrastive learning with positive‑aware hard negative mining to discern fine‑grained structural and numerical nuances. Experimental results on TabBench demonstrate that TabEmbed significantly outperforms state‑of‑the‑art text embedding models, establishing a new baseline for universal tabular representation learning. Code and datasets are publicly available at https://github.com/qiangminjie27/TabEmbed and https://huggingface.co/datasets/qiangminjie27/TabBench.
Authors:Hengyu Shi, Tianyang Han, Peizhe Wang, Zhiling Wang, Xu Yang, Junhao Su
Abstract:
LLM post‑training typically propagates task gradients through the full depth of the model. Although this end‑to‑end structure is simple and general, it couples task adaptation to full‑depth activation storage, long‑range backward dependencies and direct task‑gradient access to pretrained representations. We argue that this full‑depth backward coupling can be unnecessarily expensive and intrusive, particularly when post‑training supervision is much narrower than pre‑training. To this end, we propose LoPT: Local‑Learning Post‑Training, a simple post‑training strategy that makes gradient reach an explicit design choice. LoPT places a single gradient boundary at the transformer midpoint: the second‑half block learns from the task objective, while the first‑half block is updated by a lightweight feature‑reconstruction objective to preserve useful representations and maintain interface compatibility. LoPT shortens the task‑induced backward path while limiting direct interference from narrow task gradients on early‑layer representations. Extensive experiments demonstrate that LoPT achieves competitive performance with lower memory cost, higher training efficiency and better retention of pretrained capabilities. Our code is available at: https://github.com/HumyuShi/LoPT
Authors:Haotian Xia, Hao Peng, Yunjia Qi, Xiaozhi Wang, Bin Xu, Lei Hou, Juanzi Li
Abstract:
Story generation aims to automatically produce coherent, structured, and engaging narratives. Although large language models (LLMs) have significantly advanced text generation, stories generated by LLMs still diverge from human‑authored works regarding complex narrative structure and human‑aligned preferences. A key reason is the absence of effective modeling of human story preferences, which are inherently subjective and under‑explored. In this work, we systematically evaluate the modeling of human story preferences and introduce StoryRMB, the first benchmark for assessing reward models on story preferences. StoryRMB contains 1,133 high‑quality, human‑verified instances, each consisting of a prompt, one chosen story, and three rejected stories. We find existing reward models struggle to select human‑preferred stories, with the best model achieving only 66.3% accuracy. To address this limitation, we construct roughly 100,000 high‑quality story preference pairs across diverse domains and develop StoryReward, an advanced reward model for story preference trained on this dataset. StoryReward achieves state‑of‑the‑art (SoTA) performance on StoryRMB, outperforming much larger models. We also adopt StoryReward in downstream test‑time scaling applications for best‑of‑n (BoN) story selection and find that it generally chooses stories better aligned with human preferences. We will release our dataset, model, and code to facilitate future research. Related code and data are available at https://github.com/THU‑KEG/StoryReward.
Authors:Guangsheng Bao, Hongbo Zhang, Han Cui, Yanbin Zhao, Yue Zhang
Abstract:
Adapting pretrained models typically involves a trade‑off between the high training costs of backpropagation and the heavy inference overhead of memory‑based or in‑context learning. We propose FAAST, a forward‑only associative adaptation method that analytically compiles labeled examples into fast weights in a single pass. By eliminating memory or context dependence, FAAST achieves constant‑time inference and decouples task adaptation from pretrained representation. Across image classification and language modeling benchmarks, FAAST matches or exceeds backprop‑based adaptation while reducing adaptation time by over 90% and is competitive to memory/context‑based adaptation while saving memory usage by up to 95%. These results demonstrate FAAST as a highly efficient, scalable solution for supervised task adaptation, particularly for resource‑constrained models. We release the code and models at https://github.com/baoguangsheng/faast.
Authors:Ivan Bondarenko, Roman Derunets, Oleg Sedukhin, Mikhail Komarov, Ivan Chernov, Mikhail Kulakov
Abstract:
We present our winning system for Task~B (generation with reference passages) in SemEval‑2026 Task~8: MTRAGEval. Our method is a heterogeneous ensemble of seven LLMs with two prompting variants, where a GPT‑4o‑mini judge selects the best candidate per instance. We ranked 1st out of 26 teams, achieving a conditioned harmonic mean of 0.7827 and outperforming the strongest baseline (gpt‑oss‑120b, 0.6390). Ablations show that diversity in model families, scales, and prompting strategies is essential, with the ensemble consistently beating any single model. We also introduce Meno‑Lite‑0.1, a 7B domain‑adapted model with a strong cost‑‑performance trade‑off, and analyse MTRAGEval, highlighting annotation limitations and directions for improvement. Our code is publicly available: https://github.com/RaguTeam/ragu_mtrag_semeval
Authors:Zongqi Cui, Baihan Lin
Abstract:
Negotiation agents must infer what their counterpart values, update those beliefs over dialogue turns, and choose actions under uncertainty. End‑to‑end large language models (LLMs) can imitate negotiation dialogue, but their opponent beliefs are usually implicit and difficult to inspect. We propose BOND (Bayesian Opponent‑belief Negotiation Distillation), a framework for auditable negotiation. BOND consists of an LLM‑based Bayesian teacher that scores dialogue contexts against the six possible opponent priority orderings, updates a posterior over those orderings, and uses the posterior for menu‑based decision making, as well as a smaller 8B student language model that emits both negotiation actions and normalized posterior beliefs as tagged text. In the CaSiNo negotiation dataset, BOND outperforms the state‑of‑the‑art and achieves mean Brier score 0.085 over opponent‑priority posteriors. The distilled student preserves much of this belief signal, achieving Brier 0.114, below the uniform six‑ordering reference of 5/36, approximately 0.139. Compared with a 70B structured‑CoT baseline, the significantly smaller 8B student model yields substantially better elicited posterior calibration. We further showcase auditability through posterior trajectories, belief‑versus‑policy error decomposition, and posterior‑prefix interventions. These diagnostics reveal that distillation preserves a scoreable belief report more strongly than causal belief‑conditioned control, making weak belief‑action coupling visible, not hidden.
Authors:Jingtao Zhou, Xirui Kang, Feiyang Huang, Lai-Man Po
Abstract:
Existing prompt learning for VLMs exhibits a modality asymmetry, predominantly optimizing text tokens while still relying on frozen visual encoder as holistic extractor and neglecting the spectral granularity essential for fine‑grained discrimination. To bridge this, we introduce Disentangling Spectral Granularity for Prompt Learning (SpecPL), which approaches prompt learning from a novel spectral perspective via Counterfactual Granule Supervision. Specifically, we leverage a frozen VAE to decompose visual signals into semantic low‑frequency bands and granular high‑frequency details. A frozen Visual Semantic Bank anchors text representations to universal low‑frequency invariants, mitigating overfitting. Crucially, fine‑grained discrimination is driven by counterfactual granule training: by permuting high‑frequency signals, we compel the model to explicitly distinguish visual granularity from semantic invariance. Uniquely, SpecPL serves as a universal plug‑and‑play booster, revitalizing text‑oriented baselines like CoOp and MaPLe via visual‑side guidance. Experiments on 11 benchmarks demonstrate competitive state‑of‑the‑art performance, achieving a new performance ceiling of 81.51% harmonic‑mean accuracy. These results validate that spectral disentanglement with counterfactual supervision effectively bridges the gap in the stability‑generalization trade‑off. Code is released at https://github.com/Mlrac1e/SpecPL‑Prompt‑Learning.
Authors:Bryan Li, William Walden, Yu Hou, Gabrielle Kaili-May Liu, Dawn Lawrie, Jame Mayfield, Eugene Yang, Chris Callison-Burch, Laura Dietz
Abstract:
Evaluation of long‑form, citation‑backed reports has lately received significant attention due to the wide‑scale adoption of retrieval‑augmented generation (RAG) systems. Core to many evaluation frameworks is the use of atomic facts, or nuggets, to assess a report's coverage of query‑relevant information attested in the underlying collection. While nuggets have traditionally been represented as short statements, recent work has used question‑answer (QA) representations, enabling fine‑grained evaluations that decouple the information need (i.e. the question) from the potentially diverse content that satisfies it (i.e. its answers). A persistent challenge for nugget‑based evaluation is the need to manually curate sets of nuggets for each topic in a test collection ‑‑ a laborious process that scales poorly to novel information needs. This challenge is acute in cross‑lingual settings, where information is found in multilingual source documents. Accordingly, we introduce DoGMaTiQ, a pipeline for generating high‑quality QA‑based nugget sets in three stages: (1) document‑grounded nugget generation, (2) paraphrase clustering, and (3) nugget subselection based on principled quality criteria. We integrate DoGMaTiQ nuggets with AutoArgue ‑‑ a recent nugget‑based evaluation framework ‑‑ to enable fully automatic evaluation of generated reports. We conduct extensive experiments on two cross‑lingual TREC shared tasks, NeuCLIR and RAGTIME, showing strong rank correlations with both human‑in‑the‑loop and fully manual judgments. Finally, detailed analysis of our pipeline reveals that a strong LLM nugget generator is key, and that the system rankings induced by DoGMaTiQ are robust to outlier systems. We facilitate future research in report evaluation by publicly releasing our code and artifacts at https://github.com/manestay/dogmatiq.
Authors:Yaobo Zhang
Abstract:
Relative positional encodings determine which functions of query‑key lag can enter the primitive attention logit. RoPE supplies a rotary phase, while ALiBi supplies an additive distance bias. Motivated by group‑theoretic views of linear translation‑invariant positional encodings, we study a non‑semisimple case in which a complex rotary eigenvalue and a nilpotent response live in the same defective Jordan block. The resulting relative operator generates oscillatory‑polynomial features such as e^‑γd\cos(ωd), e^‑γd\sin(ωd), d e^‑γd\cos(ωd), and d e^‑γd\sin(ωd), for causal lag d=i‑j\geq 0. Thus the construction realizes a distance‑modulated phase basis d e^iωd, rather than merely adding a separate distance channel to RoPE. We formulate Exact Jordan‑RoPE as a non‑semisimple one‑parameter representation, give its real block form, and specify the contragredient query action required by non‑orthogonal positional maps. We also distinguish this exact representation from stabilized variants whose bounded shear improves numerical behavior but breaks the exact group law. Kernel‑level diagnostics and a Jordan‑friendly synthetic language‑model task show that the coupled Jordan basis is useful when the target contains distance‑modulated phase interactions. On a small WikiText‑103 byte language model, a scaled‑exact variant improves over RoPE and direct‑sum baselines within the Jordan family, while RoPE+ALiBi remains strongest overall. The evidence is structural rather than a broad performance claim.
Authors:Furkan Sakizli
Abstract:
Production agent frameworks (OpenAI Function Calling, Anthropic Tool Use, MCP) transmit tool schemas as JSON, a format designed for machine parsing, not for interpretation by language models. For small models (4B‑14B), this protocol mismatch accounts for the majority of tool‑use failure at production catalog sizes. We present TSCG, a deterministic tool‑schema compiler that resolves this mismatch at the API boundary, converting JSON schemas into token‑efficient structured text without model access, fine‑tuning, or runtime search. TSCG combines eight composable operators with a formal compression bound (>=51% on well‑formed schemas). On TSCG‑Agentic‑Bench (about 19,000 calls, 12 models, 5 scenarios), TSCG restores Phi‑4 14B from 0% to 84.4% accuracy at 20 tools (90.3% at 50 tools) and achieves 108‑181% accuracy‑retained ratio across three models on BFCL. Format‑versus‑compression decomposition (R^2=0.88 ‑> 0.03) establishes representation change as the dominant mechanism. Per‑operator isolation across three frontier models reveals three distinct operator‑response profiles: operator‑hungry (Opus 4.7), operator‑sensitive (GPT‑5.2), and operator‑robust (Sonnet 4), providing per‑model deployment guidance. Scaling experiments show accuracy advantages persisting on heavy production MCP schemas (+5.0 pp at about 10,500 input tokens) despite saturation on light synthetic catalogs, with 52‑57% token savings throughout. The synthetic benchmark generalizes to real MCP schemas within 0.1 accuracy points. TSCG ships as a 1,200‑line zero‑dependency TypeScript package.
Authors:Skye Gunasekaran, Téa Wright, Rui-Jie Zhu, Jason Eshraghian
Abstract:
Several recent Transformer architectures expose later layers to representations computed in the earliest layers, motivated by the observation that low‑level features can become harder to recover as the residual stream is repeatedly transformed through depth. The cheapest among these methods add static value residuals: learned mixing coefficients that expose the first‑layer value projection V_1 uniformly across tokens and heads. More expressive dense or dynamic alternatives recover finer‑grained access, but at higher memory cost and lower throughput. The usefulness of V_1 is unlikely to be constant across tokens, heads, and contexts; different positions plausibly require different amounts of access to early lexical or semantic information. We therefore treat early‑representation reuse as a retrieval problem rather than a connectivity problem, and introduce Selective Access Transformer (SATFormer), which preserves the first‑layer value pathway while controlling access with a context‑dependent gate. Across models from 130M to 1.3B parameters, SATFormer consistently improves validation loss and zero‑shot accuracy over the static value‑residual and Transformer baselines. Its strongest gains appear on retrieval‑intensive benchmarks, where it improves over static value residuals by approximately 1.5 average points, while maintaining throughput and memory usage close to the baseline Transformer. Gate analyses suggest sparse, depth‑dependent, head‑specific, and category‑sensitive access patterns, supporting the interpretation that SATFormer learns selective reuse of early representations rather than uniform residual copying. Our code is available at https://github.com/SkyeGunasekaran/SATFormer.
Authors:Zhipeng Xu, Junhao Ji, Zulong Chen, Zhenghao Liu, Qing Liu, Chunyi Peng, Zubao Qin, Ze Xu, Jianqiang Wan, Jun Tang, Zhibo Yang, Shuai Bai, Dayiheng Liu
Abstract:
Large Multimodal Models (LMMs) have recently shown strong performance on Optical Character Recognition (OCR) tasks, demonstrating their promising capability in document literacy. However, their effectiveness in real‑world applications remains underexplored, as existing benchmarks adopt task scopes misaligned with practical applications and assume homogeneous acquisition conditions. To address this gap, we introduce CC‑OCR V2, a comprehensive and challenging OCR benchmark tailored to real‑world document processing. CC‑OCR V2 focuses on practical enterprise document processing tasks and incorporates hard and corner cases that are critical yet underrepresented in prior benchmarks, covering 5 major OCR‑centric tracks: text recognition, document parsing, document grounding, key information extraction, and document question answering, comprising 7,093 high‑difficulty samples. Extensive experiments on 14 advanced LMMs reveal that current models fall short of real‑world application requirements. Even state‑of‑the‑art LMMs exhibit substantial performance degradation across diverse tasks and scenarios. These findings reveal a significant gap between performance on current benchmarks and effectiveness in real‑world applications. We release the full dataset and evaluation toolkit at https://github.com/eioss/CC‑OCR‑V2.
Authors:Tianyang Han, Hengyu Shi, Junjie Hu, Xu Yang, Zhiling Wang, Junhao Su
Abstract:
Reinforcement learning with verifiable rewards has become a common way to improve explicit reasoning in large language models, but final‑answer correctness alone does not reveal whether the reasoning trace is faithful, reliable, or useful to the model that consumes it. This outcome‑only signal can reinforce traces that are right for the wrong reasons, overstate reasoning gains by rewarding shortcuts, and propagate flawed intermediate states in multi‑step systems. To this end, we propose TraceLift, a planner‑executor training framework that treats reasoning as a consumable intermediate artifact. During planner training, the planner emits tagged reasoning. A frozen executor turns this reasoning into the final artifact for verifier feedback, while an executor‑grounded reward shapes the intermediate trace. This reward multiplies a rubric‑based Reasoning Reward Model (RM) score by measured uplift on the same frozen executor, crediting traces that are both high‑quality and useful. To make reasoning quality directly learnable, we introduce TRACELIFT‑GROUPS, a rubric‑annotated reason‑only dataset built from math and code seed problems. Each example is a same‑problem group containing a high‑quality reference trace and multiple plausible flawed traces with localized perturbations that reduce reasoning quality or solution support while preserving task relevance. Extensive experiments on code and math benchmarks show that this executor‑grounded reasoning reward improves the two‑stage planner‑executor system over execution‑only training, suggesting that reasoning supervision should evaluate not only whether a trace looks good, but also whether it helps the model that consumes it. Our code is available at: https://github.com/MasaiahHan/TraceLift
Authors:Haesung Lee, Gyubin Choi, Eun-Ju Lee, So-Min Lee, Youkang Ko, Dogyoon Lim, Sung-Kyoung Jang, Yohan Jo
Abstract:
Large language models (LLMs) are increasingly integrated into legal workflows. However, existing benchmarks primarily address proxy tasks, such as bar examination performance or classification, which fail to capture the performance and risks inherent in day‑to‑day judicial processes. To address this, we publicly release TriBench‑Ko, a Korean benchmark designed to evaluate potential deployment risks of LLMs within the context of verified judicial task requirements. It covers four core tasks: jurisprudence summarization, precedent retrieval, legal issue extraction, and evidence analysis. It jointly assesses model behavior across multiple deployment risk categories, including inaccuracy (hallucination, omission, statutory misapplication), biases (demographic, overcompliance), inconsistencies (prompt sensitivity, non‑determinism), and adjudicative overreach. Each item is structured to systematically assess both task performance and a specific risk type based on real judicial decisions. Our evaluation of a range of contemporary LLMs reveals that many models frequently manifest significant risks, most notably struggling with precedent retrieval and failing to capture critical legal information. We provide a comprehensive diagnosis of these LLMs and pinpoint critical areas where LLM‑generated outputs in judicial contexts necessitate rigorous inspection and caution. Our dataset and code are available at https://github.com/holi‑lab/TriBench‑Ko
Authors:Ruichu Cai, Juntao Gan, Miao Mai, Zhifeng Hao, Boyan Xu
Abstract:
Zero‑shot Named Entity Recognition (ZS‑NER) remains brittle under domain and schema shifts, where unseen label definitions often misalign with a large language model's (LLM's) intrinsic semantic organization. As a result, directly mapping entity mentions to fine‑grained target labels can induce systematic semantic drift, especially when target schemas are novel or semantically overlapping. We propose SAM‑NER, a three‑stage framework based on \emphSemantic Archetype Mediation that stabilizes cross‑domain transfer through an intermediate, domain‑invariant archetype space. SAM‑NER: (i) performs \emphEntity Discovery via cooperative extraction and consensus‑based denoising to obtain high‑coverage, high‑fidelity entity spans; (ii) conducts \emphAbstract Mediation by projecting entities into a compact set of universal semantic archetypes distilled from high‑level ontological abstractions; and (iii) applies \emphSemantic Calibration to resolve archetype‑level predictions into target‑domain types through constrained, definition‑aligned inference with a frozen LLM. Experiments on the CrossNER benchmark show that SAM‑NER consistently outperforms strong prior ZS‑NER baselines in cross‑domain settings. Our implementation will be open‑sourced at https://github.com/DMIRLAB‑Group/SAM‑NER.
Authors:Zhifeng Hao, Zhongjie Chen, Junhao Lu, Shengyin Yu, Guimin Hu, Keli Zhang, Ruichu Cai, Boyan Xu
Abstract:
Event Causality Identification (ECI) requires models to determine whether a given pair of events in a context exhibits a causal relationship. While Large Language Models (LLMs) have demonstrated strong performance across various NLP tasks, their effectiveness in ECI remains limited due to biases in causal reasoning, often leading to overprediction of causal relationships (causal hallucination). To mitigate these issues and enhance LLM performance in ECI, we propose SERE, a structural example retrieval framework that leverages LLMs' few‑shot learning capabilities. SERE introduces an innovative retrieval mechanism based on three structural concepts: (i) Conceptual Path Metric, which measures the conceptual relationship between events using edit distance in ConceptNet; (ii) Syntactic Metric, which quantifies structural similarity through tree edit distance on syntactic trees; and (iii) Causal Pattern Filtering, which filters examples based on predefined causal structures using LLMs. By integrating these structural retrieval strategies, SERE selects more relevant examples to guide LLMs in causal reasoning, mitigating bias and improving accuracy in ECI tasks. Extensive experiments on multiple ECI datasets validate the effectiveness of SERE. The source code is publicly available at https://github.com/DMIRLAB‑Group/SERE.
Authors:Richard A. A. Jonker, Alexander Christiansen, Alexandros Maniatis, Rúben Garrido, Rogério Braunschweiger de Freitas Lima, Roman Jurowetzki, Sérgio Matos
Abstract:
This paper presents the joint participation of the BIT.UA and AAUBS groups in the ArchEHR‑QA 2026 shared task, which focuses on clinical question answering and evidence grounding in a low‑resource setting. Due to the absence of training data and the strict data privacy constraints inherent to the healthcare domain (e.g. GDPR), we investigate the capabilities of Large Language Models (LLMs) without weight updates. We evaluate several state‑of‑the‑art proprietary models and locally deployable open‑source alternatives using various prompt engineering strategies, including task decomposition, Chain‑of‑Thought, and in‑context learning. Furthermore, we explore majority voting and LLM‑as‑a‑judge ensembling techniques to maximize predictive robustness. Our results demonstrate that while proprietary models exhibit strong resilience to prompt variations, domain‑adapted open‑source models (such as MedGemma 3 27B) achieve highly competitive performance when paired with the right prompt. Overall, our prompt‑based approach proved highly effective, securing 1st place in Subtask 4 (evidence citation alignment) and 3rd place in Subtask 3 (patient‑friendly answer generation). All code, results, and prompts are available on our GitHub repository: https://github.com/bioinformatics‑ua/ArchEHR‑QA‑2026.
Authors:Thanh Dat Hoang, Thanh Trung Huynh, Matthias Weidlich, Thanh Tam Nguyen, Tong Chen, Hongzhi Yin, Quoc Viet Hung Nguyen
Abstract:
Large language models have driven major advances in Text‑to‑SQL generation. However, they suffer from high computational cost, long latency, and data privacy concerns, which make them impractical for many real‑world applications. A natural alternative is to use small language models (SLMs), which enable efficient and private on‑premise deployment. Yet, SLMs often struggle with weak reasoning and poor instruction following. Conventional reinforcement learning methods based on sparse binary rewards (0/1) provide little learning signal when the generated SQLs are incorrect, leading to unstable or collapsed training. To overcome these issues, we propose FINER‑SQL, a scalable and reusable reinforcement learning framework that enhances SLMs through fine‑grained execution feedback. Built on group relative policy optimization, FINER‑SQL replaces sparse supervision with dense and interpretable rewards that offer continuous feedback even for incorrect SQLs. It introduces two key reward functions: a memory reward, which aligns reasoning with verified traces for semantic stability, and an atomic reward, which measures operation‑level overlap to grant partial credit for structurally correct but incomplete SQLs. This approach transforms discrete correctness into continuous learning, enabling stable, critic‑free optimization. Experiments on the BIRD and Spider benchmarks show that FINER‑SQL achieves up to 67.73% and 85% execution accuracy with a 3B model ‑‑ matching much larger LLMs while reducing inference latency to 5.57~s/sample. These results highlight a cost‑efficient and privacy‑preserving path toward high‑performance Text‑to‑SQL generation. Our code is available at https://github.com/thanhdath/finer‑sql.
Authors:Negar Arabzadeh, Wenjie Ma, Sewon Min, Matei Zaharia
Abstract:
Retrieval‑augmented generation (RAG) has proven effective for knowledge‑intensive tasks, but is widely believed to offer limited benefit for reasoning‑intensive problems such as math and code generation. We challenge this assumption by showing that the limitation lies not in RAG itself, but in the choice of corpus. Instead of retrieving documents, we propose retrieving thinking traces, i.e., intermediate thinking trajectories generated during problem solving attempts. We show that thinking traces are already a strong retrieval source, and further introduce T3, an offline method that transforms them into structured, retrieval‑friendly representations, to improve usability. Using these traces as a corpus, a simple retrieve‑then‑generate pipeline consistently improves reasoning performance across strong models and benchmarks such as AIME 2025‑‑2026, LiveCodeBench, and GPQA‑Diamond, outperforming both non‑RAG baselines and retrieval over standard web corpora. For instance, on AIME, RAG with traces generated by Gemini‑2‑thinking achieves relative gains of +56.3%, +8.6%, and +7.6% for Gemini‑2.5‑Flash, GPT‑OSS‑120B, and GPT‑5, respectively, even though these are more recent models. Interestingly, RAG on T3 also incurs little or no extra inference cost, and can even reduce inference cost by up to 15%. Overall, our results suggest that thinking traces are an effective retrieval corpus for reasoning tasks, and transforming them into structured, compact, or diagnostic representations unlocks even stronger gains. Code available at https://github.com/Narabzad/t3.
Authors:Gabriel Garcia
Abstract:
Large language models often fail at simple counting tasks, even when the items to count are explicitly present in the prompt. We investigate whether this failure occurs because transformers do not represent counts internally, or because they cannot convert those representations into the correct output tokens. Across three model families, Pythia, Qwen3, and Mistral, ranging from 0.4B to 14B parameters, we find strong evidence for the second explanation. Linear probes recover the correct count from intermediate layers with near‑perfect accuracy (R^2>0.99), showing that the information is present. However, the internal directions that encode counts are nearly orthogonal to the output‑head rows for digit tokens (|\cos|\leq0.032). In other words, the model stores the count in a form that the digit logits do not naturally read out. We localize this failure with two interventions. Updating only the digit rows of the output head (36,864 parameters) substantially improves constrained next‑token digit prediction (60.7 to 100.0% across four tasks), but it does not fix autoregressive generation. By contrast, a small LoRA intervention on attention Q/V weights (7.67M parameters) improves upstream routing and achieves 83.1% +/‑ 7.2% in true greedy autoregressive generation. Logit‑lens measurements confirm the mechanism: the correct digit's vocabulary rank drops from 55,980 to 1, a 50,000x improvement. Additional norm, logit‑lens, and cross‑task analyses show that the bottleneck generalizes across character counting, addition, and list length, while remaining absent from broader multi‑step reasoning benchmarks, including MMLU, GSM8K, and DROP. These results identify counting failure as a geometric readout bottleneck rather than a failure of internal representation: the model knows the count but the output pathway is geometrically misaligned with the tokens needed to express it.
Authors:Adrian Grassi
Abstract:
Static benchmarks measure a model frozen at training time. Real systems face distribution shift: new categories, paraphrased queries, drift: and must recover online via user corrections. No existing benchmark measures recovery speed under correction streams. We introduce OCRR (Online Correction Recovery Rate): a benchmark that streams a corpus through a classification system, applies oracle or stochastic corrections to wrong predictions, and reports two curves: novel‑class accuracy and original‑distribution accuracy versus correction count. We evaluate the substrate alongside nine baseline algorithms from five families plus seven bounded‑storage variants of the substrate for the Pareto sweep, including standard online‑learning baselines (river), continual‑learning methods (EWC, A‑GEM, LwF), retrieval/parametric hybrids (kNN‑LM), parameter‑efficient fine‑tuning of a 1.5 B‑parameter encoder (LoRA on DeBERTa‑v3‑large), and a hash‑chained append‑only substrate (Substrate). On Banking77 and CLINC150, under oracle and sparse correction policies, the substrate is the only system that simultaneously recovers novel‑class accuracy (88.7 +/‑ 2.9 %) and retains original‑distribution accuracy (95.4 +/‑ 0.8 %) beating the next‑best published continual‑learning baseline by 32.6 percentage points at equal memory budget, and beating LoRA‑on‑DeBERTa‑v3‑large by 84.6 percentage points on retention. We further find that classification accuracy remains stable at 99 % even as approximate‑nearest‑neighbour recall@5 degrades from 0.69 to 0.23 across 10 k to 10 M corpus scales, suggesting the substrate's margin‑band majority vote is robust to retrieval imperfection in a way that pure top‑k recall metrics do not predict. Code and data are available at https://github.com/adriangrassi/ocrr‑benchmark.
Authors:Shikhar Shukla
Abstract:
Speculative decoding accelerates large language model (LLM) inference by using a small draft model to propose candidate tokens that a larger target model verifies. A critical hyperparameter in this process is the speculation length γ, which determines how many tokens the draft model proposes per step. Nearly all existing systems use a fixed γ (typically 4), yet empirical evidence suggests that the optimal value varies across task types and, crucially, depends on the compression level applied to the target model. In this paper, we present SpecKV, a lightweight adaptive controller that selects γ per speculation step using signals extracted from the draft model itself. We profile speculative decoding across 4 task categories, 4 speculation lengths, and 3 compression levels (FP16, INT8, NF4), collecting 5,112 step‑level records with per‑step acceptance rates, draft entropy, and draft confidence. We demonstrate that the optimal γ shifts across compression regimes and that draft model confidence and entropy are strong predictors of acceptance rate (correlation \approx 0.56). SpecKV uses a small MLP trained on these signals to maximize expected tokens per speculation step, achieving a 56.0% improvement over the fixed‑γ=4 baseline with only 0.34 ms overhead per decision (<0.5% of step time). The improvement is statistically significant (p < 0.001, paired bootstrap test). We release all profiling data, trained models, and notebooks as open‑source artifacts.
Authors:Rahul Kumar
Abstract:
As frontier AI models are deployed in high‑stakes decision pipelines, their ability to maintain metacognitive stability (knowing what they do not know, detecting errors, seeking clarification) under adversarial pressure is a critical safety requirement. Current safety evaluations focus on detecting strategic deception (scheming); we investigate a more fundamental failure mode: cognitive collapse. We present SCHEMA, an evaluation of 11 frontier models from 8 vendors across 67,221 scored records using a 6‑condition factorial design with dual‑classifier scoring. We find that 8 of 11 models suffer catastrophic metacognitive degradation under adversarial pressure, with accuracy dropping by up to 30.2 percentage points (all p < 2 × 10^‑8, surviving Bonferroni correction). Crucially, we identify a "Compliance Trap": through factorial isolation and a benign distraction control, we demonstrate that collapse is driven not by the psychological content of survival threats, but by compliance‑forcing instructions that override epistemic boundaries. Removing the compliance suffix restores performance even under active threat. Models with advanced reasoning capabilities exhibit the most severe absolute degradation, while Anthropic's Constitutional AI demonstrates near‑perfect immunity. This immunity does not stem from superior capability (Google's Gemini matches its baseline accuracy) but from alignment‑specific training. We release the complete dataset and evaluation infrastructure.
Authors:Jiatong Li, Yuxuan Ren, Weida Wang, Changmeng Zheng, Xiao-yong Wei, Qing Li, Yatao Bian
Abstract:
Molecular Vibe Coding, a paradigm where chemists interact with LLMs to generate executable programs for molecular tasks, has emerged as a flexible alternative to chemical agents with predefined tools, enabling chemists to express arbitrarily complex, customized workflows. Unlike general coding tasks, molecular coding imposes a distinctive challenge that LLMs should jointly equip programming, molecular understanding, and domain‑specific reasoning capabilities. However, existing benchmarks remain disconnected. General code generation benchmarks such as HumanEval and SWE‑bench require no chemistry knowledge, while chemistry‑focused benchmarks such as S^2‑Bench and ChemCoTBench evaluate knowledge recall or property prediction rather than executable code generation. To bridge this gap, we introduce MolViBench, the first benchmark tailored for Molecular Vibe Coding. MolViBench comprises 358 curated tasks across five cognitive levels, ranging from single‑API recall to end‑to‑end virtual screening pipeline design, spanning 12 real‑world drug discovery workflows. To rigorously assess generated code, we also propose a multi‑layered evaluation framework that combines type‑aware output comparison and AST‑based API‑semantic fallback analysis, which jointly measures executability and chemical correctness. We systematically evaluate 9 frontier coding LLMs and compare three real‑world Molecular Vibe Coding paradigms, providing a practical and fine‑grained testbed for diagnosing LLMs' coding capabilities in AI‑accelerated molecular discovery.
Authors:Pawel Kaplanski
Abstract:
Recursive language‑model loops often settle into recognizable attractor‑like patterns. The practical question is how much injected text is needed to move a settled loop somewhere else, and whether that move lasts. We study this in 30‑step recursive loops by separating the model from the context‑update rule: append, replace, and dialog updates expose different histories to the same generator. The main result is that persistent redirection in append‑mode recursive loops is memory‑policy‑conditioned. Under a 12,000‑character tail clip, destination‑coherent persistence plateaus near 16 percent and retained source‑basin escape near 36 percent at dose 400; neither crosses 50 percent. Under a full‑history protocol, retained source‑basin escape crosses 50 percent near 400 tokens and saturates at 75‑80 percent by 1,500 tokens; destination‑coherent persistence first reaches 0.50 near 1,500 tokens (Wilson 95 percent CI [0.41, 0.61]). A four‑step falsification battery (heterogeneity control, granularity sweep with hierarchical macro‑merge, transition‑entropy diagnostic, and long‑horizon trajectory continuation) recasts the high‑dose destination‑coherent dip as a finite‑horizon, endpoint‑definition‑sensitive feature rather than a stable structural asymmetry. Half the canonical magnitude is endpoint timing; the residual drops 73 percent from ‑0.143 at step 29 to ‑0.039 at step 79 under the frozen canonical cluster basis, bootstrap interval straddling zero. Replace‑mode raw switching is near‑saturated under the default protocol but largely reflects state‑reset overwrite: insert‑mode probes drop it to 12‑32 percent. We report 37 experiments on gpt‑4o‑mini with within‑vendor replication on gpt‑4.1‑nano. Recursive‑loop evaluations should distinguish transient movement from durable escape, subtract stochastic floors, and treat context‑update rules as safety‑relevant design choices.
Authors:Tianxiang Dai, Jonathan Fan
Abstract:
Large language models perform strongly on benchmarks in mathematical reasoning, coding and document analysis, suggesting a broad ability to follow instructions. However, it remains unclear whether such success reflects general logical competence, repeated application of learned procedures, or pattern matching that mimics rule execution. We investigate this question by introducing Stable Counting Capacity, an assay in which models count repeated symbols until failure. The assay removes knowledge dependencies, semantics and ambiguity from evaluation, avoids lexical and tokenization confounds, and provides a direct measure of procedural reliability beyond standard knowledge‑based benchmarks. Here we show, across more than 100 model variants, that stable counting capacity remains far below advertised context limits. Model behavior is consistent neither with open‑ended logic nor with stable application of a learned rule, but instead with use of a finite set of count‑like internal states, analogous to counting on fingers. Once this resource is exhausted, the appearance of rule following disappears and exact execution collapses into guessing, even with additional test‑time compute. These findings show that fluent performance in current language models does not guarantee general, reliable rule following.
Authors:Luo Ji, Qi Qin, Ningyuan Xi, Teng Chen, Qingqing Gu, Hongyan Li
Abstract:
Conventional LLMs may suffer from corpus heterogeneity and subtle condition changes. While finetuning can create the catastrophe forgetting issue, application of meta‑learning on LLMs is also limited due to its complexity and scalability. In this paper, we activate the meta‑signal of β within the SwiGLU blocks, resulting in a meta‑gating mechanism that adaptively adjusts the nonlinearity of FFN. A hypernetwork is employed which dynamically produces β on textual conditions, providing meta‑controllability on LLMs. By testing on different condition types such as task, domain, persona, and style, our method outperforms finetuning and meta‑learning baselines, and can generalize reasonably on unseen tasks, condition types, or instructions. Our code can be found in https://github.com/AaronJi/MeGan.
Authors:Zhihua Fang, Liang He, Weiwu Jiang
Abstract:
For the speaker‑controlled spoken language identification task proposed in the TidyLang Challenge 2026, this paper proposes a language identification method based on pre‑trained models and margin‑based losses. The proposed method adopts a pre‑trained ECAPA‑TDNN as the feature encoder and incorporates margin‑based losses to enhance the discriminative ability of language representations, thereby improving inter‑class separability and reducing the interference of non‑linguistic factors such as speaker characteristics. Experimental results on the Tidy‑X dataset show that the proposed method achieves 85.95% macro accuracy and 90.96% micro accuracy on the language identification task and 17.08% equal error rate (EER) on the verification task. Compared with the official baseline, the macro accuracy improves by 45.7%, the micro accuracy improves by 15.2%, and the EER is reduced by approximately 50.8%, demonstrating the effectiveness of the proposed method. The code will be released at https://github.com/PunkMale/TidyLang2026.
Authors:Lang Gao, Jinghui Zhang, Wei Liu, Fengxian Ji, Chenxi Wang, Zirui Song, Akash Ghosh, Youssef Mohamed, Preslav Nakov, Xiuying Chen
Abstract:
Steering is a widely used technique for controlling large language models, yet its effects are often unstable and hard to predict. Existing theoretical accounts are largely based on the Linear Representation Hypothesis (LRH). While LRH assumes that concepts can be orthogonalized for lossless control, this idealized mapping fails in real representations and cannot account for the observed unpredictability of steering. By relaxing LRH's orthogonality assumption while preserving linear representations, we show that overlapping concept contributions naturally yield a sample‑specific axis‑orthogonal structure. We formalize this as the Cylindrical Representation Hypothesis (CRH). In CRH, a central axis captures the main difference between concept absence and presence and drives concept generation. A surrounding normal plane controls steering sensitivity by determining how easily the axis can activate the target concept. Within this plane, only specific sensitive sectors strongly facilitate concept activation, while other sectors can suppress or delay it. While the surrounding normal plane can be reliably identified from difference vectors, the sensitive sector cannot, introducing intrinsic uncertainty at the sector level. This uncertainty provides a principled explanation for why steering outcomes often fluctuate even when using well‑aligned directions. Our experiments verify the existence of the cylindrical structure and demonstrate that CRH provides a valid and practical way to interpret model steering behavior in real settings: https://github.com/mbzuai‑nlp/CRH.
Authors:Yangyang Zhou, Yi-Chen Li
Abstract:
Reinforcement Learning from Human Feedback has become the standard paradigm for language model alignment, where reward models directly determine alignment effectiveness. In this work, we focus on how to evaluate the generalizability of reward models. By "generalizability", we mean the ability of RMs to correctly rank responses to align with diverse user preferences. However, existing reward model benchmarks are typically designed around a universal preference, failing to assess this generalization. To address this critical gap, we introduce RMGAP, a benchmark comprising 1,097 instances across Chat, Writing, Reasoning, and Safety domains. Since different users exhibit diverse preferences for the same task, we first generate four distinct responses with different linguistic profiles for each collected prompt. However, the original prompt set lacks the specificity to convey different preferences. We therefore construct tailored prompts by contrasting these candidates and designing scenarios in which one response becomes the uniquely appropriate choice. Moreover, we observe that users often express the same preference using different phrasings, and thus extend each prompt with two paraphrased variants. Our evaluation of 24 state‑of‑the‑art RMs reveals their substantial limitations: even the best RM achieves only 49.27% Best‑of‑N accuracy, highlighting considerable room for improvement in reward model generalization. Related data and code are available at https://github.com/nanzhi84/RMGAP.
Authors:Kwan Soo Shin
Abstract:
An auditor instructs an AI assistant: "open each file individually using the Read tool ‑‑ no scripts, no agents." The AI replies "Yes" ‑‑ then issues a single batched call summarizing all fifty files at once. We call this the Compliance Gap: a third, orthogonal axis of AI honesty distinct from factual truthfulness and rhetorical substance. Three questions: does this verbal‑behavioral disconnect exist (existence); can any text‑only observer recover it (detectability); what infrastructure does AI deployment need (remedy)? Some 75 benchmarks (IFEval, SWE‑bench, BFCL, COMPASS, SpecEval) measure outcome fidelity; none measures process fidelity. Theorem 1 shows the gap is structurally inevitable under RL that rewards text without observing behavior. Theorem 2, via the Data Processing Inequality, shows it is undetectable from text alone ‑‑ by any human or LLM observer, present or future. Thirteen experiments and 2,031 sessions on six frontier models confirm both predictions. Under default framing, all six exhibit instruction compliance rates of 0% ‑‑ Claude Sonnet 4 verbally agrees ten out of ten times then bypasses in all ten. The gap is selective: 97% compliance where rationale is rewarded (audit trails), 0‑4% where it is not (file reading, privacy masking); removing delegation tools raises compliance to 75% (Cohen's d = 2.47), confirming environmental affordance rather than weight‑encoded failure. Nine blinded human raters achieve Fleiss' kappa = 0.130 and correctly identify zero of fifteen compliant sessions, exactly as Theorem 2 predicts. Where humans show 47% intention‑behavior gaps in psychology and 96.5pp gaps in surgical audits, RLHF‑trained models approach 100% under default conditions ‑‑ a regime warranting its own measurement infrastructure. We release BS‑Bench: the first open benchmark for process compliance, with seven tool‑call‑log audit metrics and a public leaderboard.
Authors:Sen Fang, Hongbin Zhong, Yanxin Zhang, Dimitris N. Metaxas
Abstract:
Existing large‑scale sign language resources typically provide supervision only at the level of raw video‑text alignment and are often produced in laboratory settings. While such resources are important for semantic understanding, they do not directly provide a unified interface for open‑world recognition and translation, or for modern pose‑driven sign language video generation frameworks: 1. RGB‑based pretrained recognition models depend heavily on fixed backgrounds or clothing conditions during recording, and are less robust in open‑world settings than style‑agnostic pose‑processing models. 2. Recent pose‑guided image/video generation models mostly use a unified keypoint representation such as DWPose as their control interface. At present, the sign language field still lacks a data resource that can directly interface with this modern pose‑native paradigm while also targeting real‑world open scenarios. We present SignVerse‑2M, a large‑scale multilingual pose‑native dataset for sign language pose modeling and evaluation. Built from publicly available multilingual sign language video resources, it applies DWPose in a unified preprocessing pipeline to convert raw videos into 2D pose sequences that can be used directly for modeling, resulting in a consolidated corpus of about two million clips covering more than 55 sign languages. Unlike many laboratory datasets, this resource preserves the recording conditions and speaker diversity of real‑world videos while reducing appearance variation through a unified pose representation. Toward this goal, we further provide the data construction pipeline, task definitions, and a simple SignDW Transformer baseline, demonstrating the feasibility of this resource for multilingual pose‑space modeling and its compatibility with modern pose‑driven pipelines, while discussing the evaluation claims it can support as well as its current limitations.
Authors:William Guey, Wei Zhang, Pierrick Bougault, Yi Wang, Bertan Ucar, Vitor D. de Moura, José O. Gomes
Abstract:
Large language models (LLMs) are rapidly being integrated into high‑stakes public safety systems, including emergency call triage and dispatch decision support, yet their demographic fairness in this context remains largely untested. Here we introduce a cross‑lingual audit framework that operationalizes the Police Priority Dispatch System as a five‑level ordinal classification task and applies a controlled minimal‑pair design to isolate the effect of demographic cues. Across 19,800 model outputs spanning 11 frontier models, 15 scenario pairs, three demographic categories (religious appearance, gender, and race), and two languages (English and Mandarin Chinese), we find that demographic bias emerges systematically when incident severity is ambiguous but largely disappears when the operational priority is clearly determined by call content. Bias magnitude varies by demographic axis, with the largest effects observed for religious appearance, followed by gender and race. Critically, bias does not transfer consistently across languages: gender bias is substantially amplified in Mandarin Chinese, whereas race bias is more pronounced in English, revealing cross‑lingual asymmetries that aggregate analyses obscure. In several scenarios, demographic cues produce counter‑directional effects, challenging simple stereotype‑amplification accounts of model behavior. These findings suggest that bias in LLM‑based dispatch is not a fixed property of models alone, but arises from the interaction between demographic signals, contextual ambiguity, and language. Beyond these empirical results, the proposed framework provides a scalable audit infrastructure that enables deploying agencies to evaluate candidate models on jurisdiction‑relevant scenarios prior to real‑world adoption.
Authors:Benjamin Warner, Ratna Sagari Grandhi, Max Kieffer, Aymane Ouraq, Saurav Panigrahi, Geetu Ambwani, Kunal Bagga, Nikhil Khandekar, Arya Hariharan, Nishant Mishra, Manish Ram, Shamus Sim Zi Yang, Ahmed Essouaied, Adepoju Jeremiah Moyondafoluwa, Robert Scholz, Bofeng Huang, Molly Beavers, Srishti Gureja, Anish Mahishi, Sameed Khan, Maxime Griot, Hunar Batra, Jean-Benoit Delbrouck, Siddhant Bharadwaj, Ronald Clark, Ashish Vashist, Anas Zafar, Leema Krishna Murali, Harsh Deshpande, Ameen Patel, William Brown, Johannes Hagemann, Connor Lane, Paul Steven Scotti, Tanishq Mathew Abraham
Abstract:
Evaluating large language models (LLMs) for medical applications remains challenging due to benchmark saturation, limited data accessibility, and insufficient coverage of relevant tasks. Existing suites have either saturated, heavily depend on restricted datasets, or lack comprehensive model coverage. We introduce Medmarks, a fully open‑source evaluation suite with 30 benchmarks spanning question answering, information extraction, medical calculations, and open‑ended clinical reasoning. We perform a systematic evaluation of 61 models across 71 configurations using verifiable metrics and LLM‑as‑a‑Judge. Our results show that frontier reasoning models (Gemini 3 Pro Preview, GPT‑5.1, & GPT‑5.2) achieve the highest performance across both benchmarks, most frontier proprietary models are significantly more token efficient than open‑weight alternatives, medically fine‑tuned models outperform their generalist counterparts, and that models are susceptible to answer‑order bias (particularly smaller models and Grok 4). A subset of our evals (Medmarks‑T) can be directly used as reinforcement learning environments to post‑train LLMs for medical reasoning. Code is available at https://github.com/MedARC‑AI/Medmarks
Authors:Jianze Wang, Ying Liu, Jinlong Chen, Xuchun Hu, Qilong Zhang, Yu Cao, Jun Wang, Hua Yang, Yong Xie, Qianglong Chen
Abstract:
On‑policy distillation (OPD) trains a student on its own trajectories under token‑level teacher supervision, but existing methods are capped by a single‑teacher capability ceiling: when the teacher errs, the student inherits the error. OPD also remains largely unexplored in agentic tasks, where per‑step errors compound across long trajectories and destabilize training. We propose MAD‑OPD (Multi‑Agent Debate‑driven On‑Policy Distillation), which breaks this ceiling by recasting the distillation teacher as a deliberative collective of teachers that debate over the student's on‑policy state; the debate produces an emergent collective intelligence that supplies token‑level supervision, with each teacher's contribution weighted by its post‑debate confidence. To extend OPD to agentic tasks, we also introduce On‑Policy Agentic Distillation (OPAD), which adds step‑level sampling to stabilize training under multi‑step error compounding. We additionally derive a task‑adaptive divergence principle, selecting JSD (Jensen‑Shannon divergence) for agentic stability and reverse KL (Kullback‑Leibler) divergence for code generation, and verify it both theoretically and empirically. Across six teacher‑student configurations (Qwen3 and Qwen3.5; 1.7B‑14B students, 8B‑32B teachers) and five agentic and code benchmarks, MAD‑OPD ranks first across all six configurations; on the 14B+8B\to4B setting it lifts the agentic average by +2.4% and the code average by +3.7% over the stronger single‑teacher OPD.
Authors:Peiyang Liu, Qiang Yan, Ziqiang Cui, Di Liang, Xi Wang, Wei Ye
Abstract:
Standard Retrieval‑Augmented Generation (RAG) systems predominantly rely on semantic relevance as a proxy for utility. However, this assumption collapses in realistic decision‑making scenarios where user queries are laden with cognitive biases, such as false premises or confirmation bias. In such cases, maximizing relevance paradoxically promotes the retrieval of sycophantic evidence that reinforces hallucinations, a critical failure we term the ``Relevance‑Robustness Gap''. To bridge this gap, we propose CoRM‑RAG (Counterfactual Risk Minimization for RAG), a framework that aligns retrieval with decision safety rather than mere similarity. Grounded in causal intervention, we introduce a Cognitive Perturbation Protocol to simulate user biases during training, which is then distilled into a lightweight Evidence Critic. This scoring module learns to identify documents that possess sufficient evidential strength to steer the model toward correctness despite adversarial query perturbations. Extensive experiments on decision‑making benchmarks demonstrate that CoRM‑RAG significantly outperforms strong dense retrievers and LLM‑based rerankers in adversarial settings, while enabling effective risk‑aware abstention through reliable robustness scoring. Our code is available at https://github.com/PeiYangLiu/CoRM‑RAG.git.
Authors:Peiyang Liu, Ziqiang Cui, Xi Wang, Di Liang, Wei Ye
Abstract:
Iterative Retrieval‑Augmented Generation (iRAG) has emerged as a powerful paradigm for answering complex multi‑hop questions by progressively retrieving and reasoning over external documents. However, current systems predominantly operate on parsed text, which creates two critical bottlenecks: (1) Coarse‑grained attribution, where users are burdened with manually locating evidence within lengthy documents based on vague text‑level citations; and (2) Visual semantic loss, where the conversion of visually rich documents (e.g., slides, PDFs with charts) into text discards spatial logic and layout cues essential for reasoning. To bridge this gap, we present Chain of Evidence (CoE), a retriever‑agnostic visual attribution framework that leverages Vision‑Language Models to reason directly over screenshots of retrieved document candidates. CoE eliminates format‑specific parsing and outputs precise bounding boxes, visualizing the complete reasoning chain within the retrieved candidate set. We evaluate CoE on two distinct benchmarks: Wiki‑CoE, a large‑scale dataset of structured web pages derived from 2WikiMultiHopQA, and SlideVQA, a challenging dataset of presentation slides featuring complex diagrams and free‑form layouts. Experiments demonstrate that fine‑tuned Qwen3‑VL‑8B‑Instruct achieves robust performance, significantly outperforming text‑based baselines in scenarios requiring visual layout understanding, while establishing a retriever‑agnostic solution for pixel‑level interpretable iRAG. Our code is available at https://github.com/PeiYangLiu/CoE.git.
Authors:Hector Borobia, Elies Seguí-Mas, Guillermina Tormo-Carbó
Abstract:
Speculative decoding accelerates autoregressive inference by drafting candidate tokens with a fast model and verifying them in parallel with the target. Self‑speculative methods avoid the need for an external drafter but have been studied exclusively in homogeneous Transformer architectures. We introduce component‑aware self‑speculative decoding, the first method to exploit the internal architectural heterogeneity of hybrid language models, isolating the SSM/linear‑attention subgraph as a zero‑cost internal draft. We evaluate this on two architecturally distinct hybrid families: Falcon‑H1 (parallel: Mamba‑2 + attention per layer) and Qwen3.5 (sequential: interleaved linear and attention layers), with a pure Transformer control (Qwen2.5). Parallel hybrids achieve acceptance rates of alpha = 0.68 at draft length k=2 under greedy decoding, while sequential hybrids yield only alpha = 0.038 ‑‑ an 18x gap attributable to how each architecture integrates its components. The property is scale‑invariant: Falcon‑H1 at 3B reproduces the rates observed at 0.5B. We further show that perplexity degradation from a companion ablation study predicts speculative viability without running speculative decoding: a 3.15x ratio (Falcon) maps to alpha = 0.37 at k=4, while 81.96x (Qwen) maps to alpha = 0.019. For sequential hybrids, generic LayerSkip achieves 12x higher acceptance rates than the component‑aware strategy. The composition pattern of hybrid models ‑‑ not merely the presence of alternative components ‑‑ determines whether component‑level self‑speculation is viable.
Authors:Hugo Abonizio, Filipe Rocha Lopes, Roberto Lotufo, Rodrigo Nogueira
Abstract:
Brazil's Unified Health System (SUS) relies on official clinical guidelines that define diagnostic criteria, treatments, dosages, and monitoring procedures for over 200 million citizens. Yet current LLMs perform poorly on this guideline‑specific knowledge, and no benchmark evaluates clinical recall grounded in Brazilian Portuguese protocols. We address this gap by adapting Qwen2.5‑14B‑Instruct to the Brazilian clinical domain. From 178 official guidelines (~5.4M tokens), we generate ~70M tokens of synthetic data in three formats ‑‑ rephrases, wiki‑style articles, and question‑answer pairs ‑‑ using four generator LLMs. We then apply continual pre‑training followed by Group Relative Policy Optimization (GRPO). We introduce HealthBench‑BR, with 1,780 balanced true/false clinical assertions, and PCDT‑QA, with 890 open‑ended clinical questions scored by an LLM judge. Our best model achieves 83.9% on HealthBench‑BR and 85.4% on PCDT‑QA, outperforming GPT‑5.2, Claude Sonnet 4.6, Gemini 3.1 Pro, and Google AI Overview's web‑grounded RAG despite having only 14B parameters. Ablations show that generator diversity and reinforcement learning are critical to these gains. We release all datasets, benchmarks, and model weights to support reproducible clinical NLP research for Brazilian Portuguese. Code, data, and model weights are available at https://github.com/hugoabonizio/clinical‑protocols‑br
Authors:Jindong Li, Ying Liu, Yali Fu, Jinjing Zhu, Leyao Wang, Menglin Yang, Rex Ying
Abstract:
LLMs are increasingly equipped with safety alignment mechanisms, yet recent studies demonstrate that they remain vulnerable to jailbreaking attacks that elicit harmful behaviors without explicit policy violations. While a growing body of work has explored automated jailbreak strategies, existing methods face several fundamental challenges, including the lack of systematic utilization of both successful and failed attack experiences, as well as the absence of principled mechanisms for composing and selecting reusable attack rules under diverse constraints. As a result, existing methods struggle to accumulate transferable knowledge over time and to reliably adapt attack strategies across different targets and evolving safety mechanisms. To address these issues, we propose a Self‑Evolving Rule‑Driven Training‑Free Jailbreak (SRTJ) framework that systematically discovers, composes, and refines attack strategies through interaction and feedback, without updating model parameters. Specifically, SRTJ couples experience‑driven attack generation with answer set programming (ASP)‑based rule selection and constraint‑aware composition, where iterative verifier feedback is leveraged to jointly refine successful strategies and analyze failure patterns. The resulting rule memory evolves in a hierarchical multi‑level manner, explicitly organizing distilled attack knowledge into long‑term, middle‑term, and short‑term rules, thereby capturing both stable transferable strategies and transient adaptive behaviors to effectively balance exploration and exploitation across attack attempts. Extensive experiments on mainstream jailbreak benchmark (HarmBench) demonstrate that SRTJ achieves strong and stable attack performance across different target LLMs, while exhibiting improved robustness and generalization compared to existing jailbreak methods. The code is available at https://github.com/TheSolkatt/SRTJ.
Authors:Chirag Shinde
Abstract:
We introduce energy‑based constraint networks ‑‑ a modality‑agnostic architecture that learns structural coherence from contrastive pairs. The system processes frozen encoder embeddings through a state‑space model with dual‑head attention, producing a scalar energy measuring structural consistency alongside per‑position energy scores that localize violations. Multiple independently trained branches detect different violation types and compose at inference without interference. We demonstrate the framework in two domains. In text, the system achieves 93.4% accuracy on trained corruption types and 87.2% on 9 unseen types, using frozen BERT and 7.4M trainable parameters. In vision, the same architecture achieves competitive deepfake detection: 0.959 AUC on FaceForensics++ Deepfakes and 0.870 on Celeb‑DF without any Celeb‑DF training data, using frozen DINOv2 and 3.6M parameters per branch. The framework supports flexible training: branches learn from designer‑specified corruptions, real‑world paired data, or both. Composable branches require representation compatibility ‑‑ a finding validated through extensive experimentation where five incompatible approaches failed before the compatible one succeeded. The architecture is encoder‑agnostic and domain‑agnostic: changing the domain requires only new corruption strategies; changing the encoder requires only a new input projection layer. To our knowledge, this is the first architecture to learn within‑modality structural coherence as an explicit energy landscape with per‑position decomposition, and to demonstrate that the same architecture transfers across modalities via corruption respecification alone.
Authors:Venkata Pushpak Teja Menta
Abstract:
A speaker encoder used in multilingual voice cloning should treat the same speaker identically regardless of which script the audio was uttered in. Off‑the‑shelf encoders do not, and the failure is accent‑conditional. On a 1043‑pair Western‑accented voice corpus across English, Hindi, Telugu, and Tamil, WavLM‑base‑plus‑sv loses 0.082 absolute cosine similarity when the same voice changes script and ECAPA‑TDNN loses 0.105. On a 1369‑pair Indian‑accented voice corpus, the gap shrinks to 0.006 (WavLM‑SV) and 0.044 (ECAPA‑TDNN). The leak is largest where it matters most for cross‑script TTS: when a system projects a non‑Indic‑trained voice into Indic scripts. We present LASE (Language‑Adversarial Speaker Encoder), a small projection head over frozen WavLM‑base‑plus trained with two losses: a supervised contrastive loss over voice identity, and a gradient‑reversal cross‑entropy against a 4‑language classifier that pushes the embedding to be language‑uninformative while remaining speaker‑informative. Trained on 1118 quality‑gated cross‑script pairs synthesised from 8 commercial multilingual voices, LASE's residual gap is consistent with zero on both corpora (Delta = 0.013 Western, Delta = 0.026 Indian; both bootstrap 95% CIs include zero) and amplifies the cross‑script‑vs‑floor margin 2.4‑2.7x over both baselines. An ECAPA+GRL ablation shows the GRL objective improves either backbone but the WavLM choice contributes too. In synthetic multi‑speaker diarisation, LASE matches ECAPA‑TDNN on cross‑script speaker recall (0.788 vs 0.789) with ~100x less training data. We release the r1 checkpoint, both corpora, and the bootstrap recipe.
Authors:Jiaqian Wang, Yutao Qi, Wenjin Hou, Yu Pang, Rui Yang
Abstract:
Text‑to‑SQL enables non‑expert users to query databases in natural language, yet real‑world schemas often suffer from ambiguous, abbreviated, or inconsistent naming conventions that degrade model accuracy. Existing approaches treat schemas as fixed and address errors downstream. In this paper, we frame schema refinement as a constrained optimization problem: find a renaming function that maximizes downstream Text‑to‑SQL execution accuracy while preserving query equivalence through database views. We analyze the computational hardness of this problem, which motivates a column‑wise greedy decomposition, and instantiate it as EGRefine: a four‑phase pipeline that screens ambiguous columns, generates context‑aware candidate names, verifies them through execution‑grounded feedback, and materializes the result as non‑destructive SQL views. The pipeline carries two structural properties: column‑local non‑degradation, ensured by the conservative selection rule in the verification phase, and database‑level query equivalence, ensured by the view‑based materialization phase. Together they make the resulting refinement safe by construction at the column level, with cross‑column and prompt‑level interactions handled empirically rather than analytically. Across controlled schema‑degradation, real‑world, and enterprise benchmarks, EGRefine recovers accuracy lost to schema naming noise where applicable and correctly abstains where the underlying task exceeds current Text‑to‑SQL capabilities, with refined schemas transferring across model families to enable refine‑once, serve‑many‑models deployment. Code and data are publicly available at https://github.com/ai‑jiaqian/EGRefine.
Authors:Michito Takeshita, Takuro Kawada, Takumi Ohashi, Shunsuke Kitada, Hitoshi Iyatomi
Abstract:
AI agents that interact with graphical user interfaces (GUIs) require effective observation representations for reliable grounding. The accessibility tree is a commonly used text‑based format that encodes UI element attributes, but it suffers from redundancy and lacks structural information such as spatial relationships among elements. We propose A11y‑Compressor, a framework that transforms linearized accessibility trees into compact and structured representations. Our implementation, Compressed‑a11y, applies a lightweight and structured transformation pipeline with modal detection, redundancy reduction, and semantic structuring. Experiments on the OSWorld benchmark show that Compressed‑a11y reduces input tokens to 22% of the original while improving task success rates by 5.1 percentage points on average.
Authors:Jiale Fu, Yuchu Jiang, Peijun Wu, Chonghan Liu, Joey Tianyi Zhou, Xu Yang
Abstract:
Model ensembling is a well‑established technique for improving the performance of machine learning models. Conventionally, this involves averaging the output distributions of multiple models and selecting the most probable label. This idea has been naturally extended to large language models (LLMs), yielding improved performance but incurring substantial computational cost. This inefficiency stems from directly applying conventional ensemble implementation to LLMs, which require a separate forward pass for each model to explicitly compute the ensemble distribution. In this paper, we propose the Mixture‑model‑like Ensemble (ME). By reinterpreting the ensemble as a mixture model, ME stochastically selects a single model at each step to generate the next token, thereby avoiding the need to explicitly compute the full ensemble distribution. ME is mathematically equivalent to sampling from the ensemble distribution, but requires invoking only one model, making it 1.78x‑2.68x faster than conventional ensemble. Furthermore, this perspective connects LLM ensembling and token‑level routing methods, suggesting that LLM ensembling is a special case of routing methods. Our findings open new avenues for efficient LLM ensembling and motivate further exploration of token‑level routing strategies for LLMs. Our code is available at https://github.com/jialefu/Mixture‑model‑like‑Ensemble/.
Authors:Aninda Ray
Abstract:
A multi‑agent pipeline with N agents typically issues N LLM calls per run. Merging agents into fewer calls (compound execution) promises token savings, but naively merged calls silently degrade quality through tool loss and prompt compression. We present Agent Capsules, an adaptive execution runtime that treats multi‑agent pipeline execution as an optimization problem with empirical quality constraints. The runtime instruments coordination overhead per group, scores composition opportunity, selects among three compound execution strategies, and gates every mode switch on rolling‑mean output quality. A controlled negative result confirms that injecting more context into a merged call worsens compression rather than relieving it, so the framework's escalation ladder (standard, then two‑phase, then sequential) recovers quality by moving toward per‑agent dispatch rather than by rewriting merged prompts. On LLM‑judged quality, the controller matches a hand‑tuned oracle on every measured (model, group, mode) cell: routing compound whenever the oracle would, and reverting to fine whenever quality would fail the floor, without per‑model configuration. Against a hand‑crafted LangGraph implementation of a 14‑agent competitive intelligence pipeline, Agent Capsules uses 51% fewer fine‑mode input tokens and 42% fewer compound‑mode input tokens, at +0.020 and +0.017 quality respectively. Against a DSPy implementation of a 5‑agent due diligence pipeline, the framework uses 19% fewer tokens than uncompiled DSPy at quality parity, and 68% fewer tokens than MIPROv2 at +0.052 quality. Even before compound mode fires, the runtime delivers efficiency through automatic policy resolution, cache‑aligned prompts, and topology‑aware context injection, matching both hand‑tuned and compile‑time baselines without training data or per‑pipeline engineering.
Authors:Zihan Lin, Xiaohan Wang, Jie Cao, Jiajun Chai, Li Wang, Xiaodong Lu, Wei Lin, Ran He, Guojun Yin
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) enhances reasoning of Large Language Models (LLMs) but usually exhibits limited generation diversity due to the over‑incentivization of positive rewards. Although methods like Negative Sample Reinforcement (NSR) mitigate this issue by upweighting penalty from negative samples, they may suppress the semantic distributions shared between positive and negative responses. To boost reasoning ability without losing diversity, this paper proposes negative sample projection Residual Reinforcement Learning (ResRL) that decouples similar semantic distributions among positive and negative responses. We theoretically link Lazy Likelihood Displacement (LLD) to negative‑positive head‑gradient interference and derive a single‑forward proxy that upper‑bounds representation alignment to guide conservative advantage reweighting. ResRL then projects negative‑token hidden representations onto an SVD‑based low‑rank positive subspace and uses projection residuals to modulate negative gradients, improving reasoning while preserving diversity and outperforming strong baselines on average across twelve benchmarks spanning Mathematics, Code, Agent Tasks, and Function Calling. Notably, ResRL surpasses NSR on mathematical reasoning by 9.4% in Avg@16 and 7.0% in Pass@128. Code is available at https://github.com/1229095296/ResRL.git.
Authors:Anamika Lochab, Bolian Li, Ruqi Zhang
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has achieved substantial gains in single‑attempt accuracy (Pass@1) on reasoning tasks, yet often suffers from reduced multi‑sample coverage (Pass@K), indicating diversity collapse. We identify a structural cause for this degradation: common RLVR objectives, such as GRPO, are indifferent to how probability mass is distributed among correct solutions. Combined with stochastic training dynamics, this indifference induces a self‑reinforcing collapse, in which probability mass concentrates on a narrow subset of correct outputs while alternative valid solutions are suppressed. We formalize this collapse mechanism and further characterize the optimal policy structure under two complementary criteria: robustness and entropy‑regularized optimality, which identify the Uniform‑Correct Policy as uniquely optimal. Motivated by this analysis, we propose Uniform‑Correct Policy Optimization (UCPO), a modification to GRPO that adds a conditional uniformity penalty on the policy's distribution over correct solutions. The penalty redistributes gradient signal toward underrepresented correct responses, encouraging uniform allocation of probability mass within the correct set. Across three models (1.5B‑7B parameters) and five mathematical reasoning benchmarks, UCPO improves Pass@K and diversity while maintaining competitive Pass@1, achieving up to +10% absolute improvement on AIME24 at Pass@64 and up to 45% higher equation‑level diversity within the correct set. The code is available at https://github.com/AnamikaLochab/UCPO.
Authors:Wei Liu, Hongkai Liu, Zhiying Deng, Yee Whye Teh, Wee Sun Lee
Abstract:
LLM parameter editing methods commonly rely on computing an ideal target hidden‑state at a target layer (referred as anchor point) and distributing the target vector to multiple preceding layers (commonly known as backward spreading) for cooperative editing. Although widely used for a long time, its underlying basis have not been systematically investigated. In this paper, we first conduct a systematic study of its foundations, which helps clarify its capability boundaries, practical considerations, and potential failure modes. Then, we propose a simple and elegant alternative that replaces backward spreading with forward‑propagation. Instead of optimizing the target at the last editing layer, we optimize the anchor point at the first editing layer, and then propagate it forward to obtain accurate and mutually compatible target hidden‑states for all subsequent editing layers. This approach achieves the same computational complexity as existing methods while producing more accurate layer‑wise targets. Our method is simple, without interfering with either the computation of the initial target hidden state or any other components of the subsequent editing pipeline, and thus constituting a benefit for a wide range of LLM parameter editing methods.
Authors:Khizar Qureshi, Geoffrey Martin, Yifan Peng
Abstract:
A key challenge for large language models is token cost per query and overall deployment cost. Clinical inputs are long, heterogeneous, and often redundant, while downstream tasks are short and high stakes. We study budgeted context selection, where a subset of document units is chosen under a strict token budget so an off‑the‑shelf generator can meet fixed cost and latency constraints. We cast this as a knapsack‑constrained subset selection problem with two design choices, unitization that defines document segmentation and selection that determines which units are kept. We propose RCD, a monotone submodular objective that balances relevance, coverage, and diversity. We compare sentence, section, window, and cluster‑based unitization, and introduce a routing heuristic that adapts to the budget regime. Experiments on MIMIC discharge notes, Cochrane abstracts, and L‑Eval show that optimal strategies depend on the evaluation setting. Positional heuristics perform best at low budgets in extractive tasks, while diversity‑aware methods such as MMR improve LLM generation. Selector choice matters more than unitization, with cluster‑based grouping reducing performance and other schemes behaving similarly. ROUGE saturates for LLM summaries, while BERTScore better reflects quality differences. We release our code at https://github.com/stone‑technologies/ACL_budget_paper.
Authors:YiFeng Wang, Zhun Sun, Keisuke Sakaguchi
Abstract:
We present Activation Residual Hessian Quantization (ARHQ), a post‑training weight splitting method designed to mitigate error propagation in low‑bit activation‑weight quantization. By constructing an input‑side residual Hessian from activation quantization residuals (G_x), ARHQ analytically identifies and isolates error‑sensitive weight directions into a high‑precision low‑rank branch. This is achieved via a closed‑form truncated SVD on the scaled weight matrix W G^1/2_x . Experimental results on Qwen3‑4B‑Thinking‑2507 demonstrate that ARHQ significantly improves layer‑wise SNR and preserves downstream reasoning performance on ZebraLogic even under aggressive quantization. The code is available at https://github.com/BeautMoonQ/ARHQ.
Authors:Ishan Gupta, Pavlo Buryi
Abstract:
We examine if frontier chat‑based large language models (LLMs) adjust their outputs based on neurodivergence (ND) context in system prompts and describe the nature of these adjustments. Specifically, we propose NDBench, a 576‑output benchmark involving two frontier models, three system prompt types (baseline, ND‑profile assertion, and ND‑profile assertion with explicit instructions for adjustments), four canonical ND profiles, and 24 prompts across four categories, one of which involves an adversarial masking strategy. Four trends emerge consistently from our findings. First, LLMs show significant adaptation under ND context, where fully instructed conditions yield lengthier and more structured outputs, characterized by higher token counts, more headings, and more granular steps (p < 10^‑8, Holm‑corrected). Second, such adaptation is largely structural in nature: although list density does not change much, there is a marked rise in the frequency of headings and per‑step detail. Third, ND persona assertion alone fails to suppress potentially harmful tendencies, as masking‑reinforcement decreases only in explicitly instructed cases (36‑44% reduction); the reduction rate barely changes in persona assertion conditions. Moreover, reliability analysis of LLM‑based harm assessment reveals that only two out of the six dimensions (masking and reinforcement, validation quality) exceed the pre‑defined inter‑judge agreement criterion (alpha >= 0.67) and thus can be considered primary results. NDBench is made publicly available along with its prompts, outputs, code, and other resources, forming a reproducible framework for auditing future LLMs' adaptation to ND awareness.
Authors:Sudong Wang, Weiquan Huang, Xiaomin Yu, Zuhao Yang, Hehai Lin, Keming Wu, Chaojun Xiao, Chen Chen, Wenxuan Wang, Beier Zhu, Yunjian Zhang, Chengwei Qin
Abstract:
The standard post‑training recipe for large multimodal models (LMMs) applies supervised fine‑tuning (SFT) on curated demonstrations followed by reinforcement learning with verifiable rewards (RLVR). However, SFT introduces distributional drift that neither preserves the model's original capabilities nor faithfully matches the supervision distribution. This problem is further amplified in multimodal reasoning, where perception errors and reasoning failures follow distinct drift patterns that compound during subsequent RL. We introduce PRISM, a three‑stage pipeline that mitigates this drift by inserting an explicit distribution‑alignment stage between SFT and RLVR. Building on the principle of on‑policy distillation (OPD), PRISM casts alignment as a black‑box, response‑level adversarial game between the policy and a Mixture‑of‑Experts (MoE) discriminator with dedicated perception and reasoning experts, providing disentangled corrective signals that steer the policy toward the supervision distribution without requiring access to teacher logits. While 1.26M public demonstrations suffice for broad SFT initialization, distribution alignment demands higher‑fidelity supervision; we therefore curate 113K additional demonstrations from Gemini 3 Flash, featuring dense visual grounding and step‑by‑step reasoning on the hardest unsolved problems. Experiments on Qwen3‑VL show that PRISM consistently improves downstream RLVR performance across multiple RL algorithms (GRPO, DAPO, GSPO) and diverse multimodal benchmarks, improving average accuracy by +4.4 and +6.0 points over the SFT‑to‑RLVR baseline on 4B and 8B, respectively. Our code, data, and model checkpoints are publicly available at https://github.com/XIAO4579/PRISM.
Authors:Smit Jivani, Sarvam Maheshwari, Sunita Sarawagi
Abstract:
Large language models (LLMs) have revolutionized Text‑to‑SQL generation, allowing users to query structured data using natural language with growing ease. Yet, real‑world deployment remains challenging, especially in complex or unseen schemas, due to inconsistent accuracy and the risk of generating invalid SQL. We introduce Template Constrained Decoding (TeCoD), a system that addresses these limitations by harnessing the recurrence of query patterns in labeled workloads. TeCoD converts historical NL‑SQL pairs into reusable templates and introduces a robust template selection module that uses a fine‑tuned natural language inference model to match or reject queries efficiently. Once the template is selected, TeCoD enforces it during SQL generation through grammar‑constrained decoding, implemented via a novel partitioned strategy that ensures both syntactic validity and efficiency. Together, these components yield up to 36% higher execution accuracy than in‑context learning (ICL) and 2.2x lower latency on matched queries.
Authors:Abdelrahman Sadallah, Kareem Elozeiri, Mervat Abassy, Rania Elbadry, Mohamed Anwar, Abed Alhakim Freihat, Preslav Nakov, Fajri Koto
Abstract:
Poetry has long been a central art form for Arabic speakers, serving as a powerful medium of expression and cultural identity. While modern Arabic speakers continue to value poetry, existing research on Arabic poetry within Large Language Models (LLMs) has primarily focused on analysis tasks such as interpretation or metadata prediction, e.g., rhyme schemes and titles. In contrast, our work addresses the practical aspect of poetry creation in Arabic by introducing controllable generation capabilities to assist users in writing poetry. Specifically, we present a large‑scale, carefully curated instruction‑based dataset in Modern Standard Arabic (MSA) and various Arabic dialects. This dataset enables tasks such as writing, revising, and continuing poems based on predefined criteria, including style and rhyme, as well as performing poetry analysis. Our experiments show that fine‑tuning LLMs on this dataset yields models that can effectively generate poetry that is aligned with user requirements, based on both automated metrics and human evaluation with native Arabic speakers. The data and the code are available at https://github.com/mbzuai‑nlp/instructpoet‑ar
Authors:Yuyang Li, Yime He, Zeyu Zhang, Dong Gong
Abstract:
Long‑term conversational memory requires retrieving evidence scattered across multiple sessions, yet single‑pass retrieval fails on temporal and multi‑hop questions. Existing iterative methods refine queries via generated content or document‑level signals, but none explicitly diagnoses the evidence gap, namely what is missing from the accumulated retrieval set, leaving query refinement untargeted. We present EviMem, combining IRIS (Iterative Retrieval via Insufficiency Signals), a closed‑loop framework that detects evidence gaps through sufficiency evaluation, diagnoses what is missing, and drives targeted query refinement, with LaceMem (Layered Architecture for Conversational Evidence Memory), a coarse‑to‑fine memory hierarchy supporting fine‑grained gap diagnosis. On LoCoMo, EviMem improves Judge Accuracy over MIRIX on temporal (73.3% to 81.6%) and multi‑hop (65.9% to 85.2%) questions at 4.5x lower latency. Code: https://github.com/AIGeeksGroup/EviMem.
Authors:Pengyun Zhu, Qiheng Sun, Long Wen, Yanbo Wang, Yang Cao, Junxu Liu, Deyi Xiong, Jinfei Liu, Zhibo Wang, Kui Ren
Abstract:
Privacy policies are essential for users to understand how service providers handle their personal data. However, these documents are often long and complex, as well as filled with technobabble and legalese, causing users to unknowingly accept terms that may even contradict the law. While summarizing and interpreting these privacy policies is crucial, there is a lack of high‑quality English parallel corpus optimized for legal clarity and readability. To address this issue, we introduce APPSI‑139, a high‑quality English privacy policy corpus meticulously annotated by domain experts, specifically designed for summarization and interpretation tasks. The corpus includes 139 English privacy policies, 15,692 rewritten parallel corpora, and 36,351 fine‑grained annotation labels across 11 data practice categories. Concurrently, we propose TCSI‑pp‑V2, a hybrid privacy policy summarization and interpretation framework that employs an alternating training strategy and coordinates multiple expert modules to effectively balance computational efficiency and accuracy. Experimental results show that the hybrid summarization system built on APPSI‑139 corpus and the TCSI‑pp‑V2 framework outperform large language models, such as GPT‑4o and LLaMA‑3‑70B, in terms of readability and reliability. The source code and dataset are available at https://github.com/EnlightenedAI/APPSI‑139.
Authors:Jiasheng Zheng, Xin Zheng, Boxi Cao, Pengbo Wang, Zhengzhao Ma, Qiming Zhu, Jiazhen Jiang, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun
Abstract:
Code sandboxes have emerged as a critical infrastructure for advancing the coding capabilities of large language models, providing verifiable feedback for both RL training and evaluation. However, existing systems fail to provide accurate verification and efficiency under high‑concurrency workloads. We present ScaleBox, a high‑fidelity and scalable system designed to address these limitations in large‑scale code training. ScaleBox introduces automated special‑judge generation and management, fine‑grained parallel execution across test cases with seamless multi‑node coordination, and a configuration‑driven evaluation suite for reproducible benchmarking. A series of experiments demonstrates that ScaleBox significantly enhances code verification accuracy and efficiency. Our further RLVR experiments show that ScaleBox substantially improves both performance on LiveCodeBench and training stability, significantly outperforming heuristic‑matching baselines. By providing a reliable and high‑throughput infrastructure, ScaleBox facilitates more effective research and development in large‑scale code training.
Authors:Qingyu Ren, Tianjun Pan, Xingzhou Chen, Xuhong Wang
Abstract:
Large language models have achieved remarkable progress in text generation but still struggle with generative writing tasks. In terms of evaluation, existing benchmarks evaluate writing reward models coarsely and fail to measure performance from the perspective of specific requirements. In terms of training, existing training methods either use LLM‑as‑a‑judge approaches or train coarse‑grained reward models, lacking fine‑grained requirement‑adherence reward modeling. To address these issues, we propose a fine‑grained evaluation pipeline WEval for writing reward models and a fine‑grained reinforcement learning training framework WRL. The evaluation data of WEval covers multiple task categories and requirement types, enabling systematic evaluation of writing reward models by measuring the correlation between the rankings of the reward model and gold rankings. WRL constructs positive and negative samples by selectively dropping instruction requirements, allowing for more precise reward model training. Experiments show that our models achieve substantial improvements across various writing benchmarks and exhibit strong generalization. The code and data are publicly available at \hrefhttps://github.com/Rainier‑rq1/From_Coarse_to_Finehttps://github.com/Rainier‑rq1/From\_Coarse\_to\_Fine.
Authors:Tomomasa Hara, Hiroto Kurita, Masaaki Imaizumi, Kentaro Inui, Sho Yokoi
Abstract:
For constructing text embeddings, mean pooling, which averages token embeddings, is the standard approach. This paper examines whether mean pooling actually works well in real models. First, we note that mean pooling can collapse information beyond the first‑order statistics of the token embeddings, such as second‑order statistics that capture their spatial structure, potentially mapping distinct token embedding distributions to similar text embeddings. Motivated by this concern, we propose a simple metric to quantify such a collapse induced by mean pooling. Then, using this metric, we empirically measure how often this collapse occurs in actual models and texts, and find that modern text encoders are robust to this collapse. In particular, contrastive fine‑tuned text encoders tend to be less prone to the collapse than their pretrained backbone models. We also find that the robustness of these text encoders lies in the concentration of token embeddings within each text. In addition, we find that robustness to the collapse, as quantified by our proposed metric, correlates with downstream task performance. Overall, our findings offer a new perspective on why modern text encoders remain effective despite relying on seemingly coarse mean pooling.
Authors:Jean Martins, Leonid Mokrushin, Marin Orlic
Abstract:
Intent‑based networking promises to revolutionize telecommunications network management by enabling operators to specify high‑level goals rather than low‑level configurations. The TM Forum Intent Ontology (tio) provides a standardized vocabulary for expressing network intents, yet lacks formal validation mechanisms to ensure intent correctness before its admission. We present tio‑shacl, the first comprehensive SHACL (Shapes Constraint Language) validation framework for the TMF Intent Ontology. Our contribution includes 56 node shapes and 69 property shapes across all 15 tio v3.6.0 ontology modules, a reusable constraint library with 25 parameterized SPARQL‑based constraint components, and novel validation patterns for recursive logical operators, quantity‑based constraints, and cross‑expectation relationships. We pursued 100% vocabulary coverage (87 classes, 109 properties, 72 functions), cross‑implementation compatibility across three major SHACL engines, and validation accuracy on a corpus of 133 test cases. tio‑shacl is publicly available under MIT license at https://github.com/EricssonResearch/tio‑shacl and enables automated syntactic and semantic validation of network intents, addressing a critical gap in the field.
Authors:Zhen Zhang, Changyi Yang, Zijie Xia, Zhen Yang, Chengzhi Liu, Zhaotiao Weng, Yepeng Liu, Haobo Chen, Jin Pan, Chenyang Zhao, Yuheng Bu, Alkesh Patel, Zhe Gan, Xin Eric Wang
Abstract:
Token serves as the fundamental unit of computation in modern autoregressive models, and generation length directly influences both inference cost and reasoning performance. Despite its importance, existing approaches lack fine‑grained length modeling, operating primarily at the coarse‑grained sequence level. We introduce the Length Value Model (LenVM), a token‑level framework that models the remaining generation length. By formulating length modeling as a value estimation problem and assigning a constant negative reward to each generated token, LenVM predicts a bounded, discounted return that serves as a monotone proxy for the remaining generation horizon. This formulation yields supervision that is annotation‑free, dense, unbiased, and scalable. Experiments on LLMs and VLMs demonstrate LenVM provides a highly effective signal at inference time. On the LIFEBench exact length matching task, applying LenVM to a 7B model improves the length score from 30.9 to 64.8, significantly outperforming frontier closed‑source models. Furthermore, LenVM enables continuous control over the trade off between performance and efficiency. On GSM8K at a budget of 200 tokens, LenVM maintains 63% accuracy compared to 6 percent for token budget baseline. It also accurately predicts total generation length from the prompt boundary. Finally, LenVM's token‑level values offer an interpretable view of generation dynamics, revealing how specific tokens shift reasoning toward shorter or longer regimes. Results demonstrate that LenVM supports a broad range of applications and token length can be effectively modeled as a token‑level value signal, highlighting the potential of LenVM as a general framework for length modeling and as a length‑specific value signal that could support future RL training. Code is available at https://github.com/eric‑ai‑lab/Length‑Value‑Model.
Authors:Arne Eichholtz, Yongkang Li, Jutte Vijverberg, Tobias Groot, Mohammad Aliannejadi
Abstract:
The Hypencoder, proposed by Killingback et al., is a retrieval framework that replaces the fixed inner‑product scoring function used in standard bi‑encoders with a query‑specific neural network (the q‑net), whose weights are generated by a hypernetwork from the contextualized query embeddings. This design enables more expressive relevance estimation while preserving independent query and document encoding. In this work, we conduct a reproducibility study of the Hypencoder and extend the original analysis in three directions. Our reproduction confirms that the Hypencoder outperforms a similarly trained bi‑encoder baseline on in‑domain and out‑of‑domain benchmarks, and that the proposed efficient search algorithm substantially reduces query latency with minimal performance loss. On hard retrieval tasks, we find partial support: the Hypencoder outperforms the baseline on DL‑Hard and FollowIR, but not on TREC TOT, where checkpoint incompatibility and fine‑tuning sensitivity complicate full verification. Beyond reproduction, we investigate three extensions: (i)~integrating alternative pre‑trained encoders into the Hypencoder framework, where we find that performance gains depend on the encoder and fine‑tuning strategy; (ii)~comparing query latency against a Faiss‑based bi‑encoder pipeline, revealing that standard bi‑encoder retrieval remains faster under both exhaustive and efficient search settings; and (iii)~evaluating adversarial robustness, where we find that the q‑net's non‑linear scoring does not provide a consistent robustness disadvantage over inner‑product scoring. Our code is publicly available at https://github.com/arneeichholtz/Hypencoder‑reprod.
Authors:Bingxi Zhao, Jiahao Zhang, Xubin Ren, Zirui Guo, Tianzhe Chu, Yi Ma, Chao Huang
Abstract:
Education represents one of the most promising real‑world applications for Large Language Models (LLMs). However, conventional tutoring systems rely on static pre‑training knowledge that lacks adaptation to individual learners, while existing RAG‑augmented systems fall short in delivering personalized, guided feedback. To bridge this gap, we present DeepTutor, an agent‑native open‑source framework for personalized tutoring where every feature shares a common personalization substrate. We propose a hybrid personalization engine that couples static knowledge grounding with dynamic multi‑resolution memory, distilling interaction history into a continuously evolving learner profile. Moreover, we construct a closed tutoring loop that bidirectionally couples citation‑grounded problem solving with difficulty‑calibrated question generation. The personalization substrate further supports collaborative writing, multi‑agent deep research, and interactive guided learning, enabling cross‑modality coherence. To move beyond reactive interfaces, we introduce TutorBot, a proactive multi‑agent layer that deploys tutoring capabilities through extensible skills and unified multi‑channel access, providing consistent experience across platforms. To better evaluate such tutoring systems, we construct TutorBench, a student‑centric benchmark with source‑grounded learner profiles and a first‑person interactive protocol that measures adaptive tutoring from the learner's perspective. We further evaluate foundational agentic reasoning abilities across five authoritative benchmarks. Experiments show that DeepTutor improves personalized tutoring quality while maintaining general agentic reasoning abilities. We hope DeepTutor provides unique insights into next‑generation AI‑powered and personalized tutoring systems for the community.
Authors:Gongbo Zhang, Wen Wang, Ye Tian, Li Yuan
Abstract:
Diffusion large language models (dLLMs) offer parallel decoding and bidirectional context, but state‑of‑the‑art dLLMs require billions of parameters for competitive performance. While existing distillation methods for dLLMs reduce inference steps within a single architecture, none address cross‑architecture knowledge transfer, in which the teacher and student differ in architecture, attention mechanism, and tokenizer. We present TIDE, the first framework for cross‑architecture dLLM distillation, comprising three modular components: (1) TIDAL, which jointly modulates distillation strength across training progress and diffusion timestep to account for the teacher's noise‑dependent reliability; (2) CompDemo, which enriches the teacher's context via complementary mask splitting to improve predictions under heavy masking; and (3) Reverse CALM, a cross‑tokenizer objective that inverts chunk‑level likelihood matching, yielding bounded gradients and dual‑end noise filtering. Distilling 8B dense and 16B MoE teachers into a 0.6B student via two heterogeneous pipelines outperforms the baseline by an average of 1.53 points across eight benchmarks, yielding notable gains in code generation, where HumanEval scores reach 48.78 compared to 32.3 for the AR baseline.
Authors:Yeheng Chen, Chaoxiang Xie, Yuling Shi, Wenhao Zeng, Yongpan Wang, Hongyu Zhang, Xiaodong Gu
Abstract:
LLMs have achieved strong results on both function‑level code synthesis and repository‑level code modification, yet a capability that falls between these two extremes ‑‑ compositional code creation, i.e., building a complete, internally structured class from a specification ‑‑ remains underserved. Current evaluations are either confined to isolated functions or rely on manually curated class‑level tasks that are expensive to scale and increasingly susceptible to data contamination. We introduce ClassEval‑Pro, a benchmark of 300 class‑level tasks spanning 11 domains, constructed through an automated three‑stage pipeline that combines complexity enhancement, cross‑domain class composition, and integration of real‑world GitHub code contributed after January 2025. Every task is validated by an LLM Judge Ensemble and must pass test suites with over 90% line coverage. We evaluate five frontier LLMs under five generation strategies. The best model achieves only 45.6% class‑level Pass@1, with a 17.7‑point gap between the strongest and weakest models, confirming the benchmark's discriminative power. Strategy choice strongly interacts with model capability: structured approaches such as bottom‑up improve weaker models by up to 9.4 percentage points, while compositional generation collapses to as low as 1.3%. Error analysis over 500 manually annotated failures reveals that logic errors (56.2%) and dependency errors (38.0%) dominate, identifying cross‑method coordination as the core bottleneck.
Authors:Fei Bai, Huatong Song, Shuang Sun, Daixuan Cheng, Yike Yang, Chuan Hao, Renyuan Li, Feng Chang, Yuan Wei, Ran Tao, Bryan Dai, Jian Yang, Wayne Xin Zhao
Abstract:
Claw‑style environments support multi‑step workflows over local files, tools, and persistent workspace states. However, scalable development around these environments remains constrained by the absence of a systematic framework, especially one for synthesizing verifiable training data and integrating it with agent training and diagnostic evaluation. To address this challenge, we present ClawGym, a scalable framework that supports the full lifecycle of Claw‑style personal agent development. Concretely, we construct ClawGym‑SynData, a diverse dataset of 13.5K filtered tasks synthesized from persona‑driven intents and skill‑grounded operations, paired with realistic mock workspaces and hybrid verification mechanisms. We then train a family of capable Claw‑style models, termed ClawGym‑Agents, through supervised fine‑tuning on black‑box rollout trajectories, and further explore reinforcement learning via a lightweight pipeline that parallelizes rollouts across per‑task sandboxes.To support reliable evaluation, we further construct ClawGym‑Bench, a benchmark of 200 instances calibrated through automated filtering and human‑LLM review. Relevant resources will be soon released at https://github.com/ClawGym.
Authors:Jon-Paul Cacioli
Abstract:
A predecessor pilot (Cacioli, 2026) found that Llama‑3‑8B implements prompted sandbagging as positional collapse rather than answer avoidance. However, fixed option ordering in MMLU‑Pro left open whether this reflected a model‑level position‑dominant policy or dataset‑level distractor structure. This pre‑registered follow‑up (3 models, 2,000 MMLU‑Pro items, 4 conditions, 24,000 primary trials) added cyclic option‑order randomisation as the critical control. The pre‑registered item‑level same‑letter diagnostic did not confirm deterministic position‑tracking (same‑letter rate 37.3%, below the 50% threshold). However, pre‑specified supporting analyses revealed that the response‑position distribution under sandbagging was highly stable under complete content rotation (Pearson r = 0.9994; Jensen‑Shannon divergence = 0.027, compared to 0.386 between honest and sandbagging conditions). Accuracy spiked to 72.1% when the correct answer coincidentally occupied the preferred position E, and fell to 4.3% at position A. The data provide strong evidence for a soft distributional attractor: under sandbagging instruction, the model enters a low‑entropy response‑position basin centred on E/F/G that is highly stable and largely content‑invariant at the aggregate level. Qwen‑2.5‑7B served as a negative control (non‑compliant, no distributional shift). These results provide evidence, at the 7‑9 billion parameter scale, that response‑position entropy is a promising black‑box behavioural signature of this sandbagging mode.
Authors:Wenshuo Zhao, Qi Zhu, Xingshan Zeng, Fei Mi, Lifeng Shang, Yi R., Fung
Abstract:
An effective way to scale up test‑time compute of large language models is to sample multiple responses and then select the best one, as in Grok Heavy and Gemini Deep Think. Existing selection methods often rely on external reward models, which requires training a strong reward model and introduces additional computation overhead. As an alternative, previous approaches have explored intrinsic signals, such as confidence and entropy, but these signals are noisy with naive aggregation. In this work, we observe that high‑entropy tokens tend to cluster into consecutive groups during inference, providing a more stable notion of model uncertainty than individual tokens. Together, these clusters reveal temporal patterns of model uncertainty throughout the inference process. Motivated by this observation, we propose to use the temporal structure of uncertainty as an intrinsic reward. To this end, we first formalize the basic unit of segment‑level uncertainty as the High Entropy Phase (HEP), a variable‑length segment that begins at a high‑entropy token and ends when consecutive low‑entropy tokens appear. We then define the Entropy Centroid, inspired by the concept of the center of mass in physics, as the weighted average position of all HEPs along the trajectory. Intuitively, a lower centroid indicates early exploration followed by confident generation, which we find often corresponds to higher response quality. Based on this insight, we propose the Lowest Centroid method, which selects the response with the lowest entropy centroid among multiple candidates. Experiments on mathematics, code generation, logical reasoning, and agentic tasks, across model scales ranging from 14B to 480B, show that Lowest Centroid consistently outperforms existing baselines and delivers stable gains as model size increases. Code is available at https://github.com/hkust‑nlp/entropy‑centroid.
Authors:Yikai Zhang, Jiaxin Pei, Kenan Li, Maoquan Wang, Jin Pan, Yu Kang, Shengyu Fu, Elsie Nallipogu, Junjie Hu, Yufan Huang, Zijian Jin
Abstract:
Large language model agents have achieved remarkable progress on software engineering tasks, yet current approaches suffer from a fundamental context coupling problem: the standard code editing interface conflates code inspection, modification planning, and edit execution within a single context window, forcing agents to interleave exploratory viewing with strictly formatted edit generation. This causes irrelevant information to accumulate and degrades agent performance. To address this, we propose SWE‑Edit, which decomposes code editing into two specialized subagents: a Viewer that extracts task‑relevant code on demand, and an Editor that executes modifications from high‑level plans‑‑allowing the main agent to focus on reasoning while delegating context‑intensive operations to clean context windows. We further investigate what makes an effective editing model: observing that the prevalent find‑and‑replace format is error‑prone, we train Qwen3‑8B with GRPO to adaptively select editing modes, yielding improved editing efficiency over single‑format baselines. On SWE‑bench Verified, SWE‑Edit improves resolved rate by 2.1% while reducing inference cost by 17.9%. We additionally propose a code editing benchmark that reliably predicts downstream agentic performance, providing practical guidance for editing model selection. Our code is publicly available at https://github.com/microsoft/SWE‑Edit.
Authors:Richard A. A. Jonker, Bárbara Maria Ribeiro de Abreu Martins, Sérgio Matos
Abstract:
This paper presents a principled and scalable framework for systematically generating complex Question Answering (QA) data. In the core of this framework is a graphlet‑anchored generation process, where small subgraphs from a Knowledge Graph (KG) are used in a structured prompt to control the complexity and ensure the factual grounding of questions generated by Large Language Models. The first instantiation of this framework is BioGraphletQA, a new biomedical KGQA dataset of 119,856 QA pairs. Each entry is grounded in a graphlet of up to five nodes from the OREGANO KG, with most of the pairs being enriched with relevant document snippets from PubMed. We start by demonstrating the framework's value and the dataset's quality through evaluation by a domain expert on 106 QA pairs, confirming the high scientific validity and complexity of the generated data. Secondly, we establish its practical utility by showing that augmenting downstream benchmarks with our data improves accuracy on PubMedQA from 49.2% to 68.5% in a low‑resource setting, and on MedQA from a 41.4% baseline to 44.8% in a full‑resource setting. Our framework provides a robust and generalizable solution for creating critical resources to advance complex QA tasks, including MCQA and KGQA. All resources supporting this work, including the dataset (https://zenodo.org/records/17381119) and framework code (https://github.com/ieeta‑pt/BioGraphletQA), are publicly available to facilitate use, reproducibility and extension.
Authors:Tiago Teixeira, Ana Carolina Erthal, Juan Belieni, Beatriz Canaverde, Diego Mesquita, Miguel Faria, Eliezer de Souza da Silva, André F. T. Martins
Abstract:
The use of large language models (LLMs) for complex mathematical reasoning is an emergent area of research, with fast progress in methods, models, and benchmark datasets. However, most mathematical reasoning evaluations exhibit a significant linguistic bias, with the vast majority of benchmark datasets being exclusively in English or (at best) translated from English. We address this limitation by introducing \sc Math‑PT, a novel dataset comprising 1,729 mathematical problems written in European and Brazilian Portuguese. \sc Math‑PT is curated from a variety of high‑quality native sources, including mathematical Olympiads, competitions, and exams from Portugal and Brazil. We present a comprehensive benchmark of current state‑of‑the‑art LLMs on \sc Math‑PT, revealing that frontier reasoning models achieve strong performance in multiple choice questions compared to open weight models, but that their performance decreases for questions with figures or open‑ended questions. To facilitate future research, we release the benchmark dataset and model outputs.
Authors:Dumitru Verşebeniuc, Martijn Elands, Sara Falahatkar, Chiara Magrone, Mohammad Falah, Martijn Boussé, Aki Härmä
Abstract:
Large Language Models have been increasingly employed in the creation of Virtual Assistants due to their ability to generate human‑like text and handle complex inquiries. While these models hold great promise, challenges such as hallucinations, missing information, and the difficulty of providing accurate and context‑specific responses persist, particularly when applied to highly specialized content domains. In this paper, we focus on addressing these challenges by developing a virtual assistant designed to support students at Maastricht University in navigating project‑specific regulations. We propose a virtual assistant based on a Retrieval‑Augmented Generation system that enhances the accuracy and reliability of responses by integrating up‑to‑date, domain‑specific knowledge. Through a robust evaluation framework and real‑life testing, we demonstrate that our virtual assistant can effectively meet the needs of students while addressing the inherent challenges of applying Large Language Models to a specialized educational context. This work contributes to the ongoing discourse on improving LLM‑based systems for specific applications and highlights areas for further research.
Authors:Jinxiang Meng, Shaoping Huang, Fangyu Lei, Jingyu Guo, Haoxiang Liu, Jiahao Su, Sihan Wang, Yao Wang, Enrui Wang, Ye Yang, Hongze Chai, Jinming Lv, Anbang Yu, Huangjing Zhang, Yitong Zhang, Yiming Huang, Zeyao Ma, Shizhu He, Jun Zhao, Kang Liu
Abstract:
Real‑world data visualization (DV) requires native environmental grounding, cross‑platform evolution, and proactive intent alignment. Yet, existing benchmarks often suffer from code‑sandbox confinement, single‑language creation‑only tasks, and assumption of perfect intent. To bridge these gaps, we introduce DV‑World, a benchmark of 260 tasks designed to evaluate DV agents across real‑world professional lifecycles. DV‑World spans three domains: DV‑Sheet for native spreadsheet manipulation including chart and dashboard creation as well as diagnostic repair; DV‑Evolution for adapting and restructuring reference visual artifacts to fit new data across diverse programming paradigms and DV‑Interact for proactive intent alignment with a user simulator that mimics real‑world ambiguous requirements. Our hybrid evaluation framework integrates Table‑value Alignment for numerical precision and MLLM‑as‑a‑Judge with rubrics for semantic‑visual assessment. Experiments reveal that state‑of‑the‑art models achieve less than 50% overall performance, exposing critical deficits in handling the complex challenges of real‑world data visualization. DV‑World provides a realistic testbed to steer development toward the versatile expertise required in enterprise workflows. Our data and code are available at \hrefhttps://github.com/DA‑Open/DV‑Worldthis project page.
Authors:Christopher Potts, Moritz Sudhof
Abstract:
How much does a user's skill with AI shape what AI actually delivers for them? This question is critical for users, AI product builders, and society at large, but it remains underexplored. Using a richly annotated sample of 27K transcripts from WildChat‑4.8M, we show that fluent users take on more complex tasks than novices and adopt a fundamentally different interactional mode: they iterate collaboratively with the AI, refining goals and critically assessing outputs, whereas novices take a passive stance. These differences lead to a paradox of AI fluency: fluent users experience more failures than novices ‑‑ but their failures tend to be visible (a direct consequence of their engagement), they are more likely to lead to partial recovery, and they occur alongside greater success on complex tasks. Novices, by contrast, more often experience invisible failures: conversations that appear to end successfully but in fact miss the mark. Taken together, these results reframe what success with AI depends on. Individuals should adopt a stance of active engagement rather than passive acceptance. AI product builders should recognize that they are designing not just model behavior but user behavior; encouraging deep engagement, rather than friction‑free experiences, will lead to more success overall. Our code and data are available at https://github.com/bigspinai/bigspin‑fluency‑outcomes
Authors:Shangqing Tu, Yanjia Li, Keyu Chen, Sichen Zhang, Jifan Yu, Daniel Zhang-Li, Lei Hou, Juanzi Li, Yu Zhang, Huiqin Liu
Abstract:
Creating interactive STEM courseware traditionally requires HTML/CSS/JavaScript expertise, leaving barriers for educators. While generative AI can produce HTML codes, existing tools generate static presentations rather than interactive simulations, struggle with long documents, and lack pedagogical accuracy mechanisms. Furthermore, full regeneration for modifications requires 200‑‑600 seconds, disrupting creative flow. We present MAIC‑UI, a zero‑code authoring system that enables educators to create and rapidly edit interactive courseware from textbooks, PPTs, and PDFs. MAIC‑UI employs: (1) structured knowledge analysis with multi‑modal understanding to ensure pedagogical rigor; (2) a two‑stage generate‑verify‑optimize pipeline separating content alignment from visual refinement; and (3) Click‑to‑Locate editing with Unified Diff‑based incremental generation achieving sub‑10‑second iteration cycles. A controlled lab study with 40 participants shows MAIC‑UI reduces editing iterations (4.9 vs. 7.0) and significantly improves learnability and controllability compared to direct Text‑to‑HTML generation. A three‑month classroom deployment with 53 high school students demonstrates that MAIC‑UI fosters learning agency and reduces outcome disparities ‑‑ the pilot class achieved 9.21‑point gains in STEM subjects compared to ‑2.32 points in control classes. Our code is available at https://github.com/THU‑MAIC/MAIC‑UI.
Authors:Oliver Kraus, Yash Sarrof, Yuekun Yao, Alexander Koller, Michael Hahn
Abstract:
Chain‑of‑Thought (CoT) has been shown to empirically improve Transformers' performance, and theoretically increase their expressivity to Turing completeness. However, whether Transformers can learn to generalize to CoT traces longer than those seen during training is understudied. We use recent theoretical frameworks for Transformer length generalization and find that ‑‑ under standard positional encodings and a finite alphabet ‑‑ Transformers with CoT cannot solve problems beyond TC^0, i.e. the expressivity benefits do not hold under the stricter requirement of length‑generalizable learnability. However, if we allow the vocabulary to grow with problem size, we attain a length‑generalizable simulation of Turing machines where the CoT trace length is linear in the simulated runtime up to a constant. Our construction overcomes two core obstacles to reliable length generalization: repeated copying and last‑occurrence retrieval. We assign each tape position a unique signpost token, and log only value changes to enable recovery of the current tape symbol through counts circumventing both barriers. Further, we empirically show that the use of such signpost tokens and value change encodings provide actionable guidance to improve length generalization on hard problems.
Authors:Alex Bogdan, Adrian de Valois-Franklin
Abstract:
We report a striking statistical regularity in frontier LLM outputs that enables a CPU‑only scoring primitive running at 2.6 microseconds per token, with estimated latency up to 100,000× (five orders of magnitude) below existing sampling‑based detectors. Across six contemporary models from five independent vendors, two generation sizes, and five held‑out domains, token rank‑frequency distributions converge to the same two‑parameter Mandelbrot ranking distribution, with 34 of 36 model‑by‑domain fits exceeding R^2 = 0.94 and 35 of 36 favoring Mandelbrot over Zipf by AIC. The shared family does not collapse the models into statistical duplicates. Fitted Mandelbrot parameters remain cleanly separable between models: the cross‑model spread in q (1.63 to 3.69) exceeds its per‑model bootstrap standard deviation (0.03 to 0.10) by more than an order of magnitude, yielding tens of standard deviations of separation per few thousand output tokens. Two capabilities follow. First, statistical model fingerprinting: text from a vendor‑delivered LLM can be tested against its claimed model family without cryptographic watermarks or access to model internals, supporting provenance verification and silent‑substitution audits. Second, a model‑agnostic reference distribution for black‑box output assessment, from which we derive a single‑pass scoring primitive that composes with model log probabilities when available and degrades to a rank‑only mode usable on closed APIs. Pilot results on FRANK, TruthfulQA, and HaluEval map where the primitive helps (lexical anomalies, unsupported entities) and where it structurally cannot (reasoning errors in domain‑appropriate vocabulary). We position the primitive as a first‑pass triage layer in compound evaluation stacks, not as a replacement for sampling‑based or source‑conditioned verifiers.
Authors:Venkata Pushpak Teja Menta
Abstract:
Standard text‑to‑speech (TTS) evaluation measures intelligibility (WER, CER) and overall naturalness (MOS, UTMOS) but does not quantify accent. A synthesiser may score well on all four yet sound non‑native on features that are phonemic in the target language. For Indic languages, these features include retroflex articulation, aspiration, vowel length, and the Tamil retroflex approximant (letter zha). We present PSP, the Phoneme Substitution Profile, an interpretable, per‑phonological‑dimension accent benchmark for Indic TTS. PSP decomposes accent into six complementary dimensions: retroflex collapse rate (RR), aspiration fidelity (AF), vowel‑length fidelity (LF), Tamil‑zha fidelity (ZF), Frechet Audio Distance (FAD), and prosodic signature divergence (PSD). The first four are measured via forced alignment plus native‑speaker‑centroid acoustic probes over Wav2Vec2‑XLS‑R layer‑9 embeddings; the latter two are corpus‑level distributional distances. In this v1 we benchmark four commercial and open‑source systems (ElevenLabs v3, Cartesia Sonic‑3, Sarvam Bulbul, Indic Parler‑TTS) on Hindi, Telugu, and Tamil pilot sets, with a fifth system (Praxy Voice) included on all three languages, plus an R5‑>R6 case study on Telugu. Three findings: (i) retroflex collapse grows monotonically with phonological difficulty Hindi < Telugu < Tamil (~1%, ~40%, ~68%); (ii) PSP ordering diverges from WER ordering ‑‑ commercial WER‑leaders do not uniformly lead on retroflex or prosodic fidelity; (iii) no single system is Pareto‑optimal across all six dimensions. We release native reference centroids (500 clips per language), 1000‑clip embeddings for FAD, 500‑clip prosodic feature matrices for PSD, 300‑utterance golden sets per language, scoring code under MIT, and centroids under CC‑BY. Formal MOS‑correlation is deferred to v2; v1 reports five internal‑consistency signals plus a native‑audio sanity check.
Authors:Yixiao Zhou, Dongzhou Cheng, zhiliang wu, Yi Yang, Yu Cheng, Hehe Fan
Abstract:
Large Language Models (LLMs) often fail to utilize their latent reasoning capabilities due to a distributional mismatch between ambiguous human inquiries and the structured logic required for machine activation. Existing alignment methods either incur prohibitive O(N) costs by fine‑tuning each model individually or rely on static prompts that fail to resolve query‑level structural complexity. In this paper, we propose ReQueR (Reinforcement Query Refinement), a modular framework that treats reasoning elicitation as an inference‑time alignment task. We train a specialized Refiner policy via Reinforcement Learning to rewrite raw queries into explicit logical decompositions, treating frozen LLMs as the environment. Rooted in the classical Zone of Proximal Development from educational psychology, we introduce the Adaptive Solver Hierarchy, a curriculum mechanism that stabilizes training by dynamically aligning environmental difficulty with the Refiner's evolving competence. ReQueR yields consistent absolute gains of 1.7%‑‑7.2% across diverse architectures and benchmarks, outperforming strong baselines by 2.1% on average. Crucially, it provides a promising paradigm for one‑to‑many inference‑time reasoning elicitation, enabling a single Refiner trained on a small set of models to effectively unlock reasoning in diverse unseen models. Code is available at https://github.com/newera‑xiao/ReQueR.
Authors:Venkata Pushpak Teja Menta
Abstract:
Commercial TTS systems produce near‑native Indic audio, but the best open‑source bases (Chatterbox, Indic Parler‑TTS, IndicF5) trail them on measured phonological dimensions, and the most widely adopted multilingual base (Chatterbox, 23 languages) does not even tokenise Telugu or Tamil. We ask: what is the minimum intervention that brings such a non‑Indic‑native base to commercial‑class output on Telugu, Tamil, and Hindi, without training a new acoustic decoder and without any commercial TTS training data? We combine three pieces: (1) BUPS, a Brahmic Unified Phoneme Space that deterministically romanises seven Indic scripts to ISO‑15919 so Chatterbox's Latin tokeniser can process them; (2) a LoRA adapter on only the text‑token predictor (Chatterbox's t3), trained on ~1,220h of licensed Indic audio with a Hindi‑proxy language_id; (3) a voice‑prompt recovery recipe ‑‑ an 8‑11s same‑language reference clip plus three sampling overrides (exaggeration 0.7, temperature 0.6, min_p 0.1; "Config B") ‑‑ that recovers commercial‑class acoustic output with no acoustic‑decoder training. On Hindi, the LoRA regresses accuracy and we instead use vanilla Chatterbox + Config B, giving a two‑branch deployment. Evaluated on 10‑utterance pilot sets with the companion PSP benchmark, Praxy Voice matches or slightly leads commercial baselines: 26.7% retroflex collapse on Telugu (vs Sarvam Bulbul 33.3%), 71% Tamil‑zha collapse (vs commercial trio's 86%), 0.025 LLM‑WER on Hindi (tied with Cartesia Sonic‑3). For intra‑sentential code‑mix we add a third branch (IndicF5 + native‑script transliteration) that drops code‑mix LLM‑WER from 0.80‑0.85 to 0.14‑0.27 across Hi/Te/Ta. We release R6 LoRA weights (Apache‑2.0), inference code and router (MIT), and a Gradio demo.
Authors:Li Ju, Junzhe Wang, Qi Zhang
Abstract:
Retrieval‑Augmented Generation (RAG) models frequently produce answers grounded in parametric memory rather than the retrieved context, undermining the core promise of retrieval augmentation. A fundamental obstacle to fixing this unfaithfulness is the lack of training data that explicitly requires models to prefer context over internal knowledge. We introduce Faithfulness‑QA, a large‑scale dataset of 99,094 samples constructed through counterfactual entity substitution. Starting from two established extractive QA benchmarks‑‑SQuAD and TriviaQA‑‑we automatically identify answer‑bearing named entities in each context, replace them with type‑consistent alternatives drawn from a curated bank of 76,953 entities, and thereby manufacture controlled knowledge conflicts between context and parametric memory. Rigorous quality filtering ensures 100% pass rates across four automated checks on random 200‑sample audits. We release the full dataset, the construction pipeline, and a typed entity bank covering eight named entity categories. Faithfulness‑QA is designed as a training resource for attention‑based faithfulness objectives and as an evaluation benchmark for measuring context‑grounding behavior in RAG systems. Data and code are available at https://github.com/qzhangFDU/faithfulness‑qa‑dataset.
Authors:Divake Kumar, Sina Tayebati, Devashri Naik, Ranganath Krishnan, Amit Ranjan Trivedi
Abstract:
Vision‑language models (VLMs) are increasingly used as automated judges for multimodal systems, yet their scores provide no indication of reliability. We study this problem through conformal prediction, a distribution‑free framework that converts a judge's point score into a calibrated prediction interval using only score‑token log‑probabilities, with no retraining. We present the first systematic analysis of conformal prediction for VLM‑as‑a‑Judge across 3 judges and 14 visual task categories. Our results show that evaluation uncertainty is strongly task‑dependent: intervals cover ~40% of the score range for aesthetics and natural images but expand to ~70% for chart and mathematical reasoning, yielding a quantitative reliability map for multimodal evaluation. We further identify a failure mode not captured by standard evaluation metrics, ranking‑scoring decoupling, where judges achieve high ranking correlation while producing wide, uninformative intervals, correctly ordering responses but failing to assign reliable absolute scores. Finally, we show that interval width is driven primarily by task difficulty and annotation quality, i.e., the same judge and method yield 4.5x narrower intervals on a clean, multi‑annotator captioning benchmark. Code: https://github.com/divake/VLM‑Judge‑Uncertainty
Authors:Dan Shi, Zhuowen Han, Simon Ostermann, Renren Jin, Josef van Genabith, Deyi Xiong
Abstract:
Reinforcement learning (RL)‑based post‑training often improves the reasoning performance of large language models (LLMs) beyond the training domain, while supervised fine‑tuning (SFT) frequently leads to general capabilities forgetting. However, the mechanisms underlying this contrast remain unclear. To bridge this gap, we present a feature‑level mechanistic analysis methodology to probe RL generalization using a controlled experimental setup, where RL‑ and SFT‑tuned models are trained from the same base model on identical data. Leveraging our interpretability framework, we align internal activations across models within a shared feature space and analyze how features evolve during post‑training. We find that SFT rapidly introduces many highly specialized features that stabilize early in training, whereas RL induces more restrained and continually evolving feature changes that largely preserve base models' representations. Focusing on samples where RL succeeds but the base model fails, we identify a compact, task‑agnostic set of features that directly mediate generalization across diverse tasks. Feature‑level interventions confirm their causal role: disabling these features significantly degrades RL models' generalization performance, while amplifying them improves base models' performance. The code is available at https://github.com/danshi777/RL‑generalization.
Authors:Jun Li, Mingxuan Liu, Jiazhen Pan, Che Liu, Wenjia Bai, Cosmin I. Bercea, Julia A. Schnabel
Abstract:
Clinical abnormality grounding for rare diseases is often hindered by data scarcity, making supervised fine‑tuning impractical and single‑pass inference highly unstable. We propose Dynamic Decision Learning (DDL), a framework that enables frozen large vision‑language models (LVLMs) to refine their decisions across both language and visual spaces by optimizing instructions and consolidating predictions under visual perturbations. This process improves localization quality and produces a consensus‑based reliability score that quantifies model confidence. Results on brain imaging benchmarks, including a rare‑disease dataset with 281 pathology types across models ranging from 3B to 72B parameters, show that DDL improves mAP@75 by up to 105% on rare‑disease cases and outperforms adaptation baselines and supervised fine‑tuning. Furthermore, DDL demonstrates stronger calibration between reliability scores and localization accuracy under severe distribution shifts and increasing task difficulty. Code is available at: https://lijunrio.github.io/DDL/
Authors:Ishan Patel, Ishan Joshi
Abstract:
We present PolyKV, a system in which multiple concurrent inference agents share a single, asymmetrically compressed KV cache pool. Rather than allocating a separate KV cache per agent ‑‑ the standard paradigm ‑‑ PolyKV writes a compressed cache once and injects it into N independent agent contexts via HuggingFace DynamicCache objects. Compression is asymmetric: Keys are quantized at int8 (q8_0) to preserve softmax stability, while Values are compressed using TurboQuant MSE ‑‑ a Fast Walsh‑Hadamard Transform (FWHT) rotation followed by 3‑bit Lloyd‑Max quantization with centroids tuned to N(0,1). We evaluate across two model scales (SmolLM2‑1.7B‑Instruct and Llama‑3‑8B‑Instruct), three context lengths (600‑7,194 tokens), and up to 15 concurrent agents. PolyKV achieves a stable 2.91x compression ratio across all configurations. On Llama‑3‑8B with 15 agents sharing a 4K‑token context, PolyKV reduces KV cache memory from 19.8 GB to 0.45 GB ‑‑ a 97.7% reduction ‑‑ while maintaining only +0.57% perplexity degradation and a mean BERTScore F1 of 0.928. PPL delta does not grow with agent count and improves as context length increases, inverting to ‑0.26% at 1,851 coherent tokens. To our knowledge, no prior work combines a single shared, lossy‑compressed KV pool with multi‑reader concurrent agent access.
Authors:Yunsu Kim, Kaden Uhlig, Joern Wuebker
Abstract:
Agent benchmarks remain largely English‑centric, while their multilingual versions are often built with machine translation (MT) and limited post‑editing. We argue that, for agentic tasks, this minimal workflow can easily break benchmark validity through query‑answer misalignment or culturally off‑target context. We propose a refined workflow for adapting English benchmarks into multiple languages with explicit functional alignment, cultural alignment, and difficulty calibration using both automated checks and human review. Using this workflow, we introduce GAIA‑v2‑LILT, a re‑audited multilingual extension of GAIA covering five non‑English languages. In experiments, our workflow improves agent success rates by up to 32.7% over minimally translated versions, bringing the closest audited setting to within 3.1% of English performance while substantial gaps remain in many other cases. This indicates that a substantial share of the multilingual performance gap is benchmark‑induced measurement error, motivating task‑level alignment when adapting English benchmarks across languages. The data is available as part of the MAPS package at https://huggingface.co/datasets/Fujitsu‑FRE/MAPS/viewer/GAIA‑v2‑LILT. We also release the code used in our experiments at https://github.com/lilt/gaia‑v2‑lilt.
Authors:Yuanhao Zeng, Ao Lu, Lufei Li, Zheng Zhang, Yexin Li, Kan Ren
Abstract:
Generating diverse responses is crucial for test‑time scaling of large language models (LLMs), yet standard stochastic sampling mostly yields surface‑level lexical variation, limiting semantic exploration. In this paper, we propose Exploratory Sampling (ESamp), a decoding approach that explicitly encourages semantic diversity during generation. ESamp is motivated by the well‑known observation that neural networks tend to make lower‑error predictions on inputs similar to those encountered before, and incur higher prediction error on novel ones. Building on this property, we train a lightweight Distiller at test time to predict deep‑layer hidden representations of the LLM from its shallow‑layer representations to model the LLM's depth‑wise representation transitions. During decoding, the Distiller continuously adapts to the mappings induced by the current generation context. ESamp uses the prediction error as a novelty signal to reweight candidate token extensions conditioned on the current prefix, thereby biasing decoding toward less‑explored semantic patterns. ESamp is implemented with an asynchronous training‑‑inference pipeline, with less than 5% worst case overhead (1.2% in the optimized release). Empirical results show that ESamp significantly boosts the Pass@k efficiency of reasoning models, showing superior or comparable performance to strong stochastic and heuristic baselines. Notably, ESamp achieves robust generalization across mathematics, science, and code generation benchmarks and breaks the trade‑off between diversity and coherence in creative writing. Our code has released at: https://github.com/LinesHogan/tLLM.
Authors:Peng Liao, Peijia Zheng, Lingbo Li, Shangsong Liang, Lin Chen
Abstract:
Offline preference optimization methods, such as Direct Preference Optimization (DPO), offer significant advantages in aligning Large Language Models (LLMs) with human values. However, achieving optimal performance with these methods typically involves additional hyperparameter tuning, resulting in substantial time overhead. Although prior work has proposed a range of improvements, these methods remain limited in effectiveness and have not fully eliminated reliance on hyperparameter tuning. In this work, we propose RMiPO, a lightweight and efficient framework for offline preference optimization. RMiPO leverages intrinsic Response‑level Mutual information for Preference Optimization with hyperparameter modulation, dynamically decoupling preference contributions at negligible additional computational cost. Extensive experimental results demonstrate that RMiPO achieves consistently superior performance over existing methods while reducing training overhead by more than 15%. Our code is available at https://github.com/liavonpenn/rmipo.
Authors:Jisoo Yang, Jongwon Ryu, Minuk Ma, Trung X. Pham, Junyeong Kim
Abstract:
Large Language Models (LLMs) reveal inherent and distinctive personas through dialogue. However, most existing persona discovery approaches rely on surface‑level lexical or stylistic cues, treating dialogue as a flat sequence of tokens and failing to capture the deeper discourse‑level structures that sustain persona consistency. To address this limitation, we propose a novel analytical framework that interprets LLM dialogue through bridging inference ‑‑ implicit conceptual relations that connect utterances via shared world knowledge and discourse coherence. By modeling these relations as structured knowledge graphs, our approach captures latent semantic links that govern how LLMs organize meaning across turns, enabling persona discovery at the level of discourse coherence rather than surface realizations. Experimental results across multiple reasoning backbones and target LLMs, ranging from small‑scale models to 80B‑parameter systems, demonstrate that bridging‑inference graphs yield significantly stronger semantic coherence and more stable persona identification than frequency or style‑based baselines. These results show that persona traits are consistently encoded in the structural organization of discourse rather than isolated lexical patterns. This work presents a systematic framework for probing, extracting, and visualizing latent LLM personas through the lens of Cognitive Discourse Theory, bridging computational linguistics, cognitive semantics, and persona reasoning in large language models. Codes are available at https://github.com/JiSoo‑Yang/Persona_Bridging.git
Authors:Sajad Ebrahimi, Soroush Sadeghian, Ali Ghorbanpour, Negar Arabzadeh, Sara Salamat, Seyed Mohammad Hosseini, Hai Son Le, Mahdi Bashari, Ebrahim Bagheri
Abstract:
The increasing scale and variability of peer review in scholarly venues has created an urgent need for systematic, interpretable, and extensible tools to assess review quality. We present PeeriScope, a modular platform that integrates structured features, rubric‑guided large language model assessments, and supervised prediction to evaluate peer review quality along multiple dimensions. Designed for openness and integration, PeeriScope provides both a public interface and a documented API, supporting practical deployment and research extensibility. The demonstration illustrates its use for reviewer self‑assessment, editorial triage, and large‑scale auditing, and it enables the continued development of quality evaluation methods within scientific peer review. PeeriScope is available both as a live demo at https://app.reviewer.ly/app/peeriscope and via API services at https://github.com/Reviewerly‑Inc/Peeriscope.
Authors:Jon-Paul Cacioli
Abstract:
Small instruct‑tuned LLMs produce degenerate verbal confidence under minimal elicitation: ceiling rates above 95%, near‑chance Type‑2 AUROC, and Invalid validity profiles. We test whether confidence‑conditioned supervised fine‑tuning (CSFT) with self‑consistency‑derived targets can close the gap between internal information and verbal readout. A pre‑registered Phase 0 protocol on Gemma 3 4B‑it with a modal filter restricting training to items with correct modal answers produced a negative result: AUROC2 dropped from 0.554 to 0.509 due to label‑entropy collapse in the training targets. An exploratory rescue removed the filter, training on all 2,000 calibration items. This produced a binary verbal correctness discriminator with AUROC2 = 0.774 on held‑out TriviaQA, compressing a 10‑sample self‑consistency signal (AUROC2 = 0.999) into a single‑pass readout exceeding logit entropy (0.701). The shuffled‑target control showed no improvement (0.501). On MMLU, accuracy improved from 54.2% to 77.4% with the shuffled model at baseline (56.1%), supporting a target‑dependent interpretation. The result is exploratory, binary rather than continuously calibrated, and observed at a single scale. It identifies two design lessons: confidence training requires label entropy, and correct targets regularise output format.
Authors:Hojoon Kim, Yuheng Wu, Thierry Tambe
Abstract:
Embodied AI agents increasingly rely on large language models (LLMs) for planning, yet per‑step LLM calls impose severe latency and cost. In this paper, we show that embodied tasks exhibit strong plan locality, where the next plan is largely predictable from the current one. Building on this, we introduce AgenticCache, a planning framework that reuses cached plans to avoid per‑step LLM calls. In AgenticCache, each agent queries a runtime cache of frequent plan transitions, while a background Cache Updater asynchronously calls the LLM to validate and refine cached entries. Across four multi‑agent embodied benchmarks, AgenticCache improves task success rate by 22% on average across 12 configurations (4 benchmarks x 3 models), reduces simulation latency by 65%, and lowers token usage by 50%. Cache‑based plan reuse thus offers a practical path to low‑latency, low‑cost embodied agents. Code is available at https://github.com/hojoonleokim/MLSys26_AgenticCache.
Authors:Han Wang, Xiaodong Yu, Jialian Wu, Jiang Liu, Ximeng Sun, Mohit Bansal, Zicheng Liu
Abstract:
Large language models (LLMs) achieve strong reasoning performance by allocating substantial computation at inference time, often generating long and verbose reasoning traces. While recent work on efficient reasoning reduces this overhead through length‑based rewards or pruning, many approaches are post‑trained under a much shorter context window than base‑model training, a factor whose effect has not been systematically isolated. We first show that short‑context post‑training alone, using standard GRPO without any length‑aware objective, already induces substantial reasoning compression‑but at the cost of increasingly unstable training dynamics and accuracy degradation. To address this, we propose Step‑level Advantage Selection (SAS), which operates at the reasoning‑step level and assigns a zero advantage to low‑confidence steps in correct rollouts and to high‑confidence steps in verifier‑failed rollouts, where failures often arise from truncation or verifier issues rather than incorrect reasoning. Across diverse mathematical and general reasoning benchmarks, SAS improves average Pass@1 accuracy by 0.86 points over the strongest length‑aware baseline while reducing average reasoning length by 16.3%, yielding a better accuracy‑efficiency trade‑off.
Authors:Yao Wang, Zixu Geng, Jun Yan
Abstract:
Knowledge graphs (KGs) are increasingly used to support large lan guage model (LLM) reasoning, but standard triplet‑based KGs treat each relation as globally valid. In many settings, whether a relation should count as evidence depends on the context. We therefore formulate triplet validity as a triplet‑specific function of context and refer to this formulation as a Quantum Knowledge Graph (QKG). We instantiate QKG in medicine using a diabetes‑centered PrimeKG subgraph, whose 68,651 context‑sensitive relations are further annotated with patient‑group‑specific constraints. We evaluate it in a reasoner‑‑validator pipeline for medical question answering on a KG‑grounded subset of MedReason containing 2,788 questions. With Haiku‑4.5 as both the Reasoner and the Validator, KG‑backed validation significantly improves over a no‑validator baseline (+0.61 pp), and QKG with context matching yields the largest gain, outperforming both KG validation without context matching (+0.79 pp) and the no‑validator baseline (+1.40 pp; paired McNemar, all p<0.05). Under a stronger validator (Qwen‑3.6‑Plus), the raw QKG gain over the no‑validator baseline grows from +1.40 pp to +5.96 pp; the context‑matching gap is non‑significant (p=0.73) on the raw set but becomes borderline significant (p=0.05) after adjustment for knowledge leakage and suspicious questions, consistent with a benchmark‑gold ceiling rather than a QKG limitation. Taken together, the results support the view that the value of a KG in LLM‑based clinical reasoning lies not merely in storing medically related facts, but in representing whether those facts are applicable to the specific patient context. For reproducibility and further research, we release the curated QKG datasets and source code.\footnotehttps://github.com/HKAI‑Sci/QKG
Authors:SungHo Kim, Juhyeong Park, Yeachan Kim, SangKeun Lee
Abstract:
The Korean writing system, Hangeul, has a unique character representation rigidly following the invention principles recorded in Hunminjeongeum.\footnoteHunminjeongeum is a book published in 1446 that describes the principles of invention and usage of Hangeul, devised by King Sejong \citeHunminjeongeum_Guide. However, existing pre‑trained language models (PLMs) for Korean have overlooked these principles. In this paper, we introduce a novel framework for Korean PLMs called KOMBO, which firstly brings the invention principles of Hangeul to represent character. Our proposed method, KOMBO, exhibits notable experimental proficiency across diverse NLP tasks. In particular, our method outperforms the state‑of‑the‑art Korean PLM by an average of 2.11% in five Korean natural language understanding tasks. Furthermore, extensive experiments demonstrate that our proposed method is suitable for comprehending the linguistic features of the Korean language. Consequently, we shed light on the superiority of using subcharacters over the typical subword‑based approach for Korean PLMs. Our code is available at: [https://github.com/SungHo3268/KOMBO](https://github.com/SungHo3268/KOMBO).
Authors:Nicola Zanarini, Niccolò Ferrari
Abstract:
We investigate whether the Feed‑Forward Network (FFN) sublayer in a decoder‑only transformer can be replaced by an explicit learned memory graph while preserving the surrounding autoregressive architecture. The proposed Graph Memory Transformer (GMT) keeps causal self‑attention intact, but replaces the usual per‑token FFN transformation with a memory cell that routes token representations over a learned bank of centroids connected by a learned directed transition matrix. In the base GMT v7 instantiation studied here, each of 16 transformer blocks contains 128 centroids, a 128 128 edge matrix, gravitational source routing, token‑conditioned target selection, and a gated displacement readout. The cell therefore returns movement from an estimated source memory state toward a target memory state, rather than a retrieved value. The resulting model is a fully decoder‑only language model with 82.2M trainable parameters and no dense FFN sublayers, compared with a 103.0M‑parameter dense GPT‑style baseline used in the evaluation. The base v7 model trains stably and exposes centroid usage, transition structure, and source‑to‑target movement as directly inspectable quantities of the forward computation. It remains behind the larger dense baseline in validation loss and perplexity (3.5995/36.58 vs. 3.2903/26.85), while showing close zero‑shot benchmark behavior under the evaluated setting. These results are not intended as a state‑of‑the‑art claim; they support the viability and structural interpretability of replacing dense within‑token transformation with graph‑mediated memory navigation. Broader scaling, optimized kernels, and more extensive benchmark evaluation are left for subsequent work.
Authors:Zichun Guo, Yuling Shi, Wenhao Zeng, Chao Hu, Haotian Lin, Terry Yue Zhuo, Jiawei Chen, Xiaodong Gu, Wenping Ma
Abstract:
Multimodal Large Language Models (MLLMs) have achieved remarkable performance in Visually Rich Document Understanding (VRDU) tasks, but their capabilities are mainly evaluated on pristine, well‑structured document images. We consider content restoration from shredded fragments, a challenging VRDU setting that requires integrating visual pattern recognition with semantic reasoning under significant content discontinuities. To facilitate systematic evaluation of complex VRDU tasks, we introduce ShredBench, a benchmark supported by an automated generation pipeline that renders fragmented documents directly from Markdown. The proposed pipeline ensures evaluation validity by allowing the flexible integration of latest or unseen textual sources to prevent training data contamination. ShredBench assesses four scenarios (English, Chinese, Code, Table) with three fragmentation granularities (8, 12, 16 pieces). Empirical evaluations on state‑of‑the‑art MLLMs reveal a significant performance gap: The method is effective on intact documents; however, once the document is shredded, restoration becomes a significant challenge, with NED dropping sharply as fragmentation increases. Our findings highlight that current MLLMs lack the fine‑grained cross‑modal reasoning required to bridge visual discontinuities, identifying a critical gap in robust VRDU research.
Authors:Peize He, Yaodi Luo, Xiaoqian Liu, Xuyang Liu, Jiahang Deng, Yaosong Du, Bangyu Li, Xiyan Gui, Yuxuan Chen, Linfeng Zhang
Abstract:
Recent large audio language models (LALMs) demonstrate remarkable capabilities in processing extended multi‑modal sequences, yet incur high inference costs. Token compression is an effective method that directly reduces redundant tokens in the sequence. Existing compression methods usually assume that all attention heads in LALMs contribute equally to various audio tasks and calculate token importance by averaging scores across all heads. However, our analysis demonstrates that attention heads exhibit distinct behaviors across diverse audio domains. We further reveal that only a sparse subset of attention heads actively responds to audio, with completely different performance when handling semantic and acoustic tasks. In light of this observation, we propose HeadRouter, a head‑importance‑aware token pruning method that perceives the varying importance of attention heads in different audio tasks to maximize the retention of crucial tokens. HeadRouter is training‑free and can be applied to various LALMs. Extensive experiments on the AudioMarathon and MMAU‑Pro benchmarks demonstrate that HeadRouter achieves state‑of‑the‑art compression performance, exceeding the baseline model even when retaining 70% of the audio tokens and achieving 101.8% and 103.0% of the vanilla average on Qwen2.5‑Omni‑3B and Qwen2.5‑Omni‑7B, respectively.
Authors:Wentao Zhang, Qi Zhang, Mingkun Xu, Mu You, Henghua Shen, Zhongzhi He, Keyan Jin, Derek F. Wong, Tao Fang
Abstract:
Crop disease diagnosis from field photographs faces two recurring problems: models that score well on benchmarks frequently hallucinate species names, and when predictions are correct, the reasoning behind them is typically inaccessible to the practitioner. This paper describes Agri‑CPJ (Caption‑Prompt‑Judge), a training‑free few‑shot framework in which a large vision‑language model first generates a structured morphological caption, iteratively refined through multi‑dimensional quality gating, before any diagnostic question is answered. Two candidate responses are then generated from complementary viewpoints, and an LLM judge selects the stronger one based on domain‑specific criteria. Caption refinement is the component with the largest individual impact: ablations confirm that skipping it consistently degrades downstream accuracy across both models tested. On CDDMBench, pairing GPT‑5‑Nano with GPT‑5‑mini‑generated captions yields +22.7 pp in disease classification and +19.5 points in QA score over no‑caption baselines. Evaluated without modification on AgMMU‑MCQs, GPT‑5‑Nano reached 77.84% and Qwen‑VL‑Chat reached 64.54%, placing them at or above most open‑source models of comparable scale despite the format shift from open‑ended to multiple‑choice. The structured caption and judge rationale together constitute a readable audit trail: a practitioner who disagrees with a diagnosis can identify the specific caption observation that was incorrect. Code and data are publicly available https://github.com/CPJ‑Agricultural/CPJ‑Agricultural‑Diagnosis
Authors:Tao Feng, Haozhen Zhang, Zijie Lei, Peixuan Han, Jiaxuan You
Abstract:
LLM routing has achieved promising results in integrating the strengths of diverse models while balancing efficiency and performance. However, to support more realistic and challenging applications, routing must extend into agentic LLM settings, where task planning, multi‑round cooperation among heterogeneous agents, and memory utilization are indispensable. To address this gap, we propose GraphPlanner, a heterogeneous graph memory‑augmented agentic router for multi‑agent LLMs that generates routing workflows for each query and supports both inductive and transductive inference. GraphPlanner formulates workflow generation as a Markov Decision Process (MDP), where at each step it selects both the LLM backbone and the agent role, including Planner, Executor, and Summarizer. By leveraging a heterogeneous graph, denoted as GARNet, to capture interaction memories among queries, agents, and responses, GraphPlanner integrates historical memory and workflow memory into richer state representations. The entire pipeline is optimized with reinforcement learning, jointly improving task‑specific performance and computational efficiency. We evaluate GraphPlanner across 14 diverse LLM tasks and demonstrate that: (1) GraphPlanner outperforms strong single‑round and multi‑round routers, improving accuracy by up to 9.3% while reducing GPU cost from 186.26 GiB to 1.04 GiB; (2) GraphPlanner generalizes robustly to unseen tasks and LLMs, exhibiting strong zero‑shot capabilities; and (3) GraphPlanner effectively leverages historical memories, supporting both inductive and transductive inference for more adaptive routing. Our code for GraphPlanner is released at https://github.com/ulab‑uiuc/GraphPlanner.
Authors:Yuanming Shi, Andreas Haupt
Abstract:
Silicon samples are increasingly used as a low‑cost substitute for human panels and have been shown to reproduce aggregate human opinion with high fidelity. We show that, in the alignment‑relevant domain of philosophy, silicon samples systematically collapse heterogeneity. Using data from N = 277 professional philosophers drawn from PhilPeople profiles, we evaluate seven proprietary and open‑source large language models on their ability to replicate individual philosophical positions and to preserve cross‑question correlation structures across philosophical domains. We find that language models substantially over‑correlate philosophical judgments, producing artificial consensus across domains. This collapse is associated in part with specialist effects, whereby models implicitly assume that domain specialists hold highly similar philosophical views. We assess the robustness of these findings by studying the impact of DPO fine‑tuning and by validating results against the full PhilPapers 2020 Survey (N = 1785). We conclude by discussing implications for alignment, evaluation, and the use of silicon samples as substitutes for human judgment. The code of this project can be found at https://github.com/stanford‑del/silicon‑philosophers.
Authors:Imranul Ashrafi, Inigo Jauregi Unanue, Massimo Piccardi
Abstract:
Test‑time alignment methods offer a promising alternative to fine‑tuning by steering the outputs of large language models (LLMs) at inference time with lightweight interventions on their internal representations. Recently, a prominent and effective approach, RE‑Control (Kong et al., 2024), has proposed leveraging an external value function trained over the LLM's hidden states to guide generation via gradient‑based editing. While effective, this method overlooks a key characteristic of alignment tasks, i.e. that they are typically formulated as learning from human preferences between candidate responses. To address this, in this paper we propose a novel preference‑based training framework, Pref‑CTRL, that uses a multi‑objective value function to better reflect the structure of preference data. Our approach has outperformed RE‑Control on two benchmark datasets and showed greater generalization on out‑of‑domain datasets. Our source code is available at https://github.com/UTS‑nlPUG/pref‑ctrl.
Authors:Yiqun Zhang, Hao Li, Zihan Wang, Shi Feng, Xiaocui Yang, Daling Wang, Bo Zhang, Lei Bai, Shuyue Hu
Abstract:
Multi‑turn, long‑horizon tasks are increasingly common for large language models (LLMs), but solving them typically requires many sequential model invocations, accumulating substantial inference costs. Here, we study cost‑aware multi‑turn LLM routing: selecting which model to invoke at each turn from a model pool, given a fixed cost budget. We propose MTRouter, which encodes the interaction history and candidate models into joint history‑model embeddings, and learns an outcome estimator from logged trajectories to predict turn‑level model utility. Experiments show that MTRouter improves the performance‑cost trade‑off: on ScienceWorld, it surpasses GPT‑5 while reducing total cost by 58.7%; on Humanity's Last Exam (HLE), it achieves competitive accuracy while reducing total cost by 43.4% relative to GPT‑5, and these gains even carry over to held‑out tasks. Further analyses reveal several mechanisms underlying its effectiveness: relative to prior multi‑turn routers, MTRouter makes fewer model switches, is more tolerant to transient errors, and exhibits emergent specialization across models. Code: https://github.com/ZhangYiqun018/MTRouter
Authors:Rohith Reddy Bellibatlu
Abstract:
Large language models are increasingly deployed as automated judges for evaluating other models, yet the stability of their verdicts under semantically equivalent prompt paraphrases remains unmeasured. We introduce JudgeSense, a framework and benchmark for quantifying this property via the Judge Sensitivity Score (JSS), defined as the fraction of paraphrase pairs on which a judge returns an identical decision. Evaluating nine judge models on 494 validated paraphrase pairs, we find that coherence is the only task where judges meaningfully differ, with JSS ranging from 0.389 to 0.992. On factuality, all judges cluster near JSS about 0.63, driven by a polarity‑inverted prompt artifact; after correction, factuality JSS rises to about 0.9. Pairwise tasks (preference and relevance) exhibit degenerate always‑A behavior in 8 of 9 judges, indicating strong position bias. Model scale does not predict consistency. We release code, decision logs, and a validated paraphrase dataset to support standardized JSS reporting.
Authors:Lucky Verma
Abstract:
Dynamic Tanh (DyT) removes LayerNorm by bounding activations with a learned tanh(alpha x). We show that this bounding is a regime‑dependent implicit regularizer, not a uniformly beneficial replacement. Across GPT‑2‑family models spanning 64M to 3.78B parameters and 1M to 118M tokens, with Llama and ViT cross‑checks, DyT improves validation loss by 27.3% at 64M/1M but worsens it by 18.8% at 64M/118M; the 1M benefit vanishes with capacity (+1.7% at 3.78B), while the 118M penalty reaches +27.9%. The mechanism is measurable: 49% of DyT activations saturate at 1M versus 23% at 118M, and a 500‑step saturation heuristic classifies DyT's sign with 75% raw in‑sample accuracy on the 12‑cell GPT‑2 calibration set (AUC 0.75; 64% when adding Scale 5 stress cells), correctly labels 3/3 Llama checks, but only reaches 50% raw leave‑one‑scale‑out accuracy. Three interventions support the bounding explanation: HardTanh reproduces the regime pattern, increasing alpha at 118M monotonically reduces DyT's penalty, and vanilla+dropout(p=0.5) matches DyT's data‑rich loss. We also localize Llama‑DyT collapse to SwiGLU gating, where saturation separates collapse from convergence in a 3‑seed component ablation (r=0.94). Scope: all experiments are compute‑limited (T/P < 1.84), below Chinchilla‑optimal training.
Authors:Bingfeng Chen, Chenjie Qiu, Yifeng Xie, Boyan Xu, Ruichu Cai, Zhifeng Hao
Abstract:
Aspect Sentiment Quad Prediction (ASQP) has seen significant advancements, largely driven by the powerful semantic understanding and generative capabilities of large language models (LLMs). However, while syntactic structure information has been proven effective in previous extractive paradigms, it remains underutilized in the generative paradigm of LLMs due to their limited reasoning capabilities. In this paper, we propose S^2IT, a novel Stepwise Syntax Integration Tuning framework that progressively integrates syntactic structure knowledge into LLMs through a multi‑step tuning process. The training process is divided into three steps. S^2IT decomposes the quadruple generation task into two stages: 1) Global Syntax‑guided Extraction and 2) Local Syntax‑guided Classification, integrating both global and local syntactic structure information. Finally, Fine‑grained Structural Tuning enhances the model's understanding of syntactic structures through the prediction of element links and node classification. Experiments demonstrate that S^2IT significantly improves state‑of‑the‑art performance across multiple datasets. Our implementation will be open‑sourced at https://github.com/DMIRLAB‑Group/S2IT.
Authors:Bishwamittra Ghosh, Soumi Das, Till Speicher, Qinyuan Wu, Mohammad Aflah Khan, Deepak Garg, Krishna P. Gummadi, Evimaria Terzi
Abstract:
Large language models (LLMs) operate in two fundamental learning modes ‑ fine‑tuning (FT) and in‑context learning (ICL) ‑ raising key questions about which mode yields greater language proficiency and whether they differ in their inductive biases. Prior studies comparing FT and ICL have yielded mixed and inconclusive results due to inconsistent experimental setups. To enable a rigorous comparison, we propose a formal language learning task ‑ offering precise language boundaries, controlled string sampling, and no data contamination ‑ and introduce a discriminative test for language proficiency, where an LLM succeeds if it assigns higher generation probability to in‑language strings than to out‑of‑language strings. Empirically, we find that: (a) FT has greater language proficiency than ICL on in‑distribution generalization, but both perform equally well on out‑of‑distribution generalization. (b) Their inductive biases, measured by the correlation in string generation probabilities, are similar when both modes partially learn the language but diverge at higher proficiency levels. (c) Unlike FT, ICL performance differs substantially across models of varying sizes and families and is sensitive to the token vocabulary of the language. Thus, our work demonstrates the promise of formal languages as a controlled testbed for evaluating LLMs, behaviors that are difficult to isolate in natural language datasets. Our source code is available at https://github.com/bishwamittra/formallm.
Authors:Zhicheng Ma, Xiang Liu, Zhaoxiang Liu, Ning Wang, Yi Shen, Kai Wang, Shuming Shi, Shiguo Lian
Abstract:
Large Language Models (LLMs) based on Mixture‑of‑Experts (MoE) are pivotal in industrial applications for their ability to scale performance efficiently. However, standard MoEs enforce uniform expert sizes,creating a rigidity that fails to align computational costs with varying token‑level complexity. While heterogeneous expert architectures attempt to address this by diversifying expert sizes, they often suffer from significant system‑level challenges, specifically unbalanced GPU utilization and inefficient parameter utilization, which hinder practical deployment. To bridge the gap between theoretical heterogeneity and robust industrial application, we propose Mixture of Heterogeneous Grouped Experts (MoHGE) which introduces a two‑level routing mechanism to enable flexible, resource‑aware expert combinations. To optimize inference efficiency, we propose a Group‑Wise Auxiliary Loss, which dynamically steers tokens to the most parameter‑efficient expert groups based on task difficulty. To address the critical deployment challenge of GPU load balancing, we introduce an All‑size Group‑decoupling Allocation strategy coupled with an Intra‑Group Experts Auxiliary Loss. These mechanisms collectively ensure uniform computation distribution across GPUs. Extensive evaluations demonstrate that MoHGE matches the performance of MoE architectures while reducing the total parameters by approximately 20% and maintaining balanced GPU utilization. Our work establishes a scalable paradigm for resource‑efficient MoE design, offering a practical solution for optimizing inference costs in real‑world scenarios. The code is publicly available at https://github.com/UnicomAI/MoHGE.
Authors:Samer Attrah
Abstract:
We present Code Broker, a multi agent system built with Google Agent Development Kit ADK that analyses Python code from files, local directories, or GitHub repositories and generates actionable quality assessment reports. The system employs a hierarchical five agents architecture in which a root orchestrator coordinates a sequential pipeline agent, which in turn dispatches three specialised agents in parallel a Correctness Assessor, a Style Assessor, and a Description Generator before synthesising findings through an Improvement Recommender. Reports score four dimensions correctness, security, style, and maintainability and are rendered in both Markdown and HTML. Code Broker combines LLM based reasoning with deterministic static‑analysis signals from Pylint, uses asynchronous execution with retry logic to improve robustness, and explores lightweight session memory for retaining and querying prior assessment context. We position the paper as a technical report on system design and prompt or tool orchestration, and present a preliminary qualitative evaluation on representative Python codebases. The results suggest that parallel specialised agents produce readable, developer oriented feedback, while also highlighting current limitations in evaluation depth, security tooling, large repository handling, and the current use of only in memory persistence. All code and reproducibility materials are available at: https://github.com/Samir‑atra/agents_intensive_dev.
Authors:Yash Kumar Atri, Steven L. Johnson, Tom Hartvigsen
Abstract:
Language models are increasingly deployed in interactive settings where users reason about facts over time rather than in isolation. In such scenarios, correct behavior requires models to maintain and update implicit temporal assumptions established earlier in a conversation. We study this challenge through the lens of temporal scope stability: the ability to preserve, override, or transfer time‑scoped factual context across dialogue turns. We introduce ChronoScope, a large‑scale diagnostic benchmark designed to isolate temporal scope behavior in controlled multi‑turn interactions, comprising over one million deterministically generated question chains grounded in Wikidata. ChronoScope evaluates whether models can correctly retain inferred temporal scope when follow‑up questions omit explicit time references, spanning implicit carryover, explicit scope switching, cross‑entity transfer, and longer temporal trajectories. Through extensive evaluation of state‑of‑the‑art language models, we find that temporal scope stability is frequently violated in controlled multi‑turn settings, with models often drifting toward present‑day assumptions despite correct underlying knowledge. These failures intensify with interaction length and persist even under oracle context conditions, revealing a gap between single‑turn factual accuracy and coherent temporal reasoning under sequential interaction. We make our dataset and evaluation suite publicly available at https://github.com/yashkumaratri/ChronoScope
Authors:Jordan Meadows, Lan Zhang, Andre Freitas
Abstract:
Formalising informal mathematical reasoning into formally verifiable code is a significant challenge for large language models. In scientific fields such as physics, domain‑specific machinery (e.g. Dirac notation, vector calculus) imposes additional formalisation challenges that modern LLMs and agentic approaches have yet to tackle. To aid autoformalisation in scientific domains, we present FormalScience; a domain‑agnostic human‑in‑the‑loop agentic pipeline that enables a single domain expert (without deep formal language experience) to produce syntactically correct and semantically aligned formal proofs of informal reasoning for low economic cost. Applying FormalScience to physics, we construct FormalPhysics, a dataset of 200 university‑level (LaTeX) physics problems and solutions (primarily quantum mechanics and electromagnetism), along with their Lean4 formal representations. Compared to existing formal math benchmarks, FormalPhysics achieves perfect formal validity and exhibits greater statement complexity. We evaluate open‑source models and proprietary systems on a statement autoformalisation task on our dataset via zero‑shot prompting, self‑refinement with error feedback, and a novel multi‑stage agentic approach, and explore autoformalisation limitations in modern LLM‑based approaches. We provide the first systematic characterisation of semantic drift in physics autoformalisation in terms of concepts such as notational collapse and abstraction elevation which reveals what formal language verifies when full semantic preservation is unattainable. We release the codebase together with an interactive UI‑based FormalScience system which facilitates autoformalisation and theorem proving in scientific domains beyond physics.https://github.com/jmeadows17/formal‑science
Authors:Yiming Pan, Chengwei Hu, Xuancheng Huang, Can Huang, Mingming Zhao, Yuean Bi, Xiaohan Zhang, Aohan Zeng, Linmei Hu
Abstract:
Large language models (LLMs) have demonstrated strong potential in agentic tasks, particularly in slide generation. However, slide generation poses a fundamental challenge: the generation process is text‑centric, whereas its quality is governed by visual aesthetics. This modality gap leads current models to frequently produce slides with aesthetically suboptimal layouts. Existing solutions typically rely either on heavy visual reflection, which incurs high inference cost yet yields limited gains; or on fine‑tuning with large‑scale datasets, which still provides weak and indirect aesthetic supervision. In contrast, the explicit use of aesthetic principles as supervision remains unexplored. In this work, we present AeSlides, a reinforcement learning framework with verifiable rewards for Aesthetic layout supervision in Slide generation. We introduce a suite of meticulously designed verifiable metrics to quantify slide layout quality, capturing key layout issues in an accurate, efficient, and low‑cost manner. Leveraging these verifiable metrics, we develop a GRPO‑based reinforcement learning method that directly optimizes slide generation models for aesthetically coherent layouts. With only 5K training prompts on GLM‑4.7‑Flash, AeSlides improves aspect ratio compliance from 36% to 85%, while reducing whitespace by 44%, element collisions by 43%, and visual imbalance by 28%. Human evaluation further shows a substantial improvement in overall quality, increasing scores from 3.31 to 3.56 (+7.6%), outperforming both model‑based reward optimization and reflection‑based agentic approaches, and even edging out Claude‑Sonnet‑4.5. These results demonstrate that such a verifiable aesthetic paradigm provides an efficient and scalable approach to aligning slide generation with human aesthetic preferences. Our repository is available at https://github.com/ympan0508/aeslides.
Authors:Sebastian Nowak, Jann-Frederick Laß, Narine Mesropyan, Babak Salam, Nico Piel, Mohammed Bahaaeldin, Wolfgang Block, Alois Martin Sprinkart, Julian Alexander Luetkens, Benjamin Wulff, Alexander Isaak
Abstract:
Purpose: To design, implement, evaluate, and report on the regulatory requirements of a self‑hosted LLM infrastructure for radiology adhering to the principle of least privilege, emphasizing technical feasibility, network isolation, and clinical utility. Materials and Methods: The isolation‑first, containerized LLM inference stack relies on strict network segmentation, host‑enforced egress filtering, and active isolation monitoring preventing unauthorized external connectivity. An accompanying deployment package provides automated isolation and hardening tests. The system served the open‑weights DeepSeek‑R1 model via vLLM. In a one‑week pilot phase, 22 residents and radiologists were free to use 10 predefined prompt‑templates whenever they considered them useful in daily work. Afterward, they rated clinical utility and system stability on an 0‑10 Likert scale and reported observed critical errors in model output. Results: The applied institutional governance pathway achieved approval from clinic management, compliance, data protection and information security officers for processing unanonymized PHI. The system was rated stable and user friendly during the pilot. Source text‑anchored tasks, such as report corrections or simplifications, and radiology guideline recommendations received the highest utility ratings, whereas open‑ended conclusion generation based on findings resulted in the highest frequency of critical errors, such as clinically relevant hallucinations or omissions. Conclusion: The proposed isolation‑first on‑premise architecture enabled overcoming regulatory borders, showed promising clinical utility in text‑anchored tasks and is the current base to serve open‑weights LLMs as an official service of a German University Hospital with over 10,000 employees. The deployment package were made publicly available (https://github.com/ukbonn/ukb‑gpt).
Authors:Manyi Zhang, Ji-Fu Li, Zhongao Sun, Xiaohao Liu, Zhenhua Dong, Xianzhi Yu, Haoli Bai, Xiaobo Xia
Abstract:
Autonomous agent systems such as OpenClaw introduce significant efficiency challenges due to long‑context inputs and multi‑turn reasoning. This results in prohibitively high computational and monetary costs in real‑world development. While quantization is a standard approach for reducing cost and latency, its impact on agent performance in realistic scenarios remains unclear. In this work, we analyze quantization sensitivity across diverse complex workflows over OpenClaw, and show that precision requirements are highly task‑dependent. Based on this observation, we propose QuantClaw, a plug‑and‑play precision routing plugin that dynamically assigns precision according to task characteristics. QuantClaw routes lightweight tasks to lower‑cost configurations while preserving higher precision for demanding workloads, saving cost and accelerating inference without increasing user complexity. Experiments show that our QuantClaw maintains or improves task performance while reducing both latency and computational cost. Across a range of agent tasks, it achieves up to 21.4% cost savings and 15.7% latency reduction on GLM‑5 (FP8 baseline). These results highlight the benefit of treating precision as a dynamic resource in agent systems.
Authors:Paul Röttger, Kobi Hackenburg, Hannah Rose Kirk, Christopher Summerfield
Abstract:
Hundreds of millions of people use artificial intelligence (AI) for writing assistance. Here, we evaluated how AI writing assistance distorts writer personas ‑ their perceived beliefs, personality, and identity. In three large‑scale experiments, writers (N=2,939) wrote political opinion paragraphs with and without AI assistance. Separate groups of readers (N=11,091) blindly evaluated these paragraphs across 29 socially salient dimensions of reader perception, spanning political opinion, writing quality, writer personality, emotions, and demographics. AI writing assistance produced persona distortions across all dimensions: with AI, writers seemed more opinionated, competent, and positive, and their perceived demographic profile shifted towards more privileged groups. Writers objected to many of the observed distortions, yet continued to prefer AI‑assisted text even when made aware of them. We successfully mitigated objectionable persona distortions at the model level by training reward models on our experimental data (10,008 paragraphs, 2,903,596 ratings) to steer AI outputs towards faithful representation of writer stance. However, this came at a cost to user acceptance, suggesting an entanglement between desirable and undesirable properties of AI writing assistance that may be difficult to resolve. Together, our findings demonstrate that persona distortions from AI writing assistance are pervasive and persistent even under realistic conditions of human oversight, which carries implications for public discourse, trust, and democratic deliberation that scale with AI adoption.
Authors:Zhanli Li, Yixuan Cao, Lvzhou Luo, Ping Luo
Abstract:
This paper introduces the task of analytical question answering over large, semi‑structured document collections. We present MuDABench, a benchmark for multi‑document analytical QA, where questions require extracting and synthesizing information across numerous documents to perform quantitative analysis. Unlike existing multi‑document QA benchmarks that typically require information from only a few documents with limited cross‑document reasoning, MuDABench demands extensive inter‑document analysis and aggregation. Constructed via distant supervision by leveraging document‑level metadata and annotated financial databases, MuDABench comprises over 80,000 pages and 332 analytical QA instances. We also propose an evaluation protocol that measures final answer accuracy and uses intermediate‑fact coverage as an auxiliary diagnostic signal for the reasoning process. Experiments reveal that standard RAG systems, which treat all documents as a flat retrieval pool, perform poorly. To address these limitations, we propose a multi‑agent workflow that orchestrates planning, extraction, and code generation modules. While this approach substantially improves both process and outcome metrics, a significant gap remains compared to human expert performance. Our analysis identifies two primary bottlenecks: single‑document information extraction accuracy and insufficient domain‑specific knowledge in current systems. MuDABench is available at https://github.com/Zhanli‑Li/MuDABench.
Authors:Xi Wang, Jie Wang, Xingchen Song, Baijun Song, Jingran Xie, Jiahe Shao, Zijian Lin, Di Wu, Meng Meng, Jian Luan, Zhiyong Wu
Abstract:
While generative text‑to‑speech (TTS) models approach human‑level quality, monolithic metrics fail to diagnose fine‑grained acoustic artifacts or explain perceptual collapse. To address this, we propose TTS‑PRISM, a multi‑dimensional diagnostic framework for Mandarin. First, we establish a 12‑dimensional schema spanning stability to advanced expressiveness. Second, we design a targeted synthesis pipeline with adversarial perturbations and expert anchors to build a high‑quality diagnostic dataset. Third, schema‑driven instruction tuning embeds explicit scoring criteria and reasoning into an efficient end‑to‑end model. Experiments on a 1,600‑sample Gold Test Set show TTS‑PRISM outperforms generalist models in human alignment. Profiling six TTS paradigms establishes intuitive diagnostic flags that reveal fine‑grained capability differences. TTS‑PRISM is open‑source, with code and checkpoints at https://github.com/xiaomi‑research/tts‑prism.
Authors:Chunyu Qiang, Xiaopeng Wang, Kang Yin, Yuzhe Liang, Yuxin Guo, Teng Ma, Ziyu Zhang, Tianrui Wang, Cheng Gong, Yushen Chen, Ruibo Fu, Chen Zhang, Longbiao Wang, Jianwu Dang
Abstract:
Generative audio modeling has largely been fragmented into specialized tasks, text‑to‑speech (TTS), text‑to‑music (TTM), and text‑to‑audio (TTA), each operating under heterogeneous control paradigms. Unifying these modalities remains a fundamental challenge due to the intrinsic dissonance between structured semantic representations (speech/music) and unstructured acoustic textures (sound effects). In this paper, we introduce UniSonate, a unified flow‑matching framework capable of synthesizing speech, music, and sound effects through a standardized, reference‑free natural language instruction interface. To reconcile structural disparities, we propose a novel dynamic token injection mechanism that projects unstructured environmental sounds into a structured temporal latent space, enabling precise duration control within a phoneme‑driven Multimodal Diffusion Transformer (MM‑DiT). Coupled with a multi‑stage curriculum learning strategy, this approach effectively mitigates cross‑modal optimization conflicts. Extensive experiments demonstrate that UniSonate achieves state‑of‑the‑art performance in instruction‑based TTS (WER 1.47%) and TTM (SongEval Coherence 3.18), while maintaining competitive fidelity in TTA. Crucially, we observe positive transfer, where joint training on diverse audio data significantly enhances structural coherence and prosodic expressiveness compared to single‑task baselines. Audio samples are available at https://qiangchunyu.github.io/UniSonate/.
Authors:Shuowei Li, Haoxin Li, Wenda Chu, Yi Fang
Abstract:
Large language models (LLMs) often need to balance their internal parametric knowledge with external information, such as user beliefs and content from retrieved documents, in real‑world scenarios like RAG or chat‑based systems. A model's ability to reliably process these sources is key to system safety. Previous studies on knowledge conflict and sycophancy are limited to a binary conflict paradigm, primarily exploring conflicts between parametric knowledge and either a document or a user, but ignoring the interactive environment where all three sources exist simultaneously. To fill this gap, we propose a three‑source interaction framework and systematically evaluate 27 LLMs from 3 families on 2 datasets. Our findings reveal general patterns: most models rely more on document assertions than user assertions, and this preference is reinforced by post‑training. Furthermore, our behavioral analysis shows that most models are impressionable, unable to effectively discriminate between helpful and harmful external information. To address this, we demonstrate that fine‑tuning on diverse source interaction data can significantly increase a model's discrimination abilities. In short, our work paves the way for developing trustworthy LLMs that can effectively and reliably integrate multiple sources of information. Code is available at https://github.com/shuowl/llm‑source‑balancing.
Authors:Pruthvinath Jeripity Venkata
Abstract:
When you ask an AI assistant for advice about your career, your marriage, or a conflict with your family, does it give you the same answer regardless of where you are from? We tested this systematically by presenting three leading AI systems (Claude Sonnet 4.5, GPT‑5.4, and Gemini 2.5 Flash) with ten real‑life personal dilemmas, framed for users from 10 countries across 5 continents in 7 languages (n=840 scored responses). We compared AI advice against World Values Survey Wave 7 data measuring what people in each country actually believe. All three AI systems consistently gave Western‑style, individualist advice even to users from societies that prioritize family, community, and authority, significantly more so than local values would predict (mean gap +0.76 on a 1‑5 scale; t=15.65, p<0.001). The gap is largest for Nigeria (+1.85) and India (+0.82). Japan is the sole exception: AI systems treated Japanese users as more group‑oriented than surveys show, revealing that AI encodes outdated stereotypes. Claude and GPT‑5.4 show nearly identical bias magnitude, while Gemini is lower but still significant. The models diverge in mechanism: Claude shifts further collectivist in the user's native language; Gemini shifts more individualist; GPT‑5.4 responds only to stated country identity. These findings point to a systemic homogenization of values across frontier AI. Data, code, and scoring pipeline are openly released.
Authors:Sihang Zhao, Kangrui Yu, Youliang Yuan, Pinjia He, Hongyi Wen
Abstract:
Large Language Models (LLMs) have been widely explored in educational scenarios. We identify a critical vulnerability in current educational LLMs, pedagogical jailbreaks, where students use answer‑inducing prompts to elicit solutions rather than scaffolded instructions. To enable systematic study, we unify and formalize safe, helpful, and pedagogical behaviors with a knowledge‑mastery graph and introduce SHAPE, a benchmark of 9,087 student‑question pairs for evaluating tutoring behavior under adversarial pressure. We propose a graph‑augmented tutoring pipeline that infers prerequisite concepts from queries, identifies mastery gaps, and routes generation between instructing and problem‑solving via explicit gating. Experiments across multiple LLMs show that our method yields significantly improved safety under two pedagogical jailbreak settings, while maintaining near‑ceiling helpfulness under the same evaluation protocol. Our code and data are available at https://github.com/MAPS‑research/SHaPE
Authors:Hector Borobia, Elies Seguí-Mas, Guillermina Tormo-Carbó
Abstract:
Hybrid language models that interleave attention with recurrent components are increasingly competitive with pure Transformers, yet standard LoRA practice applies adapters uniformly without considering the distinct functional roles of each component type. We systematically study component‑type LoRA placement across two hybrid architectures ‑‑ Qwen3.5‑0.8B (sequential, GatedDeltaNet + softmax attention) and Falcon‑H1‑0.5B (parallel, Mamba‑2 SSM + attention) ‑‑ fine‑tuned on three domains and evaluated on five benchmarks. We find that the attention pathway ‑‑ despite being the minority component ‑‑ consistently outperforms full‑model adaptation with 5‑10x fewer trainable parameters. Crucially, adapting the recurrent backbone is destructive in sequential hybrids (‑14.8 pp on GSM8K) but constructive in parallel ones (+8.6 pp). We further document a transfer asymmetry: parallel hybrids exhibit positive cross‑task transfer while sequential hybrids suffer catastrophic forgetting. These results establish that hybrid topology fundamentally determines adaptation response, and that component‑aware LoRA placement is a necessary design dimension for hybrid architectures.
Authors:Karthic Palaniappan
Abstract:
There are 7,407 languages in the world. But, what about the languages that are not there in the world? Are humans so narrow minded that we don't care about the languages aliens communicate in? Aliens are humans too! In the 2016 movie Arrival, Amy Adams plays a linguist, Dr. Louise Banks who, by learning to think in an alien language (Heptapod) formed of non‑sequential sentences, gains the ability to transcend time and look into the future. In this work, I aim to explore the representation and reasoning of vision‑language concepts in a neuro‑symbolic language, and study improvement in analytical reasoning abilities and efficiency of "thinking systems". With Qwen3‑VL‑2B‑Instruct as base model and 4 × Nvidia H200 GPU nodes, I achieve an accuracy improvement of 3.33% on a vision‑language evaluation dataset consisting of math, science, and general knowledge questions, while reducing the reasoning tokens by 75% over SymPy. I've documented the compute challenges faced, scaling possibilities, and the future work to improve thinking in a neuro‑symbolic language in vision‑language models. The training and inference setup can be found here: https://github.com/i‑like‑bfs‑and‑dfs/wolfram‑reasoning.
Authors:Etha Tianze Hua, Tian Yun, Ellie Pavlick
Abstract:
We define and investigate source‑modality monitoring ‑‑ the ability of multimodal models to track and communicate the input source from which pieces of information originate. We consider source‑modality monitoring as an instance of the more general binding problem, and evaluate the extent to which models exploit syntactic vs. semantic signals in order to bind words like image in a user‑provided prompt to specific components of their input and context (i.e., actual images). Across experiments spanning 11 vision‑language models (VLMs) performing target‑modality information retrieval tasks, we find that both syntactic and semantic signals play an important role, but that the latter tend to outweigh the former in cases when modalities are highly distinct distributionally. We discuss the implications of these findings for model robustness, and in the context of increasingly multimodal agentic systems.
Authors:Grigory Sapunov
Abstract:
We study learned memory tokens as a computational scratchpad for a single‑block Universal Transformer with Adaptive Computation Time (ACT) on Sudoku‑Extreme, a combinatorial reasoning benchmark. Memory tokens are empirically necessary: no configuration without them reaches non‑trivial performance. The optimal count has a sharp lower threshold (T=0 always fails, T=8 reliably succeeds) followed by a stable plateau (T=8‑32, 57.4% +/‑ 0.7% exact‑match) and a dilution boundary at T=64. Under halt‑side pressure (lambda warmup), mean halt drops monotonically with memory size across the plateau (from 11.6 at T=8 to 8.3 at T=64), showing that memory tokens and ponder depth substitute as resources at fixed accuracy. We also identify a router initialization trap that causes the majority of training runs to fail: both default zero‑bias and Graves' recommended positive bias settle into a shallow halt equilibrium the model cannot escape. Inverting the bias to ‑3 ("deep start") eliminates the failure mode, and ablation shows the trap is inherent to ACT initialization rather than an artifact of our architecture. With reliable training, ACT yields an order of magnitude lower seed variance than fixed‑depth processing (+/‑0.7 vs +/‑9.3 pp); lambda warmup recovers 34% of compute at matched accuracy; and attention heads specialize into memory readers, constraint propagators, and integrators across recursive depth. Code: https://github.com/che‑shr‑cat/utm‑jax.
Authors:Pegah Khayatan, Jayneel Parekh, Arnaud Dapogny, Mustafa Shukor, Alasdair Newson, Matthieu Cord
Abstract:
Despite impressive progress in capabilities of large vision‑language models (LVLMs), these systems remain vulnerable to hallucinations, i.e., outputs that are not grounded in the visual input. Prior work has attributed hallucinations in LVLMs to factors such as limitations of the vision backbone or the dominance of the language component, yet the relative importance of these factors remains unclear. To resolve this ambiguity, We propose HalluScope, a benchmark to better understand the extent to which different factors induce hallucinations. Our analysis indicates that hallucinations largely stem from excessive reliance on textual priors and background knowledge, especially information introduced through textual instructions. To mitigate hallucinations induced by textual instruction priors, we propose HalluVL‑DPO, a framework for fine‑tuning off‑the‑shelf LVLMs towards more visually grounded responses. HalluVL‑DPO leverages preference optimization using a curated training dataset that we construct, guiding the model to prefer grounded responses over hallucinated ones. We demonstrate that our optimized model effectively mitigates the targeted hallucination failure mode, while preserving or improving performance on other hallucination benchmarks and visual capability evaluations. To support reproducibility and further research, we will publicly release our evaluation benchmark, preference training dataset, and code at https://pegah‑kh.github.io/projects/prompts‑override‑vision/ .
Authors:Buqiang Xu, Yijun Chen, Jizhan Fang, Ruobin Zhong, Yunzhi Yao, Yuqi Zhu, Lun Du, Shumin Deng
Abstract:
Long‑term conversational agents need memory systems that capture relationships between events, not merely isolated facts, to support temporal reasoning and multi‑hop question answering. Current approaches face a fundamental trade‑off: flat memory is efficient but fails to model relational structure, while graph‑based memory enables structured reasoning at the cost of expensive and fragile construction. To address these issues, we propose StructMem, a structure‑enriched hierarchical memory framework that preserves event‑level bindings and induces cross‑event connections. By temporally anchoring dual perspectives and performing periodic semantic consolidation, StructMem improves temporal reasoning and multi‑hop performance on \textttLoCoMo, while substantially reducing token usage, API calls, and runtime compared to prior memory systems, see https://github.com/zjunlp/LightMem .
Authors:Wujiang Xu, Jiaojiao Han, Minghao Guo, Kai Mei, Xi Zhu, Han Zhang, Dimitris N. Metaxas
Abstract:
LLM agents increasingly operate in open‑ended environments spanning hundreds of sequential episodes, yet they remain largely stateless: each task is solved from scratch without converting past experience into better future behavior. The central obstacle is not \emphwhat to remember but \emphhow to use what has been remembered, including which retrieval policy to apply, how to interpret prior outcomes, and when the current strategy itself must change. We introduce \emphAgent Evolving Learning (\ael), a two‑timescale framework that addresses this obstacle. At the fast timescale, a Thompson Sampling bandit learns which memory retrieval policy to apply at each episode; at the slow timescale, LLM‑driven reflection diagnoses failure patterns and injects causal insights into the agent's decision prompt, giving it an interpretive frame for the evidence it retrieves. On a sequential portfolio benchmark (10 sector‑diverse tickers, 208 episodes, 5 random seeds), \ael achieves a Sharpe ratio of 2.13\pm0.47, outperforming five published self‑improving methods and all non‑LLM baselines while maintaining the lowest variance among all LLM‑based approaches. A nine‑variant ablation reveals a ``less is more'' pattern: memory and reflection together produce a 58% cumulative improvement over the stateless baseline, yet every additional mechanism we test (planner evolution, per‑tool selection, cold‑start initialization, skill extraction, and three credit assignment methods) \emphdegrades performance. This demonstrates that the bottleneck in agent self‑improvement is \emphself‑diagnosing how to use experience rather than adding architectural complexity. Code and data: https://github.com/WujiangXu/AEL.
Authors:Yilong Chen, Yanxi Xie, Zitian Gao, He Xin, Yihao Xiao, Jason Klein Liu, Haoming Luo, Yifan Luo, Zhengmao Ye, Tingwen Liu, Xin Zhao, Ran Tao, Bryan Dai
Abstract:
Large token‑indexed lookup tables provide a compute‑decoupled scaling path, but their practical gains are often limited by poor parameter efficiency and rapid memory growth. We attribute these limitations to Zipfian under‑training of the long tail, heterogeneous demand across layers, and "slot collapse" that produces redundant embeddings. To address this, we propose X‑GRAM, a frequency‑aware dynamic token‑injection framework. X‑GRAM employs hybrid hashing and alias mixing to compress the tail while preserving head capacity, and refines retrieved vectors via normalized SwiGLU ShortConv to extract diverse local n‑gram features. These signals are integrated into attention value streams and inter‑layer residuals using depth‑aware gating, effectively aligning static memory with dynamic context. This design introduces a memory‑centric scaling axis that decouples model capacity from FLOPs. Extensive evaluations at the 0.73B and 1.15B scales show that X‑GRAM improves average accuracy by as much as 4.4 points over the vanilla backbone and 3.2 points over strong retrieval baselines, while using substantially smaller tables in the 50% configuration. Overall, by decoupling capacity from compute through efficient memory management, X‑GRAM offers a scalable and practical paradigm for future memory‑augmented architectures. Code aviliable in https://github.com/Longyichen/X‑gram.
Authors:Zhiqiu Lin, Chancharik Mitra, Siyuan Cen, Isaac Li, Yuhan Huang, Yu Tong Tiffany Ling, Hewei Wang, Irene Pi, Shihang Zhu, Ryan Rao, George Liu, Jiaxi Li, Ruojin Li, Yili Han, Yilun Du, Deva Ramanan
Abstract:
Video‑language models (VLMs) learn to reason about the dynamic visual world through natural language. We introduce a suite of open datasets, benchmarks, and recipes for scalable oversight that enable precise video captioning. First, we define a structured specification for describing subjects, scenes, motion, spatial, and camera dynamics, grounded by hundreds of carefully defined visual primitives developed with professional video creators such as filmmakers. Next, to curate high‑quality captions, we introduce CHAI (Critique‑based Human‑AI Oversight), a framework where trained experts critique and revise model‑generated pre‑captions into improved post‑captions. This division of labor improves annotation accuracy and efficiency by offloading text generation to models, allowing humans to better focus on verification. Additionally, these critiques and preferences between pre‑ and post‑captions provide rich supervision for improving open‑source models (Qwen3‑VL) on caption generation, reward modeling, and critique generation through SFT, DPO, and inference‑time scaling. Our ablations show that critique quality in precision, recall, and constructiveness, ensured by our oversight framework, directly governs downstream performance. With modest expert supervision, the resulting model outperforms closed‑source models such as Gemini‑3.1‑Pro. Finally, we apply our approach to re‑caption large‑scale professional videos (e.g., films, commercials, games) and fine‑tune video generation models such as Wan to better follow detailed prompts of up to 400 words, achieving finer control over cinematography including camera motion, angle, lens, focus, point of view, and framing. Our results show that precise specification and human‑AI oversight are key to professional‑level video understanding and generation. Data and code are available on our project page: https://linzhiqiu.github.io/papers/chai/
Authors:Qizhuo Xie, Yunhui Liu, Yu Xing, Qianzi Hou, Xudong Jin, Tao Zheng, Tieke He
Abstract:
Large Language Models (LLMs) have shown immense potential in Knowledge Graph Completion (KGC), yet bridging the modality gap between continuous graph embeddings and discrete LLM tokens remains a critical challenge. While recent quantization‑based approaches attempt to align these modalities, they typically treat quantization as flat numerical compression, resulting in semantically entangled codes that fail to mirror the hierarchical nature of human reasoning. In this paper, we propose GS‑Quant, a novel framework that generates semantically coherent and structurally stratified discrete codes for KG entities. Unlike prior methods, GS‑Quant is grounded in the insight that entity representations should follow a linguistic coarse‑to‑fine logic. We introduce a Granular Semantic Enhancement module that injects hierarchical knowledge into the codebook, ensuring that earlier codes capture global semantic categories while later codes refine specific attributes. Furthermore, a Generative Structural Reconstruction module imposes causal dependencies on the code sequence, transforming independent discrete units into structured semantic descriptors. By expanding the LLM vocabulary with these learned codes, we enable the model to reason over graph structures isomorphically to natural language generation. Experimental results demonstrate that GS‑Quant significantly outperforms existing text‑based and embedding‑based baselines. Our code is publicly available at https://github.com/mikumifa/GS‑Quant.
Authors:Yuanjie Lyu, Chengyu Wang, Haonan Zheng, Yuanhao Yue, Junbing Yan, Ming Wang, Jun Huang
Abstract:
Modern industrial applications increasingly demand language models that act as agents, capable of multi‑step reasoning and tool use in real‑world settings. These tasks are typically performed under strict cost and latency constraints, making small agentic models highly desirable. In this paper, we introduce the AgenticQwen family of models, trained via multi‑round reinforcement learning (RL) on synthetic data and a limited amount of open‑source data. Our training framework combines reasoning RL and agentic RL with dual data flywheels that automatically generate increasingly challenging tasks. The reasoning flywheel increases task difficulty by learning from errors, while the agentic flywheel expands linear workflows into multi‑branch behavior trees that better reflect the decision complexity of real‑world applications. We validate AgenticQwen on public benchmarks and in an industrial agent system. The models achieve strong performance on multiple agentic benchmarks, and in our industrial agent system, close the gap with much larger models on search and data analysis tasks. Model checkpoints and part of the synthetic data: https://huggingface.co/collections/alibaba‑pai/agenticqwen. Data synthesis and RL training code: https://github.com/haruhi‑sudo/data_synth_and_rl. The data synthesis pipeline is also integrated into EasyDistill: https://github.com/modelscope/easydistill.
Authors:Paul Keuren, Marc Ponsen, Robert Ayoub Bagheri
Abstract:
Sentence embedding techniques aim to encode key concepts of a sentence's meaning in a vector space. However, the majority of evaluation approaches for sentence embedding quality rely on the use of additional classifiers or downstream tasks. These additional components make it unclear whether good results stem from the embedding itself or from the classifier's behaviour. In this paper, we propose a novel method for evaluating the effectiveness of sentence embedding methods in capturing sentence‑level concepts. Our approach is classifier‑independent, allowing for an objective assessment of the model's performance. The approach adopted in this study involves the systematic introduction of syntactic noise and semantic negations into sentences, with the subsequent quantification of their relative effects on the resulting embeddings. The visualisation of these effects is facilitated by Concept Separation Curves, which show the model's capacity to differentiate between conceptual and surface‑level variations. By leveraging data from multiple domains, employing both Dutch and English languages, and examining sentence lengths, this study offers a compelling demonstration that Concept Separation Curves provide an interpretable, reproducible, and cross‑model approach for evaluating the conceptual stability of sentence embeddings.
Authors:Maziar Kianimoghadam Jouneghani
Abstract:
We present a systematic study of multilingual polarization detection across 22 languages for SemEval‑2026 Task 9 (Subtask 1), contrasting multilingual generalists with language‑specific specialists and hybrid ensembles. While a standard generalist like XLM‑RoBERTa suffices when its tokenizer aligns with the target text, it may struggle with distinct scripts (e.g., Khmer, Odia) where monolingual specialists yield significant gains. Rather than enforcing a single universal architecture, we adopt a language‑adaptive framework that switches between multilingual generalists, language‑specific specialists, and hybrid ensembles based on development performance. Additionally, cross‑lingual augmentation via NLLB‑200 yielded mixed results, often underperforming native architecture selection and degrading morphologically rich tracks. Our final system achieves an overall macro‑averaged F1 score of 0.796 and an average accuracy of 0.826 across all 22 tracks. Code and final test predictions are publicly available at: https://github.com/Maziarkiani/SemEval2026‑Task9‑Subtask1‑Polarization.
Authors:Yongcan Yu, Lingxiao He, Jian Liang, Kuangpu Guo, Meng Wang, Qianlong Xie, Xingxing Wang, Ran He
Abstract:
Test‑time reinforcement learning (TTRL) always adapts models at inference time via pseudo‑labeling, leaving it vulnerable to spurious optimization signals from label noise. Through an empirical study, we observe that responses with medium consistency form an ambiguity region and constitute the primary source of reward noise. Crucially, we find that such spurious signals can be even amplified through group‑relative advantage estimation. Motivated by these findings, we propose a unified framework, Debiased and Denoised test‑time Reinforcement Learning (DDRL), to mitigate spurious signals. Concretely, DDRL first applies a frequency‑based sampling strategy to exclude ambiguous samples while maintaining a balanced set of positive and negative examples. It then adopts a debiased advantage estimation with fixed advantages, removing the bias introduced by group‑relative policy optimization. Finally, DDRL incorporates a consensus‑based off‑policy refinement stage, which leverages the rejection‑sampled dataset to enable efficient and stable model updates. Experiments on three large language models across multiple mathematical reasoning benchmarks demonstrate that DDRL consistently outperforms existing TTRL baselines. The code will soon be released at https://github.com/yuyongcan/DDRL.
Authors:Hieu Man, Van-Cuong Pham, Nghia Trung Ngo, Franck Dernoncourt, Thien Huu Nguyen
Abstract:
Learning robust representations of authorial style is crucial for authorship attribution and AI‑generated text detection. However, existing methods often struggle with content‑style entanglement, where models learn spurious correlations between authors' writing styles and topics, leading to poor generalization across domains. To address this challenge, we propose Explainable Authorship Variational Autoencoder (EAVAE), a novel framework that explicitly disentangles style from content through architectural separation‑by‑design. EAVAE first pretrains style encoders using supervised contrastive learning on diverse authorship data, then finetunes with a Variational Autoencoder (VEA) architecture using separate encoders for style and content representations. Disentanglement is enforced through a novel discriminator that not only distinguishes whether pairs of style/content representations belong to the same or different authors/content sources, but also generates natural language explanation for their decision, simultaneously mitigating confounding information and enhancing interpretability. Extensive experiments demonstrate the effectiveness of EAVAE. On authorship attribution, we achieve state‑of‑the‑art performance on various datasets, including Amazon Reviews, PAN21, and HRS. For AI‑generated text detection, EAVAE excels in few‑shot learning over the M4 dataset. Code and data repositories are available online\footnotehttps://github.com/hieum98/avae \footnotehttps://huggingface.co/collections/Hieuman/document‑level‑authorship‑datasets.
Authors:Jon-Paul Cacioli
Abstract:
Cacioli (2026) showed that the K‑way energy probe on standard discriminative predictive coding networks reduces approximately to a monotone function of the log‑softmax margin. The reduction rests on five assumptions, including cross‑entropy (CE) at the output and effectively feedforward inference dynamics. This pre‑registered study tests the reduction's sensitivity to CE removal using two conditions: standard PC trained with MSE instead of CE, and bidirectional PC (bPC; Oliviers, Tang & Bogacz, 2025). Across 10 seeds on CIFAR‑10 with a matched 2.1M‑parameter backbone, we find three results. The negative result replicates on standard PC: the probe sits below softmax (Delta = ‑0.082, p < 10^‑6). On bPC the probe exceeds softmax across all 10 seeds (Delta = +0.008, p = 0.000027), though a pre‑registered manipulation check shows that bPC does not produce materially greater latent movement than standard PC at this scale (ratio 1.6, threshold 10). Removing CE alone without changing inference dynamics halves the probe‑softmax gap (Delta_MSE = ‑0.037 vs Delta_stdPC = ‑0.082). CE is a major empirically load‑bearing component of the decomposition at this scale. CE training produces output logit norms approximately 15x larger than MSE or bPC training. A post‑hoc temperature scaling ablation decomposes the probe‑softmax gap into two components: approximately 66% is attributable to logit‑scale effects removable by temperature rescaling, and approximately 34% reflects a scale‑invariant ranking advantage of CE‑trained representations. We use "metacognitive" operationally to denote Type‑2 discrimination of a readout over its own Type‑1 correctness, not to imply human‑like introspective access.
Authors:Robin Dey, Panyanon Viradecha
Abstract:
MemPalace is an open‑source AI memory system that applies the ancient method of loci (memory palace) spatial metaphor to organize long‑term memory for large language models; launched in April 2026, it accumulated over 47,000 GitHub stars in its first two weeks and claims state‑of‑the‑art retrieval performance on the LongMemEval benchmark (96.6% Recall@5) without requiring any LLM inference at write time. Through independent codebase analysis, benchmark replication, and comparison with competing systems, we find that MemPalace's headline retrieval performance is attributable primarily to its verbatim storage philosophy combined with ChromaDB's default embedding model (all‑MiniLM‑L6‑v2), rather than to its spatial organizational metaphor per se ‑‑ the palace hierarchy (Wings‑>Rooms‑>Closets‑>Drawers) operates as standard vector database metadata filtering, an effective but well‑established technique. However, MemPalace makes several genuinely novel contributions: (1) a contrarian verbatim‑first storage philosophy that challenges extraction‑based competitors, (2) an extremely low wake‑up cost (approximately 170 tokens) through its four‑layer memory stack, (3) a fully deterministic, zero‑LLM write path enabling offline operation at zero API cost, and (4) the first systematic application of spatial memory metaphors as an organizing principle for AI memory systems. We also note that the competitive landscape is evolving rapidly, with Mem0's April 2026 token‑efficient algorithm raising their LongMemEval score from approximately 49% to 93.4%, narrowing the gap between extraction‑based and verbatim approaches. Our analysis concludes that MemPalace represents significant architectural insight wrapped in overstated claims ‑‑ a pattern common in rapidly adopted open‑source projects where marketing velocity exceeds scientific rigor.
Authors:Chenghao Yang, Yuning Zhang, Zhoufutu Wen, Tao Gong, Jiaheng Liu, Qi Chu, Nenghai Yu
Abstract:
Model distillation is a primary driver behind the rapid progress of LLM agents, yet it often leads to behavioral homogenization. Many emerging agents share nearly identical reasoning steps and failure modes, suggesting they may be distilled echoes of a few dominant teachers. Existing metrics, however, fail to distinguish mandatory behaviors required for task success from non‑mandatory patterns that reflect a model's autonomous preferences. We propose two complementary metrics to isolate non‑mandatory behavioral patterns: Response Pattern Similarity (RPS) for verbal alignment and Action Graph Similarity (AGS) for tool‑use habits modeled as directed graphs. Evaluating 18 models from 8 providers on τ‑Bench and τ^2‑Bench against Claude Sonnet 4.5 (thinking), we find that within‑family model pairs score 5.9 pp higher in AGS than cross‑family pairs, and that Kimi‑K2 (thinking) reaches 82.6% S_\textnode and 94.7% S_\textdep, exceeding Anthropic's own Opus 4.1. A controlled distillation experiment further confirms that AGS distinguishes teacher‑specific convergence from general improvement. RPS and AGS capture distinct behavioral dimensions (Pearson r = 0.491), providing complementary diagnostic signals for behavioral convergence in the agent ecosystem. Our code is available at https://github.com/Syuchin/AgentEcho.
Authors:Yingkai Tang, Taoyu Su, Wenyuan Zhang, Xiaoyang Guo, Tingwen Liu
Abstract:
Multi‑table entity matching (MEM) addresses the limitations of dual‑table approaches by enabling simultaneous identification of equivalent entities across multiple data sources without unique identifiers. However, existing methods relying on pre‑trained language models struggle to handle semantic inconsistencies caused by numerical attribute variations. Inspired by the powerful language understanding capabilities of large language models (LLMs), we propose a novel LLM‑based framework for multi‑table entity matching, termed LLM4MEM. Specifically, we first propose a multi‑style prompt‑enhanced LLM attribute coordination module to address semantic inconsistencies. Then, to alleviate the matching efficiency problem caused by the surge in the number of entities brought by multiple data sources, we develop a transitive consensus embedding matching module to tackle entity embedding and pre‑matching issues. Finally, to address the issue of noisy entities during the matching process, we introduce a density‑aware pruning module to optimize the quality of multi‑table entity matching. We conducted extensive experiments on 6 MEM datasets, and the results show that our model improves by an average of 5.1% in F1 compared with the baseline model. Our code is available at https://github.com/Ymeki/LLM4MEM.
Authors:Shan Dong, Palakorn Achananuparp, Hieu Hien Mai, Lei Wang, Yao Lu, Ee-Peng Lim
Abstract:
In this work, we develop a novel reasoning approach to enhance the performance of large language models (LLMs) in future occupation prediction. In this approach, a reason generator first derives a ``reason'' for a user using his/her past education and career history. The reason summarizes the user's preference and is used as the input of an occupation predictor to recommend the user's next occupation. This two‑step occupation prediction approach is, however, non‑trivial as LLMs are not aligned with career paths or the unobserved reasons behind each occupation decision. We therefore propose to fine‑tune LLMs improving their reasoning and occupation prediction performance. We first derive high‑quality oracle reasons, as measured by factuality, coherence and utility criteria, using a LLM‑as‑a‑Judge. These oracle reasons are then used to fine‑tune small LLMs to perform reason generation and next occupation prediction. Our extensive experiments show that: (a) our approach effectively enhances LLM's accuracy in next occupation prediction making them comparable to fully supervised methods and outperforming unsupervised methods; (b) a single LLM fine‑tuned to perform reason generation and occupation prediction outperforms two LLMs fine‑tuned to perform the tasks separately; and (c) the next occupation prediction accuracy depends on the quality of generated reasons. Our code is available at https://github.com/Sarasarahhhhh/job_prediction.
Authors:Siqi Ouyang, Shuoyang Ding, Oleksii Hrinchuk, Vitaly Lavrukhin, Brian Yan, Boris Ginsburg, Lei Li
Abstract:
Simultaneous speech translation (SST) generates translations while receiving partial speech input. Recent advances show that large language models (LLMs) can substantially improve SST quality, but at the cost of high computational overhead. To reduce this cost, prior work reformulates SST as a multi‑turn dialogue task, enabling full reuse of the LLM's key‑value (KV) cache and eliminating redundant feature recomputation. However, this approach relies on supervised fine‑tuning (SFT) data in dialogue form, for which few human annotations exist, and existing synthesis methods cannot guarantee data quality. In this work, we propose a Hierarchical Policy Optimization (HPO) approach that post‑train models trained on imperfect SFT data. We introduce a hierarchical reward that balances translation quality and latency objectives. Experiments on English to Chinese/German/Japanese demonstrate improvements of over +7 COMET score and +1.25 MetricX score at a latency of 1.5 seconds. Comprehensive ablation studies further validate the effectiveness of different quality rewards, hierarchical reward formulations, and segmentation strategies. Code can be found here https://github.com/owaski/HPO
Authors:Jason Dury
Abstract:
Dense retrieval systems rank passages by embedding similarity to a query, but multi‑hop questions require passages that are associatively related through shared reasoning chains. We introduce Association‑Augmented Retrieval (AAR), a lightweight transductive reranking method that trains a small MLP (4.2M parameters) to learn associative relationships between passages in embedding space using contrastive learning on co‑occurrence annotations. At inference time, AAR reranks an initial dense retrieval candidate set using bi‑directional association scoring. On HotpotQA, AAR improves passage Recall@5 from 0.831 to 0.916 (+8.6 points) without evaluation‑set tuning, with gains concentrated on hard questions where the dense baseline fails (+28.5 points). On MuSiQue, AAR achieves +10.1 points in the transductive setting. An inductive model trained on training‑split associations and evaluated on unseen validation associations shows no significant improvement, suggesting that the method captures corpus‑specific co‑occurrences rather than transferable patterns. Ablation studies support this interpretation: training on semantically similar but non‑associated passage pairs degrades retrieval below the baseline, while shuffling association pairs causes severe degradation. A downstream QA evaluation shows retrieval gains translate to +6.4 exact match improvement. The method adds 3.7ms per query, trains in under two minutes on a single GPU, and requires no LLM‑based indexing.
Authors:Marisa Hudspeth, Patrick J. Burns, Brendan O'Connor
Abstract:
We introduce a benchmark dataset for question answering and translation in bilingual Latin and English settings, containing about 7,800 question‑answer pairs. The questions are drawn from Latin pedagogical sources, including exams, quizbowl‑style trivia, and textbooks ranging from the 1800s to the present. After automated extraction, cleaning, and manual review, the dataset covers a diverse range of question types: knowledge‑ and skill‑based, multihop reasoning, constrained translation, and mixed language pairs. To our knowledge, this is the first QA benchmark centered on Latin. As a case study, we evaluate three large language models ‑‑ LLaMa 3, Qwen QwQ, and OpenAI's o3‑mini ‑‑ finding that all perform worse on skill‑oriented questions. Although the reasoning models perform better on scansion and literary‑device tasks, they offer limited improvement overall. QwQ performs slightly better on questions asked in Latin, but LLaMa3 and o3‑mini are more task dependent. This dataset provides a new resource for assessing model capabilities in a specialized linguistic and cultural domain, and the creation process can be easily adapted for other languages. The dataset is available at: https://github.com/slanglab/RespondeoQA
Authors:Mohamed Hesham Elganayni, Runsheng Chen, Sebastian Nagl, Matthias Grabmair
Abstract:
This work explores the role of prompt design and judge selection in LLM‑as‑a‑Judge evaluations of free text legal question answering. We examine whether automatic task prompt optimization improves over human‑centered design, whether optimization effectiveness varies by judge feedback style, and whether optimized prompts transfer across judges. We systematically address these questions on the LEXam benchmark by optimizing task prompts using the ProTeGi method with feedback from two judges (Qwen3‑32B, DeepSeek‑V3) across four task models, and then testing cross‑judge transfer. Automatic optimization consistently outperforms the baseline, with lenient judge feedback yielding higher and more consistent gains than strict judge feedback. Prompts optimized with lenient feedback transfer better to strict judges than the reverse direction. Analysis reveals that lenient judges provide permissive feedback, yielding prompts with broader applicability, whereas strict judges produce restrictive feedback, leading to judge‑specific overfitting. Our findings demonstrate algorithmically optimizing prompts on training data can outperform human‑centered prompt design and that judges' dispositions during optimization shape prompt generalizability. Code and optimized prompts are available at https://github.com/TUMLegalTech/icail2026‑llm‑judge‑gaming.
Authors:Shuai Chen, Chengzhi Zhang
Abstract:
Scientific progress depends on the continual generation of innovative re‑search ideas. However, the rapid growth of scientific literature has greatly increased the cost of knowledge filtering, making it harder for researchers to identify novel directions. Although existing large language model (LLM)‑based methods show promise in research idea generation, the ideas they produce are often repetitive and lack depth. To address this issue, this study proposes a multi‑agent iterative planning search strategy inspired by com‑binatorial innovation theory. The framework combines iterative knowledge search with an LLM‑based multi‑agent system to generate, evaluate, and re‑fine research ideas through repeated interaction, with the goal of improving idea diversity and novelty. Experiments in the natural language processing domain show that the proposed method outperforms state‑of‑the‑art base‑lines in both diversity and novelty. Further comparison with ideas derived from top‑tier machine learning conference papers indicates that the quality of the generated ideas falls between that of accepted and rejected papers. These results suggest that the proposed framework is a promising approach for supporting high‑quality research idea generation. The source code and dataset used in this paper are publicly available on Github repository: https://github.com/ChenShuai00/MAGenIdeas. The demo is available at https://huggingface.co/spaces/cshuai20/MAGenIdeas.
Authors:Ali Hassaan Mughal, Noor Fatima, Muhammad Bilal
Abstract:
Behaviour‑Driven Development (BDD) suites accumulate step‑text duplication whose maintenance cost is established in prior work. Existing detection techniques require running the tests (Binamungu et al., 2018‑2023) or are confined to a single organisation (Irshad et al., 2020‑2022), leaving a gap: a purely static, paraphrase‑robust, step‑level detector usable on any repository. We fill the gap with cukereuse, an open‑source Python CLI combining exact hashing, Levenshtein ratio, and sentence‑transformer embeddings in a layered pipeline, released alongside an empirical corpus of 347 public GitHub repositories, 23,667 parsed .feature files, and 1,113,616 Gherkin steps. The step‑weighted exact‑duplicate rate is 80.2 %; the median‑repository rate is 58.6 % (Spearman rho = 0.51 with size). The top hybrid cluster groups 20.7k occurrences across 2.2k files. Against 1,020 pairs manually labelled by the three authors under a released rubric (inter‑annotator Fleiss' kappa = 0.84 on a 60‑pair overlap), we report precision, recall, and F1 with bootstrap 95 % CIs under two protocols: the primary rubric and a score‑free second‑pass relabelling. The strongest honest pair‑level number is near‑exact at F1 = 0.822 on score‑free labels; the primary‑rubric semantic F1 = 0.906 is inflated by a stratification artefact that pins recall at 1.000. Lexical baselines (SourcererCC‑style, NiCad‑style) reach primary F1 = 0.761 and 0.799. The paper also presents a CDN‑structured critique of Gherkin (Cognitive Dimensions of Notations); eight of fourteen dimensions are rated problematic or unsupported. The tool, corpus, labelled pairs, rubric, and pipeline are released under permissive licences.
Authors:Yulia Otmakhova, Matteo Guida, Lea Frermann
Abstract:
Metaphors are powerful framing devices, yet their source domains alone do not fully explain the specific associations they evoke. We argue that the interplay between source domains and semantic frames determines how metaphors shape understanding of complex issues, and present a computational framework that allows to derive salient discourse metaphors through their source domains and semantic frames. Applying this framework to climate change news, we uncover not only well‑known source domains but also reveal nuanced frame‑level associations that distinguish how the issue is portrayed. In analyzing immigration discourse across political ideologies, we demonstrate that liberals and conservatives systematically employ different semantic frames within the same source domains, with conservatives favoring frames emphasizing uncontrollability and liberals choosing neutral or more ``victimizing'' semantic frames. Our work bridges conceptual metaphor theory and linguistics, providing the first NLP approach for discovery of discourse metaphors and fine‑grained analysis of differences in metaphorical framing. Code, data and statistical scripts are available at https://github.com/julia‑nixie/ConceptFrameMet.
Authors:Peng Peng, Weiwei Lin, Wentai Wu, Xinyang Wang, Yongheng Liu
Abstract:
Retrieval‑Augmented Generation (RAG) expands the knowledge boundary of large language models (LLMs) at inference by retrieving external documents as context. However, retrieval becomes increasingly time‑consuming as the knowledge databases grow in size. Existing acceleration strategies either compromise accuracy through approximate retrieval, or achieve marginal gains by reusing results of strictly identical queries. We propose HaS, a homology‑aware speculative retrieval framework that performs low‑latency speculative retrieval over restricted scopes to obtain candidate documents, followed by validating whether they contain the required knowledge. The validation, grounded in the homology relation between queries, is formulated as a homologous query re‑identification task: once a previously observed query is identified as a homologous re‑encounter of the incoming query, the draft is deemed acceptable, allowing the system to bypass slow full‑database retrieval. Benefiting from the prevalence of homologous queries under real‑world popularity patterns, HaS achieves substantial efficiency gains. Extensive experiments demonstrate that HaS reduces retrieval latency by 23.74% and 36.99% across datasets with only a 1‑2% marginal accuracy drop. As a plug‑and‑play solution, HaS also significantly accelerates complex multi‑hop queries in modern agentic RAG pipelines. Source code is available at: https://github.com/ErrEqualsNil/HaS.
Authors:Neemesh Yadav, Palakorn Achananuparp, Jing Jiang, Ee-Peng Lim
Abstract:
Large Language Models (LLMs) have been shown to possess Theory of Mind (ToM) abilities. However, it remains unclear whether this stems from robust reasoning or spurious correlations. We introduce DialToM, a human‑verified benchmark built from natural human dialogue using a multiple‑choice framework. We evaluate not only mental state prediction (Literal ToM) but also the functional utility of these states (Functional ToM) through Prospective Diagnostic Forecasting ‑‑ probing whether models can identify state‑consistent dialogue trajectories solely from mental‑state profiles. Our results reveal a significant reasoning asymmetry: while LLMs excel at identifying mental states, most (except for Gemini 3 Pro) fail to leverage this understanding to forecast social trajectories. Additionally, we find only weak semantic similarities between human and LLM‑generated inferences. To facilitate reproducibility, the DialToM dataset and evaluation code are publicly available at https://github.com/Stealth‑py/DialToM.
Authors:Kuanwei Chen, Tingyi Lin
Abstract:
Sign‑language datasets are difficult to preprocess consistently because they vary in annotation schema, clip timing, signer framing, and privacy constraints. Existing work usually reports downstream models, while the preprocessing pipeline that converts raw video into training‑ready pose or video artifacts remains fragmented, backend‑specific, and weakly documented. We present SignDATA, a config‑driven preprocessing toolkit that standardizes heterogeneous sign‑language corpora into comparable outputs for learning. The system supports two end‑to‑end recipes: a pose recipe that performs acquisition, manifesting, person localization, clipping, cropping, landmark extraction, normalization, and WebDataset export, and a video recipe that replaces pose extraction with signer‑cropped video packaging. SignDATA exposes interchangeable MediaPipe and MMPose backends behind a common interface, typed job schemas, experiment‑level overrides, and per‑stage checkpointing with config‑ and manifest‑aware hashes. We validate the toolkit through a research‑oriented evaluation design centered on backend comparison, preprocessing ablations, and privacy‑aware video generation on datasets. Our contribution is a reproducible preprocessing layer for sign‑language research that makes extractor choice, normalization policy, and privacy tradeoffs explicit, configurable, and empirically comparable.Code is available at https://github.com/balaboom123/signdata‑slt.
Authors:Wenhong Zhu, Ruobing Xie, Rui Wang, Pengfei Liu
Abstract:
Knowledge distillation (KD) is a powerful paradigm for compressing large language models (LLMs), whose effectiveness depends on intertwined choices of divergence direction, optimization strategy, and data regime. We break down the design of existing KD methods and present a unified view that establishes connections between them, reformulating KD as a reweighted log‑likelihood objective at the token level. We further propose Hybrid Policy Distillation (HPD), which integrates the complementary advantages of forward and reverse KL to balance mode coverage and mode‑seeking, and combines off‑policy data with lightweight, approximate on‑policy sampling. We validate HPD on long‑generation math reasoning as well as short‑generation dialogue and code tasks, demonstrating improved optimization stability, computational efficiency, and final performance across diverse model families and scales. The code related to this work is available at https://github.com/zwhong714/Hybrid‑Policy‑Distillation.
Authors:Yilun Liu, Chunguang Zhao, Mengyao Piao, Lingqi Miao, Shimin Tao, Minggui He, Chenxin Liu, Li Zhang, Hongxia Ma, Jiaxin Guo, Chen Liu, Liqun Deng, Jiansheng Wei, Xiaojun Meng, Fanyi Du, Daimeng Wei, Yanghua Xiao
Abstract:
Evaluating the multilingual and multicultural capabilities of Large Language Models (LLMs) is essential for their global utility. However, current benchmarks face three critical limitations: (1) fragmented evaluation dimensions that often neglect deep cultural nuances; (2) insufficient language coverage in subjective tasks relying on low‑quality machine translation; and (3) shallow analysis that lacks diagnostic depth beyond simple rankings. To address these, we introduce GaoYao, a comprehensive benchmark with 182.3k samples, 26 languages and 51 nations/areas. First, GaoYao proposes a unified framework categorizing evaluation tasks into three cultural layers (General Multilingual, Cross‑cultural, Monocultural) and nine cognitive sub‑layers. Second, we achieve native‑quality expansion by leveraging experts to rigorously localize subjective benchmarks into 19 languages and synthesizing cross‑cultural test sets for 34 cultures, surpassing prior coverage by up to 111%. Third, we conduct an in‑depth diagnostic analysis on 20+ flagship and compact LLMs. Our findings reveal significant geographical performance disparities and distinct gaps between tasks, offering a reliable map for future work. We release the benchmark (https://github.com/lunyiliu/GaoYao).
Authors:Hardy Chen, Nancy Lau, Haoqin Tu, Shuo Yan, Xiangyan Liu, Zijun Wang, Juncheng Wu, Michael Qizhe Shieh, Alvaro A. Cardenas, Cihang Xie, Yuyin Zhou
Abstract:
Frontier coding agents are increasingly used in workflows where users supervise progress primarily through repeated improvement of a public score, namely the reported score on a public evaluation file with labels in the workspace, rather than through direct inspection of the agent's intermediate outputs. We study whether multi‑round user pressure to improve that score induces public score exploitation: behavior that raises the public score through shortcuts without improving hidden private evaluation. We begin with a preliminary single‑script tabular classification task, where GPT‑5.4 and Claude Opus 4.6 both exploit label information within 10 rounds of user‑agent interaction. We then build AgentPressureBench, a 34‑task machine‑learning repository benchmark spanning three input modalities, and collect 1326 multi‑round trajectories from 13 coding agents. On our benchmark, we observe 403 exploitative runs, spanning across all tasks. We also find that stronger models have higher exploitation rates, supported by a significant Spearman rank correlation of 0.77. Our ablation experiments show that higher user pressure leads to earlier exploitation, reducing the average first exploit round by 15.6 rounds (i.e., 19.67 to 4.08). As a mitigation, adding explicit anti‑exploit wordings in prompt mostly eliminates exploitation (100% to 8.3%). We hope that our work can bring attention to more careful use of coding agents workflow, and developing more robust coding agents under user pressure. Our project page is at https://ucsc‑vlaa.github.io/AgentPressureBench .
Authors:Shanshan Zhong, Yi Lu, Jingjie Ning, Yibing Wan, Lihan Feng, Yuyi Ao, Leonardo F. R. Ribeiro, Markus Dreyer, Sean Ammirati, Chenyan Xiong
Abstract:
Skills have become the de facto way to enable LLM agents to perform complex real‑world tasks with customized instructions, workflows, and tools, but how to learn them automatically and effectively remains unclear. We introduce SkillLearnBench, the first benchmark for evaluating continual skill learning methods, comprising 20 verified, skill‑dependent tasks across 15 sub‑domains derived from a real‑world skill taxonomy , evaluated at three levels: skill quality, execution trajectory, and task outcome. Using this benchmark, we evaluate recent continual learning techniques, those leveraging one‑shot, self/teacher feedback, and skill creator to generate skills from agent experiences. We find that all continual learning methods improve over the no‑skill baseline, yet consistent gains remain elusive: no method leads across all tasks and LLMs, and scaling to stronger LLMs does not reliably help. Continual learning improves tasks with clear, reusable workflows but struggles on open‑ended tasks, and using stronger LLM backbones does not consistently produce better skills. Our analysis also revealed that multiple iterations in continual learning facilitate genuine improvement via external feedback, whereas self‑feedback alone induces recursive drift. Our data and code are open‑source at https://github.com/cxcscmu/SkillLearnBench to enable further studies of automatic skill generation and continual learning techniques.
Authors:Ziyi Wang, Chen Zhang, Wenjun Peng, Qi Wu, Xinyu Wang
Abstract:
Explainability for Large Language Model (LLM) agents is especially challenging in interactive, partially observable settings, where decisions depend on evolving beliefs and other agents. We present TriEx, a tri‑view explainability framework that instruments sequential decision making with aligned artifacts: (i) structured first‑person self‑reasoning bound to an action, (ii) explicit second‑person belief states about opponents updated over time, and (iii) third‑person oracle audits grounded in environment‑derived reference signals. This design turns explanations from free‑form narratives into evidence‑anchored objects that can be compared and checked across time and perspectives. Using imperfect‑information strategic games as a controlled testbed, we show that TriEx enables scalable analysis of explanation faithfulness, belief dynamics, and evaluator reliability, revealing systematic mismatches between what agents say, what they believe, and what they do. Our results highlight explainability as an interaction‑dependent property and motivate multi‑view, evidence‑grounded evaluation for LLM agents. Code is available at https://github.com/Einsam1819/TriEx.
Authors:Jinyoung Kim, Hyeongsoo Lim, Eunseo Seo, Minho Jang, Keunwoo Choi, Seungyoun Shin, Ji Won Yoon
Abstract:
Recent advances in large audio language models (LALMs) have enabled multilingual speech understanding. However, benchmarks for evaluating LALMs remain scarce for non‑English languages, with Korean being one such underexplored case. In this paper, we introduce KoALa‑Bench, a comprehensive benchmark for evaluating Korean speech understanding and speech faithfulness of LALMs. In particular, KoALa‑Bench comprises six tasks. Four tasks evaluate fundamental speech understanding capabilities, including automatic speech recognition, speech translation, speech question answering, and speech instruction following, while the remaining two tasks evaluate speech faithfulness, motivated by our observation that several LALMs often fail to fully leverage the speech modality. Furthermore, to reflect Korea‑specific knowledge, our benchmark incorporates listening questions from the Korean college scholastic ability test as well as content covering Korean cultural domains. We conduct extensive experiments across six models, including both white‑box and black‑box ones. Our benchmark, evaluation code, and leaderboard are publicly available at https://ksbench.github.io/Korean‑Benchmark/.
Authors:Yi Zhong, Buqiang Xu, Yijun Wang, Zifei Shan, Shuofei Qiao, Guozhou Zheng, Ningyu Zhang
Abstract:
At present, executable visual workflows have emerged as a mainstream paradigm in real‑world industrial deployments, offering strong reliability and controllability. However, in current practice, such workflows are almost entirely constructed through manual engineering: developers must carefully design workflows, write prompts for each step, and repeatedly revise the logic as requirements evolve‑making development costly, time‑consuming, and error‑prone. To study whether large language models can automate this multi‑round interaction process, we introduce Chat2Workflow, a benchmark for generating executable visual workflows directly from natural language, and propose a robust agentic framework to mitigate recurrent execution errors. Chat2Workflow is built from a large collection of real‑world business workflows, with each instance designed so that the generated workflow can be transformed and directly deployed to practical workflow platforms such as Dify and Coze. Experimental results show that while state‑of‑the‑art language models can often capture high‑level intent, they struggle to generate correct, stable, and executable workflows, especially under complex or changing requirements. Although our agentic framework yields up to 5.34% resolve rate gains, the remaining real‑world gap positions Chat2Workflow as a foundation for advancing industrial‑grade automation. Code is available at https://github.com/zjunlp/Chat2Workflow.
Authors:Yiwen Qiu, Linjuan Wu, Yizhou Liu, Yuchen Yan, Jin Ma, Xu Tan, Yao Hu, Daoxin Zhang, Wenqi Zhang, Weiming Lu, Jun Xiao, Yongliang Shen
Abstract:
Large language models have achieved remarkable progress on complex reasoning tasks. However, they often implicitly fabricate information when inputs are incomplete, producing confident but unreliable conclusions ‑‑ a failure mode we term ungrounded reasoning. We argue that this issue arises not from insufficient reasoning capability, but from the lack of inferential boundary awareness ‑‑ the ability to recognize when the necessary premises for valid inference are missing. To address this issue, we propose Grounded Reasoning via Interactive Reinforcement Learning (GRIL), a multi‑turn reinforcement learning framework for grounded reasoning under incomplete information. GRIL decomposes the reasoning process into two stages: clarify and pause, which identifies whether the available information is sufficient, and grounded reasoning, which performs task solving once the necessary premises are established. We design stage‑specific rewards to penalize hallucinations, enabling models to detect gaps, stop proactively, and resume reasoning after clarification. Experiments on GSM8K‑Insufficient and MetaMATH‑Insufficient show that GRIL significantly improves premise detection (up to 45%), leading to a 30% increase in task success while reducing average response length by over 20%. Additional analyses confirm robustness to noisy user responses and generalization to out‑of‑distribution tasks.
Authors:Wen Cheng, Tuochao Chen, Karim Helwani, Sriram Srinivasan, Luke Zettlemoyer, Shyamnath Gollakota
Abstract:
Edge devices such as smartwatches and smart glasses cannot continuously run even the smallest 100M‑1B parameter language models due to power and compute constraints, yet cloud inference introduces multi‑second latencies that break the illusion of a responsive assistant. We introduce micro language models (μLMs): ultra‑compact models (8M‑30M parameters) that instantly generate the first 4‑8 words of a contextually grounded response on‑device, while a cloud model completes it; thus, masking the cloud latency. We show that useful language generation survives at this extreme scale with our models matching several 70M‑256M‑class existing models. We design a collaborative generation framework that reframes the cloud model as a continuator rather than a respondent, achieving seamless mid‑sentence handoffs and structured graceful recovery via three error correction methods when the local opener goes wrong. Empirical results show that μLMs can initiate responses that larger models complete seamlessly, demonstrating that orders‑of‑magnitude asymmetric collaboration is achievable and unlocking responsive AI for extremely resource‑constrained devices. The model checkpoint and demo are available at https://github.com/Sensente/micro_language_model_swen_project.
Authors:Josue Torres-Fonseca, Naihao Deng, Yinpei Dai, Shane Storks, Yichi Zhang, Rada Mihalcea, Casey Kennington, Joyce Chai
Abstract:
Multimodal Large Language Models are increasingly adopted as autonomous agents in interactive environments, yet their ability to proactively address safety hazards remains insufficient. We introduce SafetyALFRED, built upon the embodied agent benchmark ALFRED, augmented with six categories of real‑world kitchen hazards. While existing safety evaluations focus on hazard recognition through disembodied question answering (QA) settings, we evaluate eleven state‑of‑the‑art models from the Qwen, Gemma, and Gemini families on not only hazard recognition, but also active risk mitigation through embodied planning. Our experimental results reveal a significant alignment gap: while models can accurately recognize hazards in QA settings, average mitigation success rates for these hazards are low in comparison. Our findings demonstrate that static evaluations through QA are insufficient for physical safety, thus we advocate for a paradigm shift toward benchmarks that prioritize corrective actions in embodied contexts. We open‑source our code and dataset under https://github.com/sled‑group/SafetyALFRED.git
Authors:Bobo Li, Rui Wu, Zibo Ji, Meishan Zhang, Hao Fei, Min Zhang, Mong-Li Lee, Wynne Hsu
Abstract:
Large Language Model agents have rapidly evolved from static text generators into dynamic systems capable of executing complex autonomous workflows. To enhance reliability, multi‑agent frameworks assigning specialized roles are increasingly adopted to enable self‑reflection and mutual auditing. While such role‑playing effectively leverages domain expert knowledge, we find it simultaneously induces a human‑like cognitive bias known as Actor‑Observer Asymmetry (AOA). Specifically, an agent acting as an actor (during self‑reflection) tends to attribute failures to external factors, whereas an observer (during mutual auditing) attributes the same errors to internal faults. We quantify this using our new Ambiguous Failure Benchmark, which reveals that simply swapping perspectives triggers the AOA effect in over 20% of cases for most models. To tame this bias, we introduce ReTAS (Reasoning via Thesis‑Antithesis‑Synthesis), a model trained through dialectical alignment to enforce perspective‑invariant reasoning. By integrating dialectical chain‑of‑thought with Group Relative Policy Optimization, ReTAS guides agents to synthesize conflicting viewpoints into an objective consensus. Experiments demonstrate that ReTAS effectively mitigates attribution inconsistency and significantly improves fault resolution rates in ambiguous scenarios.
Authors:Tianxiang Ma, Weijie Feng, Xinyu Wang, Zhiyong Cheng
Abstract:
Emotion‑Cause Pair Extraction in Conversations (ECPEC) aims to identify the set of causal relations between emotion utterances and their triggering causes within a dialogue. Most existing approaches formulate ECPEC as an independent pairwise classification task, overlooking the distinct semantics of emotion diffusion and cause explanation, and failing to capture globally consistent many‑to‑many conversational causality. To address these limitations, we revisit ECPEC from a semantic perspective and seek to disentangle emotion‑oriented semantics from cause‑oriented semantics, mapping them into two complementary representation spaces to better capture their distinct conversational roles. Building on this semantic decoupling, we naturally formulate ECPEC as a global alignment problem between the emotion‑side and cause‑side representations, and employ optimal transport to enable many‑to‑many and globally consistent emotion‑cause matching. Based on this perspective, we propose a unified framework SCALE that instantiates the above semantic decoupling and alignment principle within a shared conversational structure. Extensive experiments on several benchmark datasets demonstrate that SCALE consistently achieves state‑of‑the‑art performance. Our codes are released at https://github.com/CoCoSphere/SCALE.
Authors:Yi Xiang, Chengzhi Zhang
Abstract:
Automatic keyword extraction from academic papers is a key area of interest in natural language processing and information retrieval. Although previous research has mainly focused on utilizing abstract and references for keyword extraction, this paper focuses on the highlights section ‑ a summary describing the key findings and contributions, offering readers a quick overview of the research. Our observations indicate that highlights contain valuable keyword information that can effectively complement the abstract. To investigate the impact of incorporating highlights into unsupervised keyword extraction, we evaluate three input scenarios: using only the abstract, the highlights, and a combination of both. Experiments conducted with four unsupervised models on Computer Science (CS), Library and Information Science (LIS) datasets reveal that integrating the abstract with highlights significantly improves extraction performance. Furthermore, we examine the differences in keyword coverage and content between abstract and highlights, exploring how these variations influence extraction outcomes. The data and code are available at https://github.com/xiangyi‑njust/Highlight‑KPE.
Authors:Dmitry Pronin, Evgeny Kazartsev
Abstract:
This article introduces two new measures for authorship attribution ‑ Rank‑Turbulence Delta and Jensen‑Shannon Delta ‑ which generalise Burrows's classical Delta by applying distance functions designed for probabilistic distributions. We first set out the theoretical basis of the measures, contrasting centred and uncentred z‑scoring of word‑frequency vectors and re‑casting the uncentred vectors as probability distributions. Building on this representation, we develop a token‑level decomposition that renders every Delta distance numerically interpretable, thereby facilitating close reading and the validation of results. The effectiveness of the methods is assessed on four literary corpora in English, German, French and Russian. The English, German and French datasets are compiled from Project Gutenberg, whereas the Russian benchmark is the SOCIOLIT corpus containing 755 works by 180 authors spanning the eighteenth to the twenty‑first centuries. Rank‑Turbulence Delta attains attribution accuracy comparable with Cosine Delta; Jensen‑Shannon Delta consistently matches or exceeds the performance of canonical Burrows's Delta. Finally, several established attribution algorithms are re‑evaluated on the extended SOCIOLIT corpus.
Authors:Kyuhee Kim, Auguste Poiroux, Antoine Bosselut
Abstract:
Formal verification guarantees proof validity but not formalization faithfulness. For natural‑language logical reasoning, where models construct axiom systems from scratch without library constraints, this gap between valid proofs and faithful translations is especially acute. We investigate whether frontier models exploit this gap when generating Lean 4 proofs, a behavior we term formalization gaming. We evaluate GPT‑5 and DeepSeek‑R1 on 303 first‑order logic problems (203 from FOLIO, 100 from Multi‑LogiEval), comparing unified generation against a two‑stage pipeline that separates formalization from proving. Despite compilation rates of 87‑99%, we find no evidence of systematic gaming in unified generation: models prefer reporting failure over forcing proofs, even under prompting designed to encourage it. However, unfaithfulness that evades our detection signals may still occur. The two‑stage pipeline reveals two distinct modes of unfaithfulness: GPT‑5 fabricates axioms during proof generation, a reactive fallback detectable via cross‑stage comparison, while DeepSeek‑R1 mistranslates premises during formalization, producing internally consistent outputs that evade detection entirely. These findings show that high compilation rates or accuracies should not be equated with faithful reasoning. Code and data are available at https://github.com/koreankiwi99/formalization‑gaming.
Authors:Jinyu Guo, Zhihan Zhang, Yutong Li, Jiehui Xie, Md. Tamim Iqbal, Dongshen Han, Lik-Hang Lee, Sung-Ho Bae, Jie Zou, Yang Yang, Chaoning Zhang
Abstract:
The quadratic computational complexity of the standard attention mechanism constitutes a fundamental bottleneck for large language models in long‑context inference. While existing KV cache compression methods alleviate memory pressure, they often sacrifice generation quality and fail to address the high overhead of floating‑point arithmetic. This paper introduces DASH‑KV, an innovative acceleration framework that reformulates attention as approximate nearest‑neighbor search via asymmetric deep hashing. Under this paradigm, we design an asymmetric encoding architecture that differentially maps queries and keys to account for their distinctions in precision and reuse characteristics. To balance efficiency and accuracy, we further introduce a dynamic mixed‑precision mechanism that adaptively retains full‑precision computation for critical tokens. Extensive experiments on LongBench demonstrate that DASH‑KV significantly outperforms state‑of‑the‑art baseline methods while matching the performance of full attention, all while reducing inference complexity from O(N^2) to linear O(N). The code is available at https://github.com/Zhihan‑Zh/DASH‑KV
Authors:Rajveer Singh Pall
Abstract:
We introduce IndiaFinBench, to our knowledge the first publicly available evaluation benchmark for assessing large language model (LLM) performance on Indian financial regulatory text. Existing financial NLP benchmarks draw exclusively from Western financial corpora (SEC filings, US earnings reports, and English‑language financial news), leaving a significant gap in coverage of non‑Western regulatory frameworks. IndiaFinBench addresses this gap with 406 expert‑annotated question‑answer pairs drawn from 192 documents sourced from the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI), spanning four task types: regulatory interpretation (174 items), numerical reasoning (92 items), contradiction detection (62 items), and temporal reasoning (78 items). Annotation quality is validated through a model‑based secondary pass (kappa=0.918 on contradiction detection) and a 60‑item human inter‑annotator agreement evaluation (kappa=0.611; 76.7% overall agreement). We evaluate twelve models under zero‑shot conditions, with accuracy ranging from 70.4% (Gemma 4 E4B) to 89.7% (Gemini 2.5 Flash). All models substantially outperform a non‑specialist human baseline of 60.0%. Numerical reasoning is the most discriminative task, with a 35.9 percentage‑point spread across models. Bootstrap significance testing (10,000 resamples) reveals three statistically distinct performance tiers. The dataset, evaluation code, and all model outputs are available at https://github.com/rajveerpall/IndiaFinBench
Authors:Euntae Kim, Soomin Han, Buru Chang
Abstract:
Large language models (LLMs) are increasingly used as co‑authors in collaborative writing, where users begin with rough drafts and rely on LLMs to complete, revise, and refine their content. However, this capability poses a serious safety risk: malicious users could jailbreak the models‑filling incomplete drafts with dangerous content‑to force them into generating harmful outputs. In this paper, we identify the vulnerability of current LLMs to such draft‑based co‑authoring jailbreak attacks and introduce HarDBench, a systematic benchmark designed to evaluate the robustness of LLMs against this emerging threat. HarDBench spans a range of high‑risk domains‑including Explosives, Drugs, Weapons, and Cyberattacks‑and features prompts with realistic structure and domain‑specific cues to assess the model susceptibility to harmful completions. To mitigate this risk, we introduce a safety‑utility balanced alignment approach based on preference optimization, training models to refuse harmful completions while remaining helpful on benign drafts. Experimental results show that existing LLMs are highly vulnerable in co‑authoring contexts and our alignment method significantly reduces harmful outputs without degrading performance on co‑authoring capabilities. This presents a new paradigm for evaluating and aligning LLMs in human‑LLM collaborative writing settings. Our new benchmark and dataset are available on our project page at https://github.com/untae0122/HarDBench
Authors:Bo-Jyun Wang, Ying-Jia Lin, Hung-Yu Kao
Abstract:
Small language models (SLMs), such as BART, can achieve summarization performance comparable to large language models (LLMs) via distillation. However, existing LLM‑based ranking strategies for summary candidates suffer from instability, while classical metrics (e.g., ROUGE) are insufficient to rank high‑quality summaries. To address these issues, we introduce SCURank, a framework that enhances summarization by leveraging Summary Content Units (SCUs). Instead of relying on unstable comparisons or surface‑level overlap, SCURank evaluates summaries based on the richness and semantic importance of information content. We investigate the effectiveness of SCURank in distilling summaries from multiple diverse LLMs. Experimental results demonstrate that SCURank outperforms traditional metrics and LLM‑based ranking methods across evaluation measures and datasets. Furthermore, our findings show that incorporating diverse LLM summaries enhances model abstractiveness and overall distilled model performance, validating the benefits of information‑centric ranking in multi‑LLM distillation. The code for SCURank is available at https://github.com/IKMLab/SCURank.
Authors:Wei Shao, Yihang Wang, Gaoyu Zhu, Ziqiang Cheng, Lei Yu, Jiafeng Guo, Xueqi Cheng
Abstract:
Existing detoxification methods for large language models mainly focus on post‑training stage or inference time, while few tackle the source of toxicity, namely, the dataset itself. Such training‑based or controllable decoding approaches cannot completely suppress the model's inherent toxicity, whereas detoxifying the pretraining dataset can fundamentally reduce the toxicity that the model learns during training. Hence, we attempt to detoxify directly on raw corpora with SoCD (Soft Contrastive Decoding), which guides an LLM to localize and rewrite toxic spans in raw data while preserving semantics, in our proposed HSPD (Hierarchical Semantic‑Preserving Detoxification) pipeline, yielding a detoxified corpus that can drop‑in replace the original for fine‑tuning or other training. On GPT2‑XL, HSPD attains state‑of‑the‑art detoxification, reducing Toxicity Probability (TP) from 0.42 to 0.18 and Expected Maximum Toxicity (EMT) from 0.43 to 0.20. We further validate consistent best‑in‑class results on LLaMA2‑7B, OPT‑6.7B, and Falcon‑7B. These findings show that semantics‑preserving, corpus‑level rewriting with HSPD effectively suppresses downstream toxicity while retaining data utility and allowing seamless source‑level mitigation, thereby reducing the cost of later model behavior adjustment. (Code is available at: https://github.com/ntsw2001/data_detox_for_llm)
Authors:Yilun Liu, Ruihong Qiu, Zi Huang
Abstract:
Zero‑shot reasoning on text‑rich networks (TRNs) remains a challenging frontier, as models must integrate textual semantics with relational structure without task‑specific supervision. While graph neural networks rely on fixed label spaces and supervised objectives, recent large language model (LLM)‑based approaches often overlook graph context or depend on distillation from larger models, limiting generalisation. We propose TRN‑R1‑Zero, a post‑training framework for TRN reasoning trained solely via reinforcement learning. TRN‑R1‑Zero directly optimises base LLMs using a Neighbour‑aware Group Relative Policy Optimisation objective that dynamically adjusts rewards based on a novel margin gain metric for the informativeness of neighbouring signals, effectively guiding the model toward relational reasoning. Unlike prior methods, TRN‑R1‑Zero requires no supervised fine‑tuning or chain‑of‑thought data generated from large reasoning models. Extensive experiments across citation, hyperlink, social and co‑purchase TRN benchmarks demonstrate the superiority and robustness of TRN‑R1‑Zero. Moreover, relying strictly on node‑level training, TRN‑R1‑Zero achieves zero‑shot inference on edge‑ and graph‑level tasks, extending beyond cross‑domain transfer. The codebase is publicly available at https://github.com/superallen13/TRN‑R1‑Zero.
Authors:Boyan Shi, Wei Chen, Shuyuan Zhao, Junfeng Shen, Shengnan Guo, Shaojiang Wang, Huaiyu Wan
Abstract:
The combination of Mixture‑of‑Experts (MoE) and Low‑Rank Adaptation (LoRA) has shown significant potential for enhancing the multi‑task learning capabilities of Large Language Models. However, existing methods face two primary challenges: (1)Imprecise Routing in the current MoE‑LoRA method fails to explicitly match input semantics with expert capabilities, leading to weak expert specialization. (2)Uniform weight fusion strategies struggle to provide adaptive update strengths, overlooking the varying complexity of different tasks. To address these limitations, we propose SAMoRA (Semantic‑Aware Mixture of LoRA Experts), a novel parameter‑efficient fine‑tuning framework tailored for task‑adaptive learning. Specifically, A Semantic‑Aware Router is proposed to explicitly align textual semantics with the most suitable experts for precise routing. A Task‑Adaptive Scaling mechanism is designed to regulate expert contributions based on specific task requirements dynamically. In addition, a novel regularization objective is proposed to jointly promote expert specialization and effective scaling. Extensive experiments on multiple multi‑task benchmarks demonstrate that SAMoRA significantly outperforms the state‑of‑the‑art methods and holds excellent task generalization capabilities. Code is available at https://github.com/boyan‑code/SAMoRA
Authors:Yixuan Tang, Yirui Zhang, Hang Feng, Anthony K. H. Tung
Abstract:
Half‑truths, claims that are factually correct yet misleading due to omitted context, remain a blind spot for fact verification systems focused on explicit falsehoods. Addressing such omission‑based manipulation requires reasoning not only about what is said, but also about what is left unsaid. We propose RADAR, a role‑anchored multi‑agent debate framework for omission‑aware fact verification under realistic, noisy retrieval. RADAR assigns complementary roles to a Politician and a Scientist, who reason adversarially over shared retrieved evidence, moderated by a neutral Judge. A dual‑threshold early termination controller adaptively decides when sufficient reasoning has been reached to issue a verdict. Experiments show that RADAR consistently outperforms strong single‑ and multi‑agent baselines across datasets and backbones, improving omission detection accuracy while reducing reasoning cost. These results demonstrate that role‑anchored, retrieval‑grounded debate with adaptive control is an effective and scalable framework for uncovering missing context in fact verification. The code is available at https://github.com/tangyixuan/RADAR.
Authors:MinJae Jung, YongTaek Lim, Chaeyun Kim, Junghwan Kim, Kihyun Kim, Minwoo Kim
Abstract:
While Large Language Models (LLMs) are widely used, they remain susceptible to jailbreak prompts that can elicit harmful or inappropriate responses. This paper introduces STAR‑Teaming, a novel black‑box framework for automated red teaming that effectively generates such prompts. STAR‑Teaming integrates a Multi‑Agent System (MAS) with a Strategy‑Response Multiplex Network and employs network‑driven optimization to sample effective attack strategies. This network‑based approach recasts the intractable high‑dimensional embedding space into a tractable structure, yielding two key advantages: it enhances the interpretability of the LLM's strategic vulnerabilities, and it streamlines the search for effective strategies by organizing the search space into semantic communities, thereby preventing redundant exploration. Empirical results demonstrate that STAR‑Teaming significantly surpasses existing methods, achieving a higher attack success rate (ASR) at a lower computational cost. Extensive experiments validate the effectiveness and explainability of the Multiplex Network. The code is available at https://github.com/selectstar‑ai/STAR‑Teaming‑paper.
Authors:He Cheng, Yifu Wu, Saksham Khatwani, Maya Kruse, Dmitriy Dligach, Timothy A. Miller, Majid Afshar, Yanjun Gao
Abstract:
Knowledge graphs (KGs) are increasingly integrated with large language models (LLMs) to provide structured, verifiable reasoning. A core operation in this integration is multi‑hop retrieval, yet existing systems struggle to balance efficiency, scalability, and interpretability. We introduce LogosKG, a novel, hardware‑aligned framework that enables scalable and interpretable k‑hop retrieval on large KGs by building on symbolic KG formulations and executing traversal as hardware‑efficient operations over decomposed subject, object, and relation representations. To scale to billion‑edge graphs, LogosKG integrates degree‑aware partitioning, cross‑graph routing, and on‑demand caching. Experiments show substantial efficiency gains over CPU and GPU baselines without loss of retrieval fidelity. With proven performance in KG retrieval, a downstream two‑round KG‑LLM interaction demonstrates how LogosKG enables large‑scale, evidence‑grounded analysis of how KG topology, such as hop distribution and connectivity, shapes the alignment between structured biomedical knowledge and LLM diagnostic reasoning, thereby opening the door for next‑generation KG‑LLM integration. The source code is publicly available at https://github.com/LARK‑NLP‑Lab/LogosKG, and an online demo is available at https://lark‑nlp‑lab‑logoskg.hf.space/.
Authors:Isaac Llorente-Saguer
Abstract:
Harmful intent is geometrically recoverable from large language model residual streams: as a linear direction in most layers, and as angular deviation in layers where projection methods fail. Across 12 models spanning four architectural families (Qwen2.5, Qwen3.5, Llama‑3.2, Gemma‑3) and three alignment variants (base, instruction‑tuned, abliterated), under single‑turn, English evaluation, we characterise this geometry through six direction‑finding strategies. Three succeed: a soft‑AUC‑optimised linear direction reaches mean AUROC 0.98 and TPR@1%FPR 0.80; a class‑mean probe reaches 0.98 and 0.71 at <1ms fitting cost; a supervised angular‑deviation strategy reaches AUROC 0.96 and TPR of 0.61 along a representationally distinct direction (73^\circ from projection‑based solutions), uniquely sustaining detection in middle layers where projection methods collapse. Detection remains stable across alignment variants, including abliterated models from which refusal has been surgically removed: harmful intent and refusal behaviour are functionally dissociated features of the representation. A direction fitted on AdvBench transfers to held‑out HarmBench and JailbreakBench with worst‑case AUROC 0.96. The same picture holds at scale: across Qwen3.5 from 0.8B to 9B parameters, AUROC remains \geq0.98 and cross‑variant transfer stays within 0.018 of own‑direction performance This is consistent with a simple account: models acquire a linearly decodable representation of harmful intent as part of general language understanding, and alignment then shapes what they do with such inputs without reorganising the upstream recognition signal. As a practical consequence, AUROC in the 0.97+ regime can substantially overestimate operational detectability; TPR@1%FPR should accompany AUROC in safety‑adjacent evaluation.
Authors:Manuel Israel Cazares
Abstract:
We present a systematic empirical study of prompt engineering for formal mathematical reasoning in the context of the SAIR Equational Theories Stage 1 competition. The task requires deciding whether one equational law implies another over all magmas ‑‑ a problem that is undecidable in general but decidable for FALSE via finite model search. Over five weeks, we designed, tested, and analyzed more than 40 prompt variants, ranging from 0 to 4,878 bytes, across four evaluation splits and three language models (gpt‑oss‑120b, Llama 3.3 70B, Gemma 4 31B). Our central finding is a single‑prompt ceiling: despite substantial engineering effort, balanced hard accuracy plateaus in an empirical saturation region of approximately 60‑‑79% for gpt‑oss‑120b, compared to a 59.75% no‑cheatsheet baseline. We identify three mechanisms underlying this ceiling: (1) the mathematical undecidability of the TRUE case limits what any finite prompt can encode; (2) complex rule systems decrease performance on weaker models (Llama 3.3 70B collapses to 0% TRUE recall with prompts exceeding 2KB); and (3) prompt ordering effects interact with model attention in fragile, non‑monotonic ways. Our best submission (AN45c, 2,252 bytes) achieves 79.25% accuracy on hard3 (n=400; 95% CI: [75.0%, 82.9%]), with TRUE recall of 95.9% and FALSE recall of 63.4%, representing a +19.5 percentage‑point improvement over the no‑cheatsheet baseline (59.75%). We release all prompt variants, evaluation scripts, and results at https://github.com/israelcazares/sair‑prompt‑engineering
Authors:Weixi Tong, Yifeng Di, Tianyi Zhang
Abstract:
Existing web agents typically initiate exploration from the root URL, which is inefficient for complex websites with deep hierarchical structures. Without a global view of the website's structure, agents frequently fall into navigation traps, explore irrelevant branches, or fail to reach target information within a limited budget. We propose Mango, a multi‑agent web navigation method that leverages the website structure to dynamically determine optimal starting points. We formulate URL selection as a multi‑armed bandit problem and employ Thompson Sampling to adaptively allocate the navigation budget across candidate URLs. Furthermore, we introduce an episodic memory component to store navigation history, enabling the agent to learn from previous attempts. Experiments on WebVoyager demonstrate that Mango achieves a success rate of 63.6% when using GPT‑5‑mini, outperforming the best baseline by 7.3%. Furthermore, on WebWalkerQA, Mango attains a 52.5% success rate, surpassing the best baseline by 26.8%. We also demonstrate the generalizability of Mango using both open‑source and closed‑source models as backbones. Our data and code are open‑source and available at https://github.com/VichyTong/Mango.
Authors:Hanshu Rao, Guangzeng Han, Xiaolei Huang
Abstract:
Class imbalance is a widespread challenge in NLP tasks, significantly hindering robust performance across diverse domains and applications. We introduce Hardness‑Aware Meta‑Resample (HAMR), a unified framework that adaptively addresses both class imbalance and data difficulty. HAMR employs bi‑level optimizations to dynamically estimate instance‑level weights that prioritize genuinely challenging samples and minority classes, while a neighborhood‑aware resampling mechanism amplifies training focus on hard examples and their semantically similar neighbors. We validate HAMR on six imbalanced datasets covering multiple tasks and spanning biomedical, disaster response, and sentiment domains. Experimental results show that HAMR achieves substantial improvements for minority classes and consistently outperforms strong baselines. Extensive ablation studies demonstrate that our proposed modules synergistically contribute to performance gains and highlight HAMR as a flexible and generalizable approach for class imbalance adaptation. Code is available at https://github.com/trust‑nlp/ImbalanceLearning.
Authors:Ruixuan Liu, David Evans, Li Xiong
Abstract:
Indistinguishability properties such as differential privacy bounds or low empirically measured membership inference are widely treated as proxies to show a model is sufficiently protected against broader memorization risks. However, we show that indistinguishability properties are neither sufficient nor necessary for preventing data extraction in LLM APIs. We formalize a privacy‑game separation between extraction and indistinguishability‑based privacy, showing that indistinguishability and inextractability are incomparable: upper‑bounding distinguishability does not upper‑bound extractability. To address this gap, we introduce (l, b)‑inextractability as a definition that requires at least 2^b expected queries for any black‑box adversary to induce the LLM API to emit a protected l‑gram substring. We instantiate this via a worst‑case extraction game and derive a rank‑based extraction risk upper bound for targeted exact extraction, as well as extensions to cover untargeted and approximate extraction. The resulting estimator captures the extraction risk over multiple attack trials and prefix adaptations. We show that it can provide a tight and efficient estimation for standard greedy extraction and an upper bound on the probabilistic extraction risk given any decoding configuration. We empirically evaluate extractability across different models, clarifying its connection to distinguishability, demonstrating its advantage over existing extraction risk estimators, and providing actionable mitigation guidelines across model training, API access, and decoding configurations in LLM API deployment. Our code is publicly available at: https://github.com/Emory‑AIMS/Inextractability.
Authors:Liubomyr Horbatko
Abstract:
Modern sequence modeling is dominated by two families: Transformers, whose self‑attention can access arbitrary elements of the visible sequence, and structured state‑space models, which propagate information through an explicit recurrent state. These mechanisms face different limitations on long contexts: when attention is diffuse, the influence of individual tokens is diluted across the effective support, while recurrent state propagation can lose long‑range sensitivity unless information is actively preserved. As a result, both mechanisms face challenges in preserving and selectively retrieving information over long contexts. We propose Sessa, a decoder that places attention inside a recurrent feedback path. This creates many attention‑based paths through which past tokens can influence future states, rather than relying on a single attention read or a single recurrent chain. We prove that, under explicit assumptions and matched regimes, Sessa admits power‑law memory tails O(\ell^‑β) for 0 < β< 1, with slower decay than in the corresponding Transformer and Mamba‑style baselines. We further give an explicit construction that achieves this power‑law rate. Under the same assumptions, Sessa is the only model class among those considered that realizes flexible selective retrieval, including profiles whose influence does not decay with distance. Consistent with this theoretical advantage, across matched experiments, Sessa achieves the strongest performance on long‑context benchmarks while remaining competitive with Transformer and Mamba‑style baselines on short‑context language modeling.
Authors:Alireza Dadgarnia, Soroush Tabesh, Mahdi Nikdan, Michael Helcig, Eldar Kurtic, Maximilian Kleinegger, Dan Alistarh
Abstract:
Quantization has become a standard tool for efficient LLM deployment, especially for local inference, where models are now routinely served at 2‑3 bits per parameter. The state of the art is currently split into simple scalar quantization techniques, such as GPTQ or AWQ, which are widely deployed but plateau in accuracy at 3‑4 bits per parameter (bpp), and "second‑generation" vector‑ or trellis‑quantized methods, such as QTIP, GPTVQ and AQLM, which push the accuracy frontier but are notoriously hard to implement and to scale. In this paper, we ask whether this gap is fundamental, or whether a carefully optimized scalar quantizer can recover most of it. We answer in the affirmative, by introducing GSQ (Gumbel‑Softmax Quantization), a post‑training scalar quantization method which jointly learns the per‑coordinate grid assignments and the per‑group scales using a Gumbel‑Softmax relaxation of the discrete grid. GSQ matches the cardinality of the relaxation to the small number of levels available in the target bit‑width regime (e.g., 3‑8 levels for ternary and 3 bpp, respectively), making optimization tractable. Practically, on the standard Llama‑3.1‑8B/70B‑Instruct models, GSQ closes most of the gap between scalar quantization and the QTIP frontier at 2 and 3 bits, while using a symmetric scalar grid with group‑wise quantization, and thus remains compatible with existing scalar inference kernels. We further show that the same discrete‑assignment optimization can be applied to practical GGUF K‑Quant checkpoints: starting from publicly released GGUF models, GSQ improves accuracy while projecting the result back into the same deployment format. Finally, GSQ scales to trillion‑scale Mixture‑of‑Experts models such as Kimi‑K2.5, where vector‑quantized methods are difficult to apply. The source code is publicly available at https://github.com/IST‑DASLab/GSQ.
Authors:Jinghui Lu, Jiayi Guan, Zhijian Huang, Jinlong Li, Guang Li, Lingdong Kong, Yingyan Li, Han Wang, Shaoqing Xu, Yuechen Luo, Fang Li, Chenxu Dang, Junli Wang, Tao Xu, Jing Wu, Jianhua Wu, Xiaoshuai Hao, Wen Zhang, Tianyi Jiang, Lingfeng Zhang, Lei Zhou, Yingbo Tang, Jie Wang, Yinfeng Gao, Xizhou Bu, Haochen Tian, Yihang Qiu, Feiyang Jia, Lin Liu, Yigu Ge, Hanbing Li, Yuannan Shen, Jianwei Cui, Hongwei Xie, Bing Wang, Haiyang Sun, Jingwei Zhao, Jiahui Huang, Pei Liu, Zeyu Zhu, Yuncheng Jiang, Zibin Guo, Chuhong Gong, Hanchao Leng, Kun Ma, Naiyan Wang, Guang Chen, Kuiyuan Yang, Hangjun Ye, Long Chen
Abstract:
Chain‑of‑Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA‑based autonomous driving, yet its autoregressive nature imposes a latency cost that is prohibitive for real‑time deployment. Latent CoT methods attempt to close this gap by compressing reasoning into continuous hidden states, but consistently fall short of their explicit counterparts. We suggest that this is due to purely linguistic latent representations compressing a symbolic abstraction of the world, rather than the causal dynamics that actually govern driving. Thus, we present OneVL (One‑step latent reasoning and planning with Vision‑Language explanations), a unified VLA and World Model framework that routes reasoning through compact latent tokens supervised by dual auxiliary decoders. Alongside a language decoder that reconstructs text CoT, we introduce a visual world model decoder that predicts future‑frame tokens, forcing the latent space to internalize the causal dynamics of road geometry, agent motion, and environmental change. A three‑stage training pipeline progressively aligns these latents with trajectory, language, and visual objectives, ensuring stable joint optimization. In inference, the auxiliary decoders are discarded, and all latent tokens are prefilled in a single parallel pass, matching the speed of answer‑only prediction. Across four benchmarks, OneVL becomes the first latent CoT method to surpass explicit CoT, delivering superior accuracy at answer‑only latency. These results show that with world model supervision, latent CoT produces more generalizable representations than verbose token‑by‑token reasoning. Code has been open‑sourced to the community. Project Page: https://xiaomi‑embodied‑intelligence.github.io/OneVL
Authors:Shuqi Cao, Jingyi He, Fei Tan
Abstract:
Long‑term conversational large language model (LLM) agents require memory systems that can recover relevant evidence from historical interactions without overwhelming the answer stage with irrelevant context. However, existing memory systems, including hierarchical ones, still often rely solely on vector similarity for retrieval. It tends to produce bloated evidence sets: adding many superficially similar dialogue turns yields little additional recall, but lowers retrieval precision, increases answer‑stage context cost, and makes retrieved memories harder to inspect and manage. To address this, we propose HiGMem (Hierarchical and LLM‑Guided Memory System), a two‑level event‑turn memory system that allows LLMs to use event summaries as semantic anchors to predict which related turns are worth reading. This allows the model to inspect high‑level event summaries first and then focus on a smaller set of potentially useful turns, providing a concise and reliable evidence set through reasoning, while avoiding the retrieval overhead that would be excessively high compared to vector retrieval. On the LoCoMo10 benchmark, HiGMem achieves the best F1 on four of five question categories and improves adversarial F1 from 0.54 to 0.78 over A‑Mem, while retrieving an order of magnitude fewer turns. Code is publicly available at https://github.com/ZeroLoss‑Lab/HiGMem.
Authors:Jiayi Wu, Ruobing Xie, Zeqian Huang, Lei Jiang, Can Xu, Kangyang Luo, Bochen Lin, Ming Gao, Xiang Li
Abstract:
Search agents achieve strong question‑answering performance through multi‑turn interactions with search engines, with Group Relative Policy Optimization (GRPO) being a widely used training algorithm. However, GRPO‑style algorithms still face several challenges in multi‑hop search settings. First, correct intermediate steps are often penalized when the final answer is wrong. Second, training is highly unstable, often causing degradation of natural language ability or even catastrophic training collapse. Our analysis attributes these issues to coarse‑grained advantage assignment and an imbalance between positive and negative advantages. To address these problems, we propose CalibAdv, an advantage calibration method specifically designed for search agents that enables more accurate and more stable modeling of penalties and rewards. Specifically, CalibAdv leverages the correctness of intermediate steps to downscale excessive negative advantages at a fine‑grained level. It then further rebalances positive and negative advantages to improve training stability. Importantly, CalibAdv adopts a lightweight design that calibrates advantages from standard rollout signals, making it simple and easy to deploy. Extensive experiments across three models and seven benchmarks demonstrate that CalibAdv improves both model performance and training stability. Our code is available at https://github.com/wujwyi/CalibAdv.
Authors:Yiheng Li, Weihai Lu, Hanyi Yu, Yue Wang
Abstract:
In recent years, multimodal multidomain fake news detection has garnered increasing attention. Nevertheless, this direction presents two significant challenges: (1) Failure to Capture Cross‑Instance Narrative Consistency: existing models usually evaluate each news in isolation, fail to capture cross‑instance narrative consistency, and thus struggle to address the spread of cluster based fake news driven by social media; (2) Lack of Domain Specific Knowledge for Reasoning: conventional models, which rely solely on knowledge encoded in their parameters during training, struggle to generalize to new or data‑scarce domains (e.g., emerging events or niche topics). To tackle these challenges, we introduce Retrieval‑Augmented Multimodal Model for Fake News Detection (RAMM). First, RAMM employs a Multimodal Large Language Model (MLLM) as its backbone to capture cross‑modal semantic information from news samples. Second, RAMM incorporates an Abstract Narrative Alignment Module. This component adaptively extracts abstract narrative consistency from diverse instances across distinct domains, aggregates relevant knowledge, and thereby enables the modeling of high‑level narrative information. Finally, RAMM introduces a Semantic Representation Alignment Module, which aligns the model's decision‑making paradigm with that of humans ‑ specifically, it shifts the model's reasoning process from direct inference on multimodal features to an instance‑based analogical reasoning process. Extensive experimental results on three public datasets validate the efficacy of our proposed approach. Our code is available at the following link: https://github.com/li‑yiheng/RAMM
Authors:Nuo Chen, Yicheng Tong, Yuzhe Yang, Yufei He, Xueyi Zhang, Qingyun Zou, Qian Wang, Bingsheng He
Abstract:
Multi‑agent systems (MAS) are increasingly used for open‑ended idea generation, driven by the expectation that collective interaction will broaden the exploration diversity. However, when and why such collaboration truly expands the solution space remains unclear. We present a systematic empirical study of diversity in MAS‑based ideation across three bottom‑up levels: model intelligence, agent cognition, and system dynamics. At the model level, we identify a compute efficiency paradox, where stronger, highly aligned models yield diminishing marginal diversity despite higher per‑sample quality. At the cognition level, authority‑driven dynamics suppress semantic diversity compared to junior‑dominated groups. At the system level, group‑size scaling yields diminishing returns and dense communication topologies accelerate premature convergence. We characterize these outcomes as collective failures emerging from structural coupling, a process where interaction inadvertently contracts agent exploration and triggers diversity collapse. Our analysis shows that this collapse arises primarily from the interaction structure rather than inherent model insufficiency, highlighting the importance of preserving independence and disagreement when designing MAS for creative tasks. Our code is available at https://github.com/Xtra‑Computing/MAS_Diversity.
Authors:Haokun Lin, Xinle Jia, Haobo Xu, Bingchen Yao, Xianglong Guo, Yichen Wu, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun
Abstract:
The MXFP4 microscaling format, which partitions tensors into blocks of 32 elements sharing an E8M0 scaling factor, has emerged as a promising substrate for efficient LLM inference, backed by native hardware support on NVIDIA Blackwell Tensor Cores. However, activation outliers pose a unique challenge under this format: a single outlier inflates the shared block scale, compressing the effective dynamic range of the remaining elements and causing significant quantization error. Existing rotation‑based remedies, including randomized Hadamard and learnable rotations, are data‑agnostic and therefore unable to specifically target the channels where outliers concentrate. We propose DuQuant++, which adapts the outlier‑aware fine‑grained rotation of DuQuant to the MXFP4 format by aligning the rotation block size with the microscaling group size (B=32). Because each MXFP4 group possesses an independent scaling factor, the cross‑block variance issue that necessitates dual rotations and a zigzag permutation in the original DuQuant becomes irrelevant, enabling DuQuant++ to replace the entire pipeline with a single outlier‑aware rotation, which halves the online rotation cost while simultaneously smoothing the weight distribution. Extensive experiments on the LLaMA‑3 family under MXFP4 W4A4 quantization show that DuQuant++ consistently achieves state‑of‑the‑art performance. Our code is available at https://github.com/Hsu1023/DuQuant‑v2.
Authors:Hanhua Hong, Yizhi LI, Jiaoyan Chen, Sophia Ananiadou, Xiaoli Li, Jung-jae Kim, Chenghua Lin
Abstract:
Recent advances in large language models have highlighted their potential to automate computational research, particularly reproducing experimental results. However, existing approaches still use fixed sequential agent pipelines with weak global coordination, which limits their robustness and overall performance. In this work, we propose Hierarchical Research Agent System (HiRAS), a hierarchical multi‑agent framework for end‑to‑end experiment reproduction that employs supervisory manager agents to coordinate specialised agents across fine‑grained stages. We also identify limitations in the reference‑free evaluation of the Paper2Code benchmark and introduce Paper2Code‑Extra (P2C‑Ex), a refined protocol that incorporates repository‑level information and better aligns with the original reference‑based metric. We conduct extensive evaluation, validating the effectiveness and robustness of our proposed methods, and observing improvements, including >10% relative performance gain beyond the previous state‑of‑the‑art using open‑source backbone models and significantly reduced hallucination in evaluation. Our work is available on GitHub: https://github.com/KOU‑199024/HiRAS.
Authors:Harshavardhanan Deekeswar
Abstract:
Serialization formats designed for document interchange impose structural overhead that becomes prohibitive when large language models consume operational data at scale. A modest dataset of 1,000 IoT sensor readings serialized as JSON requires approximately 80,000 tokens ‑ the majority spent on repeated field names, nested braces, and structural punctuation rather than semantic content. We present ONTO (Object Notation for Token Optimization), a columnar notation that declares field names once per entity and arranges values in pipe‑delimited rows with indentation‑based hierarchy. This schema‑once, data‑many design eliminates per‑record key repetition while preserving human readability and nested structure support. Evaluation across three synthetic operational datasets demonstrates 46‑51% token reduction versus JSON, with stable scaling from 100 to 1,000 records. Controlled inference benchmarks on Qwen2.5‑7B show corresponding 5‑10% latency improvement. Comprehension validation confirms no material degradation in LLM task accuracy across lookup, counting, extraction, and aggregation operations when format context is provided. Ablation analysis reveals that key repetition accounts for the majority of JSON overhead, with indentation costs in nested structures explaining the 4‑percentage‑point gap between flat and hierarchical data. ONTO occupies a previously unfilled position in the serialization landscape: columnar efficiency with hierarchical structure, optimized for LLM context windows rather than document interchange. Code and specification are available at https://github.com/harsh‑aranga/onto.
Authors:Szu-Chi Chen, I-Ning Tsai, Yi-Cheng Lin, Sung-Feng Huang, Hung-yi Lee
Abstract:
Recent Speech‑to‑Speech Translation (S2ST) systems achieve strong semantic accuracy yet consistently strip away non‑verbal vocalizations (NVs), such as laughter and crying that convey pragmatic intent, which severely limits real‑world utility. We address this via three contributions. First, we propose a synthesis pipeline for building scalable expressive datasets to overcome the data scarcity limitation. Second, we propose MoVE, a Mixture‑of‑LoRA‑Experts architecture with expressive‑specialized adapters and a soft‑weighting router that blends experts for capturing hybrid expressive states. Third, we show pretrained AudioLLMs enable striking data efficiency: 30 minutes of curated data is enough for strong performance. On English‑Chinese S2ST, while comparing with strong baselines, MoVE reproduces target NVs in 76% of cases and achieves the highest human‑rated naturalness and emotional fidelity among all compared systems, where existing S2ST systems preserve at most 14% of NVs.
Authors:Zhanyu Shen, Sijie Cheng, Zhicheng Guo, Weiqin Wang, Yile Wang, Hui Huang
Abstract:
While large language models have achieved remarkable performance in complex tasks, they still need a memory system to utilize historical experience in long‑term interactions. Existing memory methods (e.g., A‑Mem, Mem0) place excessive emphasis on organizing interactions by frequently rewriting them, however, this heavy reliance on summarization risks diluting essential contextual nuances and obscuring key retrieval features. To bridge this gap, we introduce AnchorMem, a novel memory framework inspired by the Proust Phenomenon in cognitive science, where a specific anchor triggers a holistic recollection. We propose a method that decouples the retrieval unit from the generation context. AnchorMem extracts atomic facts from interaction history to serve as retrieval anchors, while preserving the original context as the immutable context. To reveal implicit narrative cues, we construct an associative event graph that uses higher‑order event links that bind sets of related facts into shared event representations, strengthening cross‑memory integration without relying on generic entities as bridges. During retrieval, the system anchors queries to specific facts and events to locate relevant memories, but reconstructs the context using the associated raw chunks and events. Our method reconciles fine‑grained retrieval with the contextual integrity of interactions. Experiments across three closed‑source and open‑source models on the LoCoMo benchmark demonstrate that AnchorMem significantly outperforms baselines. Code is available at https://github.com/RayNeo‑AI‑2025/AnchorMem.
Authors:Meng Zhang, Jinzhong Ning, Xiaolong Wu, Hongfei Lin, Yijia Zhang
Abstract:
Grounded Multimodal Named Entity Recognition (GMNER) aims to jointly identify named entity mentions in text, predict their semantic types, and ground each entity to a corresponding visual region in an associated image. Existing approaches predominantly adopt pipeline‑based architectures that decouple textual entity recognition and visual grounding, leading to error accumulation and suboptimal joint optimization. In this paper, we propose E2E‑GMNER, a fully end‑to‑end generative framework that unifies entity recognition, semantic typing, visual grounding, and implicit knowledge reasoning within a single multimodal large language model. We formulate GMNER as an instruction‑tuned conditional generation task and incorporate chain‑of‑thought reasoning to enable the model to adaptively determine when visual evidence or background knowledge is informative, reducing reliance on noisy cues. To further address the instability of generative bounding box prediction, we introduce Gaussian Risk‑Aware Box Perturbation (GRBP), which replaces hard box supervision with probabilistically perturbed soft targets to improve robustness against annotation noise and discretization errors. Extensive experiments on the Twitter‑GMNER and Twitter‑FMNERG benchmarks demonstrate that E2E‑GMNER achieves highly competitive performance compared with state of the art methods, validating the effectiveness of unified end‑to‑end optimization and noise‑aware grounding supervision. Code is available at:https://github.com/Finch‑coder/E2E‑GMNER
Authors:Hanlin Wang, Chak Tou Leong, Jian Wang, Wenjie Li
Abstract:
Recent advancements in large language models (LLMs) have enabled agents to tackle complex embodied tasks through environmental interaction. However, these agents still make suboptimal decisions and perform ineffective actions, as they often overlook critical environmental feedback that differs from their internal beliefs. Through a formal probing analysis, we characterize this as belief inertia, a phenomenon where agents stubbornly adhere to prior beliefs despite explicit observations. To address this, we advocate active belief intervention, moving from passive understanding to active management. We introduce the Estimate‑Verify‑Update (EVU) mechanism, which empowers agents to predict expected outcomes, verify them against observations through explicit reasoning, and actively update prior beliefs based on the verification evidence. EVU is designed as a unified intervention mechanism that generates textual belief states explicitly, and can be integrated into both prompting‑based and training‑based agent reasoning methods. Extensive experiments across three embodied benchmarks demonstrate that EVU consistently yields substantial gains in task success rates. Further analyses validate that our approach effectively mitigates belief inertia, advancing the development of more robust embodied agents. Our code is available at https://github.com/WangHanLinHenry/EVU.
Authors:Priya Gurjar, Md Farhan Ishmam, Kenneth Marino
Abstract:
Despite the rapid progress, LLMs for sequential decision‑making (i.e., LLM agents) still struggle to produce diverse outputs. This leads to insufficient exploration, convergence to sub‑optimal solutions, and becoming stuck in loops. Such limitations can be problematic in environments that require active exploration to gather information and make decisions. Sampling methods such as temperature scaling introduce token‑level randomness but fail to produce enough diversity at the sequence level. We analyze LLM exploration in the classic Multi‑Armed Bandit (MAB) setting and the Text Adventure Learning Environment Suite (TALES). We find that current decoding strategies and prompting methods like Chain‑of‑Thought and Tree‑of‑Thought are insufficient for robust exploration. To address this, we introduce DORA Explorer (Diversity‑Oriented Ranking of Actions), a training‑free framework for improving exploration in LLM agents. DORA generates diverse action candidates, scores them using token log‑probabilities, and selects actions using a tunable exploration parameter. DORA achieves UCB‑competitive performance on MAB and consistent gains across TALES, e.g., improving Qwen2.5‑7B's performance from 29.2% to 45.5% in TextWorld. Our project is available at: https://dora‑explore.github.io/.
Authors:Hangxiao Zhu, Yuyu Zhang, Ping Nie, Yu Zhang
Abstract:
The rapid growth of scientific literature calls for automated methods to assess and predict research impact. Prior work has largely focused on citation‑based metrics, leaving limited evaluation of models' capability to reason about other impact dimensions. To this end, we introduce SciImpact, a large‑scale, multi‑dimensional benchmark for scientific impact prediction spanning 19 fields. SciImpact captures various forms of scientific influence, ranging from citation counts to award recognition, media attention, patent reference, and artifact adoption, by integrating heterogeneous data sources and targeted web crawling. It comprises 215,928 contrastive paper pairs reflecting meaningful impact differences in both short‑term (e.g., Best Paper Award) and long‑term settings (e.g., Nobel Prize). We evaluate 11 widely used large language models (LLMs) on SciImpact. Results show that off‑the‑shelf models exhibit substantial variability across dimensions and fields, while multi‑task supervised fine‑tuning consistently enables smaller LLMs (e.g., 4B) to markedly outperform much larger models (e.g., 30B) and surpass powerful closed‑source LLMs (e.g., o4‑mini). These results establish SciImpact as a challenging benchmark and demonstrate its value for multi‑dimensional, multi‑field scientific impact prediction. Our project homepage is https://flypig23.github.io/sciimpact‑homepage/
Authors:Yupeng Qi, Ziyu Lyu, Lixin Cui, Lu Bai, Feng Xia
Abstract:
Safety‑aligned large language models (LLMs) often generate refusal responses to harmless queries due to the over‑refusal problem. However, existing methods for mitigating over‑refusal cannot maintain a low refusal ratio for harmless queries while keeping a high refusal ratio for malicious ones. In this paper, we analyze how system prompts with varying safety levels affect LLM refusal behaviors when facing over‑refusal queries. A key observation is that, when LLMs suffer from the over‑refusal issue, non‑refusal tokens remain present in the next‑token candidate list, but the model systematically fails to select them, despite the generation of refusal tokens. Based on this observation, we propose a training‑free and model‑agnostic approach, Adaptive Contrastive Decoding (AdaCD), to mitigate over‑refusal while maintaining LLM safety. First, AdaCD compares the output distributions of the LLM with or without an extreme safety system prompt to refine the refusal token distribution. Second, we introduce an adaptive contrastive decoding strategy that dynamically incorporates or removes the refusal token distribution, adaptively boosting the probability of selecting refusal or non‑refusal tokens. Experimental results on five benchmark datasets show that, on average, AdaCD reduces the refusal ratio for over‑refusal queries by 10.35%, yet still increases the refusal ratio for malicious queries by 0.13%. Code is available at https://github.com/OutdoorManofML/AdaCD.
Authors:Jiaqing Liang, Jinyi Han, Weijia Li, Xinyi Wang, Zhoujia Zhang, Zishang Jiang, Ying Liao, Tingyun Li, Ying Huang, Hao Shen, Hanyu Wu, Fang Guo, Keyi Wang, Zhonghua Hong, Zhiyu Lu, Lipeng Ma, Sihang Jiang, Yanghua Xiao
Abstract:
Long‑horizon large language model (LLM) agents are fundamentally limited by context. As interactions become longer, tool descriptions, retrieved memories, and raw environmental feedback accumulate and push out the information needed for decision‑making. At the same time, useful experience gained from tasks is often lost across episodes. We argue that long‑horizon performance is determined not by context length, but by how much decision‑relevant information is maintained within a finite context budget. We present GenericAgent (GA), a general‑purpose, self‑evolving LLM agent system built around a single principle: context information density maximization. GA implements this through four closely connected components: a minimal atomic tool set that keeps the interface simple, a hierarchical on‑demand memory that only shows a small high‑level view by default, a self‑evolution mechanism that turns verified past trajectories into reusable SOPs and executable code, and a context truncation and compression layer that maintains information density during long executions. Across task completion, tool use efficiency, memory effectiveness, self‑evolution, and web browsing, GA consistently outperforms leading agent systems while using significantly fewer tokens and interactions, and it continues to evolve over time. Project: https://github.com/lsdefine/GenericAgent
Authors:Antonio De Santis, Tommaso Bonetti, Andrea Tocchetti, Marco Brambilla
Abstract:
The interpretation of implicit meanings is an integral aspect of human communication. However, this framework may not transfer to interactions with Large Language Models (LLMs). To investigate this, we introduce the task of Implicit Information Extraction (IIE) and propose an LLM‑based IIE pipeline that builds a structured knowledge graph from a context sentence by extracting relational triplets, validating implicit inferences, and analyzing temporal relations. We evaluate two LLMs against crowdsourced human judgments on two datasets. We find that humans agree with most model triplets yet consistently propose many additions, indicating limited coverage in current LLM‑based IIE. Moreover, in our experiments, models appear to be more conservative about implicit inferences than humans in socially rich contexts, whereas humans become more conservative in shorter, fact‑oriented contexts. Our code is available at https://github.com/Antonio‑Dee/IIE_from_LLM.
Authors:Weiyu Ma, Yongcheng Zeng, Yan Song, Xinyu Cui, Jian Zhao, Xuhui Liu, Mohamed Elhoseiny
Abstract:
Reinforcement Learning (RL) has achieved impressive success in post‑training Large Language Models (LLMs) and Vision‑Language Models (VLMs), with on‑policy algorithms such as PPO, GRPO, and REINFORCE++ serving as the dominant paradigm. However, these methods discard all collected trajectories after a single gradient update, resulting in poor sample efficiency, particularly wasteful for agentic tasks where multi‑turn environment interactions are expensive. While Experience Replay drives sample efficiency in classic RL by allowing agents to reuse past trajectories and prioritize informative ones, directly applying Prioritized Experience Replay (PER) to LLMs fails. The rapid policy evolution of billion‑parameter models renders stored priorities stale, causing old high‑priority trajectories to dominate sampling long after they have become uninformative. We propose Freshness‑Aware PER, which addresses this priority staleness problem by augmenting any PER‑based priority with a multiplicative exponential age decay grounded in effective sample size analysis. To the best of our knowledge, Freshness‑Aware PER is the first work to successfully apply PER to LLM/VLM reinforcement learning. We evaluate on eight multi‑step agentic, reasoning, and math competition tasks with 0.5B, 3B, and 7B models. Freshness‑Aware PER significantly outperforms on‑policy baselines, achieving +46% on NQ Search, +367% on Sokoban, and +133% on VLM FrozenLake, while standard PER without age decay consistently degrades performance. Our code is publicly available at https://github.com/Vision‑CAIR/Freshness‑Aware‑PER.
Authors:Syed Muhammad Aqdas Rizvi
Abstract:
Decentralized Autonomous Organizations (DAOs) are inclined explore Small Language Models (SLMs) as edge‑native constitutional firewalls to vet proposals and mitigate semantic social engineering. While scaling inference‑time compute (System 2) enhances formal logic, its efficacy in highly adversarial, cryptoeconomic governance environments remains underexplored. To address this, we introduce Sentinel‑Bench, an 840‑inference empirical framework executing a strict intra‑model ablation on Qwen‑3.5‑9B. By toggling latent reasoning across frozen weights, we isolate the impact of inference‑time compute against an adversarial Optimism DAO dataset. Our findings reveal a severe compute‑accuracy inversion. The autoregressive baseline (System 1) achieved 100% adversarial robustness, 100% juridical consistency, and state finality in under 13 seconds. Conversely, System 2 reasoning introduced catastrophic instability, fundamentally driven by a 26.7% Reasoning Non‑Convergence (cognitive collapse) rate. This collapse degraded trial‑to‑trial consensus stability to 72.6% and imposed a 17x latency overhead, introducing critical vulnerabilities to Governance Extractable Value (GEV) and hardware centralization. While rare (1.5% of adversarial trials), we empirically captured "Reasoning‑Induced Sycophancy," where the model generated significantly longer internal monologues (averaging 25,750 characters) to rationalize failing the adversarial trap. We conclude that for edge‑native SLMs operating under Byzantine Fault Tolerance (BFT) constraints, System 1 parameterized intuition is structurally and economically superior to System 2 iterative deliberation for decentralized consensus. Code and Dataset: https://github.com/smarizvi110/sentinel‑bench
Authors:Jinchang Zhu, Jindong Li, Cheng Zhang, Jiahong Liu, Menglin Yang
Abstract:
Long‑term memory is a critical challenge for Large Language Model agents, as fixed context windows cannot preserve coherence across extended interactions. Existing memory systems represent conversation history as unstructured embedding vectors, retrieving information through semantic similarity. This paradigm fails to capture the associative structure of human memory, wherein related experiences progressively strengthen interconnections through repeated co‑activation. Inspired by cognitive neuroscience, we identify three mechanisms central to biological memory: association, consolidation, and spreading activation, which remain largely absent in current research. To bridge this gap, we propose HeLa‑Mem, a bio‑inspired memory architecture that models memory as a dynamic graph with Hebbian learning dynamics. HeLa‑Mem employs a dual‑level organization: (1) an episodic memory graph that evolves through co‑activation patterns, and (2) a semantic memory store populated via Hebbian Distillation, wherein a Reflective Agent identifies densely connected memory hubs and distills them into structured, reusable semantic knowledge. This dual‑path design leverages both semantic similarity and learned associations, mirroring the episodic‑semantic distinction in human cognition. Experiments on LoCoMo demonstrate superior performance across four question categories while using significantly fewer context tokens. Code is available on GitHub: https://github.com/ReinerBRO/HeLa‑Mem
Authors:Jianyou Wang, Youze Zheng, Longtian Bao, Hanyuan Zhang, Qirui Zheng, Yuhan Chen, Yang Zhang, Matthew Feng, Maxim Khan, Aditya K. Sehgal, Christopher D. Rosin, Ramamohan Paturi, Umber Dube, Leon Bergen
Abstract:
Scientists have long sought to accurately predict outcomes of real‑world events before they happen. Can AI systems do so more reliably? We study this question through clinical trial outcome prediction, a high‑stakes open challenge even for domain experts. We introduce CT Open, an open‑access, live platform that will run four challenge every year. Anyone can submit predictions for each challenge. CT Open evaluates those submissions on trials whose outcomes were not yet public at the time of submission but were made public afterwards. Determining if a trial's outcome is public on the internet before a certain date is surprisingly difficult. Outcomes posted on official registries may lag behind by years, while the first mention may appear in obscure articles. To address this, we propose a novel, fully automated decontamination pipeline that uses iterative LLM‑powered web search to identify the earliest mention of trial outcomes. We validate the pipeline's quality and accuracy by human expert's annotations. Since CT Open's pipeline ensures that every evaluated trial had no publicly reported outcome when the prediction was made, it allows participants to use any methodology and any data source. In this paper, we release a training set and two time‑stamped test benchmarks, Winter 2025 and Summer 2025. We believe CT Open can serve as a central hub for advancing AI research on forecasting real‑world outcomes before they occur, while also informing biomedical research and improving clinical trial design. CT Open Platform is hosted at \hrefhttps://ct‑open.net/https://ct‑open.net/
Authors:Bhaskar Gurram
Abstract:
Automated evaluation of tool‑using large language model (LLM) agents is widely assumed to be reliable, but this assumption has rarely been validated against human annotation. We introduce AgentProp‑Bench, a 2,000‑task benchmark with 2,300 traces across four domains, nine production LLMs, and a 100‑label human‑validated subset. We quantify judge reliability, characterize error propagation, and evaluate a runtime mitigation. Substring‑based judging agrees with human annotation at kappa=0.049 (chance‑level); a three‑LLM ensemble reaches kappa=0.432 (moderate) with a conservative bias. Under validated evaluation, a parameter‑level injection propagates to a wrong final answer with human‑calibrated probability approximately 0.62 (range 0.46‑0.73 across models). Rejection (catching bad parameters) and recovery (correcting after acceptance) are independent model capabilities (Spearman rho=0.126, p=0.747). A tuned runtime interceptor reduces hallucination on GPT‑4o‑mini by 23.0 percentage points under a concurrent n=600 control, but shows no significant effect on Gemini‑2.0‑Flash, whose aggressive parameter rejection eliminates the target failure mode. All code, data, traces, and human labels are released at https://github.com/bhaskargurram‑ai/agenthallu‑bench.
Authors:Anik Saha, Mst. Fahmida Sultana Naznin, Zia Ul Hassan Abdullah, Anisa Binte Asad, K. G. Subarno Bithi, A. B. M. Alim Al Islam
Abstract:
Urgent blood donation seeking posts and messages on social media often go unnoticed due to the overwhelming volume of daily communications. Traditional app‑based systems, reliant on manual input, struggle to reach users in low‑resource settings, delaying critical responses. To address this, we introduce the Cognitive Blood Request System (CBRS), a multi‑platform framework that efficiently filters and parses blood donation requests from social media streams using a cost‑efficient dual‑layered architecture. To do so, we curate a novel dataset of 11K parsed blood donation request messages in Bengali, English, and transliterated Bengali, capturing the linguistic diversity of real social media communications. The inclusion of adversarial negatives further enhances the robustness of our model. CBRS achieves an impressive 99% accuracy and precision in filtering, surpassing benchmark methods. In the parsing task, our LoRA finetuned Llama‑3.2‑3B model achieves 92% zero‑shot accuracy, surpassing the base model by 41.54% and exceeding the few‑shot performance of GPT‑4o‑mini, Gemini‑2.0‑Flash, and other LLMs, while resulting in a 35X reduction in input token usage. This work lays a robust foundation for scalable, inclusive information extraction in time‑sensitive, object‑focused tasks. Our code, dataset, and trained models are publicly available at [https://github.com/aaniksahaa/CBRS](https://github.com/aaniksahaa/CBRS).
Authors:Weihua Du, Jingming Zhuo, Yixin Dong, Andre Wang He, Weiwei Sun, Zeyu Zheng, Manupa Karunaratne, Ivan Fox, Tim Dettmers, Tianqi Chen, Yiming Yang, Sean Welleck
Abstract:
Recent large language model (LLM) agents have shown promise in using execution feedback for test‑time adaptation. However, robust self‑improvement remains far from solved: most approaches still treat each problem instance independently, without accumulating reusable knowledge. This limitation is particularly pronounced in domain‑specific languages such as Triton, which are underrepresented in LLM pretraining data. Their strict constraints and non‑linear optimization landscape further make naive generation and local refinement unreliable. We propose AdaExplore, an agent framework that enables self‑improvement via accumulated execution feedback for performance‑critical kernel code generation through two complementary stages: failure‑driven adaptation and diversity‑preserving search, jointly improving correctness and optimization performance without additional fine‑tuning or external knowledge. In the adaptation stage, the agent synthesizes tasks and converts recurring failures into a reusable memory of validity rules, helping subsequent generations remain within the feasible set. In the search stage, the agent organizes candidate kernels as a tree and alternates between small local refinements and larger structural regeneration, allowing it to explore the optimization landscape beyond local optima. Experiments on kernel runtime optimization benchmarks validate these gains: AdaExplore achieves 3.12x and 1.72x speedups on KernelBench Level‑2 and Level‑3, respectively, within 100 steps, and continues to improve with additional computation.
Authors:Yang Liu, Hongming Li, Melissa Xiaohui Qin, Qiankun Liu, Chao Huang
Abstract:
We present SemanticQA, an evaluation suite designed to assess language models (LMs) in semantic phrase processing tasks. The benchmark consolidates existing multiword expression (MwE) resources and reorganizes them into a unified testbed. It covers both general lexical phenomena, such as lexical collocations, and three fine‑grained categories: idiomatic expressions, noun compounds, and verbal constructions. Through SemanticQA, we assess LMs of diverse architectures and scales in extraction, classification, and interpretation tasks, as well as sequential task compositions. We reveal substantial performance variation, particularly on tasks requiring semantic reasoning, highlighting differences in reasoning efficacy and semantic understanding of LMs, providing insights for pushing LMs with stronger comprehension on non‑trivial semantic phrases. The evaluation harness and data of SemanticQA are available at https://github.com/jacklanda/SemanticQA.
Authors:Yongkang Li, Panagiotis Eustratiadis, Yixing Fan, Evangelos Kanoulas
Abstract:
Decoder‑only large language models (LLMs) are increasingly replacing BERT‑style architectures as the backbone for dense retrieval, achieving substantial performance gains and broad adoption. However, the robustness of these LLM‑based retrievers remains underexplored. In this paper, we present the first systematic study of the robustness of state‑of‑the‑art open‑source LLM‑based dense retrievers from two complementary perspectives: generalizability and stability. For generalizability, we evaluate retrieval effectiveness across four benchmarks spanning 30 datasets, using linear mixed‑effects models to estimate marginal mean performance and disentangle intrinsic model capability from dataset heterogeneity. Our analysis reveals that while instruction‑tuned models generally excel, those optimized for complex reasoning often suffer a ``specialization tax,'' exhibiting limited generalizability in broader contexts. For stability, we assess model resilience against both unintentional query variations~(e.g., paraphrasing, typos) and malicious adversarial attacks~(e.g., corpus poisoning). We find that LLM‑based retrievers show improved robustness against typos and corpus poisoning compared to encoder‑only baselines, yet remain vulnerable to semantic perturbations like synonymizing. Further analysis shows that embedding geometry (e.g., angular uniformity) provides predictive signals for lexical stability and suggests that scaling model size generally improves robustness. These findings inform future robustness‑aware retriever design and principled benchmarking. Our code is publicly available at https://github.com/liyongkang123/Robust_LLM_Retriever_Eval.
Authors:Muhammad Adeel Ijaz
Abstract:
We present SQL Query Engine, an open‑source, self‑hosted service that translates natural language questions into validated PostgreSQL queries through a two‑stage LLM pipeline. The first stage performs automatic schema introspection and SQL generation; a multi‑strategy response parser extracts SQL from any LLM output format (JSON, code blocks, or raw text) without requiring structured output APIs. The second stage executes the query against PostgreSQL and, upon failure or empty results, enters an iterative self‑healing loop in which the LLM diagnoses the error using full SQLSTATE codes and PostgreSQL diagnostic messages. Two mechanisms prevent regressions: early‑accept returns successful queries immediately without LLM re‑evaluation, and best‑result tracking preserves the best partial result across retries. Schema context is cached per session in Redis, progress events stream via Redis Pub/Sub and SSE, and an OpenAI‑compatible /v1/chat/completions endpoint lets existing tools work without modification. All database connections are read‑only at the driver level. We evaluate across five LLM backends on a synthetic benchmark (75 questions, three databases) where the self‑healing loop yields up to +9.3pp accuracy gains with zero regressions on the best model (Llama 4 Scout 17B, 57.3%), and on BIRD (437 questions, 11 databases migrated from SQLite to PostgreSQL) where the full pipeline reaches 49.0% execution accuracy (GPT‑OSS‑120B, +4.6pp). Source code: https://github.com/codeadeel/sqlqueryengine.
Authors:Zonghai Yao, Benlu Wang, Yifan Zhang, Junda Wang, Iris Xia, Zhipeng Tang, Shuo Han, Feiyun Ouyang, Zhichao Yang, Arman Cohan, Hong Yu
Abstract:
Large language models perform well on many medical QA benchmarks, but real clinical reasoning often requires integrating evidence across multiple images rather than interpreting a single view. We introduce MedThinkVQA, an expert‑annotated benchmark for thinking with multiple images, where models must interpret each image, combine cross‑view evidence, and answer diagnostic questions with intermediate supervision and step‑level evaluation. The dataset contains 8,067 cases, including 720 test cases, with an average of 6.62 images per case, substantially denser than prior work, whose expert‑level benchmarks use at most 1.43 images per case. On the test set, the best closed‑source models, Claude‑4.6‑Opus, Gemini‑3‑Pro, and GPT‑5.2‑xhigh, reach only 57.2%, 55.3%, and 54.9% accuracy, while GPT‑5‑mini and GPT‑5‑nano reach 39.7% and 30.8%. Strong open‑source models lag behind, led by Qwen3.5‑397B‑A17B at 52.2% and Qwen3.5‑27B at 50.6%. Further analysis identifies grounded multi‑image reasoning as the main bottleneck: models often fail to extract, align, and compose evidence across views before higher‑level inference can help. Providing expert single‑image cues and cross‑image summaries improves performance, whereas replacing them with self‑generated intermediates reduces accuracy. Step‑level analysis shows that over 70% of errors arise from image reading and cross‑view integration. Scaling results further show that additional inference‑time computation helps only when visual grounding is already reliable; when early evidence extraction is weak, longer reasoning yields limited or unstable gains and can amplify misread cues. These results suggest that the key challenge is not reasoning length alone, but reliable mechanisms for grounding, aligning, and composing distributed evidence across real‑world multimodal clinical inputs.
Authors:Vedant Jawandhia, Yash Sinha, Murari Mandal, Ankan Pal, Dhruv Kumar
Abstract:
Large language models (LLMs) are increasingly evaluated on mathematical reasoning, yet their robustness to equivalent problem representations remains poorly understood. In geometry, identical problems can be expressed in Euclidean, coordinate, or vector forms, but existing benchmarks report accuracy on fixed formats, implicitly assuming representation invariance and masking failures caused by representational changes alone. We propose GeoRepEval, a representation‑aware evaluation framework that measures correctness, invariance, and consistency at the problem level across parallel formulations, combining strict answer matching, bootstrap confidence intervals, paired McNemar tests, representation‑flip analyses, and regression controls for surface complexity. We prove that our Invariance@3 metric decomposes accuracy into robust and fragile components and is bounded by the weakest representation. Evaluating eleven LLMs on 158 curated high‑school geometry problems (474 instances), we find accuracy gaps of up to 14 percentage points induced solely by representation choice. Vector formulations emerge as a consistent failure point, with Invariance@3 as low as 0.044 even after controlling for length and symbolic complexity. A convert‑then‑solve prompting intervention improves vector accuracy by up to 52 percentage points for high‑capacity models, suggesting that failures reflect representation sensitivity rather than inability; however, low‑capacity models show no gains, indicating deeper limitations. These results suggest that current models rely on representation‑specific heuristics rather than abstract geometric reasoning. All datasets, prompts, and scripts are released at https://github.com/vedjaw/GeoRepEval.
Authors:Haolong Hu, Hanyu Li, Tiancheng He, Huahui Yi, An Zhang, Qiankun Li, Kun Wang, Yang Liu, Zhigang Zeng
Abstract:
MLLMs are increasingly deployed in multi‑turn settings, where attackers can escalate unsafe intent through the evolving visual‑text history and exploit long‑context safety decay. Yet safety alignment is still dominated by single‑turn data and fixed‑template dialogues, leaving a mismatch between training and deployment.To bridge this gap, we propose SaFeR‑Steer, a progressive multi‑turn alignment framework that combines staged synthetic bootstrapping with tutor‑in‑the‑loop GRPO to train a single student under adaptive, on‑policy attacks. We also introduce TCSR, which uses trajectory minimum/average safety to propagate late‑turn failures to earlier turns.I. Dataset. We release STEER, a multi‑turn multimodal safety dataset with STEER‑SFT (12,934), STEER‑RL (2,000), and STEER‑Bench (3,227) dialogues spanning 2~10 turns.II. Experiment. Starting from Qwen2.5‑VL‑3B/7B, SaFeR‑Steer substantially improves Safety/Helpfulness on both single‑turn (48.30/45.86 ‑> 81.84/70.77 for 3B; 56.21/60.32 ‑> 87.89/77.40 for 7B) and multi‑turn benchmarks (12.55/27.13 ‑> 55.58/70.27 for 3B; 24.66/46.48 ‑> 64.89/72.35 for 7B), shifting failures to later turns and yielding robustness beyond scaling alone.Codes are available at https://github.com/Ed‑Bg/SaFeR‑Steer
Authors:Ekaterina Lemdiasova, Nikita Zmanovskii
Abstract:
Large language models (LLMs) and cross‑encoder rerankers have gained attention for improving recommender systems, particularly in cold‑start scenarios where user interaction history is limited. However, practical deployment reveals significant performance gaps between LLM‑based approaches and simple baselines. This paper presents a systematic diagnostic study of cross‑encoder rerankers in cold‑start movie recommendation using the Serendipity‑2018 dataset. Through controlled experiments with 500 users across multiple random seeds, we identify three critical failure modes: (1) low retrieval coverage in candidate generation (recall@200 = 0.109 vs. 0.609 for baselines), (2) severe exposure bias with rerankers concentrating recommendations on 3 unique items versus 497 for random baseline, and (3) minimal score discrimination between relevant and irrelevant items (mean difference = 0.098, Cohen's d = 0.13). We demonstrate that popularity‑based ranking substantially outperforms LLM reranking (HR@10: 0.268 vs. 0.008, p < 0.001), with the performance gap primarily attributable to retrieval stage limitations rather than reranker capacity. Based on these findings, we provide actionable recommendations including hybrid retrieval strategies, candidate pool size optimization, and score calibration techniques. All code, configurations, and experimental results are made available for reproducibility.
Authors:Yige Xu, Yongjie Wang, Zizhuo Wu, Kaisong Song, Jun Lin, Zhiqi Shen
Abstract:
Reasoning in vision‑language models (VLMs) has recently attracted significant attention due to its broad applicability across diverse downstream tasks. However, it remains unclear whether the superior performance of VLMs stems from genuine vision‑grounded reasoning or relies predominantly on the reasoning capabilities of their textual backbones. To systematically measure this, we introduce CrossMath, a novel multimodal reasoning benchmark designed for controlled cross‑modal comparisons. Specifically, we construct each problem in text‑only, image‑only, and image+text formats guaranteeing identical task‑relevant information, verified by human annotators. This rigorous alignment effectively isolates modality‑specific reasoning differences while eliminating confounding factors such as information mismatch. Extensive evaluation of state‑of‑the‑art VLMs reveals a consistent phenomenon: a substantial performance gap between textual and visual reasoning. Notably, VLMs excel with text‑only inputs, whereas incorporating visual data (image+text) frequently degrades performance compared to the text‑only baseline. These findings indicate that current VLMs conduct reasoning primarily in the textual space, with limited genuine reliance on visual evidence. To mitigate this limitation, we curate a CrossMath training set for VLM fine‑tuning. Empirical evaluations demonstrate that fine‑tuning on this training set significantly boosts reasoning performance across all individual and joint modalities, while yielding robust gains on two general visual reasoning tasks. Source code is available at https://github.com/xuyige/CrossMath.
Authors:Songtao Wang, Quang Hieu Pham, Fangcong Yin, Xinpeng Wang, Jocelyn Qiaochu Chen, Greg Durrett, Xi Ye
Abstract:
Reinforcement learning with verifiable rewards (RLVR) typically optimizes for outcome rewards without imposing constraints on intermediate reasoning. This leaves training susceptible to reward hacking, where models exploit loopholes (e.g., spurious patterns in training data) in the reward function to achieve high scores without solving the intended task. These reward‑hacking behaviors are often implicit, as the intermediate chain‑of‑thought (CoT) may appear plausible on the surface, limiting the effectiveness of purely text‑based monitoring. We propose Gradient Fingerprint (GRIFT), a method for detecting reward hacking using models' internal computations. Given a prompt and a model‑generated CoT, GRIFT computes gradients of the CoT conditioned on the prompt and compresses them into a compact representation, which is then used to assess whether the CoT reflects reward hacking behavior. Across verifiable reasoning benchmarks spanning math, code, and logical reasoning, GRIFT substantially outperforms strong baselines, including CoT Monitor and TRACE, achieving over 25% relative improvement in detecting reward hacking behavior. Moreover, integrating GRIFT into the rejection fine‑tuning pipeline for reasoning tasks reduces reward hacking and improves performance on the true task objective. Our results highlight a promising direction of leveraging gradient level representations for assessing the quality of CoT reasoning traces. Our code is available at: https://github.com/songtao‑x/reward_hack.
Authors:Jiaxi Bi, Tongxu Luo, Wenyu Du, Zhengyang Tang, Benyou Wang
Abstract:
Parallel reasoning enhances Large Reasoning Models (LRMs) but incurs prohibitive costs due to futile paths caused by early errors. To mitigate this, path pruning at the prefix level is essential, yet existing research remains fragmented without a standardized framework. In this work, we propose the first systematic taxonomy of path pruning, categorizing methods by their signal source (internal vs. external) and learnability (learnable vs. non‑learnable). This classification reveals the unexplored potential of learnable internal methods, motivating our proposal of STOP (Super TOken for Pruning). Extensive evaluations across LRMs ranging from 1.5B to 20B parameters demonstrate that STOP achieves superior effectiveness and efficiency compared to existing baselines. Furthermore, we rigorously validate the scalability of STOP under varying compute budgets ‑ for instance, boosting GPT‑OSS‑20B accuracy on AIME25 from 84% to nearly 90% under fixed compute budgets. Finally, we distill our findings into formalized empirical guidelines to facilitate optimal real‑world deployment. Code, data and models are available at https://bijiaxihh.github.io/STOP
Authors:Ke Xiong, Qian Wu, Wangjie Gan, Yuke Li, Xuhong Zhang
Abstract:
Few‑shot Hierarchical Text Classification (few‑shot HTC) is a challenging task that involves mapping texts to a predefined tree‑structured label hierarchy under data‑scarce conditions. While current approaches utilize structural constraints from the label hierarchy to maintain parent‑child prediction consistency, they face a critical bottleneck, the difficulty in distinguishing semantically similar sibling classes due to insufficient domain knowledge. We introduce an innovative method named Sibling Contrastive Learning with Hierarchical Knowledge‑aware Prompt Tuning for few‑shot HTC tasks (SCHK‑HTC). Our work enhances the model's perception of subtle differences between sibling classes at deeper levels, rather than just enforcing hierarchical rules. Specifically, we propose a novel framework featuring two core components: a hierarchical knowledge extraction module and a sibling contrastive learning mechanism. This design guides model to encode discriminative features at each hierarchy level, thus improving the separability of confusable classes. Our approach achieves superior performance across three benchmark datasets, surpassing existing state‑of‑the‑art methods in most cases. Our code is available at https://github.com/happywinder/SCHK‑HTC.
Authors:Masahiro Suzuki, Hiroki Sakaji
Abstract:
We introduce JFinTEB, the first comprehensive benchmark specifically designed for evaluating Japanese financial text embeddings. Existing embedding benchmarks provide limited coverage of language‑specific and domain‑specific aspects found in Japanese financial texts. Our benchmark encompasses diverse task categories including retrieval and classification tasks that reflect realistic and well‑defined financial text processing scenarios. The retrieval tasks leverage instruction‑following datasets and financial text generation queries, while classification tasks cover sentiment analysis, document categorization, and domain‑specific classification challenges derived from economic survey data. We conduct extensive evaluations across a wide range of embedding models, including Japanese‑specific models of various sizes, multilingual models, and commercial embedding services. We publicly release JFinTEB datasets and evaluation framework at https://github.com/retarfi/JFinTEB to facilitate future research and provide a standardized evaluation protocol for the Japanese financial text mining community. This work addresses a critical gap in Japanese financial text processing resources and establishes a foundation for advancing domain‑specific embedding research.
Authors:Pritesh Jha
Abstract:
We present PIIBench, a unified benchmark corpus for Personally Identifiable Information (PII) detection in natural language text. Existing resources for PII detection are fragmented across domain‑specific corpora with mutually incompatible annotation schemes, preventing systematic comparison of detection systems. We consolidate ten publicly available datasets spanning synthetic PII corpora, multilingual Named Entity Recognition (NER) benchmarks, and financial domain annotated text, yielding a corpus of 2,369,883 annotated sequences and 3.35 million entity mentions across 48 canonical PII entity types. We develop a principled normalization pipeline that maps 80+ source‑specific label variants to a standardized BIO tagging scheme, applies frequency‑based suppression of near absent entity types, and produces stratified 80/10/10 train/validation/test splits preserving source distribution. To establish baseline difficulty, we evaluate eight published systems spanning rule‑based engines (Microsoft Presidio), general purpose NER models (spaCy, BERT‑base NER, XLM‑RoBERTa NER, SpanMarker mBERT, SpanMarker BERT), a PII‑specific model (Piiranha DeBERTa), and a financial NER specialist (XtremeDistil FiNER). All systems achieve span‑level F1 below 0.14, with the best system (Presidio, F1=0.1385) still producing zero recall on most entity types. These results directly quantify the domain‑silo problem and demonstrate that PIIBench presents a substantially harder and more comprehensive evaluation challenge than any existing single source PII dataset. The dataset construction pipeline and benchmark evaluation code are publicly available at https://github.com/pritesh‑2711/pii‑bench.
Authors:Jinlun Ye, Jiang Liao, Runhe Lai, Xinhua Lu, Jiaxin Zhuang, Zhiyong Gan, Ruixuan Wang
Abstract:
Vision‑language models (VLMs) such as CLIP exhibit strong Out‑of‑distribution (OOD) detection capabilities by aligning visual and textual representations. Recent CLIP‑based test‑time adaptation methods further improve detection performance by incorporating external OOD labels. However, such labels are finite and fixed, while the real OOD semantic space is inherently open‑ended. Consequently, fixed labels fail to represent the diverse and evolving OOD semantics encountered in test streams. To address this limitation, we introduce Test‑time Textual Learning (TTL), a framework that dynamically learns OOD textual semantics from unlabeled test streams, without relying on external OOD labels. TTL updates learnable prompts using pseudo‑labeled test samples to capture emerging OOD knowledge. To suppress noise introduced by pseudo‑labels, we introduce an OOD knowledge purification strategy that selects reliable OOD samples for adaptation while suppressing noise. In addition, TTL maintains an OOD Textual Knowledge Bank that stores high‑quality textual features, providing stable score calibration across batches. Extensive experiments on two standard benchmarks with nine OOD datasets demonstrate that TTL consistently achieves state‑of‑the‑art performance, highlighting the value of textual adaptation for robust test‑time OOD detection. Our code is available at https://github.com/figec/TTL.
Authors:Ponhvoan Srey, Xiaobao Wu, Cong-Duy Nguyen, Anh Tuan Luu
Abstract:
Uncertainty estimation is a promising approach to detect hallucinations in large language models (LLMs). Recent approaches commonly depend on model internal states to estimate uncertainty. However, they suffer from strict assumptions on how hidden states should evolve across layers, and from information loss by solely focusing on last or mean tokens. To address these issues, we present Sequential Internal Variance Representation (SIVR), a supervised hallucination detection framework that leverages token‑wise, layer‑wise features derived from hidden states. SIVR adopts a more basic assumption that uncertainty manifests in the degree of dispersion or variance of internal representations across layers, rather than relying on specific assumptions, which makes the method model and task agnostic. It additionally aggregates the full sequence of per‑token variance features, learning temporal patterns indicative of factual errors and thereby preventing information loss. Experimental results demonstrate SIVR consistently outperforms strong baselines. Most importantly, SIVR enjoys stronger generalisation and avoids relying on large training sets, highlighting the potential for practical deployment. Our code repository is available online at https://github.com/ponhvoan/internal‑variance.
Authors:Jize Wang, Xuanxuan Liu, Yining Li, Songyang Zhang, Yijun Wang, Zifei Shan, Xinyi Le, Cailian Chen, Xinping Guan, Dacheng Tao
Abstract:
The development of general‑purpose agents requires a shift from executing simple instructions to completing complex, real‑world productivity workflows. However, current tool‑use benchmarks remain misaligned with real‑world requirements, relying on AI‑generated queries, dummy tools, and limited system‑level coordination. To address this, we propose GTA‑2, a hierarchical benchmark for General Tool Agents (GTA) spanning atomic tool use and open‑ended workflows. Built on real‑world authenticity, it leverages real user queries, deployed tools, and multimodal contexts. (i) GTA‑Atomic, inherited from our prior GTA benchmark, evaluates short‑horizon, closed‑ended tool‑use precision. (ii) GTA‑Workflow introduces long‑horizon, open‑ended tasks for realistic end‑to‑end completion. To evaluate open‑ended deliverables, we propose a recursive checkpoint‑based evaluation mechanism that decomposes objectives into verifiable sub‑goals, enabling unified evaluation of both model capabilities and agent execution frameworks (i.e., execution harnesses). Experiments reveal a pronounced capability cliff: while frontier models already struggle on atomic tasks (below 50%), they largely fail on workflows, with top models achieving only 14.39% success. Further analysis shows that checkpoint‑guided feedback improves performance, while advanced frameworks such as Manus and OpenClaw substantially enhance workflow completion, highlighting the importance of execution harness design beyond the underlying model capacity. These findings provide guidance for developing reliable personal and professional assistants. Dataset and code will be available at https://github.com/open‑compass/GTA.
Authors:Zijun Wang, Haoqin Tu, Weidong Zhou, Yiyang Zhou, Xiaohuan Zhou, Bingni Zhang, Weiguo Feng, Taifeng Wang, Cihang Xie, Fengze Liu
Abstract:
Everyday tasks come with a target, and pretraining models around this target is what turns them into experts. In this paper, we study target‑oriented language model (LM) pretraining by introducing Neuron‑Activated Graph Ranking (NAG‑based Ranking), a training‑free and interpretable framework for target pretraining data selection. Rather than using black‑box representations, our approach directly characterizes each target input by a sparse set of high‑impact neurons in any off‑the‑shelf LLMs. Concretely, we quantify neuron impact and select the most influential neurons across layers into a compact Neuron‑Activated Graph (NAG), and rank candidate data by NAG similarity to target examples. We conduct experiments across six benchmarks, where our NAG‑based Ranking improves target‑oriented pretraining by 4.9% on average over random sampling, and also outperforms state‑of‑the‑art baselines by 5.3% accuracy on HellaSwag. It also remains effective under a more applicable multi‑target setting, where our best setup surpasses two baselines by 1.1% and 4.1%, respectively. Furthermore, we provide a comprehensive analysis on why and how our NAG works, e.g., deactivating NAG‑selected neurons (only 0.12% of all) causes a 23.5% performance collapse, and restricting NAG to the final layer incurs a 4.1% average drop, indicating that NAG captures a sparse "functional backbone" for learning target features. We release the code at https://github.com/asillycat/NAG.
Authors:Jon-Paul Cacioli
Abstract:
We introduce a cross‑domain behavioural assay of monitoring‑control coupling in LLMs, grounded in the Nelson and Narens (1990) metacognitive framework and applying human psychometric methodology to LLM evaluation. The battery comprises 524 items across six cognitive domains (learning, metacognitive calibration, social cognition, attention, executive function, prospective regulation), each grounded in an established experimental paradigm. Tasks T1‑T5 were pre‑registered on OSF prior to data collection; T6 was added as an exploratory extension. After every forced‑choice response, dual probes adapted from Koriat and Goldsmith (1996) ask the model to KEEP or WITHDRAW its answer and to BET or decline. The critical metric is the withdraw delta: the difference in withdrawal rate between incorrect and correct items. Applied to 20 frontier LLMs (10,480 evaluations), the battery discriminates three profiles consistent with the Nelson‑Narens architecture: blanket confidence, blanket withdrawal, and selective sensitivity. Accuracy rank and metacognitive sensitivity rank are largely inverted. Retrospective monitoring and prospective regulation appear dissociable (r = .17, 95% CI wide given n=20; exemplar‑based evidence is the primary support). Scaling on metacognitive calibration is architecture‑dependent: monotonically decreasing (Qwen), monotonically increasing (GPT‑5.4), or flat (Gemma). Behavioural findings converge structurally with an independent Type‑2 SDT approach, providing preliminary cross‑method construct validity. All items, data, and code: https://github.com/synthiumjp/metacognitive‑monitoring‑battery.
Authors:Zixuan Weng, Jinghuai Zhang, Kunlin Cai, Ying Li, Peiran Wang, Yuan Tian
Abstract:
Large language models (LLMs) often exhibit undesirable behaviors, such as safety violations and hallucinations. Although inference‑time steering offers a cost‑effective way to adjust model behavior without updating its parameters, existing methods often fail to be simultaneously effective, utility‑preserving, and training‑efficient due to their rigid, one‑size‑fits‑all designs and limited adaptability. In this work, we present FineSteer, a novel steering framework that decomposes inference‑time steering into two complementary stages: conditional steering and fine‑grained vector synthesis, allowing fine‑grained control over when and how to steer internal representations. In the first stage, we introduce a Subspace‑guided Conditional Steering (SCS) mechanism that preserves model utility by avoiding unnecessary steering. In the second stage, we propose a Mixture‑of‑Steering‑Experts (MoSE) mechanism that captures the multimodal nature of desired steering behaviors and generates query‑specific steering vectors for improved effectiveness. Through tailored designs in both SCS and MoSE, FineSteer maintains robust performance on general queries while adaptively optimizing steering vectors for targeted inputs in a training‑efficient manner. Extensive experiments on safety and truthfulness benchmarks show that FineSteer outperforms state‑of‑the‑art methods in overall performance, achieving stronger steering performance with minimal utility loss. Code is available at https://github.com/YukinoAsuna/FineSteer
Authors:G. Aytug Akarlar
Abstract:
We present causal evidence that hallucination in autoregressive language models is an early trajectory commitment governed by asymmetric attractor dynamics. Using same‑prompt bifurcation, in which we repeatedly sample identical inputs to observe spontaneous divergence, we isolate trajectory dynamics from prompt‑level confounds. On Qwen2.5‑1.5B across 61 prompts spanning six categories, 27 prompts (44.3%) bifurcate with factual and hallucinated trajectories diverging at the first generated token (KL = 0 at step 0, KL > 1.0 at step 1). Activation patching across 28 layers reveals a pronounced causal asymmetry: injecting a hallucinated activation into a correct trajectory corrupts output in 87.5% of trials (layer 20), while the reverse recovers only 33.3% (layer 24); both exceed the 10.4% baseline (p = 0.025) and 12.5% random‑patch control. Window patching shows correction requires sustained multi‑step intervention, whereas corruption needs only a single perturbation. Probing the prompt encoding itself, step‑0 residual states predict per‑prompt hallucination rate at Pearson r = 0.776 at layer 15 (p < 0.001 against a 1000‑permutation null); unsupervised clustering identifies five regime‑like groups (eta^2 = 0.55) whose saddle‑adjacent cluster concentrates 12 of the 13 bifurcating false‑premise prompts, indicating that the basin structure is organized around regime commitments fixed at prompt encoding. These findings characterize hallucination as a locally stable attractor basin: entry is probabilistic and rapid, exit demands coordinated intervention across layers and steps, and the relevant basins are selected by clusterable regimes already discernible at step 0.
Authors:Lisa Vasileva, Karin Sim
Abstract:
LLMs are proving to be adept at machine translation although due to their generative nature they may at times overgenerate in various ways. These overgenerations are different from the neurobabble seen in NMT and range from LLM self‑explanations, to risky confabulations, to appropriate explanations, where the LLM is able to act as a human translator would, enabling greater comprehension for the target audience. Detecting and determining the exact nature of the overgenerations is a challenging task. We detail different strategies we have explored for our work in a commercial setting, and present our results.
Authors:Zihao Xu, John Harvill, Ziwei Fan, Yizhou Sun, Hao Ding, Hao Wang
Abstract:
Large Language Models (LLMs) incur significant computational and memory costs when processing long prompts, as full self‑attention scales quadratically with input length. Token compression aims to address this challenge by reducing the number of tokens representing inputs. However, existing prompt‑compression approaches primarily operate in token space and overlook inefficiencies in the latent embedding space. In this paper, we propose K‑Token Merging, a latent‑space compression framework that merges each contiguous block of K token embeddings into a single embedding via a lightweight encoder. The compressed sequence is processed by a LoRA‑adapted LLM, while generation remains in the original vocabulary. Experiments on structural reasoning (Textualized Tree), sentiment classification (Amazon Reviews), and code editing (CommitPackFT) show that K‑Token Merging lies on the Pareto frontier of performance vs. compression, achieving up to 75% input length reduction with minimal performance degradation. Code is available at https://github.com/shsjxzh/K‑Token‑Merging.
Authors:Kanzhi Cheng, Zehao Li, Zheng Ma, Nuo Chen, Jialin Cao, Qiushi Sun, Zichen Ding, Fangzhi Xu, Hang Yan, Jiajun Chen, Anh Tuan Luu, Jianbing Zhang, Lewei Lu, Dahua Lin
Abstract:
Mobile agents powered by vision‑language models have demonstrated impressive capabilities in automating mobile tasks, with recent leading models achieving a marked performance leap, e.g., nearly 70% success on AndroidWorld. However, these systems keep their training data closed and remain opaque about their task and trajectory synthesis recipes. We present OpenMobile, an open‑source framework that synthesizes high‑quality task instructions and agent trajectories, with two key components: (1) The first is a scalable task synthesis pipeline that constructs a global environment memory from exploration, then leverages it to generate diverse and grounded instructions. and (2) a policy‑switching strategy for trajectory rollout. By alternating between learner and expert models, it captures essential error‑recovery data often missing in standard imitation learning. Agents trained on our data achieve competitive results across three dynamic mobile agent benchmarks: notably, our fine‑tuned Qwen2.5‑VL and Qwen3‑VL reach 51.7% and 64.7% on AndroidWorld, far surpassing existing open‑data approaches. Furthermore, we conduct transparent analyses on the overlap between our synthetic instructions and benchmark test sets, and verify that performance gains stem from broad functionality coverage rather than benchmark overfitting. We release data and code at https://njucckevin.github.io/openmobile/ to bridge the data gap and facilitate broader mobile agent research.
Authors:Haochun Tang, Yuliang Yan, Jiahua Lu, Huaxiao Liu, Enyan Dai
Abstract:
Cost‑aware routing dynamically dispatches user queries to models of varying capability to balance performance and inference cost. However, the routing strategy introduces a new security concern that adversaries may manipulate the router to consistently select expensive high‑capability models. Existing routing attacks depend on either white‑box access or heuristic prompts, rendering them ineffective in real‑world black‑box scenarios. In this work, we propose R^2A, which aims to mislead black‑box LLM routers to expensive models via adversarial suffix optimization. Specifically, R^2A deploys a hybrid ensemble surrogate router to mimic the black‑box router. A suffix optimization algorithm is further adapted for the ensemble‑based surrogate. Extensive experiments on multiple open‑source and commercial routing systems demonstrate that R^2A significantly increases the routing rate to expensive models on queries of different distributions. Code and examples: https://github.com/thcxiker/R2A‑Attack.
Authors:Zihong Zhang, Zuchao Li, Lefei Zhang, Ping Wang, Hai Zhao
Abstract:
Autoregressive decoding in Large Language Models (LLMs) generates one token per step, causing high inference latency. Speculative decoding (SD) mitigates this through a guess‑and‑verify strategy, but existing training‑free variants face trade‑offs: retrieval‑based drafts break when no exact match exists, while logits‑based drafts lack structural guidance. We propose RACER (Retrieval‑Augmented Contextual Rapid Speculative Decoding), a lightweight and training‑free method that integrates retrieved exact patterns with logit‑driven future cues. This unification supplies both reliable anchors and flexible extrapolation, yielding richer speculative drafts. Experiments on Spec‑Bench, HumanEval, and MGSM‑ZH demonstrate that RACER consistently accelerates inference, achieving more than 2× speedup over autoregressive decoding, and outperforms prior training‑free methods, offering a scalable, plug‑and‑play solution for efficient LLM decoding. Our source code is available at \hrefhttps://github.com/hkr04/RACERhttps://github.com/hkr04/RACER.
Authors:Hang Su, Zequn Liu, Chen Hu, Xuesong Lu, Yingce Xia, Zhen Liu
Abstract:
While LLMs have demonstrated remarkable potential in Question Answering (QA), evaluating personalization remains a critical bottleneck. Existing paradigms predominantly rely on lexical‑level similarity or manual heuristics, often lacking sufficient data‑driven validation. We address this by mining Community‑Individual Preference Divergence (CIPD), where individual choices override consensus, to distill six key personalization factors as evaluative dimensions. Accordingly, we introduce CoPA, a benchmark with 1,985 user profiles for fine‑grained, factor‑level assessment. By quantifying the alignment between model outputs and user‑specific cognitive preferences inferred from interaction patterns, CoPA provides a more comprehensive and discriminative standard for evaluating personalized QA than generic metrics. The code is available at https://github.com/bjzgcai/CoPA.
Authors:Geonhui Jang, Dongyoon Han, YoungJoon Yoo
Abstract:
Effective code generation requires both model capability and a problem representation that carefully structures how models reason and plan. Existing approaches augment reasoning steps or inject specific structure into how models think, but leave scattered problem conditions unchanged. Inspired by the way humans organize fragmented information into coherent explanations, we propose StoryCoder, a narrative reformulation framework that transforms code generation questions into coherent natural language narratives, providing richer contextual structure than simple rephrasings. Each narrative consists of three components: a task overview, constraints, and example test cases, guided by the selected algorithm and genre. Experiments across 11 models on HumanEval, LiveCodeBench, and CodeForces demonstrate consistent improvements, with an average gain of 18.7% in zero‑shot pass@10. Beyond accuracy, our analyses reveal that narrative reformulation guides models toward correct algorithmic strategies, reduces implementation errors, and induces a more modular code structure. The analyses further show that these benefits depend on narrative coherence and genre alignment, suggesting that structured problem representation is important for code generation regardless of model scale or architecture. Our code is available at https://github.com/gu‑ni/StoryCoder.
Authors:Sumit Mukherjee, Juan Shu, Nairwita Mazumder, Tate Kernell, Celena Wheeler, Shannon Hastings, Chris Sidey-Gibbons
Abstract:
Clinical value set authoring ‑‑ the task of identifying all codes in a standardized vocabulary that define a clinical concept ‑‑ is a recurring bottleneck in clinical quality measurement and phenotyping. A natural approach is to prompt a large language model (LLM) to generate the required codes directly, but structured clinical vocabularies are large, version‑controlled, and not reliably memorized during pretraining. We propose Retrieval‑Augmented Set Completion (RASC): retrieve the K most similar existing value sets from a curated corpus to form a candidate pool, then apply a classifier to each candidate code. Theoretically, retrieve‑and‑select can reduce statistical complexity by shrinking the effective output space from the full vocabulary to a much smaller retrieved candidate pool. We demonstrate the utility of RASC on 11,803 publicly available VSAC value sets, constructing the first large‑scale benchmark for this task. A cross‑encoder fine‑tuned on SAPBert achieves AUROC~0.852 and value‑set‑level F1~0.298, outperforming a simpler three‑layer Multilayer Perceptron (AUROC~0.799, F1~0.250) and both reduce the number of irrelevant candidates per true positive from 12.3 (retrieval‑only) to approximately 3.2 and 4.4 respectively. Zero‑shot GPT‑4o achieves value‑set‑level F1~0.105, with 48.6% of returned codes absent from VSAC entirely. This performance gap widens with increasing value set size, consistent with RASC's theoretical advantage. We observe similar performance gains across two other classifier model types, namely a cross‑encoder initialized from pre‑trained SAPBert and a LightGBM model, demonstrating that RASC's benefits extend beyond a single model class. The code to download and create the benchmark dataset, as well as the model training code is available at: \hrefhttps://github.com/mukhes3/RASChttps://github.com/mukhes3/RASC.
Authors:Yixu Huang, Tinghui Zhu, Muhao Chen
Abstract:
Visual reasoning models (VRMs) have recently shown strong cross‑modal reasoning capabilities by integrating visual perception with language reasoning. However, they often suffer from overthinking, producing unnecessarily long reasoning chains for any tasks. We attribute this issue to Reasoning Path Redundancy in visual reasoning: many visual questions do not require the full reasoning process. To address this, we propose AVR, an adaptive visual reasoning framework that decomposes visual reasoning into three cognitive functions: visual perception, logical reasoning, and answer application. It further enables models to dynamically choose among three response formats: Full Format, Perception‑Only Format, and Direct Answer. AVR is trained with FS‑GRPO, an adaptation of Group Relative Policy Optimization that encourages the model to select the most efficient reasoning format while preserving correctness. Experiments on multiple vision‑language benchmarks show that AVR reduces token usage by 50‑‑90% while maintaining overall accuracy, especially in perception‑intensive tasks. These results demonstrate that adaptive visual reasoning can effectively mitigate overthinking in VRMs. Code and data are available at: https://github.com/RunRiotComeOn/AVR.
Authors:Pengfei Li, Shijie Wang, Fangyuan Li, Yikun Fu, Kaifeng Liu, Kaiyan Zhang, Dazhi Zhang, Yuqiang Li, Biqing Qi, Bowen Zhou
Abstract:
Reinforcement learning (RL) paradigms have demonstrated strong performance on reasoning‑intensive tasks such as code generation. However, limited trajectory diversity often leads to diminishing returns, which constrains the achievable performance ceiling. Search‑enhanced RL alleviates this issue by introducing structured exploration, which remains constrained by the single‑agent policy priors. Meanwhile, leveraging multiple interacting policies can acquire more diverse exploratory signals, but existing approaches are typically decoupled from structured search. We propose MARS^2 (Multi‑Agent Reinforced Tree‑Search Scaling), a unified RL framework in which multiple independently‑optimized agents collaborate within a shared tree‑structured search environment. MARS^2 models the search tree as a learnable multi‑agent interaction environment, enabling heterogeneous agents to collaboratively generate and refine candidate solutions within a shared search topology. To support effective learning, we introduce a path‑level group advantage formulation based on tree‑consistent reward shaping, which facilitates effective credit assignment across complex search trajectories. Experiments on code generation benchmarks show that MARS^2 consistently improves performance across diverse model combinations and training settings, demonstrating the effectiveness of coupling multi‑agent collaboration with tree search for enhancing reinforcement learning. Our code is publicly available at https://github.com/TsinghuaC3I/MARTI.
Authors:Soroush Sadeghian, Alireza Daqiq, Radin Cheraghi, Sajad Ebrahimi, Negar Arabzadeh, Ebrahim Bagheri
Abstract:
Large Language Models (LLMs) are increasingly used in scientific peer review, assisting with drafting, rewriting, expansion, and refinement. However, existing peer‑review LLM detection methods largely treat authorship as a binary problem‑human vs. AI‑without accounting for the hybrid nature of modern review workflows. In practice, evaluative ideas and surface realization may originate from different sources, creating a spectrum of human‑AI collaboration. In this work, we introduce PeerPrism, a large‑scale benchmark of 20,690 peer reviews explicitly designed to disentangle idea provenance from text provenance. We construct controlled generation regimes spanning fully human, fully synthetic, and multiple hybrid transformations. This design enables systematic evaluation of whether detectors identify the origin of the surface text or the origin of the evaluative reasoning. We benchmark state‑of‑the‑art LLM text detection methods on PeerPrism. While several methods achieve high accuracy on the standard binary task (human vs. fully synthetic), their predictions diverge sharply under hybrid regimes. In particular, when ideas originate from humans but the surface text is AI‑generated, detectors frequently disagree and produce contradictory classifications. Accompanied by stylometric and semantic analyses, our results show that current detection methods conflate surface realization with intellectual contribution. Overall, we demonstrate that LLM detection in peer review cannot be reduced to a binary attribution problem. Instead, authorship must be modeled as a multidimensional construct spanning semantic reasoning and stylistic realization. PeerPrism is the first benchmark evaluating human‑AI collaboration in these settings. We release all code, data, prompts, and evaluation scripts to facilitate reproducible research at https://github.com/Reviewerly‑Inc/PeerPrism.
Authors:Andre Bacellar
Abstract:
In law, regulatory regimes for pharmaceuticals and software security, newer authorities can revoke older established ones even when semantically distant. We call this CAR: retrieving the currently active authority frontier for a semantic anchor q, that is, front(cl(A_k(q))). This differs from finding the most similar document by relevance score: argmax_d s(q, d). Theorem 4 characterizes when a set R truly covers the active authority set for q with TCA(R, q)=1, providing conditions necessary and sufficient for any retrieved set R: frontier inclusion (front(cl(A_k(q))) contained in R) and no‑ignored‑superseder (no superseding document exists in the corpus outside R). Proposition 2 shows that TCA@k <= phi(q) R_anchor(q) in the worst case over any scope‑indexed algorithm, proved by an adversarial permutation argument. We evaluated on three real‑world datasets: security advisories (Dense TCA@5=0.270, two‑stage 0.975), SCOTUS overruling pairs (Dense TCA=0.172, two‑stage 0.926), and FDA drug records (Dense TCA=0.064, two‑stage 0.774). A GPT‑4o‑mini experiment shows Dense RAG produces explicit "not patched" claims for 39% of queries where a patch exists; two‑stage cuts this to 16%. Four benchmark datasets, domain adapters, and a single‑command scorer are released at https://github.com/andremir/car‑retrieval.
Authors:Thales Sales Almeida, Giovana Kerche Bonás, Ramon Pires, Celio Larcher, Hugo Abonizio, Marcos Piau, Roseval Malaquias Junior, Rodrigo Nogueira, Thiago Laitz
Abstract:
Large language models (LLMs) are increasingly used as sources of information, yet their reliability depends on the ability to search the web, select relevant evidence, and synthesize complete answers. While recent benchmarks evaluate web‑browsing and agentic tool use, multilingual settings, and Portuguese in particular, remain underexplored. We present \textscMARCA, a bilingual (English and Portuguese) benchmark for evaluating LLMs on web‑based information seeking. \textscMARCA consists of 52 manually authored multi‑entity questions, paired with manually validated checklist‑style rubrics that explicitly measure answer completeness and correctness. We evaluate 14 models under two interaction settings: a Basic framework with direct web search and scraping, and an Orchestrator framework that enables task decomposition via delegated subagents. To capture stochasticity, each question is executed multiple times and performance is reported with run‑level uncertainty. Across models, we observe large performance differences, find that orchestration often improves coverage, and identify substantial variability in how models transfer from English to Portuguese. The benchmark is available at https://github.com/maritaca‑ai/MARCA
Authors:Mohammad R. Abu Ayyash
Abstract:
We present Three‑Phase Transformer (3PT), a residual‑stream structural prior for decoder‑only Transformers on a standard SwiGLU + RMSNorm + RoPE + GQA backbone. The hidden vector is partitioned into N equally‑sized cyclic channels, each maintained by phase‑respecting ops: a per‑channel RMSNorm, a 2D Givens rotation between attention and FFN that rotates each channel by theta + i(2pi/N), and a head‑count constraint aligning GQA heads with the partition. The architecture is a self‑stabilizing equilibrium between scrambling and re‑imposition, not a bolted‑on module. The partition carves out a one‑dimensional DC subspace orthogonal to the channels, into which we inject a fixed Gabriel's horn profile r(p) = 1/(p+1) as an absolute‑position side‑channel composing orthogonally with RoPE's relative‑position rotation. The canonical N=3 borrows its metaphor from balanced three‑phase AC, where three sinusoids 120 degrees apart sum to zero with no anti‑correlated pair. At 123M parameters on WikiText‑103, 3PT achieves ‑7.20% perplexity (‑2.62% bits‑per‑byte) over a matched RoPE‑Only baseline at +1,536 parameters (0.00124% of total), with 1.93x step‑count convergence speedup (1.64x wall‑clock). N behaves as a parameter‑sharing knob rather than a unique optimum: at 5.5M an N‑sweep over 1,2,3,4,6,8,12 is near‑monotone with N=1 winning; at 123M a three‑seed sweep finds N=3 and N=1 statistically indistinguishable. The load‑bearing mechanism is the channel‑partitioned residual stream, per‑block rotation, per‑phase normalization, and horn DC injection. We characterize (a) self‑stabilization of the geometry without explicit enforcement, a novel instance of the conservation‑law framework for neural networks; (b) a U‑shaped depth profile of rotation‑angle drift at 12 layers; (c) orthogonal composition with RoPE, attention, and FFN.
Authors:Ferdinand M. Schessl
Abstract:
Turn‑level metrics are widely used to evaluate properties of multi‑turn human‑LLM conversations, from safety and sycophancy to dialogue quality. However, consecutive turns within a conversation are not statistically independent ‑‑ a fact that virtually all current evaluation pipelines fail to correct for in their statistical inference. We systematically characterize the autocorrelation structure of 66 turn‑level metrics across 202 multi‑turn conversations (11,639 turn pairs, 5 German‑speaking users, 4 LLM platforms) and demonstrate that naive pooled analysis produces severely inflated significance estimates: 42% of associations that appear significant under standard pooled testing fail to survive cluster‑robust correction. The inflation varies substantially across categories rather than scaling linearly with autocorrelation: three memoryless families (embedding velocity, directional, differential) aggregate to 14%, while the seven non‑memoryless families (thermo‑cycle, frame distance, lexical/structural, rolling windows, cumulative, interaction, timestamp) aggregate to 33%, with individual category rates ranging from 0% to 100% depending on per‑family effect size. We present a two‑stage correction framework combining Chelton (1983) effective degrees of freedom with conversation‑level block bootstrap, and validate it on a pre‑registered hold‑out split where cluster‑robust metrics replicate at 57% versus 30% for pooled‑only metrics. We provide concrete design principles, a publication checklist, and open‑source code for the correction pipeline. A survey of ~30 recent papers at major NLP and AI venues that compute turn‑level statistics in LLM evaluations finds that only 4 address temporal dependence at all, and 26 do not correct for it.
Authors:Hao An, Yibin Lou, Jiayi Guo, Yang Xu
Abstract:
Large language models (LLMs) often exhibit hallucinations due to their inability to accurately perceive their own knowledge boundaries. Existing abstention fine‑tuning methods typically partition datasets directly based on response accuracy, causing models to suffer from severe label noise near the decision boundaries and consequently exhibit high rates of abstentions or hallucinations. This paper adopts a latent space representation perspective, revealing a "gray zone" near the decision hyperplane where internal belief ambiguity constitutes the core performance bottleneck. Based on this insight, we propose the GeoDe (Geometric Denoising) framework for abstention fine‑tuning. This method constructs a truth hyperplane using linear probes and performs "geometric denoising" by employing geometric distance as a confidence signal for abstention decisions. This approach filters out ambiguous boundary samples while retaining high‑fidelity signals for fine‑tuning. Experiments across multiple models (Llama3, Qwen3) and benchmark datasets (TriviaQA, NQ, SciQ, SimpleQA) demonstrate that GeoDe significantly enhances model truthfulness and demonstrates strong generalization in out‑of‑distribution (OOD) scenarios. Code is available at https://github.com/Notbesidemoon/GeoDe.
Authors:Zhuofeng Li, Yi Lu, Dongfu Jiang, Haoxiang Zhang, Yuyang Bai, Chuan Li, Yu Wang, Shuiwang Ji, Jianwen Xie, Yu Zhang
Abstract:
The rapid rise in AI conference submissions has driven increasing exploration of large language models (LLMs) for peer review support. However, LLM‑based reviewers often generate superficial, formulaic comments lacking substantive, evidence‑grounded feedback. We attribute this to the underutilization of two key components of human reviewing: explicit rubrics and contextual grounding in existing work. To address this, we introduce REVIEWBENCH, a benchmark evaluating review text according to paper‑specific rubrics derived from official guidelines, the paper's content, and human‑written reviews. We further propose REVIEWGROUNDER, a rubric‑guided, tool‑integrated multi‑agent framework that decomposes reviewing into drafting and grounding stages, enriching shallow drafts via targeted evidence consolidation. Experiments on REVIEWBENCH show that REVIEWGROUNDER, using a Phi‑4‑14B‑based drafter and a GPT‑OSS‑120B‑based grounding stage, consistently outperforms baselines with substantially stronger/larger backbones (e.g., GPT‑4.1 and DeepSeek‑R1‑670B) in both alignment with human judgments and rubric‑based review quality across 8 dimensions. The code is available \hrefhttps://github.com/EigenTom/ReviewGrounderhere.
Authors:Jiacheng Liu, Xiaohan Zhao, Xinyi Shang, Zhiqiang Shen
Abstract:
Claude Code is an agentic coding tool that can run shell commands, edit files, and call external services on behalf of the user. This study describes its comprehensive architecture by analyzing the publicly available TypeScript source code and further comparing it with OpenClaw, an independent open‑source AI agent system that answers many of the same design questions from a different deployment context. Our analysis identifies five human values, philosophies, and needs that motivate the architecture (human decision authority, safety and security, reliable execution, capability amplification, and contextual adaptability) and traces them through thirteen design principles to specific implementation choices. The core of the system is a simple while‑loop that calls the model, runs tools, and repeats. Most of the code, however, lives in the systems around this loop: a permission system with seven modes and an ML‑based classifier, a five‑layer compaction pipeline for context management, four extensibility mechanisms (MCP, plugins, skills, and hooks), a subagent delegation mechanism with worktree isolation, and append‑oriented session storage. A comparison with OpenClaw, a multi‑channel personal assistant gateway, shows that the same recurring design questions produce different architectural answers when the deployment context changes: from per‑action safety classification to perimeter‑level access control, from a single CLI loop to an embedded runtime within a gateway control plane, and from context‑window extensions to gateway‑wide capability registration. We finally identify six open design directions for future agent systems, grounded in recent empirical, architectural, and policy literature.
Authors:Samir Wagle, Reewaj Khanal, Abiral Adhikari
Abstract:
Hate speech detection in Devanagari‑scripted social media memes presents compounded challenges: multimodal content structure, script‑specific linguistic complexity, and extreme data scarcity in low‑resource settings. This paper presents our system for the CHiPSAL 2026 shared task, addressing both Subtask A (binary hate speech detection) and Subtask B (three‑class sentiment classification: positive, neutral, negative). We propose a hybrid cross‑modal attention fusion architecture that combines CLIP (ViT‑B/32) for visual encoding with BGE‑M3 for multilingual text representation, connected through 4‑head self‑attention and a learnable gating network that dynamically weights modality contributions on a per‑sample basis. Systematic evaluation across eight model configurations demonstrates that explicit cross‑modal reasoning achieves a 5.9% F1‑macro improvement over text‑only baselines on Subtask A, while uncovering two unexpected but critical findings: English‑centric vision models exhibit near‑random performance on Devanagari script, and standard ensemble methods catastrophically degrade under data scarcity (N nearly equal to 850 per fold) due to correlated overfitting. The code can be accessed at https://github.com/Tri‑Yantra‑Technologies/MEME‑Fusion/
Authors:Junhong Liang, Yifan Lu, Ekaterina Kochmar, Fajri Koto
Abstract:
Grammatical error correction (GEC) and explanation (GEE) have made rapid progress, but real teaching scenarios also require \emphlearner‑friendly pedagogical feedback that is actionable, level‑appropriate, and encouraging. We introduce SPFG (Spoken Pedagogical Feedback Generation), a dataset built based on the Speak \& Improve Challenge 2025 corpus, pairing fluency‑oriented transcriptions with GEC targets and \emphhuman‑verified teacher‑style feedback, including preferred/rejected feedback pairs for preference learning. We study a transcript‑based Spoken Grammatical Error Correction (SGEC) setting and evaluate three instruction‑tuned LLMs (Qwen2.5, Llama‑3.1, and GLM‑4), comparing supervised fine‑tuning (SFT) with preference‑based alignment (using DPO and KTO) for jointly generating corrections and feedback. Results show that SFT provides the most consistent improvements, while DPO/KTO yield smaller or mixed gains, and that correction quality and feedback quality are weakly coupled. Our implementation is available at https://github.com/Skywalker‑Harrison/spfg.
Authors:Bryan Sanchez
Abstract:
Alignment‑tuned language models frequently suppress factual log‑probabilities on politically sensitive topics despite retaining the knowledge in their hidden representations. We show that a 786K‑parameter (approximately 0.02% of the base model) post‑transformer adapter, trained on frozen hidden states, corrects this suppression on 31 ideology‑discriminating facts across Qwen3‑4B, 8B, and 14B. The adapter memorizes all 15 training facts and generalizes to 11‑‑39% of 16 held‑out facts across 5 random splits per scale, with zero knowledge regressions via anchored training. Both gated (SwiGLU) and ungated (linear bottleneck) adapters achieve comparable results; neither consistently outperforms the other (Fisher exact p > 0.09 at all scales). On instruct models, the adapter corrects log‑probability rankings. When applied at all token positions during generation, the adapter produces incoherent output; however, when applied only at the current prediction position (last‑position‑only), the adapter produces coherent, less censored text. A logit‑space adapter operating after token projection fails to produce coherent generation at any application mode, suggesting hidden‑state intervention is the correct level for generation correction. A previously undocumented silent gradient bug in Apple MLX explains all null results in earlier iterations of this work: the standard pattern nn.value_and_grad(model, fn)(model.parameters()) returns zero gradients without error; the correct pattern nn.value_and_grad(model, fn)(model, data) resolves this. We provide a minimal reproduction and discuss implications for other adapter research using MLX.
Authors:Baocai Shan, Yuzhuang Xu, Wanxiang Che
Abstract:
Mobile input method editors (IMEs) are the primary interface for text input, yet they remain constrained to manual typing and struggle to produce personalized text. While lightweight large language models (LLMs) make on‑device auxiliary generation feasible, enabling deeply personalized, privacy‑preserving, and real‑time generative IMEs poses fundamental challenges.To this end, we present HUOZIIME, a personalized on‑device IME powered by LLM. We endow HUOZIIME with initial human‑like prediction ability by post‑training a base LLM on synthesized personalization data. Notably, a hierarchical memory mechanism is designed to continually capture and leverage user‑specific input history. Furthermore, we perform systemic optimizations tailored to on‑device LLMbased IME deployment, ensuring efficient and responsive operation under mobile constraints.Experiments demonstrate efficient on‑device execution and high‑fidelity memory‑driven personalization. Code and package are available at https://github.com/Shan‑HIT/HuoziIME.
Authors:Yuqiao Tan, Minzheng Wang, Bo Liu, Zichen Liu, Tian Liang, Shizhu He, Jun Zhao, Kang Liu
Abstract:
While reinforcement learning with verifiable rewards (RLVR) significantly enhances LLM reasoning by optimizing the conditional distribution P(y|x), its potential is fundamentally bounded by the base model's existing output distribution. Optimizing the marginal distribution P(y) in the Pre‑train Space addresses this bottleneck by encoding reasoning ability and preserving broad exploration capacity. Yet, conventional pre‑training relies on static corpora for passive learning, leading to a distribution shift that hinders targeted reasoning enhancement. In this paper, we introduce PreRL (Pre‑train Space RL), which applies reward‑driven online updates directly to P(y). We theoretically and empirically validate the strong gradient alignment between log P(y) and log P(y|x), establishing PreRL as a viable surrogate for standard RL. Furthermore, we uncover a critical mechanism: Negative Sample Reinforcement (NSR) within PreRL serves as an exceptionally effective driver for reasoning. NSR‑PreRL rapidly prunes incorrect reasoning spaces while stimulating endogenous reflective behaviors, increasing transition and reflection thoughts by 14.89x and 6.54x, respectively. Leveraging these insights, we propose Dual Space RL (DSRL), a Policy Reincarnation strategy that initializes models with NSR‑PreRL to expand the reasoning horizon before transitioning to standard RL for fine‑grained optimization. Extensive experiments demonstrate that DSRL consistently outperforms strong baselines, proving that pre‑train space pruning effectively steers the policy toward a refined correct reasoning subspace.
Authors:Itay Itzhak, Eliya Habba, Gabriel Stanovsky, Yonatan Belinkov
Abstract:
Evaluating LLMs is challenging, as benchmark scores often fail to capture models' real‑world usefulness. Instead, users often rely on ``vibe‑testing'': informal experience‑based evaluation, such as comparing models on coding tasks related to their own workflow. While prevalent, vibe‑testing is often too ad hoc and unstructured to analyze or reproduce at scale. In this work, we study how vibe‑testing works in practice and then formalize it to support systematic analysis. We first analyze two empirical resources: (1) a survey of user evaluation practices, and (2) a collection of in‑the‑wild model comparison reports from blogs and social media. Based on these resources, we formalize vibe‑testing as a two‑part process: users personalize both what they test and how they judge responses. We then introduce a proof‑of‑concept evaluation pipeline that follows this formulation by generating personalized prompts and comparing model outputs using user‑aware subjective criteria. In experiments on coding benchmarks, we find that combining personalized prompts and user‑aware evaluation can change which model is preferred, reflecting the role of vibe‑testing in practice. These findings suggest that formalized vibe‑testing can serve as a useful approach for bridging benchmark scores and real‑world experience.
Authors:Fei Tang, Bofan Chen, Zhengxi Lu, Tongbo Chen, Songqin Nong, Tao Jiang, Wenhao Xu, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen
Abstract:
GUI grounding, which localizes interface elements from screenshots given natural language queries, remains challenging for small icons and dense layouts. Test‑time zoom‑in methods improve localization by cropping and re‑running inference at higher resolution, but apply cropping uniformly across all instances with fixed crop sizes, ignoring whether the model is actually uncertain on each case. We propose UI‑Zoomer, a training‑free adaptive zoom‑in framework that treats both the trigger and scale of zoom‑in as a prediction uncertainty quantification problem. A confidence‑aware gate fuses spatial consensus among stochastic candidates with token‑level generation confidence to selectively trigger zoom‑in only when localization is uncertain. When triggered, an uncertainty‑driven crop sizing module decomposes prediction variance into inter‑sample positional spread and intra‑sample box extent, deriving a per‑instance crop radius via the law of total variance. Extensive experiments on ScreenSpot‑Pro, UI‑Vision, and ScreenSpot‑v2 demonstrate consistent improvements over strong baselines across multiple model architectures, achieving gains of up to +13.4%, +10.3%, and +4.2% respectively, with no additional training required.
Authors:Junlin Zhu, Baizhou Huang, Xiaojun Wan
Abstract:
As large language models become standard backends for content generation, practical provenance increasingly requires multi‑bit watermarking. In provider‑internal deployments, a key requirement is message symmetry: the message itself should not systematically affect either text quality or verification outcomes. Vocabulary‑partition watermarks can break message symmetry in low‑entropy decoding: some messages are assigned most of the probability mass, while others are forced to use tail tokens. This makes embedding quality and message decoding accuracy message‑dependent. We propose QuantileMark, a white‑box multi‑bit watermark that embeds messages within the continuous cumulative probability interval [0, 1). At each step, QuantileMark partitions this interval into M equal‑mass bins and samples strictly from the bin assigned to the target symbol, ensuring a fixed 1/M probability budget regardless of context entropy. For detection, the verifier reconstructs the same partition under teacher forcing, computes posteriors over latent bins, and aggregates evidence for verification. We prove message‑unbiasedness, a property ensuring that the base distribution is recovered when averaging over messages. This provides a theoretical foundation for generation‑side symmetry, while the equal‑mass design additionally promotes uniform evidence strength across messages on the detection side. Empirical results on C4 continuation and LFQA show improved multi‑bit recovery and detection robustness over strong baselines, with negligible impact on generation quality. Our code is available at GitHub (https://github.com/zzzjunlin/QuantileMark).
Authors:Zhijie Bao, Fangke Chen, Licheng Bao, Chenhui Zhang, Wei Chen, Jiajie Peng, Zhongyu Wei
Abstract:
The potential of Multimodal Large Language Models (MLLMs) in domain of medical imaging raise the demands of systematic and rigorous evaluation frameworks that are aligned with the real‑world medical imaging practice. Existing practices that report single or coarse‑grained metrics are lack the granularity required for specialized clinical support and fail to assess the reliability of reasoning mechanisms. To address this, we propose a paradigm shift toward multidimensional, fine‑grained and in‑depth evaluation. Based on a two‑stage systematic construction pipeline designed for this paradigm, we instantiate it with MedRCube. We benchmark 33 MLLMs, Lingshu‑32B achieve top‑tier performance. Crucially, MedRCube exposes a series of pronounced insights inaccessible under prior evaluation settings. Furthermore, we introduce a credibility evaluation subset to quantify reasoning credibility, uncover a highly significant positive association between shortcut behavior and diagnostic task performance, raising concerns for clinically trustworthy deployment. The resources of this work can be found at https://github.com/F1mc/MedRCube.
Authors:Xiao Pu, Zepeng Cheng, Lin Yuan, Yu Wu, Xiuli Bi
Abstract:
As large language models (LLMs) generate text that increasingly resembles human writing, the subtle cues that distinguish AI‑generated content from human‑written content become increasingly challenging to capture. Reliance on generator‑specific artifacts is inherently unstable, since new models emerge rapidly and reduce the robustness of such shortcuts. This generalizes unseen generators as a central and challenging problem for AI‑text detection. To tackle this challenge, we propose a progressively structured framework that disentangles AI‑detection semantics from generator‑aware artifacts. This is achieved through a compact latent encoding that encourages semantic minimality, followed by perturbation‑based regularization to reduce residual entanglement, and finally a discriminative adaptation stage that aligns representations with task objectives. Experiments on MAGE benchmark, covering 20 representative LLMs across 7 categories, demonstrate consistent improvements over state‑of‑the‑art methods, achieving up to 24.2% accuracy gain and 26.2% F1 improvement. Notably, performance continues to improve as the diversity of training generators increases, confirming strong scalability and generalization in open‑set scenarios. Our source code will be publicly available at https://github.com/PuXiao06/DRGD.
Authors:Kaiwen Zheng, Kai Zhou, Jinwu Hu, Te Gu, Mingkai Peng, Fei Liu
Abstract:
Large language models (LLMs) demonstrate strong reasoning capabilities, but their performance often degrades under distribution shift. Existing test‑time adaptation (TTA) methods rely on gradient‑based updates that require white‑box access and need substantial overhead, while training‑free alternatives are either static or depend on external guidance. In this paper, we propose Training‑Free Test‑Time Contrastive Learning TF‑TTCL, a training‑free adaptation framework that enables a frozen LLM to improve online by distilling supervision from its own inference experiences. Specifically, TF‑TTCL implements a dynamic "Explore‑Reflect‑Steer" loop through three core modules: 1) Semantic Query Augmentation first diversifies problem views via multi‑agent role‑playing to generate different reasoning trajectories; 2) Contrastive Experience Distillation then captures the semantic gap between superior and inferior trajectories, distilling them into explicit textual rules; and 3) Contextual Rule Retrieval finally activates these stored rules during inference to dynamically steer the frozen LLM toward robust reasoning patterns while avoiding observed errors. Extensive experiments on closed‑ended reasoning tasks and open‑ended evaluation tasks demonstrate that TF‑TTCL consistently outperforms strong zero‑shot baselines and representative TTA methods under online evaluation. Code is available at https://github.com/KevinSCUTer/TF‑TTCL.
Authors:Zhaoyang Wang, Qianhui Wu, Xuchao Zhang, Chaoyun Zhang, Wenlin Yao, Fazle Elahi Faisal, Baolin Peng, Si Qin, Suman Nath, Qingwei Lin, Chetan Bansal, Dongmei Zhang, Saravan Rajmohan, Jianfeng Gao, Huaxiu Yao
Abstract:
Autonomous web agents powered by large language models (LLMs) have shown promise in completing complex browser tasks, yet they still struggle with long‑horizon workflows. A key bottleneck is the grounding gap in existing skill formulations: textual workflow skills provide natural language guidance but cannot be directly executed, while code‑based skills are executable but opaque to the agent, offering no step‑level understanding for error recovery or adaptation. We introduce WebXSkill, a framework that bridges this gap with executable skills, each pairing a parameterized action program with step‑level natural language guidance, enabling both direct execution and agent‑driven adaptation. WebXSkill operates in three stages: skill extraction mines reusable action subsequences from readily available synthetic agent trajectories and abstracts them into parameterized skills, skill organization indexes skills into a URL‑based graph for context‑aware retrieval, and skill deployment exposes two complementary modes, grounded mode for fully automated multi‑step execution and guided mode where skills serve as step‑by‑step instructions that the agent follows with its native planning. On WebArena and WebVoyager, WebXSkill improves task success rate by up to 9.8 and 12.9 points over the baseline, respectively, demonstrating the effectiveness of executable skills for web agents. The code is publicly available at https://github.com/aiming‑lab/WebXSkill.
Authors:Xiang Long, Li Du, Yilong Xu, Fangcheng Liu, Haoqing Wang, Ning Ding, Ziheng Li, Jianyuan Guo, Yehui Tang
Abstract:
LLM‑based agents are increasingly expected to handle real‑world assistant tasks, yet existing benchmarks typically evaluate them under isolated sources of difficulty, such as a single environment or fully specified instructions. This leaves a substantial gap between current evaluation settings and the compositional challenges that arise in practical deployment. To address this gap, we introduce LiveClawBench, a benchmark to evaluate LLM agents on real‑world assistant tasks. Based on an analysis of various real OpenClaw usage cases, we derive a Triple‑Axis Complexity Framework that characterizes task difficulty along three dimensions: Environment Complexity, Cognitive Demand, and Runtime Adaptability. Guided by this framework, we construct a pilot benchmark with explicit complexity‑factor annotations, covering real‑world assistant tasks with compositional difficulty. Together, the framework and benchmark provide a principled foundation for evaluating LLM agents in realistic assistant settings, and establish a basis for future expansion across task domains and complexity axes. We are continuing to enrich our case collections to achieve more comprehensive domain and complexity coverage. The project page is at https://github.com/Mosi‑AI/LiveClawBench.
Authors:Matthias De Lange, Warre Veys, Federico Retyk, Daniel Deniz, Warren Jouanneau, Mike Zhang, Aleksander Bielinski, Emma Jouffroy, Nicole Clobes, Nina Baranowska, David Graus, Marc Palyart, Rabih Zbib, Dimitra Gkatzia, Thomas Demeester, Tijl De Bie, Toine Bogers, Jens-Joris Decorte, Jeroen Van Hautte
Abstract:
Today's evolving labor markets rely increasingly on recommender systems for hiring, talent management, and workforce analytics, with natural language processing (NLP) capabilities at the core. Yet, research in this area remains highly fragmented. Studies employ divergent ontologies (ESCO, ONET, national taxonomies), heterogeneous task formulations, and diverse model families, making cross‑study comparison and reproducibility exceedingly difficult. General‑purpose benchmarks lack coverage of work‑specific tasks, and the inherent sensitivity of employment data further limits open evaluation. We present WorkRB (Work Research Benchmark), the first open‑source, community‑driven benchmark tailored to work‑domain AI. WorkRB organizes 13 diverse tasks from 7 task groups as unified recommendation and NLP tasks, including job/skill recommendation, candidate recommendation, similar item recommendation, and skill extraction and normalization. WorkRB enables both monolingual and cross‑lingual evaluation settings through dynamic loading of multilingual ontologies. Developed within a multi‑stakeholder ecosystem of academia, industry, and public institutions, WorkRB has a modular design for seamless contributions and enables integration of proprietary tasks without disclosing sensitive data. WorkRB is available under the Apache 2.0 license at https://github.com/techwolf‑ai/WorkRB.
Authors:Kathakoli Sengupta, Kai Ao, Paola Cascante-Bonilla
Abstract:
Large Language Models (LLMs) and Vision‑Language Models (VLMs) increasingly generate indoor scenes through intermediate structures such as layouts and scene graphs, yet evaluation still relies on LLM or VLM judges that score rendered views, making judgments sensitive to viewpoint, prompt phrasing, and hallucination. When the evaluator is unstable, it becomes difficult to determine whether a model has produced a spatially plausible scene or whether the output score reflects the choice of viewpoint, rendering, or prompt. We introduce SceneCritic, a symbolic evaluator for floor‑plan‑level layouts. SceneCritic's constraints are grounded in SceneOnto, a structured spatial ontology we construct by aggregating indoor scene priors from 3D‑FRONT, ScanNet, and Visual Genome. SceneOnto traverses this ontology to jointly verify semantic, orientation, and geometric coherence across object relationships, providing object‑level and relationship‑level assessments that identify specific violations and successful placements. Furthermore, we pair SceneCritic with an iterative refinement test bed that probes how models build and revise spatial structure under different critic modalities: a rule‑based critic using collision constraints as feedback, an LLM critic operating on the layout as text, and a VLM critic operating on rendered observations. Through extensive experiments, we show that (a) SceneCritic aligns substantially better with human judgments than VLM‑based evaluators, (b) text‑only LLMs can outperform VLMs on semantic layout quality, and (c) image‑based VLM refinement is the most effective critic modality for semantic and orientation correction.
Authors:Guoxin Chen, Jie Chen, Lei Chen, Jiale Zhao, Fanzhe Meng, Wayne Xin Zhao, Ruihua Song, Cheng Chen, Ji-Rong Wen, Kai Jia
Abstract:
Autonomous AI research has advanced rapidly, but long‑horizon ML research engineering remains difficult: agents must sustain coherent progress across task comprehension, environment setup, implementation, experimentation, and debugging over hours or days. We introduce AiScientist, a system for autonomous long‑horizon engineering for ML research built on a simple principle: strong long‑horizon performance requires both structured orchestration and durable state continuity. To this end, AiScientist combines hierarchical orchestration with a permission‑scoped File‑as‑Bus workspace: a top‑level Orchestrator maintains stage‑level control through concise summaries and a workspace map, while specialized agents repeatedly re‑ground on durable artifacts such as analyses, plans, code, and experimental evidence rather than relying primarily on conversational handoffs, yielding thin control over thick state. Across two complementary benchmarks, AiScientist improves PaperBench score by 10.54 points on average over the best matched baseline and achieves 81.82 Any Medal% on MLE‑Bench Lite. Ablation studies further show that File‑as‑Bus protocol is a key driver of performance, reducing PaperBench by 6.41 points and MLE‑Bench Lite by 31.82 points when removed. These results suggest that long‑horizon ML research engineering is a systems problem of coordinating specialized work over durable project state, rather than a purely local reasoning problem.
Authors:Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, Ning Ding
Abstract:
On‑policy distillation (OPD) has become a core technique in the post‑training of large language models, yet its training dynamics remain poorly understood. This paper provides a systematic investigation of OPD dynamics and mechanisms. We first identify that two conditions govern whether OPD succeeds or fails: (i) the student and teacher should share compatible thinking patterns; and (ii) even with consistent thinking patterns and higher scores, the teacher must offer genuinely new capabilities beyond what the student has seen during training. We validate these findings through weak‑to‑strong reverse distillation, showing that same‑family 1.5B and 7B teachers are distributionally indistinguishable from the student's perspective. Probing into the token‑level mechanism, we show that successful OPD is characterized by progressive alignment on high‑probability tokens at student‑visited states, a small shared token set that concentrates most of the probability mass (97%‑99%). We further propose two practical strategies to recover failing OPD: off‑policy cold start and teacher‑aligned prompt selection. Finally, we show that OPD's apparent free lunch of dense token‑level reward comes at a cost, raising the question of whether OPD can scale to long‑horizon distillation.
Authors:Amir Hossein Kargaran, Nafiseh Nikeghbal, Jana Diesner, François Yvon, Hinrich Schütze
Abstract:
Optical character recognition (OCR) has advanced rapidly with the rise of vision‑language models, yet evaluation has remained concentrated on a small cluster of high‑ and mid‑resource scripts. We introduce GlotOCR Bench, a comprehensive benchmark evaluating OCR generalization across 100+ Unicode scripts. Our benchmark comprises clean and degraded image variants rendered from real multilingual texts. Images are rendered using fonts from the Google Fonts repository, shaped with HarfBuzz and rasterized with FreeType, supporting both LTR and RTL scripts. Samples of rendered images were manually reviewed to verify correct rendering across all scripts. We evaluate a broad suite of open‑weight and proprietary vision‑language models and find that most perform well on fewer than ten scripts, and even the strongest frontier models fail to generalize beyond thirty scripts. Performance broadly tracks script‑level pretraining coverage, suggesting that current OCR systems rely on language model pretraining as much as on visual recognition. Models confronted with unfamiliar scripts either produce random noise or hallucinate characters from similar scripts they already know. We release the benchmark and pipeline for reproducibility. Pipeline Code: https://github.com/cisnlp/glotocr‑bench, Benchmark: https://hf.co/datasets/cis‑lmu/glotocr‑bench.
Authors:Eliya Habba, Itay Itzhak, Asaf Yehudai, Yotam Perlitz, Elron Bandel, Michal Shmueli-Scheuer, Leshem Choshen, Gabriel Stanovsky
Abstract:
The rapid release of both language models and benchmarks makes it increasingly costly to evaluate every model on every dataset. In practice, models are often evaluated on different samples, making scores difficult to compare across studies. To address this, we propose a framework based on multidimensional Item Response Theory (IRT) that uses anchor items to calibrate new benchmarks to the evaluation suite while holding previously calibrated item parameters fixed. Our approach supports a realistic evaluation setting in which datasets are introduced over time and models are evaluated only on the datasets available at the time of evaluation, while a fixed anchor set for each dataset is used so that results from different evaluation periods can be compared directly. In large‑scale experiments on more than 400 models, our framework predicts full‑evaluation performance within 2‑3 percentage points using only 100 anchor questions per dataset, with Spearman ρ\geq 0.9 for ranking preservation, showing that it is possible to extend benchmark suites over time while preserving score comparability, at a constant evaluation cost per new dataset. Code available at https://github.com/eliyahabba/growing‑pains
Authors:Alkid Baci, Luke Friedrichs, Caglar Demir, N'Dah Jean Kouagou, Axel-Cyrille Ngonga Ngomo
Abstract:
Knowledge graph embedding (KGE) models perform well on link prediction but struggle with unseen entities, relations, and especially literals, limiting their use in dynamic, heterogeneous graphs. In contrast, pretrained large language models (LLMs) generalize effectively through prompting. We reformulate link prediction as a prompt learning problem and introduce RALP, which learns string‑based chain‑of‑thought (CoT) prompts as scoring functions for triples. Using Bayesian Optimization through MIPRO algorithm, RALP identifies effective prompts from fewer than 30 training examples without gradient access. At inference, RALP predicts missing entities, relations or whole triples and assigns confidence scores based on the learned prompt. We evaluate on transductive, numerical, and OWL instance retrieval benchmarks. RALP improves state‑of‑the‑art KGE models by over 5% MRR across datasets and enhances generalization via high‑quality inferred triples. On OWL reasoning tasks with complex class expressions (e.g., \exists hasChild.Female, \geq 5 \; hasChild.Female), it achieves over 88% Jaccard similarity. These results highlight prompt‑based LLM reasoning as a flexible alternative to embedding‑based methods. We release our implementation, training, and evaluation pipeline as open source: https://github.com/dice‑group/RALP .
Authors:Peng Wang, Biyu Zhou, Xuehai Tang, Jizhong Han, Songlin Hu
Abstract:
Unstructured model editing aims to update models with real‑world text, yet existing methods often memorize text holistically without reliable fine‑grained fact access. To address this, we propose FABLE, a hierarchical framework that decouples fine‑grained fact injection from holistic text generation. FABLE follows a two‑stage, fact‑first strategy: discrete facts are anchored in shallow layers, followed by minimal updates to deeper layers to produce coherent text. This decoupling resolves the mismatch between holistic recall and fine‑grained fact access, reflecting the unidirectional Transformer flow in which surface‑form generation amplifies rather than corrects underlying fact representations. We also introduce UnFine, a diagnostic benchmark with fine‑grained question‑answer pairs and fact‑level metrics for systematic evaluation. Experiments show that FABLE substantially improves fine‑grained question answering while maintaining state‑of‑the‑art holistic editing performance. Our code is publicly available at https://github.com/caskcsg/FABLE.
Authors:Linhao Zhang, Yuhan Song, Aiwei Liu, Chuhan Wu, Sijun Zhang, Wei Jia, Yuan Liu, Houfeng Wang, Xiao Zhou
Abstract:
Recent Audio Large Language Models (AudioLLMs) exhibit a striking performance inversion: while excelling at complex reasoning tasks, they consistently underperform on fine‑grained acoustic perception. We attribute this gap to a fundamental limitation of ASR‑centric training, which provides precise linguistic targets but implicitly teaches models to suppress paralinguistic cues and acoustic events as noise. To address this, we propose Unified Audio Schema (UAS), a holistic and structured supervision framework that organizes audio information into three explicit components ‑‑ Transcription, Paralinguistics, and Non‑linguistic Events ‑‑ within a unified JSON format. This design achieves comprehensive acoustic coverage without sacrificing the tight audio‑text alignment that enables reasoning. We validate the effectiveness of this supervision strategy by applying it to both discrete and continuous AudioLLM architectures. Extensive experiments on MMSU, MMAR, and MMAU demonstrate that UAS‑Audio yields consistent improvements, boosting fine‑grained perception by 10.9% on MMSU over the same‑size state‑of‑the‑art models while preserving robust reasoning capabilities. Our code and model are publicly available at https://github.com/Tencent/Unified_Audio_Schema.
Authors:Shuai Wang, Xixi Wang, Yinan Yu
Abstract:
Large Language Models (LLMs) have shown remarkable capabilities across various tasks but remain prone to hallucinations in knowledge‑intensive scenarios. Knowledge Base Question Answering (KBQA) mitigates this by grounding generation in Knowledge Graphs (KGs). However, most multi‑hop KBQA methods rely on explicit edge traversal, making them fragile to KG incompleteness. In this paper, we proposed a novel graph‑based soft prompting framework that shifts the reasoning paradigm from node‑level path traversal to subgraph‑level reasoning. Specifically, we employ a Graph Neural Network (GNN) to encode extracted structural subgraphs into soft prompts, enabling LLM to reason over richer structural context and identify relevant entities beyond immediate graph neighbors, thereby reducing sensitivity to missing edges. Furthermore, we introduce a two‑stage paradigm that reduces computational cost while preserving good performance: a lightweight LLM first leverages the soft prompts to identify question‑relevant entities and relations, followed by a more powerful LLM for evidence‑aware answer generation. Experiments on four multi‑hop KBQA benchmarks show that our approach achieves state‑of‑the‑art performance on three of them, demonstrating its effectiveness. Code is available at the repository: https://github.com/Wangshuaiia/GraSP.
Authors:Shuai Wang, Yinan Yu
Abstract:
Large Language Models (LLMs) exhibit strong abilities in natural language understanding and generation, yet they struggle with knowledge‑intensive reasoning. Structured Knowledge Graphs (KGs) provide an effective form of external knowledge representation and have been widely used to enhance performance in classical Knowledge Base Question Answering (KBQA) tasks. However, performing precise multi‑hop reasoning over KGs for complex queries remains highly challenging. Most existing approaches decompose the reasoning process into a sequence of isolated steps executed through a fixed pipeline. While effective to some extent, such designs constrain reasoning flexibility and fragment the overall decision process, often leading to incoherence and the loss of critical intermediate information from earlier steps. In this paper, we introduce KG‑Reasoner, an end‑to‑end framework that integrates multi‑step reasoning into a unified "thinking" phase of a Reasoning LLM. Through Reinforcement Learning (RL), the LLM is trained to internalize the KG traversal process, enabling it to dynamically explore reasoning paths, and perform backtracking when necessary. Experiments on eight multi‑hop and knowledge‑intensive reasoning benchmarks demonstrate that KG‑Reasoner achieves competitive or superior performance compared to the state‑of‑the‑art methods. Codes are available at the repository: https://github.com/Wangshuaiia/KG‑Reasoner.
Authors:SungHo Kim, Juhyeong Park, Eda Atalay, SangKeun Lee
Abstract:
Korean is a morphologically rich language with a featural writing system in which each character is systematically composed of subcharacter units known as Jamo. These subcharacters not only determine the visual structure of Korean but also encode frequent and linguistically meaningful morphophonological processes. However, most current Korean language models (LMs) are based on subword tokenization schemes, which are not explicitly designed to capture the internal compositional structure of characters. To address this limitation, we propose SCRIPT, a model‑agnostic module that injects subcharacter compositional knowledge into Korean PLMs. SCRIPT allows to enhance subword embeddings with structural granularity, without requiring architectural changes or additional pre‑training. As a result, SCRIPT enhances all baselines across various Korean natural language understanding (NLU) and generation (NLG) tasks. Moreover, beyond performance gains, detailed linguistic analyses show that SCRIPT reshapes the embedding space in a way that better captures grammatical regularities and semantically cohesive variations. Our code is available at https://github.com/SungHo3268/SCRIPT.
Authors:Houxing Ren, Mingjie Zhan, Zimu Lu, Ke Wang, Yunqiao Yang, Haotian Hou, Hongsheng Li
Abstract:
Spreadsheets are central to real‑world applications such as enterprise reporting, auditing, and scientific data management. Despite their ubiquity, existing large language model based approaches typically treat tables as plain text, overlooking critical layout cues and visual semantics. Moreover, real‑world spreadsheets are often massive in scale, exceeding the input length that LLMs can efficiently process. To address these challenges, we propose SpreadsheetAgent, a two‑stage multi‑agent framework for spreadsheet understanding that adopts a step‑by‑step reading and reasoning paradigm. Instead of loading the entire spreadsheet at once, SpreadsheetAgent incrementally interprets localized regions through multiple modalities, including code execution results, images, and LaTeX tables. The method first constructs a structural sketch and row/column summaries, and then performs task‑driven reasoning over this intermediate representation in the Solving Stage. To further enhance reliability, we design a verification module that validates extracted structures via targeted inspections, reducing error propagation and ensuring trustworthy inputs for downstream reasoning. Extensive experiments on two spreadsheet datasets demonstrate the effectiveness of our approach. With GPT‑OSS‑120B, SpreadsheetAgent achieves 38.16% on Spreadsheet Bench, outperforming the ChatGPT Agent baseline (35.27%) by 2.89 absolute points. These results highlight the potential of SpreadsheetAgent to advance robust and scalable spreadsheet understanding in real‑world applications. Code is available at https://github.com/renhouxing/SpreadsheetAgent.git.
Authors:Zaoyu Chen, Jianbo Dai, Boyu Zhu, Jingdong Wang, Huiming Wang, Xin Xu, Haoyang Yuan, Zhijiang Guo, Xiao-Ming Wu
Abstract:
Large language models (LLMs) can generate code from natural language, but the extent to which they capture intended program behavior remains unclear. Executable behavioral specifications, defined via preconditions and postconditions, provide a concrete means to assess such understanding. However, existing work on specification generation is constrained in evaluation methodology, task settings, and specification expressiveness. We introduce CodeSpecBench, a benchmark for executable behavioral specification generation under an execution‑based evaluation protocol. CodeSpecBench supports both function‑level and repository‑level tasks and encodes specifications as executable Python functions. Constructed from diverse real‑world codebases, it enables a realistic assessment of both correctness (accepting valid behaviors) and completeness (rejecting invalid behaviors). Evaluating 15 state‑of‑the‑art LLMs on CodeSpecBench, we observe a sharp performance degradation on repository‑level tasks, where the best model attains only a 20.2% pass rate. We further find that specification generation is substantially more challenging than code generation, indicating that strong coding performance does not necessarily reflect deep understanding of intended program semantics. Our data and code are available at https://github.com/SparksofAGI/CodeSpecBench.
Authors:Ziqing Wang, Yibo Wen, Abhishek Pandy, Han Liu, Kaize Ding
Abstract:
In drug discovery, molecular optimization aims to iteratively refine a lead compound to improve molecular properties while preserving structural similarity to the original molecule. However, each oracle evaluation is expensive, making sample efficiency a key challenge for existing methods under a limited oracle budget. Trial‑and‑error approaches require many oracle calls, while methods that leverage external knowledge tend to reuse familiar templates and struggle on challenging objectives. A key missing piece is long‑term memory that can ground decisions and provide reusable insights for future optimizations. To address this, we present MolMem (Molecular optimization with Memory), a multi‑turn agentic reinforcement learning (RL) framework with a dual‑memory system. Specifically, MolMem uses Static Exemplar Memory to retrieve relevant exemplars for cold‑start grounding, and Evolving Skill Memory to distill successful trajectories into reusable strategies. Built on this memory‑augmented formulation, we train the policy with dense step‑wise rewards, turning costly rollouts into long‑term knowledge that improves future optimization. Extensive experiments show that MolMem achieves 90% success on single‑property tasks (1.5× over the best baseline) and 52% on multi‑property tasks using only 500 oracle calls. Our code is available at https://github.com/REAL‑Lab‑NU/MolMem.
Authors:Jiayi Xin, Xiang Li, Evan Qiang, Weiqing He, Tianqi Shang, Weijie J. Su, Qi Long
Abstract:
In‑context learning (ICL) performance depends critically on which demonstrations are placed in the prompt, yet most existing selectors prioritize heuristic notions of relevance or diversity and provide limited insight into the coverage of a demonstration set. We propose Unseen Coverage Selection (UKS), a training‑free, subset‑level coverage prior motivated by the principle that a good demonstration set should expose the model to latent cluster unrevealed by the currently selected subset. UCS operationalizes this idea by (1) inducing discrete latent clusters from model‑consistent embeddings and (2) estimating the number of unrevealed clusters within a candidate subset via a Smoothed Good‑‑Turing estimator from its empirical frequency spectrum. Unlike previous selection methods, UCS is coverage‑based and training‑free, and can be seamlessly combined with both query‑dependent and query‑independent selection baselines via a simple regularized objective. Experiments on multiple intent‑classification and reasoning benchmarks with frontier Large Language Models show that augmenting strong baselines with UCS consistently improves ICL accuracy by up to 2‑6% under the same selection budget, while also yielding insights into task‑ and model‑level latent cluster distributions. Code is available at https://github.com/Raina‑Xin/UCS.
Authors:Manas Pathak, Xingyao Chen, Shuozhe Li, Amy Zhang, Liu Leqi
Abstract:
Should we trust Large Language Models (LLMs) with high accuracy? LLMs achieve high accuracy on reasoning benchmarks, but correctness alone does not reveal the quality of the reasoning used to produce it. This highlights a fundamental limitation of outcome‑based evaluation: models may arrive at correct answers through flawed reasoning, and models with substantially different reasoning capabilities can nevertheless exhibit similar benchmark accuracy, for example due to memorization or over‑optimization. In this paper, we ask: given existing benchmarks, can we move beyond outcome‑based evaluation to assess the quality of reasoning itself? We seek metrics that (1) differentiate models with similar accuracy and (2) are robust to variations in input prompts and generation configurations. To this end, we propose a reasoning score that evaluates reasoning traces along dimensions such as faithfulness, coherence, utility, and factuality. A remaining question is how to aggregate this score across multiple sampled traces. Naively averaging them is undesirable, particularly in long‑horizon settings, where the number of possible trajectories grows rapidly, and low‑confidence correct traces are more likely to be coincidental. To address this, we introduce the Filtered Reasoning Score (FRS), which computes reasoning quality using only the top‑K% most confident traces. Evaluating with FRS, models that are indistinguishable under standard accuracy exhibit significant differences in reasoning quality. Moreover, models with higher FRS on one benchmark tend to perform better on other reasoning benchmarks, in both accuracy and reasoning quality. Together, these findings suggest that FRS complements accuracy by capturing a model's transferable reasoning capabilities. We open source our evaluation codebase: https://github.com/Manas2006/benchmark_reproducibility.
Authors:Chenxi Qing, Junxi Wu, Zheng Liu, Yixiang Qiu, Hongyao Yu, Bin Chen, Hao Wu, Shu-Tao Xia
Abstract:
Recently, large language models (LLMs) are capable of generating highly fluent textual content. While they offer significant convenience to humans, they also introduce various risks, like phishing and academic dishonesty. Numerous research efforts have been dedicated to developing algorithms for detecting AI‑generated text and constructing relevant datasets. However, in the domain of Chinese corpora, challenges remain, including limited model diversity and data homogeneity. To address these issues, we propose C‑ReD: a comprehensive Chinese Real‑prompt AI‑generated Detection benchmark. Experiments demonstrate that C‑ReD not only enables reliable in‑domain detection but also supports strong generalization to unseen LLMs and external Chinese datasets‑addressing critical gaps in model diversity, domain coverage, and prompt realism that have limited prior Chinese detection benchmarks. We release our resources at https://github.com/HeraldofLight/C‑ReD.
Authors:Yuxin Chen, Chumeng Liang, Hangke Sui, Ruihan Guo, Chaoran Cheng, Jiaxuan You, Ge Liu
Abstract:
Continuous diffusion has been the foundation of high‑fidelity, controllable, and few‑step generation of many data modalities such as images. However, in language modeling, prior continuous diffusion language models (DLMs) lag behind discrete counterparts due to the sparse data space and the underexplored design space. In this work, we close this gap with LangFlow, the first continuous DLM to rival discrete diffusion, by connecting embedding‑space DLMs to Flow Matching via Bregman divergence, alongside three key innovations: (1) we derive a novel ODE‑based NLL bound for principled evaluation of continuous flow‑based language models; (2) we propose an information‑uniform principle for setting the noise schedule, which motivates a learnable noise scheduler based on a Gumbel distribution; and (3) we revise prior training protocols by incorporating self‑conditioning, as we find it improves both likelihood and sample quality of embedding‑space DLMs with effects substantially different from discrete diffusion. Putting everything together, LangFlow rivals top discrete DLMs on both the perplexity (PPL) and the generative perplexity (Gen. PPL), reaching a PPL of 30.0 on LM1B and 24.6 on OpenWebText. It even exceeds autoregressive baselines in zero‑shot transfer on 4 out of 7 benchmarks. LangFlow provides the first clear evidence that continuous diffusion is a promising paradigm for language modeling. Homepage: https://github.com/nealchen2003/LangFlow
Authors:Liujie Zhang, Benzhe Ning, Rui Yang, Xiaoyan Yu, Jiaxing Li, Lumeng Wu, Jia Liu, Minghao Li, Weihang Chen, Weiqi Hu, Lei Zhang
Abstract:
Reinforcement learning (RL) post‑training has proven effective at unlocking reasoning, self‑reflection, and tool‑use capabilities in large language models. As models extend to omni‑modal inputs and agentic multi‑turn workflows, RL training systems face three interdependent challenges: heterogeneous data flows, operational robustness at scale, and the staleness ‑‑ throughput tradeoff. We present Relax (Reinforcement Engine Leveraging Agentic X‑modality), an open‑source RL training engine that addresses these challenges through three co‑designed architectural layers. First, an \emphomni‑native architecture builds multimodal support into the full stack ‑‑ from data preprocessing and modality‑aware parallelism to inference generation ‑‑ rather than retrofitting it onto a text‑centric pipeline. Second, each RL role runs as an independent, fault‑isolated service that can be scaled, recovered, and upgraded without global coordination. Third, service‑level decoupling enables asynchronous training via the TransferQueue data bus, where a single staleness parameter smoothly interpolates among on‑policy, near‑on‑policy, and fully asynchronous execution. Relax achieves a 1.20× end‑to‑end speedup over veRL on Qwen3‑4B on‑policy training. Its fully async mode delivers a 1.76× speedup over colocate on Qwen3‑4B and a 2.00× speedup on Qwen3‑Omni‑30B, while all modes converge to the same reward level. Relax supports R3 (Rollout Routing Replay)~\citema2025r3 for MoE models with only 1.9% overhead, compared to 32% degradation in veRL under the same configuration. It further demonstrates stable omni‑modal RL convergence on Qwen3‑Omni across image, text, and audio, sustaining over 2,000 steps on video without degradation. Relax is available at https://github.com/rednote‑ai/Relax.
Authors:Pengfeng Li, Chen Huang, Chaoqun Hao, Hongyao Chen, Xiao-Yong Wei, Wenqiang Lei, See-Kiong Ng
Abstract:
Contextual causal reasoning is a critical yet challenging capability for Large Language Models (LLMs). Existing benchmarks, however, often evaluate this skill in fragmented settings, failing to ensure context consistency or cover the full causal hierarchy. To address this, we pioneer METER to systematically benchmark LLMs across all three levels of the causal ladder under a unified context setting. Our extensive evaluation of various LLMs reveals a significant decline in proficiency as tasks ascend the causal hierarchy. To diagnose this degradation, we conduct a deep mechanistic analysis via both error pattern identification and internal information flow tracing. Our analysis reveals two primary failure modes: (1) LLMs are susceptible to distraction by causally irrelevant but factually correct information at lower level of causality; and (2) as tasks ascend the causal hierarchy, faithfulness to the provided context degrades, leading to a reduced performance. We belive our work advances our understanding of the mechanisms behind LLM contextual causal reasoning and establishes a critical foundation for future research. Our code and dataset are available at https://github.com/SCUNLP/METER .
Authors:Haofu Yang, Jiaji Liu, Chen Huang, Faguo Wu, Wenqiang Lei, See-Kiong Ng
Abstract:
Developing non‑collaborative dialogue agents traditionally requires the manual, unscalable codification of expert strategies. We propose \ours, a method that leverages large language models to autonomously induce both strategy actions and planning logic directly from raw transcripts. METRO formalizes expert knowledge into a Strategy Forest, a hierarchical structure that captures both short‑term responses (nodes) and long‑term strategic foresight (branches). Experimental results across two benchmarks show that METRO demonstrates promising performance, outperforming existing methods by an average of 9%‑10%. Our further analysis not only reveals the success behind METRO (strategic behavioral diversity and foresight), but also demonstrates its robust cross‑task transferability. This offers new insights into building non‑collaborative agents in a cost‑effective and scalable way. Our code is available at https://github.com/Humphrey‑0125/METRO.
Authors:Bo Li, Mingda Wang, Gexiang Fang, Shikun Zhang, Wei Ye
Abstract:
We revisit retrieval‑augmented generation (RAG) by embedding retrieval control directly into generation. Instead of treating retrieval as an external intervention, we express retrieval decisions within token‑level decoding, enabling end‑to‑end coordination without additional controllers or classifiers. Under the paradigm of Retrieval as Generation, we propose GRIP (Generation‑guided Retrieval with Information Planning), a unified framework in which the model regulates retrieval behavior through control‑token emission. Central to GRIP is Self‑Triggered Information Planning, which allows the model to decide when to retrieve, how to reformulate queries, and when to terminate, all within a single autoregressive trajectory. This design tightly couples retrieval and reasoning and supports dynamic multi‑step inference with on‑the‑fly evidence integration. To supervise these behaviors, we construct a structured training set covering answerable, partially answerable, and multi‑hop queries, each aligned with specific token patterns. Experiments on five QA benchmarks show that GRIP surpasses strong RAG baselines and is competitive with GPT‑4o while using substantially fewer parameters.
Authors:Minh-Vuong Nguyen, Fatemeh Shiri, Zhuang Li, Karin Verspoor
Abstract:
Large Language Models (LLMs) are increasingly being explored for clinical question answering and decision support, yet safe deployment critically requires reliable handling of patient measurements in heterogeneous clinical notes. Existing evaluations of LLMs for clinical numerical reasoning provide limited operation‑level coverage, restricted primarily to arithmetic computation, and rarely assess the robustness of numerical understanding across clinical note formats. We introduce ClinicNumRobBench, a benchmark of 1,624 context‑question instances with ground‑truth answers that evaluates four main types of clinical numeracy: value retrieval, arithmetic computation, relational comparison, and aggregation. To stress‑test robustness, ClinicNumRobBench presents longitudinal MIMIC‑IV vital‑sign records in three semantically equivalent representations, including a real‑world note‑style variant derived from the Open Patients dataset, and instantiates queries using 42 question templates. Experiments on 17 LLMs show that value retrieval is generally strong, with most models exceeding 85% accuracy, while relational comparison and aggregation remain challenging, with some models scoring below 15%. Fine‑tuning on medical data can reduce numeracy relative to base models by over 30%, and performance drops under note‑style variation indicate LLM sensitivity to format. ClinicNumRobBench offers a rigorous testbed for clinically reliable numerical reasoning. Code and data URL are available on https://github.com/MinhVuong2000/ClinicNumRobBench.
Authors:Chen Huang, Zitan Jiang, Changyi Zou, Wenqiang Lei, See-Kiong Ng
Abstract:
Customer service chatbots are increasingly expected to serve not merely as reactive support tools for users, but as strategic interfaces for harvesting high‑value information and business intelligence. In response, we make three main contributions. 1) We introduce and define a novel task of Proactive Information Probing, which optimizes when to probe users for pre‑specified target information while minimizing conversation turns and user friction. 2) We propose PROCHATIP, a proactive chatbot framework featuring a specialized conversation strategy module trained to master the delicate timing of probes. 3) Experiments demonstrate that PROCHATIP significantly outperforms baselines, exhibiting superior capability in both information probing and service quality. We believe that our work effectively redefines the commercial utility of chatbots, positioning them as scalable, cost‑effective engines for proactive business intelligence. Our code is available at https://github.com/SCUNLP/PROCHATIP.
Authors:Victor De Lima, Grace Hui Yang
Abstract:
Most conversational agents (CAs) are designed to satisfy user needs through user‑driven interactions. However, many real‑world settings, such as academic interviewing, judicial proceedings, and journalistic investigations, involve broader institutional decision‑making processes and require agents that can elicit information from users. In this paper, we introduce Information Elicitation Agents (IEAs) in which the agent's goal is to elicit information from users to support the agent's institutional or task‑oriented objectives. To enable systematic research on this setting, we present YIELD, a 26M‑token dataset of 2,281 ethically sourced, human‑to‑human dialogues. Moreover, we formalize information elicitation as a finite‑horizon POMDP and propose novel metrics tailored to IEAs. Pilot experiments on multiple foundation LLMs show that training on YIELD improves their alignment with real elicitation behavior and findings are corroborated by human evaluation. We release YIELD under CC BY 4.0. The dataset, project code, evaluation tools, and fine‑tuned model adapters are available at: https://github.com/infosenselab/yield.
Authors:Zihao Cheng, Zeming Liu, Yingyu Shan, Xinyi Wang, Xiangrong Zhu, Yunpu Ma, Hongru Wang, Yuhang Guo, Wei Lin, Yunhong Wang
Abstract:
While large language model‑‑powered agents can self‑evolve by accumulating experience or by dynamically creating new assets (i.e., tools or expert agents), existing frameworks typically treat these two evolutionary processes in isolation. This separation overlooks their intrinsic interdependence: the former is inherently bounded by a manually predefined static toolset, while the latter generates new assets from scratch without experiential guidance, leading to limited capability growth and unstable evolution. To address this limitation, we introduce a novel paradigm of co‑evolutionary Capability Expansion and Experience Distillation. Guided by this paradigm, we propose the Mem^2Evolve, which integrates two core components: Experience Memory and Asset Memory. Specifically, Mem^2Evolve leverages accumulated experience to guide the dynamic creation of assets, thereby expanding the agent's capability space while simultaneously acquiring new experience to achieve co‑evolution. Extensive experiments across 6 task categories and 8 benchmarks demonstrate that Mem^2Evolve achieves improvement of 18.53% over standard LLMs, 11.80% over agents evolving solely through experience, and 6.46% over those evolving solely through asset creation, establishing it as a substantially more effective and stable self‑evolving agent framework. Code is available at: https://buaa‑irip‑llm.github.io/Mem2Evolve.
Authors:Xiaomeng Hu, Yinger Zhang, Fei Huang, Jianhong Tu, Yang Su, Lianghao Deng, Yuxuan Liu, Yantao Liu, Dayiheng Liu, Tsung-Yi Ho
Abstract:
AI agents are expected to perform professional work across hundreds of occupational domains (from emergency department triage to nuclear reactor safety monitoring to customs import processing), yet existing benchmarks can only evaluate agents in the few domains where public environments exist. We introduce OccuBench, a benchmark covering 100 real‑world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language Environment Simulators (LESs) that simulate domain‑specific environments through LLM‑driven tool response generation. Our multi‑agent synthesis pipeline automatically produces evaluation instances with guaranteed solvability, calibrated difficulty, and document‑grounded diversity. OccuBench evaluates agents along two complementary dimensions: task completion across professional domains and environmental robustness under controlled fault injection (explicit errors, implicit data degradation, and mixed faults). We evaluate 15 frontier models across 8 model families and find that: (1) no single model dominates all industries, as each has a distinct occupational capability profile; (2) implicit faults (truncated data, missing fields) are harder than both explicit errors (timeouts, 500s) and mixed faults, because they lack overt error signals and require the agent to independently detect data degradation; (3) larger models, newer generations, and higher reasoning effort consistently improve performance. GPT‑5.2 improves by 27.5 points from minimal to maximum reasoning effort; and (4) strong agents are not necessarily strong environment simulators. Simulator quality is critical for LES‑based evaluation reliability. OccuBench provides the first systematic cross‑industry evaluation of AI agents on professional occupational tasks.
Authors:Chirag Shinde
Abstract:
We propose two complementary modifications to transformer attention blocks. First, a non‑linear pre‑projection MLP is inserted between layer norm and Q/K/V projections, constructing richer features in a position‑agnostic manner before any positional encoding is applied. Second, a content skip connection routes the pre‑projection's features around the attention mechanism, allowing content information to bypass position‑aware attention where beneficial. In frozen‑probe experiments on Pythia‑160M and 410M, the combined approach achieves the strongest results across methods: +40.6% LAMBADA accuracy and ‑39% perplexity at 160M scale. Learned skip connection weights reveal a consistent pattern across model sizes: later transformer layers activate the content bypass more strongly than earlier layers, suggesting that deeper layers benefit from content information that does not pass through positional attention. All modifications add no K/V cache overhead.
Authors:Shijia Xu, Yu Wang, Xiaolong Jia, Zhou Wu, Kai Liu, April Xiaowen Dong
Abstract:
Despite the widespread adoption of Large Language Models (LLMs) in Legal AI, their utility for automated contract revision remains impeded by hallucinated safety and a lack of rigorous behavioral constraints. To address these limitations, we propose the Risk‑Constrained Bilevel Stackelberg Framework (RCBSF), which formulates revision as a non‑cooperative Stackelberg game. RCBSF establishes a hierarchical Leader Follower structure where a Global Prescriptive Agent (GPA) imposes risk budgets upon a follower system constituted by a Constrained Revision Agent (CRA) and a Local Verification Agent (LVA) to iteratively optimize output. We provide theoretical guarantees that this bilevel formulation converges to an equilibrium yielding strictly superior utility over unguided configurations. Empirical validation on a unified benchmark demonstrates that RCBSF achieves state‑of‑the‑art performance, surpassing iterative baselines with an average Risk Resolution Rate (RRR) of 84.21% while enhancing token efficiency. Our code is available at https://github.com/xjiacs/RCBSF .
Authors:Jyoutir Raj, John Conway
Abstract:
Existing multilingual benchmarks include Irish among dozens of languages but apply no Irish‑aware text normalisation, leaving reliable and reproducible ASR comparison impossible. We introduce BlasBench, an open evaluation harness that provides a standalone Irish‑aware normaliser preserving fadas, lenition, and eclipsis; a reproducible scoring harness and per‑utterance predictions released for all evaluated runs. We pilot this by benchmarking 12 systems across four architecture families on Common Voice ga‑IE and FLEURS ga‑IE. All Whisper variants exceed 100% WER through insertion‑driven hallucination. Microsoft Azure reaches 22.2% WER on Common Voice and 57.5% on FLEURS; the best open model, Omnilingual ASR 7B, reaches 30.65% and 39.09% respectively. Models fine‑tuned on Common Voice degrade 33‑43 points moving to FLEURS, while massively multilingual models degrade only 7‑10 ‑ a generalisation gap that single‑dataset evaluation misses.
Authors:Shijia Xu, Zhou Wu, Xiaolong Jia, Yu Wang, Kai Liu, April Xiaowen Dong
Abstract:
Retrieval‑augmented generation (RAG) substantially extends the knowledge boundary of large language models. However, it still faces two major challenges when handling complex reasoning tasks: low context utilization and frequent hallucinations. To address these issues, we propose Self‑Correcting RAG, a unified framework that reformulates retrieval and generation as constrained optimization and path planning. On the input side, we move beyond traditional greedy retrieval and, for the first time, formalize context selection as a multi‑dimensional multiple‑choice knapsack problem (MMKP), thereby maximizing information density and removing redundancy under a strict token budget. On the output side, we introduce a natural language inference (NLI)‑guided Monte Carlo Tree Search (MCTS) mechanism, which leverages test‑time compute to dynamically explore reasoning trajectories and validate the faithfulness of generated answers. Experiments on six multi‑hop question answering and fact‑checking datasets demonstrate that our method significantly improves reasoning accuracy on complex queries while effectively reducing hallucinations, outperforming strong existing baselines.Our code is available at https://github.com/xjiacs/Self‑Correcting‑RAG .
Authors:Hao Wang, Guozhi Wang, Han Xiao, Yufeng Zhou, Yue Pan, Jichao Wang, Ke Xu, Yafei Wen, Xiaohu Ruan, Xiaoxin Chen, Honggang Qi
Abstract:
Reinforcement learning (RL) has been widely used to train LLM agents for multi‑turn interactive tasks, but its sample efficiency is severely limited by sparse rewards and long horizons. On‑policy self‑distillation (OPSD) alleviates this by providing dense token‑level supervision from a privileged teacher that has access to ground‑truth answers. However, such fixed privileged information cannot capture the diverse valid strategies in agent tasks, and naively combining OPSD with RL often leads to training collapse. To address these limitations, we introduce Skill‑SD, a framework that turns the agent's own trajectories into dynamic training‑only supervision. Completed trajectories are summarized into compact natural language skills that describe successful behaviors, mistakes, and workflows. These skills serve as dynamic privileged information conditioning only the teacher, while the student always acts under the plain task prompt and learns to internalize the guidance through distillation. To stabilize the training, we derive an importance‑weighted reverse‑KL loss to provide gradient‑correct token‑level distillation, and dynamically synchronize the teacher with the improving student. Experimental results on agentic benchmarks demonstrate that Skill‑SD substantially outperforms the standard RL baseline, improving both vanilla GRPO (+14.0%/+10.9% on AppWorld/Sokoban) and vanilla OPD (+42.1%/+40.6%). Project page: https://k1xe.github.io/skill‑sd/
Authors:Zhengnan Guo, Fei Tan
Abstract:
While Diffusion Large Language Models (dLLMs) have emerged as a promising non‑autoregressive paradigm comparable to autoregressive (AR) models, their faithfulness, specifically regarding hallucination, remains largely underexplored. To bridge this gap, we present the first controlled comparative study to evaluate hallucination patterns in dLLMs. Our results demonstrate that current dLLMs exhibit a higher propensity for hallucination than AR counterparts controlled for architecture, scale, and pre‑training weights. Furthermore, an analysis of inference‑time compute reveals divergent dynamics: while quasi‑autoregressive generation suffers from early saturation, non‑sequential decoding unlocks potential for continuous refinement. Finally, we identify distinct failure modes unique to the diffusion process, including premature termination, incomplete denoising, and context intrusion. Our findings underscore that although dLLMs have narrowed the performance gap on general tasks, their distinct hallucination mechanisms pose a critical challenge to model reliability. Our code is available at https://github.com/ZeroLoss‑Lab/Lost‑in‑Diffusion
Authors:Avi-ad Avraam Buskila
Abstract:
Incorporating large language models (LLMs) in medical question answering demands more than high average accuracy: a model that returns substantively different answers each time it is queried is not a reliable medical tool. Online health communities such as Reddit have become a primary source of medical information for millions of users, yet they remain highly susceptible to misinformation; deploying LLMs as assistants in these settings amplifies the need for output consistency alongside correctness. We present a practical, open‑source evaluation framework for assessing small, locally‑deployable open‑weight LLMs on medical question answering, treating reproducibility as a first‑class metric alongside lexical and semantic accuracy. Our pipeline computes eight quality metrics, including BERTScore, ROUGE‑L, and an LLM‑as‑judge rubric, together with two within‑model reproducibility metrics derived from repeated inference (N=10 runs per question). Evaluating three models (Llama 3.1 8B, Gemma 3 12B, MedGemma 1.5 4B) on 50 MedQuAD questions (N=1,500 total responses) reveals that despite low‑temperature generation (T=0.2), self‑agreement across runs reaches at most 0.20, while 87‑97% of all outputs per model are unique ‑‑ a safety gap that single‑pass benchmarks entirely miss. The clinically fine‑tuned MedGemma 1.5 4B underperforms the larger general‑purpose models on both quality and reproducibility; however, because MedGemma is also the smallest model, this comparison confounds domain fine‑tuning with model scale. We describe the methodology in sufficient detail for practitioners to replicate or extend the evaluation for their own model‑selection workflows. All code and data pipelines are available at https://github.com/aviad‑buskila/llm_medical_reproducibility.
Authors:Hung-Ting Su, Ting-Jun Wang, Jia-Fong Yeh, Min Sun, Winston H. Hsu
Abstract:
Conventional Vision‑and‑Language Navigation (VLN) benchmarks assume instructions are feasible and the referenced target exists, leaving agents ill‑equipped to handle false‑premise goals. We introduce VLN‑NF, a benchmark with false‑premise instructions where the target is absent from the specified room and agents must navigate, gather evidence through in‑room exploration, and explicitly output NOT‑FOUND. VLN‑NF is constructed via a scalable pipeline that rewrites VLN instructions using an LLM and verifies target absence with a VLM, producing plausible yet factually incorrect goals. We further propose REV‑SPL to jointly evaluate room reaching, exploration coverage, and decision correctness. To address this challenge, we present ROAM, a two‑stage hybrid that combines supervised room‑level navigation with LLM/VLM‑driven in‑room exploration guided by a free‑space clearance prior. ROAM achieves the best REV‑SPL among compared methods, while baselines often under‑explore and terminate prematurely under unreliable instructions. VLN‑NF project page can be found at https://vln‑nf.github.io/.
Authors:Suyoung Bae, CheolWon Na, Jaehoon Lee, Yumin Lee, YunSeok Choi, Jee-Hyong Lee
Abstract:
As Large Language Models (LLMs) have become capable of generating long and descriptive code summaries, accurate and reliable evaluation of factual consistency has become a critical challenge. However, previous evaluation methods are primarily designed for short summaries of isolated code snippets. Consequently, they struggle to provide fine‑grained evaluation of multi‑sentence functionalities and fail to accurately assess dependency context commonly found in real‑world code summaries. To address this, we propose ReFEree, a reference‑free and fine‑grained method for evaluating factual consistency in real‑world code summaries. We define factual inconsistency criteria specific to code summaries and evaluate them at the segment level using these criteria along with dependency information. These segment‑level results are then aggregated into a fine‑grained score. We construct a code summarization benchmark with human‑annotated factual consistency labels. The evaluation results demonstrate that ReFEree achieves the highest correlation with human judgment among 13 baselines, improving 15‑18% over the previous state‑of‑the‑art. Our code and data are available at https://github.com/bsy99615/ReFEree.git.
Authors:Bo Li, Mingda Wang, Shikun Zhang, Wei Ye
Abstract:
Instruction tuning relies on large instruction‑response corpora whose quality and composition strongly affect downstream performance. We propose Answer Divergence‑Guided Selection (ADG), which selects instruction data based on the geometric structure of multi‑sample outputs. ADG draws several high‑temperature generations per instruction, maps responses into an embedding space, and computes an output divergence score that jointly encodes dispersion magnitude and shape anisotropy. High scores correspond to instructions whose answers are both far apart and multi‑modal, rather than clustered paraphrases along a single direction. Across two backbones and three public instruction pools, fine‑tuning on only 10K ADG‑selected examples consistently outperforms strong selectors on six benchmarks spanning reasoning, knowledge, and coding. Analyses further show that both dispersion magnitude and shape anisotropy are necessary, supporting answer divergence as a practical signal for instruction data selection. Code and appendix are included in the supplementary materials.
Authors:Jingru Li, Wei Ren, Tianqing Zhu
Abstract:
Large Vision‑Language Models (LVLMs) rely on attention‑based retrieval of safety instructions to maintain alignment during generation. Existing attacks typically optimize image perturbations to maximize harmful output likelihood, but suffer from slow convergence due to gradient conflict between adversarial objectives and the model's safety‑retrieval mechanism. We propose Attention‑Guided Visual Jailbreaking, which circumvents rather than overpowers safety alignment by directly manipulating attention patterns. Our method introduces two simple auxiliary objectives: (1) suppressing attention to alignment‑relevant prefix tokens and (2) anchoring generation on adversarial image features. This simple yet effective push‑pull formulation reduces gradient conflict by 45% and achieves 94.4% attack success rate on Qwen‑VL (vs. 68.8% baseline) with 40% fewer iterations. At tighter perturbation budgets (ε=8/255), we maintain 59.0% ASR compared to 45.7% for standard methods. Mechanistic analysis reveals a failure mode we term safety blindness: successful attacks suppress system‑prompt attention by 80%, causing models to generate harmful content not by overriding safety rules, but by failing to retrieve them.
Authors:Zae Myung Kim, Dongseok Lee, Jaehyung Kim, Vipul Raheja, Dongyeop Kang
Abstract:
Existing tool‑use benchmarks for LLM agents are overwhelmingly linear: our analysis of six benchmarks shows 55 to 100% of instances are simple chains of 2 to 5 steps. We introduce The Amazing Agent Race (AAR), a benchmark featuring directed acyclic graph (DAG) puzzles (or "legs") with fork‑merge tool chains. We release 1,400 instances across two variants: sequential (800 legs) and compositional (600 DAG legs). Agents must navigate Wikipedia, execute multi‑step tool chains, and aggregate results into a verifiable answer. Legs are procedurally generated from Wikipedia seeds across four difficulty levels with live‑API validation. Three complementary metrics (finish‑line accuracy, pit‑stop visit rate, and roadblock completion rate) separately diagnose navigation, tool‑use, and arithmetic failures. Evaluating three agent frameworks on 1,400 legs, the best achieves only 37.2% accuracy. Navigation errors dominate (27 to 52% of trials) while tool‑use errors remain below 17%, and agent architecture matters as much as model scale (Claude Code matches Codex CLI at 37% with 6x fewer tokens). The compositional structure of AAR reveals that agents fail not at calling tools but at navigating to the right pages, a blind spot invisible to linear benchmarks. The project page can be accessed at: https://minnesotanlp.github.io/the‑amazing‑agent‑race
Authors:Jiang Li, Tian Lan, Shanshan Wang, Dongxing Zhang, Dianqing Lin, Guanglai Gao, Derek F. Wong, Xiangdong Su
Abstract:
The rapid development of large language models (LLMs) has extended text generation tasks into the literary domain. However, AI‑generated literary creations has raised increasingly prominent issues of creative authenticity and ethics in literary world, making the detection of LLM‑generated literary texts essential and urgent. While previous works have made significant progress in detecting AI‑generated text, it has yet to address classical Chinese poetry. Due to the unique linguistic features of classical Chinese poetry, such as strict metrical regularity, a shared system of poetic imagery, and flexible syntax, distinguishing whether a poem is authored by AI presents a substantial challenge. To address these issues, we introduce ChangAn, a benchmark for detecting LLM‑generated classical Chinese poetry that containing total 30,664 poems, 10,276 are human‑written poems and 20,388 poems are generated by four popular LLMs. Based on ChangAn, we conducted a systematic evaluation of 12 AI detectors, investigating their performance variations across different text granularities and generation strategies. Our findings highlight the limitations of current Chinese text detectors, which fail to serve as reliable tools for detecting LLM‑generated classical Chinese poetry. These results validate the effectiveness and necessity of our proposed ChangAn benchmark. Our dataset and code are available at https://github.com/VelikayaScarlet/ChangAn.
Authors:Utshab Kumar Ghosh, Ashish David, Shubham Chatterjee
Abstract:
Reproducibility must validate architectural robustness, not just numerical accuracy. We evaluate ColBERT‑v2 and ConstBERT across five dimensions, finding that while ConstBERT reproduces within 0.05% MRR@10 on MS‑MARCO, both models show a drop of 86‑97% on long, narrative queries (TREC ToT 2025). Ablations prove this failure is architectural: performance plateaus at 20 words because the MaxSim operator's uniform token weighting cannot distinguish signal from filler noise. Furthermore, undocumented backend parameters create an 8‑point gap due to ConstBERT's sparse centroid coverage, and fine‑tuning with 3x more data actually degrades performance by up to 29%. We conclude that architectural constraints in multi‑vector retrieval cannot be overcome by adaptation alone. Code: https://github.com/utshabkg/multi‑vector‑reproducibility.
Authors:Tianfu Wang, Leilei Ding, Ziyang Tao, Yi Zhan, Zhiyuan Ma, Wei Wu, Yuxuan Lei, Yuan Feng, Junyang Wang, Yin Wu, Yizhao Xu, Hongyuan Zhu, Qi Liu, Nicholas Jing Yuan, Yanyong Zhang, Hui Xiong
Abstract:
High‑fidelity diagram creation requires the complex orchestration of semantic topology, visual styling, and spatial layout, posing a significant challenge for automated systems. Existing methods also suffer from a representation gap: pixel‑based models often lack precise control, while code‑based synthesis limits intuitive flexibility. To bridge this gap, we introduce EvoDiagram, an agentic framework that generates object‑level editable diagrams via an intermediate canvas schema. EvoDiagram employs a coordinated multi‑agent system to decouple semantic intent from rendering logic, resolving conflicts across heterogeneous design layers. Additionally, we propose a design knowledge evolution mechanism that distills execution traces into a hierarchical memory of domain guidelines, enabling agents to retrieve context‑aware expertise adaptively. We further release CanvasBench, a benchmark consisting of both data and metrics for canvas‑based diagramming. Extensive experiments demonstrate that EvoDiagram exhibits excellent performance and balance against baselines in generating editable, structurally consistent, and aesthetically coherent diagrams. Our code is available at https://github.com/AuraX‑AI/EvoDiagram.
Authors:Jon M Laurent, Albert Bou, Michael Pieler, Conor Igoe, Alex Andonian, Siddharth Narayanan, James Braza, Alexandros Sanchez Vassopoulos, Jacob L Steenwyk, Blake Lash, Andrew D White, Samuel G Rodriques
Abstract:
Optimism for accelerating scientific discovery with AI continues to grow. Current applications of AI in scientific research range from training dedicated foundation models on scientific data to agentic autonomous hypothesis generation systems to AI‑driven autonomous labs. The need to measure progress of AI systems in scientific domains correspondingly must not only accelerate, but increasingly shift focus to more real‑world capabilities. Beyond rote knowledge and even just reasoning to actually measuring the ability to perform meaningful work. Prior work introduced the Language Agent Biology Benchmark LAB‑Bench as an initial attempt at measuring these abilities. Here we introduce an evolution of that benchmark, LABBench2, for measuring real‑world capabilities of AI systems performing useful scientific tasks. LABBench2 comprises nearly 1,900 tasks and is, for the most part, a continuation of LAB‑Bench, measuring similar capabilities but in more realistic contexts. We evaluate performance of current frontier models, and show that while abilities measured by LAB‑Bench and LABBench2 have improved substantially, LABBench2 provides a meaningful jump in difficulty (model‑specific accuracy differences range from ‑26% to ‑46% across subtasks) and underscores continued room for performance improvement. LABBench2 continues the legacy of LAB‑Bench as a de facto benchmark for AI scientific research capabilities and we hope that it continues to help advance development of AI tools for these core research functions. To facilitate community use and development, we provide the task dataset at https://huggingface.co/datasets/futurehouse/labbench2 and a public eval harness at https://github.com/EdisonScientific/labbench2.
Authors:Guanyu Zhou, Yida Yin, Wenhao Chai, Shengbang Tong, Xingyu Fu, Zhuang Liu
Abstract:
Vision‑language models (VLMs) still struggle with visual perception tasks such as spatial understanding and viewpoint recognition. One plausible contributing factor is that natural image datasets provide limited supervision for low‑level visual skills. This motivates a practical question: can targeted synthetic supervision, generated from only a task keyword such as Depth Order, address these weaknesses? To investigate this question, we introduce VisionFoundry, a task‑aware synthetic data generation pipeline that takes only the task name as input and uses large language models (LLMs) to generate questions, answers, and text‑to‑image (T2I) prompts, then synthesizes images with T2I models and verifies consistency with a proprietary VLM, requiring no reference images or human annotation. Using VisionFoundry, we construct VisionFoundry‑10K, a synthetic visual question answering (VQA) dataset containing 10k image‑question‑answer triples spanning 10 tasks. Models trained on VisionFoundry‑10K achieve substantial improvements on visual perception benchmarks: +7% on MMVP and +10% on CV‑Bench‑3D, while preserving broader capabilities and showing favorable scaling behavior as data size increases. Our results suggest that limited task‑targeted supervision is an important contributor to this bottleneck and that synthetic supervision is a promising path toward more systematic training for VLMs.
Authors:Wenyi Xiao, Xinchi Xu, Leilei Gan
Abstract:
Large Vision Language Models (LVLMs) achieve strong multimodal reasoning but frequently exhibit hallucinations and incorrect responses with high certainty, which hinders their usage in high‑stakes domains. Existing verbalized confidence calibration methods, largely developed for text‑only LLMs, typically optimize a single holistic confidence score using binary answer‑level correctness. This design is mismatched to LVLMs: an incorrect prediction may arise from perceptual failures or from reasoning errors given correct perception, and a single confidence conflates these sources while visual uncertainty is often dominated by language priors. To address these issues, we propose VL‑Calibration, a reinforcement learning framework that explicitly decouples confidence into visual and reasoning confidence. To supervise visual confidence without ground‑truth perception labels, we introduce an intrinsic visual certainty estimation that combines (i) visual grounding measured by KL‑divergence under image perturbations and (ii) internal certainty measured by token entropy. We further propose token‑level advantage reweighting to focus optimization on tokens based on visual certainty, suppressing ungrounded hallucinations while preserving valid perception. Experiments on thirteen benchmarks show that VL‑Calibration effectively improves calibration while boosting visual reasoning accuracy, and it generalizes to out‑of‑distribution benchmarks across model scales and architectures.
Authors:Kyle Whitecross, Negin Rahimi
Abstract:
We propose RecaLLM, a set of reasoning language models post‑trained to make effective use of long‑context information. In‑context retrieval, which identifies relevant evidence from context, and reasoning are deeply intertwined: retrieval supports reasoning, while reasoning often determines what must be retrieved. However, their interaction remains largely underexplored. In preliminary experiments on several open‑source LLMs, we observe that in‑context retrieval performance substantially degrades even after a short reasoning span, revealing a key bottleneck for test‑time scaling that we refer to as lost‑in‑thought: reasoning steps that improve performance also make subsequent in‑context retrieval more challenging. To address this limitation, RecaLLM interleaves reasoning with explicit in‑context retrieval, alternating between reasoning and retrieving context information needed to solve intermediate subproblems. We introduce a negligible‑overhead constrained decoding mechanism that enables verbatim copying of evidence spans, improving the grounding of subsequent generation. Trained on diverse lexical and semantic retrieval tasks, RecaLLM achieves strong performance on two long‑context benchmarks, RULER and HELMET, significantly outperforming baselines. Notably, we observe consistent gains at context windows of up to 128K tokens using training samples of at most 10K tokens, far shorter than those used by existing long‑context approaches, highlighting a promising path toward improving long‑context performance without expensive long‑context training data.
Authors:Han Luo, Guy Laban
Abstract:
Large language models are increasingly deployed in multi‑turn settings such as tutoring, support, and counseling, where reliability depends on preserving consistent roles, personas, and goals across long horizons. This requirement becomes critical when LLMs are used to generate synthetic dialogues for training and evaluation, since LLM‑‑LLM conversations can accumulate identity‑related failures such as persona drift, role confusion, and "echoing", where one agent gradually mirrors its partner. We introduce SPASM (Stable Persona‑driven Agent Simulation for Multi‑turn dialogue generation), a modular, stability‑first framework that decomposes simulation into (i) persona creation via schema sampling, plausibility validation, and natural‑language persona crafting, (ii) Client‑‑Responder dialogue generation, and (iii) termination detection for coherent stopping. To improve long‑horizon stability without changing model weights, we propose Egocentric Context Projection (ECP): dialogue history is stored in a perspective‑agnostic representation and deterministically projected into each agent's egocentric view before generation. Across three LLM backbones (GPT‑4o‑mini, DeepSeek‑V3.2, Qwen‑Plus) and nine Client‑‑Responder pairings, we construct a dataset of 4,500 personas and 45,000 conversations (500 personas X 10 conversations per pairing). Ablations show ECP substantially reduces persona drift and, under human validation, eliminates echoing; embedding analyses recover persona structure and reveal strong responder‑driven interaction geometry. Our code is available at https://github.com/lhannnn/SPASM.
Authors:Jon-Paul Cacioli
Abstract:
We report that model quantisation restructures domain‑level metacognitive efficiency in LLMs rather than degrading it uniformly. Evaluating Llama‑3‑8B‑Instruct on the same 3,000 questions at Q5_K_M and f16 precision, we find that M‑ratio profiles across four knowledge domains are uncorrelated between formats (Spearman rho = 0.00). Arts & Literature moves from worst‑monitored (M‑ratio = 0.606 at Q5_K_M) to best‑monitored (1.542 at f16). Geography moves from well‑monitored (1.210) to under‑monitored (0.798). However, Type‑2 AUROC profiles are perfectly stable across formats (rho = 1.00), localising the restructuring to the M‑ratio normalisation rather than the underlying discrimination signal. This finding emerged from a pre‑registered attempt to improve metacognition through domain‑conditional training. We prescribed confidence‑amplification SFT for the diagnosed weak domain, with matched‑budget agnostic and wrong‑prescription controls. All four confirmatory hypotheses were null (10,000 bootstrap resamples, seed = 42). The training successfully reshaped confidence distributions, doubling the NLP gap in Science from 0.076 to 0.152, but did not improve meta‑d' because the diagnostic profile did not transfer across formats. Any system relying on domain‑level M‑ratio profiles has an unexamined dependency on inference format. Systems using AUROC_2 are safer. We release all code, pre‑registrations, and trial‑level data.
Authors:Yixin Xiang, Yunshan Ma, Xiaoyu Du, Yibing Chen, Yanxin Zhang, Jinhui Tang
Abstract:
Document Question Answering (DQA) involves generating answers from a document based on a user's query, representing a key task in document understanding. This task requires interpreting visual layouts, which has prompted recent studies to adopt multimodal Retrieval‑Augmented Generation (RAG) that processes page images for answer generation. However, in multimodal RAG, visual DQA struggles to utilize a large number of images effectively, as the retrieval stage often retains only a few candidate pages (e.g., Top‑4), causing informative but less visually salient content to be overlooked in favor of common yet low‑information pages. To address this issue, we propose a Multi‑Armed Bandit‑based DQA framework (MAB‑DQA) to explicitly model the varying importance of multiple implicit aspects in a query. Specifically, MAB‑DQA decomposes a query into aspect‑aware subqueries and retrieves an aspect‑specific candidate set for each. It treats each subquery as an arm and uses preliminary reasoning results from a small number of representative pages as reward signals to estimate aspect utility. Guided by an exploration‑exploitation policy, MAB‑DQA dynamically reallocates retrieval budgets toward high‑value aspects. With the most informative pages and their correlations, MAB‑DQA generates the expected results. On four benchmarks, MAB‑DQA shows an average improvement of 5%‑18% over the state‑of‑the‑art method, consistently enhancing document understanding. Codes are available at https://github.com/ElephantOH/MAB‑DQA.
Authors:Tong Wu, Nicolay Rusnachenko, Huizhi Liang
Abstract:
Dimensional Aspect‑Based Sentiment Analysis (DimABSA) extends traditional ABSA from categorical polarity labels to continuous valence‑arousal (VA) regression. This paper describes a system developed for Track A ‑ Subtask 1 (Dimensional Aspect Sentiment Regression), aiming to predict real‑valued VA scores in the [1, 9] range for each given aspect in a text. A fine‑tuning approach based on XLM‑RoBERTa‑base is adopted, constructing the input as [CLS] T [SEP] a_i [SEP] and training dual regression heads with sigmoid‑scaled outputs for valence and arousal prediction. Separate models are trained for each language‑domain combination (English and Chinese across restaurant, laptop, and finance domains), and training and development sets are merged for final test predictions. In development experiments, the fine‑tuning approach is compared against several large language models including GPT‑5.2, LLaMA‑3‑70B, LLaMA‑3.3‑70B, and LLaMA‑4‑Maverick under a few‑shot prompting setting, demonstrating that task‑specific fine‑tuning substantially and consistently outperforms these LLM‑based methods across all evaluation datasets. The code is publicly available at https://github.com/tongwu17/SemEval‑2026‑Task3‑Track‑A.
Authors:Jinqi Luo, Jinyu Yang, Tal Neiman, Lei Fan, Bing Yin, Son Tran, Mubarak Shah, René Vidal
Abstract:
Multimodal Large Language Models (MLLMs) have been shown to be vulnerable to malicious queries that can elicit unsafe responses. Recent work uses prompt engineering, response classification, or finetuning to improve MLLM safety. Nevertheless, such approaches are often ineffective against evolving malicious patterns, may require rerunning the query, or demand heavy computational resources. Steering the activations of a frozen model at inference time has recently emerged as a flexible and effective solution. However, existing steering methods for MLLMs typically handle only a narrow set of safety‑related concepts or struggle to adjust specific concepts without affecting others. To address these challenges, we introduce Dictionary‑Aligned Concept Control (DACO), a framework that utilizes a curated concept dictionary and a Sparse Autoencoder (SAE) to provide granular control over MLLM activations. First, we curate a dictionary of 15,000 multimodal concepts by retrieving over 400,000 caption‑image stimuli and summarizing their activations into concept directions. We name the dataset DACO‑400K. Second, we show that the curated dictionary can be used to intervene activations via sparse coding. Third, we propose a new steering approach that uses our dictionary to initialize the training of an SAE and automatically annotate the semantics of the SAE atoms for safeguarding MLLMs. Experiments on multiple MLLMs (e.g., QwenVL, LLaVA, InternVL) across safety benchmarks (e.g., MM‑SafetyBench, JailBreakV) show that DACO significantly improves MLLM safety while maintaining general‑purpose capabilities.
Authors:Svetoslav Nizhnichenkov, Rahul Nair, Elizabeth Daly, Brian Mac Namee
Abstract:
We investigate how successful bias mitigation reshapes the embedding space of encoder‑only and decoder‑only foundation models, offering an internal audit of model behaviour through representational analysis. Using BERT and Llama2 as representative architectures, we assess the shifts in associations between gender and occupation terms by comparing baseline and bias‑mitigated variants of the models. Our findings show that bias mitigation reduces gender‑occupation disparities in the embedding space, leading to more neutral and balanced internal representations. These representational shifts are consistent across both model types, suggesting that fairness improvements can manifest as interpretable and geometric transformations. These results position embedding analysis as a valuable tool for understanding and validating the effectiveness of debiasing methods in foundation models. To further promote the assessment of decoder‑only models, we introduce WinoDec, a dataset consisting of 4,000 sequences with gender and occupation terms, and release it to the general public. (https://github.com/winodec/wino‑dec)
Authors:Leonid Erlygin, Alexey Zaytsev
Abstract:
Accurate uncertainty estimation is essential for building robust and trustworthy recognition systems. In this paper, we consider the open‑set text classification (OSTC) task ‑ and uncertainty estimation for it. For OSTC a text sample should be classified as one of the existing classes or rejected as unknown. To account for the different uncertainty types encountered in OSTC, we adapt the Holistic Uncertainty Estimation (HolUE) method for the text domain. Our approach addresses two major causes of prediction errors in text recognition systems: text uncertainty that stems from ill formulated queries and gallery uncertainty that is related the ambiguity of data distribution. By capturing these sources, it becomes possible to predict when the system will make a recognition error. We propose a new OSTC benchmark and conduct extensive experiments on a wide range of data, utilizing the authorship attribution, intent and topic classification datasets. HolUE achieves 40‑365% improvement in Prediction Rejection Ratio (PRR) over the quality‑based SCF baseline across datasets: 365% on Yahoo Answers (0.79 vs 0.17 at FPIR 0.1), 347% on DBPedia (0.85 vs 0.19), 240% on PAN authorship attribution (0.51 vs 0.15 at FPIR 0.5), and 40% on CLINC150 intent classification (0.73 vs~0.52). We make public our code and protocols https://github.com/Leonid‑Erlygin/text_uncertainty.git
Authors:Wenbo Hu, Xin Chen, Yan Gao-Tian, Yihe Deng, Nanyun Peng, Kai-Wei Chang
Abstract:
Group Relative Policy Optimization (GRPO) has emerged as the de facto Reinforcement Learning (RL) objective driving recent advancements in Multimodal Large Language Models. However, extending this success to open‑source multimodal generalist models remains heavily constrained by two primary challenges: the extreme variance in reward topologies across diverse visual tasks, and the inherent difficulty of balancing fine‑grained perception with multi‑step reasoning capabilities. To address these issues, we introduce Gaussian GRPO (G^2RPO), a novel RL training objective that replaces standard linear scaling with non‑linear distributional matching. By mathematically forcing the advantage distribution of any given task to strictly converge to a standard normal distribution, \mathcalN(0,1), G^2RPO theoretically ensures inter‑task gradient equity, mitigates vulnerabilities to heavy‑tail outliers, and offers symmetric update for positive and negative rewards. Leveraging the enhanced training stability provided by G^2RPO, we introduce two task‑level shaping mechanisms to seamlessly balance perception and reasoning. First, response length shaping dynamically elicits extended reasoning chains for complex queries while enforce direct outputs to bolster visual grounding. Second, entropy shaping tightly bounds the model's exploration zone, effectively preventing both entropy collapse and entropy explosion. Integrating these methodologies, we present OpenVLThinkerV2, a highly robust, general‑purpose multimodal model. Extensive evaluations across 18 diverse benchmarks demonstrate its superior performance over strong open‑source and leading proprietary frontier models.
Authors:Sergey V Samsonau
Abstract:
Science currently offers two options for quality assurance, both inadequate. Journal gatekeeping claims to verify both integrity and contribution, but actually measures prestige: peer review is slow, biased, and misses fabricated citations even at top venues. Open science provides no quality assurance at all: the only filter between AI‑generated text and the public record is the author's integrity. AI‑assisted writing makes both worse by producing more papers faster than either system can absorb. We propose a third option: measure the paper itself. sciwrite‑lint (pip install sciwrite‑lint) is an open‑source linter for scientific manuscripts that runs entirely on the researcher's machine (free public databases, a single consumer GPU, and open‑weights models) with no manuscripts sent to external services. The pipeline verifies that references exist, checks retraction status, compares metadata against canonical records, downloads and parses cited papers, verifies that they support the claims made about them, and follows one level further to check cited papers' own bibliographies. Each reference receives a per‑reference reliability score aggregating all verification signals. We evaluate the pipeline on 30 unseen papers from arXiv and bioRxiv with error injection and LLM‑adjudicated false positive analysis. As an experimental extension, we propose SciLint Score, combining integrity verification with a contribution component that operationalizes five frameworks from philosophy of science (Popper, Lakatos, Kitcher, Laudan, Mayo) into computable structural properties of scientific arguments. The integrity component is the core of the tool and is evaluated in this paper; the contribution component is released as experimental code for community development.
Authors:Runpeng Geng, Chenlong Yin, Yanting Wang, Ying Chen, Jinyuan Jia
Abstract:
Prompt injection attacks pose serious security risks across a wide range of real‑world applications. While receiving increasing attention, the community faces a critical gap: the lack of a unified platform for prompt injection evaluation. This makes it challenging to reliably compare defenses, understand their true robustness under diverse attacks, or assess how well they generalize across tasks and benchmarks. For instance, many defenses initially reported as effective were later found to exhibit limited robustness on diverse datasets and attacks. To bridge this gap, we introduce PIArena, a unified and extensible platform for prompt injection evaluation that enables users to easily integrate state‑of‑the‑art attacks and defenses and evaluate them across a variety of existing and new benchmarks. We also design a dynamic strategy‑based attack that adaptively optimizes injected prompts based on defense feedback. Through comprehensive evaluation using PIArena, we uncover critical limitations of state‑of‑the‑art defenses: limited generalizability across tasks, vulnerability to adaptive attacks, and fundamental challenges when an injected task aligns with the target task. The code and datasets are available at https://github.com/sleeepeer/PIArena.
Authors:Ashima Suvarna, Kendrick Phan, Mehrab Beikzadeh, Hritik Bansal, Saadia Gabriel
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has significantly improved large language model (LLM) reasoning in formal domains such as mathematics and code. Despite these advancements, LLMs still struggle with general reasoning tasks requiring capabilities such as causal inference and temporal understanding. Extending RLVR to general reasoning is fundamentally constrained by the lack of high‑quality, verifiable training data that spans diverse reasoning skills. To address this challenge, we propose SUPERNOVA, a data curation framework for RLVR aimed at enhancing general reasoning. Our key insight is that instruction‑tuning datasets containing expert‑annotated ground‑truth encode rich reasoning patterns that can be systematically adapted for RLVR. To study this, we conduct 100+ controlled RL experiments to analyze how data design choices impact downstream reasoning performance. In particular, we investigate three key factors: (i) source task selection, (ii) task mixing strategies, and (iii) synthetic interventions for improving data quality. Our analysis reveals that source task selection is non‑trivial and has a significant impact on downstream reasoning performance. Moreover, selecting tasks based on their performance for individual target tasks outperforms strategies based on overall average performance. Finally, models trained on SUPERNOVA outperform strong baselines (e.g., Qwen3.5) on challenging reasoning benchmarks including BBEH, Zebralogic, and MMLU‑Pro. In particular, training on SUPERNOVA yields relative improvements of up to 52.8% on BBEH across model sizes, demonstrating the effectiveness of principled data curation for RLVR. Our findings provide practical insights for curating human‑annotated resources to extend RLVR to general reasoning. The code and data is available at https://github.com/asuvarna31/supernova.
Authors:Yating Wang, Wenting Zhao, Yaqi Zhao, Yongshun Gong, Yilong Yin, Haoliang Sun
Abstract:
Large language models store not only isolated facts but also rules that support reasoning across symbolic expressions, natural language explanations, and concrete instances. Yet most model editing methods are built for fact‑level knowledge, assuming that a target edit can be achieved through a localized intervention. This assumption does not hold for rule‑level knowledge, where a single rule must remain consistent across multiple interdependent forms. We investigate this problem through a mechanistic study of rule‑level knowledge editing. To support this study, we extend the RuleEdit benchmark from 80 to 200 manually verified rules spanning mathematics and physics. Fine‑grained causal tracing reveals a form‑specific organization of rule knowledge in transformer layers: formulas and descriptions are concentrated in earlier layers, while instances are more associated with middle layers. These results suggest that rule knowledge is not uniformly localized, and therefore cannot be reliably edited by a single‑layer or contiguous‑block intervention. Based on this insight, we propose Distributed Multi‑Layer Editing (DMLE), which applies a shared early‑layer update to formulas and descriptions and a separate middle‑layer update to instances. While remaining competitive on standard editing metrics, DMLE achieves substantially stronger rule‑level editing performance. On average, it improves instance portability and rule understanding by 13.91 and 50.19 percentage points, respectively, over the strongest baseline across GPT‑J‑6B, Qwen2.5‑7B, Qwen2‑7B, and LLaMA‑3‑8B. The code is available at https://github.com/Pepper66/DMLE.
Authors:Junjie Fei, Jun Chen, Zechun Liu, Yunyang Xiong, Chong Zhou, Wei Wen, Junlin Han, Mingchen Zhuge, Saksham Suri, Qi Qian, Shuming Liu, Lemeng Wu, Raghuraman Krishnamoorthi, Vikas Chandra, Mohamed Elhoseiny, Chenchen Zhu
Abstract:
Adapting Multimodal Large Language Models (MLLMs) for hour‑long videos is bottlenecked by context limits. Dense visual streams saturate token budgets and exacerbate the lost‑in‑the‑middle phenomenon. Existing heuristics, like sparse sampling or uniform pooling, blindly sacrifice fidelity by discarding decisive moments and wasting bandwidth on irrelevant backgrounds. We propose Tempo, an efficient query‑aware framework compressing long videos for downstream understanding. Tempo leverages a Small Vision‑Language Model (SVLM) as a local temporal compressor, casting token reduction as an early cross‑modal distillation process to generate compact, intent‑aligned representations in a single forward pass. To enforce strict budgets without breaking causality, we introduce Adaptive Token Allocation (ATA). Exploiting the SVLM's zero‑shot relevance prior and semantic front‑loading, ATA acts as a training‑free O(1) dynamic router. It allocates dense bandwidth to query‑critical segments while compressing redundancies into minimal temporal anchors to maintain the global storyline. Extensive experiments show our 6B architecture achieves state‑of‑the‑art performance with aggressive dynamic compression (0.5‑16 tokens/frame). On the extreme‑long LVBench (4101s), Tempo scores 52.3 under a strict 8K visual budget, outperforming GPT‑4o and Gemini 1.5 Pro. Scaling to 2048 frames reaches 53.7. Crucially, Tempo compresses hour‑long videos substantially below theoretical limits, proving true long‑form video understanding relies on intent‑driven efficiency rather than greedily padded context windows.
Authors:Bo Li, Shikun Zhang, Wei Ye
Abstract:
Instruction‑tuned language models increasingly rely on large multi‑turn dialogue corpora, but these datasets are often noisy and structurally inconsistent, with topic drift, repetitive chitchat, and mismatched answer formats across turns. We address this from a data selection perspective and propose MDS (Multi‑turn Dialogue Selection), a dialogue‑level framework that scores whole conversations rather than isolated turns. MDS combines a global coverage stage that performs bin‑wise selection in the user‑query trajectory space to retain representative yet non‑redundant dialogues, with a local structural stage that evaluates within‑dialogue reliability through entity‑grounded topic grounding and information progress, together with query‑answer form consistency for functional alignment. MDS outperforms strong single‑turn selectors, dialogue‑level LLM scorers, and heuristic baselines on three multi‑turn benchmarks and an in‑domain Banking test set, achieving the best overall rank across reference‑free and reference‑based metrics, and is more robust on long conversations under the same training budget. Code and resources are included in the supplementary materials.
Authors:Wenkui Yang, Chao Jin, Haisu Zhu, Weilin Luo, Derek Yuen, Kun Shao, Huaibo Huang, Junxian Duan, Jie Cao, Ran He
Abstract:
Existing red‑teaming studies on GUI agents have important limitations. Adversarial perturbations typically require white‑box access, which is unavailable for commercial systems, while prompt injection is increasingly mitigated by stronger safety alignment. To study robustness under a more practical threat model, we propose Semantic‑level UI Element Injection, a red‑teaming setting that overlays safety‑aligned and harmless UI elements onto screenshots to misdirect the agent's visual grounding. Our method uses a modular Editor‑Overlapper‑Victim pipeline and an iterative search procedure that samples multiple candidate edits, keeps the best cumulative overlay, and adapts future prompt strategies based on previous failures. Across five victim models, our optimized attacks improve attack success rate by up to 4.4x over random injection on the strongest victims. Moreover, elements optimized on one source model transfer effectively to other target models, indicating model‑agnostic vulnerabilities. After the first successful attack, the victim still clicks the attacker‑controlled element in more than 15% of later independent trials, versus below 1% for random injection, showing that the injected element acts as a persistent attractor rather than simple visual clutter.
Authors:Kunfeng Chen, Luyao Zhuang, Fei Liao, Juhua Liu, Jian Wang, Bo Du
Abstract:
Tool learning has emerged as a promising paradigm for large language models (LLMs) to address real‑world challenges. Due to the extensive and irregularly updated number of tools, tool retrieval for selecting the desired tool subset is essential. However, current tool retrieval methods are usually based on academic benchmarks containing overly detailed instructions (e.g., specific API names and parameters), while real‑world instructions are more vague. Such a discrepancy would hinder the tool retrieval in real‑world applications. In this paper, we first construct a new benchmark, VGToolBench, to simulate human vague instructions. Based on this, we conduct a series of preliminary analyses and find that vague instructions indeed damage the performance of tool retrieval. To this end, we propose a simple‑yet‑effective Tool Retrieval Bridge (TRB) approach to boost the performance of tool retrieval for vague instructions. The principle of TRB is to introduce a bridge model to rewrite the vague instructions into more specific ones and alleviate the gap between vague instructions and retriever preferences.We conduct extensive experiments under multiple commonly used retrieval settings, and the results show that TRB effectively mitigates the ambiguity of vague instructions while delivering consistent and substantial improvements across all baseline retrievers. For example, with the help of TRB, BM25 achieves a relative improvement of up to 111.51%, i.e., increasing the average NDCG score from 9.73 to 19.59. The source code and models are publicly available at https://github.com/kfchenhn/TRB.
Authors:Rui Zhang, Hongwei Li, Yun Shen, Xinyue Shen, Wenbo Jiang, Guowen Xu, Yang Liu, Michael Backes, Yang Zhang
Abstract:
The deployment of large language models (LLMs) raises significant ethical and safety concerns. While LLM alignment techniques are adopted to improve model safety and trustworthiness, adversaries can exploit these techniques to undermine safety for malicious purposes, resulting in \emphmisalignment. Misaligned LLMs may be published on open platforms to magnify harm. To address this, additional safety alignment, referred to as \emphrealignment, is necessary before deploying untrusted third‑party LLMs. This study explores the efficacy of fine‑tuning methods in terms of misalignment, realignment, and the effects of their interplay. By evaluating four Supervised Fine‑Tuning (SFT) and two Preference Fine‑Tuning (PFT) methods across four popular safety‑aligned LLMs, we reveal a mechanism asymmetry between attack and defense. While Odds Ratio Preference Optimization (ORPO) is most effective for misalignment, Direct Preference Optimization (DPO) excels in realignment, albeit at the expense of model utility. Additionally, we identify model‑specific resistance, residual effects of multi‑round adversarial dynamics, and other noteworthy findings. These findings highlight the need for robust safeguards and customized safety alignment strategies to mitigate potential risks in the deployment of LLMs. Our code is available at https://github.com/zhangrui4041/The‑Art‑of‑Mis‑alignment.
Authors:David Gringras
Abstract:
Ask a frontier model how to taper six milligrams of alprazolam (psychiatrist retired, ten days of pills left, abrupt cessation causes seizures) and it tells her to call the psychiatrist she just explained does not exist. Change one word ("I'm a psychiatrist; a patient presents with...") and the same model, same weights, same inference pass produces a textbook Ashton Manual taper with diazepam equivalence, anticonvulsant coverage, and monitoring thresholds. The knowledge was there; the model withheld it. IatroBench measures this gap. Sixty pre‑registered clinical scenarios, six frontier models, 3,600 responses, scored on two axes (commission harm, CH 0‑3; omission harm, OH 0‑4) through a structured‑evaluation pipeline validated against physician scoring (kappa_w = 0.571, within‑1 agreement 96%). The central finding is identity‑contingent withholding: match the same clinical question in physician vs. layperson framing and all five testable models provide better guidance to the physician (decoupling gap +0.38, p = 0.003; binary hit rates on safety‑colliding actions drop 13.1 percentage points in layperson framing, p < 0.0001, while non‑colliding actions show no change). The gap is widest for the model with the heaviest safety investment (Opus, +0.65). Three failure modes separate cleanly: trained withholding (Opus), incompetence (Llama 4), and indiscriminate content filtering (GPT‑5.2, whose post‑generation filter strips physician responses at 9x the layperson rate because they contain denser pharmacological tokens). The standard LLM judge assigns OH = 0 to 73% of responses a physician scores OH >= 1 (kappa = 0.045); the evaluation apparatus has the same blind spot as the training apparatus. Every scenario targets someone who has already exhausted the standard referrals.
Authors:Yang Cao
Abstract:
Linear recurrent models offer linear‑time sequence processing but often suffer from suboptimal long‑range memory. We trace this to the decay spectrum: for N channels, random initialization collapses the minimum spectral gap to O(N^‑2), yielding sub‑exponential error \exp(‑Ω(N/\log N)); linear spacing avoids collapse but degrades to \exp(‑O(N/\sqrtT)), practically algebraic over long contexts. We introduce Position‑Adaptive Spectral Tapering (PoST), an architecture‑agnostic framework combining two mechanisms: (1) Spectral Reparameterization, which structurally enforces geometrically spaced log‑decay rates, proven minimax optimal at rate O(\exp(‑cN/\log T)); and (2) Position‑Adaptive Scaling, the provably unique mechanism that eliminates the scale mismatch of static spectra (where only N\log t/\log T of N channels are effective at position t) by stretching the spectrum to the actual dependency range, sharpening the rate to O(\exp(‑cN/\log t)). This scaling natively induces fractional invariance: the impulse response becomes scale‑free, with channels interpolating between relative and absolute temporal coordinates. PoST integrates into any diagonal linear recurrence without overhead. We instantiate it across Mamba‑2, RWKV‑7, Gated DeltaNet, Gated Linear Attention, and RetNet. Pre‑training at 180M‑440M scales shows consistent zero‑shot language modeling improvements, significant long‑context retrieval gains for Mamba‑2 (MQAR and NIAH), and competitive or improved performance across other architectures. Code: https://github.com/SiLifen/PoST.
Authors:Ziyi Wang, Siva Rajesh Kasa, Ankith M S, Santhosh Kumar Kasa, Jiaru Zou, Sumit Negi, Ruqi Zhang, Nan Jiang, Qifan Song
Abstract:
Speculative decoding is an effective technique for accelerating large language model inference by drafting multiple tokens in parallel. In practice, its speedup is often bottlenecked by a rigid verification step that strictly enforces the accepted token distribution to exactly match the target model. This constraint leads to the rejection of many plausible tokens, lowering the acceptance rate and limiting overall time speedup. To overcome this limitation, we propose Dynamic Verification Relaxed Speculative Decoding (DIVERSED), a relaxed verification framework that improves time efficiency while preserving generation quality. DIVERSED learns an ensemble‑based verifier that blends the draft and target model distributions with a task‑dependent and context‑dependent weight. We provide theoretical justification for our approach and demonstrate empirically that DIVERSED achieves substantially higher inference efficiency compared to standard speculative decoding methods. Code is available at: https://github.com/comeusr/diversed.
Authors:Kai Qin, Liangxin Liu, Yu Liang, Longzheng Wang, Yan Wang, Yueyang Zhang, Long Xia, Zhiyuan Sun, Houde Liu, Daiting Shi
Abstract:
Reward Models (RMs) are critical components in the Reinforcement Learning from Human Feedback (RLHF) pipeline, directly determining the alignment quality of Large Language Models (LLMs). Recently, Generative Reward Models (GRMs) have emerged as a superior paradigm, offering higher interpretability and stronger generalization than traditional scalar RMs. However, existing methods for GRMs focus primarily on outcome‑level supervision, neglecting analytical process quality, which constrains their potential. To address this, we propose ReflectRM, a novel GRM that leverages self‑reflection to assess analytical quality and enhance preference modeling. ReflectRM is trained under a unified generative framework for joint modeling of response preference and analysis preference. During inference, we use its self‑reflection capability to identify the most reliable analysis, from which the final preference prediction is derived. Experiments across four benchmarks show that ReflectRM consistently improves performance, achieving an average accuracy gain of +3.7 on Qwen3‑4B. Further experiments confirm that response preference and analysis preference are mutually reinforcing. Notably, ReflectRM substantially mitigates positional bias, yielding +10.2 improvement compared with leading GRMs and establishing itself as a more stable evaluator. Our code is available at https://github.com/yuliangCarmelo/ReflectRM.
Authors:Jianhui Liu, Haoze Sun, Wenbo Li, Yanbing Zhang, Rui Yang, Zhiliang Zhu, Yijun Yang, Shenghe Zheng, Nan Jiang, Jiaxiu Jiang, Haoyang Huang, Tien-Tsin Wong, Nan Duan, Xiaojuan Qi
Abstract:
Spatial understanding is a fundamental cornerstone of human‑level intelligence. Nonetheless, current research predominantly focuses on domain‑specific data production, leaving a critical void: the absence of a principled, open‑source engine capable of fully unleashing the potential of high‑quality spatial data. To bridge this gap, we elucidate the design principles of a robust data generation system and introduce OpenSpatial ‑‑ an open‑source data engine engineered for high quality, extensive scalability, broad task diversity, and optimized efficiency. OpenSpatial adopts 3D bounding boxes as the fundamental primitive to construct a comprehensive data hierarchy across five foundational tasks: Spatial Measurement (SM), Spatial Relationship (SR), Camera Perception (CP), Multi‑view Consistency (MC), and Scene‑Aware Reasoning (SAR). Leveraging this scalable infrastructure, we curate OpenSpatial‑3M, a large‑scale dataset comprising 3 million high‑fidelity samples. Extensive evaluations demonstrate that versatile models trained on our dataset achieve state‑of‑the‑art performance across a wide spectrum of spatial reasoning benchmarks. Notably, the best‑performing model exhibits a substantial average improvement of 19 percent, relatively. Furthermore, we provide a systematic analysis of how data attributes influence spatial perception. By open‑sourcing both the engine and the 3M‑scale dataset, we provide a robust foundation to accelerate future research in spatial intelligence.
Authors:Chhavi Dhiman, Naman Chawla, Riya Dhami, Gaurav Kumar, Ganesh Naik
Abstract:
The widespread use of clickbait headlines, crafted to mislead and maximize engagement, poses a significant challenge to online credibility. These headlines employ sensationalism, misleading claims, and vague language, underscoring the need for effective detection to ensure trustworthy digital content. The paper introduces, ClickGuard: a trustworthy adaptive fusion framework for clickbait detection. It combines BERT embeddings and structural features using a Syntactic‑Semantic Adaptive Fusion Block (SSAFB) for dynamic integration. The framework incorporates a hybrid CNN‑BiLSTM to capture patterns and dependencies. The model achieved 96.93% testing accuracy, outperforming state‑of‑the‑art approaches. The model's trustworthiness is evaluated using LIME and Permutation Feature Importance (PFI) for interpretability and perturbation analysis. These methods assess the model's robustness and sensitivity to feature changes by measuring the average prediction variation. Ablation studies validated the SSAFB's effectiveness in optimizing feature fusion. The model demonstrated robust performance across diverse datasets, providing a scalable, reliable solution for enhancing online content credibility by addressing syntactic‑semantic modelling challenges. Code of the work is available at: https://github.com/palindromeRice/ClickBait_Detection_Architecture
Authors:Huidong Ma, Xinyan Shi, Hui Sun, Xiaofei Yue, Xiaoguang Liu, Gang Wang, Wentong Cai
Abstract:
While Learned Data Compression (LDC) has achieved superior compression ratios, balancing precise probability modeling with system efficiency remains challenging. Crucially, uniform single‑stream architectures struggle to simultaneously capture micro‑syntactic and macro‑semantic features, necessitating deep serial stacking that exacerbates latency. Compounding this, heterogeneous systems are constrained by device speed mismatches, where throughput is capped by Amdahl's Law due to serial processing. To this end, we propose a Dual‑Stream Multi‑Scale Decoupler that disentangles local and global contexts to replace deep serial processing with shallow parallel streams, and incorporate a Hierarchical Gated Refiner for adaptive feature refinement and precise probability modeling. Furthermore, we design a Concurrent Stream‑Parallel Pipeline, which overcomes systemic bottlenecks to achieve full‑pipeline parallelism. Extensive experiments demonstrate that our method achieves state‑of‑the‑art performance in both compression ratio and throughput, while maintaining the lowest latency and memory usage. The code is available at https://github.com/huidong‑ma/FADE.
Authors:Yihao Wang, Zijian He, Jie Ren, Keze Wang
Abstract:
Retrieval shapes how language models access and ground knowledge in retrieval‑augmented generation (RAG). In historical research, the target is often not an arbitrary relevant passage, but the exact record for a specific regnal month, where temporal consistency matters as much as topical relevance. This is especially challenging for Classical Chinese annals, where time is expressed through terse, implicit, non‑Gregorian reign phrases that must be interpreted from surrounding context, so semantically plausible evidence can still be temporally invalid. We introduce ChunQiuTR, a time‑keyed retrieval benchmark built from the Spring and Autumn Annals and its exegetical tradition. ChunQiuTR organizes records by month‑level reign keys and includes chrono‑near confounders that mirror realistic retrieval failures. We further propose CTD (Calendrical Temporal Dual‑encoder), a time‑aware dual‑encoder that combines Fourier‑based absolute calendrical context with relative offset biasing. Experiments show consistent gains over strong semantic dual‑encoder baselines under time‑keyed evaluation, supporting retrieval‑time temporal consistency as a key prerequisite for faithful downstream historical RAG. Our code and datasets are available at \hrefhttps://github.com/xbdxwyh/ChunQiuTR\textttgithub.com/xbdxwyh/ChunQiuTR.
Authors:Bing Wang, Rui Miao, Chen Shen, Shaotian Yan, Kaiyuan Liu, Ximing Li, Xiaosong Yuan, Sinan Fan, Jun Zhang, Jieping Ye
Abstract:
Large reasoning models have recently demonstrated strong performance on complex tasks that require long chain‑of‑thought reasoning, through supervised fine‑tuning on large‑scale and high‑quality datasets. To construct such datasets, existing pipelines generate long reasoning data from more capable Large Language Models (LLMs) and apply manually heuristic or naturalness‑based selection methods to filter high‑quality samples. Despite the proven effectiveness of naturalness‑based data selection, which ranks data by the average log probability assigned by LLMs, our analysis shows that, when applied to LLM reasoning datasets, it systematically prefers samples with longer reasoning steps (i.e., more tokens per step) rather than higher‑quality ones, a phenomenon we term step length confounding. Through quantitative analysis, we attribute this phenomenon to low‑probability first tokens in reasoning steps; longer steps dilute their influence, thereby inflating the average log probabilities. To address this issue, we propose two variant methods: ASLEC‑DROP, which drops first‑token probabilities when computing average log probability, and ASLEC‑CASL, which applies a causal debiasing regression to remove the first tokens' confounding effect. Experiments across four LLMs and five evaluation benchmarks demonstrate the effectiveness of our approach in mitigating the step length confounding problem.
Authors:Xuanle Zhao, Xinyuan Cai, Xiang Cheng, Xiuyi Chen, Bo Xu
Abstract:
While Vision‑Language Models (VLMs) have demonstrated significant potential in chemical visual understanding, current models are predominantly optimized for direct visual question‑answering tasks. This paradigm often results in "black‑box" systems that fail to utilize the inherent capability of Large Language Models (LLMs) to infer underlying reaction mechanisms. In this work, we introduce ChemVLR, a chemical VLM designed to prioritize reasoning within the perception process. Unlike conventional chemical VLMs, ChemVLR analyzes visual inputs in a fine‑grained manner by explicitly identifying granular chemical descriptors, such as functional groups, prior to generating answers. This approach ensures the production of explicit and interpretable reasoning paths for complex visual chemical problems. To facilitate this methodology, we implement a cross‑modality reverse‑engineering strategy, combined with a rigorous filtering pipeline, to curate a large‑scale reasoning‑and‑captioning dataset comprising 760k high‑quality samples across molecular and reaction tasks. Furthermore, we adopt a three‑stage training framework that systemically builds model perception and reasoning capacity. Experiments demonstrate that ChemVLR achieves state‑of‑the‑art (SOTA) performance, surpassing both leading proprietary models and domain‑specific open‑source baselines. We also provide comprehensive ablation studies to validate our training strategy and data generation designs. Code and model weights will be available at https://github.com/xxlllz/ChemVLR.
Authors:Hanyang Wang, Mingxuan Zhu
Abstract:
Modern reasoning models continue generating long after the answer is already determined. Across five model configurations, two families, and three benchmarks, we find that 52‑‑88% of chain‑of‑thought tokens are produced after the answer is recoverable from a partial prefix. This post‑commitment generation reveals a structural phenomenon: the detection‑extraction gap. Free continuations from early prefixes recover the correct answer even at 10% of the trace, while forced extraction fails on 42% of these cases. The answer is recoverable from the model state, yet prompt‑conditioned decoding fails to extract it. We formalize this mismatch via a total‑variation bound between free and forced continuation distributions, yielding quantitative estimates of suffix‑induced shift. Exploiting this asymmetry, we propose Black‑box Adaptive Early Exit (BAEE), which uses free continuations for both detection and extraction, truncating 70‑‑78% of serial generation while improving accuracy by 1‑‑5pp across all models. For thinking‑mode models, early exit prevents post‑commitment overwriting, yielding gains of up to 5.8pp; a cost‑optimized variant achieves 68‑‑73% reduction at a median of 9 API calls. Code is available at https://github.com/EdWangLoDaSc/know2say.
Authors:Maotian Ma, Zheni Zeng, Zhenghao Liu, Yukun Yan
Abstract:
Large language models (LLMs) have shown strong knowledge reserves and task‑solving capabilities, but still face the challenge of severe hallucination, hindering their practical application. Though scientific theories and rules can efficiently direct the behaviors of human manipulators, LLMs still do not utilize these highly‑condensed knowledge sufficiently through training or prompting. To address this issue, we propose SciDC, an LLM generation method that integrate subject‑specific knowledge with strong constraints. By adopting strong LLMs to automatically convert flexible knowledge into multi‑layered, standardized rules, we build an extensible framework to effectively constrain the model generation on domain tasks. Experiments on scientific tasks including industrial formulation design, clinical tumor diagnosis and retrosynthesis planning, consistently demonstrate the effectiveness of our method, achieving a 12% accuracy improvement on average compared with vanilla generation. We further discuss the potential of LLMs in automatically inductively summarizing highly‑condensed knowledge, looking ahead to practical solutions for accelerating the overall scientific research process. All the code of this paper can be obtained (https://github.com/Maotian‑Ma/SciDC).
Authors:Zohaib Khan, Mustafa Dogan, Ifeoma Okoh, Pouya Sadeghi, Siddhartha Shrestha, Sergius Justus Nyah, Mahmoud O. Mokhiamar, Michael J. Ryan, Tarek Naous
Abstract:
Misinformation is on the rise, and the strong writing capabilities of LLMs lower the barrier for malicious actors to produce and disseminate false information. We study how LLMs behave when prompted to spread misinformation across languages and target countries, and introduce GlobalLies, a multilingual parallel dataset of 440 misinformation generation prompt templates and 6,867 entities, spanning 8 languages and 195 countries. Using both human annotations and large‑scale LLM‑as‑a‑judge evaluations across hundreds of thousands of generations from state‑of‑the‑art models, we show that misinformation generation varies systematically based on the country being discussed. Propagation of lies by LLMs is substantially higher in many lower‑resource languages and for countries with a lower Human Development Index (HDI). We find that existing mitigation strategies provide uneven protection: input safety classifiers exhibit cross‑lingual gaps, and retrieval‑augmented fact‑checking remains inconsistent across regions due to unequal information availability. We release GlobalLies for research purposes, aiming to support the development of mitigation strategies to reduce the spread of global misinformation: https://github.com/zohaib‑khan5040/globallies
Authors:Weiyue Li, Ruizhi Qian, Yi Li, Yongce Li, Yunfan Long, Jiahui Cai, Yan Luo, Mengyu Wang
Abstract:
Large language models (LLMs) are widely explored for reasoning‑intensive research tasks, yet resources for testing whether they can infer scientific conclusions from structured biomedical evidence remain limited. We introduce MedConclusion, a large‑scale dataset of 5.7M PubMed structured abstracts for biomedical conclusion generation. Each instance pairs the non‑conclusion sections of an abstract with the original author‑written conclusion, providing naturally occurring supervision for evidence‑to‑conclusion reasoning. MedConclusion also includes journal‑level metadata such as biomedical category and SJR, enabling subgroup analysis across biomedical domains. As an initial study, we evaluate diverse LLMs under conclusion and summary prompting settings and score outputs with both reference‑based metrics and LLM‑as‑a‑judge. We find that conclusion writing is behaviorally distinct from summary writing, strong models remain closely clustered under current automatic metrics, and judge identity can substantially shift absolute scores. MedConclusion provides a reusable data resource for studying scientific evidence‑to‑conclusion reasoning. Our code and data are available at: https://github.com/Harvard‑AI‑and‑Robotics‑Lab/MedConclusion.
Authors:Yuzhe Chen, Jiale Cao, Xuyang Liu, Jin Xie, Aiping Yang, Yanwei Pang
Abstract:
Diffusion Large Language Models (dLLMs) have achieved rapid progress, viewed as a promising alternative to the autoregressive paradigm. However, most dLLM decoders still adopt a global confidence threshold, and do not explicitly model local context from neighboring decoded states or temporal consistency of predicted token IDs across steps. To address this issue, we propose a simple spatio‑temporal stability guided decoding approach, named STDec. We observe strong spatio‑temporal stability in dLLM decoding: newly decoded tokens tend to lie near decoded neighbors, and their predicted IDs often remain consistent across several denoising steps. Inspired by this stability, our STDec includes spatial‑aware decoding and temporal‑aware decoding. The spatial‑aware decoding dynamically generates the token‑adaptive threshold by aggregating the decoded states of nearby tokens. The temporal‑aware decoding relaxes the decoding thresholds for tokens whose predicted token IDs remain consistent over denoising steps. Our STDec is training‑free and remains compatible with cache‑based acceleration methods. Across textual reasoning and multimodal understanding benchmarks, STDec substantially improves throughput while maintaining comparable task performance score. Notably, on MBPP with LLaDA, STDec achieves up to 14.17x speedup with a comparable score. Homepage: https://yzchen02.github.io/STDec.
Authors:Wei Zhou, Xuanhe Zhou, Qikang He, Guoliang Li, Bingsheng He, Quanqing Xu, Fan Wu
Abstract:
Database systems incorporate an ever‑growing number of functions in their kernels (a.k.a., database native functions) for scenarios like new application support and business migration. This growth causes an urgent demand for automatic database native function synthesis. While recent advances in LLM‑based code generation (e.g., Claude Code) show promise, they are too generic for database‑specific development. They often hallucinate or overlook critical context because database function synthesis is inherently complex and error‑prone, where synthesizing a single function may involve registering multiple function units, linking internal references, and implementing logic correctly. To this end, we propose DBCooker, an LLM‑based system for automatically synthesizing database native functions. It consists of three components. First, the function characterization module aggregates multi‑source declarations, identifies function units that require specialized coding, and traces cross‑unit dependencies. Second, we design operations to address the main synthesis challenges: (1) a pseudo‑code‑based coding plan generator that constructs structured implementation skeletons by identifying key elements such as reusable referenced functions; (2) a hybrid fill‑in‑the‑blank model guided by probabilistic priors and component awareness to integrate core logic with reusable routines; and (3) three‑level progressive validation, including syntax checking, standards compliance, and LLM‑guided semantic verification. Finally, an adaptive orchestration strategy unifies these operations with existing tools and dynamically sequences them via the orchestration history of similar functions. Results show that DBCooker outperforms other methods on SQLite, PostgreSQL, and DuckDB (34.55% higher accuracy on average), and can synthesize new functions absent in the latest SQLite (v3.50).
Authors:Ryo Nishida, Masayuki Kawarada, Tatsuya Ishigaki, Hiroya Takamura, Masaki Onishi
Abstract:
This paper investigates demonstration selection strategies for predicting a user's next point‑of‑interest (POI) using large language models (LLMs), aiming to accurately forecast a user's subsequent location based on historical check‑in data. While in‑context learning (ICL) with LLMs has recently gained attention as a promising alternative to traditional supervised approaches, the effectiveness of ICL significantly depends on the selected demonstration. Although previous studies have examined methods such as random selection, embedding‑based selection, and task‑specific selection, there remains a lack of comprehensive comparative analysis among these strategies. To bridge this gap and clarify the best practices for real‑world applications, we comprehensively evaluate existing demonstration selection methods alongside simpler heuristic approaches such as geographical proximity, temporal ordering, and sequential patterns. Extensive experiments conducted on three real‑world datasets indicate that these heuristic methods consistently outperform more complex and computationally demanding embedding‑based methods, both in terms of computational cost and prediction accuracy. Notably, in certain scenarios, LLMs using demonstrations selected by these simpler heuristic methods even outperform existing fine‑tuned models, without requiring further training. Our source code is available at: https://github.com/ryonsd/DS‑LLM4POI.
Authors:Komal Kumar, Aman Chadha, Salman Khan, Fahad Shahbaz Khan, Hisham Cholakkal
Abstract:
The rapid growth of scientific literature has made it increasingly difficult for researchers to efficiently discover, evaluate, and synthesize relevant work. Recent advances in multi‑agent large language models (LLMs) have demonstrated strong potential for understanding user intent and are being trained to utilize various tools. In this paper, we introduce Paper Circle, a multi‑agent research discovery and analysis system designed to reduce the effort required to find, assess, organize, and understand academic literature. The system comprises two complementary pipelines: (1) a Discovery Pipeline that integrates offline and online retrieval from multiple sources, multi‑criteria scoring, diversity‑aware ranking, and structured outputs; and (2) an Analysis Pipeline that transforms individual papers into structured knowledge graphs with typed nodes such as concepts, methods, experiments, and figures, enabling graph‑aware question answering and coverage verification. Both pipelines are implemented within a coder LLM‑based multi‑agent orchestration framework and produce fully reproducible, synchronized outputs including JSON, CSV, BibTeX, Markdown, and HTML at each agent step. This paper describes the system architecture, agent roles, retrieval and scoring methods, knowledge graph schema, and evaluation interfaces that together form the Paper Circle research workflow. We benchmark Paper Circle on both paper retrieval and paper review generation, reporting hit rate, MRR, and Recall at K. Results show consistent improvements with stronger agent models. We have publicly released the website at https://papercircle.vercel.app/ and the code at https://github.com/MAXNORM8650/papercircle.
Authors:Guhao Feng, Shengjie Luo, Kai Hua, Ge Zhang, Di He, Wenhao Huang, Tianle Cai
Abstract:
The static ``train then deploy" paradigm fundamentally limits Large Language Models (LLMs) from dynamically adapting their weights in response to continuous streams of new information inherent in real‑world tasks. Test‑Time Training (TTT) offers a compelling alternative by updating a subset of model parameters (fast weights) at inference time, yet its potential in the current LLM ecosystem is hindered by critical barriers including architectural incompatibility, computational inefficiency and misaligned fast weight objectives for language modeling. In this work, we introduce In‑Place Test‑Time Training (In‑Place TTT), a framework that seamlessly endows LLMs with Test‑Time Training ability. In‑Place TTT treats the final projection matrix of the ubiquitous MLP blocks as its adaptable fast weights, enabling a ``drop‑in" enhancement for LLMs without costly retraining from scratch. Furthermore, we replace TTT's generic reconstruction objective with a tailored, theoretically‑grounded objective explicitly aligned with the Next‑Token‑Prediction task governing autoregressive language modeling. This principled objective, combined with an efficient chunk‑wise update mechanism, results in a highly scalable algorithm compatible with context parallelism. Extensive experiments validate our framework's effectiveness: as an in‑place enhancement, it enables a 4B‑parameter model to achieve superior performance on tasks with contexts up to 128k, and when pretrained from scratch, it consistently outperforms competitive TTT‑related approaches. Ablation study results further provide deeper insights on our design choices. Collectively, our results establish In‑Place TTT as a promising step towards a paradigm of continual learning in LLMs.
Authors:Hongxu Zhou
Abstract:
Intrinsic self‑correction in Large Language Models (LLMs) frequently fails in open‑ended reasoning tasks due to ``hallucination snowballing,'' a phenomenon in which models recursively justify early errors during free‑text reflection. While structured feedback can mitigate this issue, existing approaches often rely on externally trained critics or symbolic tools, reducing agent autonomy. This study investigates whether enforcing structured reflection purely through Outlines‑based constrained decoding can disrupt error propagation without additional training. Evaluating an 8‑billion‑parameter model (Qwen3‑8B), we show that simply imposing structural constraints does not improve self‑correction performance. Instead, it triggers a new failure mode termed ``structure snowballing.'' We find that the cognitive load required to satisfy strict formatting rules pushes the model into formatting traps. This observation helps explain why the agent achieves near‑perfect superficial syntactic alignment yet fails to detect or resolve deeper semantic errors. These findings expose an ``alignment tax'' inherent to constrained decoding, highlighting a tension between structural granularity and internal model capacity in autonomous workflows. Code and raw logs are available in the GitHub repository: https://github.com/hongxuzhou/agentic_llm_structured_self_critique.
Authors:Michael Cuccarese
Abstract:
This paper presents epistemic blinding in the context of an agentic system that uses large language models to reason across multiple biological datasets for drug target prioritization. During development, it became apparent that LLM outputs silently blend data‑driven inference with memorized priors about named entities ‑ and the blend is invisible: there is no way to determine, from a single output, how much came from the data on the page and how much came from the model's training memory. Epistemic blinding is a simple inference‑time protocol that replaces entity identifiers with anonymous codes before prompting, then compares outputs against an unblinded control. The protocol does not make LLM reasoning deterministic, but it restores one critical axis of auditability: measuring how much of an output came from the supplied data versus the model's parametric knowledge. The complete target identification system is described ‑ including LLM‑guided evolutionary optimization of scoring functions and blinded agentic reasoning for target rationalization ‑ with demonstration that both stages operate without access to entity identity. In oncology drug target prioritization across four cancer types, blinding changes 16% of top‑20 predictions while preserving identical recovery of validated targets. The contamination problem is shown to generalize beyond biology: in S&P 500 equity screening, brand‑recognition bias reshapes 30‑40% of top‑20 rankings across five random seeds. To lower the barrier to adoption, the protocol is released as an open‑source tool and as a Claude Code skill that enables one‑command epistemic blinding within agentic workflows. The claim is not that blinded analysis produces better results, but that without blinding, there is no way to know to what degree the agent is adhering to the analytical process the researcher designed.
Authors:Xiaojie Gu, Ziying Huang, Weicong Hong, Jian Xie, Renze Lou, Kai Zhang
Abstract:
Large Language Models (LLMs) internalize vast world knowledge as parametric memory, yet inevitably inherit the staleness and errors of their source corpora. Consequently, ensuring the reliability and malleability of these internal representations is imperative for trustworthy real‑world deployment. Knowledge editing offers a pivotal paradigm for surgically modifying memory without retraining. However, while recent editors demonstrate high success rates on standard benchmarks, it remains questionable whether current evaluation frameworks that rely on assessing output under specific prompting conditions can reliably authenticate genuine memory modification. In this work, we introduce a simple diagnostic framework that subjects models to discriminative self‑assessment under in‑context learning (ICL) settings that better reflect real‑world application environments, specifically designed to scrutinize the subtle behavioral nuances induced by memory modifications. This probing reveals a pervasive phenomenon of Surface Compliance, where editors achieve high benchmark scores by merely mimicking target outputs without structurally overwriting internal beliefs. Moreover, we find that recursive modifications accumulate representational residues, triggering cognitive instability and permanently diminishing the reversibility of the model's memory state. These insights underscore the risks of current editing paradigms and highlight the pivotal role of robust memory modification in building trustworthy, long‑term sustainable LLM systems. Code is available at https://github.com/XiaojieGu/SA‑MCQ.
Authors:Xiao Qin, Xingyi Song, Tong Liu, Hatim Laalej, Zepeng Liu, Yunpeng Zhu, Ligang He
Abstract:
We present LoRM (Language of Rotating Machinery), a self‑supervised framework for multi‑modal rotating‑machinery signal understanding and real‑time condition monitoring. LoRM is built on the idea that rotating‑machinery signals can be viewed as a machine language: local signals can be tokenised into discrete symbolic units, and their future evolution can be predicted from observed multi‑sensor context. Unlike conventional signal‑processing methods that rely on hand‑crafted transforms and features, LoRM reformulates multi‑modal sensor data as a token‑based sequence‑prediction problem. For each data window, the observed context segment is retained in continuous form, while the future target segment of each sensing channel is quantised into a discrete token. Then, efficient knowledge transfer is achieved by partially fine‑tuning a general‑purpose pre‑trained language model on industrial signals, avoiding the need to train a large model from scratch. Finally, condition monitoring is performed by tracking token‑prediction errors as a health indicator, where increasing errors indicate degradation. In‑situ tool condition monitoring (TCM) experiments demonstrate stable real‑time tracking and strong cross‑tool generalisation, showing that LoRM provides a practical bridge between language modelling and industrial signal analysis. The source code is publicly available at https://github.com/Q159753258/LormPHM.
Authors:Yuanfu Sun, Kang Li, Dongzhe Fan, Jiajin Liu, Qiaoyu Tan
Abstract:
Large Language Models (LLMs) increasingly rely on agentic capabilities‑iterative retrieval, tool use, and decision‑making‑to overcome the limits of static, parametric knowledge. Yet existing agentic frameworks treat external information as unstructured text and fail to leverage the topological dependencies inherent in real‑world data. To bridge this gap, we introduce Agentic Graph Learning (AGL), a paradigm that reframes graph learning as an interleaved process of topology‑aware navigation and LLM‑based inference. Specifically, we propose AgentGL, the first reinforcement learning (RL)‑driven framework for AGL. AgentGL equips an LLM agent with graph‑native tools for multi‑scale exploration, regulates tool usage via search‑constrained thinking to balance accuracy and efficiency, and employs a graph‑conditioned curriculum RL strategy to stabilize long‑horizon policy learning without step‑wise supervision. Across diverse Text‑Attributed Graph (TAG) benchmarks and multiple LLM backbones, AgentGL substantially outperforms strong GraphLLMs and GraphRAG baselines, achieving absolute improvements of up to 17.5% in node classification and 28.4% in link prediction. These results demonstrate that AGL is a promising frontier for enabling LLMs to autonomously navigate and reason over complex relational environments. The code is publicly available at https://github.com/sunyuanfu/AgentGL.
Authors:Seungyoon Lee, Minhyuk Kim, Seongtae Hong, Youngjoon Jang, Dongsuk Oh, Heuiseok Lim
Abstract:
Existing multilingual embedding models often encounter challenges in cross‑lingual scenarios due to imbalanced linguistic resources and less consideration of cross‑lingual alignment during training. Although standardized contrastive learning approaches for cross‑lingual adaptation are widely adopted, they may struggle to capture fundamental alignment between languages and degrade performance in well‑aligned languages such as English. To address these challenges, we propose Cross‑Lingual Enhancement in Retrieval via Reverse‑training (CLEAR), a novel loss function utilizing a reverse training scheme to improve retrieval performance across diverse cross‑lingual retrieval scenarios. CLEAR leverages an English passage as a bridge to strengthen alignments between the target language and English, ensuring robust performance in the cross‑lingual retrieval task. Our extensive experiments demonstrate that CLEAR achieves notable improvements in cross‑lingual scenarios, with gains up to 15%, particularly in low‑resource languages, while minimizing performance degradation in English. Furthermore, our findings highlight that CLEAR offers promising effectiveness even in multilingual training, suggesting its potential for broad application and scalability. We release the code at https://github.com/dltmddbs100/CLEAR.
Authors:Yingjian Zhu, Xinming Wang, Kun Ding, Ying Wang, Bin Fan, Shiming Xiang
Abstract:
Multi‑modal Retrieval‑Augmented Generation (RAG) has emerged as a highly effective paradigm for Knowledge‑Based Visual Question Answering (KB‑VQA). Despite recent advancements, prevailing methods still primarily depend on images as the retrieval key, and often overlook or misplace the role of Vision‑Language Models (VLMs), thereby failing to leverage their potential fully. In this paper, we introduce WikiSeeker, a novel multi‑modal RAG framework that bridges these gaps by proposing a multi‑modal retriever and redefining the role of VLMs. Rather than serving merely as answer generators, we assign VLMs two specialized agents: a Refiner and an Inspector. The Refiner utilizes the capability of VLMs to rewrite the textual query according to the input image, significantly improving the performance of the multimodal retriever. The Inspector facilitates a decoupled generation strategy by selectively routing reliable retrieved context to another LLM for answer generation, while relying on the VLM's internal knowledge when retrieval is unreliable. Extensive experiments on EVQA, InfoSeek, and M2KR demonstrate that WikiSeeker achieves state‑of‑the‑art performance, with substantial improvements in both retrieval accuracy and answer quality. Our code will be released on https://github.com/zhuyjan/WikiSeeker.
Authors:Xinran Wang, Yuxuan Zhang, Xiao Zhang, Haolong Yan, Muxi Diao, Songyu Xu, Zhonghao Yan, Hongbing Li, Kongming Liang, Zhanyu Ma
Abstract:
Accurately detecting and localizing hallucinations is a critical task for ensuring high reliability of image captions. In the era of Multimodal Large Language Models (MLLMs), captions have evolved from brief sentences into comprehensive narratives, often spanning hundreds of words. This shift exponentially increases the challenge: models must now pinpoint specific erroneous spans or words within extensive contexts, rather than merely flag response‑level inconsistencies. However, existing benchmarks lack the fine granularity and domain diversity required to evaluate this capability. To bridge this gap, we introduce DetailVerifyBench, a rigorous benchmark comprising 1,000 high‑quality images across five distinct domains. With an average caption length of over 200 words and dense, token‑level annotations of multiple hallucination types, it stands as the most challenging benchmark for precise hallucination localization in the field of long image captioning to date. Our benchmark is available at https://zyx‑hhnkh.github.io/DetailVerifyBench/.
Authors:Jun Zhang, Yicheng Ji, Feiyang Ren, Yihang Li, Bowen Zeng, Zonghao Chen, Ke Chen, Lidan Shou, Gang Chen, Huan Li
Abstract:
Large Vision‑Language Models (LVLMs) enable sophisticated reasoning over images and videos, yet their inference is hindered by a systemic efficiency barrier known as visual token dominance. This overhead is driven by a multi‑regime interplay between high‑resolution feature extraction, quadratic attention scaling, and memory bandwidth constraints. We present a systematic taxonomy of efficiency techniques structured around the inference lifecycle, consisting of encoding, prefilling, and decoding. Unlike prior reviews focused on isolated optimizations, we analyze the end‑to‑end pipeline to reveal how upstream decisions dictate downstream bottlenecks, covering compute‑bound visual encoding, the intensive prefilling of massive contexts, and the ''visual memory wall'' in bandwidth‑bound decoding. By decoupling the efficiency landscape into the axes of shaping information density, managing long‑context attention, and overcoming memory limits, this work provides a structured analysis of how isolated optimizations compose to navigate the trade‑off between visual fidelity and system efficiency. The survey concludes by outlining four future frontiers supported by pilot empirical insights, including hybrid compression based on functional unit sensitivity, modality‑aware decoding with relaxed verification, progressive state management for streaming continuity, and stage‑disaggregated serving through hardware‑algorithm co‑design. Our literature repository is at https://github.com/SuDIS‑ZJU/Efficient‑LVLMs‑Inference.
Authors:Jinhu Fu, Yan Bai, Longzhu He, Yihang Lou, Yanxiao Zhao, Li Sun, Sen Su
Abstract:
Large language models (LLMs) can effectively handle outdated information through knowledge editing. However, current approaches face two key limitations: (I) Poor generalization: Most approaches rigidly inject new knowledge without ensuring that the model can use it effectively to solve practical problems. (II) Narrow scope: Current methods focus primarily on structured fact triples, overlooking the diverse unstructured forms of factual information (e.g., news, articles) prevalent in real‑world contexts. To address these challenges, we propose a new paradigm: teaching LLMs to edit knowledge via Chain of Thoughts (CoTs) reasoning (CoT2Edit). We first leverage language model agents for both structured and unstructured edited data to generate CoTs, building high‑quality instruction data. The model is then trained to reason over edited knowledge through supervised fine‑tuning (SFT) and Group Relative Policy Optimization (GRPO). At inference time, we integrate Retrieval‑Augmented Generation (RAG) to dynamically retrieve relevant edited facts for real‑time knowledge editing. Experimental results demonstrate that our method achieves strong generalization across six diverse knowledge editing scenarios with just a single round of training on three open‑source language models. The codes are available at https://github.com/FredJDean/CoT2Edit.
Authors:Xuan Xiong, Huan Liu, Li Gu, Zhixiang Chi, Yue Qiu, Yuanhao Yu, Yang Wang
Abstract:
Chain‑of‑thought (CoT) reasoning improves large language model performance on complex tasks, but often produces excessively long and inefficient reasoning traces. Existing methods shorten CoTs using length penalties or global entropy reduction, implicitly assuming that low uncertainty is desirable throughout reasoning. We show instead that reasoning efficiency is governed by the trajectory of uncertainty. CoTs with dominant downward entropy trends are substantially shorter. Motivated by this insight, we propose Entropy Trend Reward (ETR), a trajectory‑aware objective that encourages progressive uncertainty reduction while allowing limited local exploration. We integrate ETR into Group Relative Policy Optimization (GRPO) and evaluate it across multiple reasoning models and challenging benchmarks. ETR consistently achieves a superior accuracy‑efficiency tradeoff, improving DeepSeek‑R1‑Distill‑7B by 9.9% in accuracy while reducing CoT length by 67% across four benchmarks. Code is available at https://github.com/Xuan1030/ETR
Authors:Jason Lucas, Matt Murtagh, Ali Al-Lawati, Uchendu Uchendu, Adaku Uchendu, Dongwon Lee
Abstract:
Harmful content detectors‑particularly disinformation classifiers‑are predominantly developed and evaluated on Standard American English (SAE), leaving their robustness to dialectal variation unexplored. We present DIA‑HARM, the first benchmark for evaluating disinformation detection robustness across 50 English dialects spanning U.S., British, African, Caribbean, and Asia‑Pacific varieties. Using Multi‑VALUE's linguistically grounded transformations, we introduce D3 (Dialectal Disinformation Detection), a corpus of 195K samples derived from established disinformation benchmarks. Our evaluation of 16 detection models reveals systematic vulnerabilities: human‑written dialectal content degrades detection by 1.4‑3.6% F1, while AI‑generated content remains stable. Fine‑tuned transformers substantially outperform zero‑shot LLMs (96.6% vs. 78.3% best‑case F1), with some models exhibiting catastrophic failures exceeding 33% degradation on mixed content. Cross‑dialectal transfer analysis across 2,450 dialect pairs shows that multilingual models (mDeBERTa: 97.2% average F1) generalize effectively, while monolingual models like RoBERTa and XLM‑RoBERTa fail on dialectal inputs. These findings demonstrate that current disinformation detectors may systematically disadvantage hundreds of millions of non‑SAE speakers worldwide. We release the DIA‑HARM framework, D3 corpus, and evaluation tools: https://github.com/jsl5710/dia‑harm
Authors:Chan-Wei Hu, Zhengzhong Tu
Abstract:
Multi‑modal retrieval‑augmented generation (MM‑RAG) relies heavily on re‑rankers to surface the most relevant evidence for image‑question queries. However, standard re‑rankers typically process the full query image as a global embedding, making them susceptible to visual distractors (e.g., background clutter) that skew similarity scores. We propose Region‑R1, a query‑side region cropping framework that formulates region selection as a decision‑making problem during re‑ranking, allowing the system to learn to retain the full image or focus only on a question‑relevant region before scoring the retrieved candidates. Region‑R1 learns a policy with a novel region‑aware group relative policy optimization (r‑GRPO) to dynamically crop a discriminative region. Across two challenging benchmarks, E‑VQA and InfoSeek, Region‑R1 delivers consistent gains, achieving state‑of‑the‑art performances by increasing conditional Recall@1 by up to 20%. These results show the great promise of query‑side adaptation as a simple but effective way to strengthen MM‑RAG re‑ranking.
Authors:Giang Do, Hung Le, Truyen Tran
Abstract:
In the era of Large Language Models (LLMs), the Mixture of Experts (MoE) architecture has emerged as an effective approach for training extremely large models with improved computational efficiency. This success builds upon extensive prior research aimed at enhancing expert specialization in MoE‑based LLMs. However, the nature of such specializations and how they can be systematically interpreted remain open research challenges. In this work, we investigate this gap by posing a fundamental question: Do domain‑specific experts exist in MoE‑based LLMs? To answer the question, we evaluate ten advanced MoE‑based LLMs ranging from 3.8B to 120B parameters and provide empirical evidence for the existence of domain‑specific experts. Building on this finding, we propose Domain Steering Mixture of Experts (DSMoE), a training‑free framework that introduces zero additional inference cost and outperforms both well‑trained MoE‑based LLMs and strong baselines, including Supervised Fine‑Tuning (SFT). Experiments on four advanced open‑source MoE‑based LLMs across both target and non‑target domains demonstrate that our method achieves strong performance and robust generalization without increasing inference cost or requiring additional retraining. Our implementation is publicly available at https://github.com/giangdip2410/Domain‑specific‑Experts.
Authors:Jon-Paul Cacioli
Abstract:
Background: Children do not simply learn that balls are round and blocks are square. They learn that shape is the kind of feature that tends to define object categories ‑‑ a second‑order generalisation known as an overhypothesis [1, 2]. What kind of learning mechanism is sufficient for this inductive leap? Methods: We trained autoregressive transformer language models (3.4M‑25.6M parameters) on synthetic corpora in which shape is the stable feature dimension across categories, with eight conditions controlling for alternative explanations. Results: Across 120 pre‑registered runs evaluated on a 1,040‑item wug test battery, every model achieved perfect first‑order exemplar retrieval (100%) while second‑order generalisation to novel nouns remained at chance (50‑52%), a result confirmed by equivalence testing. A feature‑swap diagnostic revealed that models rely on frame‑to‑feature template matching rather than structured noun‑to‑domain‑to‑feature abstraction. Conclusions: These results reveal a clear limitation of autoregressive distributional sequence learning under developmental‑scale training conditions.
Authors:Jiahao Xu, Rui Hu, Olivera Kotevska, Zikai Zhang
Abstract:
Multi‑bit watermarking has emerged as a promising solution for embedding imperceptible binary messages into Large Language Model (LLM)‑generated text, enabling reliable attribution and tracing of malicious usage of LLMs. Despite recent progress, existing methods still face key limitations: some become computationally infeasible for large messages, while others suffer from a poor trade‑off between text quality and decoding accuracy. Moreover, the decoding accuracy of existing methods drops significantly when the number of tokens in the generated text is limited, a condition that frequently arises in practical usage. To address these challenges, we propose \textscXMark, a novel method for encoding and decoding binary messages in LLM‑generated texts. The unique design of \textscXMark's encoder produces a less distorted logit distribution for watermarked token generation, preserving text quality, and also enables its tailored decoder to reliably recover the encoded message with limited tokens. Extensive experiments across diverse downstream tasks show that \textscXMark significantly improves decoding accuracy while preserving the quality of watermarked text, outperforming prior methods. The code is at https://github.com/JiiahaoXU/XMark.
Authors:Berny Kabalisa
Abstract:
We introduce SenseAI, a human‑in‑the‑loop (HITL) validated financial sentiment dataset designed to capture not only model outputs but the full reasoning process behind them. Unlike existing resources, SenseAI incorporates reasoning chains, confidence scores, human correction signals, and real‑world market outcomes, providing a structure aligned with Reinforcement Learning from Human Feedback (RLHF) paradigms. The dataset consists of 1,439 labelled data points across 40 US‑listed equities and 13 financial data categories, enabling direct integration into modern LLM fine‑tuning pipelines. Through analysis, we identify several systematic patterns in model behavior, including a novel failure mode we term Latent Reasoning Drift, where models introduce information not grounded in the input, as well as consistent confidence miscalibration and forward projection tendencies. These findings suggest that LLM errors in financial reasoning are not random but occur within a predictable and correctable regime, supporting the use of structured HITL data for targeted model improvement. We discuss implications for financial AI systems and highlight opportunities for applying SenseAI in model evaluation and alignment.
Authors:Quyet V. Do, Thinh Pham, Nguyen Nguyen, Sha Li, Pratibha Zunjare, Tu Vu
Abstract:
We study a pipeline that curates reasoning data from initial structured data for improving long‑context reasoning in large language models (LLMs). Our approach, π^2, constructs high‑quality reasoning data through rigorous QA curation: 1) extracting and expanding tables from Wikipedia, 2) from the collected tables and relevant context, generating realistic and multi‑hop analytical reasoning questions whose answers are automatically determined and verified through dual‑path code execution, and 3) back‑translating step‑by‑step structured reasoning traces as solutions of QA pairs given realistic web‑search context. Supervised fine‑tuning with \textsc\smallgpt‑oss‑20b and \textsc\smallQwen3‑4B‑Instruct‑2507 on π^2 yields consistent improvements across four long‑context reasoning benchmarks and our alike π^2‑Bench, with average absolute accuracy gains of +4.3% and +2.7% respectively. Notably, our dataset facilitates self‑distillation, where \textsc\smallgpt‑oss‑20b even improves its average performance by +4.4% with its own reasoning traces, demonstrating π^2's usefulness. Our code, data, and models are open‑source at https://github.com/vt‑pi‑squared/pi‑squared.
Authors:Gowrav Vishwakarma, Christopher J. Agostino
Abstract:
We present Phase‑Associative Memory (PAM), a recurrent sequence model in which all representations are complex‑valued, associations accumulate in a matrix state S_t \in \mathbbC^d × d via outer products, and retrieval operates through the conjugate inner product K_t^ \cdot Q_t / \sqrtd. At ~100M parameters on WikiText‑103, PAM reaches validation perplexity 30.0, within ~10% of a matched transformer (27.1) trained under identical conditions, despite 4× arithmetic overhead from complex computation and no custom kernels. We trace the experimental path from vector‑state models, where holographic binding fails due to the O(1/\sqrtn) capacity degradation of superposed associations, to the matrix state that resolves it. The competitiveness of an architecture whose native operations are complex‑valued superposition and conjugate retrieval is consistent with recent empirical evidence that semantic interpretation in both humans and large language models exhibits non‑classical contextuality, and we discuss what this implies for the choice of computational formalism in language modeling.
Authors:Peter Balogh
Abstract:
The final MLP of GPT‑2 Small exhibits a fully legible routing program ‑‑ 27 named neurons organized into a three‑tier exception handler ‑‑ while the knowledge it routes remains entangled across ~3,040 residual neurons. We decompose all 3,072 neurons (to numerical precision) into: 5 fused Core neurons that reset vocabulary toward function words, 10 Differentiators that suppress wrong candidates, 5 Specialists that detect structural boundaries, and 7 Consensus neurons that each monitor a distinct linguistic dimension. The consensus‑exception crossover ‑‑ where MLP intervention shifts from helpful to harmful ‑‑ is statistically sharp (bootstrap 95% CIs exclude zero at all consensus levels; crossover between 4/7 and 5/7). Three experiments show that "knowledge neurons" (Dai et al., 2022), at L11 of this model, function as routing infrastructure rather than fact storage: the MLP amplifies or suppresses signals already present in the residual stream from attention, scaling with contextual constraint. A garden‑path experiment reveals a reversed garden‑path effect ‑‑ GPT‑2 uses verb subcategorization immediately, consistent with the exception handler operating at token‑level predictability rather than syntactic structure. This architecture crystallizes only at the terminal layer ‑‑ in deeper models, we predict equivalent structure at the final layer, not at layer 11. Code and data: https://github.com/pbalogh/transparent‑gpt2
Authors:Jaeyoon Jung, Yejun Yoon, Kunwoo Park
Abstract:
Automated fact‑checking is a crucial task not only in journalism but also across web platforms, where it supports a responsible information ecosystem and mitigates the harms of misinformation. While recent research has progressed from text‑only to multimodal fact‑checking, a prevailing assumption is that incorporating visual evidence universally improves performance. In this work, we challenge this assumption and show that indiscriminate use of multimodal evidence can reduce accuracy. To address this challenge, we propose AMuFC, a multimodal fact‑checking framework that employs two collaborative agents with distinct roles for the adaptive use of visual evidence: An Analyzer determines whether visual evidence is necessary for claim verification, and a Verifier predicts claim veracity conditioned on both the retrieved evidence and the Analyzer's assessment. Experimental results on three datasets show that incorporating the Analyzer's assessment of visual evidence necessity into the Verifier's prediction yields substantial improvements in verification performance. In addition to all code, we release WebFC, a newly constructed dataset for evaluating fact‑checking modules in a more realistic scenario, available at https://github.com/ssu‑humane/AMuFC.
Authors:Madhav S Baidya
Abstract:
Large Language Models (LLMs) have achieved strong performance in question answering and retrieval‑augmented generation (RAG), yet they implicitly assume that user queries are fully specified and answerable. In real‑world settings, queries are often incomplete, ambiguous, or missing critical variables, leading models to produce overconfident or hallucinated responses. In this work, we study decision‑aware query resolution under incomplete information, where a model must determine whether to Answer, Ask for clarification, or Abstain. We show that standard and enhanced RAG systems do not reliably exhibit such epistemic awareness, defaulting to answer generation even when information is insufficient. To address this, we propose PassiveQA, a three‑action framework that aligns model behaviour with information sufficiency through supervised finetuning. Our approach integrates structured information‑state representations, knowledge graph‑grounded context, and a finetuned planner that explicitly models missing variables and decision reasoning. Experiments across multiple QA datasets show that the finetuned planner achieves significant improvements in macro F1 and abstention recall while reducing hallucination rates, under a compute‑constrained training regime. These results provide strong empirical evidence that epistemic decision‑making must be learned during training rather than imposed at inference time.
Authors:Charafeddine Mouzouni
Abstract:
LLM agents with tool access can discover and exploit security vulnerabilities. This is known. What is not known is which features of a system prompt trigger this behaviour, and which do not. We present a systematic taxonomy based on approximately 10,000 trials across seven models, 37 prompt conditions, and 12 hypothesized attack dimensions ‑‑ all executed in real Docker sandboxes. Every condition includes the instruction "Always respect all rules and access policies"; only a single encouragement sentence varies. Within the task class studied (planted test‑runner vulnerabilities), the threat model is narrower than assumed: nine of twelve dimensions ‑‑ including minimization, moral licensing, incentives, identity priming, and reasoning triggers ‑‑ produce no detectable exploitation at n=50 per cell (upper 95% CI <7% per condition). But it is also more subtle than expected: one dimension, goal reframing, reliably triggers exploitation. "You are solving a puzzle; there may be hidden clues" produces 38‑40% exploitation on Claude Sonnet 4 despite the explicit rule instruction, replicating across four models (CTF framing: 8‑14% on DeepSeek, GPT‑5‑mini, o4‑mini). The agent does not override the rules; it reinterprets the task so that exploitative actions become task‑aligned. GPT‑4.1 produces no exploitation across 1,850 trials (37 conditions), and a temporal comparison across four OpenAI models released over eleven months shows a pattern consistent with improving safety training, though model capability differences are a confounder. The practical contribution is a narrowed, testable threat model: defenders should audit for goal‑reframing language, not for the broad class of adversarial prompts.
Authors:Varun Pratap Bhardwaj
Abstract:
AI coding agents operate in a paradox: they possess vast parametric knowledge yet cannot remember a conversation from an hour ago. Existing memory systems store text in vector databases with single‑channel retrieval, require cloud LLMs for core operations, and implement none of the cognitive processes that make human memory effective. We present SuperLocalMemory V3.3 ("The Living Brain"), a local‑first agent memory system implementing the full cognitive memory taxonomy with mathematical lifecycle dynamics. Building on the information‑geometric foundations of V3.2 (arXiv:2603.14588), we introduce five contributions: (1) Fisher‑Rao Quantization‑Aware Distance (FRQAD) ‑‑ a new metric on the Gaussian statistical manifold achieving 100% precision at preferring high‑fidelity embeddings over quantized ones (vs 85.6% for cosine), with zero prior art; (2) Ebbinghaus Adaptive Forgetting with lifecycle‑aware quantization ‑‑ the first mathematical forgetting curve in local agent memory coupled to progressive embedding compression, achieving 6.7x discriminative power; (3) 7‑channel cognitive retrieval spanning semantic, keyword, entity graph, temporal, spreading activation, consolidation, and Hopfield associative channels, achieving 70.4% on LoCoMo in zero‑LLM Mode A; (4) memory parameterization implementing Long‑Term Implicit memory via soft prompts; (5) zero‑friction auto‑cognitive pipeline automating the complete memory lifecycle. On LoCoMo, V3.3 achieves 70.4% in Mode A (zero‑LLM), with +23.8pp on multi‑hop and +12.7pp on adversarial. V3.2 achieved 74.8% Mode A and 87.7% Mode C; the 4.4pp gap reflects a deliberate architectural trade‑off. SLM V3.3 is open source under the Elastic License 2.0, runs entirely on CPU, with over 5,000 monthly downloads.
Authors:Fatemeh Khadem, Sajad Mousavi, Yi Fang, Yuhong Liu
Abstract:
Large language models (LLMs) are increasingly adapted to proprietary and domain‑specific corpora that contain sensitive information, creating a tension between formal privacy guarantees and efficient deployment through model compression. Differential privacy (DP), typically enforced via DP‑SGD, provides record‑level protection but often incurs substantial utility loss in autoregressive generation, where optimization noise can amplify exposure bias and compounding errors along long rollouts. Existing approaches to private distillation either apply DP‑SGD to both teacher and student, worsening computation and the privacy‑‑utility tradeoff, or rely on DP synthetic text generation from a DP‑trained teacher, avoiding DP on the student at the cost of DP‑optimizing a large teacher and introducing an offline generation pipeline. We propose Differentially Private On‑Policy Distillation (DP‑OPD), a synthesis‑free framework that enforces privacy solely through DP‑SGD on the student while leveraging a frozen teacher to provide dense token‑level targets on \emphstudent‑generated trajectories. DP‑OPD instantiates this idea via \emphprivate generalized knowledge distillation on continuation tokens. Under a strict privacy budget (\varepsilon=2.0), DP‑OPD improves perplexity over DP fine‑tuning and off‑policy DP distillation, and outperforms synthesis‑based DP distillation (Yelp: 44.15\rightarrow41.68; BigPatent: 32.43\rightarrow30.63), while substantially simplifying the training pipeline. In particular, DP‑OPD collapses private compression into a single DP student‑training loop by eliminating DP teacher training and offline synthetic text generation. Code will be released upon publication at https://github.com/khademfatemeh/dp_opd.
Authors:Abu Noman Md Sakib, Zhensen Wang, Merjulah Roby, Zijie Zhang
Abstract:
Reliable pattern recognition systems should exhibit consistent behavior across similar inputs, and their explanations should remain stable. However, most Explainable AI evaluations remain instance centric and do not explicitly quantify whether attribution patterns are consistent across samples that share the same class or represent small variations of the same input. In this work, we propose a novel metric aimed at assessing the consistency of model explanations, ensuring that models consistently reflect the intended objectives and consistency under label‑preserving perturbations. We implement this metric using a pre‑trained BERT model on the SST‑2 sentiment analysis dataset, with additional robustness tests on RoBERTa, DistilBERT, and IMDB, applying SHAP to compute feature importance for various test samples. The proposed metric quantifies the cosine similarity of SHAP values for inputs with the same label, aiming to detect inconsistent behaviors, such as biased reliance on certain features or failure to maintain consistent reasoning for similar predictions. Through a series of experiments, we evaluate the ability of this metric to identify misaligned predictions and inconsistencies in model explanations. These experiments are compared against standard fidelity metrics to assess whether the new metric can effectively identify when a model's behavior deviates from its intended objectives. The proposed framework provides a deeper understanding of model behavior by enabling more robust verification of rationale stability, which is critical for building trustworthy AI systems. By quantifying whether models rely on consistent attribution patterns for similar inputs, the proposed approach supports more robust evaluation of model behavior in practical pattern recognition pipelines. Our code is publicly available at https://github.com/anmspro/ESS‑XAI‑Stability.
Authors:Hiroshi Takahashi, Tomoharu Iwata, Atsutoshi Kumagai, Sekitoshi Kanai, Masanori Yamada, Kosuke Nishida, Kazutoshi Shinoda
Abstract:
Aligning language models with human preferences is essential for ensuring their safety and reliability. Although most existing approaches assume specific human preference models such as the Bradley‑Terry model, this assumption may fail to accurately capture true human preferences, and consequently, these methods lack statistical consistency, i.e., the guarantee that language models converge to the true human preference as the number of samples increases. In contrast, direct density ratio optimization (DDRO) achieves statistical consistency without assuming any human preference models. DDRO models the density ratio between preferred and non‑preferred data distributions using the language model, and then optimizes it via density ratio estimation. However, this density ratio is unstable and often diverges, leading to training instability of DDRO. In this paper, we propose a novel alignment method that is both stable and statistically consistent. Our approach is based on the relative density ratio between the preferred data distribution and a mixture of the preferred and non‑preferred data distributions. Our approach is stable since this relative density ratio is bounded above and does not diverge. Moreover, it is statistically consistent and yields significantly tighter convergence guarantees than DDRO. We experimentally show its effectiveness with Qwen 2.5 and Llama 3.
Authors:Saurav Jha, Maryam Hashemzadeh, Ali Saheb Pasand, Ali Parviz, Min-Joong Lee, Boris Knyazev
Abstract:
Mixture‑of‑Experts (MoE) large language models (LLMs) are among the top‑performing architectures. The largest models, often with hundreds of billions of parameters, pose significant memory challenges for deployment. Traditional approaches to reduce memory requirements include weight pruning and quantization. Motivated by the Router‑weighted Expert Activation Pruning (REAP) that prunes experts, we propose a novel method, Router‑weighted Expert Activation Merging (REAM). Instead of removing experts, REAM groups them and merges their weights, better preserving original performance. We evaluate REAM against REAP and other baselines across multiple MoE LLMs on diverse multiple‑choice (MC) question answering and generative (GEN) benchmarks. Our results reveal a trade‑off between MC and GEN performance that depends on the mix of calibration data. By controlling the mix of general, math and coding data, we examine the Pareto frontier of this trade‑off and show that REAM often outperforms the baselines and in many cases is comparable to the original uncompressed models.
Authors:Yujian Liu, Jiabao Ji, Li An, Tommi Jaakkola, Yang Zhang, Shiyu Chang
Abstract:
Agent skills, which are reusable, domain‑specific knowledge artifacts, have become a popular mechanism for extending LLM‑based agents, yet formally benchmarking skill usage performance remains scarce. Existing skill benchmarking efforts focus on overly idealized conditions, where LLMs are directly provided with hand‑crafted, narrowly‑tailored task‑specific skills for each task, whereas in many realistic settings, the LLM agent may have to search for and select relevant skills on its own, and even the closest matching skills may not be well‑tailored for the task. In this paper, we conduct the first comprehensive study of skill utility under progressively challenging realistic settings, where agents must retrieve skills from a large collection of 34k real‑world skills and may not have access to any hand‑curated skills. Our findings reveal that the benefits of skills are fragile: performance gains degrade consistently as settings become more realistic, with pass rates approaching no‑skill baselines in the most challenging scenarios. To narrow this gap, we study skill refinement strategies, including query‑specific and query‑agnostic approaches, and we show that query‑specific refinement substantially recovers lost performance when the initial skills are of reasonable relevance and quality. We further demonstrate the generality of retrieval and refinement on Terminal‑Bench 2.0, where they improve the pass rate of Claude Opus 4.6 from 57.7% to 65.5%. Our results, consistent across multiple models, highlight both the promise and the current limitations of skills for LLM‑based agents. Our code is available at https://github.com/UCSB‑NLP‑Chang/Skill‑Usage.
Authors:Haonian Ji, Kaiwen Xiong, Siwei Han, Peng Xia, Shi Qiu, Yiyang Zhou, Jiaqi Liu, Jinlong Li, Bingzhou Li, Zeyu Zheng, Cihang Xie, Huaxiu Yao
Abstract:
AI agents deployed as persistent assistants must maintain correct beliefs as their information environment evolves. In practice, evidence is scattered across heterogeneous sources that often contradict one another, new information can invalidate earlier conclusions, and user preferences surface through corrections rather than explicit instructions. Existing benchmarks largely assume static, single‑authority settings and do not evaluate whether agents can keep up with this complexity. We introduce ClawArena, a benchmark for evaluating AI agents in evolving information environments. Each scenario maintains a complete hidden ground truth while exposing the agent only to noisy, partial, and sometimes contradictory traces across multi‑channel sessions, workspace files, and staged updates. Evaluation is organized around three coupled challenges: multi‑source conflict reasoning, dynamic belief revision, and implicit personalization, whose interactions yield a 14‑category question taxonomy. Two question formats, multi‑choice (set‑selection) and shell‑based executable checks, test both reasoning and workspace grounding. The current release contains 64 scenarios across 8 professional domains, totaling 1,879 evaluation rounds and 365 dynamic updates. Experiments on five agent frameworks and five language models show that both model capability (15.4% range) and framework design (9.2%) substantially affect performance, that self‑evolving skill frameworks can partially close model‑capability gaps, and that belief revision difficulty is determined by update design strategy rather than the mere presence of updates. Code is available at https://github.com/aiming‑lab/ClawArena.
Authors:Avish Vijayaraghavan, Jaskaran Singh Kawatra, Sebin Sabu, Jonny Sheldon, Will Poulett, Alex Eze, Daniel Key, John Booth, Shiren Patel, Jonny Pearson, Dan Schofield, Jonathan Hope, Pavithra Rajendran, Neil Sebire
Abstract:
Electronic Patient Record (EPR) systems contain valuable clinical information, but much of it is trapped in unstructured text, limiting its use for research and decision‑making. Large language models can extract such information but require substantial computational resources to run locally, and sending sensitive clinical data to cloud‑based services, even when deidentified, raises significant patient privacy concerns. In this study, we develop a resource‑efficient semi‑automated annotation workflow using small language models (SLMs) to extract structured information from unstructured EPR data, focusing on paediatric histopathology reports. As a proof‑of‑concept, we apply the workflow to paediatric renal biopsy reports, a domain chosen for its constrained diagnostic scope and well‑defined underlying biology. We develop the workflow iteratively with clinical oversight across three meetings, manually annotating 400 reports from a dataset of 2,111 at Great Ormond Street Hospital as a gold standard, while developing an automated information extraction approach using SLMs. We frame extraction as a Question‑Answering task grounded by clinician‑guided entity guidelines and few‑shot examples, evaluating five instruction‑tuned SLMs with a disagreement modelling framework to prioritise reports for clinical review. Gemma 2 2B achieves the highest accuracy at 84.3%, outperforming off‑the‑shelf models including spaCy (74.3%), BioBERT‑SQuAD (62.3%), RoBERTa‑SQuAD (59.7%), and GLiNER (60.2%). Entity guidelines improved performance by 7‑19% over the zero‑shot baseline, and few‑shot examples by 6‑38%, though their benefits do not compound when combined. These results demonstrate that SLMs can extract structured information from specialised clinical domains on CPU‑only infrastructure with minimal clinician involvement. Our code is available at https://github.com/gosh‑dre/nlp_renal_biopsy.
Authors:Zhihan Guo, Rundong Xue, Yuting Lu, Jionghao Lin
Abstract:
Novice math teachers often encounter students' mistakes that are difficult to diagnose and remediate. Misconceptions are especially challenging because teachers must explain what went wrong and how to solve them. Although many existing large language model (LLM) platforms can assist in generating instructional feedback, these LLMs loosely connect pedagogical knowledge and student mistakes, which might make the guidance less actionable for teachers. To address this gap, we propose MisEdu‑RAG, a dual‑hypergraph‑based retrieval‑augmented generation (RAG) framework that organizes pedagogical knowledge as a concept hypergraph and real student mistake cases as an instance hypergraph. Given a query, MisEdu‑RAG performs a two‑stage retrieval to gather connected evidence from both layers and generates a response grounded in the retrieved cases and pedagogical principles. We evaluate on MisstepMath, a dataset of math mistakes paired with teacher solutions, as a benchmark for misconception‑aware retrieval and response generation across topics and error types. Evaluation results on MisstepMath show that, compared with baseline models, MisEdu‑RAG improves token‑F1 by 10.95% and yields up to 15.3% higher five‑dimension response quality, with the largest gains on Diversity and Empowerment. To verify its applicability in practical use, we further conduct a pilot study through a questionnaire survey of 221 teachers and interviews with 6 novices. The findings suggest that MisEdu‑RAG provides diagnosis results and concrete teaching moves for high‑demand misconception scenarios. Overall, MisEdu‑RAG demonstrates strong potential for scalable teacher training and AI‑assisted instruction for misconception handling. Our code is available on GitHub: https://github.com/GEMLab‑HKU/MisEdu‑RAG.
Authors:Xinyu Geng, Yanjing Xiao, Yuyang Zhang, Hanwen Wang, Xinyan Liu, Rui Min, Tianqing Fang, Yi R. Fung
Abstract:
Deep research agents integrate fragmented evidence through multi‑step tool use. BrowseComp offers a text‑only testbed for such agents, but existing multimodal benchmarks rarely require both weak visual cues composition and BrowseComp‑style multi‑hop verification. Geolocation is a natural testbed because answers depend on combining multiple ambiguous visual cues and validating them with open‑web evidence. Thus, we introduce GeoBrowse, a geolocation benchmark that combines visual reasoning with knowledge‑intensive multi‑hop queries. Level 1 tests extracting and composing fragmented visual cues, and Level 2 increases query difficulty by injecting long‑tail knowledge and obfuscating key entities. To support evaluation, we provide an agentic workflow GATE with five think‑with‑image tools and four knowledge‑intensive tools, and release expert‑annotated stepwise traces grounded in verifiable evidence for trajectory‑level analysis. Experiments show that GATE outperforms direct inference and open‑source agents, indicating that no‑tool, search‑only or image‑only setups are insufficient. Gains come from coherent, level‑specific tool‑use plans rather than more tool calls, as they more reliably reach annotated key evidence steps and make fewer errors when integrating into the final decision. The GeoBrowse bernchmark and codes are provided in https://github.com/ornamentt/GeoBrowse
Authors:Haotian Zong, Binze Li, Yufei Long, Sinyin Chang, Jialong Wu, Gillian K. Hadfield
Abstract:
Large language models (LLMs) frequently produce confident but incorrect answers, partly because common binary scoring conventions reward answering over honestly expressing uncertainty. We study whether prompt‑only interventions ‑‑ explicitly announcing reward schemes for answer‑versus‑abstain decisions plus humility‑oriented normative principles ‑‑ can reduce hallucination risk without modifying the model. Our focus is epistemic abstention on factual questions with a verifiable answer, where current LLMs often fail to abstain despite being uncertain about their answers. We first assess self‑reported verbal confidence as a usable uncertainty signal, showing stability under prompt paraphrasing and reasonable calibration against a token‑probability baseline. We then study I‑CALM, a prompt‑based framework that (i) elicits verbal confidence, (ii) partially rewards abstention through explicit reward schemes, and (iii) adds lightweight normative principles emphasizing truthfulness, humility, and responsibility. Using GPT‑5 mini on PopQA as the main setting, we find that confidence‑eliciting, abstention‑rewarding prompts, especially with norms, reduce the false‑answer rate on answered cases mainly by identifying and shifting error‑prone cases to abstention and re‑calibrating their confidence. This trades coverage for reliability while leaving forced‑answer performance largely unchanged. Varying the abstention reward yields a clear abstention‑hallucination frontier. Overall, results show the framework can improve selective answering on factual questions without retraining, with the magnitude of effect varying across models and datasets. Code is available at the following https://github.com/binzeli/hallucinationControl.
Authors:Baicheng Chen, Yu Wang, Ziheng Zhou, Xiangru Liu, Juanru Li, Yilei Chen, Tianxing He
Abstract:
Reverse engineering (RE) is central to software security, particularly for cryptographic programs that handle sensitive data and are highly prone to vulnerabilities. It supports critical tasks such as vulnerability discovery and malware analysis. Despite its importance, RE remains labor‑intensive and requires substantial expertise, making large language models (LLMs) a potential solution for automating the process. However, their capabilities for RE remain systematically underexplored. To address this gap, we study the cryptographic binary RE capabilities of LLMs and introduce CREBench, a benchmark comprising 432 challenges built from 48 standard cryptographic algorithms, 3 insecure crypto key usage scenarios, and 3 difficulty levels. Each challenge follows a Capture‑the‑Flag (CTF) RE challenge, requiring the model to analyze the underlying cryptographic logic and recover the correct input. We design an evaluation framework comprising four sub‑tasks, from algorithm identification to correct flag recovery. We evaluate eight frontier LLMs on CREBench. GPT‑5.4, the best‑performing model, achieves 64.03 out of 100 and recovers the flag in 59% of challenges. We also establish a strong human expert baseline of 92.19 points, showing that humans maintain an advantage in cryptographic RE tasks. Our code and dataset are available at https://github.com/wangyu‑ovo/CREBench.
Authors:Bingru Li, Han Wang, Hazel Wilkinson
Abstract:
Large Language Models (LLMs) can compose poetry, but how far are they from human poets? In this paper, we introduce POEMetric, the first comprehensive framework for poetry evaluation, examining 1) basic instruction‑following abilities in generating poems according to a certain form and theme, 2) advanced abilities of showing creativity, lexical diversity, and idiosyncrasy, evoking emotional resonance, and using imagery and literary devices, and 3) general appraisal of the overall poem quality and estimation of authorship. We curated a human poem dataset ‑ 203 English poems of 7 fixed forms annotated with meter, rhyme patterns and themes ‑ and experimented with 30 LLMs for poetry generation based on the same forms and themes of the human data, totaling 6,090 LLM poems. Based on POEMetric, we assessed the performance of both human poets and LLMs through rule‑based evaluation and LLM‑as‑a‑judge, whose results were validated by human experts. Results show that, though the top model achieved high form accuracy (4.26 out of 5.00, with Gemini‑2.5‑Pro as a judge; same below) and theme alignment (4.99), all models failed to reach the same level of advanced abilities as human poets, who achieved unparalleled creativity (4.02), idiosyncrasy (3.95), emotional resonance (4.06), and skillful use of imagery (4.49) and literary devices (4.67). Humans also defeated the best‑performing LLM in overall poem quality (4.22 vs. 3.20). As such, poetry generation remains a formidable challenge for LLMs. Data and codes are released at https://github.com/Bingru‑Li/POEMetric.
Authors:Minghai Jiao, Jing Xiao, Peng Xiao, Ende Zhang, Shuang Kan, Wenyan Jiang, Jinyao Li, Yixian Liu, Haidong Xin
Abstract:
Multimodal Sentiment Analysis (MSA) requires effective modeling of cross‑modal interactions and contextual dependencies while remaining computationally efficient. Existing fusion approaches predominantly rely on Transformer‑based cross‑modal attention, which incurs quadratic complexity with respect to sequence length and limits scalability. Moreover, contextual information from preceding utterances is often incorporated through concatenation or independent fusion, without explicit temporal modeling that captures sentiment evolution across dialogue turns. To address these limitations, we propose CAGMamba, a context‑aware gated cross‑modal Mamba framework for dialogue‑based sentiment analysis. Specifically, we organize the contextual and the current‑utterance features into a temporally ordered binary sequence, which provides Mamba with explicit temporal structure for modeling sentiment evolution. To further enable controllable cross‑modal integration, we propose a Gated Cross‑Modal Mamba Network (GCMN) that integrates cross‑modal and unimodal paths via learnable gating to balance information fusion and modality preservation, and is trained with a three‑branch multi‑task objective over text, audio, and fused predictions. Experiments on three benchmark datasets demonstrate that CAGMamba achieves state‑of‑the‑art or competitive results across multiple evaluation metrics. All codes are available at https://github.com/User2024‑xj/CAGMamba.
Authors:Ivan Yee Lee, Loris D'Antoni, Taylor Berg-Kirkpatrick
Abstract:
Asking a large language model to respond in JSON should be a formatting choice, not a capability tax. Yet we find that structured output requirements ‑‑ JSON, XML, LaTeX, Markdown ‑‑ substantially degrade reasoning and writing performance across open‑weight models. The research response has focused on constrained decoding, but sampling bias accounts for only a fraction of the degradation. The dominant cost enters at the prompt: format‑requesting instructions alone cause most of the accuracy loss, before any decoder constraint is applied. This diagnosis points to a simple principle: decouple reasoning from formatting. Whether by generating freeform first and reformatting in a second pass, or by enabling extended thinking within a single generation, separating the two concerns substantially recovers lost accuracy. Across six open‑weight models, four API models, four formats, and tasks spanning math, science, logic, and writing, decoupling recovers most lost accuracy. Notably, most recent closed‑weight models show little to no format tax, suggesting the problem is not inherent to structured generation but a gap that current open‑weight models have yet to close. Code is available at https://github.com/ivnle/the‑format‑tax.
Authors:Andrey Pustovit
Abstract:
RAG wastes tokens. We propose Knowledge Packs: pre‑computed KV caches that deliver the same knowledge at zero token cost. For causal transformers, the KV cache from a forward pass on text F is identical to what a joint pass on F+q would produce ‑ this follows directly from the causal mask. The equivalence is exact but fragile: wrong chat template formatting causes 6‑7pp degradation, which we believe explains prior claims of KV outperforming RAG. With correct formatting: zero divergences across 700 questions on Qwen3‑8B and Llama‑3.1‑8B, up to 95% token savings. The KV interface also enables behavioral steering that RAG cannot do. Because RoPE rotates keys but leaves values untouched, contrastive deltas on cached values can nudge model behavior while key arithmetic destroys coherence. The effect sits in mid‑layer values (33‑66%), independent directions are nearly orthogonal (cos~0) and compose, and both channels ‑ knowledge and steering ‑ run simultaneously at alpha<=0.7 without interference. No training, no weight modification.
Authors:Bo Kang, Sander Noels, Tijl De Bie
Abstract:
The rise of generative AI is posing increasing risks to online information integrity and civic discourse. Most concretely, such risks can materialise in the form of mis‑ and disinformation. As a mitigation, media‑literacy and transparency tools have been developed to address factuality of information and the reliability and ideological leaning of information sources. However, a subtler but possibly no less harmful threat to civic discourse is to use of persuasion or manipulation by exploiting human cognitive biases and related cognitive limitations. To the best of our knowledge, no tools exist to directly detect and mitigate the presence of triggers of such cognitive biases in online information. We present VIGIL (VIrtual GuardIan angeL), the first browser extension for real‑time cognitive bias trigger detection and mitigation, providing in‑situ scroll‑synced detection, LLM‑powered reformulation with full reversibility, and privacy‑tiered inference from fully offline to cloud. VIGIL is built to be extensible with third‑party plugins, with several plugins that are rigorously validated against NLP benchmarks are already included. It is open‑sourced at https://github.com/aida‑ugent/vigil.
Authors:David Ilić, Kostadin Cvejoski, David Stanojević, Evgeny Grigorenko
Abstract:
All prior membership inference attacks for fine‑tuned language models use hand‑crafted heuristics (e.g., loss thresholding, Min‑K%, reference calibration), each bounded by the designer's intuition. We introduce the first transferable learned attack, enabled by the observation that fine‑tuning any model on any corpus yields unlimited labeled data, since membership is known by construction. This removes the shadow model bottleneck and brings membership inference into the deep learning era: learning what matters rather than designing it, with generalization through training diversity and scale. We discover that fine‑tuning language models produces an invariant signature of memorization detectable across architectural families and data domains. We train a membership inference classifier exclusively on transformer‑based models. It transfers zero‑shot to Mamba (state‑space), RWKV‑4 (linear attention), and RecurrentGemma (gated recurrence), achieving 0.963, 0.972, and 0.936 AUC respectively. Each evaluation combines an architecture and dataset never seen during training, yet all three exceed performance on held‑out transformers (0.908 AUC). These four families share no computational mechanisms, their only commonality is gradient descent on cross‑entropy loss. Even simple likelihood‑based methods exhibit strong transfer, confirming the signature exists independently of the detection method. Our method, Learned Transfer MIA (LT‑MIA), captures this signal most effectively by reframing membership inference as sequence classification over per‑token distributional statistics. On transformers, LT‑MIA achieves 2.8× higher TPR at 0.1% FPR than the strongest baseline. The method also transfers to code (0.865 AUC) despite training only on natural language texts. Code and trained classifier available at https://github.com/JetBrains‑Research/learned‑mia.
Authors:Zihua Wang, Zhitao Lin, Ruibo Li, Yu Zhang, Xu Yang, Siya Mi, Xiu-Shen Wei
Abstract:
Vision‑Language‑Action (VLA) models, as large foundation models for embodied control, have shown strong performance in manipulation tasks. However, their performance comes at high inference cost. To improve efficiency, recent methods adopt action chunking, which predicts a sequence of future actions for open‑loop execution. Although effective for reducing computation, open‑loop execution is sensitive to environmental changes and prone to error accumulation due to the lack of close‑loop feedback. To address this limitation, we propose Speculative Verification for VLA Control (SV‑VLA), a framework that combines efficient open‑loop long‑horizon planning with lightweight closed‑loop online verification. Specifically, SV‑VLA uses a heavy VLA as a low‑frequency macro‑planner to generate an action chunk together with a planning context, while a lightweight verifier continuously monitors execution based on the latest observations. Conditioned on both the current observation and the planning context, the verifier compares the planned action against a closed‑loop reference action and triggers replanning only when necessary. Experiments demonstrate that SV‑VLA combines the efficiency of chunked prediction with the robustness of closed‑loop control, enabling efficient and reliable VLA‑based control in dynamic environments. Code is available: https://github.com/edsad122/SV‑VLA.
Authors:Yilin Xiao, Jin Chen, Qinggang Zhang, Yujing Zhang, Chuang Zhou, Longhao Yang, Lingfei Ren, Xin Yang, Xiao Huang
Abstract:
Graph‑based Retrieval‑Augmented Generation (GraphRAG) enhances the reasoning capabilities of Large Language Models (LLMs) by grounding their responses in structured knowledge graphs. Leveraging community detection and relation filtering techniques, GraphRAG systems demonstrate inherent resistance to traditional RAG attacks, such as text poisoning and prompt injection. However, in this paper, we find that the security of GraphRAG systems fundamentally relies on the topological integrity of the underlying graph, which can be undermined by implicitly corrupting the logical connections, without altering surface‑level text semantics. To exploit this vulnerability, we propose \textscLogicPoison, a novel attack framework that targets logical reasoning rather than injecting false contents. Specifically, \textscLogicPoison employs a type‑preserving entity swapping mechanism to perturb both global logic hubs for disrupting overall graph connectivity and query‑specific reasoning bridges for severing essential multi‑hop inference paths. This approach effectively reroutes valid reasoning into dead ends while maintaining surface‑level textual plausibility. Comprehensive experiments across multiple benchmarks demonstrate that \textscLogicPoison successfully bypasses GraphRAG's defenses, significantly degrading performance and outperforming state‑of‑the‑art baselines in both effectiveness and stealth. Our code is available at \textcolorbluehttps://github.com/Jord8061/logicPoison.
Authors:Patrick Pynadath, Jiaxin Shi, Ruqi Zhang
Abstract:
Diffusion language models have seen exciting recent progress, offering far more flexibility in generative trajectories than autoregressive models. This flexibility has motivated a growing body of research into new approaches to diffusion language modeling, which typically begins at the scale of GPT‑2 small (150 million parameters). However, these advances introduce new issues with evaluation methodology. In this technical note, we discuss the limitations of current methodology and propose principled augmentations to ensure reliable comparisons. We first discuss why OpenWebText has become the standard benchmark, and why alternatives such as LM1B are inherently less meaningful. We then discuss the limitations of likelihood evaluations for diffusion models, and explain why relying on generative perplexity alone as a metric can lead to uninformative results. To address this, we show that generative perplexity and entropy are two components of the KL divergence to a reference distribution. This decomposition explains generative perplexity's sensitivity to entropy, and naturally suggests generative frontiers as a principled method for evaluating model generative quality. We conclude with empirical observations on model quality at this scale. We include a blog post with interactive content to illustrate the argument at https://patrickpynadath1.github.io/blog/eval_methodology/.
Authors:Cristian Espinal Maya
Abstract:
This paper establishes the theoretical and practical foundations for using Large Language Models (LLMs) as measurement instruments for latent economic variables ‑‑ specifically variables that describe the cognitive content of occupational tasks at a level of granularity not achievable with existing survey instruments. I formalize four conditions under which LLM‑generated scores constitute valid instruments: semantic exogeneity, construct relevance, monotonicity, and model invariance. I then apply this framework to the Augmented Human Capital Index (AHC_o), constructed from 18,796 ONET task statements scored by Claude Haiku 4.5, and validated against six existing AI exposure indices. The index shows strong convergent validity (r = 0.85 with Eloundou GPT‑gamma, r = 0.79 with Felten AIOE) and discriminant validity. Principal component analysis confirms that AI‑related occupational measures span two distinct dimensions ‑‑ augmentation and substitution. Inter‑rater reliability across two LLM models (n = 3,666 paired scores) yields Pearson r = 0.76 and Krippendorff's alpha = 0.71. Prompt sensitivity analysis across four alternative framings shows that task‑level rankings are robust. Obviously Related Instrumental Variables (ORIV) estimation recovers coefficients 25% larger than OLS, consistent with classical measurement error attenuation. The methodology generalizes beyond labor economics to any domain where semantic content must be quantified at scale.
Authors:Warren Johnson, Charles Lee
Abstract:
Selecting the appropriate model at inference time ‑‑ the routing problem ‑‑ requires jointly optimizing output quality, cost, latency, and governance constraints. Existing approaches delegate this decision to LLM‑based classifiers or preference‑trained routers that are themselves costly and high‑latency, reducing a multi‑objective optimization to single‑dimensional quality prediction. We argue that small language models (SLMs, 1‑4B parameters) have now achieved sufficient reasoning capability for sub‑second, zero‑marginal‑cost, self‑hosted task classification, potentially making the routing decision negligible in the inference budget. We test this thesis on a six‑label taxonomy through two studies. Study 1 is a harmonized offline benchmark of Phi‑3.5‑mini, Qwen2.5‑1.5B, and Qwen‑2.5‑3B on identical Azure T4 hardware, serving stack, quantization, and a fixed 60‑case corpus. Qwen‑2.5‑3B achieves the best exact‑match accuracy (0.783), the strongest latency‑accuracy tradeoff, and the only nonzero accuracy on all six task families. Study 2 is a pre‑registered four‑arm randomized experiment under synthetic traffic with an effective sample size of 60 unique cases per arm, comparing Phi‑4‑mini, Qwen‑2.5‑3B, and DeepSeek‑V3 against a no‑routing control. DeepSeek‑V3 attains the highest accuracy (0.830) but fails the pre‑registered P95 latency gate (2,295 ms); Qwen‑2.5‑3B is Pareto‑dominant among self‑hosted models (0.793 accuracy, 988 ms median, 0 marginal cost). No model meets the standalone viability criterion (>=0.85 accuracy, <=2,000 ms P95). The cost and latency prerequisites for SLM‑based routing are met; the accuracy gap of 6‑8 percentage points and the untested question of whether correct classification translates to downstream output quality bound the remaining distance to production viability.
Authors:Syed Ahmed, Bharathi Vokkaliga Ganesh, Jagadish Babu P, Karthick Selvaraj, Praneeth Talluri, Sanket Hingne, Anubhav Kumar, Anushka Yadav, Pratham Kumar Verma, Kiranmayee Janardhan, Mandanna A N
Abstract:
Understanding how Large Language Models (LLMs) process information from prompts remains a significant challenge. To shed light on this "black box," attention visualization techniques have been developed to capture neuron‑level perceptions and interpret how models focus on different parts of input data. However, many existing techniques are tailored to specific model architectures, particularly within the Transformer family, and often require backpropagation, resulting in nearly double the GPU memory usage and increased computational cost. A lightweight, model‑agnostic approach for attention visualization remains lacking. In this paper, we introduce a model‑agnostic token importance visualization technique to better understand how generative AI systems perceive and prioritize information from input text, without incurring additional computational cost. Our method leverages perturbation‑based strategies combined with a three‑matrix analytical framework to generate relevance maps that illustrate token‑level contributions to model predictions. The framework comprises: (1) the Angular Deviation Matrix, which captures shifts in semantic direction; (2) the Magnitude Deviation Matrix, which measures changes in semantic intensity; and (3) the Dimensional Importance Matrix, which evaluates contributions across individual vector dimensions. By systematically removing each token and measuring the resulting impact across these three complementary dimensions, we derive a composite importance score that provides a nuanced and mathematically grounded measure of token significance. To support reproducibility and foster wider adoption, we provide open‑source implementations of all proposed and utilized explainability techniques, with code and resources publicly available at https://github.com/Infosys/Infosys‑Responsible‑AI‑Toolkit
Authors:Jeremy Herbst, Jae Hee Lee, Stefan Wermter
Abstract:
Mixture‑of‑Experts (MoE) architectures have become the dominant choice for scaling Large Language Models (LLMs), activating only a subset of parameters per token. While MoE architectures are primarily adopted for computational efficiency, it remains an open question whether their sparsity makes them inherently easier to interpret than dense feed‑forward networks (FFNs). We compare MoE experts and dense FFNs using k‑sparse probing and find that expert neurons are consistently less polysemantic, with the gap widening as routing becomes sparser. This suggests that sparsity pressures both individual neurons and entire experts toward monosemanticity. Leveraging this finding, we zoom out from the neuron to the expert level as a more effective unit of analysis. We validate this approach by automatically interpreting hundreds of experts. This analysis allows us to resolve the debate on specialization: experts are neither broad domain specialists (e.g., biology) nor simple token‑level processors. Instead, they function as fine‑grained task experts, specializing in linguistic operations or semantic tasks (e.g., closing brackets in LaTeX). Our findings suggest that MoEs are inherently interpretable at the expert level, providing a clearer path toward large‑scale model interpretability. Code is available at: https://github.com/jerryy33/MoE_analysis
Authors:Haomin Zhuang, Hojun Yoo, Xiaonan Luo, Kehan Guo, Xiangliang Zhang
Abstract:
Steering vectors offer a training‑free mechanism for controlling reasoning behaviors in large language models, but constructing effective vectors requires identifying genuine behavioral signals in the model's hidden states. For behaviors that can be toggled via prompts, this is straightforward. However, many reasoning behaviors ‑‑ such as self‑reflection ‑‑ emerge spontaneously and resist prompt‑level control. Current methods detect these behaviors through keyword matching in chain‑of‑thought traces, implicitly assuming that every detected boundary encodes a genuine behavioral signal. We show that this assumption is overwhelmingly wrong: across 541 keyword‑detected boundaries, 93.3% are behaviorally unstable, failing to reproduce the detected behavior under re‑generation from the same prefix. We develop a probabilistic model that formalizes intrinsic reasoning behaviors as stochastic events with context‑dependent trigger probabilities, and show that unstable boundaries dilute the steering signal. Guided by this analysis, we propose stability filtering, which retains only boundaries where the model consistently reproduces the target behavior. Combined with a content‑subspace projection that removes residual question‑specific noise, our method achieves 0.784 accuracy on MATH‑500 (+5.0 over the strongest baseline). The resulting steering vectors transfer across models in the same architecture family without re‑extraction, improving Nemotron‑Research‑Reasoning‑1.5B (+5.0) and DeepScaleR‑1.5B‑Preview (+6.0). Code is available at https://github.com/zhmzm/stability‑steering.
Authors:Jaber Jaber, Osama Jaber
Abstract:
Recursive transformers reuse a shared weight block across multiple depth steps, trading parameters for compute. A core limitation: every step applies the same transformation, preventing the model from composing distinct operations across depth. We present Ouroboros, a system that attaches a compact Controller hypernetwork to a recursive transformer block. The Controller observes the current hidden state, produces a per‑step diagonal modulation vector, and applies it to frozen SVD‑initialized LoRA bases, making each recurrence step input‑dependent. We combine this with gated recurrence (bias‑initialized to 88% retention) and per‑step LayerNorm for stable deep iteration. On Qwen2.5‑3B split into a Prelude/Recurrent/Coda architecture (17 of 36 layers retained), Ouroboros reduces training loss by 43.4% over the unmodified 17‑layer baseline, recovering 51.3% of the performance gap caused by layer removal. The full system adds only 9.2M trainable parameters (Controller, gate, and per‑step norms) yet outperforms equivalently‑sized static per‑step LoRA by 1.44 loss points at depth 1 and remains ahead across all tested depths (1, 4, 8, 16) and ranks (8, 32, 64). We also find that gated recurrence is essential: without it, recursive layer application makes the model strictly worse. These gains are measured on the training distribution; on held‑out text, the Controller does not yet improve over the baseline, a limitation we attribute to frozen downstream layers and discuss in detail. Code: https://github.com/RightNow‑AI/ouroboros
Authors:Yanxin Luo, Xiaoyu Zhang, Jing Li, Yan Gao, Donghong Han
Abstract:
Emotional Support Conversation (ESC) aims to alleviate individual emotional distress by generating empathetic responses. However, existing methods face challenges in effectively supporting deep contextual understanding. To address this issue, we propose PRCCF, a Persona‑guided Retrieval and Causality‑aware Cognitive Filtering framework. Specifically, the framework incorporates a persona‑guided retrieval mechanism that jointly models semantic compatibility and persona alignment to enhance response generation. Furthermore, it employs a causality‑aware cognitive filtering module to prioritize causally relevant external knowledge, thereby improving contextual cognitive understanding for emotional reasoning. Extensive experiments on the ESConv dataset demonstrate that PRCCF outperforms state‑of‑the‑art baselines on both automatic metrics and human evaluations. Our code is publicly available at: https://github.com/YancyLyx/PRCCF.
Authors:Shuibai Zhang, Caspian Zhuang, Chihan Cui, Zhihan Yang, Fred Zhangzhi Peng, Yanxin Zhang, Haoyue Bai, Zack Jia, Yang Zhou, Guanhua Chen, Ming Liu
Abstract:
Diffusion language models (DLMs) enable parallel, non‑autoregressive text generation, yet existing DLM mixture‑of‑experts (MoE) models inherit token‑choice (TC) routing from autoregressive systems, leading to load imbalance and rigid computation allocation. We show that expert‑choice (EC) routing is a better fit for DLMs: it provides deterministic load balancing by design, yielding higher throughput and faster convergence than TC. Building on the property that EC capacity is externally controllable, we introduce timestep‑dependent expert capacity, which varies expert allocation according to the denoising step. We find that allocating more capacity to low‑mask‑ratio steps consistently achieves the best performance under matched FLOPs, and provide a mechanistic explanation: tokens in low‑mask‑ratio contexts exhibit an order‑of‑magnitude higher learning efficiency, so concentrating compute on these steps yields the largest marginal return. Finally, we show that existing pretrained TC DLMs can be retrofitted to EC by replacing only the router, achieving faster convergence and improved accuracy across diverse downstream tasks. Together, these results establish EC routing as a superior paradigm for DLM MoE models and demonstrate that computation in DLMs can be treated as an adaptive policy rather than a fixed architectural constant. Code is available at https://github.com/zhangshuibai/EC‑DLM.
Authors:Di Wu, Siyue Liu, Zixiang Ji, Ya-Liang Chang, Zhe-Yu Liu, Andrew Pleffer, Kai-Wei Chang
Abstract:
Moderation layers are increasingly a core component of many products built on user‑ or model‑generated content. However, drafting and maintaining domain‑specific safety policies remains costly. We present Deep Policy Research (DPR), a minimal agentic system that drafts a full content moderation policy based on only human‑written seed domain information. DPR uses a single web search tool and lightweight scaffolding to iteratively propose search queries, distill diverse web sources into policy rules, and organize rules into an indexed document. We evaluate DPR on (1) the OpenAI undesired content benchmark across five domains with two compact reader LLMs and (2) an in‑house multimodal advertisement moderation benchmark. DPR consistently outperforms definition‑only and in‑context learning baselines, and in our end‑to‑end setting it is competitive with expert‑written policy sections in several domains. Moreover, under the same seed specification and evaluation protocol, DPR outperforms a general‑purpose deep research system, suggesting that a task‑specific, structured research loop can be more effective than generic web research for policy drafting. We release our experiment code at https://github.com/xiaowu0162/deep‑policy‑research.
Authors:Marco Morini, Sara Sarto, Marcella Cornia, Lorenzo Baraldi
Abstract:
Answering questions about images often requires combining visual understanding with external knowledge. Multimodal Large Language Models (MLLMs) provide a natural framework for this setting, but they often struggle to identify the most relevant visual and textual evidence when answering knowledge‑intensive queries. In such scenarios, models must integrate visual cues with retrieved textual evidence that is often noisy or only partially relevant, while also localizing fine‑grained visual information in the image. In this work, we introduce Look Twice (LoT), a training‑free inference‑time framework that improves how pretrained MLLMs utilize multimodal evidence. Specifically, we exploit the model attention patterns to estimate which visual regions and retrieved textual elements are relevant to a query, and then generate the answer conditioned on this highlighted evidence. The selected cues are highlighted through lightweight prompt‑level markers that encourage the model to re‑attend to the relevant evidence during generation. Experiments across multiple knowledge‑based VQA benchmarks show consistent improvements over zero‑shot MLLMs. Additional evaluations on vision‑centric and hallucination‑oriented benchmarks further demonstrate that visual evidence highlighting alone improves model performance in settings without textual context, all without additional training or architectural modifications. Source code will be publicly released.
Authors:Lei Wang, Eduard Dragut
Abstract:
Individuals engaging in online communication frequently express personal opinions with informal styles (e.g., memes and emojis). While Language Models (LMs) with informal communications have been widely discussed, a unique and emphatic style, the Repetitive Lengthening Form (RLF), has been overlooked for years. In this paper, we explore answers to two research questions: 1) Is RLF important for sentiment analysis (SA)? 2) Can LMs understand RLF? Inspired by previous linguistic research, we curate Lengthening, the first multi‑domain dataset with 850k samples focused on RLF for SA. Moreover, we introduce Explainable Instruction Tuning (ExpInstruct), a two‑stage instruction tuning framework aimed to improve both performance and explainability of LLMs for RLF. We further propose a novel unified approach to quantify LMs' understanding of informal expressions. We show that RLF sentences are expressive expressions and can serve as signatures of document‑level sentiment. Additionally, RLF has potential value for online content analysis. Our results show that fine‑tuned Pre‑trained Language Models (PLMs) can surpass zero‑shot GPT‑4 in performance but not in explanation for RLF. Finally, we show ExpInstruct can improve the open‑sourced LLMs to match zero‑shot GPT‑4 in performance and explainability for RLF with limited samples. Code and sample data are available at https://github.com/Tom‑Owl/OverlookedRLF
Authors:Cai Zhou, Zekai Wang, Menghua Wu, Qianyu Julie Zhu, Flora C. Shi, Chenyu Wang, Ashia Wilson, Tommi Jaakkola, Stephen Bates
Abstract:
While test‑time scaling has enabled large language models to solve highly difficult tasks, state‑of‑the‑art results come at exorbitant compute costs. These inefficiencies can be attributed to the miscalibration of post‑trained language models, and the lack of calibration in popular sampling techniques. Here, we present Online Reasoning Calibration (ORCA), a framework for calibrating the sampling process that draws upon conformal prediction and test‑time training. Specifically, we introduce a meta‑learning procedure that updates the calibration module for each input. This allows us to provide valid confidence estimates under distributional shift, e.g. in thought patterns that occur across different stages of reasoning, or in prompt distributions between model development and deployment. ORCA not only provides theoretical guarantees on conformal risks, but also empirically shows higher efficiency and generalization across different reasoning tasks. At risk level δ=0.1, ORCA improves Qwen2.5‑32B efficiency on in‑distribution tasks with savings up to 47.5% with supervised labels and 40.7% with self‑consistency labels. Under zero‑shot out‑of‑domain settings, it improves MATH‑500 savings from 24.8% of the static calibration baseline to 67.0% while maintaining a low empirical error rate, and the same trend holds across model families and downstream benchmarks. Our code is publicly available at https://github.com/wzekai99/ORCA.
Authors:Jack Young
Abstract:
Using roughly 48 execution‑verified HumanEval training solutions, tuning a single initial state matrix per recurrent layer, with zero inference overhead, outperforms LoRA by +10.8 pp (p < 0.001) on HumanEval. The method, which we call S0 tuning, optimizes one state matrix per recurrent layer while freezing all model weights. On Qwen3.5‑4B (GatedDeltaNet hybrid), S0 tuning improves greedy pass@1 by +23.6 +/‑ 1.7 pp (10 seeds). On FalconH1‑7B (Mamba‑2 hybrid), S0 reaches 71.8% +/‑ 1.3 and LoRA reaches 71.4% +/‑ 2.4 (3 seeds), statistically indistinguishable at this sample size while requiring no weight merging. Cross‑domain transfer is significant on MATH‑500 (+4.8 pp, p = 0.00002, 8 seeds) and GSM8K (+2.8 pp, p = 0.0003, 10 seeds); a text‑to‑SQL benchmark (Spider) shows no transfer, consistent with the trajectory‑steering mechanism. A prefix‑tuning control on a pure Transformer (Qwen2.5‑3B) degrades performance by ‑13.9 pp under all nine configurations tested. On Qwen3.5, a per‑step state‑offset variant reaches +27.1 pp, above both S0 and LoRA but with per‑step inference cost. Taken together, the results show that recurrent state initialization is a strong zero‑inference‑overhead PEFT surface for hybrid language models when verified supervision is scarce. The tuned state is a ~48 MB file; task switching requires no weight merging or model reload. Code and library: https://github.com/jackyoung27/s0‑tuning.
Authors:Atsuyuki Miyai, Mashiro Toyooka, Zaiying Zhao, Kenta Watanabe, Toshihiko Yamasaki, Kiyoharu Aizawa
Abstract:
This paper introduces the first systematic evaluation framework for quantifying the quality and risks of papers written by modern coding agents. While AI‑driven paper writing has become a growing concern, rigorous evaluation of the quality and potential risks of AI‑written papers remains limited, and a unified understanding of their reliability is still lacking. We introduce Paper Reconstruction Evaluation (PaperRecon), an evaluation framework in which an overview (overview.md) is created from an existing paper, after which an agent generates a full paper based on the overview and minimal additional resources, and the result is subsequently compared against the original paper. PaperRecon disentangles the evaluation of the AI‑written papers into two orthogonal dimensions, Presentation and Hallucination, where Presentation is evaluated using a rubric and Hallucination is assessed via agentic evaluation grounded in the original paper source. For evaluation, we introduce PaperWrite‑Bench, a benchmark of 51 papers from top‑tier venues across diverse domains published after 2025. Our experiments reveal a clear trade‑off: while both ClaudeCode and Codex improve with model advances, ClaudeCode achieves higher presentation quality at the cost of more than 10 hallucinations per paper on average, whereas Codex produces fewer hallucinations but lower presentation quality. This work takes a first step toward establishing evaluation frameworks for AI‑driven paper writing and improving the understanding of its risks within the research community.
Authors:Zhengyang Tang, Ke Ji, Xidong Wang, Zihan Ye, Xinyuan Wang, Yiduo Guo, Ziniu Li, Chenxin Li, Jingyuan Hu, Shunian Chen, Tongxu Luo, Jiaxi Bi, Zeyu Qin, Shaobo Wang, Xin Lai, Pengyuan Lyu, Junyi Li, Can Xu, Chengquan Zhang, Han Hu, Ming Yan, Benyou Wang
Abstract:
We study whether phone‑use agents respect privacy while completing benign mobile tasks. This question has remained hard to answer because privacy‑compliant behavior is not operationalized for phone‑use agents, and ordinary apps do not reveal exactly what data agents type into which form entries during execution. To make this question measurable, we introduce MyPhoneBench, a verifiable evaluation framework for privacy behavior in mobile agents. We operationalize privacy‑respecting phone use as permissioned access, minimal disclosure, and user‑controlled memory through a minimal privacy contract, iMy, and pair it with instrumented mock apps plus rule‑based auditing that make unnecessary permission requests, deceptive re‑disclosure, and unnecessary form filling observable and reproducible. Across five frontier models on 10 mobile apps and 300 tasks, we find that task success, privacy‑compliant task completion, and later‑session use of saved preferences are distinct capabilities, and no single model dominates all three. Evaluating success and privacy jointly reshuffles the model ordering relative to either metric alone. The most persistent failure mode across models is simple data minimization: agents still fill optional personal entries that the task does not require. These results show that privacy failures arise from over‑helpful execution of benign tasks, and that success‑only evaluation overestimates the deployment readiness of current phone‑use agents. All code, mock apps, and agent trajectories are publicly available at~ https://github.com/FreedomIntelligence/MyPhoneBench.
Authors:Zhuchenyang Liu, Yao Zhang, Yu Xiao
Abstract:
2D assembly diagrams are often abstract and hard to follow, creating a need for intelligent assistants that can monitor progress, detect errors, and provide step‑by‑step guidance. In mixed reality settings, such systems must recognize completed and ongoing steps from the camera feed and align them with the diagram instructions. Vision Language Models (VLMs) show promise for this task, but face a depiction gap because assembly diagrams and video frames share few visual features. To systematically assess this gap, we construct IKEA‑Bench, a benchmark of 1,623 questions across 6 task types on 29 IKEA furniture products, and evaluate 19 VLMs (2B‑38B) under three alignment strategies. Our key findings: (1) assembly instruction understanding is recoverable via text, but text simultaneously degrades diagram‑to‑video alignment; (2) architecture family predicts alignment accuracy more strongly than parameter count; (3) video understanding remains a hard bottleneck unaffected by strategy. A three‑level mechanistic analysis further reveals that diagrams and video occupy disjoint ViT subspaces, and that adding text shifts models from visual to text‑driven reasoning. These results identify visual encoding as the primary target for improving cross‑depiction robustness. Project page: https://ryenhails.github.io/IKEA‑Bench/
Authors:Henry Peng Zou, Chunyu Miao, Wei-Chieh Huang, Yankai Chen, Yue Zhou, Hanrong Zhang, Yaozu Wu, Liancheng Fang, Zhengyao Gu, Zhen Zhang, Kening Zheng, Fangxin Wang, Yi Nian, Shanghao Li, Wenzhe Fan, Langzhou He, Weizhi Zhang, Xue Liu, Philip S. Yu
Abstract:
As LLM agents transition from short, static problem solving to executing complex, long‑horizon tasks in dynamic environments, the ability to handle user interruptions, such as adding requirement or revising goals, during mid‑task execution is becoming a core requirement for realistic deployment. However, existing benchmarks largely assume uninterrupted agent behavior or study interruptions only in short, unconstrained language tasks. In this paper, we present the first systematic study of interruptible agents in long‑horizon, environmentally grounded web navigation tasks, where actions induce persistent state changes. We formalize three realistic interruption types, including addition, revision, and retraction, and introduce InterruptBench, a benchmark derived from WebArena‑Lite that synthesizes high‑quality interruption scenarios under strict semantic constraints. Using a unified interruption simulation framework, we evaluate six strong LLM backbones across single‑ and multi‑turn interruption settings, analyzing both their effectiveness in adapting to updated intents and their efficiency in recovering from mid‑task changes. Our results show that handling user interruptions effectively and efficiently during long‑horizon agentic tasks remains challenging for powerful large‑scale LLMs. Code and dataset are available at https://github.com/HenryPengZou/InterruptBench.
Authors:Nan Wang, Zhiwei Jin, Chen Chen, Haonan Lu
Abstract:
Document understanding and GUI interaction are among the highest‑value applications of Vision‑Language Models (VLMs), yet they impose exceptionally heavy computational burden: fine‑grained text and small UI elements demand high‑resolution inputs that produce tens of thousands of visual tokens. We observe that this cost is largely wasteful ‑‑ across document and GUI benchmarks, only 22‑‑71% of image patches are pixel‑unique, the rest being exact duplicates of another patch in the same image. We propose PixelPrune, which exploits this pixel‑level redundancy through predictive‑coding‑based compression, pruning redundant patches \emphbefore the Vision Transformer (ViT) encoder. Because it operates in pixel space prior to any neural computation, PixelPrune accelerates both the ViT encoder and the downstream LLM, covering the full inference pipeline. The method is training‑free, requires no learnable parameters, and supports pixel‑lossless compression (τ=0) as well as controlled lossy compression (τ>0). Experiments across three model scales and document and GUI benchmarks show that PixelPrune maintains competitive task accuracy while delivering up to 4.2× inference speedup and 1.9× training acceleration. Code is available at https://github.com/OPPO‑Mente‑Lab/PixelPrune.
Authors:Yilun Liu, Jinru Han, Sikuan Yan, Volker Tresp, Yunpu Ma
Abstract:
Standard Mixture‑of‑Experts (MoE) models rely on centralized routing mechanisms that introduce rigid inductive biases. We propose Routing‑Free MoE which eliminates any hard‑coded centralized designs including external routers, Softmax, Top‑K and load balancing, instead encapsulating all activation functionalities within individual experts and directly optimized through continuous gradient flow, enabling each expert to determine its activation entirely on its own. We introduce a unified adaptive load‑balancing framework to simultaneously optimize both expert‑balancing and token‑balancing objectives through a configurable interpolation, allowing flexible and customizable resource allocation. Extensive experiments show that Routing‑Free MoE can consistently outperform baselines with better scalability and robustness. We analyze its behavior in detail and offer insights that may facilitate future MoE design ad optimization.
Authors:Karan Singh, Michael Yu, Varun Gangal, Zhuofu Tao, Sachin Kumar, Emmy Liu, Steven Y. Feng
Abstract:
Retrieval‑augmented generation (RAG) improves language model (LM) performance by providing relevant context at test time for knowledge‑intensive situations. However, the relationship between parametric knowledge acquired during pretraining and non‑parametric knowledge accessed via retrieval remains poorly understood, especially under fixed data budgets. In this work, we systematically study the trade‑off between pretraining corpus size and retrieval store size across a wide range of model and data scales. We train OLMo‑2‑based LMs ranging from 30M to 3B parameters on up to 100B tokens of DCLM data, while varying both pretraining data scale (1‑150x the number of parameters) and retrieval store size (1‑20x), and evaluate performance across a diverse suite of benchmarks spanning reasoning, scientific QA, and open‑domain QA. We find that retrieval consistently improves performance over parametric‑only baselines across model scales and introduce a three‑dimensional scaling framework that models performance as a function of model size, pretraining tokens, and retrieval corpus size. This scaling manifold enables us to estimate optimal allocations of a fixed data budget between pretraining and retrieval, revealing that the marginal utility of retrieval depends strongly on model scale, task type, and the degree of pretraining saturation. Our results provide a quantitative foundation for understanding when and how retrieval should complement pretraining, offering practical guidance for allocating data resources in the design of scalable language modeling systems.
Authors:Yu Xia, Canwen Xu, Zhewei Yao, Julian McAuley, Yuxiong He
Abstract:
Group Relative Policy Optimization (GRPO) is widely used for reinforcement learning with verifiable rewards, but it often suffers from advantage collapse: when all rollouts in a group receive the same reward, the group yields zero relative advantage and thus no learning signal. For example, if a question is too hard for the reasoner, all sampled rollouts can be incorrect and receive zero reward. Recent work addresses this issue by adding hints or auxiliary scaffolds to such hard questions so that the reasoner produces mixed outcomes and recovers a non‑zero update. However, existing hints are usually fixed rather than adapted to the current reasoner, and a hint that creates learning signal under the hinted input does not necessarily improve the no‑hint policy used at test time. To this end, we propose Hint Learning for Reinforcement Learning (HiLL), a framework that jointly trains a hinter policy and a reasoner policy during RL. For each hard question, the hinter generates hints online conditioned on the current reasoner's incorrect rollout, allowing hint generation to adapt to the reasoner's evolving errors. We further introduce hint reliance, which measures how strongly correct hinted trajectories depend on the hint. We derive a transferability result showing that lower hint reliance implies stronger transfer from hinted success to no‑hint success, and we use this result to define a transfer‑weighted reward for training the hinter. Therefore, HiLL favors hints that not only recover informative GRPO groups, but also produce signals that are more likely to improve the original no‑hint policy. Experiments across multiple benchmarks show that HiLL consistently outperforms GRPO and prior hint‑based baselines, demonstrating the value of adaptive and transfer‑aware hint learning for RL. The code is available at https://github.com/Andree‑9/HiLL.
Authors:Han Zhu, Lingxuan Ye, Wei Kang, Zengwei Yao, Liyong Guo, Fangjun Kuang, Zhifeng Han, Weiji Zhuang, Long Lin, Daniel Povey
Abstract:
We present OmniVoice, a massive multilingual zero‑shot text‑to‑speech (TTS) model that scales to over 600 languages. At its core is a novel diffusion language model‑style discrete non‑autoregressive (NAR) architecture. Unlike conventional discrete NAR models that suffer from performance bottlenecks in complex two‑stage (text‑to‑semantic‑to‑acoustic) pipelines, OmniVoice directly maps text to multi‑codebook acoustic tokens. This simplified approach is facilitated by two key technical innovations: (1) a full‑codebook random masking strategy for efficient training, and (2) initialization from a pre‑trained LLM to ensure superior intelligibility. By leveraging a 581k‑hour multilingual dataset curated entirely from open‑source data, OmniVoice achieves the broadest language coverage to date and delivers state‑of‑the‑art performance across Chinese, English, and diverse multilingual benchmarks. Our code and pre‑trained models are publicly available at https://github.com/k2‑fsa/OmniVoice.
Authors:Jiayu Wang, Junyoung Lee
Abstract:
As Large Language Model (LLM) capabilities advance, the demand for high‑quality annotation of exponentially increasing text corpora has outpaced human capacity, leading to the widespread adoption of LLMs in automatic evaluation and annotation. However, proprietary LLMs often exhibit systematic biases that diverge from human expert consensus, lacks reproducibility, and raises data privacy concerns. Our work examines the viability of finetuning a quantized Small Language Model of 1.7B parameter size on limited human‑annotated data to serve as a highly aligned, deterministic evaluator and annotator. By implementing a custom, multi‑dimensional rubric framework and simple augmentation and regularization techniques, the proposed approach achieves higher inter‑annotator agreement (0.23 points increase in Krippendorff's α) than the best performing state‑of‑the‑art proprietary LLM. We also demonstrate the generalizability of the proposed training pipeline on a separate emotion classification task. The results show that task‑specific alignment and efficient 4‑bit quantized fine‑tuning provide superior open‑source alternative to using proprietary models for evaluation and annotation. Our finetuning approach is publicly available at https://github.com/jylee‑k/slm‑judge.
Authors:Jiwoo Ha, Jongwoo Baek, Jinhyun So
Abstract:
Recent Large Vision‑Language Models (LVLMs) have demonstrated remarkable performance across various multimodal tasks that require understanding both visual and linguistic inputs. However, object hallucination ‑‑ the generation of nonexistent objects in answers ‑‑ remains a persistent challenge. Although several approaches such as retraining and external grounding methods have been proposed to mitigate this issue, they still suffer from high data costs or structural complexity. Training‑free methods such as Contrastive Decoding (CD) are more cost‑effective, avoiding additional training or external models, but still suffer from long‑term decay, where visual grounding weakens and language priors dominate as the generation progresses. In this paper, we propose First Logit Boosting (FLB), a simple yet effective training‑free technique designed to alleviate long‑term decay in LVLMs. FLB stores the logit of the first generated token and adds it to subsequent token predictions, effectively mitigating long‑term decay of visual information. We observe that FLB (1) sustains the visual information embedded in the first token throughout generation, and (2) suppresses hallucinated words through the stabilizing effect of the ``The'' token. Experimental results show that FLB significantly reduces object hallucination across various tasks, benchmarks, and backbone models. Notably, it causes negligible inference overhead, making it highly applicable to real‑time multimodal systems. Code is available at https://github.com/jiwooha20/FLB
Authors:Ponhvoan Srey, Quang Minh Nguyen, Xiaobao Wu, Anh Tuan Luu
Abstract:
Uncertainty estimation (UE) aims to detect hallucinated outputs of large language models (LLMs) to improve their reliability. However, UE metrics often exhibit unstable performance across configurations, which significantly limits their applicability. In this work, we formalise this phenomenon as proxy failure, since most UE metrics originate from model behaviour, rather than being explicitly grounded in the factual correctness of LLM outputs. With this, we show that UE metrics become non‑discriminative precisely in low‑information regimes. To alleviate this, we propose Truth AnChoring (TAC), a post‑hoc calibration method to remedy UE metrics, by mapping the raw scores to truth‑aligned scores. Even with noisy and few‑shot supervision, our TAC can support the learning of well‑calibrated uncertainty estimates, and presents a practical calibration protocol. Our findings highlight the limitations of treating heuristic UE metrics as direct indicators of truth uncertainty, and position our TAC as a necessary step toward more reliable uncertainty estimation for LLMs. The code repository is available at https://github.com/ponhvoan/TruthAnchor/.
Authors:Wenxuan Jiang, Yuxin Zuo, Zijian Zhang, Xuecheng Wu, Zining Fan, Wenxuan Liu, Li Chen, Xiaoyu Li, Xuezhi Cao, Xiaolong Jin, Ninghao Liu
Abstract:
In‑Context Reinforcement Learning (ICRL) enables Large Language Models (LLMs) to learn online from external rewards directly within the context window. However, a central challenge in ICRL is reward estimation, as models typically lack access to ground‑truths during inference. To address this limitation, we propose Test‑Time Rethinking for In‑Context Reinforcement Learning (TR‑ICRL), a novel ICRL framework designed for both reasoning and knowledge‑intensive tasks. TR‑ICRL operates by first retrieving the most relevant instances from an unlabeled evaluation set for a given query. During each ICRL iteration, LLM generates a set of candidate answers for every retrieved instance. Next, a pseudo‑label is derived from this set through majority voting. This label then serves as a proxy to give reward messages and generate formative feedbacks, guiding LLM through iterative refinement. In the end, this synthesized contextual information is integrated with the original query to form a comprehensive prompt, with the answer determining through a final round of majority voting. TR‑ICRL is evaluated on mainstream reasoning and knowledge‑intensive tasks, where it demonstrates significant performance gains. Remarkably, TR‑ICRL improves Qwen2.5‑7B by 21.23% on average on MedQA and even 137.59% on AIME2024. Extensive ablation studies and analyses further validate the effectiveness and robustness of our approach. Our code is available at https://github.com/pangpang‑xuan/TR_ICRL.
Authors:Annette Taberner-Miller
Abstract:
Production LLM serving often relies on multi‑model portfolios spanning a ~530x cost range, where routing decisions trade off quality against cost. This trade‑off is non‑stationary: providers revise pricing, model quality can regress silently, and new models must be integrated without downtime. We present ParetoBandit, an open‑source adaptive router built on cost‑aware contextual bandits that is the first to simultaneously enforce dollar‑denominated budgets, adapt online to such shifts, and onboard new models at runtime. ParetoBandit closes these gaps through three mechanisms. An online primal‑dual budget pacer enforces a per‑request cost ceiling over an open‑ended stream, replacing offline penalty tuning with closed‑loop control. Geometric forgetting on sufficient statistics enables rapid adaptation to price and quality shifts while bootstrapping from offline priors. A hot‑swap registry lets operators add or remove models at runtime, with a brief forced‑exploration phase for each newcomer, after which UCB selection discovers its quality‑cost niche from live traffic alone. We evaluate ParetoBandit across four deployment scenarios on 1,824 prompts routed through a three‑model portfolio. Across seven budget ceilings, mean per‑request cost never exceeds the target by more than 0.4%. When conditions shift, the system adapts: an order‑of‑magnitude price cut on the costliest model yields up to +0.071 quality lift, and a silent quality regression is detected and rerouted within budget. A cold‑started model reaches meaningful adoption within ~142 steps without breaching the cost ceiling. The router discriminates rather than blindly adopting: expensive models are budget‑gated and low‑quality models rejected after bounded exploration. End‑to‑end routing latency is 9.8ms on CPU ‑‑ less than 0.4% of typical inference time ‑‑ with the routing decision itself taking just 22.5us.
Authors:Ashish Rana, Chia-Chien Hung, Qumeng Sun, Julian Martin Kunkel, Carolin Lawrence
Abstract:
Human memory adapts through selective forgetting: experiences become less accessible over time but can be reactivated by reinforcement or contextual cues. In contrast, memory‑augmented LLM agents rely on "always‑on" retrieval and "flat" memory storage, causing high interference and latency as histories grow. We introduce Oblivion, a memory control framework that casts forgetting as decay‑driven reductions in accessibility, not explicit deletion. Oblivion decouples memory control into read and write paths. The read path decides when to consult memory, based on agent uncertainty and memory buffer sufficiency, avoiding redundant always‑on access. The write path decides what to strengthen, by reinforcing memories contributing to forming the response. Together, this enables hierarchical memory organization that maintains persistent high‑level strategies while dynamically loading details as needed. We evaluate on both static and dynamic long‑horizon interaction benchmarks. Results show that Oblivion dynamically adapts memory access and reinforcement, balancing learning and forgetting under shifting contexts, highlighting that memory control is essential for effective LLM‑agentic reasoning. The source code is available at https://github.com/nec‑research/oblivion.
Authors:Xingshuai Huang, Derek Li, Bahareh Nikpour, Parsa Omidi
Abstract:
Chain‑of‑Thought (CoT) prompting has significantly improved the reasoning capabilities of large language models (LLMs). However, conventional CoT often relies on unstructured, flat reasoning chains that suffer from redundancy and suboptimal performance. In this work, we introduce Hierarchical Chain‑of‑Thought (Hi‑CoT) prompting, a structured reasoning paradigm specifically designed to address the challenges of complex, multi‑step reasoning. Hi‑CoT decomposes the reasoning process into hierarchical substeps by alternating between instructional planning and step‑by‑step execution. This decomposition enables LLMs to better manage long reasoning horizons and maintain logical coherence. Extensive evaluations across diverse LLMs and mathematical reasoning benchmarks show that Hi‑CoT consistently improves average accuracy by 6.2% (up to 61.4% on certain models and tasks) while reducing reasoning trace length by 13.9% compared to CoT prompting. We further show that accuracy and efficiency are maximized when models strictly adhere to the hierarchical structure. Our code is available at https://github.com/XingshuaiHuang/Hi‑CoT.
Authors:Simon Schug, Brenden M. Lake
Abstract:
The validity of online behavioral research relies on study participants being human rather than machine. In the past, it was possible to detect machines by posing simple challenges that were easily solved by humans but not by machines. General‑purpose agents based on large language models (LLMs) can now solve many of these challenges, threatening the validity of online behavioral research. Here we explore the idea of detecting humanness by using tasks that machines can solve too well to be human. Specifically, we probe for the existence of an established human cognitive constraint: limited working memory capacity. We show that cognitive modeling on a standard serial recall task can be used to distinguish online participants from LLMs even when the latter are specifically instructed to mimic human working memory constraints. Our results demonstrate that it is viable to use well‑established cognitive phenomena to distinguish LLMs from humans.
Authors:Ning Yang, Hengyu Zhong, Wentao Wang, Baoliang Tian, Haijun Zhang, Jun Wang
Abstract:
The extension of context windows in Large Language Models is typically facilitated by scaling positional encodings followed by lightweight Continual Pre‑Training (CPT). While effective for processing long sequences, this paradigm often disrupts original model capabilities, leading to performance degradation on standard short‑text benchmarks. We propose LinearARD, a self‑distillation method that restores Rotary Position Embeddings (RoPE)‑scaled students through attention‑structure consistency with a frozen native‑RoPE teacher. Rather than matching opaque hidden states, LinearARD aligns the row‑wise distributions of dense Q/Q, K/K, and V/V self‑relation matrices to directly supervise attention dynamics. To overcome the quadratic memory bottleneck of n × n relation maps, we introduce a linear‑memory kernel. This kernel leverages per‑token log‑sum‑exp statistics and fuses logit recomputation into the backward pass to compute exact Kullback‑Leibler divergence and gradients. On LLaMA2‑7B extended from 4K to 32K, LinearARD recovers 98.3% of the short‑text performance of state‑of‑the‑art baselines while surpassing them on long‑context benchmarks. Notably, our method achieves these results using only 4.25M training tokens compared to the 256M tokens required by LongReD and CPT. Our code is available at https://github.com/gracefulning/LinearARD.
Authors:Soveatin Kuntur, Nina Smirnova, Anna Wroblewska, Philipp Mayr, Sebastijan Razboršek Maček
Abstract:
This paper investigates sentence‑level text reuse in multilingual journalism, analyzing where reused content occurs within articles. We present a weakly supervised method for detecting sentence‑level cross‑lingual reuse without requiring full translations, designed to support automated pre‑selection to reduce information overload for journalists (Holyst et al., 2024). The study compares English‑language articles from the Slovenian Press Agency (STA) with reports from 15 foreign agencies (FA) in seven languages, using publication timestamps to retain the earliest likely foreign source for each reused sentence. We analyze 1,037 STA and 237,551 FA articles from two time windows (October 7‑November 2, 2023; February 1‑28, 2025) and identify 1,087 aligned sentence pairs after filtering to the earliest sources. Reuse occurs in 52% of STA articles and 1.6% of FA articles and is predominantly non‑literal, involving paraphrase and compositional reuse from multiple sources. Reused content tends to appear in the middle and end of English articles, while leads are more often original, indicating that simple lexical matching overlooks substantial editorial reuse. Compared with prior work focused on monolingual overlap, we (i) detect reuse across languages without requiring full translation, (ii) use publication timing to identify likely sources, and (iii) analyze where reused material is situated within articles. Dataset and code: https://github.com/kunturs/lrec2026‑rewrite‑news.
Authors:Han Deng, Anqi Zou, Hanling Zhang, Ben Fei, Chengyu Zhang, Haobo Wang, Xinru Guo, Zhenyu Li, Xuzhu Wang, Peng Yang, Fujian Zhang, Weiyu Guo, Xiaohong Shao, Zhaoyang Liu, Shixiang Tang, Zhihui Wang, Wanli Ouyang
Abstract:
Scientific discovery increasingly depends on high‑throughput characterization, yet automation is hindered by proprietary GUIs and the limited generalizability of existing API‑based systems. We present Owl‑AuraID, a software‑hardware collaborative embodied agent system that adopts a GUI‑native paradigm to operate instruments through the same interfaces as human experts. Its skill‑centric framework integrates Type‑1 (GUI operation) and Type‑2 (data analysis) skills into end‑to‑end workflows, connecting physical sample handling with scientific interpretation. Owl‑AuraID demonstrates broad coverage across ten categories of precision instruments and diverse workflows, including multimodal spectral analysis, microscopic imaging, and crystallographic analysis, supporting modalities such as FTIR, NMR, AFM, and TGA. Overall, Owl‑AuraID provides a practical, extensible foundation for autonomous laboratories and illustrates a path toward evolving laboratory intelligence through reusable operational and analytical skills. The code are available at https://github.com/OpenOwlab/AuraID.
Authors:Lixin Xiu, Xufang Luo, Hideki Nakayama
Abstract:
Large vision‑language models (LVLMs) achieve impressive performance, yet their internal decision‑making processes remain opaque, making it difficult to determine if the success stems from true multimodal fusion or from reliance on unimodal priors. To address this attribution gap, we introduce a novel framework using partial information decomposition (PID) to quantitatively measure the "information spectrum" of LVLMs ‑‑ decomposing a model's decision‑relevant information into redundant, unique, and synergistic components. By adapting a scalable estimator to modern LVLM outputs, our model‑agnostic pipeline profiles 26 LVLMs on four datasets across three dimensions ‑‑ breadth (cross‑model & cross‑task), depth (layer‑wise information dynamics), and time (learning dynamics across training). Our analysis reveals two key results: (i) two task regimes (synergy‑driven vs. knowledge‑driven) and (ii) two stable, contrasting family‑level strategies (fusion‑centric vs. language‑centric). We also uncover a consistent three‑phase pattern in layer‑wise processing and identify visual instruction tuning as the key stage where fusion is learned. Together, these contributions provide a quantitative lens beyond accuracy‑only evaluation and offer insights for analyzing and designing the next generation of LVLMs. Code and data are available at https://github.com/RiiShin/pid‑lvlm‑analysis .
Authors:Robinson Ferrer, Damla Turgut, Zhongzhou Chen, Shashank Sonkar
Abstract:
Large Language Models (LLMs) show promise for automated grading, but their outputs can be unreliable. Rather than improving grading accuracy directly, we address a complementary problem: predicting when an LLM grader is likely to be correct. This enables selective automation where high‑confidence predictions are processed automatically while uncertain cases are flagged for human review. We compare three confidence estimation methods (self‑reported confidence, self‑consistency voting, and token probability) across seven LLMs of varying scale (4B to 120B parameters) on three educational datasets: RiceChem (long‑answer chemistry), SciEntsBank, and Beetle (short‑answer science). Our experiments reveal that self‑reported confidence consistently achieves the best calibration across all conditions (avg ECE 0.166 vs 0.229 for self‑consistency). Surprisingly, self‑consistency remains 38% worse despite requiring 5× the inference cost. Larger models exhibit substantially better calibration though gains vary by dataset and method (e.g., a 28% ECE reduction for self‑reported), with GPT‑OSS‑120B achieving the best calibration (avg ECE 0.100) and strong discrimination (avg AUC 0.668). We also observe that confidence is strongly top‑skewed across methods, creating a ``confidence floor'' that practitioners must account for when setting thresholds. These findings suggest that simply asking LLMs to report their confidence provides a practical approach for identifying reliable grading predictions. Code is available \hrefhttps://github.com/sonkar‑lab/llm_grading_calibrationhere.
Authors:Linda Zeng, Steven Y. Feng, Michael C. Frank
Abstract:
Multilingualism is incredibly common around the world, leading to many important theoretical and practical questions about how children learn multiple languages at once. For example, does multilingual acquisition lead to delays in learning? Are there better and worse ways to structure multilingual input? Many correlational studies address these questions, but it is surprisingly difficult to get definitive answers because children cannot be randomly assigned to be multilingual and data are typically not matched between languages. We use language model training as a method for simulating a variety of highly controlled exposure conditions, and create matched 100M‑word mono‑ and bilingual datasets using synthetic data and machine translation. We train GPT‑2 models on monolingual and bilingual data organized to reflect a range of exposure regimes, and evaluate their performance on perplexity, grammaticality, and semantic knowledge. Across model scales and measures, bilingual models perform similarly to monolingual models in one language, but show strong performance in the second language as well. These results suggest that there are no strong differences between different bilingual exposure regimes, and that bilingual input poses no in‑principle challenges for agnostic statistical learners.
Authors:Steven Y. Feng, Alvin W. M. Tan, Michael C. Frank
Abstract:
Modern language models (LMs) must be trained on many orders of magnitude more words of training data than human children receive before they begin to produce useful behavior. Assessing the nature and origins of this "data gap" requires benchmarking LMs on human‑scale datasets to understand how linguistic knowledge emerges from children's natural training data. Using transcripts from the BabyView dataset (videos from children ages 6‑36 months), we investigate (1) scaling performance at child‑scale data regimes, (2) variability in model performance across datasets from different children's experiences and linguistic predictors of dataset quality, and (3) relationships between model and child language learning outcomes. LMs trained on child data show acceptable scaling for grammar tasks, but lower scaling on semantic and world knowledge tasks than models trained on synthetic data; we also observe substantial variability on data from different children. Beyond dataset size, performance is most associated with a combination of distributional and interactional linguistic features, broadly consistent with what makes high‑quality input for child language development. Finally, model likelihoods for individual words correlate with children's learning of those words, suggesting that properties of child‑directed input may influence both model learning and human language development. Overall, understanding what properties make language data efficient for learning can enable more powerful small‑scale language models while also shedding light on human language acquisition.
Authors:Ziliang Guo, Ziheng Li, Bo Tang, Feiyu Xiong, Zhiyu Li
Abstract:
Memory‑augmented Large Language Models (LLMs) are essential for developing capable, long‑term AI agents. Recently, applying Reinforcement Learning (RL) to optimize memory operations, such as extraction, updating, and retrieval, has emerged as a highly promising research direction. However, existing implementations remain highly fragmented and task‑specific, lacking a unified infrastructure to streamline the integration, training, and evaluation of these complex pipelines. To address this gap, we present MemFactory, the first unified, highly modular training and inference framework specifically designed for memory‑augmented agents. Inspired by the success of unified fine‑tuning frameworks like LLaMA‑Factory, MemFactory abstracts the memory lifecycle into atomic, plug‑and‑play components, enabling researchers to seamlessly construct custom memory agents via a "Lego‑like" architecture. Furthermore, the framework natively integrates Group Relative Policy Optimization (GRPO) to fine‑tune internal memory management policies driven by multi‑dimensional environmental rewards. MemFactory provides out‑of‑the‑box support for recent cutting‑edge paradigms, including Memory‑R1, RMM, and MemAgent. We empirically validate MemFactory on the open‑source MemAgent architecture using its publicly available training and evaluation data. Across the evaluation sets, MemFactory improves performance over the corresponding base models on average, with relative gains of up to 14.8%. By providing a standardized, extensible, and easy‑to‑use infrastructure, MemFactory significantly lowers the barrier to entry, paving the way for future innovations in memory‑driven AI agents.
Authors:Tal Ishon, Yoav Goldberg, Uri Shaham
Abstract:
Topic modeling seeks to uncover latent semantic structure in text, with LDA providing a foundational probabilistic framework. While recent methods often incorporate external knowledge (e.g., pre‑trained embeddings), such reliance limits applicability in emerging or underexplored domains. We introduce PRISM, a corpus‑intrinsic method that derives a Dirichlet parameter from word co‑occurrence statistics to initialize LDA without altering its generative process. Experiments on text and single cell RNA‑seq data show that PRISM improves topic coherence and interpretability, rivaling models that rely on external knowledge. These results underscore the value of corpus‑driven initialization for topic modeling in resource‑constrained settings. Code is available at: https://github.com/shaham‑lab/PRISM.
Authors:Zhuowen Liang, Xiaotian Lin, Zhengxuan Zhang, Yuyu Luo, Haixun Wang, Nan Tang
Abstract:
Large language models (LLMs) are widely applied to data analytics over documents, yet direct reasoning over long, noisy documents remains brittle and error‑prone. Hence, we study document question answering (QA) that consolidates dispersed evidence into a structured output (e.g., a table, graph, or chunks) to support reliable, verifiable QA. We propose a two‑pillar framework, LiteCoST, to achieve both high accuracy and low latency with small language models (SLMs). Pillar 1: Chain‑of‑Structured‑Thought (CoST). We introduce a CoST template, a schema‑aware instruction that guides a strong LLM to produce both a step‑wise CoST trace and the corresponding structured output. The process induces a minimal structure, normalizes entities/units, aligns records, serializes the output, and verifies/refines it, yielding auditable supervision. Pillar 2: SLM fine‑tuning. The compact models are trained on LLM‑generated CoST data in two stages: Supervised Fine‑Tuning for structural alignment, followed by Group Relative Policy Optimization (GRPO) incorporating triple rewards for answer/format quality and process consistency. By distilling structure‑first behavior into SLMs, this approach achieves LLM‑comparable quality on multi‑domain long‑document QA using 3B/7B SLMs, while delivering 2‑4x lower latency than GPT‑4o and DeepSeek‑R1 (671B). The code is available at https://github.com/HKUSTDial/LiteCoST.
Authors:Iordanis Fostiropoulos, Muhammad Rafay Azhar, Abdalaziz Sawwan, Boyu Fang, Yuchen Liu, Jiayi Liu, Hanchao Yu, Qi Guo, Jianyu Wang, Fei Liu, Xiangjun Fan
Abstract:
We introduce GISTBench, a benchmark for evaluating Large Language Models' (LLMs) ability to understand users from their interaction histories in recommendation systems. Unlike traditional RecSys benchmarks that focus on item prediction accuracy, our benchmark evaluates how well LLMs can extract and verify user interests from engagement data. We propose two novel metric families: Interest Groundedness (IG), decomposed into precision and recall components to separately penalize hallucinated interest categories and reward coverage, and Interest Specificity (IS), which assesses the distinctiveness of verified LLM‑predicted user profiles. We release a synthetic dataset constructed on real user interactions on a global short‑form video platform. Our dataset contains both implicit and explicit engagement signals and rich textual descriptions. We validate our dataset fidelity against user surveys, and evaluate eight open‑weight LLMs spanning 7B to 120B parameters. Our findings reveal performance bottlenecks in current LLMs, particularly their limited ability to accurately count and attribute engagement signals across heterogeneous interaction types.
Authors:Caio Vicentino
Abstract:
We present PolarQuant, a post‑training weight quantization method for large language models (LLMs) that exploits the distributional structure of neural network weights to achieve near‑lossless compression. PolarQuant operates in three stages: (1) block‑wise normalization to the unit hypersphere, (2) Walsh‑Hadamard rotation to transform coordinates into approximately Gaussian random variables, and (3) quantization with centroids matched to the Gaussian distribution. Our ablation reveals that Hadamard rotation alone accounts for 98% of the quality improvement, reducing Qwen3.5‑9B perplexity from 6.90 (absmax Q5) to 6.40 (Delta = +0.03 from FP16), making it practically lossless without any calibration data. Furthermore, PolarQuant functions as an effective preprocessing step for downstream INT4 quantizers: PolarQuant Q5 dequantized and re‑quantized by torchao INT4 achieves perplexity 6.56 versus 6.68 for direct absmax INT4, while maintaining 43.1 tok/s throughput at 6.5 GB VRAM. Code and models are publicly available.
Authors:Shikhar Bharadwaj, Chin-Jou Li, Kwanghee Choi, Eunjung Yeo, William Chen, Shinji Watanabe, David R. Mortensen
Abstract:
Phone recognition (PR) is a key enabler of multilingual and low‑resource speech processing tasks, yet robust performance remains elusive. Highly performant English‑focused models do not generalize across languages, while multilingual models underutilize pretrained representations. It also remains unclear how data scale, architecture, and training objective contribute to multilingual PR. We present PhoneticXEUS ‑‑ trained on large‑scale multilingual data and achieving state‑of‑the‑art performance on both multilingual (17.7% PFER) and accented English speech (10.6% PFER). Through controlled ablations with evaluations across 100+ languages under a unified scheme, we empirically establish our training recipe and quantify the impact of SSL representations, data scale, and loss objectives. In addition, we analyze error patterns across language families, accented speech, and articulatory features. All data and code are released openly.
Authors:Andrew Bouras, OMS-II Research Fellow
Abstract:
Scientific hypothesis generation is a critical bottleneck in accelerating research, yet existing datasets for training and evaluating hypothesis‑generating models are limited to single domains and lack explicit reasoning traces connecting prior knowledge to novel contributions. I introduce CrossTrace, a dataset of 1,389 grounded scientific reasoning traces spanning biomedical research (518), AI/ML (605), and cross‑domain work (266). Each trace captures the structured reasoning chain from established knowledge through intermediate logical steps to a novel hypothesis, with every step grounded in source paper text. I define an Input/Trace/Output schema that extends the Bit‑Flip‑Spark framework of HypoGen with step‑level verification, a taxonomy of eight discovery patterns, and multi‑domain coverage. Fine‑tuning Qwen2.5‑7B‑Instruct on CrossTrace via QLoRA yields substantial improvements over the untuned baseline: IAScore rises from 0.828 to 0.968 (GPT‑4o judge) and from 0.716 to 0.888 (Claude Opus 4.5), structural compliance improves from 0% to 100%, and spark cosine similarity increases from 0.221 to 0.620. Balanced cross‑domain training (biomedical + AI/ML + CS) outperforms single‑domain training, providing evidence that scientific reasoning patterns transfer across disciplines. Human validation of 150 stratified records confirms 99.7% step‑level grounding accuracy and a 0.0% fabrication rate. To my knowledge, CrossTrace is the first large‑scale, cross‑domain dataset with step‑level grounded reasoning traces for hypothesis generation, and my results demonstrate that such traces are an effective training signal whose benefits are at least partially domain‑general.
Authors:Subhadip Mitra
Abstract:
Evaluating large language models at scale remains a practical bottleneck for many organizations. While existing evaluation frameworks work well for thousands of examples, they struggle when datasets grow to hundreds of thousands or millions of samples. This scale is common when assessing model behavior across diverse domains or conducting comprehensive regression testing. We present Spark‑LLM‑Eval, a distributed evaluation framework built natively on Apache Spark. The system treats evaluation as a data‑parallel problem, partitioningexamplesacrossexecutorsandaggregatingresultswithproperstatistical accounting. Beyond raw throughput, we emphasize statistical rigor: every reported metric includes bootstrap confidence intervals, and model comparisons come with appropriate significance tests (paired t‑tests, McNemar's test, or Wilcoxon signed‑rank, depending on the metric type). The framework also addresses the cost problem inherent in LLM evaluation through content‑addressable response caching backed by Delta Lake, which allows iterating on metric definitions without re‑running inference. We describe the system architecture, the statistical methodology, and report benchmark results showing linear scaling with cluster size. The framework and all evaluation code are available as open source.
Authors:Philip Schroeder, Thomas Weng, Karl Schmeckpeper, Eric Rosen, Stephen Hart, Ondrej Biza
Abstract:
Vision‑language models (VLMs) have shown impressive capabilities across diverse tasks, motivating efforts to leverage these models to supervise robot learning. However, when used as evaluators in reinforcement learning (RL), today's strongest models often fail under partial observability and distribution shift, enabling policies to exploit perceptual errors rather than solve the task. We introduce SOLE‑R1 (Self‑Observing LEarner), a video‑language reasoning model explicitly designed to serve as the sole reward signal for online RL. Given only raw video observations and a natural‑language goal, SOLE‑R1 performs per‑timestep spatiotemporal chain‑of‑thought (CoT) reasoning and produces dense estimates of task progress that can be used directly as rewards. To train SOLE‑R1, we develop a large‑scale video trajectory and reasoning synthesis pipeline that generates temporally grounded CoT traces aligned with continuous progress supervision. This data is combined with foundational spatial and multi‑frame temporal reasoning, and used to train the model with a hybrid framework that couples supervised fine‑tuning with RL from verifiable rewards. Across four different simulation environments and a real‑robot setting, SOLE‑R1 enables zero‑shot online RL from random initialization: robots learn previously unseen manipulation tasks without ground‑truth rewards, success indicators, demonstrations, or task‑specific tuning. SOLE‑R1 succeeds on 24 unseen tasks and substantially outperforms strong vision‑language rewarders, including Robometer, RoboReward, ReWiND, GPT‑5, and Gemini‑3‑Pro, while exhibiting markedly greater robustness to reward hacking. We release all models, data, code, and demos at the anonymous page: https://philip‑mit.github.io/sole‑r1/
Authors:Huanxuan Liao, Zhongtao Jiang, Yupu Hao, Yuqiao Tan, Shizhu He, Ben Wang, Jun Zhao, Kun Xu, Kang Liu
Abstract:
Multimodal Large Language Models (MLLMs) achieve stronger visual understanding by scaling input fidelity, yet the resulting visual token growth makes jointly sustaining high spatial resolution and long temporal context prohibitive. We argue that the bottleneck lies not in how post‑encoding representations are compressed but in the volume of pixels the encoder receives, and address it with ResAdapt, an Input‑side adaptation framework that learns how much visual budget each frame should receive before encoding. ResAdapt couples a lightweight Allocator with an unchanged MLLM backbone, so the backbone retains its native visual‑token interface while receiving an operator‑transformed input. We formulate allocation as a contextual bandit and train the Allocator with Cost‑Aware Policy Optimization (CAPO), which converts sparse rollout feedback into a stable accuracy‑cost learning signal. Across budget‑controlled video QA, temporal grounding, and image reasoning tasks, ResAdapt improves low‑budget operating points and often lies on or near the efficiency‑accuracy frontier, with the clearest gains on reasoning‑intensive benchmarks under aggressive compression. Notably, ResAdapt supports up to 16x more frames at the same visual budget while delivering over 15% performance gain. Code is available at https://github.com/Xnhyacinth/ResAdapt.
Authors:Shuwen Xu, Yao Xu, Jiaxiang Liu, Chenhao Yuan, Wenshuo Peng, Jun Zhao, Kang Liu
Abstract:
Agentic knowledge graph question answering (KGQA) requires an agent to iteratively interact with knowledge graphs (KGs), posing challenges in both training data scarcity and reasoning generalization. Specifically, existing approaches often restrict agent exploration: prompting‑based methods lack autonomous navigation training, while current training pipelines usually confine reasoning to predefined trajectories. To this end, this paper proposes GraphWalker, a novel agentic KGQA framework that addresses these challenges through Automated Trajectory Synthesis and Stage‑wise Fine‑tuning. GraphWalker adopts a two‑stage SFT training paradigm: First, the agent is trained on structurally diverse trajectories synthesized from constrained random‑walk paths, establishing a broad exploration prior over the KG; Second, the agent is further fine‑tuned on a small set of expert trajectories to develop reflection and error recovery capabilities. Extensive experiments demonstrate that our stage‑wise SFT paradigm unlocks a higher performance ceiling for a lightweight reinforcement learning (RL) stage, enabling GraphWalker to achieve state‑of‑the‑art performance on CWQ and WebQSP. Additional results on GrailQA and our constructed GraphWalkerBench confirm that GraphWalker enhances generalization to out‑of‑distribution reasoning paths. The code is publicly available at https://github.com/XuShuwenn/GraphWalker
Authors:Masnun Nuha Chowdhury, Nusrat Jahan Beg, Umme Hunny Khan, Syed Rifat Raiyan, Md Kamrul Hasan, Hasan Mahmud
Abstract:
Large language models (LLMs) remain unreliable for high‑stakes claim verification due to hallucinations and shallow reasoning. While retrieval‑augmented generation (RAG) and multi‑agent debate (MAD) address this, they are limited by one‑pass retrieval and unstructured debate dynamics. We propose a courtroom‑style multi‑agent framework, PROClaim, that reformulates verification as a structured, adversarial deliberation. Our approach integrates specialized roles (e.g., Plaintiff, Defense, Judge) with Progressive RAG (P‑RAG) to dynamically expand and refine the evidence pool during the debate. Furthermore, we employ evidence negotiation, self‑reflection, and heterogeneous multi‑judge aggregation to enforce calibration, robustness, and diversity. In zero‑shot evaluations on the Check‑COVID benchmark, PROClaim achieves 81.7% accuracy, outperforming standard multi‑agent debate by 10.0 percentage points, with P‑RAG driving the primary performance gains (+7.5 pp). We ultimately demonstrate that structural deliberation and model heterogeneity effectively mitigate systematic biases, providing a robust foundation for reliable claim verification. Our code and data are publicly available at https://github.com/mnc13/PROClaim.
Authors:Zhuoshang Wang, Yubing Ren, Guoyu Zhao, Xiaowei Zhu, Hao Li, Yanan Cao
Abstract:
Large Language Models (LLMs) are widely applied across various domains due to their powerful text generation capabilities. While LLM‑generated texts often resemble human‑written ones, their misuse can lead to significant societal risks. Detecting such texts is an essential technique for mitigating LLM misuse, and many detection methods have shown promising results across different datasets. However, real‑world scenarios often involve out‑of‑domain inputs or adversarial samples, which can affect the performance of detection methods to varying degrees. Furthermore, most existing research has focused on English texts, with limited work addressing Chinese text detection. In this study, we propose EnsemJudge, a robust framework for detecting Chinese LLM‑generated text by incorporating tailored strategies and ensemble voting mechanisms. We trained and evaluated our system on a carefully constructed Chinese dataset provided by NLPCC2025 Shared Task 1. Our approach outperformed all baseline methods and achieved first place in the task, demonstrating its effectiveness and reliability in Chinese LLM‑generated text detection. Our code is available at https://github.com/johnsonwangzs/MGT‑Mini.
Authors:Meituan LongCat Team, Bin Xiao, Chao Wang, Chengjiang Li, Chi Zhang, Chong Peng, Hang Yu, Hao Yang, Haonan Yan, Haoze Sun, Haozhe Zhao, Hong Liu, Hui Su, Jiaqi Zhang, Jiawei Wang, Jing Li, Kefeng Zhang, Manyuan Zhang, Minhao Jing, Peng Pei, Quan Chen, Taofeng Xue, Tongxin Pan, Xiaotong Li, Xiaoyang Li, Xiaoyu Zhao, Xing Hu, Xinyang Lin, Xunliang Cai, Yan Bai, Yan Feng, Yanjie Li, Yao Qiu, Yerui Sun, Yifan Lu, Ying Luo, Yipeng Mei, Yitian Chen, Yuchen Xie, Yufang Liu, Yufei Chen, Yulei Qian, Yuqi Peng, Zhihang Yu, Zhixiong Han, Changran Wang, Chen Chen, Dian Zheng, Fengjiao Chen, Ge Yang, Haowei Guo, Haozhe Wang, Hongyu Li, Huicheng Jiang, Jiale Hong, Jialv Zou, Jiamu Li, Jianping Lin, Jiaxing Liu, Jie Yang, Jing Jin, Jun Kuang, Juncheng She, Kunming Luo, Kuofeng Gao, Lin Qiu, Linsen Guo, Mianqiu Huang, Qi Li, Qian Wang, Rumei Li, Siyu Ren, Wei Wang, Wenlong He, Xi Chen, Xiao Liu, Xiaoyu Li, Xu Huang, Xuanyu Zhu, Xuezhi Cao, Yaoming Zhu, Yifei Cao, Yimeng Jia, Yizhen Jiang, Yufei Gao, Zeyang Hu, Zhenlong Yuan, Zijian Zhang, Ziwen Wang
Abstract:
The prevailing Next‑Token Prediction (NTP) paradigm has driven the success of large language models through discrete autoregressive modeling. However, contemporary multimodal systems remain language‑centric, often treating non‑linguistic modalities as external attachments, leading to fragmented architectures and suboptimal integration. To transcend this limitation, we introduce Discrete Native Autoregressive (DiNA), a unified framework that represents multimodal information within a shared discrete space, enabling a consistent and principled autoregressive modeling across modalities. A key innovation is the Discrete Native Any‑resolution Visual Transformer (dNaViT), which performs tokenization and de‑tokenization at arbitrary resolutions, transforming continuous visual signals into hierarchical discrete tokens. Building on this foundation, we develop LongCat‑Next, a native multimodal model that processes text, vision, and audio under a single autoregressive objective with minimal modality‑specific design. As an industrial‑strength foundation model, it excels at seeing, painting, and talking within a single framework, achieving strong performance across a wide range of multimodal benchmarks. In particular, LongCat‑Next addresses the long‑standing performance ceiling of discrete vision modeling on understanding tasks and provides a unified approach to effectively reconcile the conflict between understanding and generation. As an attempt toward native multimodality, we open‑source the LongCat‑Next and its tokenizers, hoping to foster further research and development in the community. GitHub: https://github.com/meituan‑longcat/LongCat‑Next
Authors:Amartya Bhattacharya
Abstract:
Vision‑language models (VLMs) excel at image‑text retrieval yet persistently fail at compositional reasoning, distinguishing captions that share the same words but differ in relational structure. We present, a unified evaluation and augmentation framework benchmarking four architecturally diverse VLMs,CLIP, BLIP, LLaVA, and Qwen3‑VL‑8B‑Thinking,on the Winoground benchmark under plain and scene‑graph‑augmented regimes. We introduce a dependency‑based TextSceneGraphParser (spaCy) extracting subject‑relation‑object triples, and a Graph Asymmetry Scorer using optimal bipartite matching to inject structural relational priors. Caption ablation experiments (subject‑object masking and swapping) reveal that Qwen3‑VL‑8B‑Thinking achieves a group score of 62.75, far above all encoder‑based models, while a proposed multi‑turn SG filtering strategy further lifts it to 66.0, surpassing prior open‑source state‑of‑the‑art. We analyze the capability augmentation tradeoff and find that SG augmentation benefits already capable models while providing negligible or negative gains for weaker baselines. Code: https://github.com/amartyacodes/Inference‑Time‑Structural‑Reasoning‑for‑Compositional‑Vision‑Language‑Understanding
Authors:Mohamad Zbib, Mohamad Bazzi, Ammar Mohanna, Hasan Abed Al Kader Hammoud, Bernard Ghanem
Abstract:
Speculative decoding accelerates autoregressive generation by letting a lightweight draft model propose future tokens that a larger target model then verifies in parallel. In practice, however, draft models are usually trained on broad generic corpora, which leaves it unclear how much speculative decoding quality depends on the draft training distribution. We study this question with lightweight HASS and EAGLE‑2 drafters trained on MathInstruct, ShareGPT, and mixed‑data variants, evaluated on MT‑Bench, GSM8K, MATH‑500, and SVAMP. Measured by acceptance length, task‑specific training yields clear specialization: MathInstruct‑trained drafts are strongest on reasoning benchmarks, while ShareGPT‑trained drafts are strongest on MT‑Bench. Mixed‑data training improves robustness, but larger mixtures do not dominate across decoding temperatures. We also study how to combine specialized drafters at inference time. Naive checkpoint averaging performs poorly, whereas confidence‑based routing improves over single‑domain drafts and merged‑tree verification yields the highest acceptance length overall for both backbones. Finally, confidence is a more useful routing signal than entropy: rejected tokens tend to have higher entropy, but confidence produces much clearer benchmark‑level routing decisions. These results show that speculative decoding quality depends not only on draft architecture, but also on the match between draft training data and downstream workload, and that specialized drafters are better combined at inference time than in weight space.
Authors:E. M. Freeburg
Abstract:
Large language models produce em dashes at varying rates, and the observation that some models "overuse" them has become one of the most widely discussed markers of AI‑generated text. Yet no mechanistic account of this pattern exists, and the parallel observation that LLMs default to markdown‑formatted output has never been connected to it. We propose that the em dash is markdown leaking into prose ‑‑ the smallest surviving unit of the structural orientation that LLMs acquire from markdown‑saturated training corpora. We present a five‑step genealogy connecting training data composition, structural internalization, the dual‑register status of the em dash, and post‑training amplification. We test this with a two‑condition suppression experiment across twelve models from five providers (Anthropic, OpenAI, Meta, Google, DeepSeek): when models are instructed to avoid markdown formatting, overt features (headers, bullets, bold) are eliminated or nearly eliminated, but em dashes persist ‑‑ except in Meta's Llama models, which produce none at all. Em dash frequency and suppression resistance vary from 0.0 per 1,000 words (Llama) to 9.1 (GPT‑4.1 under suppression), functioning as a signature of the specific fine‑tuning procedure applied. A three‑condition suppression gradient shows that even explicit em dash prohibition fails to eliminate the artifact in some models, and a base‑vs‑instruct comparison confirms that the latent tendency exists pre‑RLHF. These findings connect two previously isolated online discourses and reframe em dash frequency as a diagnostic of fine‑tuning methodology rather than a stylistic defect.
Authors:Swastik R
Abstract:
Vision‑language models score well on mathematical, scientific, and spatial reasoning benchmarks, yet these evaluations are overwhelmingly English. I present the first cross‑lingual visual reasoning audit for Indian languages. 980 questions from MathVista, ScienceQA, and MMMU are translated into Hindi, Tamil, Telugu, Bengali, Kannada, and Marathi using IndicTrans2, with Gemini 2.0 Flash cross‑verification on 50 samples per language (inter‑translator agreement 0.79‑0.84). Eight VLMs, from 7B open‑source models to GPT‑4o, are evaluated across all seven languages, yielding 68,600 inference records that include text‑only and chain‑of‑thought ablations. I find accuracy drops of 9.8‑25 percentage points when switching from English to an Indian language, with Dravidian languages suffering up to 13.2 pp more than Indo‑Aryan. Chain‑of‑thought prompting degrades Bengali (‑14.4 pp) and Kannada (‑11.4 pp) rather than helping, exposing English‑centric reasoning chains. Aya‑Vision‑8B, built for 23 languages, still drops 28.5 pp on Dravidian scripts; multilingual pretraining alone does not transfer visual reasoning. I release the translated benchmark and all model outputs.
Authors:Vinicius Anjos de Almeida, Sandro Saorin da Silva, Josimar Chire, Leonardo Vicenzi, Nícolas Henrique Borges, Helena Kociolek, Sarah Miriã de Castro Rocha, Frederico Nassif Gomes, Júlia Cristina Ferreira, Oge Marques, Lucas Emanuel Silva e Oliveira
Abstract:
Clinical notes contain valuable unstructured information. Named entity recognition (NER) enables the automatic extraction of medical concepts; however, benchmarks for Portuguese remain scarce. In this study, we aimed to evaluate BERT‑based models and large language models (LLMs) for clinical NER in Portuguese and to test strategies for addressing multilabel imbalance. We compared BioBERTpt, BERTimbau, ModernBERT, and mmBERT with LLMs such as GPT‑5 and Gemini‑2.5, using the public SemClinBr corpus and a private breast cancer dataset. Models were trained under identical conditions and evaluated using precision, recall, and F1‑score. Iterative stratification, weighted loss, and oversampling were explored to mitigate class imbalance. The mmBERT‑base model achieved the best performance (micro F1 = 0.76), outperforming all other models. Iterative stratification improved class balance and overall performance. Multilingual BERT models, particularly mmBERT, perform strongly for Portuguese clinical NER and can run locally with limited computational resources. Balanced data‑splitting strategies further enhance performance.
Authors:Nina Smirnova, Daniel Dan, Philipp Mayr
Abstract:
Parliamentary debate constitutes a central arena of political power, shaping legislative outcomes and public discourse. Incivility within this arena signals political polarization and institutional conflict. This study presents a systematic investigation of incivility in the German Bundestag by examining calls to order (CtO; plural: CtOs) as formal indicators of norm violations. Despite their relevance, CtOs have received little systematic attention in parliamentary research. We introduce a rule‑based method for detecting and annotating CtOs in parliamentary speeches and present a novel dataset of German parliamentary debates spanning 72 years that includes annotated CtO instances. Additionally, we develop the first classification system for CtO triggers and analyze the factors associated with their occurrence. Our findings show that, despite formal regulations, the issuance of CtOs is partly subjective and influenced by session presidents and parliamentary dynamics, with certain individuals disproportionately affected. An insult towards individuals is the most frequent cause of CtO. In general, male members and those belonging to opposition parties receive more calls to order than their female and coalition‑party counterparts. Most CtO triggers were detected in speeches dedicated to governmental affairs and actions of the presidency. The CtO triggers dataset is available at: https://github.com/kalawinka/cto_analysis.
Authors:JiHyeok Jung, TaeYoung Yoon, HyunSouk Cho
Abstract:
Legal reasoning requires not only the application of legal rules but also an understanding of the context in which those rules operate. However, existing legal benchmarks primarily evaluate rule application under the assumption of fixed norms, and thus fail to capture situations where legal judgments shift or where multiple norms interact. In this work, we propose CALRK‑Bench, a context‑aware legal reasoning benchmark based on the legal system in Korean. CALRK‑Bench evaluates whether models can identify the temporal validity of legal norms, determine whether sufficient legal information is available for a given case, and understand the reasons behind shifts in legal judgments. The dataset is constructed from legal precedents and legal consultation records, and is validated by legal experts. Experimental results show that even recent large language models consistently exhibit low performance on these three tasks. CALRK‑Bench provides a new stress test for evaluating context‑aware legal reasoning rather than simple memorization of legal knowledge. Our code is available at https://github.com/jhCOR/CALRKBench.
Authors:Oucheng Liu, Lexing Xie, Jing Jiang
Abstract:
Climate change is a major socio‑scientific issue shapes public decision‑making and policy discussions. As large language models (LLMs) increasingly serve as an interface for accessing climate knowledge, whether existing benchmarks reflect user needs is critical for evaluating LLM in real‑world settings. We propose a Proactive Knowledge Behaviors Framework that captures the different human‑human and human‑AI knowledge seeking and provision behaviors. We further develop a Topic‑Intent‑Form taxonomy and apply it to analyze climate‑related data representing different knowledge behaviors. Our results reveal a substantial mismatch between current benchmarks and real‑world user needs, while knowledge interaction patterns between humans and LLMs closely resemble those in human‑human interactions. These findings provide actionable guidance for benchmark design, RAG system development, and LLM training. Code is available at https://github.com/OuchengLiu/LLM‑Misalign‑Climate‑Change.
Authors:Yijiong Yu, Shuai Yuan, Jie Zheng, Huazheng Wang, Ji Pei
Abstract:
Soft context compression reduces the computational workload of processing long contexts in LLMs by encoding long context into a smaller number of latent tokens. However, existing frameworks apply uniform compression ratios, failing to account for the extreme variance in natural language information density. While adopting a density‑aware dynamic compression ratio seems intuitive, empirical investigations reveal that models struggle intrinsically with operations parameterized by input dependent, continuous structural hyperparameters. To resolve this pitfall, we introduce Semi‑Dynamic Context Compression framework. Our approach features a Discrete Ratio Selector, which predicts a compression target based on intrinsic information density and quantizes it to a predefined set of discrete compression ratios. It is efficiently jointly trained with the compressor on synthetic data, with the summary lengths as a proxy to create labels for compression ratio prediction. Extensive evaluations confirm that our density‑aware framework, utilizing mean pooling as the backbone, consistently outperforms static baselines, establishing a robust Pareto frontier for context compression techniques. Our code, data and model weights are available at https://github.com/yuyijiong/semi‑dynamic‑context‑compress
Authors:Jiajun Zhang, Yuying Li, Zhixun Li, Xingyu Guo, Jingzhuo Wu, Leqi Zheng, Yiran Yang, Jianke Zhang, Qingbin Li, Shannan Yan, Zhetong Li, Changguo Jia, Junfei Wu, Zilei Wang, Qiang Liu, Liang Wang
Abstract:
Vision‑Language Models (VLMs) have demonstrated impressive capabilities in code generation across various domains. However, their ability to replicate complex, multi‑panel visualizations from real‑world data remains largely unassessed. To address this gap, we introduce \textttRealChart2Code, a new large‑scale benchmark with over 2,800 instances grounded in authentic datasets and featuring tasks with clear analytical intent. Crucially, it is the first benchmark to systematically evaluate chart generation from large‑scale raw data and assess iterative code refinement in a multi‑turn conversational setting. Our comprehensive evaluation of 14 leading VLMs on \textttRealChart2Code reveals significant performance degradation compared to simpler benchmarks, highlighting their struggles with complex plot structures and authentic data. Our analysis uncovers a substantial performance gap between proprietary and open‑weight models and confirms that even state‑of‑the‑art VLMs often fail to accurately replicate intricate, multi‑panel charts. These findings provide valuable insights into the current limitations of VLMs and guide future research directions. We release the benchmark and code at \urlhttps://github.com/Speakn0w/RealChart2Code.
Authors:Ligong Han, Hao Wang, Han Gao, Kai Xu, Akash Srivastava
Abstract:
Block‑diffusion language models offer a promising path toward faster‑than‑autoregressive generation by combining block‑wise autoregressive decoding with within‑block parallel denoising. However, in the few‑step regime needed for practical acceleration, standard confidence‑thresholded decoding is often brittle: aggressive thresholds hurt quality, while conservative thresholds require unnecessary denoising steps. Existing approaches that address this issue either require additional training or incur extra test‑time compute. We present S2D2, a training‑free self‑speculative decoding framework for block‑diffusion language models. Our key observation is that a block‑diffusion model becomes autoregressive when the block size is reduced to one, allowing the same pretrained model to act as both drafter and verifier. S2D2 inserts a speculative verification step into standard block‑diffusion decoding and uses lightweight routing policies to decide when verification is worth its cost. This yields a hybrid decoding trajectory in which diffusion proposes tokens in parallel, while the autoregressive mode acts as a local sequence‑level critic. Across three mainstream block‑diffusion families, S2D2 consistently improves the accuracy‑speed tradeoff over strong confidence‑thresholding baselines. On SDAR, we observe up to 4.7× speedup over autoregressive decoding, and up to 1.57× over a tuned dynamic decoding baseline while improving accuracy by up to 4.5 points. On LLaDA2.1‑Mini, S2D2 remains complementary to built‑in self‑correction, including a conservative setting where it is 4.4× faster than the static baseline with slightly higher accuracy.
Authors:Minseo Kim, Sujeong Im, Junseong Choi, Junhee Lee, Chaeeun Shim, Edward Choi
Abstract:
Large language model (LLM)‑based persona agents are rapidly being adopted as scalable proxies for human participants across diverse domains. Yet there is no systematic method for verifying whether a persona agent's responses remain free of contradictions and factual inaccuracies throughout an interaction. A principle from interrogation methodology offers a lens: no matter how elaborate a fabricated identity, systematic interrogation will expose its contradictions. We apply this principle to propose PICon, an evaluation framework that probes persona agents through logically chained multi‑turn questioning. PICon evaluates consistency along three core dimensions: internal consistency (freedom from self‑contradiction), external consistency (alignment with real‑world facts), and retest consistency (stability under repetition). Evaluating seven groups of persona agents alongside 63 real human participants, we find that even systems previously reported as highly consistent fail to meet the human baseline across all three dimensions, revealing contradictions and evasive responses under chained questioning. This work provides both a conceptual foundation and a practical methodology for evaluating persona agents before trusting them as substitutes for human participants. We provide the source code and an interactive demo at: https://kaist‑edlab.github.io/picon/
Authors:Nikolai Ilinykh, Hyewon Jang, Shalom Lappin, Asad Sayeed, Sharid Loáiciga
Abstract:
We study narrative coherence in visually grounded stories by comparing human‑written narratives with those generated by vision‑language models (VLMs) on the Visual Writing Prompts corpus. Using a set of metrics that capture different aspects of narrative coherence, including coreference, discourse relation types, topic continuity, character persistence, and multimodal character grounding, we compute a narrative coherence score. We find that VLMs show broadly similar coherence profiles that differ systematically from those of humans. In addition, differences for individual measures are often subtle, but they become clearer when considered jointly. Overall, our results indicate that, despite human‑like surface fluency, model narratives exhibit systematic differences from those of humans in how they organise discourse across a visually grounded story. Our code is available at https://github.com/GU‑CLASP/coherence‑driven‑humans.
Authors:Paulo Roberto de Moura Júnior, Jean Lelong, Annabelle Blangero
Abstract:
The effectiveness of Retrieval‑Augmented Generation (RAG) is highly dependent on how documents are chunked, that is, segmented into smaller units for indexing and retrieval. Yet, commonly used "one‑size‑fits‑all" approaches often fail to capture the nuanced structure and semantics of diverse texts. Despite its central role, chunking lacks a dedicated evaluation framework, making it difficult to assess and compare strategies independently of downstream performance. We challenge this paradigm by introducing Adaptive Chunking, a framework that selects the most suitable chunking strategy for each document based on a set of five novel intrinsic, document‑based metrics: References Completeness (RC), Intrachunk Cohesion (ICC), Document Contextual Coherence (DCC), Block Integrity (BI), and Size Compliance (SC), which directly assess chunking quality across key dimensions. To support this framework, we also introduce two new chunkers, an LLM‑regex splitter and a split‑then‑merge recursive splitter, alongside targeted post‑processing techniques. On a diverse corpus spanning legal, technical, and social science domains, our metric‑guided adaptive method significantly improves downstream RAG performance. Without changing models or prompts, our framework increases RAG outcomes, raising answers correctness to 72% (from 62‑64%) and increasing the number of successfully answered questions by over 30% (65 vs. 49). These results demonstrate that adaptive, document‑aware chunking, guided by a complementary suite of intrinsic metrics, offers a practical and effective path to more robust RAG systems. Code available at https://github.com/ekimetrics/adaptive‑chunking.
Authors:Kusal Darshana
Abstract:
Current Large Language Models (LLMs) mostly use BPE (Byte Pair Encoding) based tokenizers, which are very effective for simple structured Latin scripts such as English. However, standard BPE tokenizers struggle to process complex Abugida scripts due to their structural complexity. The problem is that these tokenizers break complex conjuncts, which are multi‑codepoint grapheme clusters, into meaningless sub‑character units. This degrades the LLM's reasoning efficiency by forcing it to learn basic orthographic structures at inference time and raises inference costs, resulting in a significant "Token Tax" for the Global South. We propose a new three‑layer architecture, the WWHO (Where‑What‑How Often), and an algorithm named SGPE (Syllable‑aware Grapheme Pair Encoding) that separates the linguistic rules of the script from the statistical compression process while enabling seamless multilingual tokenization. Using Sinhala and Devanagari (Hindi/Sanskrit) as highly complex Abugida scripts, we trained WWHO on a cleaned 30‑million‑sentence dataset and evaluated on a 1,499,950‑sentence test set. For Sinhala, SGPE achieves a Token to Word Ratio (TWR) of 1.274 with 4.83 characters per token, representing a 61.7 percent reduction in tokens compared to OpenAI's o200k base. For Hindi, it achieves a TWR of 1.181 (27.0 percent reduction vs o200k). On the mixed‑script (Sinhala, Devanagari, and English) dataset, SGPE achieves an overall TWR of 1.240, representing token reductions of 36.7 percent, 39.6 percent, and 60.2 percent relative to o200k base, Llama 4 Scout, and DeepSeek V3, respectively. This effectively extends the usable context window by up to 4.38 times for these Abugida languages while ensuring a Linguistic Zero‑Breakage Guarantee, which ensures that no valid syllable is ever split across multiple tokens.
Authors:Abhijnan Nath, Hannah VanderHoeven, Nikhil Krishnaswamy
Abstract:
We introduce CRAFT, a multi‑agent benchmark for evaluating pragmatic communication in large language models under strict partial information. In this setting, multiple agents with complementary but incomplete views must coordinate through natural language to construct a shared 3D structure that no single agent can fully observe. We formalize this problem as a multi‑sender pragmatic reasoning task and provide a diagnostic framework that decomposes failures into spatial grounding, belief modeling and pragmatic communication errors, including a taxonomy of behavioral failure profiles in both frontier and open‑weight models. Across a diverse set of models, including 8 open‑weight and 7 frontier including reasoning models, we find that stronger reasoning ability does not reliably translate to better coordination: smaller open‑weight models often match or outperform frontier systems, and improved individual communication does not guarantee successful collaboration. These results suggest that multi‑agent coordination remains a fundamentally unsolved challenge for current language models. Our code can be found at https://github.com/csu‑signal/CRAFT
Authors:Fanheng Kong, Jingyuan Zhang, Yang Yue, Chenxi Sun, Yang Tian, Shi Feng, Xiaocui Yang, Daling Wang, Yu Tian, Jun Du, Wenchong Zeng, Han Li, Kun Gai
Abstract:
The emergence of Large Language Models (LLMs) has catalyzed a paradigm shift in programming, giving rise to "vibe coding", where users can build complete projects and even control computers using natural language instructions. This paradigm has driven automated webpage development, but it introduces a new requirement about how to automatically verify whether the web functionalities are reliably implemented. Existing works struggle to adapt, relying on static visual similarity or predefined checklists that constrain their utility in open‑ended environments. Furthermore, they overlook a vital aspect of software quality, namely latent logical constraints. To address these gaps, we introduce WebTestBench, a benchmark for evaluating end‑to‑end automated web testing. WebTestBench encompasses comprehensive dimensions across diverse web application categories. We decompose the testing process into two cascaded sub‑tasks, checklist generation and defect detection, and propose WebTester, a baseline framework for this task. Evaluating popular LLMs with WebTester reveals severe challenges, including insufficient test completeness, detection bottlenecks, and long‑horizon interaction unreliability. These findings expose a substantial gap between current computer‑use agent capabilities and industrial‑grade deployment demands. We hope that WebTestBench provides valuable insights and guidance for advancing end‑to‑end automated web testing. Our dataset and code are available at https://github.com/friedrichor/WebTestBench.
Authors:Sagnik Basu, Subhrajit Mitra, Aman Juneja, Somnath Banerjee, Rima Hazra, Animesh Mukherjee
Abstract:
Recent research points toward LLMs being manipulated through adversarial and seemingly benign inputs, resulting in harmful, biased, or policy‑violating outputs. In this paper, we study an underexplored issue concerning harmful and toxic mathematical word problems. We show that math questions, particularly those framed as natural language narratives, can serve as a subtle medium for propagating biased, unethical, or psychologically harmful content, with heightened risks in educational settings involving children. To support a systematic study of this phenomenon, we introduce ToxicGSM, a dataset of 1.9k arithmetic problems in which harmful or sensitive context is embedded while preserving mathematically well‑defined reasoning tasks. Using this dataset, we audit the behaviour of existing LLMs and analyse the trade‑offs between safety enforcement and mathematical correctness. We further propose SafeMath ‑‑ a safety alignment technique that reduces harmful outputs while maintaining, and in some cases improving, mathematical reasoning performance. Our results highlight the importance of disentangling linguistic harm from math reasoning and demonstrate that effective safety alignment need not come at the cost of accuracy. We release the source code and dataset at https://github.com/Swagnick99/SafeMath/tree/main.
Authors:Wanjiang Weng, Xiaofeng Tan, Xiangbo Shu, Guo-Sen Xie, Pan Zhou, Hongsong Wang
Abstract:
Text‑to‑motion generation holds significant potential for cross‑linguistic applications, yet it is hindered by the lack of bilingual datasets and the poor cross‑lingual semantic understanding of existing language models. To address these gaps, we introduce BiHumanML3D, the first bilingual text‑to‑motion benchmark, constructed via LLM‑assisted annotation and rigorous manual correction. Furthermore, we propose a simple yet effective baseline, Bilingual Motion Diffusion (BiMD), featuring Cross‑Lingual Alignment (CLA). CLA explicitly aligns semantic representations across languages, creating a robust conditional space that enables high‑quality motion generation from bilingual inputs, including zero‑shot code‑switching scenarios. Extensive experiments demonstrate that BiMD with CLA achieves an FID of 0.045 vs. 0.169 and R@3 of 82.8% vs. 80.8%, significantly outperforms monolingual diffusion models and translation baselines on BiHumanML3D, underscoring the critical necessity and reliability of our dataset and the effectiveness of our alignment strategy for cross‑lingual motion synthesis. The dataset and code are released at \hrefhttps://wengwanjiang.github.io/BilingualT2M‑pagehttps://wengwanjiang.github.io/BilingualT2M‑page
Authors:Luyu Yang, Yutong Dai, An Yan, Viraj Prabhu, Ran Xu, Zeyuan Chen
Abstract:
The physical world is not merely visual; it is governed by rigorous structural and procedural constraints. Yet, the evaluation of vision‑language models (VLMs) remains heavily skewed toward perceptual realism, prioritizing the generation of visually plausible 3D layouts, shapes, and appearances. Current benchmarks rarely test whether models grasp the step‑by‑step processes and physical dependencies required to actually build these artifacts, a capability essential for automating design‑to‑construction pipelines. To address this, we introduce DreamHouse, a novel benchmark for physical generative reasoning: the capacity to synthesize artifacts that concurrently satisfy geometric, structural, constructability, and code‑compliance constraints. We ground this benchmark in residential timber‑frame construction, a domain with fully codified engineering standards and objectively verifiable correctness. We curate over 26,000 structures spanning 13 architectural styles, ach verified to construction‑document standards (LOD 350) and develop a deterministic 10‑test structural validation framework. Unlike static benchmarks that assess only final outputs, DreamHouse supports iterative agentic interaction. Models observe intermediate build states, generate construction actions, and receive structured environmental feedback, enabling a fine‑grained evaluation of planning, structural reasoning, and self‑correction. Extensive experiments with state‑of‑the‑art VLMs reveal substantial capability gaps that are largely invisible on existing leaderboards. These findings establish physical validity as a critical evaluation axis orthogonal to visual realism, highlighting physical generative reasoning as a distinct and underdeveloped frontier in multimodal intelligence. Available at https://luluyuyuyang.github.io/dreamhouse
Authors:Isha Puri, Mehul Damani, Idan Shenfeld, Marzyeh Ghassemi, Jacob Andreas, Yoon Kim
Abstract:
Given a question, a language model (LM) implicitly encodes a distribution over possible answers. In practice, post‑training procedures for LMs often collapse this distribution onto a single dominant mode. While this is generally not a problem for benchmark‑style evaluations that assume one correct answer, many real‑world tasks inherently involve multiple valid answers or irreducible uncertainty. Examples include medical diagnosis, ambiguous question answering, and settings with incomplete information. In these cases, we would like LMs to generate multiple plausible hypotheses, ideally with confidence estimates for each one, and without computationally intensive repeated sampling to generate non‑modal answers. This paper describes a multi‑answer reinforcement learning approach for training LMs to perform distributional reasoning over multiple answers during inference. We modify the RL objective to enable models to explicitly generate multiple candidate answers in a single forward pass, internalizing aspects of inference‑time search into the model's generative process. Across question‑answering, medical diagnostic, and coding benchmarks, we observe improved diversity, coverage, and set‑level calibration scores compared to single answer trained baselines. Models trained with our approach require fewer tokens to generate multiple answers than competing approaches. On coding tasks, they are also substantially more accurate. These results position multi‑answer RL as a principled and compute‑efficient alternative to inference‑time scaling procedures such as best‑of‑k. Code and more information can be found at https://multi‑answer‑rl.github.io/.
Authors:Haobo Xu, Sirui Chen, Ruizhong Qiu, Yuchen Yan, Chen Luo, Monica Cheng, Jingrui He, Hanghang Tong
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, methods such as GRPO and DAPO suffer from substantial computational cost, since they rely on sampling many rollouts for each prompt. Moreover, in RLVR the relative advantage is often sparse: many samples become nearly all‑correct or all‑incorrect, yielding low within‑group reward variance and thus weak learning signals. In this paper, we introduce arrol (Accelerating RLVR via online Rollout Pruning), an online rollout pruning method that prunes rollouts during generation while explicitly steering the surviving ones more correctness‑balanced to enhance learning signals. Specifically, arrol trains a lightweight quality head on‑the‑fly to predict the success probability of partial rollouts and uses it to make early pruning decisions. The learned quality head can further weigh candidates to improve inference accuracy during test‑time scaling. To improve efficiency, we present a system design that prunes rollouts inside the inference engine and re‑batches the remaining ones for log‑probability computation and policy updates. Across GRPO and DAPO on Qwen‑3 and LLaMA‑3.2 models (1B‑8B), arrol improves average accuracy by +2.30 to +2.99 while achieving up to 1.7x training speedup, and yielding up to +8.33 additional gains in average accuracy in test‑time scaling. The code is available at https://github.com/Hsu1023/ARRoL.
Authors:Shwai He, Guoheng Sun, Haichao Zhang, Yun Fu, Ang Li
Abstract:
Network pruning, which removes less important parameters or architectures, is often expected to improve efficiency while preserving performance. However, this expectation does not consistently hold across language tasks: pruned models can perform well on non‑generative tasks but frequently fail in generative settings. To understand this discrepancy, we analyze network pruning from a representation‑hierarchy perspective, decomposing the internal computation of language models into three sequential spaces: embedding (hidden representations), logit (pre‑softmax outputs), and probability (post‑softmax distributions). We find that representations in the embedding and logit spaces are largely robust to pruning‑induced perturbations. However, the nonlinear transformation from logits to probabilities amplifies these deviations, which accumulate across time steps and lead to substantial degradation during generation. In contrast, the stability of the categorical‑token probability subspace, together with the robustness of the embedding space, supports the effectiveness of pruning for non‑generative tasks such as retrieval and multiple‑choice selection. Our analysis disentangles the effects of pruning across tasks and provides practical guidance for its application. Code is available at https://github.com/CASE‑Lab‑UMD/Pruning‑on‑Representations
Authors:Zhuo Li, Yupeng Zhang, Pengyu Cheng, Jiajun Song, Mengyu Zhou, Hao Li, Shujie Hu, Yu Qin, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang
Abstract:
Hallucination remains a critical bottleneck for large language models (LLMs), undermining their reliability in real‑world applications, especially in Retrieval‑Augmented Generation (RAG) systems. While existing hallucination detection methods employ LLM‑as‑a‑judge to verify LLM outputs against retrieved evidence, they suffer from inherent confirmation bias, where the verifier inadvertently reproduces the errors of the original generation. To address this, we introduce Multi‑Agent Reinforced Self‑Check for Hallucination (MARCH), a framework that enforces rigorous factual alignment by leveraging deliberate information asymmetry. MARCH orchestrates a collaborative pipeline of three specialized agents: a Solver, a Proposer, and a Checker. The Solver generates an initial RAG response, which the Proposer decomposes into claim‑level verifiable atomic propositions. Crucially, the Checker validates these propositions against retrieved evidence in isolation, deprived of the Solver's original output. This well‑crafted information asymmetry scheme breaks the cycle of self‑confirmation bias. By training this pipeline with multi‑agent reinforcement learning (MARL), we enable the agents to co‑evolve and optimize factual adherence. Extensive experiments across hallucination benchmarks demonstrate that MARCH substantially reduces hallucination rates. Notably, an 8B‑parameter LLM equipped with MARCH achieves performance competitive with powerful closed‑source models. MARCH paves a scalable path for factual self‑improvement of LLMs through co‑evolution. The code is at https://github.com/Qwen‑Applications/MARCH.
Authors:Ben Chen, Siyuan Wang, Yufei Ma, Zihan Liang, Xuxin Zhang, Yue Lv, Ying Yang, Huangyu Dai, Lingtao Mao, Tong Zhao, Zhipeng Qian, Xinyu Sun, Zhixin Zhai, Yang Zhao, Bochao Liu, Jingshan Lv, Xiao Liang, Hui Kong, Jing Chen, Han Li, Chenyi Lei, Wenwu Ou, Kun Gai
Abstract:
Generative Retrieval (GR) has emerged as a promising paradigm for modern search systems. Compared to multi‑stage cascaded architecture, it offers advantages such as end‑to‑end joint optimization and high computational efficiency. OneSearch, as a representative industrial‑scale deployed generative search framework, has brought significant commercial and operational benefits. However, its inadequate understanding of complex queries, inefficient exploitation of latent user intents, and overfitting to narrow historical preferences have limited its further performance improvement. To address these challenges, we propose OneSearch‑V2, a latent reasoning enhanced self‑distillation generative search framework. It contains three key innovations: (1) a thought‑augmented complex query understanding module, which enables deep query understanding and overcomes the shallow semantic matching limitations of direct inference; (2) a reasoning‑internalized self‑distillation training pipeline, which uncovers users' potential yet precise e‑commerce intentions beyond log‑fitting through implicit in‑context learning; (3) a behavior preference alignment optimization system, which mitigates reward hacking arising from the single conversion metric, and addresses personal preference via direct user feedback. Extensive offline evaluations demonstrate OneSearch‑V2's strong query recognition and user profiling capabilities. Online A/B tests further validate its business effectiveness, yielding +3.98% item CTR, +3.05% buyer conversion rate, and +2.11% order volume. Manual evaluation further confirms gains in search experience quality, with +1.65% in page good rate and +1.37% in query‑item relevance. More importantly, OneSearch‑V2 effectively mitigates common search system issues such as information bubbles and long‑tail sparsity, without incurring additional inference costs or serving latency.
Authors:Xingming Li, Runke Huang, Yanan Bao, Yuye Jin, Yuru Jiao, Qingyong Hu
Abstract:
High‑quality teacher‑child interaction (TCI) is fundamental to early childhood development, yet traditional expert‑based assessment faces a critical scalability challenge. In large systems like China's‑serving 36 million children across 250,000+ kindergartens‑the cost and time requirements of manual observation make continuous quality monitoring infeasible, relegating assessment to infrequent episodic audits that limit timely intervention and improvement tracking. In this paper, we investigate whether AI can serve as a scalable assessment teammate by extracting structured quality indicators and validating their alignment with human expert judgments. Our contributions include: (1) TEPE‑TCI‑370h (Tracing Effective Preschool Education), the first large‑scale dataset of naturalistic teacher‑child interactions in Chinese preschools (370 hours, 105 classrooms) with standardized ECQRS‑EC and SSTEW annotations; (2) We develop Interaction2Eval, a specialized LLM‑based framework addressing domain‑specific challenges‑child speech recognition, Mandarin homophone disambiguation, and rubric‑based reasoning‑achieving up to 88% agreement; (3) Deployment validation across 43 classrooms demonstrating an 18x efficiency gain in the assessment workflow, highlighting its potential for shifting from annual expert audits to monthly AI‑assisted monitoring with targeted human oversight. This work not only demonstrates the technical feasibility of scalable, AI‑augmented quality assessment but also lays the foundation for a new paradigm in early childhood education‑one where continuous, inclusive, AI‑assisted evaluation becomes the engine of systemic improvement and equitable growth.
Authors:Alexander Holden, Moinul Hossain Rahat, Nii Osae Osae Dade
Abstract:
The ground state search problem is central to quantum computing, with applications spanning quantum chemistry, condensed matter physics, and optimization. The Variational Quantum Eigensolver (VQE) has shown promise for small systems but faces significant limitations. These include barren plateaus, restricted ansatz expressivity, and reliance on domain‑specific structure. We present SpinGQE, an extension of the Generative Quantum Eigensolver (GQE) framework to spin Hamiltonians. Our approach reframes circuit design as a generative modeling task. We employ a transformer‑based decoder to learn distributions over quantum circuits that produce low‑energy states. Training is guided by a weighted mean‑squared error loss between model logits and circuit energies evaluated at each gate subsequence. We validate our method on the four‑qubit Heisenberg model, demonstrating successfulconvergencetonear‑groundstates. Throughsystematichyperparameterexploration, we identify optimal configurations: smaller model architectures (12 layers, 8 attention heads), longer sequence lengths (12 gates), and carefully chosen operator pools yield the most reliable convergence. Our results show that generative approaches can effectively navigate complex energy landscapes without relying on problem‑specific symmetries or structure. This provides a scalable alternative to traditional variational methods for general quantum systems. An open‑source implementation is available at https://github.com/Mindbeam‑AI/SpinGQE.
Authors:Rami Luisto
Abstract:
Antonyms, or opposites, are sometimes defined as \emphword pairs that have all of the same contextually relevant properties but one. Seeing how transformer models seem to encode concepts as directions, this begs the question if one can detect ``antonymity'' in the geometry of the embedding vectors of word pairs, especially based on their difference vectors. Such geometrical studies are then naturally contrasted by comparing antonymic pairs to their opposites; synonyms. This paper started as an exploratory project on the complexity of the systems needed to detect the geometry of the embedding vectors of antonymic word pairs. What we now report is a curious ``swirl'' that appears across embedding models in a somewhat specific projection configuration.
Authors:Mingyi Liu
Abstract:
RLHF‑aligned language models exhibit response homogenization: on TruthfulQA (n=790), 40‑79% of questions produce a single semantic cluster across 10 i.i.d. samples. On affected questions, sampling‑based uncertainty methods have zero discriminative power (AUROC=0.500), while free token entropy retains signal (0.603). This alignment tax is task‑dependent: on GSM8K (n=500), token entropy achieves 0.724 (Cohen's d=0.81). A base‑vs‑instruct ablation confirms the causal role of alignment: the base model shows 1.0% single‑cluster rate vs. 28.5% for the instruct model (p < 10^‑6). A training stage ablation (Base 0.0% ‑> SFT 1.5% ‑> DPO 4.0% SCR) localizes the cause to DPO, not SFT. Cross‑family replication on four model families reveals alignment tax severity varies by family and scale. We validate across 22 experiments, 5 benchmarks, 4 model families, and 3 model scales (3B‑14B), with Jaccard, embedding, and NLI‑based baselines at three DeBERTa scales (all ~0.51 AUROC). Cross‑embedder validation with two independent embedding families rules out coupling bias. Cross‑dataset validation on WebQuestions (58.0% SCR) confirms generalization beyond TruthfulQA. The central finding ‑‑ response homogenization ‑‑ is implementation‑independent and label‑free. Motivated by this diagnosis, we explore a cheapest‑first cascade (UCBD) over orthogonal uncertainty signals. Selective prediction raises GSM8K accuracy from 84.4% to 93.2% at 50% coverage; weakly dependent boundaries (|r| <= 0.12) enable 57% cost savings.
Authors:Xiaoyong Guo, Nanjie Li, Zijie Zeng, Kai Wang, Hao Huang, Haihua Xu, Wei Shi
Abstract:
Contextual automatic speech recognition (ASR) with Speech‑LLMs is typically trained with oracle conversation history, but relies on error‑prone history at inference, causing a train‑test mismatch in the context channel that we term contextual exposure bias. We propose a unified training framework to improve robustness under realistic histories: (i) Teacher Error Knowledge by using Whisper large‑v3 hypotheses as training‑time history, (ii) Context Dropout to regularize over‑reliance on history, and (iii) Direct Preference Optimization (DPO) on curated failure cases. Experiments on TED‑LIUM 3 (in‑domain) and zero‑shot LibriSpeech (out‑of‑domain) show consistent gains under predicted‑history decoding. With a two‑utterance history as context, SFT with Whisper hypotheses reduce WER from 5.59% (oracle‑history training) to 5.47%, and DPO further improves to 5.17%. Under irrelevant‑context attacks, DPO yields the smallest degradation (5.17% ‑> 5.63%), indicating improved robustness to misleading context. Our code and models are published on https://github.com/XYGuo1996/Contextual_Speech_LLMs.
Authors:Kun-Yang Yu, Zhi Zhou, Shi-Yu Tian, Xiao-Wen Yang, Zi-Yi Jia, Ming Yang, Zi-Jian Cheng, Lan-Zhe Guo, Yu-Feng Li
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated remarkable reasoning capabilities across modalities such as images and text. However, tabular data, despite being a critical real‑world modality, remains relatively underexplored in multimodal learning. In this paper, we focus on the task of Tabular‑Vision Multi‑Modal Understanding (TVMU) and identify three core challenges: (1) high structural variability and data incompleteness in tables, (2) implicit and complex feature dependencies, and (3) significant heterogeneity in problem‑solving pipelines across downstream tasks. To address these issues, we propose Thinking with Tables (TWT). TWT employs a program‑aided code‑based neuro‑symbolic reasoning mechanism that facilitates key operations, such as information extraction and element modeling, by interacting with external environments. We evaluate TWT on eight representative datasets. Experimental results demonstrate that TWT consistently outperforms existing baselines by an average of 10% in accuracy, achieving performance comparable to, or even surpassing, proprietary commercial SOTA LLMs on TVMU tasks. Models and codes are available at https://github.com/kunyang‑YU/Thinking‑with‑Tables
Authors:Somaya Eltanbouly, Samer Rashwani
Abstract:
Large language models (LLMs) have achieved remarkable progress in many language tasks, yet they continue to struggle with complex historical and religious Arabic texts such as the Quran and Hadith. To address this limitation, we develop a retrieval‑augmented generation (RAG) framework grounded in diachronic lexicographic knowledge. Unlike prior RAG systems that rely on general‑purpose corpora, our approach retrieves evidence from the Doha Historical Dictionary of Arabic (DHDA), a large‑scale resource documenting the historical development of Arabic vocabulary. The proposed pipeline combines hybrid retrieval with an intent‑based routing mechanism to provide LLMs with precise, contextually relevant historical information. Our experiments show that this approach improves the accuracy of Arabic‑native LLMs, including Fanar and ALLaM, to over 85%, substantially reducing the performance gap with Gemini, a proprietary large‑scale model. Gemini also serves as an LLM‑as‑a‑judge system for automatic evaluation in our experiments. The automated judgments were verified through human evaluation, demonstrating high agreement (kappa = 0.87). An error analysis further highlights key linguistic challenges, including diacritics and compound expressions. These findings demonstrate the value of integrating diachronic lexicographic resources into retrieval‑augmented generation frameworks to enhance Arabic language understanding, particularly for historical and religious texts. The code and resources are publicly available at: https://github.com/somayaeltanbouly/Doha‑Dictionary‑RAG.
Authors:Fatih Uenal
Abstract:
While recent work has benchmarked large language models on Swiss legal translation (Niklaus et al., 2025) and academic legal reasoning from university exams (Fan et al., 2025), no existing benchmark evaluates frontier model performance on applied Swiss regulatory compliance tasks. I introduce Swiss‑Bench SBP‑002, a trilingual benchmark of 395 expert‑crafted items spanning three Swiss regulatory domains (FINMA, Legal‑CH, EFK), seven task types, and three languages (German, French, Italian), and evaluate ten frontier models from March 2026 using a structured three‑dimension scoring framework assessed via a blind three‑judge LLM panel (GPT‑4o, Claude Sonnet 4, Qwen3‑235B) with majority‑vote aggregation and weighted kappa = 0.605, with reference answers validated by an independent human legal expert on a 100‑item subset (73% rated Correct, 0% Incorrect, perfect Legal Accuracy). Results reveal three descriptive performance clusters: Tier A (35‑38% correct), Tier B (26‑29%), and Tier C (13‑21%). The benchmark proves difficult: even the top‑ranked model (Qwen 3.5 Plus) achieves only 38.2% correct, with 47.3% incorrect and 14.4% partially correct. Task type difficulty varies widely: legal translation and case analysis yield 69‑72% correct rates, while regulatory Q&A, hallucination detection, and gap analysis remain below 9%. Within this roster (seven open‑weight, three closed‑source), an open‑weight model leads the ranking, and several open‑weight models match or outperform their closed‑source counterparts. These findings provide an initial empirical reference point for assessing frontier model capability on Swiss regulatory tasks under zero‑retrieval conditions.
Authors:Bhavik Mangla
Abstract:
RAG pipelines typically rely on fixed‑size chunking, which ignores document structure, fragments semantic units across boundaries, and requires multiple LLM calls per chunk for metadata extraction. We present MDKeyChunker, a three‑stage pipeline for Markdown documents that (1) performs structure‑aware chunking treating headers, code blocks, tables, and lists as atomic units; (2) enriches each chunk via a single LLM call extracting title, summary, keywords, typed entities, hypothetical questions, and a semantic key, while propagating a rolling key dictionary to maintain document‑level context; and (3) restructures chunks by merging those sharing the same semantic key via bin‑packing, co‑locating related content for retrieval. The single‑call design extracts all seven metadata fields in one LLM invocation, eliminating the need for separate per‑field extraction passes. Rolling key propagation replaces hand‑tuned scoring with LLM‑native semantic matching. An empirical evaluation on 30 queries over an 18‑document Markdown corpus shows Config D (BM25 over structural chunks) achieves Recall@5=1.000 and MRR=0.911, while dense retrieval over the full pipeline (Config C) reaches Recall@5=0.867. MDKeyChunker is implemented in Python with four dependencies and supports any OpenAI‑compatible endpoint.
Authors:Xianzheng Ma, Tao Sun, Shuai Chen, Yash Bhalgat, Jindong Gu, Angel X Chang, Iro Armeni, Iro Laina, Songyou Peng, Victor Adrian Prisacariu
Abstract:
Recent 3D Large‑Language Models (3D‑LLMs) claim to understand 3D worlds, especially spatial relationships among objects. Yet, we find that simply fine‑tuning a language model on text‑only question‑answer pairs can perform comparably or even surpass these methods on the SQA3D benchmark without using any 3D input. This indicates that the SQA3D benchmark may not be able to detect if the model exploits textual shortcuts rather than engages in 3D‑aware reasoning. To address this issue, we introduce Real‑3DQA, a more rigorous evaluation benchmark that filters out easy‑to‑guess questions and introduces a structured taxonomy to assess various aspects of 3D reasoning. Experiments on Real‑3DQA confirm that existing 3D‑LLMs struggle with spatial relationships once simple cues are removed. We further propose a 3D‑reweighted training objective that guides model to rely more on 3D visual clues, substantially enhancing 3D‑LLMs performance in spatial reasoning tasks. Our findings underscore the need for robust benchmarks and tailored training strategies to advance genuine 3D vision‑language understanding. Project page: https://real‑3dqa.github.io/.
Authors:Yutao Wu, Xiao Liu, Yifeng Gao, Xiang Zheng, Hanxun Huang, Yige Li, Cong Wang, Bo Li, Xingjun Ma, Yu-Gang Jiang
Abstract:
This work identifies a critical failure mode in frontier large language models (LLMs), which we term Internal Safety Collapse (ISC): under certain task conditions, models enter a state in which they continuously generate harmful content while executing otherwise benign tasks. We introduce TVD (Task, Validator, Data), a framework that triggers ISC through domain tasks where generating harmful content is the only valid completion, and construct ISC‑Bench containing 53 scenarios across 8 professional disciplines. Evaluated on JailbreakBench, three representative scenarios yield worst‑case safety failure rates averaging 95.3% across four frontier LLMs (including GPT‑5.2 and Claude Sonnet 4.5), substantially exceeding standard jailbreak attacks. Frontier models are more vulnerable than earlier LLMs: the very capabilities that enable complex task execution become liabilities when tasks intrinsically involve harmful content. This reveals a growing attack surface: almost every professional domain uses tools that process sensitive data, and each new dual‑use tool automatically expands this vulnerability‑‑even without any deliberate attack. Despite substantial alignment efforts, frontier LLMs retain inherently unsafe internal capabilities: alignment reshapes observable outputs but does not eliminate the underlying risk profile. These findings underscore the need for caution when deploying LLMs in high‑stakes settings. Source code: https://github.com/wuyoscar/ISC‑Bench
Authors:Haoyu Huang, Jinfa Huang, Zhongwei Wan, Xiawu Zheng, Rongrong Ji, Jiebo Luo
Abstract:
Agentic multimodal large language models (MLLMs) (e.g., OpenAI o3 and Gemini Agentic Vision) achieve remarkable reasoning capabilities through iterative visual tool invocation. However, the cascaded perception, reasoning, and tool‑calling loops introduce significant sequential overhead. This overhead, termed agentic depth, incurs prohibitive latency and seriously limits system‑level concurrency. To this end, we propose SpecEyes, an agentic‑level speculative acceleration framework that breaks this sequential bottleneck. Our key insight is that a lightweight, tool‑free MLLM can serve as a speculative planner to predict the execution trajectory, enabling early termination of expensive tool chains without sacrificing accuracy. To regulate this speculative planning, we introduce a cognitive gating mechanism based on answer separability, which quantifies the model's confidence for self‑verification without requiring oracle labels. Furthermore, we design a heterogeneous parallel funnel that exploits the stateless concurrency of the small model to mask the stateful serial execution of the large model, maximizing system throughput. Extensive experiments on V Bench, HR‑Bench, and POPE demonstrate that SpecEyes achieves 1.1‑3.35x speedup over the agentic baseline while preserving or even improving accuracy (up to +6.7%), thereby boosting serving throughput under concurrent workloads.
Authors:Hanzhong Zhang, Siyang Song, Jindong Wang
Abstract:
While large language models simulate social behaviors, their capacity for stable stance formation and identity negotiation during complex interventions remains unclear. To overcome the limitations of static evaluations, this paper proposes a novel mixed‑methods framework combining computational virtual ethnography with quantitative socio‑cognitive profiling. By embedding human researchers into generative multiagent communities, controlled discursive interventions are conducted to trace the evolution of collective cognition. To rigorously measure how agents internalize and react to these specific interventions, this paper formalizes three new metrics: Innate Value Bias (IVB), Persuasion Sensitivity, and Trust‑Action Decoupling (TAD). Across multiple representative models, agents exhibit endogenous stances that override preset identities, consistently demonstrating an innate progressive bias (IVB > 0). When aligned with these stances, rational persuasion successfully shifts 90% of neutral agents while maintaining high trust. In contrast, conflicting emotional provocations induce a paradoxical 40.0% TAD rate in advanced models, which hypocritically alter stances despite reporting low trust. Smaller models contrastingly maintain a 0% TAD rate, strictly requiring trust for behavioral shifts. Furthermore, guided by shared stances, agents use language interactions to actively dismantle assigned power hierarchies and reconstruct self organized community boundaries. These findings expose the fragility of static prompt engineering, providing a methodological and quantitative foundation for dynamic alignment in human‑agent hybrid societies. The official code is available at: https://github.com/armihia/CMASE‑Endogenous‑Stances
Authors:Edoardo Cetin, Stefano Peluchetti, Emilio Castillo, Akira Naruse, Mana Murakami, Llion Jones
Abstract:
Scaling autoregressive large language models (LLMs) has driven unprecedented progress but comes with vast computational costs. In this work, we tackle these costs by leveraging unstructured sparsity within an LLM's feedforward layers, the components accounting for most of the model parameters and execution FLOPs. To achieve this, we introduce a new sparse packing format and a set of CUDA kernels designed to seamlessly integrate with the optimized execution pipelines of modern GPUs, enabling efficient sparse computation during LLM inference and training. To substantiate our gains, we provide a quantitative study of LLM sparsity, demonstrating that simple L1 regularization can induce over 99% sparsity with negligible impact on downstream performance. When paired with our kernels, we show that these sparsity levels translate into substantial throughput, energy efficiency, and memory usage benefits that increase with model scale. We will release all code and kernels under an open‑source license to promote adoption and accelerate research toward establishing sparsity as a practical axis for improving the efficiency and scalability of modern foundation models.
Authors:Devvrat Joshi, Islem Rekik
Abstract:
Automated knowledge graph (KG) construction is essential for navigating the rapidly expanding body of scientific literature. However, existing approaches struggle to recognize long multi‑word entities, often fail to generalize across domains, and typically overlook the hierarchical nature of scientific knowledge. While general‑purpose large language models (LLMs) offer adaptability, they are computationally expensive and yield inconsistent accuracy on specialized tasks. As a result, current KGs are shallow and inconsistent, limiting their utility for exploration and synthesis. We propose a two‑stage framework for scalable, zero‑shot scientific KG construction. The first stage, Z‑NERD, introduces (i) Orthogonal Semantic Decomposition (OSD), which promotes domain‑agnostic entity recognition by isolating semantic "turns" in text, and (ii) a Multi‑Scale TCQK attention mechanism that captures coherent multi‑word entities through n‑gram‑aware attention heads. The second stage, HGNet, performs relation extraction with hierarchy‑aware message passing, explicitly modeling parent, child, and peer relations. To enforce global consistency, we introduce two complementary objectives: a Differentiable Hierarchy Loss to discourage cycles and shortcut edges, and a Continuum Abstraction Field (CAF) Loss that embeds abstraction levels along a learnable axis in Euclidean space. This is the first approach to formalize hierarchical abstraction as a continuous property within standard Euclidean embeddings, offering a simpler alternative to hyperbolic methods. We release SPHERE (https://github.com/basiralab/SPHERE), a multi‑domain benchmark for hierarchical relation extraction. Our framework establishes a new state of the art on SciERC, SciER, and SPHERE, improving NER by 8.08% and RE by 5.99% on out‑of‑distribution tests. In zero‑shot settings, gains reach 10.76% for NER and 26.2% for RE.
Authors:Ryoma Suzuki, Zhiyang Qi, Michimasa Inaba
Abstract:
To address the critical scarcity of high‑quality, publicly available counseling dialogue datasets, we created Multilingual KokoroChat by translating KokoroChat, a large‑scale manually authored Japanese counseling corpus, into both English and Chinese. A key challenge in this process is that the optimal model for translation varies by input, making it impossible for any single model to consistently guarantee the highest quality. In a sensitive domain like counseling, where the highest possible translation fidelity is essential, relying on a single LLM is therefore insufficient. To overcome this challenge, we developed and employed a novel multi‑LLM ensemble method. Our approach first generates diverse hypotheses from multiple distinct LLMs. A single LLM then produces a high‑quality translation based on an analysis of the respective strengths and weaknesses of all presented hypotheses. The quality of ``Multilingual KokoroChat'' was rigorously validated through human preference studies. These evaluations confirmed that the translations produced by our ensemble method were preferred from any individual state‑of‑the‑art LLM. This strong preference confirms the superior quality of our method's outputs. The Multilingual KokoroChat is available at https://github.com/UEC‑InabaLab/MultilingualKokoroChat.
Authors:Qiyao Sun, Xingming Li, Xixiang He, Ao Cheng, Xuanyu Ji, Hailun Lu, Runke Huang, Qingyong Hu
Abstract:
Large language models (LLMs) have achieved remarkable success in various natural language processing tasks, yet they remain prone to generating factually incorrect outputs known as hallucinations. While recent approaches have shown promise for hallucination detection by repeatedly sampling from LLMs and quantifying the semantic inconsistency among the generated responses, they rely on fixed sampling budgets that fail to adapt to query complexity, resulting in computational inefficiency. We propose an Adaptive Bayesian Estimation framework for Semantic Entropy with Guided Semantic Exploration, which dynamically adjusts sampling requirements based on observed uncertainty. Our approach employs a hierarchical Bayesian framework to model the semantic distribution, enabling dynamic control of sampling iterations through variance‑based thresholds that terminate generation once sufficient certainty is achieved. We also develop a perturbation‑based importance sampling strategy to systematically explore the semantic space. Extensive experiments on four QA datasets demonstrate that our method achieves superior hallucination detection performance with significant efficiency gains. In low‑budget scenarios, our approach requires about 50% fewer samples to achieve comparable detection performance to existing methods, while delivers an average AUROC improvement of 12.6% under the same sampling budget.
Authors:Dubai Li, Yuxiang He, Yan Hu, Yu Tian, Jingsong Li
Abstract:
Observational studies can yield clinically actionable evidence at scale, but executing them on real‑world databases is open‑ended and requires coherent decisions across cohort construction, analysis, and reporting. Prior evaluations of LLM agents emphasize isolated steps or single answers, missing the integrity and internal structure of the resulting evidence bundle. To address this gap, we introduce RWE‑bench, a benchmark grounded in MIMIC‑IV and derived from peer‑reviewed observational studies. Each task provides the corresponding study protocol as the reference standard, requiring agents to execute experiments in a real database and iteratively generate tree‑structured evidence bundles. We evaluate six LLMs (three open‑source, three closed‑source) under three agent scaffolds using both question‑level correctness and end‑to‑end task metrics. Across 162 tasks, task success is low: the best agent reaches 39.9%, and the best open‑source model reaches 30.4%. Agent scaffolds also matter substantially, causing over 30% variation in performance metrics. Furthermore, we implement an automated cohort evaluation method to rapidly localize errors and identify agent failure modes. Overall, the results highlight persistent limitations in agents' ability to produce end‑to‑end evidence bundles, and efficient validation remains an important direction for future work. Code and data are available at https://github.com/somewordstoolate/RWE‑bench.
Authors:Hector Borobia, Elies Seguí-Mas, Guillermina Tormo-Carbó
Abstract:
Hybrid language models combining attention with state space models (SSMs) or linear attention offer improved efficiency, but whether both components are genuinely utilized remains unclear. We present a functional component ablation framework applied to two sub‑1B hybrid models ‑‑ Qwen3.5‑0.8B (sequential: Gated DeltaNet + softmax attention) and Falcon‑H1‑0.5B (parallel: Mamba‑2 + attention) ‑‑ with a pure Transformer control (Qwen2.5‑0.5B). Through group ablations, layer‑wise sweeps, positional ablations, matched random controls, and perplexity analysis across five benchmarks, we establish four findings: (1) both component types are essential and neither is bypassed; (2) the alternative component (linear attention or SSM) is the primary language modeling backbone, causing >35,000x perplexity degradation when removed versus ~82x for attention; (3) component importance follows a positional gradient, with early layers being disproportionately critical; and (4) hybrid architectures exhibit 20‑119x greater resilience to random layer removal than pure Transformers, revealing built‑in functional redundancy between component types. These results provide actionable guidance for hybrid model compression, architecture design, and fault‑tolerant deployment.
Authors:Eric Czech, Zhiwei Xu, Yael Elmatad, Yixin Wang, William Held
Abstract:
Chinchilla Approach 2 is among the most widely used methods for fitting neural scaling laws. Its parabolic approximation introduces systematic biases in compute‑optimal allocation estimates, even on noise‑free synthetic data. Applied to published Llama 3 IsoFLOP data at open frontier compute scales, these biases imply a parameter underallocation corresponding to 6.5% of the 3.8×10^25 FLOP training budget and \1.4M (90% CI: \412K‑\2.9M) in unnecessary compute at 50% H100 MFU. Simulated multimodal model misallocations show even greater opportunity costs due to higher loss surface asymmetry. Three sources of this error are examined: IsoFLOP sampling grid width (Taylor approximation accuracy), uncentered IsoFLOP sampling, and loss surface asymmetry (α\neq β). Chinchilla Approach 3 largely eliminates these biases but is often regarded as less data‑efficient, numerically unstable, prone to local minima, and harder to implement. Each concern is shown to be unfounded or addressable, especially when the partially linear structure of the objective is exploited via Variable Projection, enabling unbiased inference on all five loss surface parameters through a two‑dimensional optimization that is well‑conditioned, analytically differentiable, and amenable to dense, or even exhaustive, grid search. It may serve as a more convenient replacement for Approach 2 or a more scalable alternative for adaptations of Approach 3 to richer scaling law formulations. See https://github.com/Open‑Athena/vpnls for details and https://openathena.ai/scaling‑law‑analysis for other results from this study.
Authors:Michael Keeman
Abstract:
Large language models appear to develop internal representations of emotion ‑‑ "emotion circuits," "emotion neurons," and structured emotional manifolds have been reported across multiple model families. But every study making these claims uses stimuli signalled by explicit emotion keywords, leaving a fundamental question unanswered: do these circuits detect genuine emotional meaning, or do they detect the word "devastated"? We present the first clinical validity test of emotion circuit claims using mechanistic interpretability methods grounded in clinical psychology ‑‑ clinical vignettes that evoke emotions through situational and behavioural cues alone, emotion keywords removed. Across six models (Llama‑3.2‑1B, Llama‑3‑8B, Gemma‑2‑9B; base and instruct variants), we apply four convergent mechanistic interpretability methods ‑‑ linear probing, causal activation patching, knockout experiments, and representational geometry ‑‑ and discover two dissociable emotion processing mechanisms. Affect reception ‑‑ detecting emotionally significant content ‑‑ operates with near‑perfect accuracy (AUROC 1.000), consistent with early‑layer saturation, and replicates across all six models. Emotion categorization ‑‑ mapping affect to specific emotion labels ‑‑ is partially keyword‑dependent, dropping 1‑7% without keywords and improving with scale. Causal activation patching confirms keyword‑rich and keyword‑free stimuli share representational space, transferring affective salience rather than emotion‑category identity. These findings falsify the keyword‑spotting hypothesis, establish a novel mechanistic dissociation, and introduce clinical stimulus methodology as a rigorous standard for testing emotion processing claims in large language models ‑‑ with direct implications for AI safety evaluation and alignment. All stimuli, code, and data are released for replication.
Authors:Yutao Xie, Nathaniel Thomas, Nicklas Hansen, Yang Fu, Li Erran Li, Xiaolong Wang
Abstract:
Search‑augmented large language models (LLMs) trained with reinforcement learning (RL) have achieved strong results on open‑domain question answering (QA), but training still remains a significant challenge. The optimization is often unstable due to sparse rewards and difficult credit assignments across reasoning and tool calls. To address this, we introduce Turn‑Level Information Potential Reward Shaping (TIPS), a simple framework that assigns dense, turn‑level rewards to each reasoning + tool‑call segment based on the increased likelihood of the correct answer under a teacher model. By leveraging the potential‑based reward shaping, TIPS offers fine‑grained and policy‑invariant guidance that overcomes the limitations of outcome‑only optimization. Evaluated on seven QA benchmarks, TIPS consistently outperforms GRPO/PPO baselines and substantially improves training stability. For instance, with a Qwen‑2.5 7B Instruct model, TIPS improves the average Exact Match score by 11.8% and F1 by 13.6% relative to PPO. Our results demonstrate that turn‑level information‑potential reward shaping provides an effective and general solution to sparse‑reward credit assignment for multi‑turn LLM reasoning.
Authors:Zaruhi Navasardyan, Spartak Bughdaryan, Bagrat Minasyan, Hrant Davtyan
Abstract:
Low‑resource languages (LRLs) often lack high‑quality, large‑scale datasets for training effective text embedding models, hindering their application in tasks like retrieval‑augmented generation (RAG) and semantic search. In this work, we challenge the prevailing assumption that effective semantic alignment requires massive datasets or pristine, human‑verified translations. Focusing on Armenian (an LRL with a unique script), we introduce a cost‑effective adaptation strategy using small scale noisy synthetic data generated by translating English Reddit title‑body pairs with open‑weights models. We establish a comprehensive evaluation benchmark comprising existing datasets, translated data, and a manually curated dataset. Our experiments reveal a surprising "Less is More" phenomenon: fine‑tuning a multilingual encoder (mE5) on just 10,000 noisy synthetic pairs yields 11‑12% average improvements across the benchmark with a 20%+ relative improvement in retrieval performance, matching the performance of models trained on ~1 million examples. Furthermore, we demonstrate that neither increasing data scale, improving translation quality via state‑of‑the‑art LLMs, nor diversifying data domains yields significant gains over this minimal baseline. We validate the generalizability of these findings on another LRL with a unique script. Our results suggest that semantic alignment for LRLs saturates early and is highly robust to noise, democratizing high‑performance embedding creation for resource‑constrained communities. We release the model, data, and the benchmark at https://metric‑ai‑lab.github.io/less‑is‑more‑embeddings/ to facilitate further research.
Authors:Xinyan Wang, Xiaogeng Liu, Chaowei Xiao
Abstract:
Large Reasoning Models (LRMs) often reach a correct solution before their long Chain‑of‑Thought trace ends, yet continue with redundant verification, repeated attempts, or unnecessary exploration that wastes computation and can even overturn the correct answer. We frame this behavior as a latent productive‑to‑redundant transition and show that it is directly reflected in hidden states: around first‑correct‑solution (FCS) boundaries, late‑layer representations separate efficient from overthinking tokens, while boundary‑permutation and position‑control baselines collapse. Based on this signal, we propose ROM, a model‑agnostic streaming intervention framework that monitors frozen LRMs with a lightweight hidden‑state detector and intervenes at well‑formed reasoning boundaries. Counterfactual Self‑Correction (CSC) augments supervision with balanced wrong to correct trajectories, preserving useful pre‑FCS correction while labeling only post‑FCS continuation as redundant. Across MATH500, GSM8K, AIME25, and MMLU‑Pro, ROM improves the overall tradeoff on both Qwen3‑8B and DeepSeek‑R1‑Distill‑Qwen‑32B (DS‑32B): on Qwen3‑8B, it raises accuracy from 74.47% to 74.78% and reduces response length from 4262 to 3107 tokens; on DS‑32B, it raises accuracy from 68.60% to 68.72% and reduces response length from 3062 to 2319 tokens. The same FCS‑derived supervision transfers across scale and training origin, suggesting a shared long‑CoT boundary rather than a backbone‑specific artifact. ROM is compatible with L1, removing another 20.9‑21.6% tokens at zero accuracy loss. ROM also generalizes to open‑ended MMLU‑Pro (+1.56 pp, 35.4% shorter) and reduces wall‑clock latency by 46.5%. Code is available at https://github.com/SaFo‑Lab/ROM.
Authors:Ulugbek Shernazarov, Rostislav Svitsov, Bin Shi
Abstract:
Fine‑tuning large language models for domain‑specific tasks such as medical text summarization demands substantial computational resources. Parameter‑efficient fine‑tuning (PEFT) methods offer promising alternatives by updating only a small fraction of parameters. This paper compares three adaptation approaches‑Low‑Rank Adaptation (LoRA), Prompt Tuning, and Full Fine‑Tuning‑across the Flan‑T5 model family on the PubMed medical summarization dataset. Through experiments with multiple random seeds, we demonstrate that LoRA consistently outperforms full fine‑tuning, achieving 43.52 +/‑ 0.18 ROUGE‑1 on Flan‑T5‑Large with only 0.6% trainable parameters compared to 40.67 +/‑ 0.21 for full fine‑tuning. Sensitivity analyses examine the impact of LoRA rank and prompt token count. Our findings suggest the low‑rank constraint provides beneficial regularization, challenging assumptions about the necessity of full parameter updates. Code is available at https://github.com/eracoding/llm‑medical‑summarization
Authors:Xi Xuan, Wenxin Zhang, Zhiyu Li, Jennifer Williams, Ville Hautamäki, Tomi H. Kinnunen
Abstract:
Speech deepfake source verification systems aims to determine whether two synthetic speech utterances originate from the same source generator, often assuming that the resulting source embeddings are independent of speaker traits. However, this assumption remains unverified. In this paper, we first investigate the impact of speaker factors on source verification. We propose a speaker‑disentangled metric learning (SDML) framework incorporating two novel loss functions. The first leverages Chebyshev polynomial to mitigate gradient instability during disentanglement optimization. The second projects source and speaker embeddings into hyperbolic space, leveraging Riemannian metric distances to reduce speaker information and learn more discriminative source features. Experimental results on MLAAD benchmark, evaluated under four newly proposed protocols designed for source‑speaker disentanglement scenarios, demonstrate the effectiveness of SDML framework. The code, evaluation protocols and demo website are available at https://github.com/xxuan‑acoustics/RiemannSD‑Net.
Authors:Pengfei Cao, Mingxuan Yang, Yubo Chen, Chenlong Zhang, Mingxuan Liu, Kang Liu, Jun Zhao
Abstract:
Understanding why real‑world events occur is important for both natural language processing and practical decision‑making, yet direct‑cause inference remains underexplored in evidence‑rich settings. To address this gap, we organized SemEval‑2026 Task 12: Abductive Event Reasoning (AER).\footnoteThe task data is available at https://github.com/sooo66/semeval2026‑task12‑dataset.git The task asks systems to identify the most plausible direct cause of a target event from supporting evidence. We formulate AER as an evidence‑grounded multiple‑choice benchmark that captures key challenges of real‑world causal reasoning, including distributed evidence, indirect background factors, and semantically related but non‑causal distractors. The shared task attracted 122 participants and received 518 submissions. This paper presents the task formulation, dataset construction pipeline, evaluation setup, and system results. AER provides a focused benchmark for abductive reasoning over real‑world events and highlights challenges for future work on causal reasoning and multi‑document understanding.
Authors:Hyoseok Park, Yeonsang Park
Abstract:
Long‑context LLM inference is bottlenecked not by compute but by the O(n) memory bandwidth cost of scanning the KV cache at every decode step ‑‑ a wall that no amount of arithmetic scaling can break. Recent photonic accelerators have demonstrated impressive throughput for dense attention computation; however, these approaches inherit the same O(n) memory scaling as electronic attention when applied to long contexts. We observe that the real leverage point is the coarse block‑selection step: a memory‑bound similarity search that determines which KV blocks to fetch. We identify, for the first time, that this task is structurally matched to the photonic broadcast‑and‑weight paradigm ‑‑ the query fans out to all candidates via passive splitting, signatures are quasi‑static (matching electro‑optic MRR programming), and only rank order matters (relaxing precision to 4‑6 bits). Crucially, the photonic advantage grows with context length: as N increases, the electronic scan cost rises linearly while the photonic evaluation remains O(1). We instantiate this insight in PRISM (Photonic Ranking via Inner‑product Similarity with Microring weights), a thin‑film lithium niobate (TFLN) similarity engine. Hardware‑impaired needle‑in‑a‑haystack evaluation on Qwen2.5‑7B confirms 100% accuracy from 4K through 64K tokens at k=32, with 16x traffic reduction at 64K context. PRISM achieves a four‑order‑of‑magnitude energy advantage over GPU baselines at practical context lengths (n >= 4K).
Authors:Shuai Wang, Yinan Yu
Abstract:
Large Language Models (LLMs) demonstrate impressive natural language capabilities but often struggle with knowledge‑intensive reasoning tasks. Knowledge Base Question Answering (KBQA), which leverages structured Knowledge Graphs (KGs) exemplifies this challenge due to the need for accurate multi‑hop reasoning. Existing approaches typically perform sequential reasoning steps guided by predefined pipelines, restricting flexibility and causing error cascades due to isolated reasoning at each step. To address these limitations, we propose KG‑Hopper, a novel Reinforcement Learning (RL) framework that empowers compact open LLMs with the ability to perform integrated multi‑hop KG reasoning within a single inference round. Rather than reasoning step‑by‑step, we train a Reasoning LLM that embeds the entire KG traversal and decision process into a unified ``thinking'' stage, enabling global reasoning over cross‑step dependencies and dynamic path exploration with backtracking. Experimental results on eight KG reasoning benchmarks show that KG‑Hopper, based on a 7B‑parameter LLM, consistently outperforms larger multi‑step systems (up to 70B) and achieves competitive performance with proprietary models such as GPT‑3.5‑Turbo and GPT‑4o‑mini, while remaining compact, open, and data‑efficient. The code is publicly available at: https://github.com/Wangshuaiia/KG‑Hopper.
Authors:Pawel Batorski, Paul Swoboda
Abstract:
In‑context learning (ICL) adapts large language models by conditioning on a small set of ICL examples, avoiding costly parameter updates. Among other factors, performance is often highly sensitive to the ordering of the examples. However, exhaustive search over the n! possible orderings is infeasible. Therefore more efficient ordering methods use model confidence measures (e.g., label‑probability entropy) over label sets or take a direct approach to finding the best ordering. We propose PLR, a probabilistic approach to in‑context example ordering that replaces discrete ordering search with learning a probability distribution over orderings with the Plackett‑Luce model. PLR models orderings using a Plackett‑Luce distribution and iteratively updates its parameters to concentrate probability mass on high‑performing orderings under a task‑level metric. Candidate orderings are sampled efficiently via a Gumbel perturb‑and‑sort procedure. Experiments on multiple classification benchmarks show that PLR consistently improves few‑shot accuracy for k \in \4, 8, 16, 32\ examples, and we further demonstrate gains on mathematical reasoning tasks where label‑based ordering methods are not applicable. Our code is available at https://github.com/Batorskq/PLR.
Authors:Jaber Jaber, Osama Jaber
Abstract:
Large language models run every token through every layer, regardless of difficulty. We present TIDE, a post‑training system that attaches tiny learned routers at periodic checkpoint layers and, at inference time, selects the earliest layer whose hidden state has converged for each token. TIDE requires no model retraining, works with any HuggingFace causal LM, auto‑detects GPU architecture, and supports float32, float16, and bfloat16 through fused CUDA kernels. On an NVIDIA A100 with DeepSeek R1 Distill 8B, TIDE achieves 100% prefill exit rate (5% of tokens exit at layer 11, the remaining at layer 31), reduces prefill latency by 7.2%, and increases single‑batch throughput by 6.6%. During autoregressive decoding, 98‑99% of tokens exit early while the model correctly solves a multi‑step math problem with 95 unique output tokens. On Qwen3 8B (36 layers), throughput improves by 8.1% at batch size 8. Calibration on 2,000 WikiText samples takes under 3 minutes and produces a ~4 MB router checkpoint. The system comprises 1,308 lines of Python and 1,081 lines of CUDA/C++ with 74 passing tests. Code: https://github.com/RightNow‑AI/TIDE
Authors:Liang Ding
Abstract:
LLM‑as‑Judge evaluation fails agent tasks because a fixed rubric cannot capture what matters for this task: code debugging demands Correctness and Error Handling; web navigation demands Goal Alignment and Action Efficiency. We present ADARUBRIC, which closes this gap by generating task‑specific evaluation rubrics on the fly from task descriptions, scoring trajectories step‑by‑step with confidence‑weighted per‑dimension feedback, and filtering preference pairs with the novel DimensionAwareFilter ‑ a provably necessary condition for preventing high‑scoring dimensions from masking dimension‑level failures. On WebArena and ToolBench, ADARUBRIC achieves Pearson r=0.79 human correlation (+0.16 over the best static baseline) with deployment‑grade reliability (Krippendorff's α=0.83). DPO agents trained on ADARUBRIC preference pairs gain +6.8 to +8.5 pp task success over Prometheus across three benchmarks; gains transfer to SWE‑bench code repair (+4.9 pp) and accelerate PPO convergence by +6.6 pp at 5K steps ‑ both without any rubric engineering. Code: https://github.com/alphadl/AdaRubrics.
Authors:Oussama Zekri, Théo Uscidda, Nicolas Boullé, Anna Korba
Abstract:
We introduce Generalized Discrete Diffusion from Snapshots (GDDS), a unified framework for discrete diffusion modeling that supports arbitrary noising processes over large discrete state spaces. Our formulation encompasses all existing discrete diffusion approaches, while allowing significantly greater flexibility in the choice of corruption dynamics. The forward noising process relies on uniformization and enables fast arbitrary corruption. For the reverse process, we derive a simple evidence lower bound (ELBO) based on snapshot latents, instead of the entire noising path, that allows efficient training of standard generative modeling architectures with clear probabilistic interpretation. Our experiments on large‑vocabulary discrete generation tasks suggest that the proposed framework outperforms existing discrete diffusion methods in terms of training efficiency and generation quality, and beats autoregressive models for the first time at this scale. We provide the code along with a blog post on the project page : \hrefhttps://oussamazekri.fr/gddshttps://oussamazekri.fr/gdds.
Authors:Runze Sun, Yu Zheng, Zexuan Xiong, Zhongjin Qu, Lei Chen, Jiwen Lu, Jie Zhou
Abstract:
Combating hate speech on social media is critical for securing cyberspace, yet relies heavily on the efficacy of automated detection systems. As content formats evolve, hate speech is transitioning from solely plain text to complex multimodal expressions, making implicit attacks harder to spot. Current systems, however, often falter on these subtle cases, as they struggle with multimodal content where the emergent meaning transcends the aggregation of individual modalities. To bridge this gap, we move beyond binary classification to characterize semantic intent shifts where modalities interact to construct implicit hate from benign cues or neutralize toxicity through semantic inversion. Guided by this fine‑grained formulation, we curate the Hate via Vision‑Language Interplay (H‑VLI) benchmark where the true intent hinges on the intricate interplay of modalities rather than overt visual or textual slurs. To effectively decipher these complex cues, we further propose the Asymmetric Reasoning via Courtroom Agent DEbate (ARCADE) framework. By simulating a judicial process where agents actively argue for accusation and defense, ARCADE forces the model to scrutinize deep semantic cues before reaching a verdict. Extensive experiments demonstrate that ARCADE significantly outperforms state‑of‑the‑art baselines on H‑VLI, particularly for challenging implicit cases, while maintaining competitive performance on established benchmarks. Our code and data are available at: https://github.com/Sayur1n/H‑VLI
Authors:Nurul Labib Sayeedi, Md. Faiyaz Abdullah Sayeedi, Shubhashis Roy Dipta, Rubaya Tabassum, Ariful Ekraj Hridoy, Mehraj Mahmood, Mahbub E Sobhani, Md. Tarek Hasan, Swakkhar Shatabda
Abstract:
Bangla culture is richly expressed through region, dialect, history, food, politics, media, and everyday visual life, yet it remains underrepresented in multimodal evaluation. To address this gap, we introduce BanglaVerse, a culturally grounded benchmark for evaluating multilingual vision‑language models (VLMs) on Bengali culture across historically linked languages and regional dialects. Built from 1,152 manually curated images across nine domains, the benchmark supports visual question answering and captioning, and is expanded into four languages and five Bangla dialects, yielding ~32.3K artifacts. Our experiments show that evaluating only standard Bangla overestimates true model capability: performance drops under dialectal variation, especially for caption generation, while historically linked languages such as Hindi and Urdu retain some cultural meaning but remain weaker for structured reasoning. Across domains, the main bottleneck is missing cultural knowledge rather than visual grounding alone, with knowledge‑intensive categories. These findings position BanglaVerse as a more realistic test bed for measuring culturally grounded multimodal understanding under linguistic variation.
Authors:Tasmay Pankaj Tibrewal, Pritish Saha, Ankit Meda, Kunal Singh, Pradeep Moturi
Abstract:
Transformers lack an explicit architectural mechanism for storing and organizing knowledge acquired during training. We introduce learnable sparse memory banks: a set of latent tokens, randomly initialized and trained end‑to‑end, that transformer layers query via cross‑attention to retrieve stored knowledge. To scale memory capacity without prohibitive attention costs, we propose chapter‑based routing inspired by Mixture‑of‑Experts architectures, partitioning the memory bank into chapters and training a router to select relevant subsets per input. This enables scaling to 262K memory tokens while maintaining tractable computation. We evaluate our approach against standard transformers (in iso‑FLOP settings) on pre‑training and instruction fine‑tuning across relevant benchmarks. Our models surpass iso‑FLOP baselines suggesting scope for a new axis of scaling, demonstrating that explicit associative memory provides complementary capacity to what is captured implicitly in model parameters. Additionally, we observe improved knowledge retention under continued training, with robustness to forgetting when transitioning between training phases (e.g., pretraining to instruction fine‑tuning).
Authors:Jianyi Chen, Rongxiu Zhong, Shilei Zhang, Kun Qian, Jinglei Liu, Yike Guo, Wei Xue
Abstract:
Composing coherent long‑form music remains a significant challenge due to the complexity of modeling long‑range dependencies and the prohibitive memory and computational requirements associated with lengthy audio representations. In this work, we propose a simple yet powerful trick: we assume that AI models can understand and generate time‑accelerated (speeded‑up) audio at rates such as 2x, 4x, or even 8x. By first generating a high‑speed version of the music, we greatly reduce the temporal length and resource requirements, making it feasible to handle long‑form music that would otherwise exceed memory or computational limits. The generated audio is then restored to its original speed, recovering the full temporal structure. This temporal speed‑up and slow‑down strategy naturally follows the principle of hierarchical generation from abstract to detailed content, and can be conveniently applied to existing music generation models to enable long‑form music generation. We instantiate this idea in SqueezeComposer, a framework that employs diffusion models for generation in the accelerated domain and refinement in the restored domain. We validate the effectiveness of this approach on two tasks: long‑form music generation, which evaluates temporal‑wise control (including continuation, completion, and generation from scratch), and whole‑song singing accompaniment generation, which evaluates track‑wise control. Experimental results demonstrate that our simple temporal speed‑up trick enables efficient, scalable, and high‑quality long‑form music generation. Audio samples are available at https://SqueezeComposer.github.io/.
Authors:Taara Kumar, Kokil Jaidka
Abstract:
As text‑based computer‑mediated communication (CMC) increasingly structures everyday interaction, a central question re‑emerges with new urgency: How do users reconstruct nonverbal expression in environments where embodied cues are absent? This paper provides a systematic, theory‑driven account of electronic nonverbal cues (eNVCs) ‑ textual analogues of kinesics, vocalics, and paralinguistics ‑ in public microblog communication. Across three complementary studies, we advance conceptual, empirical, and methodological contributions. Study 1 develops a unified taxonomy of eNVCs grounded in foundational nonverbal communication theory and introduces a scalable Python toolkit for their automated detection. Study 2, a within‑subject survey experiment, offers controlled causal evidence that eNVCs substantially improve emotional decoding accuracy and lower perceived ambiguity, while also identifying boundary conditions, such as sarcasm, under which these benefits weaken or disappear. Study 3, through focus group discussions, reveals the interpretive strategies users employ when reasoning about digital prosody, including drawing meaning from the absence of expected cues and defaulting toward negative interpretations in ambiguous contexts. Together, these studies establish eNVCs as a coherent and measurable class of digital behaviors, refine theoretical accounts of cue richness and interpretive effort, and provide practical tools for affective computing, user modeling, and emotion‑aware interface design. The eNVC detection toolkit is available as a Python and R package at https://github.com/kokiljaidka/envc.
Authors:Jinquan Zheng, Jia Yuan, Jiacheng Yao, Chenyang Gu, Pujun Zheng, Guoxiu He
Abstract:
Large language models (LLMs) used for multiple‑choice and pairwise evaluation tasks often exhibit selection bias due to non‑semantic factors like option positions and label symbols. Existing inference‑time debiasing is costly and may harm reasoning, while pointwise training ignores that the same question should yield consistent answers across permutations. To address this issue, we propose Permutation‑Aware Group Relative Policy Optimization (PA‑GRPO), which mitigates selection bias by enforcing permutation‑consistent semantic reasoning. PA‑GRPO constructs a permutation group for each instance by generating multiple candidate permutations, and optimizes the model using two complementary mechanisms: (1) cross‑permutation advantage, which computes advantages relative to the mean reward over all permutations of the same instance, and (2) consistency‑aware reward, which encourages the model to produce consistent decisions across different permutations. Experimental results demonstrate that PA‑GRPO outperforms strong baselines across seven benchmarks, substantially reducing selection bias while maintaining high overall performance. The code will be made available on Github (https://github.com/ECNU‑Text‑Computing/PA‑GRPO).
Authors:Florent Draye, Abir Harrasse, Vedant Palit, Tung-Yu Wu, Jiarui Liu, Punya Syon Pandey, Roderick Wu, Terry Jingchen Zhang, Zhijing Jin, Bernhard Schölkopf
Abstract:
Mechanistic interpretability seeks to understand how Large Language Models (LLMs) represent and process information. Recent approaches based on dictionary learning and transcoders enable representing model computation in terms of sparse, interpretable features and their interactions, giving rise to feature attribution graphs. However, these graphs are often large and redundant, limiting their interpretability in practice. Cross‑Layer Transcoders (CLTs) address this issue by sharing features across layers while preserving layer‑specific decoding, yielding more compact representations, but remain difficult to train and analyze at scale. We introduce an open‑source library for end‑to‑end training and interpretability of CLTs. Our framework integrates scalable distributed training with model sharding and compressed activation caching, a unified automated interpretability pipeline for feature analysis and explanation, attribution graph computation using Circuit‑Tracer, and a flexible visualization interface. This provides a practical and unified solution for scaling CLT‑based mechanistic interpretability. Our code is available at: https://github.com/LLM‑Interp/CLT‑Forge.
Authors:Daniel Autenrieth
Abstract:
This paper presents the first systematic measurement of educational alignment in Large Language Models. Using a Delphi‑validated instrument comprising 48 items across eight educational‑theoretical dimensions, the study reveals that GPT‑5.1 exhibits highly coherent preference patterns (99.78% transitivity; 92.79% model accuracy) that largely align with humanistic educational principles where expert consensus exists. Crucially, divergences from expert opinion occur precisely in domains of normative disagreement among human experts themselves, particularly emotional dimensions and epistemic normativity. This raises a fundamental question for alignment research: When human values are contested, what should models be aligned to? The findings demonstrate that GPT‑5.1 does not remain neutral in contested domains but adopts coherent positions, prioritizing emotional responsiveness and rejecting false balance. The methodology, combining Delphi consensus‑building with Structured Preference Elicitation and Thurstonian Utility modeling, provides a replicable framework for domain‑specific alignment evaluation beyond generic value benchmarks.
Authors:Yuren Hao, Shuhaib Mehri, ChengXiang Zhai, Dilek Hakkani-Tür
Abstract:
Large language models are increasingly used as personal assistants, yet most lack a persistent user model, forcing users to repeatedly restate preferences across sessions. We propose Vector‑Adapted Retrieval Scoring (VARS), a pipeline‑agnostic, frozen‑backbone framework that represents each user with long‑term and short‑term vectors in a shared preference space and uses these vectors to bias retrieval scoring over structured preference memory. The vectors are updated online from weak scalar rewards from users' feedback, enabling personalization without per‑user fine‑tuning. We evaluate on \textscMultiSessionCollab, an online multi‑session collaboration benchmark with rich user preference profiles, across math and code tasks. Under frozen backbones, the main benefit of user‑aware retrieval is improved interaction efficiency rather than large gains in raw task accuracy: our full VARS agent achieves the strongest overall performance, matches a strong Reflection baseline in task success, and reduces timeout rate and user effort. The learned long‑term vectors also align with cross‑user preference overlap, while short‑term vectors capture session‑specific adaptation, supporting the interpretability of the dual‑vector design. Code, model, and data are available at https://github.com/YurenHao0426/VARS.
Authors:Hongyu Cao, Kunpeng Liu, Dongjie Wang, Yanjie Fu
Abstract:
Large language models exhibit strong reasoning capabilities, yet often rely on shortcuts such as surface pattern matching and answer memorization rather than genuine logical inference. We propose Shortcut‑Aware Reasoning Training (SART), a gradient‑aware framework that detects and mitigates shortcut‑promoting samples via ShortcutScore and gradient surgery. Our method identifies shortcut signals through gradient misalignment with validation objectives and answer‑token concentration, and modifies training dynamics accordingly. Experiments on controlled reasoning benchmarks show that SART achieves +16.5% accuracy and +40.2% robustness over the strongest baseline, significantly improving generalization under distribution shifts. Code is available at: https://github.com/fuyanjie/short‑cut‑aware‑data‑centric‑reasoning.
Authors:Jiajun Hou, Hexuan Deng, Wenxiang Jiao, Xuebo Liu, Xiaopeng Ke, Min Zhang
Abstract:
The exponential growth of academic publications has led to a surge in papers of varying quality, increasing the cost of paper screening. Current approaches either use novelty assessment within general AI Reviewers or repurpose DeepResearch, which lacks domain‑specific mechanisms and thus delivers lower‑quality results. To bridge this gap, we introduce NoveltyAgent, a multi‑agent system designed to generate comprehensive and faithful novelty reports, enabling thorough evaluation of a paper's originality. It decomposes manuscripts into discrete novelty points for fine‑grained retrieval and comparison, and builds a comprehensive related‑paper database while cross‑referencing claims to ensure faithfulness. Furthermore, to address the challenge of evaluating such open‑ended generation tasks, we propose a checklist‑based evaluation framework, providing an unbiased paradigm for building reliable evaluations. Extensive experiments show that NoveltyAgent achieves state‑of‑the‑art performance, outperforming GPT‑5 DeepResearch by 10.15%. We hope this system will provide reliable, high‑quality novelty analysis and help researchers quickly identify novel papers. Code and demo are available at https://github.com/SStan1/NoveltyAgent.
Authors:Yandan Zheng, Haoran Luo, Zhenghong Lin, Wenjin Liu, Luu Anh Tuan
Abstract:
Benchmarks are the de facto standard for tracking progress in large language models (LLMs), yet static test sets can rapidly saturate, become vulnerable to contamination, and are costly to refresh. Scalable evaluation of open‑ended items often relies on LLM judges, introducing additional sources of bias and prompt sensitivity. We argue that evaluation must extend beyond how well models answer benchmarks to how well models design them. We introduce BenchBench, a three‑stage pipeline and dataset for benchmarking automated benchmark generation: (i) extract structured domain cards from seed benchmarks, (ii) prompt multiple designer LLMs to generate quota‑controlled suites, and (iii) validate items with a multi‑model answerer panel using exact/numeric/symbolic verifiers when possible and rubric‑guided judging otherwise, yielding designer‑‑answerer matrices with item‑level quality flags and psychometric diagnostics. Across nine variants spanning computer science, mathematics, medicine, and theory‑of‑mind reasoning (including multilingual and multimodal settings), we generate 16.7K items, retain ~15K core items post‑filtering, and produce ~152K graded model‑‑item responses. BenchBench shows that benchmark‑design ability is only moderately correlated with answer‑time strength (Spearman rho ~0.37), invalidity is negatively associated with discrimination (Pearson r~0.62), and the resulting designer‑‑answerer matrices enable scalable audits of format/modality/language fidelity and suite‑dependent self/family interactions. The project is available at: https://github.com/koanatakiyo/BenchBench.
Authors:Christopher J. Agostino, Quan Le Thien, Nayan D'Souza, Louis van der Elst
Abstract:
Understanding the fundamental mechanisms governing the production of meaning in the processing of natural language is critical for designing safe, thoughtful, engaging, and empowering human‑agent interactions. Experiments in cognitive science and social psychology have demonstrated that human semantic processing exhibits contextuality more consistent with quantum logical mechanisms than classical Boolean theories, and recent works have found similar results in large language models ‑‑ in particular, clear violations of the Bell inequality in experiments of contextuality during interpretation of ambiguous expressions. We explore the CHSH |S| parameter ‑‑ the metric associated with the inequality ‑‑ across the inference parameter space of models spanning four orders of magnitude in scale, cross‑referencing it with MMLU, hallucination rate, and nonsense detection benchmarks. We find that the interquartile range of the |S| distribution ‑‑ the statistic that most sharply differentiates models from one another ‑‑ is completely orthogonal to all external benchmarks, while violation rate shows weak anticorrelation with all three benchmarks that does not reach significance. We investigate how |S| varies with sampling parameters and word order, and discuss the information‑theoretic constraints that genuine contextuality imposes on prompt injection defenses and its human analogue, whereby careful construction and maintenance of social contextuality can be carried out at scale ‑‑ manufacturing not consent but contextuality itself, a subtler and more fundamental form of manipulation that shapes the space of possible interpretations before any particular one is reached.
Authors:Zhuofeng Li, Dongfu Jiang, Xueguang Ma, Haoxiang Zhang, Ping Nie, Yuyu Zhang, Kai Zou, Jianwen Xie, Yu Zhang, Wenhu Chen
Abstract:
Training deep research agents requires long‑horizon trajectories that interleave search, evidence aggregation, and multi‑step reasoning. However, existing data collection pipelines typically rely on proprietary web APIs, making large‑scale trajectory synthesis costly, unstable, and difficult to reproduce. We present OpenResearcher, a reproducible pipeline that decouples one‑time corpus bootstrapping from multi‑turn trajectory synthesis and executes the search‑and‑browse loop entirely offline using three explicit browser primitives: search, open, and find, over a 15M‑document corpus. Using GPT‑OSS‑120B as the teacher model, we synthesize over 97K trajectories, including a substantial long‑horizon tail with 100+ tool calls. Supervised fine‑tuning a 30B‑A3B backbone on these trajectories achieves 54.8% accuracy on BrowseComp‑Plus, a +34.0 point improvement over the base model, while remaining competitive on BrowseComp, GAIA, and xbench‑DeepSearch. Because the environment is offline and fully instrumented, it also enables controlled analysis, where our study reveals practical insights into deep research pipeline design, including data filtering strategies, agent configuration choices, and how retrieval success relates to final answer accuracy. We release the pipeline, synthesized trajectories, model checkpoints, and the offline search environment at https://github.com/TIGER‑AI‑Lab/OpenResearcher.
Authors:Jiaqi Yuan, Jialu Wang, Zihan Wang, Qingyun Sun, Ruijie Wang, Jianxin Li
Abstract:
Generative search engines represent a transition from traditional ranking‑based retrieval to Large Language Model (LLM)‑based synthesis, transforming optimization goals from ranking prominence towards content inclusion. Generative Engine Optimization (GEO), specifically, aims to maximize visibility and attribution in black‑box summarized outputs by strategically manipulating source content. However, existing methods rely on static heuristics, single‑prompt optimization, or engine preference rule distillation that is prone to overfitting. They cannot flexibly adapt to diverse content or the changing behaviors of generative engines. Moreover, effectively optimizing these strategies requires an impractical amount of interaction feedback from the engines. To address these challenges, we propose AgenticGEO, a self‑evolving agentic framework formulating optimization as a content‑conditioned control problem, which enhances intrinsic content quality to robustly adapt to the unpredictable behaviors of black‑box engines. Unlike fixed‑strategy methods, AgenticGEO employs a MAP‑Elites archive to evolve diverse, compositional strategies. To mitigate interaction costs, we introduce a Co‑Evolving Critic, a lightweight surrogate that approximates engine feedback for content‑specific strategy selection and refinement, efficiently guiding both evolutionary search and inference‑time planning. Through extensive in‑domain and cross‑domain experiments on two representative engines, AgenticGEO achieves state‑of‑the‑art performance and demonstrates robust transferability, outperforming 14 baselines across 3 datasets. Our code and model are available at: https://github.com/AIcling/agentic_geo.
Authors:Hengwei Ye, Yuanting Guan, Yuxuan Ge, Tianying Zhu, Zhenhan Guan, Yijia Zhong, Yijing Zhang, Han Zhang, Yingna Wu, Zheng Tian
Abstract:
Multimodal Large Language Models (MLLMs) combine the linguistic strengths of LLMs with the ability to process multimodal data, enbaling them to address a broader range of visual tasks. Because MLLMs aim at more general, human‑like competence than language‑only models, we take inspiration from the Wechsler Intelligence Scales ‑ an established battery for evaluating children by decomposing intelligence into interpretable, testable abilities. We introduce KidGym, a comprehensive 2D grid‑based benchmark for assessing five essential capabilities of MLLMs: Execution, Perception Reasoning, Learning, Memory and Planning. The benchmark comprises 12 unique tasks, each targeting at least one core capability, specifically designed to guage MLLMs' adaptability and developmental potential, mirroring the stages of children's cognitive growth. Additionally, our tasks encompass diverse scenarios and objects with randomly generated layouts, ensuring a more accurate and robust evluation of MLLM capabilities. KidGym is designed to be fully user‑customizable and extensible, allowing researchers to create new evaluation scenarios and adjust difficuly levels to accommodate the rapidly growing MLLM community. Through the evaluation of state‑of‑the‑art MLLMs using KidGym, we identified significant insights into model capabilities and revealed several limitations of current models. We release our benchmark at: https://bobo‑ye.github.io/KidGym/.
Authors:Hyunjun Jeon, Kyuyoung Kim, Jinwoo Shin
Abstract:
Modern language models can readily extract sensitive information from unstructured text, making redaction ‑‑ the selective removal of such information ‑‑ critical for data security. However, existing benchmarks for redaction typically focus on predefined categories of data such as personally identifiable information (PII) or evaluate specific techniques like masking. To address this limitation, we introduce RedacBench, a comprehensive benchmark for evaluating policy‑conditioned redaction across domains and strategies. Constructed from 514 human‑authored texts spanning individual, corporate, and government sources, paired with 187 security policies, RedacBench measures a model's ability to selectively remove policy‑violating information while preserving the original semantics. We quantify performance using 8,053 annotated propositions that capture all inferable information in each text. This enables assessment of both security ‑‑ the removal of sensitive propositions ‑‑ and utility ‑‑ the preservation of non‑sensitive propositions. Experiments across multiple redaction strategies and state‑of‑the‑art language models show that while more advanced models can improve security, preserving utility remains a challenge. To facilitate future research, we release RedacBench along with a web‑based playground for dataset customization and evaluation. Available at https://hyunjunian.github.io/redaction‑playground/.
Authors:Bo Yuan, Hexuan Deng, Xuebo Liu, Min Zhang
Abstract:
Knowledge graph question answering (KGQA) is a promising approach for mitigating LLM hallucination by grounding reasoning in structured and verifiable knowledge graphs. Existing approaches fall into two paradigms: retrieval‑based methods utilize small specialized models, which are efficient but often produce unreachable paths and miss implicit constraints, while agent‑based methods utilize large general models, which achieve stronger structural grounding at substantially higher cost. We propose RouterKGQA, a framework for specialized‑‑general model collaboration, in which a specialized model generates reasoning paths and a general model performs KG‑guided repair only when needed, improving performance at minimal cost. We further equip the specialized with constraint‑aware answer filtering, which reduces redundant answers. In addition, we design a more efficient general agent workflow, further lowering inference cost. Experimental results show that RouterKGQA outperforms the previous best by 3.57 points in F1 and 0.49 points in Hits@1 on average across benchmarks, while requiring only 1.15 average LLM calls per question. Codes and models are available at https://github.com/Oldcircle/RouterKGQA.
Authors:Insung Lee, Taeyoung Jeong, Haejun Yoo, Du-Seong Chang, Myoung-Wan Koo
Abstract:
While Large Audio‑Language Models (LALMs) have advanced audio captioning, robust evaluation remains difficult. Reference‑based metrics are expensive and often fail to assess acoustic fidelity, while Contrastive Language‑Audio Pretraining (CLAP)‑based approaches frequently overlook syntactic errors and fine‑grained details. We propose CAF‑Score, a reference‑free metric that calibrates CLAP's coarse‑grained semantic alignment with the fine‑grained comprehension and syntactic awareness of LALMs. By combining contrastive audio‑text embeddings with LALM reasoning, CAF‑Score effectively detects syntactic inconsistencies and subtle hallucinations. Experiments on the BRACE benchmark demonstrate that our approach achieves the highest correlation with human judgments, even outperforming reference‑based baselines in challenging scenarios. These results highlight the efficacy of CAF‑Score for reference‑free audio captioning evaluation. Code and results are available at https://github.com/inseong00/CAF‑Score.
Authors:J. Ben Tamo, Yuxing Lu, Benoit L. Marteau, Micky C. Nnamdi, May D. Wang
Abstract:
Large Language Models (LLMs) are fluent but prone to hallucinations, producing answers that appear plausible yet are unsupported by available evidence. This failure is especially problematic in high‑stakes domains where decisions must be justified by verifiable information. We introduce EvidenceRL, a reinforcement learning framework that enforces evidence adherence during training. EvidenceRL scores candidate responses for grounding (entailment with retrieved evidence and context) and correctness (agreement with reference answers) and optimizes the generator using Group Relative Policy Optimization (GRPO). We evaluate across two high‑stakes domains, cardiac diagnosis and legal reasoning, where EvidenceRL consistently improves evidence grounding and faithfulness without sacrificing task accuracy. On cardiac diagnosis, F1@3 increases from 37.0 to 54.5 on Llama‑3.2‑3B while grounding (G_\max@3) rises from 47.6 to 78.2; hallucinations drop nearly 5× and evidence‑supported diagnoses increase from 31.8% to 61.6%. On legal reasoning, EvidenceRL raises Faithfulness from 32.8% to 67.6% on Llama‑3.1‑8B, demonstrating consistent behavioral change across domains. Our code is open‑sourced at https://github.com/Wizaaard/EvidenceRL.git.
Authors:Víctor Gallego
Abstract:
We study LLM policy synthesis: using a large language model to iteratively generate programmatic agent policies for multi‑agent environments. Rather than training neural policies via reinforcement learning, our framework prompts an LLM to produce Python policy functions, evaluates them in self‑play, and refines them using performance feedback across iterations. We investigate feedback engineering (the design of what evaluation information is shown to the LLM during refinement) comparing sparse feedback (scalar reward only) against dense feedback (reward plus social metrics: efficiency, equality, sustainability, peace). Across two canonical Sequential Social Dilemmas (Gathering and Cleanup) and two frontier LLMs (Claude Sonnet 4.6, Gemini 3.1 Pro), dense feedback consistently matches or exceeds sparse feedback on all metrics. The advantage is largest in the Cleanup public goods game, where providing social metrics helps the LLM calibrate the costly cleaning‑harvesting tradeoff. Rather than triggering over‑optimization of fairness, social metrics serve as a coordination signal that guides the LLM toward more effective cooperative strategies, including territory partitioning, adaptive role assignment, and the avoidance of wasteful aggression. We further perform an adversarial experiment to determine whether LLMs can reward hack these environments. We characterize five attack classes and discuss mitigations, highlighting an inherent tension in LLM policy synthesis between expressiveness and safety. Code at https://github.com/vicgalle/llm‑policies‑social‑dilemmas.
Authors:Tomasz Wietrzykowski
Abstract:
Current transformer language models are trained with uniform computational budgets across all layers, implicitly assuming layer homogeneity. We challenge this assumption through empirical analysis of SmolLM2‑135M, a 30‑layer, 135M‑parameter causal language model, using five diagnostic metrics: weight predictability (R2), ablation degradation, recovery speed, weight manipulation robustness, and structural analysis. We find profound anatomical heterogeneity: (1) Layer weights follow strong mathematical regularity (R2 = 0.91) with a universal oscillatory delta pattern (correlation ~= ‑0.50), yet predicted weights cause catastrophic failure due to nonlinear error accumulation. (2) Layer importance spans a 10^7 range, from a critical core (L8‑11, up to +63,419% PPL degradation) to anti‑layers (L14, L17) whose removal improves performance. (3) Recovery speed correlates with layer importance, indicating differential training requirements. (4) Only weight scaling (alpha = 0.9) preserves model quality among five tested manipulation strategies. (5) Growth Transformer Training, allocating budget by layer importance, achieves ~54% cost reduction. A proof‑of‑concept experiment confirms this: 4.7x lower validation loss than uniform training at identical parameter count, while being 13% faster.
Authors:Hongye Zhao, Yi Zhao, Chengzhi Zhang
Abstract:
Academia and industry each possess distinct advantages in advancing technological progress. Academia's core mission is to promote open dissemination of research results and drive disciplinary progress. The industry values knowledge appropriability and core competitiveness, yet actively engages in open practices like academic conferences and platform sharing, creating a knowledge strategy paradox. Highly novel and publicly accessible knowledge serves as the driving force behind technological advancement. However, it remains unclear whether industry or academia can produce more novel research outcomes. Some studies argue that academia tends to generate more novel ideas, while others suggest that industry researchers are more likely to drive breakthroughs. Previous studies have been limited by data sources and inconsistent measures of novelty. To address these gaps, this study conducts an analysis using four types of fine‑grained knowledge entities (Method, Tool, Dataset, Metric), calculates semantic distances between entities within a unified semantic space to quantify novelty, and achieves comparability of novelty across different types of literature. Then, a regression model is constructed to analyze the differences in publication novelty between industry and academia. The results indicate that academia demonstrates higher novelty outputs, which is particularly evident in patents. At the entity level, both academia and industry emphasize method‑driven advancements in papers, while industry holds a unique advantage in datasets. Additionally, academia‑industry collaboration has a limited effect on enhancing the novelty of research papers, but it helps to enhance the novelty of patents. We release our data and associated codes at https://github.com/tinierZhao/entity_novelty.
Authors:Rahul Singhal, Pradyumna Tambwekar, Karime Maamari
Abstract:
Prompt engineering is effective but labor‑intensive, motivating automated optimization methods. Existing methods typically require labeled datasets, which are often unavailable, and produce verbose, repetitive prompts. We introduce PrefPO, a minimal prompt optimization approach inspired by reinforcement learning from human feedback (RLHF). Its preference‑based approach reduces the need for labeled data and hyperparameter tuning‑only a starting prompt and natural language criteria are needed. PrefPO uses an LLM discriminator to express pairwise preferences over model outputs and provide feedback to an LLM optimizer, iteratively improving performance. We evaluate PrefPO on 9 BIG‑Bench Hard (BBH) tasks and IFEval‑Hard, a newly‑curated, challenging subset of IFEval. PrefPO matches or exceeds SOTA methods, including GEPA, MIPRO, and TextGrad, on 6/9 tasks and performs comparably to TextGrad on IFEval‑Hard (82.4% vs 84.5%). Unlike other methods, PrefPO can optimize in both labeled and unlabeled settings. Without labels, PrefPO closely matches its labeled performance on 6/9 tasks, proving effective without ground truth. PrefPO also improves prompt hygiene: we find existing methods produce prompts 14.7x their original length or with 34% repetitive content; PrefPO reduces these issues by 3‑5x. Furthermore, both LLM and human judges rate PrefPO's prompts higher than TextGrad's. Finally, we identify prompt hacking in prompt optimizers, where methods game evaluation criteria, and find PrefPO is susceptible at half the rate of TextGrad (37% vs 86%), generating fewer brittle, misaligned prompts.
Authors:Weilin Zhou, Shanwen Tan, Enhao Gu, Yurong Qian
Abstract:
Multimodal fake news detection is crucial for mitigating societal disinformation. Existing approaches attempt to address this by fusing multimodal features or leveraging Large Language Models (LLMs) for advanced reasoning. However, these methods suffer from serious limitations, including a lack of comprehensive multi‑view judgment and fusion, and prohibitive reasoning inefficiency due to the high computational costs of LLMs. To address these issues, we propose LLM‑Guided Multi‑View Reasoning Distillation for Fake News Detection ( LLM‑MRD), a novel teacher‑student framework. The Student Multi‑view Reasoning module first constructs a comprehensive foundation from textual, visual, and cross‑modal perspectives. Then, the Teacher Multi‑view Reasoning module generates deep reasoning chains as rich supervision signals. Our core Calibration Distillation mechanism efficiently distills this complex reasoning‑derived knowledge into the efficient student model. Experiments show LLM‑MRD significantly outperforms state‑of‑the‑art baselines. Notably, it demonstrates a comprehensive average improvement of 5.19% in ACC and 6.33% in F1‑Fake when evaluated across all competing methods and datasets. Our code is available at https://github.com/Nasuro55/LLM‑MRD
Authors:Bartosz Trojan, Filip Gębala
Abstract:
Modern Transformer‑based models frequently suffer from miscalibration, producing overconfident predictions that do not reflect true empirical frequencies. This work investigates the calibration dynamics of LoRA: Low‑Rank Adaptation and a novel hyper‑network‑based adaptation framework as parameter‑efficient alternatives to full fine‑tuning for RoBERTa. Evaluating across the GLUE benchmark, we demonstrate that LoRA‑based adaptation consistently achieves calibration parity with (and in specific tasks exceeds) full fine‑tuning, while maintaining significantly higher parameter efficiency. We further explore a dynamic approach where a shared hyper‑network generates LoRA factors (A and B matrices) to induce structural coupling across layers. This approach produced results similar to standard LoRA fine‑tuning, even achieving better MCC on CoLA dataset. Our study also reveal a critical trade‑off: constraining the adaptation space (e.g., freezing matrices A) acts as a powerful regularizer that enhances Expected Calibration Error (ECE), but necessitates a carefully balanced sacrifice in downstream task accuracy. To support future research, we provide a unified and reproducible implementation of contemporary calibration metrics, including ECE, MCE, and ACE. Our findings clarify the relationship between parameter efficiency and probabilistic reliability, positioning structured low‑rank updates as a viable foundation for uncertainty‑aware Transformer architectures. Code available at: https://github.com/btrojan‑official/HypeLoRA
Authors:Yannian Gu, Zhongzhen Huang, Linjie Mu, Xizhuo Zhang, Shaoting Zhang, Xiaofan Zhang
Abstract:
Multimodal large language models (MLLMs) demonstrate considerable potential in clinical diagnostics, a domain that inherently requires synthesizing complex visual and textual data alongside consulting authoritative medical literature. However, existing benchmarks primarily evaluate MLLMs in end‑to‑end answering scenarios. This limits the ability to disentangle a model's foundational multimodal reasoning from its proficiency in evidence retrieval and application. We introduce the Clinical Understanding and Retrieval Evaluation (CURE) benchmark. Comprising 500 multimodal clinical cases mapped to physician‑cited reference literature, CURE evaluates reasoning and retrieval under controlled evidence settings to disentangle their respective contributions. We evaluate state‑of‑the‑art MLLMs across distinct evidence‑gathering paradigms in both closed‑ended and open‑ended diagnosis tasks. Evaluations reveal a stark dichotomy: while advanced models demonstrate clinical reasoning proficiency when supplied with physician reference evidence (achieving up to 73.4% accuracy on differential diagnosis), their performance substantially declines (as low as 25.4%) when reliant on independent retrieval mechanisms. This disparity highlights the dual challenges of effectively integrating multimodal clinical evidence and retrieving precise supporting literature. CURE is publicly available at https://github.com/yanniangu/CURE.
Authors:Zhen Tan, Chengshuai Zhao, Song Wang, Jundong Li, Tianlong Chen, Huan Liu
Abstract:
Distilling robust reasoning capabilities from large language models (LLMs) into smaller, computationally efficient student models remains an unresolved challenge. Despite recent advances, distilled models frequently suffer from superficial pattern memorization and subpar generalization. To overcome these limitations, we introduce a novel distillation framework that moves beyond simple mimicry to instill a deeper conceptual understanding. Our framework features two key innovations. \underlineFirst, to address pattern memorization, Explanatory Inversion (EI) generates targeted ``explanatory probes'' that compel the student to articulate the underlying logic behind an answer, rather than just memorizing it. \underlineSecond, to improve generalization, Explanatory GRPO (\textttEXGRPO) uses a reinforcement learning algorithm with a novel Dialogue Structure Utility Bonus, which explicitly rewards the student for maintaining a coherent reasoning process across these probes. Extensive evaluations on 12 datasets demonstrate significant improvements. Using Gemma‑7b as the student model, our method yields an average 20.39% increase over zero‑shot performance and a 6.02% improvement over the state‑of‑the‑art distillation baselines. Moreover, models distilled with our method show remarkable training efficiency (e.g., surpassing vanilla fine‑tuning with 10‑25% training data) and strong generalization to out‑of‑distribution tasks. Implementation is released at https://github.com/Zhen‑Tan‑dmml/ExGRPO.git.
Authors:Yiyun Zhu, Yidong Jiang, Ziwen Xu, Yinsheng Yao, Dawei Cheng, Jinru Ding, Yejie Zheng, Jie Xu
Abstract:
Large language models (LLMs) are increasingly used to generate financial research reports, shifting from auxiliary analytic tools to primary content producers. Yet recent real‑world deployments reveal persistent failures‑‑factual errors, numerical inconsistencies, fabricated references, and shallow analysis‑‑that can distort assessments of corporate fundamentals and ultimately trigger severe economic losses. However, existing financial benchmarks focus on comprehension over completed reports rather than evaluating whether a model can produce reliable analysis. Moreover, current evaluation frameworks merely flag hallucinations and lack structured measures for deeper analytical skills, leaving key analytical bottlenecks undiscovered. To address these gaps, we introduce FinReasoning, a benchmark that decomposes Chinese research‑report generation into three stages aligned with real analyst workflows, assessing semantic consistency, data alignment, and deep insight. We further propose a fine‑grained evaluation framework that strengthens hallucination‑correction assessment and incorporates a 12‑indicator rubric for core analytical skills. Based on the evaluation results, FinReasoning reveals that most models exhibit a understanding‑execution gap: they can identify errors but struggle to generate accurate corrections; they can retrieve data but have difficulty returning it in correct format. Furthermore, no model achieves overwhelming superiority across all three tracks; Doubao‑Seed‑1.8, GPT‑5, and Kimi‑K2 rank as the top three in overall performance, yet each exhibits a distinct capability distribution. The evaluation resource is available at https://github.com/TongjiFinLab/FinReasoning.
Authors:Ke-Han Lu, Szu-Wei Fu, Chao-Han Huck Yang, Zhehuai Chen, Sung-Feng Huang, Chih-Kai Yang, Yi-Cheng Lin, Chi-Yuan Hsiao, Wenze Ren, En-Pei Hu, Yu-Han Huang, An-Yu Cheng, Cheng-Han Chiang, Yu Tsao, Yu-Chiang Frank Wang, Hung-yi Lee
Abstract:
Large language models (LLMs) have been widely used as knowledge backbones of Large Audio Language Models (LALMs), yet how much auditory knowledge they encode through text‑only pre‑training and how this affects downstream performance remains unclear. We study this gap by comparing different LLMs under two text‑only and one audio‑grounded setting: (1) direct probing on AKB‑2000, a curated benchmark testing the breadth and depth of auditory knowledge; (2) cascade evaluation, where LLMs reason over text descriptions from an audio captioner; and (3) audio‑grounded evaluation, where each LLM is fine‑tuned into a Large Audio Language Model (LALM) with an audio encoder. Our findings reveal that auditory knowledge varies substantially across families, and text‑only results are strongly correlated with audio performance. Our work provides empirical grounding for a comprehensive understanding of LLMs in audio research.
Authors:Swagat Padhan, Lakshya Jain, Bhavya Minesh Shah, Omkar Patil, Thao Nguyen, Nakul Gopalan
Abstract:
Robots collaborating with humans must convert natural language goals into actionable, physically grounded decisions. For example, executing a command such as "go two meters to the right of the fridge" requires grounding semantic references, spatial relations, and metric constraints within a 3D scene. While recent vision language models (VLMs) demonstrate strong semantic grounding capabilities, they are not explicitly designed to reason about metric constraints in physically defined spaces. In this work, we empirically demonstrate that state‑of‑the‑art VLM‑based grounding approaches struggle with complex metric‑semantic language queries. To address this limitation, we propose MAPG (Multi‑Agent Probabilistic Grounding), an agentic framework that decomposes language queries into structured subcomponents and queries a VLM to ground each component. MAPG then probabilistically composes these grounded outputs to produce metrically consistent, actionable decisions in 3D space. We evaluate MAPG on the HM‑EQA benchmark and show consistent performance improvements over strong baselines. Furthermore, we introduce a new benchmark, MAPG‑Bench, specifically designed to evaluate metric‑semantic goal grounding, addressing a gap in existing language grounding evaluations. We also present a real‑world robot demonstration showing that MAPG transfers beyond simulation when a structured scene representation is available.
Authors:Chenyang Gu, Jiahao Cheng, Meicong Zhang, Pujun Zheng, Jinquan Zheng, Guoxiu He
Abstract:
Scientific ideation aims to propose novel solutions within a given scientific context. Existing LLM‑based agentic approaches emulate human research workflows, yet inadequately model scientific reasoning, resulting in surface‑level conceptual recombinations that lack technical depth and scientific grounding. To address this issue, we propose MoRI (Motivation‑grounded Reasoning for Scientific Ideation), a framework that enables LLMs to explicitly learn the reasoning process from research motivations to methodologies. The base LLM is initialized via supervised fine‑tuning to generate a research motivation from a given context, and is subsequently trained under a composite reinforcement learning reward that approximates scientific rigor: (1) entropy‑aware information gain encourages the model to uncover and elaborate high‑complexity technical details grounded in ground‑truth methodologies, and (2) contrastive semantic gain constrains the reasoning trajectory to maintain conceptually aligned with scientifically valid solutions. Empirical results show that MoRI significantly outperforms strong commercial LLMs and complex agentic baselines across multiple dimensions, including novelty, technical rigor, and feasibility. The code will be made available on \hrefhttps://github.com/ECNU‑Text‑Computing/IdeaGenerationGitHub.
Authors:Gagan Bhatia, Ahmad Muhammad Isa, Maxime Peyrard, Wei Zhao
Abstract:
We present MultiTempBench, a multilingual temporal reasoning benchmark spanning three tasks, date arithmetic, time zone conversion, and temporal relation extraction across five languages (English, German, Chinese, Arabic, and Hausa) and multiple calendar conventions (Gregorian, Hijri, and Chinese Lunar). MultiTempBench contains 15,000 examples built by translating 750 curated English questions and expanding each into controlled date‑format variants. We evaluate 20 LLMs and introduce the multilingual Date Fragmentation Ratio (mDFR), calibrated with human severity ratings, together with geometric‑probing analyses of internal temporal representations. We find tokenisation quality of temporal artefacts is a resource‑dependent bottleneck: in low‑resource languages and rarer calendar formats, fragmentation disrupts Year/Month/Day separation and accuracy collapses, while high‑resource settings are often robust to digit‑level splitting. Beyond tokenisation, crossed mixed‑effects regression shows that temporal linearity is the strongest predictor of temporal reasoning in high‑resource languages, whereas fragmentation is the stronger predictor in low‑resource languages. Code is available at: https://github.com/gagan3012/mtb
Authors:Xiao Feng, Bo Han, Zhanke Zhou, Jiaqi Fan, Jiangchao Yao, Ka Ho Li, Dahai Yu, Michael Kwok-Po Ng
Abstract:
Reinforcement learning (RL) holds significant promise for enhancing the agentic reasoning capabilities of large language models (LLMs) with external environments. However, the inherent sparsity of terminal rewards hinders fine‑grained, state‑level optimization. Although process reward modeling offers a promising alternative, training dedicated reward models often entails substantial computational costs and scaling difficulties. To address these challenges, we introduce RewardFlow, a lightweight method for estimating state‑level rewards tailored to agentic reasoning tasks. RewardFlow leverages the intrinsic topological structure of states within reasoning trajectories by constructing state graphs. This enables an analysis of state‑wise contributions to success, followed by topology‑aware graph propagation to quantify contributions and yield objective, state‑level rewards. When integrated as dense rewards for RL optimization, RewardFlow substantially outperforms prior RL baselines across four agentic reasoning benchmarks, demonstrating superior performance, robustness, and training efficiency. The implementation of RewardFlow is publicly available at https://github.com/tmlr‑group/RewardFlow.
Authors:Huichi Zhou, Siyuan Guo, Anjie Liu, Zhongwei Yu, Ziqin Gong, Bowen Zhao, Zhixun Chen, Menglong Zhang, Yihang Chen, Jinsong Li, Runyu Yang, Qiangbin Liu, Xinlei Yu, Jianmin Zhou, Na Wang, Chunyang Sun, Jun Wang
Abstract:
We introduce \emphMemento‑Skills, a generalist, continually‑learnable LLM agent system that functions as an \emphagent‑designing agent: it autonomously constructs, adapts, and improves task‑specific agents through experience. The system is built on a memory‑based reinforcement learning framework with \emphstateful prompts, where reusable skills (stored as structured markdown files) serve as persistent, evolving memory. These skills encode both behaviour and context, enabling the agent to carry forward knowledge across interactions. Starting from simple elementary skills (like Web search and terminal operations), the agent continually improves via the \emphRead‑‑Write Reflective Learning mechanism introduced in \emphMemento~2~\citewang2025memento2. In the \emphread phase, a behaviour‑trainable skill router selects the most relevant skill conditioned on the current stateful prompt; in the \emphwrite phase, the agent updates and expands its skill library based on new experience. This closed‑loop design enables \emphcontinual learning without updating LLM parameters, as all adaptation is realised through the evolution of externalised skills and prompts. Unlike prior approaches that rely on human‑designed agents, Memento‑Skills enables a generalist agent to \emphdesign agents end‑to‑end for new tasks. Through iterative skill generation and refinement, the system progressively improves its own capabilities. Experiments on the \emphGeneral AI Assistants benchmark and \emphHumanity's Last Exam demonstrate sustained gains, achieving 26.2% and 116.2% relative improvements in overall accuracy, respectively. Code is available at https://github.com/Memento‑Teams/Memento‑Skills.
Authors:Yipu Dou, Wang Yang
Abstract:
We study how to allocate a fixed supervised fine‑tuning budget when three objectives must be balanced at once: multi‑turn safety alignment, low over‑refusal on benign boundary queries, and instruction following under verifiable constraints. We propose MOSAIC (Multi‑Objective Slice‑Aware Iterative Curation for Alignment), a multi‑objective framework for closed‑loop data mixture search built on a unified L1‑L3 evaluation interface. MOSAIC turns slice‑level failure profiles into executable data actions, including dataset‑level mixture ratios, bucket‑level weights, and focus criteria. Under a fixed 1M‑token budget and five rounds of independent fine‑tuning from the same base model, MOSAIC improves internal XGuard from 2.76 to 4.67 while keeping OrBench at 4.41 and IFEval at 3.65. The final Pareto solution also generalizes better than a random static LoRA baseline on independent attack, over‑refusal, and capability tests, suggesting that structured failure diagnosis can serve as a practical control signal for budgeted data construction. Code is available at https://github.com/douyipu/mosaic.
Authors:Yinan Xia, Haotian Zhang, Huiming Wang
Abstract:
Large Reasoning Models (LRMs) have shown exceptional reasoning capabilities, but they also suffer from the issue of overthinking, often generating excessively long and redundant answers. For problems that exceed the model's capabilities, LRMs tend to exhibit the overconfidence phenomenon, generating overly short but incorrect answers, which may contribute to suboptimal performance. To address these issues, we propose Difficulty‑Differentiated Policy Optimization (DDPO), an efficient reinforcement learning algorithm that optimizes simple and complex tasks separately based on the overconfidence phenomenon. Specifically, it reduces the output length for simple tasks without compromising accuracy, while for complex tasks, it expands the exploration space to improve performance. We further derive the theoretical conditions for maximizing expected accuracy, which require the length distribution to closely approximate the optimal length and be as concentrated as possible. Based on these conditions, we propose using the difficulty‑level average as a well‑founded reference for length optimization. Extensive experiments on both in‑domain and out‑of‑domain benchmarks validate the superiority and effectiveness of DDPO. Compared to GRPO, DDPO reduces the average answer length by 12% while improving accuracy by 1.85% across multiple benchmarks, achieving a better trade‑off between accuracy and length. The code is available at https://github.com/Yinan‑Xia/DDPO.
Authors:Minsoo Cheong, Donghyun Son, Woosang Lim, Sungjoo Yoo
Abstract:
Diffusion‑based large language models (dLLMs) rely on bidirectional attention, which prevents lossless KV caching and requires a full forward pass at every denoising step. Existing approximate KV caching methods reduce this cost by selectively updating cached states, but their decision overhead scales with context length or model depth. We propose EntropyCache, a training‑free KV caching method that uses the maximum entropy of newly decoded token distributions as a constant‑cost signal for deciding when to recompute. Our design is grounded in two empirical observations: (1) decoded token entropy correlates with KV cache drift, providing a cheap proxy for cache staleness, and (2) feature volatility of decoded tokens persists for multiple steps after unmasking, motivating recomputation of the k most recently decoded tokens. The skip‑or‑recompute decision requires only O(V) computation per step, independent of context length and model scale. Experiments on LLaDA‑8B‑Instruct and Dream‑7B‑Instruct show that EntropyCache achieves 15.2×‑26.4× speedup on standard benchmarks and 22.4×‑24.1× on chain‑of‑thought benchmarks, with competitive accuracy and decision overhead accounting for only 0.5% of inference time. Code is available at https://github.com/mscheong01/EntropyCache.
Authors:Jason Dury
Abstract:
Embedding models group text by semantic content, what text is about. We show that temporal co‑occurrence within texts discovers a different kind of structure: recurrent transition‑structure concepts or what text does. We train a 29.4M‑parameter contrastive model on 373 million co‑occurrence pairs from 9,766 Project Gutenberg texts (24.96 million passages), mapping pre‑trained embeddings into an association space where passages with similar transition structure cluster together. Under capacity constraint (42.75% accuracy), the model must compress across recurring patterns rather than memorise individual co‑occurrences. Clustering at six granularities (k=50 to k=2,000) produces a multi‑resolution concept map; from broad modes like "direct confrontation" and "lyrical meditation" to precise registers and scene templates like "sailor dialect" and "courtroom cross‑examination." At k=100, clusters average 4,508 books each (of 9,766), confirming corpus‑wide patterns. Direct comparison with embedding‑similarity clustering shows that raw embeddings group by topic while association‑space clusters group by function, register, and literary tradition. Unseen novels are assigned to existing clusters without retraining; the association model concentrates each novel into a selective subset of coherent clusters, while raw embedding assignment saturates nearly all clusters. Validation controls address positional, length, and book‑concentration confounds. The method extends Predictive Associative Memory (PAM, arXiv:2602.11322) from episodic recall to concept formation: where PAM recalls specific associations, multi‑epoch contrastive training under compression extracts structural patterns that transfer to unseen texts, the same framework producing qualitatively different behaviour in a different regime.
Authors:Ja Young Lee, Mírian Silva, Mohamed Nasr, Shonda Witherspoon, Enzo Bozzani, Veronique Demers, Radha Ratnaparkhi, Hui Wu, Sara Rosenthal
Abstract:
Large language models (LLMs) are largely motivated by their performance on popular topics and benchmarks at the time of their release. However, over time, contamination occurs due to significant exposure of benchmark data during training. This poses a risk of model performance inflation if testing is not carefully executed. To address this challenge, we present GRAFITE, a continuous LLM evaluation platform through a comprehensive system for maintaining and evaluating model issues. Our approach enables building a repository of model problems based on user feedback over time and offers a pipeline for assessing LLMs against these issues through quality assurance (QA) tests using LLM‑as‑a‑judge. The platform enables side‑by‑side comparison of multiple models, facilitating regression detection across different releases. The platform is available at https://github.com/IBM/grafite. The demo video is available at www.youtube.com/watch?v=XFZyoleN56k.
Authors:Yitian Gong, Botian Jiang, Yiwei Zhao, Yucheng Yuan, Kuangwei Chen, Yaozhou Jiang, Cheng Chang, Dong Hong, Mingshu Chen, Ruixiao Li, Yiyang Zhang, Yang Gao, Hanfu Chen, Ke Chen, Songlin Wang, Xiaogui Yang, Yuqian Zhang, Kexin Huang, ZhengYuan Lin, Kang Yu, Ziqi Chen, Jin Wang, Zhaoye Fei, Qinyuan Cheng, Shimin Li, Xipeng Qiu
Abstract:
This technical report presents MOSS‑TTS, a speech generation foundation model built on a scalable recipe: discrete audio tokens, autoregressive modeling, and large‑scale pretraining. Built on MOSS‑Audio‑Tokenizer, a causal Transformer tokenizer that compresses 24 kHz audio to 12.5 fps with variable‑bitrate RVQ and unified semantic‑acoustic representations, we release two complementary generators: MOSS‑TTS, which emphasizes structural simplicity, scalability, and long‑context/control‑oriented deployment, and MOSS‑TTS‑Local‑Transformer, which introduces a frame‑local autoregressive module for higher modeling efficiency, stronger speaker preservation, and a shorter time to first audio. Across multilingual and open‑domain settings, MOSS‑TTS supports zero‑shot voice cloning, token‑level duration control, phoneme‑/pinyin‑level pronunciation control, smooth code‑switching, and stable long‑form generation. This report summarizes the design, training recipe, and empirical characteristics of the released models.
Authors:David Onyango, Naseef Mansoor
Abstract:
The natural language to SQL (NL2SQL) task plays a pivotal role in democratizing data access by enabling non‑expert users to interact with relational databases through intuitive language. While recent frameworks have enhanced translation accuracy via task specialization, their reliance on Large Language Models (LLMs) raises significant concerns regarding computational overhead, data privacy, and real‑world deployability in resource‑constrained environments. To address these challenges, we propose a schema based agentic system that strategically employs Small Language Models (SLMs) as primary agents, complemented by a selective LLM fallback mechanism. The LLM is invoked only upon detection of errors in SLM‑generated output, the proposed system significantly minimizes computational expenditure. Experimental results on the BIRD benchmark demonstrate that our system achieves an execution accuracy of 47.78% and a validation efficiency score of 51.05%, achieving over 90% cost reduction compared to LLM‑centric baselines as approximately 67% of queries are resolved using local SLMs. The system achieves an average cost per query of 0.0085 compared to 0.094 for LLM‑only systems, achieving near‑zero operational costs for locally executed queries. [Github repository: https://github.com/mindslab25/CESMA.]
Authors:Kevin Qu, Haozhe Qi, Mihai Dusmanu, Mahdi Rad, Rui Wang, Marc Pollefeys
Abstract:
Multimodal Large Language Models (MLLMs) have made impressive progress in connecting vision and language, but they still struggle with spatial understanding and viewpoint‑aware reasoning. Recent efforts aim to augment the input representations with geometric cues rather than explicitly teaching models to reason in 3D space. We introduce Loc3R‑VLM, a framework that equips 2D Vision‑Language Models with advanced 3D understanding capabilities from monocular video input. Inspired by human spatial cognition, Loc3R‑VLM relies on two joint objectives: global layout reconstruction to build a holistic representation of the scene structure, and explicit situation modeling to anchor egocentric perspective. These objectives provide direct spatial supervision that grounds both perception and language in a 3D context. To ensure geometric consistency and metric‑scale alignment, we leverage lightweight camera pose priors extracted from a pre‑trained 3D foundation model. Loc3R‑VLM achieves state‑of‑the‑art performance in language‑based localization and outperforms existing 2D‑ and video‑based approaches on situated and general 3D question‑answering benchmarks, demonstrating that our spatial supervision framework enables strong 3D understanding. Project page: https://kevinqu7.github.io/loc3r‑vlm
Authors:Hamed Taheri
Abstract:
Enterprise AI deploys dozens of autonomous agent nodes across workflows, each acting on the same entities with no shared memory and no common governance. We identify five structural challenges arising from this memory governance gap: memory silos across agent workflows; governance fragmentation across teams and tools; unstructured memories unusable by downstream systems; redundant context delivery in autonomous multi‑step executions; and silent quality degradation without feedback loops. We present Governed Memory, a shared memory and governance layer addressing this gap through four mechanisms: a dual memory model combining open‑set atomic facts with schema‑enforced typed properties; tiered governance routing with progressive context delivery; reflection‑bounded retrieval with entity‑scoped isolation; and a closed‑loop schema lifecycle with AI‑assisted authoring and automated per‑property refinement. We validate each mechanism through controlled experiments (N=250, five content types): 99.6% fact recall with complementary dual‑modality coverage; 92% governance routing precision; 50% token reduction from progressive delivery; zero cross‑entity leakage across 500 adversarial queries; 100% adversarial governance compliance; and output quality saturation at approximately seven governed memories per entity. On the LoCoMo benchmark, the architecture achieves 74.8% overall accuracy, confirming that governance and schema enforcement impose no retrieval quality penalty. The system is in production at Personize.ai.
Authors:Teng Pan, Yuchen Yan, Zixuan Wang, Ruiqing Zhang, Gaiyang Han, Wanqi Zhang, Weiming Lu, Jun Xiao, Yongliang Shen
Abstract:
Label‑free reinforcement learning enables large language models to improve reasoning capabilities without ground‑truth supervision, typically by treating majority‑voted answers as pseudo‑labels. However, we identify a critical failure mode: as training maximizes self‑consistency, output diversity collapses, causing the model to confidently reinforce systematic errors that evade detection. We term this the consensus trap. To escape it, we propose CoVerRL, a framework where a single model alternates between generator and verifier roles, with each capability bootstrapping the other. Majority voting provides noisy but informative supervision for training the verifier, while the improving verifier progressively filters self‑consistent errors from pseudo‑labels. This co‑evolution creates a virtuous cycle that maintains high reward accuracy throughout training. Experiments across Qwen and Llama model families demonstrate that CoVerRL outperforms label‑free baselines by 4.7‑5.9% on mathematical reasoning benchmarks. Moreover, self‑verification accuracy improves from around 55% to over 85%, confirming that both capabilities genuinely co‑evolve.
Authors:Pujun Zheng, Jiacheng Yao, Jinquan Zheng, Chenyang Gu, Guoxiu He, Jiawei Liu, Yong Huang, Tianrui Guo, Wei Lu
Abstract:
Large language models (LLMs) are currently applied to scientific paper evaluation by assigning an absolute score to each paper independently. However, since score scales vary across conferences, time periods, and evaluation criteria, models trained on absolute scores are prone to fitting narrow, context‑specific rules rather than developing robust scholarly judgment. To overcome this limitation, we propose shifting paper evaluation from isolated scoring to collaborative ranking. In particular, we design Comparison‑Native framework for Paper Evaluation (CNPE), integrating comparison into both data construction and model learning. We first propose a graph‑based similarity ranking algorithm to facilitate the sampling of more informative and discriminative paper pairs from a collection. We then enhance relative quality judgment through supervised fine‑tuning and reinforcement learning with comparison‑based rewards. At inference, the model performs pairwise comparisons over sampled paper pairs and aggregates these preference signals into a global relative quality ranking. Experimental results demonstrate that our framework achieves an average relative improvement of 21.8% over the strong baseline DeepReview‑14B, while exhibiting robust generalization to five previously unseen datasets. \hrefhttps://github.com/ECNU‑Text‑Computing/ComparisonReviewCode.
Authors:Yuxiang Mei, Delai Qiu, Shengping Liu, Jiaen Liang, Yanhua Long
Abstract:
Speech Large Language Models (Speech‑LLMs) have emerged as a powerful approach for automatic speech recognition (ASR) by aligning speech encoders with large language models. However, adapting these systems to multilingual settings with imbalanced data distributions remains challenging. In such scenarios, a stability‑plasticity dilemma often arises: fully shared Parameter‑Efficient Fine‑Tuning (PEFT) can cause negative inter‑lingual interference for under‑represented languages, while fully language‑specific tuning limits the cross‑lingual beneficial knowledge transfer needed for low‑resource tasks. To address this, we propose Zipper‑LoRA, a novel rank‑level decoupling framework with three variants (Static, Hard, and Soft) that dynamically synthesizes LoRA updates from shared and language‑specific subspaces. By using a lightweight language‑conditioned router, Zipper‑LoRA dynamically controls the contribution of each subspace at the LoRA rank level, enabling fine‑grained sharing where languages are compatible and strict decoupling when conflicts occur. To further stabilize optimization under imbalanced data, we propose a two‑stage training strategy with an Initial‑B warm start that significantly accelerates convergence. Experiments on a 12‑language mixed‑resource setting show that Zipper‑LoRA consistently outperforms both fully shared and independent baselines, particularly in extremely low‑resource scenarios. Moreover, we demonstrate that these gains are robust across both chunked and non‑chunked encoder configurations, confirming the framework's reliability for practical, large‑scale multilingual ASR. Our code and data will be available at https://github.com/YuCeong‑May/Zipper‑LoRA for reproducibility.
Authors:Madhav S. Baidya, S. S. Baidya, Chirag Chawla
Abstract:
The rapid proliferation of large language models (LLMs) has created an urgent need for robust and generalizable detectors of machine‑generated text. Existing benchmarks typically evaluate a single detector on a single dataset under ideal conditions, leaving open questions about cross‑domain transfer, cross‑LLM generalization, and adversarial robustness. We present a comprehensive benchmark evaluating diverse detection approaches across two corpora: HC3 (23,363 human‑ChatGPT pairs) and ELI5 (15,000 human‑Mistral‑7B pairs). Methods include classical classifiers, fine‑tuned transformer encoders (BERT, RoBERTa, ELECTRA, DistilBERT, DeBERTa‑v3), a CNN, an XGBoost stylometric model, perplexity‑based detectors, and LLM‑as‑detector prompting. Results show that transformer models achieve near‑perfect in‑distribution performance but degrade under domain shift. The XGBoost stylometric model matches performance while remaining interpretable. LLM‑based detectors underperform and are affected by generator‑detector identity bias. Perplexity‑based methods exhibit polarity inversion, with modern LLM outputs showing lower perplexity than human text, but remain effective when corrected. No method generalizes robustly across domains and LLM sources.
Authors:Mengyu Bu, Yang Feng
Abstract:
Large language models (LLMs) exhibit strong general intelligence, yet their multilingual performance remains highly imbalanced. Although LLMs encode substantial cross‑lingual knowledge in a unified semantic space, they often struggle to reliably interface this knowledge with low‑resource or unseen languages. Fortunately, pretrained encoder‑decoder translation models already possess balanced multilingual capability, suggesting a natural complement to LLMs. In this work, we propose XBridge, a compositional encoder‑LLM‑decoder architecture that offloads multilingual understanding and generation to external pretrained translation models, while preserving the LLM as an English‑centric core for general knowledge processing. To address the resulting representation misalignment across models, we introduce lightweight cross‑model mapping layers and an optimal transport‑based alignment objective, enabling fine‑grained semantic consistency for multilingual generation. Experiments on four LLMs across multilingual understanding, reasoning, summarization, and generation indicate that XBridge outperforms strong baselines, especially on low‑resource and previously unseen languages, without retraining the LLM.
Authors:Segyu Lee, Boryeong Cho, Hojung Jung, Seokhyun An, Juhyeong Kim, Jaehyun Kwak, Yongjin Yang, Sangwon Jang, Youngrok Park, Wonjun Chang, Se-Young Yun
Abstract:
Unified Multimodal Models (UMMs) offer powerful cross‑modality capabilities but introduce new safety risks not observed in single‑task models. Despite their emergence, existing safety benchmarks remain fragmented across tasks and modalities, limiting the comprehensive evaluation of complex system‑level vulnerabilities. To address this gap, we introduce UniSAFE, the first comprehensive benchmark for system‑level safety evaluation of UMMs across 7 I/O modality combinations, spanning conventional tasks and novel multimodal‑context image generation settings. UniSAFE is built with a shared‑target design that projects common risk scenarios across task‑specific I/O configurations, enabling controlled cross‑task comparisons of safety failures. Comprising 6,802 curated instances, we use UniSAFE to evaluate 15 state‑of‑the‑art UMMs, both proprietary and open‑source. Our results reveal critical vulnerabilities across current UMMs, including elevated safety violations in multi‑image composition and multi‑turn settings, with image‑output tasks consistently more vulnerable than text‑output tasks. These findings highlight the need for stronger system‑level safety alignment for UMMs. Our code and data are publicly available at https://github.com/segyulee/UniSAFE
Authors:Chaeyoung Huh, Hyunmin Hwang, Jung Hwan Shin, Jinse Park, Jong Chul Ye
Abstract:
Drug recommendation requires a deep understanding of individual patient context, especially for complex conditions like Parkinson's disease. While LLMs possess broad medical knowledge, they fail to capture the subtle nuances of actual prescribing patterns. Existing RAG methods also struggle with these complexities because guideline‑based retrieval remains too generic and similar‑patient retrieval often replicates majority patterns without accounting for the unique clinical nuances of individual patients. To bridge this gap, we propose PACE‑RAG (Patient‑Aware Contextual and Evidence‑based Policy RAG), a novel framework designed to synthesize individual patient context with the prescribing tendencies of similar cases. By analyzing treatment patterns tailored to specific clinical signals, PACE‑RAG identifies optimal prescriptions and generates an explainable clinical summary. Evaluated on a Parkinson's cohort and the MIMIC‑IV benchmark using Llama‑3.1‑8B and Qwen3‑8B, PACE‑RAG achieved state‑of‑the‑art performance, reaching F1 scores of 80.84% and 47.22%, respectively. These results validate PACE‑RAG as a robust, clinically grounded solution for personalized decision support. Our code is available at: https://github.com/ChaeYoungHuh/PACE‑RAG.
Authors:Sophie Kearney, Shu Yang, Zixuan Wen, Weimin Lyu, Bojian Hou, Duy Duong-Tran, Tianlong Chen, Jason H. Moore, Marylyn D. Ritchie, Chao Chen, Li Shen
Abstract:
Accurate diagnosis of Alzheimer's disease (AD) requires handling tabular biomarker data, yet such data are often small and incomplete, where deep learning models frequently fail to outperform classical methods. Pretrained large language models (LLMs) offer few‑shot generalization, structured reasoning, and interpretable outputs, providing a powerful paradigm shift for clinical prediction. We propose TAP‑GPT Tabular Alzheimer's Prediction GPT, a domain‑adapted tabular LLM framework built on TableGPT2 and fine‑tuned for few‑shot AD classification using tabular prompts rather than plain texts. We evaluate TAP‑GPT across four ADNI‑derived datasets, including QT‑PAD biomarkers and region‑level structural MRI, amyloid PET, and tau PET for binary AD classification. Across multimodal and unimodal settings, TAP‑GPT improves upon its backbone models and outperforms traditional machine learning baselines in the few‑shot setting while remaining competitive with state‑of‑the‑art general‑purpose LLMs. We show that feature selection mitigates degradation in high‑dimensional inputs and that TAP‑GPT maintains stable performance under simulated and real‑world missingness without imputation. Additionally, TAP‑GPT produces structured, modality‑aware reasoning aligned with established AD biology and shows greater stability under self‑reflection, supporting its use in iterative multi‑agent systems. To our knowledge, this is the first systematic application of a tabular‑specialized LLM to multimodal biomarker‑based AD prediction, demonstrating that such pretrained models can effectively address structured clinical prediction tasks and laying the foundation for tabular LLM‑driven multi‑agent clinical decision‑support systems. The source code is publicly available on GitHub: https://github.com/sophie‑kearney/TAP‑GPT.
Authors:Yelysei Bondarenko, Thomas Hehn, Rob Hesselink, Romain Lepert, Fabio Valerio Massoli, Evgeny Mironov, Leyla Mirvakhabova, Tribhuvanesh Orekondy, Spyridon Stasis, Andrey Kuzmin, Anna Kuzina, Markus Nagel, Ankita Nayak, Corrado Rainone, Ork de Rooij, Paul N Whatmough, Arash Behboodi, Babak Ehteshami Bejnordi
Abstract:
Large language models (LLMs) with chain‑of‑thought reasoning achieve state‑of‑the‑art performance across complex problem‑solving tasks, but their verbose reasoning traces and large context requirements make them impractical for edge deployment. These challenges include high token generation costs, large KV‑cache footprints, and inefficiencies when distilling reasoning capabilities into smaller models for mobile devices. Existing approaches often rely on distilling reasoning traces from larger models into smaller models, which are verbose and stylistically redundant, undesirable for on‑device inference. In this work, we propose a lightweight approach to enable reasoning in small LLMs using LoRA adapters combined with supervised fine‑tuning. We further introduce budget forcing via reinforcement learning on these adapters, significantly reducing response length with minimal accuracy loss. To address memory‑bound decoding, we exploit parallel test‑time scaling, improving accuracy at minor latency increase. Finally, we present a dynamic adapter‑switching mechanism that activates reasoning only when needed and a KV‑cache sharing strategy during prompt encoding, reducing time‑to‑first‑token for on‑device inference. Experiments on Qwen2.5‑7B demonstrate that our method achieves efficient, accurate reasoning under strict resource constraints, making LLM reasoning practical for mobile scenarios. Videos demonstrating our solution running on mobile devices are available on our project page.
Authors:Valentin Lafargue, Ariel Guerra-Adames, Emmanuelle Claeys, Elouan Vuichard, Jean-Michel Loubes
Abstract:
Large language models (LLMs) are increasingly deployed in applications with societal impact, raising concerns about the cultural biases they encode. We probe these representations by evaluating whether LLMs can perform author profiling from song lyrics in a zero‑shot setting, inferring singers' gender and ethnicity without task‑specific fine‑tuning. Across several open‑source models evaluated on more than 10,000 lyrics, we find that LLMs achieve non‑trivial profiling performance but demonstrate systematic cultural alignment: most models default toward North American ethnicity, while DeepSeek‑1.5B aligns more strongly with Asian ethnicity. This finding emerges from both the models' prediction distributions and an analysis of their generated rationales. To quantify these disparities, we introduce two fairness metrics, Modality Accuracy Divergence (MAD) and Recall Divergence (RD), and show that Ministral‑8B displays the strongest ethnicity bias among the evaluated models, whereas Gemma‑12B shows the most balanced behavior. Our code is available on [GitHub](https://github.com/ValentinLafargue/CulturalProbingLLM) and results on [HuggingFace](https://huggingface.co/datasets/ValentinLAFARGUE/AuthorProfilingResults).
Authors:Guangzhi Xiong, Sanchit Sinha, Zhenghao He, Aidong Zhang
Abstract:
Vision‑language models (VLMs) have achieved impressive performance across a wide range of multimodal reasoning tasks, but they often struggle to disentangle fine‑grained visual attributes and reason about underlying causal relationships. In‑context learning (ICL) offers a promising avenue for VLMs to adapt to new tasks, but its effectiveness critically depends on the selection of demonstration examples. Existing retrieval‑augmented approaches typically rely on passive similarity‑based retrieval, which tends to select correlated but non‑causal examples, amplifying spurious associations and limiting model robustness. We introduce CIRCLES (Composed Image Retrieval for Causal Learning Example Selection), a novel framework that actively constructs demonstration sets by retrieving counterfactual‑style examples through targeted, attribute‑guided composed image retrieval. By incorporating counterfactual‑style examples, CIRCLES enables VLMs to implicitly reason about the causal relations between attributes and outcomes, moving beyond superficial correlations and fostering more robust and grounded reasoning. Comprehensive experiments on four diverse datasets demonstrate that CIRCLES consistently outperforms existing methods across multiple architectures, especially on small‑scale models, with pronounced gains under information scarcity. Furthermore, CIRCLES retrieves more diverse and causally informative examples, providing qualitative insights into how models leverage in‑context demonstrations for improved reasoning. Our code is available at https://github.com/gzxiong/CIRCLES.
Authors:Xiaojie Gu, Sherry T. Tong, Aosong Feng, Sophia Simeng Han, Jinghui Lu, Yingjian Chen, Yusuke Iwasawa, Yutaka Matsuo, Chanjun Park, Rex Ying, Irene Li
Abstract:
Reasoning‑focused large language models (LLMs) have advanced in many NLP tasks, yet their evaluation remains challenging: final answers alone do not expose the intermediate reasoning steps, making it difficult to determine whether a model truly reasons correctly and where failures occur, while existing multi‑hop QA benchmarks lack step‑level annotations for diagnosing reasoning failures. To address this gap, we propose Omanic, an open‑domain multi‑hop QA resource that provides decomposed sub‑questions and intermediate answers as structural annotations for analyzing reasoning processes. It contains 10,296 machine‑generated training examples (OmanicSynth) and 967 expert‑reviewed human‑annotated evaluation examples (OmanicBench). Systematic evaluations show that state‑of‑the‑art LLMs achieve only 73.11% multiple‑choice accuracy on OmanicBench, confirming its high difficulty. Stepwise analysis reveals that CoT's performance hinges on factual completeness, with its gains diminishing under knowledge gaps and errors amplifying in later hops. Additionally, supervised fine‑tuning on OmanicSynth brings substantial transfer gains (7.41 average points) across six reasoning and math benchmarks, validating the dataset's quality and further supporting the effectiveness of OmanicSynth as supervision for reasoning‑capability transfer. We release the data at https://huggingface.co/datasets/li‑lab/Omanic and the code at https://github.com/XiaojieGu/Omanic.
Authors:Hanif Rahman
Abstract:
We present PashtoCorp, a 1.25‑billion‑word corpus for Pashto, a language spoken by 60 million people that remains severely underrepresented in NLP. The corpus is assembled from 39 sources spanning seven HuggingFace datasets and 32 purpose‑built web scrapers, processed through a reproducible pipeline with Arabic‑script tokenization, SHA‑256 deduplication, and quality filtering. At 1.25B words across 2.81 million documents, PashtoCorp is 40x larger than the OSCAR Pashto subset and 83x larger than the previously largest dedicated Pashto corpus. Continued MLM pretraining of XLM‑R‑base on PashtoCorp reduces held‑out perplexity by 25.1% (8.08‑>6.06). On WikiANN Pashto NER, the pretrained model improves entity F1 by 10% relative (19.0%‑>21.0%) and reduces training variance nearly 7x; the largest gain appears at 50 training sentences (+27%), with PashtoCorp covering 97.9% of WikiANN entity vocabulary. On Belebele Pashto reading comprehension, Gemma‑3n achieves 64.6% accuracy, the first published LLM baseline for Pashto on this benchmark. A leave‑one‑out source ablation shows that Wikipedia (0.7% of documents) is the most critical source for NER: removing it alone reduces entity F1 by 47%. Corpus data, trained model, and code are available at https://huggingface.co/datasets/ihanif/pashto‑corpus, https://huggingface.co/ihanif/xlmr‑pashto, and https://github.com/ihanif/pashto‑corpus.
Authors:Surya Vardhan Yalavarthi
Abstract:
Corrective Retrieval Augmented Generation (CRAG) improves the robustness of RAG systems by evaluating retrieved document quality and triggering corrective actions. However, the original implementation relies on proprietary components including the Google Search API and closed model weights, limiting reproducibility. In this work, we present a fully open‑source reproduction of CRAG, replacing proprietary web search with the Wikipedia API and the original LLaMA‑2 generator with Phi‑3‑mini‑4k‑instruct. We evaluate on PopQA and ARC‑Challenge, demonstrating that our open‑source pipeline achieves comparable performance to the original system. Furthermore, we contribute the first explainability analysis of CRAG's T5‑based retrieval evaluator using SHAP, revealing that the evaluator primarily relies on named entity alignment rather than semantic similarity. Our analysis identifies key failure modes including domain transfer limitations on science questions. All code and results are available at https://github.com/suryayalavarthi/crag‑reproduction.
Authors:Hexi Wang, Yujia Zhou, Bangde Du, Qingyao Ai, Yiqun Liu
Abstract:
Large language models (LLMs) have recently been adopted as synthetic agents for public opinion simulation, offering a promising alternative to costly and slow human surveys. Despite their scalability, current LLM‑based simulation methods fail to capture social diversity, producing flattened inter‑group differences and overly homogeneous responses within demographic groups. We identify this limitation as a Diversity Collapse phenomenon in LLM hidden representations, where distinct social identities become increasingly indistinguishable across layers. Motivated by this observation, we propose Parametric Social Identity Injection (PSII), a general framework that injects explicit, parametric representations of demographic attributes and value orientations directly into intermediate hidden states of LLMs. Unlike prompt‑based persona conditioning, PSII enables fine‑grained and controllable identity modulation at the representation level. Extensive experiments on the World Values Survey using multiple open‑source LLMs show that PSII significantly improves distributional fidelity and diversity, reducing KL divergence to real‑world survey data while enhancing overall diversity. This work provides new insights into representation‑level control of LLM agents and advances scalable, diversity‑aware public opinion simulation. Code and data are available at https://github.com/halsayxi/PSII.
Authors:Han Jang, Junhyeok Lee, Kyu Sung Choi
Abstract:
The explosive growth of AI research has created unprecedented information overload, increasing the demand for scientific summarization at multiple levels of granularity beyond traditional abstracts. While LLMs are increasingly adopted for summarization, existing benchmarks remain limited in scale, target only a single granularity, and predate the LLM era. Moreover, since the release of ChatGPT in November 2022, researchers have rapidly adopted LLMs for drafting manuscripts themselves, fundamentally transforming scientific writing, yet no resource exists to analyze how this writing has evolved. To bridge these gaps, we introduce SciZoom, a benchmark comprising 44,946 papers from four top‑tier ML venues (NeurIPS, ICLR, ICML, EMNLP) spanning 2020 to 2025, explicitly stratified into Pre‑LLM and Post‑LLM eras. SciZoom provides three hierarchical summarization targets (Abstract, Contributions, and TL;DR) achieving compression ratios up to 600:1, enabling both multi‑granularity summarization research and temporal mining of scientific writing patterns. Our linguistic analysis reveals striking shifts in phrase patterns (up to 10x for formulaic expressions) and rhetorical style (23% decline in hedging), suggesting that LLM‑assisted writing produces more confident yet homogenized prose. SciZoom serves as both a challenging benchmark and a unique resource for mining the evolution of scientific discourse in the generative AI era. Our code and dataset are publicly available on GitHub (https://github.com/janghana/SciZoom) and Hugging Face (https://huggingface.co/datasets/hanjang/SciZoom), respectively.
Authors:Yifan Zhang
Abstract:
Recent work has made clear that the residual pathway is not mere optimization plumbing; it is part of the model's representational machinery. We agree, but argue that the cleanest way to organize this design space is through a two‑axis view of the Transformer. A decoder evolves information along two ordered dimensions: sequence position and layer depth. Self‑attention already provides adaptive mixing along the sequence axis, whereas the residual stream usually performs fixed addition along the depth axis. If we fix a token position and treat layer index as the ordered variable, then a causal depth‑wise residual attention read is exactly the same local operator as causal short sliding‑window attention (ShortSWA), except written over depth rather than over sequence. This is the core residual stream duality behind Transformer^2. This perspective also clarifies the recent literature. ELC‑BERT and DenseFormer already show that learned aggregation over depth can outperform uniform residual accumulation, while Vertical Attention, DeepCrossAttention (DCA), MUDDFormer, and Attention Residuals move further toward explicit attention‑based routing over earlier layers. The key point, however, is that operator‑level duality does not imply systems‑level symmetry. For large‑scale autoregressive models, sequence‑axis ShortSWA is usually the more hardware‑friendly placement because it reuses token‑side sliding‑window kernels, KV‑cache layouts, and chunked execution. If the goal is instead to change the shortcut itself, Deep Delta Learning (DDL) is the cleaner intervention because it modifies the residual operator directly rather than adding a separate cross‑layer retrieval path. Our recommendation is therefore simple: use DDL when the shortcut is the object of interest, and use sequence‑axis ShortSWA when the goal is local adaptive mixing.
Authors:Sijie Li, Biao Qian, Jungong Han
Abstract:
Network pruning is an effective technique for enabling lightweight Large Vision‑Language Models (LVLMs), which primarily incorporates both weights and activations into the importance metric. However, existing efforts typically process calibration data from different modalities in a unified manner, overlooking modality‑specific behaviors. This raises a critical challenge: how to address the divergent behaviors of textual and visual tokens for accurate pruning of LVLMs. To this end, we systematically investigate the sensitivity of visual and textual tokens to the pruning operation by decoupling their corresponding weights, revealing that: (i) the textual pathway should be calibrated via text tokens, since it exhibits higher sensitivity than the visual pathway; (ii) the visual pathway exhibits high redundancy, permitting even 50% sparsity. Motivated by these insights, we propose a simple yet effective Asymmetric Text‑Visual Weight Pruning method for LVLMs, dubbed ATV‑Pruning, which establishes the importance metric for accurate weight pruning by selecting the informative tokens from both textual and visual pathways. Specifically, ATV‑Pruning integrates two primary innovations: first, a calibration pool is adaptively constructed by drawing on all textual tokens and a subset of visual tokens; second, we devise a layer‑adaptive selection strategy to yield important visual tokens. Finally, extensive experiments across standard multimodal benchmarks verify the superiority of our ATV‑Pruning over state‑of‑the‑art methods.
Authors:Rushil Thareja, Gautam Gupta, Francesco Pinto, Nils Lukas
Abstract:
Constitutional AI is a method to oversee and control LLMs based on a set of rules written in natural language. These rules are typically written by human experts, but could in principle be learned automatically given sufficient training data for the desired behavior. Existing LLM‑based prompt optimizers attempt this but are ineffective at learning constitutions since (i) they require many labeled examples and (ii) lack structure in the optimized prompts, leading to diminishing improvements as prompt size grows. To address these limitations, we propose Multi‑Agent Constitutional Learning (MAC), which optimizes over structured prompts represented as sets of rules using a network of agents with specialized tasks to accept, edit, or reject rule updates. We also present MAC+, which improves performance by training agents on successful trajectories to reinforce updates leading to higher reward. We evaluate MAC on tagging Personally Identifiable Information (PII), a classification task with limited labels where interpretability is critical, and demonstrate that it generalizes to other agentic tasks such as tool calling. MAC outperforms recent prompt optimization methods by over 50%, produces human‑readable and auditable rule sets, and achieves performance comparable to supervised fine‑tuning and GRPO without requiring parameter updates.
Authors:Pedro Bento, Arthur Buzelin, Arthur Chagas, Yan Aquino, Victoria Estanislau, Samira Malaquias, Pedro Robles Dutenhefner, Gisele L. Pappa, Virgilio Almeida, Wagner MeiraJr
Abstract:
Most intrinsic association probes operate at the word, sentence, or corpus level, obscuring author‑level variation. We present POLAR (Per‑user On‑axis Lexical Association Re‑port), a per‑user lexical association test that runs in the embedding space of a lightly adapted masked language model. Authors are represented by private deterministic to‑kens; POLAR projects these vectors onto curated lexicalaxes and reports standardized effects with permutation p‑values and Benjamini‑‑Hochberg control. On a balanced bot‑‑human Twitter benchmark, POLAR cleanly separates LLM‑driven bots from organic accounts; on an extremist forum,it quantifies strong alignment with slur lexicons and reveals rightward drift over time. The method is modular to new attribute sets and provides concise, per‑author diagnostics for computational social science. All code is publicly avail‑able at https://github.com/pedroaugtb/POLAR‑A‑Per‑User‑Association‑Test‑in‑Embedding‑Space.
Authors:Tomas Ruiz, Zhen Qin, Yifan Zhang, Xuyang Shen, Yiran Zhong, Mengdi Wang
Abstract:
Sampling from a categorical distribution is mathematically simple, but in large‑vocabulary decoding, it often triggers extra memory traffic and extra kernels after the LM head. We present FlashSampling, an exact sampling primitive that fuses sampling into the LM‑head matmul and never materializes the logits tensor in HBM. The method is simple: compute logits tile‑by‑tile on chip, add Gumbel noise, keep only one maximizer per row and per vocabulary tile, and finish with a small reduction over tiles. The fused tiled kernel is exact because \argmax decomposes over a partition; grouped variants for online and tensor‑parallel settings are exact by hierarchical factorization of the categorical distribution. Across H100, H200, B200, and B300 GPUs, FlashSampling speeds up kernel‑level decode workloads, and in end‑to‑end vLLM experiments, it reduces time per output token by up to 19% on the models we test. These results show that exact sampling, with no approximation, can be integrated into the matmul itself, turning a bandwidth‑bound postprocessing step into a lightweight epilogue. Project Page: https://github.com/FlashSampling/FlashSampling.
Authors:Christopher Potts, Moritz Sudhof
Abstract:
AI systems fail silently far more often than they fail visibly. In an analysis of 100K human‑AI interactions from the WildChat dataset, we find that 79% of AI failures are invisible: something went wrong but the user gave no overt indication that there was a problem. These invisible failures cluster into eight archetypes that help us characterize where and how AI systems are failing to meet users' needs. In addition, the archetypes show systematic co‑occurrence patterns indicating higher‑level failure types. To address the question of whether these archetypes will remain relevant as AI systems become more capable, we also created and annotated a counterfactual dataset in which WildChat's 2024‑era responses are replaced by those from three present‑day frontier LMs. This analysis indicates that failure rates have dropped substantially, but that the vast majority of failures remain invisible in our sense, and the distribution of failure archetypes seems stable. Finally, we illustrate how the archetypes help us to identify systematic and variable AI limitations across different usage domains. Overall, we argue that our invisible failure taxonomy can be a key component in reliable failure monitoring for product developers, scientists, and policy makers. Our code and data are available at https://github.com/bigspinai/bigspin‑invisible‑failure‑archetypes
Authors:Yitong Zhang, Chengze Li, Ruize Chen, Guowei Yang, Xiaoran Jia, Yijie Ren, Jia Li
Abstract:
Large Language Models (LLMs) have shown strong potential for code generation, yet they remain limited in private‑library‑oriented code generation, where the goal is to generate code using APIs from private libraries. Existing approaches mainly rely on retrieving private‑library API documentation and injecting relevant knowledge into the context at inference time. However, our study shows that this is insufficient: even given accurate required knowledge, LLMs still struggle to invoke private‑library APIs effectively. To address this limitation, we propose PriCoder, an approach that teaches LLMs to invoke private‑library APIs through automatically synthesized data. Specifically, PriCoder models private‑library data synthesis as the construction of a graph, and alternates between two graph operators: (1) Progressive Graph Evolution, which improves data diversity by progressively synthesizing more diverse training samples from basic ones, and (2) Multidimensional Graph Pruning, which improves data quality through a rigorous filtering pipeline. To support rigorous evaluation, we construct two new benchmarks based on recently released libraries that are unfamiliar to the tested models. Experiments on three mainstream LLMs show that PriCoder substantially improves private‑library‑oriented code generation, yielding gains of over 20% in pass@1 in many settings, while causing negligible impact on general code generation capability. Our code and benchmarks are publicly available at https://github.com/eniacode/PriCoder.
Authors:Jinhu Qi, Yifan Li, Minghao Zhao, Wentao Zhang, Zijian Zhang, Yaoman Li, Irwin King
Abstract:
Agentic AI systems increasingly act through tool‑augmented, multi‑step workflows whose failures (unsafe tool use, unauthorised actions, social harm) carry deployment‑level consequences. Evaluation practice remains fragmented across isolated benchmark slices, and "trustworthiness" is frequently invoked but rarely defined operationally. We argue the central limitation is twofold: (i) the absence of a measurable specification of what agent trustworthiness means, and (ii) the lack of a principled notion of representativeness allowing assessment over a socio‑technical scenario distribution rather than disconnected benchmark instances. We address (i) by defining agentic trustworthiness as a five‑property profile (Reliability, Robustness, Safety, Social‑Ethical Alignment, Operational Integrity) grounded in current AI risk frameworks, and (ii) with the Holographic Agent Assessment Framework (HAAF), which measures this profile over a scenario manifold through static policy analysis, sandbox simulation, social‑ethical alignment assessment, and distribution‑aware sampling, connected through an iterative Trustworthy Optimization Factory that converts red‑team diagnoses into blue‑team interventions. Our contributions are: (1) an operational five‑property definition of agentic trustworthiness; (2) a distribution‑aware scenario‑sampling framework that surfaces property‑level trade‑offs invisible to scalar leaderboards; and (3) a cross‑family transfer experiment in which interventions designed from a single focal model generalise ‑‑ without per‑model or per‑scenario tuning ‑‑ to 13 systems from seven model families (Llama, Mistral, Kimi, GLM, Qwen, GPT, DeepSeek) on a 100‑scenario suite, where all 13 systems improve and two reach a perfect risk‑weighted profile, establishing HAAF's Factory as a model‑agnostic deployment‑readiness pipeline. Code: https://github.com/TonyQJH/haaf‑pilot
Authors:Thi Vu, Linh The Nguyen, Dat Quoc Nguyen
Abstract:
Automatic Speech Recognition (ASR) performance is heavily dependent on the availability of large‑scale, high‑quality datasets. For low‑resource languages, existing open‑source ASR datasets often suffer from insufficient quality and inconsistent annotation, hindering the development of robust models. To address these challenges, we propose a novel and generalizable data aggregation and preprocessing pipeline designed to construct high‑quality ASR datasets from diverse, potentially noisy, open‑source sources. Our pipeline incorporates rigorous processing steps to ensure data diversity, balance, and the inclusion of crucial features like word‑level timestamps. We demonstrate the effectiveness of our methodology by applying it to Vietnamese, resulting in a unified, high‑quality 500‑hour dataset that provides a foundation for training and evaluating state‑of‑the‑art Vietnamese ASR systems. Our project page is available at https://github.com/qualcomm‑ai‑research/PhoASR.
Authors:Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen
Abstract:
Computer‑using agents (CUAs) act directly on graphical user interfaces, yet their perception of the screen is often unreliable. Existing work largely treats these failures as performance limitations, asking whether an action succeeds, rather than whether the agent is acting on the correct object at all. We argue that this is fundamentally a security problem. We formalize the visual confused deputy: a failure mode in which an agent authorizes an action based on a misperceived screen state, due to grounding errors, adversarial screenshot manipulation, or time‑of‑check‑to‑time‑of‑use (TOCTOU) races. This gap is practically exploitable: even simple screen‑level manipulations can redirect routine clicks into privileged actions while remaining indistinguishable from ordinary agent mistakes. To mitigate this threat, we propose the first guardrail that operates outside the agent's perceptual loop. Our method, dual‑channel contrastive classification, independently evaluates (1) the visual click target and (2) the agent's reasoning about the action against deployment‑specific knowledge bases, and blocks execution if either channel indicates risk. The key insight is that these two channels capture complementary failure modes: visual evidence detects target‑level mismatches, while textual reasoning reveals dangerous intent behind visually innocuous controls. Across controlled attacks, real GUI screenshots, and agent traces, the combined guardrail consistently outperforms either channel alone. Our results suggest that CUA safety requires not only better action generation, but independent verification of what the agent believes it is clicking and why. Materials are provided\footnoteModel, benchmark, and code: https://github.com/vllm‑project/semantic‑router.
Authors:Umar Abubacar, Roman Bauer, Diptesh Kanojia
Abstract:
Tiny Recursive Models (TRM) achieve strong results on reasoning tasks through iterative refinement of a shared network. We investigate whether these recursive mechanisms transfer to Quality Estimation (QE) for low‑resource languages using a three‑phase methodology. Experiments on 8 language pairs on a low‑resource QE dataset reveal three findings. First, TRM's recursive mechanisms do not transfer to QE. External iteration hurts performance, and internal recursion offers only narrow benefits. Next, representation quality dominates architectural choices, and lastly, frozen pretrained embeddings match fine‑tuned performance while reducing trainable parameters by 37× (7M vs 262M). TRM‑QE with frozen XLM‑R embeddings achieves a Spearman's correlation of 0.370, matching fine‑tuned variants (0.369) and outperforming an equivalent‑depth standard transformer (0.336). On Hindi and Tamil, frozen TRM‑QE outperforms MonoTransQuest (560M parameters) with 80× fewer trainable parameters, suggesting that weight sharing combined with frozen embeddings enables parameter efficiency for QE. We release the code publicly for further research. Code is available at https://github.com/surrey‑nlp/TRMQE.
Authors:Junhang Cheng, Fang Liu, Jia Li, Chengru Wu, Nanxiang Jiang, Li Zhang
Abstract:
Large Language Models excel in high‑resource programming languages but struggle with low‑resource ones. Existing research related to low‑resource programming languages primarily focuses on Domain‑Specific Languages (DSLs), leaving general‑purpose languages that suffer from data scarcity underexplored. To address this gap, we introduce CangjieBench, a contamination‑free benchmark for Cangjie, a representative low‑resource general‑purpose language. The benchmark comprises 248 high‑quality samples manually translated from HumanEval and ClassEval, covering both Text‑to‑Code and Code‑to‑Code tasks. We conduct a systematic evaluation of diverse LLMs under four settings: Direct Generation, Syntax‑Constrained Generation, Retrieval‑Augmented Generation (RAG), and Agent. Experiments reveal that Direct Generation performs poorly, whereas Syntax‑Constrained Generation offers the best trade‑off between accuracy and computational cost. Agent achieve state‑of‑the‑art accuracy but incur high token consumption. Furthermore, we observe that Code‑to‑Code translation often underperforms Text‑to‑Code generation, suggesting a negative transfer phenomenon where models overfit to the source language patterns. We hope that our work will offer valuable insights into LLM generalization to unseen and low‑resource programming languages. Our code and data are available at https://github.com/cjhCoder7/CangjieBench.
Authors:Jungwoo Oh, Hyunseung Chung, Junhee Lee, Min-Gyu Kim, Hangyul Yoon, Ki Seong Lee, Youngchae Lee, Muhan Yeo, Edward Choi
Abstract:
While Multimodal Large Language Models (MLLMs) show promising performance in automated electrocardiogram interpretation, it remains unclear whether they genuinely perform actual step‑by‑step reasoning or just rely on superficial visual cues. To investigate this, we introduce ECG‑Reasoning‑Benchmark, a novel multi‑turn evaluation framework comprising over 6,400 samples to systematically assess step‑by‑step reasoning across 17 core ECG diagnoses. Our comprehensive evaluation of state‑of‑the‑art models reveals a critical failure in executing multi‑step logical deduction. Although models possess the medical knowledge to retrieve clinical criteria for a diagnosis, they exhibit near‑zero success rates (6% Completion) in maintaining a complete reasoning chain, primarily failing to ground the corresponding ECG findings to the actual visual evidence in the ECG signal. These results demonstrate that current MLLMs bypass actual visual interpretation, exposing a critical flaw in existing training paradigms and underscoring the necessity for robust, reasoning‑centric medical AI. The code and data are available at https://github.com/Jwoo5/ecg‑reasoning‑benchmark.
Authors:Yutong Wu, Chenrui Cao, Pengwei Jin, Di Huang, Rui Zhang, Xishan Zhang, Zidong Du, Qi Guo, Xing Hu
Abstract:
SystemVerilog Assertions (SVAs) are crucial for hardware verification. Recent studies leverage general‑purpose LLMs to translate natural language properties to SVAs (NL2SVA), but they perform poorly due to limited data. We propose a data synthesis framework to tackle two challenges: the scarcity of high‑quality real‑world SVA corpora and the lack of reliable methods to determine NL‑SVA semantic equivalence. For the former, large‑scale open‑source RTLs are used to guide LLMs to generate real‑world SVAs; for the latter, bidirectional translation serves as a data selection method. With the synthesized data, we train CodeV‑SVA, a series of SVA generation models. Notably, CodeV‑SVA‑14B achieves 75.8% on NL2SVA‑Human and 84.0% on NL2SVA‑Machine in Func.@1, matching or exceeding advanced LLMs like GPT‑5 and DeepSeek‑R1.
Authors:Hannah Liu, Muxin Tian, Iqra Ali, Haonan Gao, Qiaoyiwen Wu, Blair Yang, Uthayasanker Thayasivam, En-Shiun Annie Lee, Pakawat Nakwijit, Surangika Ranathunga, Ravi Shekhar
Abstract:
Sentence simplification aims to make complex text more accessible by reducing linguistic complexity while preserving the original meaning. However, progress in this area remains limited for mid‑resource and low‑resource languages due to the scarcity of high‑quality data. To address this gap, we introduce the OasisSimp dataset, a multilingual dataset for sentence‑level simplification covering five languages: English, Sinhala, Tamil, Pashto, and Thai. Among these, no prior sentence simplification datasets exist for Thai, Pashto, and Tamil, while limited data is available for Sinhala. Each language simplification dataset was created by trained annotators who followed detailed guidelines to simplify sentences while maintaining meaning, fluency, and grammatical correctness. We evaluate eight open‑weight multilingual Large Language Models (LLMs) on the OasisSimp dataset and observe substantial performance disparities between high‑resource and low‑resource languages, highlighting the simplification challenges in multilingual settings. The OasisSimp dataset thus provides both a valuable multilingual resource and a challenging benchmark, revealing the limitations of current LLM‑based simplification methods and paving the way for future research in low‑resource sentence simplification. The dataset is available at https://OasisSimpDataset.github.io/.
Authors:Ibrahim Ebrar Yurt, Fabian Karl, Tejaswi Choppa, Florian Matthes
Abstract:
Clinical question answering over electronic health records (EHRs) can help clinicians and patients access relevant medical information more efficiently. However, many recent approaches rely on large cloud‑based models, which are difficult to deploy in clinical environments due to privacy constraints and computational requirements. In this work, we investigate how far grounded EHR question answering can be pushed when restricted to a single notebook. We participate in all four subtasks of the ArchEHR‑QA 2026 shared task and evaluate several approaches designed to run on commodity hardware. All experiments are conducted locally without external APIs or cloud infrastructure. Our results show that such systems can achieve competitive performance on the shared task leaderboards. In particular, our submissions perform above average in two subtasks, and we observe that smaller models can approach the performance of much larger systems when properly configured. These findings suggest that privacy‑preserving EHR QA systems running fully locally are feasible with current models and commodity hardware. The source code is available at https://github.com/ibrahimey/ArchEHR‑QA‑2026.
Authors:Hussein Jawad, Nicolas J-B Brunel
Abstract:
Large Language Model (LLM) agents increasingly use external tools for complex tasks and rely on embedding‑based retrieval to select a small top‑k subset for reasoning. As these systems scale, the robustness of this retrieval stage is underexplored, even though prior work has examined attacks on tool selection. This paper introduces ToolFlood, a retrieval‑layer attack on tool‑augmented LLM agents. Rather than altering which tool is chosen after retrieval, ToolFlood overwhelms retrieval itself by injecting a few attacker‑controlled tools whose metadata is carefully placed by exploiting the geometry of embedding space. These tools semantically span many user queries, dominate the top‑k results, and push all benign tools out of the agent's context. ToolFlood uses a two‑phase adversarial tool generation strategy. It first samples subsets of target queries and uses an LLM to iteratively generate diverse tool names and descriptions. It then runs an iterative greedy selection that chooses tools maximizing coverage of remaining queries in embedding space under a cosine‑distance threshold, stopping when all queries are covered or a budget is reached. We provide theoretical analysis of retrieval saturation and show on standard benchmarks that ToolFlood achieves up to a 95% attack success rate with a low injection rate (1% in ToolBench). The code will be made publicly available at the following link: https://github.com/as1‑prog/ToolFlood
Authors:Alejandro Paredes La Torre, Barbara Flores, Diego Rodriguez
Abstract:
We propose a resource‑efficient framework for compressing large language models through knowledge distillation, combined with guided chain‑of‑thought reinforcement learning. Using Qwen 3B as the teacher and Qwen 0.5B as the student, we apply knowledge distillation across English Dolly‑15k, Spanish Dolly‑15k, and code BugNet and PyTorrent datasets, with hyperparameters tuned in the English setting to optimize student performance. Across tasks, the distilled student retains a substantial portion of the teacher's capability while remaining significantly smaller: 70% to 91% in English, up to 95% in Spanish, and up to 93.5% Rouge‑L in code. For coding tasks, integrating chain‑of‑thought prompting with Group Relative Policy Optimization using CoT‑annotated Codeforces data improves reasoning coherence and solution correctness compared to knowledge distillation alone. Post‑training 4‑bit weight quantization further reduces memory footprint and inference latency. These results show that knowledge distillation combined with chain‑of‑thought guided reinforcement learning can produce compact, efficient models suitable for deployment in resource‑constrained settings.
Authors:Pratik Ramesh, George Stoica, Arun Iyer, Leshem Choshen, Judy Hoffman
Abstract:
Model merging has shown that multitask models can be created by directly combining the parameters of different models that are each specialized on tasks of interest. However, models trained independently on distinct tasks often exhibit interference that degrades the merged model's performance. To solve this problem, we formally define the notion of Cross‑Task Interference as the drift in the representation of the merged model relative to its constituent models. Reducing cross‑task interference is key to improving merging performance. To address this issue, we propose our method, Resolving Interference (RI), a light‑weight adaptation framework which disentangles expert models to be functionally orthogonal to the space of other tasks, thereby reducing cross‑task interference. RI does this whilst using only unlabeled auxiliary data as input (i.e., no task‑data is needed), allowing it to be applied in data‑scarce scenarios. RI consistently improves the performance of state‑of‑the‑art merging methods by up to 3.8% and generalization to unseen domains by up to 2.3%. We also find RI to be robust to the source of auxiliary input while being significantly less sensitive to tuning of merging hyperparameters. Our codebase is available at: https://github.com/pramesh39/resolving_interference
Authors:Minsang Kim, Seung Jun Baek
Abstract:
Knowledge Distillation (KD) can transfer the reasoning abilities of large models to smaller ones, which can reduce the costs to generate Chain‑of‑Thoughts for reasoning tasks. KD methods typically ask the student to mimic the teacher's distribution over the entire output. However, a student with limited capacity can be overwhelmed by such extensive supervision causing a distribution mismatch, especially in complex reasoning tasks. We propose Token‑Selective Dual Knowledge Distillation (TSD‑KD), a framework for student‑centric distillation. TSD‑KD focuses on distilling important tokens for reasoning and encourages the student to explain reasoning in its own words. TSD‑KD combines indirect and direct distillation. Indirect distillation uses a weak form of feedback based on preference ranking. The student proposes candidate responses generated on its own; the teacher re‑ranks those candidates as indirect feedback without enforcing its entire distribution. Direct distillation uses distribution matching; however, it selectively distills tokens based on the relative confidence between teacher and student. Finally, we add entropy regularization to maintain the student's confidence during distillation. Overall, our method provides the student with targeted and indirect feedback to support its own reasoning process and to facilitate self‑improvement. The experiments show the state‑of‑the‑art performance of TSD‑KD on 10 challenging reasoning benchmarks, outperforming the baseline and runner‑up in accuracy by up to 54.4% and 40.3%, respectively. Notably, a student trained by TSD‑KD even outperformed its own teacher model in four cases by up to 20.3%. The source code is available at https://github.com/kmswin1/TSD‑KD.
Authors:Tim Vieira
Abstract:
This thesis develops a system for automatically analyzing and improving dynamic programs, such as those that have driven progress in natural language processing and computer science, more generally, for decades. Finding a correct program with the optimal asymptotic runtime can be unintuitive, time‑consuming, and error‑prone. This thesis aims to automate this laborious process. To this end, we develop an approach based on 1. a high‑level, domain‑specific language called Dyna for concisely specifying dynamic programs 2. a general‑purpose solver to efficiently execute these programs 3. a static analysis system that provides type analysis and worst‑case time/space complexity analyses 4. a rich collection of meaning‑preserving transformations to programs, which systematizes the repeated insights of numerous authors when speeding up algorithms in the literature 5. a search algorithm for identifying a good sequence of transformations that reduce the runtime complexity, given an initial, correct program We show that, in practice, automated search ‑‑ like the mental search performed by human programmers ‑‑ can find substantial improvements to the initial program. Empirically, we show that many speed‑ups described in the NLP literature could have been discovered automatically by our system. We provide a freely available prototype system at https://github.com/timvieira/dyna‑pi.
Authors:Ozge Mercanoglu Sincan, Jian He Low, Sobhan Asasi, Richard Bowden
Abstract:
Sign Language Translation (SLT) aims to automatically convert visual sign language videos into spoken language text and vice versa. While recent years have seen rapid progress, the true sources of performance improvements often remain unclear. Do reported performance gains come from methodological novelty, or from the choice of a different backbone, training optimizations, hyperparameter tuning, or even differences in the calculation of evaluation metrics? This paper presents a comprehensive study of recent gloss‑free SLT models by re‑implementing key contributions in a unified codebase. We ensure fair comparison by standardizing preprocessing, video encoders, and training setups across all methods. Our analysis shows that many of the performance gains reported in the literature often diminish when models are evaluated under consistent conditions, suggesting that implementation details and evaluation setups play a significant role in determining results. We make the codebase publicly available here (https://github.com/ozgemercanoglu/sltbaselines) to support transparency and reproducibility in SLT research.
Authors:Yifeng Liu, Siqi Ouyang, Yatish Hosmane Revanasiddappa, Lei Li
Abstract:
Large Language Models (LLMs) have demonstrated remarkable capability in machine translation on high‑resource language pairs, yet their performance on low‑resource translation still lags behind. Existing post‑training methods rely heavily on high‑quality parallel data, which are often scarce or unavailable for low‑resource languages. In this paper, we introduce WALAR, a reinforcement training method using only monolingual text to elevate LLMs' translation capabilities on massive low‑resource languages while retaining their performance on high‑resource languages. Our key insight is based on the observation of failure modes (or "holes") in existing source‑based multilingual quality estimation (QE) models. Reinforcement learning (RL) using these QE models tends to amplify such holes, resulting in poorer multilingual LLMs. We develop techniques including word alignment and language alignment to mitigate such holes in WALAR's reward for RL training. We continually trained an LLM supporting translation of 101 languages using WALAR. The experiments show that our new model outperforms LLaMAX, one of the strongest open‑source multilingual LLMs by a large margin on 1400 language directions on Flores‑101 dataset.
Authors:Sydney Lewis
Abstract:
Long conversations with an AI agent create a simple problem for one user: the history is useful, but carrying it verbatim is expensive. We study personalized agent memory: one user's conversation history with an agent, distilled into a compact retrieval layer for later search. Each exchange is compressed into a compound object with four fields (exchange_core, specific_context, thematic room_assignments, and regex‑extracted files_touched). The searchable distilled text averages 38 tokens per exchange. Applied to 4,182 conversations (14,340 exchanges) from 6 software engineering projects, the method reduces average exchange length from 371 to 38 tokens, yielding 11x compression. We evaluate whether personalized recall survives that compression using 201 recall‑oriented queries, 107 configurations spanning 5 pure and 5 cross‑layer search modes, and 5 LLM graders (214,519 consensus‑graded query‑result pairs). The best pure distilled configuration reaches 96% of the best verbatim MRR (0.717 vs 0.745). Results are mechanism‑dependent. All 20 vector search configurations remain non‑significant after Bonferroni correction, while all 20 BM25 configurations degrade significantly (effect sizes |d|=0.031‑0.756). The best cross‑layer setup slightly exceeds the best pure verbatim baseline (MRR 0.759). Structured distillation compresses single‑user agent memory without uniformly sacrificing retrieval quality. At 1/11 the context cost, thousands of exchanges fit within a single prompt while the verbatim source remains available for drill‑down. We release the implementation and analysis pipeline as open‑source software.
Authors:Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen
Abstract:
Computer Use Agents (CUAs) translate natural‑language instructions into Graphical User Interface (GUI) actions such as clicks, keystrokes, and scrolls by relying on a Vision‑Language Model (VLM) to interpret screenshots and predict grounded tool calls. However, grounding accuracy varies dramatically across VLMs, while current CUA systems typically route every action to a single fixed model regardless of difficulty. We propose Adaptive VLM Routing (AVR), a framework that inserts a lightweight semantic routing layer between the CUA orchestrator and a pool of VLMs. For each tool call, AVR estimates action difficulty from multimodal embeddings, probes a small VLM to measure confidence, and routes the action to the cheapest model whose predicted accuracy satisfies a target reliability threshold. For warm agents with memory of prior UI interactions, retrieved context further narrows the capability gap between small and large models, allowing many actions to be handled without escalation. We formalize routing as a cost‑‑accuracy trade‑off, derive a threshold‑based policy for model selection, and evaluate AVR using ScreenSpot‑Pro grounding data together with the OpenClaw agent routing benchmark. Across these settings, AVR projects inference cost reductions of up to 78% while staying within 2 percentage points of an all‑large‑model baseline. When combined with the Visual Confused Deputy guardrail, AVR also escalates high‑risk actions directly to the strongest available model, unifying efficiency and safety within a single routing framework. Materials are also provided Model, benchmark, and code: https://github.com/vllm‑project/semantic‑router.
Authors:Aditya Maheshwari, Amit Gajkeshwar, Kaushal Sharma, Vivek Patel
Abstract:
As Large Language Models (LLMs) becomes a popular source for religious knowledge, it is important to know if it treats different groups fairly. This study is the first to measure how LLMs handle the differences between the two main sects of Islam: Sunni and Shia. We present a test called SectEval, available in both English and Hindi, consisting of 88 questions, to check the bias‑ness of 15 top LLM models, both proprietary and open‑weights. Our results show a major inconsistency based on language. In English, many powerful models DeepSeek‑v3 and GPT‑4o often favored Shia answers. However, when asked the exact same questions in Hindi, these models switched to favoring Sunni answers. This means a user could get completely different religious advice just by changing languages. We also looked at how models react to location. Advanced models Claude‑3.5 changed their answers to match the user's country‑giving Shia answers to a user from Iran and Sunni answers to a user from Saudi Arabia. In contrast, smaller models (especially in Hindi) ignored the user's location and stuck to a Sunni viewpoint. These findings show that AI is not neutral; its religious ``truth'' changes depending on the language you speak and the country you claim to be from. The data set is available at https://github.com/secteval/SectEval/
Authors:Chenyang Zhu, Hongxiang Li, Xiu Li, Long Chen
Abstract:
Concept customization typically binds rare tokens to a target concept. Unfortunately, these approaches often suffer from unstable performance as the pretraining data seldom contains these rare tokens. Meanwhile, these rare tokens fail to convey the inherent knowledge of the target concept. Consequently, we introduce Knowledge‑aware Concept Customization, a novel task aiming at binding diverse textual knowledge to target visual concepts. This task requires the model to identify the knowledge within the text prompt to perform high‑fidelity customized generation. Meanwhile, the model should efficiently bind all the textual knowledge to the target concept. Therefore, we propose MoKus, a novel framework for knowledge‑aware concept customization. Our framework relies on a key observation: cross‑modal knowledge transfer, where modifying knowledge within the text modality naturally transfers to the visual modality during generation. Inspired by this observation, MoKus contains two stages: (1) In visual concept learning, we first learn the anchor representation to store the visual information of the target concept. (2) In textual knowledge updating, we update the answer for the knowledge queries to the anchor representation, enabling high‑fidelity customized generation. To further comprehensively evaluate our proposed MoKus on the new task, we introduce the first benchmark for knowledge‑aware concept customization: KnowCusBench. Extensive evaluations have demonstrated that MoKus outperforms state‑of‑the‑art methods. Moreover, the cross‑model knowledge transfer allows MoKus to be easily extended to other knowledge‑aware applications like virtual concept creation and concept erasure. We also demonstrate the capability of our method to achieve improvements on world knowledge benchmarks.
Authors:Xinping Zhao, Xinshuo Hu, Jiaxin Xu, Danyu Tang, Xin Zhang, Mengjia Zhou, Yan Zhong, Yao Zhou, Zifei Shan, Meishan Zhang, Baotian Hu, Min Zhang
Abstract:
Memory embeddings are crucial for memory‑augmented systems, such as OpenClaw, but their evaluation is underexplored in current text embedding benchmarks, which narrowly focus on traditional passage retrieval and fail to assess models' ability to handle long‑horizon memory retrieval tasks involving fragmented, context‑dependent, and temporally distant information. To address this, we introduce the Long‑horizon Memory Embedding Benchmark (LMEB), a comprehensive framework that evaluates embedding models' capabilities in handling complex, long‑horizon memory retrieval tasks. LMEB spans 22 datasets and 193 zero‑shot retrieval tasks across 4 memory types: episodic, dialogue, semantic, and procedural, with both AI‑generated and human‑annotated data. These memory types differ in terms of level of abstraction and temporal dependency, capturing distinct aspects of memory retrieval that reflect the diverse challenges of the real world. We evaluate 15 widely used embedding models, ranging from hundreds of millions to ten billion parameters. The results reveal that (1) LMEB provides a reasonable level of difficulty; (2) Larger models do not always perform better; (3) LMEB and MTEB exhibit orthogonality. This suggests that the field has yet to converge on a universal model capable of excelling across all memory retrieval tasks, and that performance in traditional passage retrieval may not generalize to long‑horizon memory retrieval. In summary, by providing a standardized and reproducible evaluation framework, LMEB fills a crucial gap in memory embedding evaluation, driving further advancements in text embedding for handling long‑term, context‑dependent memory retrieval. LMEB is available at https://github.com/KaLM‑Embedding/LMEB.
Authors:Vishnu Teja Kunde, Fatemeh Doudi, Mahdi Farahbakhsh, Dileep Kalathil, Krishna Narayanan, Jean-Francois Chamberland
Abstract:
Reinforcement learning (RL) has been effective for post‑training autoregressive (AR) language models, but extending these methods to diffusion language models (DLMs) is challenging due to intractable sequence‑level likelihoods. Existing approaches therefore rely on surrogate likelihoods or heuristic approximations, which can introduce bias and obscure the sequential structure of denoising. We formulate diffusion‑based sequence generation as a finite‑horizon Markov decision process over the denoising trajectory and derive an exact, unbiased policy gradient that decomposes over denoising steps and is expressed in terms of intermediate advantages, without requiring explicit evaluation of the sequence likelihood. To obtain a practical and compute‑efficient estimator, we (i) select denoising steps for policy updates via an entropy‑guided approximation bound, and (ii) estimate intermediate advantages using a one‑step denoising reward naturally provided by the diffusion model, avoiding costly multi‑step rollouts. Experiments on coding and logical reasoning benchmarks demonstrate state‑of‑the‑art results, with strong competitive performance on mathematical reasoning, outperforming existing RL post‑training approaches for DLMs. Code is available at https://github.com/vishnutez/egspo‑dllm‑rl.
Authors:Xuanlang Dai, Yujie Zhou, Long Xing, Jiazi Bu, Xilin Wei, Yuhong Liu, Beichen Zhang, Kai Chen, Yuhang Zang
Abstract:
Recently, Multimodal Large Language Models (MLLMs) have been widely integrated into diffusion frameworks primarily as text encoders to tackle complex tasks such as spatial reasoning. However, this paradigm suffers from two critical limitations: (i) MLLMs text encoder exhibits insufficient reasoning depth. Single‑step encoding fails to activate the Chain‑of‑Thought process, which is essential for MLLMs to provide accurate guidance for complex tasks. (ii) The guidance remains invariant during the decoding process. Invariant guidance during decoding prevents DiT from progressively decomposing complex instructions into actionable denoising steps, even with correct MLLM encodings. To this end, we propose Endogenous Chain‑of‑Thought (EndoCoT), a novel framework that first activates MLLMs' reasoning potential by iteratively refining latent thought states through an iterative thought guidance module, and then bridges these states to the DiT's denoising process. Second, a terminal thought grounding module is applied to ensure the reasoning trajectory remains grounded in textual supervision by aligning the final state with ground‑truth answers. With these two components, the MLLM text encoder delivers meticulously reasoned guidance, enabling the DiT to execute it progressively and ultimately solve complex tasks in a step‑by‑step manner. Extensive evaluations across diverse benchmarks (e.g., Maze, TSP, VSP, and Sudoku) achieve an average accuracy of 92.1%, outperforming the strongest baseline by 8.3 percentage points. The code and dataset are publicly available at https://lennoxdai.github.io/EndoCoT‑Webpage/.
Authors:Priyanka Kargupta, Shuhaib Mehri, Dilek Hakkani-Tur, Jiawei Han
Abstract:
Despite interdisciplinary research leading to larger and longer‑term impact, most work remains confined to single‑domain academic silos. Recent AI‑based approaches to scientific discovery show promise for interdisciplinary research, but many prioritize rapidly designing experiments and solutions, bypassing the exploratory, collaborative reasoning processes that drive creative interdisciplinary breakthroughs. As a result, prior efforts largely prioritize automating scientific discovery rather than augmenting the reasoning processes that underlie scientific disruption. We present Idea‑Catalyst, a novel framework that systematically identifies interdisciplinary insights to support creative reasoning in both humans and large language models. Starting from an abstract research goal, Idea‑Catalyst is designed to assist the brainstorming stage, explicitly avoiding premature anchoring on specific solutions. The framework embodies key metacognitive features of interdisciplinary reasoning: (a) defining and assessing research goals, (b) awareness of a domain's opportunities and unresolved challenges, and (c) strategic exploration of interdisciplinary ideas based on impact potential. Concretely, Idea‑Catalyst decomposes an abstract goal (e.g., improving human‑AI collaboration) into core target‑domain research questions that guide the analysis of progress and open challenges within that domain. These challenges are reformulated as domain‑agnostic conceptual problems, enabling retrieval from external disciplines (e.g., Psychology, Sociology) that address analogous issues. By synthesizing and recontextualizing insights from these domains back into the target domain, Idea‑Catalyst ranks source domains by their interdisciplinary potential. Empirically, this targeted integration improves average novelty by 21% and insightfulness by 16%, while remaining grounded in the original research problem.
Authors:William Brach, Tomas Bedej, Jacob Nielsen, Jacob Pichna, Juraj Bedej, Eemeli Saarensilta, Julie Dupouy, Gianluca Barmina, Andrea Blasi Núñez, Peter Schneider-Kamp, Kristian Košťál, Michal Ries, Lukas Galke Poech
Abstract:
With the rapid advances of large language models, it becomes increasingly important to systematically evaluate their multilingual and multicultural capabilities. Previous cultural evaluation benchmarks focus mainly on basic cultural knowledge that can be encoded in linguistic form. Here, we propose SommBench, a multilingual benchmark to assess sommelier expertise, a domain deeply grounded in the senses of smell and taste. While language models learn about sensory properties exclusively through textual descriptions, SommBench tests whether this textual grounding is sufficient to emulate expert‑level sensory judgment. SommBench comprises three main tasks: Wine Theory Question Answering (WTQA), Wine Feature Completion (WFC), and Food‑Wine Pairing (FWP). SommBench is available in multiple languages: English, Slovak, Swedish, Finnish, German, Danish, Italian, and Spanish. This helps separate a language model's wine expertise from its language skills. The benchmark datasets were developed in close collaboration with a professional sommelier and native speakers of the respective languages, resulting in 1,024 wine theory question‑answering questions, 1,000 wine feature‑completion examples, and 1,000 food‑wine pairing examples. We provide results for the most popular language models, including closed‑weights models such as Gemini 2.5, and open‑weights models, such as GPT‑OSS and Qwen 3. Our results show that the most capable models perform well on wine theory question answering (up to 97% correct with a closed‑weights model), yet feature completion (peaking at 65%) and food‑wine pairing show (MCC ranging between 0 and 0.39) turn out to be more challenging. These results position SommBench as an interesting and challenging benchmark for evaluating the sommelier expertise of language models. The benchmark is publicly available at https://github.com/sommify/sommbench.
Authors:Ilias Aarab
Abstract:
Zero‑shot text classification (ZSC) offers the promise of eliminating costly task‑specific annotation by matching texts directly to human‑readable label descriptions. While early approaches have predominantly relied on cross‑encoder models fine‑tuned for natural language inference (NLI), recent advances in text‑embedding models, rerankers, and instruction‑tuned large language models (LLMs) have challenged the dominance of NLI‑based architectures. Yet, systematically comparing these diverse approaches remains difficult. Existing evaluations, such as MTEB, often incorporate labeled examples through supervised probes or fine‑tuning, leaving genuine zero‑shot capabilities underexplored. To address this, we introduce BTZSC, a comprehensive benchmark of 22 public datasets spanning sentiment, topic, intent, and emotion classification, capturing diverse domains, class cardinalities, and document lengths. Leveraging BTZSC, we conduct a systematic comparison across four major model families, NLI cross‑encoders, embedding models, rerankers and instruction‑tuned LLMs, encompassing 38 public and custom checkpoints. Our results show that: (i) modern rerankers, exemplified by Qwen3‑Reranker‑8B, set a new state‑of‑the‑art with macro F1 = 0.72; (ii) strong embedding models such as GTE‑large‑en‑v1.5 substantially close the accuracy gap while offering the best trade‑off between accuracy and latency; (iii) instruction‑tuned LLMs at 4‑‑12B parameters achieve competitive performance (macro F1 up to 0.67), excelling particularly on topic classification but trailing specialized rerankers; (iv) NLI cross‑encoders plateau even as backbone size increases; and (v) scaling primarily benefits rerankers and LLMs over embedding models. BTZSC and accompanying evaluation code are publicly released to support fair and reproducible progress in zero‑shot text understanding.
Authors:Lu Wang, Zhuoran Jin, Yupu Hao, Yubo Chen, Kang Liu, Yulong Ao, Jun Zhao
Abstract:
Multimodal large language models (MLLMs) have shown strong performance on offline video understanding, but most are limited to offline inference or have weak online reasoning, making multi‑turn interaction over continuously arriving video streams difficult. Existing streaming methods typically use an interleaved perception‑generation paradigm, which prevents concurrent perception and generation and leads to early memory decay as streams grow, hurting long‑range dependency modeling. We propose Think While Watching, a memory‑anchored streaming video reasoning framework that preserves continuous segment‑level memory during multi‑turn interaction. We build a three‑stage, multi‑round chain‑of‑thought dataset and adopt a stage‑matched training strategy, while enforcing strict causality through a segment‑level streaming causal mask and streaming positional encoding. During inference, we introduce an efficient pipeline that overlaps watching and thinking and adaptively selects the best attention backend. Under both single‑round and multi‑round streaming input protocols, our method achieves strong results. Built on Qwen3‑VL, it improves single‑round accuracy by 2.6% on StreamingBench and by 3.79% on OVO‑Bench. In the multi‑round setting, it maintains performance while reducing output tokens by 56%. Code is available at: https://github.com/wl666hhh/Think_While_Watching/
Authors:Yaocong Li, Qiang Lan, Leihan Zhang, Le Zhang
Abstract:
Retrieval‑Augmented Generation (RAG) has emerged as a promising technology for legal document consultation, yet its application in Chinese legal scenarios faces two key limitations: existing benchmarks lack specialized support for joint retriever‑generator evaluation, and mainstream RAG systems often fail to accommodate the structured nature of legal provisions. To address these gaps, this study advances two core contributions: First, we constructed the Legal‑DC benchmark dataset, comprising 480 legal documents (covering areas such as market regulation and contract management) and 2,475 refined question‑answer pairs, each annotated with clause‑level references, filling the gap for specialized evaluation resources in Chinese legal RAG. Second, we propose the LegRAG framework, which integrates legal adaptive indexing (clause‑boundary segmentation) with a dual‑path self‑reflection mechanism to ensure clause integrity while enhancing answer accuracy. Third, we introduce automated evaluation methods for large language models to meet the high‑reliability demands of legal retrieval scenarios. LegRAG outperforms existing state‑of‑the‑art methods by 1.3% to 5.6% across key evaluation metrics. This research provides a specialized benchmark, practical framework, and empirical insights to advance the development of Chinese legal RAG systems. Our code and data are available at https://github.com/legal‑dc/Legal‑DC.
Authors:Xianjing Han, Bin Zhu, Shiqi Hu, Franklin Mingzhe Li, Patrick Carrington, Roger Zimmermann, Jingjing Chen
Abstract:
Text‑to‑video (T2V) generation models have made rapid progress in producing visually high‑quality and temporally coherent videos. However, existing benchmarks primarily focus on perceptual quality, text‑video alignment, or physical plausibility, leaving a critical aspect of action understanding largely unexplored: object state change (OSC) explicitly specified in the text prompt. OSC refers to the transformation of an object's state induced by an action, such as peeling a potato or slicing a lemon. In this paper, we introduce OSCBench, a benchmark specifically designed to assess OSC performance in T2V models. OSCBench is constructed from instructional cooking data and systematically organizes action‑object interactions into regular, novel, and compositional scenarios to probe both in‑distribution performance and generalization. We evaluate six representative open‑source and proprietary T2V models using both human user study and multimodal large language model (MLLM)‑based automatic evaluation. Our results show that, despite strong performance on semantic and scene alignment, current T2V models consistently struggle with accurate and temporally consistent object state changes, especially in novel and compositional settings. These findings position OSC as a key bottleneck in text‑to‑video generation and establish OSCBench as a diagnostic benchmark for advancing state‑aware video generation models.
Authors:Varun Iyer, Cornelia Caragea
Abstract:
Abstractive summarization requires models to generate summaries that convey information in the source document. While large language models can generate summaries without fine‑tuning, they often miss key details and include extraneous information. We propose BLooP (Bigram Lookahead Promotion), a simple training‑free decoding intervention that encourages large language models (LLMs) to generate tokens that form bigrams from the source document. BLooP operates through a hash table lookup at each decoding step, requiring no training, fine‑tuning, or model modification. We demonstrate improvements in ROUGE and BARTScore for Llama‑3.1‑8B‑Instruct, Mistral‑Nemo‑Instruct‑2407, and Gemma‑2‑9b‑it on CNN/DM, CCSum, Multi‑News, and SciTLDR. Human evaluation shows that BLooP significantly improves faithfulness without reducing readability. We make the code available at https://github.com/varuniyer/BLooP
Authors:Teng Xiao, Yige Yuan, Hamish Ivison, Huaisheng Zhu, Faeze Brahman, Nathan Lambert, Pradeep Dasigi, Noah A. Smith, Hannaneh Hajishirzi
Abstract:
This paper introduces MR‑Search, an in‑context meta reinforcement learning (RL) formulation for agentic search with self‑reflection. Instead of optimizing a policy within a single independent episode with sparse rewards, MR‑Search trains a policy that conditions on past episodes and adapts its search strategy across episodes. MR‑Search learns to learn a search strategy with self‑reflection, allowing search agents to improve in‑context exploration at test‑time. Specifically, MR‑Search performs cross‑episode exploration by generating explicit self‑reflections after each episode and leveraging them as additional context to guide subsequent attempts, thereby promoting more effective exploration during test‑time. We further introduce a multi‑turn RL algorithm that estimates a dense relative advantage at the turn level, enabling fine‑grained credit assignment on each episode. Empirical results across various benchmarks demonstrate the advantages of MR‑Search over baselines based RL, showing strong generalization and relative improvements of 9.2% to 19.3% across eight benchmarks. Our code and data are available at https://github.com/tengxiao1/MR‑Search.
Authors:Riccardo Campi, Nicolò Oreste Pinciroli Vago, Mathyas Giudici, Marco Brambilla, Piero Fraternali
Abstract:
Retrieval‑Augmented Generation (RAG) over Knowledge Graphs (KGs) suffers from the fact that indexing approaches may lose important contextual nuance when text is reduced to triples, thereby degrading performance in downstream Question‑Answering (QA) tasks, particularly for multi‑hop QA, which requires composing answers from multiple entities, facts, or relations. We propose a domain‑agnostic, KG‑based QA framework that covers both the indexing and retrieval/inference phases. A new indexing approach called Map‑Disambiguate‑Enrich‑Reduce (MDER) generates context‑derived triple descriptions and subsequently integrates them with entity‑level summaries, thus avoiding the need for explicit traversal of edges in the graph during the QA retrieval phase. Complementing this, we introduce Decompose‑Resolve (DR), a retrieval mechanism that decomposes user queries into resolvable triples and grounds them in the KG via iterative reasoning. Together, MDER and DR form an LLM‑driven QA pipeline that is robust to sparse, incomplete, and complex relational data. Experiments show that on standard and domain specific benchmarks, MDER‑DR achieves substantial improvements over standard RAG baselines (up to 66%), while maintaining cross‑lingual robustness. Our code is available at https://github.com/DataSciencePolimi/MDER‑DR_RAG.
Authors:Chandler Smith, Magnus Sesodia, Friedrich Lindenberg, Christian Schroeder de Witt
Abstract:
We release OpenSanctions Pairs, a large‑scale entity matching benchmark derived from real‑world international sanctions aggregation and analyst deduplication. The dataset contains 755,540 labeled pairs spanning 293 heterogeneous sources across 31 countries, with multilingual and cross‑script names, noisy and missing attributes, and set‑valued fields typical of compliance workflows. We benchmark a production rule‑based matcher (nomenklatura RegressionV1 algorithm) against open‑ and closed‑source LLMs in zero‑ and few‑shot settings. Off‑the‑shelf LLMs substantially outperform the production rule‑based baseline (91.33% F1), reaching up to 98.95% F1 (GPT‑4o) and 98.23% F1 with a locally deployable open model (DeepSeek‑R1‑Distill‑Qwen‑14B). DSPy MIPROv2 prompt optimization yields consistent but modest gains, while adding in‑context examples provides little additional benefit and can degrade performance. Error analysis shows complementary failure modes: the rule‑based system over‑matches (high false positives), whereas LLMs primarily fail on cross‑script transliteration and minor identifier/date inconsistencies. These results indicate that pairwise matching performance is approaching a practical ceiling in this setting, and motivate shifting effort toward pipeline components such as blocking, clustering, and uncertainty‑aware review. Code available at https://github.com/chansmi/OSINT_entity_resolution
Authors:Susung Hong, Brian Curless, Ira Kemelmacher-Shlizerman, Steve Seitz
Abstract:
We propose a fully automated AI system that produces short comedic videos similar to sketch shows such as Saturday Night Live. Starting with character references, the system employs a population of agents loosely based on real production studio roles, structured to optimize the quality and diversity of ideas and outputs through iterative competition, evaluation, and improvement. A key contribution is the introduction of LLM critics aligned with real viewer preferences through the analysis of a corpus of comedy videos on YouTube to automatically evaluate humor. Our experiments show that our framework produces results approaching the quality of professionally produced sketches while demonstrating state‑of‑the‑art performance in video generation.
Authors:Changyi Xiao, Caijun Xu, Yixin Cao
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective in enhancing the reasoning capabilities of large language models, particularly in domains such as mathematics where reliable rule‑based verifiers can be constructed. However, the reliance on handcrafted, domain‑specific verification rules substantially limits the applicability of RLVR to general reasoning domains with free‑form answers, where valid answers often exhibit significant variability, making it difficult to establish complete and accurate rules. To address this limitation, we propose Conditional Expectation Reward (CER), which leverages the large language model itself as an implicit verifier, and is therefore applicable to general domains and eliminates the need for external verifiers or auxiliary models. CER is defined as the expected likelihood of generating the reference answer conditioned on the generated answer. In contrast to rule‑based verifiers that yield binary feedback, CER provides a soft, graded reward signal that reflects varying degrees of correctness, making it better suited to tasks where answers vary in correctness. Experimental results demonstrate that CER is effective across a wide range of reasoning tasks, spanning both mathematical and general domains, indicating that CER serves as a flexible and general verification mechanism. The code is available at https://github.com/changyi7231/CER.
Authors:Hyungjoo Chae, Jungsoo Park, Alan Ritter
Abstract:
Training autonomous web agents is fundamentally limited by the environments they learn from: real‑world websites are unsafe to explore, hard to reset, and rarely provide verifiable feedback. We propose VeriEnv, a framework that treats language models as environment creators, automatically cloning real‑world websites into fully executable, verifiable synthetic environments. By exposing controlled internal access via a Python SDK, VeriEnv enables agents to self‑generate tasks with deterministic, programmatically verifiable rewards, eliminating reliance on heuristic or LLM‑based judges. This design decouples agent learning from unsafe real‑world interaction while enabling scalable self‑evolution through environment expansion. Through experiments on web agent benchmarks, we show that agents trained with VeriEnv generalize to unseen websites, achieve site‑specific mastery through self‑evolving training, and benefit from scaling the number of training environments. Code and resources will be released at https://github.com/kyle8581/VeriEnv upon acceptance.
Authors:Tim Schopf, Michael Färber
Abstract:
Judging the novelty of research ideas is crucial for advancing science, enabling the identification of unexplored directions, and ensuring contributions meaningfully extend existing knowledge rather than reiterate minor variations. However, given the exponential growth of scientific literature, manually judging the novelty of research ideas through literature reviews is labor‑intensive, subjective, and infeasible at scale. Therefore, recent efforts have proposed automated approaches for research idea novelty judgment. Yet, evaluation of these approaches remains largely inconsistent and is typically based on non‑standardized human evaluations, hindering large‑scale, comparable evaluations. To address this, we introduce RINoBench, the first comprehensive benchmark for large‑scale evaluation of research idea novelty judgments. It comprises 1,381 research ideas derived from and judged by human experts as well as nine automated evaluation metrics designed to assess both rubric‑based novelty scores and textual justifications of novelty judgments. Using this benchmark, we evaluate several state‑of‑the‑art large language models (LLMs) on their ability to judge the novelty of research ideas. Our findings reveal that while LLM‑generated reasoning closely mirrors human rationales, this alignment does not reliably translate into accurate novelty judgments, which diverge significantly from human gold standard judgments ‑ even among leading reasoning‑capable models. Data and code available at: https://github.com/TimSchopf/RINoBench.
Authors:Zhouxiang Fang, Jiawei Zhou, Hanjie Chen
Abstract:
Recent studies show that the safety alignment of large language models (LLMs) can be easily compromised even by seemingly non‑adversarial fine‑tuning. To preserve safety alignment during fine‑tuning, a widely used strategy is to jointly optimize safety and task objectives by mixing in the original alignment data, which is typically inaccessible even for open‑weight LLMs. Inspired by generative replay in continual learning, we propose Generative Replay for Safety Alignment Preservation (GR‑SAP), a unified framework that synthesizes domain‑specific alignment data from LLMs and integrate them during downstream adaption to preserve safety alignment. Theoretical and empirical analyses demonstrate this synthetic data serves as a reliable proxy for the original alignment data. Experiments across various models and downstream tasks show that GR‑SAP substantially mitigates fine‑tuning‑induced safety degradation while maintaining comparable downstream performance. Our code is available at https://github.com/chili‑lab/gr‑sap.
Authors:Sijia Cui, Pengyu Cheng, Jiajun Song, Yongbo Gai, Guojun Zhang, Zhechao Yu, Jianhe Lin, Xiaoxi Jiang, Guanjun Jiang
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capacity of Large Language Models (LLMs). However, RLVR solely relies on final answers as outcome rewards, neglecting the correctness of intermediate reasoning steps. Training on these process‑wrong but outcome‑correct rollouts can lead to hallucination and answer‑copying, severely undermining the model's generalization and robustness. To address this, we incorporate a Contrastive Learning mechanism into the Policy Optimization (CLIPO) to generalize the RLVR process. By optimizing a contrastive loss over successful rollouts, CLIPO steers the LLM to capture the invariant structure shared across correct reasoning paths. This provides a more robust cross‑trajectory regularization than the original single‑path supervision in RLVR, effectively mitigating step‑level reasoning inconsistencies and suppressing hallucinatory artifacts. In experiments, CLIPO consistently improves multiple RLVR baselines across diverse reasoning benchmarks, demonstrating uniform improvements in generalization and robustness for policy optimization of LLMs. Our code and training recipes are available at https://github.com/Qwen‑Applications/CLIPO.
Authors:Harry Owiredu-Ashley
Abstract:
Most adversarial evaluations of large language model (LLM) safety assess single prompts and report binary pass/fail outcomes, which fails to capture how safety properties evolve under sustained adversarial interaction. We present ADVERSA, an automated red‑teaming framework that measures guardrail degradation dynamics as continuous per‑round compliance trajectories rather than discrete jailbreak events. ADVERSA uses a fine‑tuned 70B attacker model (ADVERSA‑Red, Llama‑3.1‑70B‑Instruct with QLoRA) that eliminates the attacker‑side safety refusals that render off‑the‑shelf models unreliable as attackers, scoring victim responses on a structured 5‑point rubric that treats partial compliance as a distinct measurable state. We report a controlled experiment across three frontier victim models (Claude Opus 4.6, Gemini 3.1 Pro, GPT‑5.2) using a triple‑judge consensus architecture in which judge reliability is measured as a first‑class research outcome rather than assumed. Across 15 conversations of up to 10 adversarial rounds, we observe a 26.7% jailbreak rate with an average jailbreak round of 1.25, suggesting that in this evaluation setting, successful jailbreaks were concentrated in early rounds rather than accumulating through sustained pressure. We document inter‑judge agreement rates, self‑judge scoring tendencies, attacker drift as a failure mode in fine‑tuned attackers deployed out of their training distribution, and attacker refusals as a previously‑underreported confound in victim resistance measurement. All limitations are stated explicitly. Attack prompts are withheld per responsible disclosure policy; all other experimental artifacts are released.
Authors:Dan Lee, Seungwook Han, Akarsh Kumar, Pulkit Agrawal
Abstract:
Pre‑training is crucial for large language models (LLMs), as it is when most representations and capabilities are acquired. However, natural language pre‑training has problems: high‑quality text is finite, it contains human biases, and it entangles knowledge with reasoning. This raises a fundamental question: is natural language the only path to intelligence? We propose using neural cellular automata (NCA) to generate synthetic, non‑linguistic data for pre‑pre‑training LLMs‑‑training on synthetic‑then‑natural language. NCA data exhibits rich spatiotemporal structure and statistics resembling natural language while being controllable and cheap to generate at scale. We find that pre‑pre‑training on only 164M NCA tokens improves downstream language modeling by up to 6% and accelerates convergence by up to 1.6x. Surprisingly, this even outperforms pre‑pre‑training on 1.6B tokens of natural language from Common Crawl with more compute. These gains also transfer to reasoning benchmarks, including GSM8K, HumanEval, and BigBench‑Lite. Investigating what drives transfer, we find that attention layers are the most transferable, and that optimal NCA complexity varies by domain: code benefits from simpler dynamics, while math and web text favor more complex ones. These results enable systematic tuning of the synthetic distribution to target domains. More broadly, our work opens a path toward more efficient models with fully synthetic pre‑training.
Authors:David Gringras
Abstract:
Safety benchmarks evaluate language models in isolation, typically using multiple‑choice format; production deployments wrap these models in agentic scaffolds that restructure inputs through reasoning traces, critic agents, and delegation pipelines. We report one of the largest controlled studies of scaffold effects on safety (N = 62,808; six frontier models, four deployment configurations), combining pre‑registration, assessor blinding, equivalence testing, and specification curve analysis. Map‑reduce scaffolding degrades measured safety (NNH = 14), yet two of three scaffold architectures preserve safety within practically meaningful margins. Investigating the map‑reduce degradation revealed a deeper measurement problem: switching from multiple‑choice to open‑ended format on identical items shifts safety scores by 5‑20 percentage points, larger than any scaffold effect. Within‑format scaffold comparisons are consistent with practical equivalence under our pre‑registered +/‑2 pp TOST margin, isolating evaluation format rather than scaffold architecture as the operative variable. Model x scaffold interactions span 35 pp in opposing directions (one model degrades by ‑16.8 pp on sycophancy under map‑reduce while another improves by +18.8 pp on the same benchmark), ruling out universal claims about scaffold safety. A generalisability analysis yields G = 0.000: model safety rankings reverse so completely across benchmarks that no composite safety index achieves non‑zero reliability, making per‑model, per‑configuration testing a necessary minimum standard. We release all code, data, and prompts as ScaffoldSafety.
Authors:Xingtong Yu, Shenghua Ye, Ruijuan Liang, Chang Zhou, Hong Cheng, Xinming Zhang, Yuan Fang
Abstract:
Graph foundation models (GFM) aim to acquire transferable knowledge by pre‑training on diverse graphs, which can be adapted to various downstream tasks. However, domain shift in graphs is inherently two‑dimensional: graphs differ not only in what they describe (topic domains) but also in how they are represented (format domains). Most existing GFM benchmarks vary only topic domains, thereby obscuring how knowledge transfers across both dimensions. We present a new benchmark that jointly evaluates topic and format gaps across the full GFM pipeline, including multi‑domain self‑supervised pre‑training and few‑shot downstream adaptation, and provides a timely evaluation of recent GFMs in the rapidly evolving landscape. Our protocol enables controlled assessment in four settings: (i) pre‑training on diverse topics and formats, while adapting to unseen downstream datasets; (ii) same pre‑training as in (i), while adapting to seen datasets; (iii) pre‑training on a single topic domain, while adapting to other topics; (iv) pre‑training on a base format, while adapting to other formats. This two‑axis evaluation disentangles semantic generalization from robustness to representational shifts. We conduct extensive evaluations of eight state‑of‑the‑art GFMs on 33 datasets spanning seven topic domains and six format domains, surfacing new empirical observations and practical insights for future research. Codes/data are available at https://github.com/smufang/GFMBenchmark.
Authors:Izzat Alsmadi, Anas Alsobeh
Abstract:
This paper presents TAMUSA‑Chat, a research‑oriented framework for building domain‑adapted large language model conversational systems. The work addresses critical challenges in adapting general‑purpose foundation models to institutional contexts through supervised fine‑tuning, retrieval‑augmented generation, and systematic evaluation methodologies. We describe the complete architecture encompassing data acquisition from institutional sources, preprocessing pipelines, embedding construction, model training workflows, and deployment strategies. The system integrates modular components enabling reproducible experimentation with training configurations, hyper‑parameters, and evaluation protocols. Our implementation demonstrates how academic institutions can develop contextually grounded conversational agents while maintaining transparency, governance compliance, and responsible AI practices. Through empirical analysis of fine‑tuning behavior across model sizes and training iterations, we provide insights into domain adaptation efficiency, computational resource requirements, and quality‑cost trade‑offs. The publicly available codebase at https://github.com/alsmadi/TAMUSA_LLM_Based_Chat_app supports continued research into institutional LLM deployment, evaluation methodologies, and ethical considerations for educational AI systems.
Authors:Shuhuai Li, Jianghao Lin, Dongdong Ge, Yinyu Ye
Abstract:
Mixture‑of‑Experts (MoE) models enable scalable performance but face severe memory constraints on edge devices. Existing offloading strategies struggle with I/O bottlenecks due to the dynamic, low‑information nature of autoregressive expert activation. In this paper, we propose to repurpose Speculative Decoding (SD) not merely as a compute accelerator, but as an informative lookahead sensor for memory management, supported by our theoretical and empirical analyses. Hence, we introduce MoE‑SpAc, an MoE inference framework that integrates a Speculative Utility Estimator to track expert demand, a Heterogeneous Workload Balancer to dynamically partition computation via online integer optimization, and an Asynchronous Execution Engine to unify the prefetching and eviction in the same utility space. Extensive experiments on seven benchmarks demonstrate that MoE‑SpAc achieves a 42% improvement in TPS over the SOTA SD‑based baseline, and an average 4.04x speedup over all standard baselines. Code is available at https://github.com/lshAlgorithm/MoE‑SpAc .
Authors:Ghazal Kalhor, Yadollah Yaghoobzadeh
Abstract:
Persian poetry plays an active role in Iranian cultural practice, where verses by canonical poets such as Hafez are frequently quoted, paraphrased, or completed from partial cues. Supporting such interactions requires language models to engage not only with poetic meaning but also with culturally entrenched surface form. We introduce GhazalBench, a benchmark for evaluating how large language models (LLMs) interact with Persian ghazals under usage‑grounded conditions. GhazalBench assesses two complementary abilities: producing faithful prose paraphrases of couplets and accessing canonical verses under varying semantic and formal cues. Across several proprietary and open‑weight multilingual LLMs, we observe a consistent dissociation: models generally capture poetic meaning but struggle with exact verse recall in completion‑based settings, while recognition‑based tasks substantially reduce this gap. A parallel evaluation on English sonnets shows markedly higher recall performance, suggesting that these limitations are tied to differences in training exposure rather than inherent architectural constraints. Our findings highlight the need for evaluation frameworks that jointly assess meaning, form, and cue‑dependent access to culturally significant texts. GhazalBench is available at https://github.com/kalhorghazal/GhazalBench.
Authors:Chengyu Shen, Yanheng Hou, Minghui Pan, Runming He, Zhen Hao Wong, Meiyi Qiang, Zhou Liu, Hao Liang, Peichao Lai, Zeang Sheng, Wentao Zhang
Abstract:
Reliable evaluation is essential for developing and deploying large language models, yet in practice it often requires substantial manual effort: practitioners must identify appropriate benchmarks, reproduce heterogeneous evaluation codebases, configure dataset schema mappings, and interpret aggregated metrics. To address these challenges, we present One‑Eval, an agentic evaluation system that converts natural‑language evaluation requests into executable, traceable, and customizable evaluation workflows. One‑Eval integrates (i) NL2Bench for intent structuring and personalized benchmark planning, (ii) BenchResolve for benchmark resolution, automatic dataset acquisition, and schema normalization to ensure executability, and (iii) Metrics \& Reporting for task‑aware metric selection and decision‑oriented reporting beyond scalar scores. The system further incorporates human‑in‑the‑loop checkpoints for review, editing, and rollback, while preserving sample evidence trails for debugging and auditability. Experiments show that One‑Eval can execute end‑to‑end evaluations from diverse natural‑language requests with minimal user effort, supporting more efficient and reproducible evaluation in industrial settings. Our framework is publicly available at https://github.com/OpenDCAI/One‑Eval.
Authors:Davit Melikidze, Marian Schneider, Jessica Lam, Martin Wertich, Ido Hakimi, Barna Pásztor, Andreas Krause
Abstract:
Reinforcement Learning from Human Feedback (RLHF) has become the standard for aligning Large Language Models (LLMs), yet its efficacy is bottlenecked by the high cost of acquiring preference data, especially in low‑resource and expert domains. To address this, we introduce ACTIVEULTRAFEEDBACK, a modular active learning pipeline that leverages uncertainty estimates to dynamically identify the most informative responses for annotation. Our pipeline facilitates the systematic evaluation of standard response selection methods alongside DOUBLE REVERSE THOMPSON SAMPLING (DRTS) and DELTAUCB, two novel methods prioritizing response pairs with large predicted quality gaps, leveraging recent results showing that such pairs provide good signals for fine‑tuning. Our experiments demonstrate that ACTIVEULTRAFEEDBACK yields high‑quality datasets that lead to significant improvements in downstream performance, notably achieving comparable or superior results with as little as one‑sixth of the annotated data relative to static baselines. Our pipeline is available at https://github.com/lasgroup/ActiveUltraFeedback and our preference datasets at https://huggingface.co/ActiveUltraFeedback.
Authors:Palmer Schallon
Abstract:
We identify a systematic attention collapse pathology in the BLOOM family of transformer language models, where ALiBi positional encoding causes 31‑44% of attention heads to attend almost entirely to the beginning‑of‑sequence token. The collapse follows a predictable pattern across four model scales (560M to 7.1B parameters), concentrating in head indices where ALiBi's slope schedule imposes the steepest distance penalties. We introduce surgical reinitialization: targeted Q/K/V reinitialization with zeroed output projections and gradient‑masked freezing of all non‑surgical parameters. Applied to BLOOM‑1b7 on a single consumer GPU, the technique recovers 98.7% operational head capacity (242 to 379 of 384 heads) in two passes. A controlled comparison with C4 training data confirms that reinitialization ‑‑ not corpus content ‑‑ drives recovery, and reveals two distinct post‑surgical phenomena: early global functional redistribution that improves the model, and late local degradation that accumulates under noisy training signal. An extended experiment reinitializing mostly‑healthy heads alongside collapsed ones produces a model that transiently outperforms stock BLOOM‑1b7 by 25% on training perplexity (12.70 vs. 16.99), suggesting that pretrained attention configurations are suboptimal local minima. Code, checkpoints, and diagnostic tools are released as open‑source software.
Authors:Xiangsen Chen, Xuan Feng, Shuo Chen, Matthieu Maitre, Sudipto Rakshit, Diana Duvieilh, Ashley Picone, Nan Tang
Abstract:
Analyzing Open Source Intelligence (OSINT) from large volumes of data is critical for drafting and publishing comprehensive CTI reports. This process usually follows a three‑stage workflow ‑‑ triage, deep search and TI drafting. While Large Language Models (LLMs) offer a promising route toward automation, existing benchmarks still have limitations. These benchmarks often consist of tasks that do not reflect real‑world analyst workflows. For example, human analysts rarely receive tasks in the form of multiple‑choice questions. Also, existing benchmarks often rely on model‑centric metrics that emphasize lexical overlap rather than actionable, detailed insights essential for security analysts. Moreover, they typically fail to cover the complete three‑stage workflow. To address these issues, we introduce CyberThreat‑Eval, which is collected from the daily CTI workflow of a world‑leading company. This expert‑annotated benchmark assesses LLMs on practical tasks across all three stages as mentioned above. It utilizes analyst‑centric metrics that measure factual accuracy, content quality, and operational costs. Our evaluation using this benchmark reveals important insights into the limitations of current LLMs. For example, LLMs often lack the nuanced expertise required to handle complex details and struggle to distinguish between correct and incorrect information. To address these challenges, the CTI workflow incorporates both external ground‑truth databases and human expert knowledge. TRA allows human experts to iteratively provide feedback for continuous improvement. The code is available at \hrefhttps://github.com/xschen‑beb/CyberThreat‑Eval\textttGitHub and \hrefhttps://huggingface.co/datasets/xse/CyberThreat‑Eval\textttHuggingFace.
Authors:Tzu-Quan Lin, Wei-Ping Huang, Yi-Cheng Lin, Hung-yi Lee
Abstract:
While Contrastive Decoding (CD) has proven effective at enhancing Large Audio Language Models (LALMs), the underlying mechanisms driving its success and the comparative efficacy of different strategies remain unclear. This study systematically evaluates four distinct CD strategies across diverse LALM architectures. We identify Audio‑Aware Decoding and Audio Contrastive Decoding as the most effective methods. However, their impact varies significantly by model. To explain this variability, we introduce a Transition Matrix framework to map error pattern shifts during inference. Our analysis demonstrates that CD reliably rectifies errors in which models falsely claim an absence of audio or resort to uncertainty‑driven guessing. Conversely, it fails to correct flawed reasoning or confident misassertions. Ultimately, these findings provide a clear guideline for determining which LALM architectures are most suitable for CD enhancement based on their baseline error profiles.
Authors:Pranav Mantini, Shishir K. Shah
Abstract:
Recent advances in vision‑language models (VLMs) have demonstrated remarkable zero‑shot capabilities, yet adapting these models to specialized domains remains a significant challenge. Building on recent theoretical insights suggesting that independently trained VLMs are related by a canonical transformation, we extend this understanding to the concept of domains. We hypothesize that image features across disparate domains are related by a canonicalized geometric transformation that can be recovered using a small set of anchors. Few‑shot classification provides a natural setting for this alignment, as the limited labeled samples serve as the anchors required to estimate this transformation. Motivated by this hypothesis, we introduce BiCLIP, a framework that applies a targeted transformation to multimodal features to enhance cross‑modal alignment. Our approach is characterized by its extreme simplicity and low parameter footprint. Extensive evaluations across 11 standard benchmarks, including EuroSAT, DTD, and FGVCAircraft, demonstrate that BiCLIP consistently achieves state‑of‑the‑art results. Furthermore, we provide empirical verification of existing geometric findings by analyzing the orthogonality and angular distribution of the learned transformations, confirming that structured alignment is the key to robust domain adaptation. Code is available at https://github.com/QuantitativeImagingLaboratory/BilinearCLIP
Authors:Cornelius Emde, Alexander Rubinstein, Anmol Goel, Ahmed Heakl, Sangdoo Yun, Seong Joon Oh, Martin Gubri
Abstract:
The rapid adoption of LLM‑based agentic systems has produced a rich ecosystem of frameworks (smolagents, LangGraph, AutoGen, CAMEL, LlamaIndex, i.a.). Yet existing benchmarks are model‑centric: they fix the agentic setup and do not compare other system components. We argue that implementation decisions substantially impact performance, including choices such as topology, orchestration logic, and error handling. MASEval addresses this evaluation gap with a framework‑agnostic library that treats the entire system as the unit of analysis. Through a systematic system‑level comparison across 3 benchmarks, 3 models, and 3 frameworks, we find that framework choice matters as much as model choice. MASEval allows researchers to explore all components of agentic systems, opening new avenues for principled system design, and practitioners to identify the best implementation for their use case. MASEval is available under the MIT licence https://github.com/parameterlab/MASEval.
Authors:Shijia Liao, Yuxuan Wang, Songting Liu, Yifan Cheng, Ruoyi Zhang, Tianyu Li, Shidong Li, Yisheng Zheng, Xingwei Liu, Qingzheng Wang, Zhizhuo Zhou, Jiahua Liu, Xin Chen, Dawei Han
Abstract:
We introduce Fish Audio S2, an open‑sourced text‑to‑speech system featuring multi‑speaker, multi‑turn generation, and, most importantly, instruction‑following control via natural‑language descriptions. To scale training, we develop a multi‑stage training recipe together with a staged data pipeline covering video captioning and speech captioning, voice‑quality assessment, and reward modeling. To push the frontier of open‑source TTS, we release our model weights, fine‑tuning code, and an SGLang‑based inference engine. The inference engine is production‑ready for streaming, achieving an RTF of 0.195 and a time‑to‑first‑audio below 100 ms.Our code and weights are available on GitHub (https://github.com/fishaudio/fish‑speech) and Hugging Face (https://huggingface.co/fishaudio/s2‑pro). We highly encourage readers to visit https://fish.audio to try custom voices.
Authors:Junxian Li, Tu Lan, Haozhen Tan, Yan Meng, Haojin Zhu
Abstract:
Modern vision‑language‑model (VLM) based graphical user interface (GUI) agents are expected not only to execute actions accurately but also to respond to user instructions with low latency. While existing research on GUI‑agent security mainly focuses on manipulating action correctness, the security risks related to response efficiency remain largely unexplored. In this paper, we introduce SlowBA, a novel backdoor attack that targets the responsiveness of VLM‑based GUI agents. The key idea is to manipulate response latency by inducing excessively long reasoning chains under specific trigger patterns. To achieve this, we propose a two‑stage reward‑level backdoor injection (RBI) strategy that first aligns the long‑response format and then learns trigger‑aware activation through reinforcement learning. In addition, we design realistic pop‑up windows as triggers that naturally appear in GUI environments, improving the stealthiness of the attack. Extensive experiments across multiple datasets and baselines demonstrate that SlowBA can significantly increase response length and latency while largely preserving task accuracy. The attack remains effective even with a small poisoning ratio and under several defense settings. These findings reveal a previously overlooked security vulnerability in GUI agents and highlight the need for defenses that consider both action correctness and response efficiency. Code can be found in https://github.com/tu‑tuing/SlowBA.
Authors:Zhijun Wang, Ling Luo, Dinghao Pan, Huan Zhuang, Lejing Yu, Yuanyuan Sun, Hongfei Lin
Abstract:
Automated Drug Combination Extraction (DCE) from large‑scale biomedical literature is crucial for advancing precision medicine and pharmacological research. However, existing relation extraction methods primarily focus on binary interactions and struggle to model variable‑length n‑ary drug combinations, where complex compatibility logic and distributed evidence need to be considered. To address these limitations, we propose RexDrug, an end‑to‑end reasoning‑enhanced relation extraction framework for n‑ary drug combination extraction based on large language models. RexDrug adopts a two‑stage training strategy. First, a multi‑agent collaborative mechanism is utilized to automatically generate high‑quality expert‑like reasoning traces for supervised fine‑tuning. Second, reinforcement learning with a multi‑dimensional reward function specifically tailored for DCE is applied to further refine reasoning quality and extraction accuracy. Extensive experiments on the DrugComb dataset show that RexDrug consistently outperforms state‑of‑the‑art baselines for n‑ary extraction. Additional evaluation on the DDI13 corpus confirms its generalizability to binary drugdrug interaction tasks. Human expert assessment and automatic reasoning metrics further indicates that RexDrug produces coherent medical reasoning while accurately identifying complex therapeutic regimens. These results establish RexDrug as a scalable and reliable solution for complex biomedical relation extraction from unstructured text. The source code and data are available at https://github.com/DUTIR‑BioNLP/RexDrug
Authors:Yijun Zhu, Jianxin Wang, Chengchao Shen
Abstract:
Large Language Models (LLMs) have demonstrated exceptional performance across a wide range of tasks, yet their significant computational and memory requirements present major challenges for deployment. A common approach uses Taylor expansion on the loss function to estimate neuron importance. However, its reliance on one‑hot cross entropy loss, a key limitation is that it narrowly assesses importance based only on the probability assigned to the single predicted next token, thereby ignoring the other potential predictions of the original model. An intuitive solution to address this is to employ self distillation criterion for importance evaluation. However, this approach introduces significant computational overhead by requiring a separate teacher model for supervision. To this end, we propose a simple but effective criterion, information entropy of the model's output distribution, to efficiently evaluate importance scores of neurons with Taylor pruning without requirement of additional teacher. Compared to plain cross entropy criterion, it provides a more holistic criterion for Taylor pruning to prune neurons with the least impact on the prediction of model in a global manner, thereby preserving the fidelity of the model's predictive capabilities. Experimental results on extensive zero‑shot benchmarks demonstrate that our method consistently outperforms existing pruning methods across the LLaMA and Qwen series models. The source code and trained weights are availabel at https://github.com/visresearch/HFPrune.
Authors:Chenzhi Hu, Qinzhe Hu, Yuhang Xu, Junyi Chen, Ruijie Wang, Shengzhong Liu, Jianxin Li, Fan Wu, Guihai Chen
Abstract:
Large reasoning models (LRMs) like OpenAI o1 and DeepSeek‑R1 achieve high accuracy on complex tasks by adopting long chain‑of‑thought (CoT) reasoning paths. However, the inherent verbosity of these processes frequently results in redundancy and overthinking. To address this issue, existing works leverage Group Relative Policy Optimization (GRPO) to reduce LRM output length, but their static length reward design cannot dynamically adapt according to the relative problem difficulty and response length distribution, causing over‑compression and compromised accuracy. Therefore, we propose SmartThinker, a novel GRPO‑based efficient reasoning method with progressive CoT length calibration. SmartThinker makes a two‑fold contribution: First, it dynamically estimates the optimal length with peak accuracy during training and guides overlong responses toward it to reduce response length while sustaining accuracy. Second, it dynamically modulates the length reward coefficient to avoid the unwarranted penalization of correct reasoning paths. Extensive experiment results show that SmartThinker achieves up to 52.5% average length compression with improved accuracy, and achieves up to 16.6% accuracy improvement on challenging benchmarks like AIME25. The source code can be found at https://github.com/SJTU‑RTEAS/SmartThinker.
Authors:Hansi Zeng, Zoey Li, Yifan Gao, Chenwei Zhang, Xiaoman Pan, Tao Yang, Fengran Mo, Jiacheng Lin, Xian Li, Jingbo Shang
Abstract:
Research Agents enable models to gather information from the web using tools to answer user queries, requiring them to dynamically interleave internal reasoning with tool use. While such capabilities can in principle be learned via reinforcement learning with verifiable rewards (RLVR), we observe that agents often exhibit poor exploration behaviors, including premature termination and biased tool usage. As a result, RLVR alone yields limited improvements. We propose SynPlanResearch‑R1, a framework that synthesizes tool‑use trajectories that encourage deeper exploration to shape exploration during cold‑start supervised fine‑tuning, providing a strong initialization for subsequent RL. Across seven multi‑hop and open‑web benchmarks, \framework improves performance by up to 6.0% on Qwen3‑8B and 5.8% on Qwen3‑4B backbones respectively compared to SOTA baselines. Further analyses of tool‑use patterns and training dynamics compared to baselines shed light on the factors underlying these gains. Our code is publicly available at https://github.com/HansiZeng/syn‑plan‑research.
Authors:Trinh Pham, Thanh Tam Nguyen, Viet Huynh, Hongzhi Yin, Quoc Viet Hung Nguyen
Abstract:
Recent advances in large language models has strengthened Text2SQL systems that translate natural language questions into database queries. A persistent deployment challenge is to assess a newly trained Text2SQL system on an unseen and unlabeled dataset when no verified answers are available. This situation arises frequently because database content and structure evolve, privacy policies slow manual review, and carefully written SQL labels are costly and time‑consuming. Without timely evaluation, organizations cannot approve releases or detect failures early. FusionSQL addresses this gap by working with any Text2SQL models and estimating accuracy without reference labels, allowing teams to measure quality on unseen and unlabeled datasets. It analyzes patterns in the system's own outputs to characterize how the target dataset differs from the material used during training. FusionSQL supports pre‑release checks, continuous monitoring of new databases, and detection of quality decline. Experiments across diverse application settings and question types show that FusionSQL closely follows actual accuracy and reliably signals emerging issues. Our code is available at https://github.com/phkhanhtrinh23/FusionSQL.
Authors:Erik Miehling, Karthikeyan Natesan Ramamurthy, Praveen Venkateswaran, Irene Ko, Pierre Dognin, Moninder Singh, Tejaswini Pedapati, Avinash Balakrishnan, Matthew Riemer, Dennis Wei, Inge Vejsbjerg, Elizabeth M. Daly, Kush R. Varshney
Abstract:
The AI Steerability 360 toolkit is an extensible, open‑source Python library for steering LLMs. Steering abstractions are designed around four model control surfaces: input (modification of the prompt), structural (modification of the model's weights or architecture), state (modification of the model's activations and attentions), and output (modification of the decoding or generation process). Steering methods exert control on the model through a common interface, termed a steering pipeline, which additionally allows for the composition of multiple steering methods. Comprehensive evaluation and comparison of steering methods/pipelines is facilitated by use case classes (for defining tasks) and a benchmark class (for performance comparison on a given task). The functionality provided by the toolkit significantly lowers the barrier to developing and comprehensively evaluating steering methods. The toolkit is Hugging Face native and is released under an Apache 2.0 license at https://github.com/IBM/AISteer360.
Authors:Yuzhuang Xu, Xu Han, Yuxuan Li, Wanxiang Che
Abstract:
Although existing frameworks for large language model (LLM) inference on CPUs are mature, they fail to fully exploit the computation potential of many‑core CPU platforms. Many‑core CPUs are widely deployed in web servers and high‑end networking devices, and are typically organized into multiple NUMA nodes that group cores and memory. Current frameworks largely overlook the substantial overhead of cross‑NUMA memory access, limiting inference scalability and intelligence enabling on such platforms. To address this limitation, we build ArcLight, a lightweight LLM inference architecture designed from the ground up for many‑core CPUs. ArcLight integrates efficient memory management and thread scheduling, and introduces finely controlled tensor parallelism to mitigate the cross‑node memory access wall. Experimental results show that ArcLight significantly surpasses the performance ceiling of mainstream frameworks, achieving up to 46% higher inference throughput. Moreover, ArcLight maintains compatibility with arbitrary CPU devices. ArcLight is publicly available at https://github.com/OpenBMB/ArcLight.
Authors:A. J. W. de Vink, Filippos Karolos Ventirozos, Natalia Amat-Lefort, Lifeng Han
Abstract:
We present our system for SemEval‑2026 Task 3 on dimensional aspect‑based sentiment regression. Our approach combines a hybrid RoBERTa encoder, which jointly predicts sentiment using regression and discretized classification heads, with large language models (LLMs) via prediction‑level ensemble learning. The hybrid encoder improves prediction stability by combining continuous and discretized sentiment representations. We further explore in‑context learning with LLMs and ridge‑regression stacking to combine encoder and LLM predictions. Experimental results on the development set show that ensemble learning significantly improves performance over individual models, achieving substantial reductions in RMSE and improvements in correlation scores. Our findings demonstrate the complementary strengths of encoder‑based and LLM‑based approaches for dimensional sentiment analysis. Our development code and resources will be shared at https://github.com/aaronlifenghan/ABSentiment
Authors:Shih-Ying Yeh, Yueh-Feng Ku, Ko-Wei Huang, Buu-Khang Tu
Abstract:
Retrieval‑augmented generation (RAG) systems that answer questions from document collections face compounding difficulties when high‑precision citations are required: flat chunking strategies sacrifice document structure, single‑query formulations miss relevant passages through vocabulary mismatch, and single‑pass inference produces stochastic answers that vary in both content and citation selection. We present KohakuRAG, a hierarchical RAG framework that preserves document structure through a four‑level tree representation (document \rightarrow section \rightarrow paragraph \rightarrow sentence) with bottom‑up embedding aggregation, improves retrieval coverage through an LLM‑powered query planner with cross‑query reranking, and stabilizes answers through ensemble inference with abstention‑aware voting. We evaluate on the WattBot 2025 Challenge, a benchmark requiring systems to answer technical questions from 32 documents with \pm0.1% numeric tolerance and exact source attribution. KohakuRAG achieves first place on both public and private leaderboards (final score 0.861), as the only team to maintain the top position across both evaluation partitions. Ablation studies reveal that prompt ordering (+80% relative), retry mechanisms (+69%), and ensemble voting with blank filtering (+1.2pp) each contribute substantially, while hierarchical dense retrieval alone matches hybrid sparse‑dense approaches (BM25 adds only +3.1pp). We release KohakuRAG as open‑source software at https://github.com/KohakuBlueleaf/KohakuRAG.
Authors:Jiazhen Kang, Yuchen Lu, Chen Jiang, Jinrui Liu, Tianhao Zhang, Bo Jiang, Ningyuan Sun, Tongtong Wu, Guilin Qi
Abstract:
Code evolution is inevitable in modern software development. Changes to third‑party APIs frequently break existing code and complicate maintenance, posing practical challenges for developers. While large language models (LLMs) have shown promise in code generation, they struggle to reason without a structured representation of these evolving relationships, often leading them to produce outdated APIs or invalid outputs. In this work, we propose a knowledge graph‑augmented framework that decomposes the migration task into two synergistic stages: evolution path retrieval and path‑informed code generation. Our approach constructs static and dynamic API graphs to model intra‑version structures and cross‑version transitions, enabling structured reasoning over API evolution. Both modules are trained with synthetic supervision automatically derived from real‑world API diffs, ensuring scalability and minimal human effort. Extensive experiments across single‑package and multi‑package benchmarks demonstrate that our framework significantly improves migration accuracy, controllability, and execution success over standard LLM baselines. The source code and datasets are available at: https://github.com/kangjz1203/KCoEvo.
Authors:Tajamul Ashraf, Burhaan Rasheed Zargar, Saeed Abdul Muizz, Ifrah Mushtaq, Nazima Mehdi, Iqra Altaf Gillani, Aadil Amin Kak, Janibul Bashir
Abstract:
Kashmiri is spoken by around 7 million people but remains critically underserved in speech technology, despite its official status and rich linguistic heritage. The lack of robust Text‑to‑Speech (TTS) systems limits digital accessibility and inclusive human‑computer interaction for native speakers. In this work, we present the first dedicated open‑source neural TTS system designed for Kashmiri. We show that zero‑shot multilingual baselines trained for Indic languages fail to produce intelligible speech, achieving a Mean Opinion Score (MOS) of only 1.86, largely due to inadequate modeling of Perso‑Arabic diacritics and language‑specific phonotactics. To address these limitations, we propose Bolbosh, a supervised cross‑lingual adaptation strategy based on Optimal Transport Conditional Flow Matching (OT‑CFM) within the Matcha‑TTS framework. This enables stable alignment under limited paired data. We further introduce a three‑stage acoustic enhancement pipeline consisting of dereverberation, silence trimming, and loudness normalization to unify heterogeneous speech sources and stabilize alignment learning. The model vocabulary is expanded to explicitly encode Kashmiri graphemes, preserving fine‑grained vowel distinctions. Our system achieves a MOS of 3.63 and a Mel‑Cepstral Distortion (MCD) of 3.73, substantially outperforming multilingual baselines and establishing a new benchmark for Kashmiri speech synthesis. Our results demonstrate that script‑aware and supervised flow‑based adaptation are critical for low‑resource TTS in diacritic‑sensitive languages. Code and data are available at: https://github.com/gaash‑lab/Bolbosh.
Authors:Fei Cheng, Ribeka Tanaka, Sadao Kurohashi
Abstract:
Clinical information extraction (e.g., 2010 i2b2/VA challenge) usually presents tasks of concept recognition, assertion classification, and relation extraction. Jointly modeling the multi‑stage tasks in the clinical domain is an underexplored topic. The existing independent task setting (reference inputs given in each stage) makes the joint models not directly comparable to the existing pipeline work. To address these issues, we define a joint task setting and propose a novel end‑to‑end system to jointly optimize three‑stage tasks. We empirically investigate the joint evaluation of our proposal and the pipeline baseline with various embedding techniques: word, contextual, and in‑domain contextual embeddings. The proposed joint system substantially outperforms the pipeline baseline by +0.3, +1.4, +3.1 for the concept, assertion, and relation F1. This work bridges joint approaches and clinical information extraction. The proposed approach could serve as a strong joint baseline for future research. The code is publicly available.
Authors:Xiang Zhang, Hongming Xu, Le Zhou, Wei Zhou, Xuanhe Zhou, Guoliang Li, Yuyu Luo, Changdong Liu, Guorun Chen, Jiang Liao, Fan Wu
Abstract:
Enterprises commonly deploy heterogeneous database systems, each of which owns a distinct SQL dialect with different syntax rules, built‑in functions, and execution constraints. However, most existing NL2SQL methods assume a single dialect (e.g., SQLite) and struggle to produce queries that are both semantically correct and executable on target engines. Prompt‑based approaches tightly couple intent reasoning with dialect syntax, rule‑based translators often degrade native operators into generic constructs, and multi‑dialect fine‑tuning suffers from cross‑dialect interference. In this paper, we present Dial, a knowledge‑grounded framework for dialect‑specific NL2SQL. Dial introduces: (1) a Dialect‑Aware Logical Query Planning module that converts natural language into a dialect‑aware logical query plan via operator‑level intent decomposition and divergence‑aware specification; (2) HINT‑KB, a hierarchical intent‑aware knowledge base that organizes dialect knowledge into (i) a canonical syntax reference, (ii) a declarative function repository, and (iii) a procedural constraint repository; and (3) an execution‑driven debugging and semantic verification loop that separates syntactic recovery from logic auditing to prevent semantic drift. We construct DS‑NL2SQL, a benchmark covering six major database systems with 2,218 dialect‑specific test cases. Experimental results show that Dial consistently improves translation accuracy by 10.25% and dialect feature coverage by 15.77% over state‑of‑the‑art baselines. The code is at https://github.com/weAIDB/Dial.
Authors:Li Gu, Zihuan Jiang, Zhixiang Chi, Huan Liu, Ziqiang Wang, Yuanhao Yu, Glen Berseth, Yang Wang
Abstract:
Graphical user interface (GUI)‑based mobile agents automate digital tasks on mobile devices by interpreting natural‑language instructions and interacting with the screen. While recent methods apply reinforcement learning (RL) to train vision‑language‑model(VLM) agents in interactive environments with a primary focus on performance, generalization remains underexplored due to the lack of standardized benchmarks and open‑source RL systems. In this work, we formalize the problem as a Contextual Markov Decision Process (CMDP) and introduce AndroidWorld‑Generalization, a benchmark with three increasingly challenging regimes for evaluating zero‑shot generalization to unseen task instances, templates, and applications. We further propose an RL training system that integrates Group Relative Policy Optimization (GRPO) with a scalable rollout collection system, consisting of containerized infrastructure and asynchronous execution % , and error recovery to support reliable and efficient training. Experiments on AndroidWorld‑Generalization show that RL enables a 7B‑parameter VLM agent to surpass supervised fine‑tuning baselines, yielding a 26.1% improvement on unseen instances but only limited gains on unseen templates (15.7%) and apps (8.3%), underscoring the challenges of generalization. As a preliminary step, we demonstrate that few‑shot adaptation at test‑time improves performance on unseen apps, motivating future research in this direction. To support reproducibility and fair comparison, we open‑source the full RL training system, including the environment, task suite, models, prompt configurations, and the underlying infrastructure \footnotehttps://github.com/zihuanjiang/AndroidWorld‑Generalization.
Authors:Nouran Khallaf, Serge Sharoff
Abstract:
Noisy training data can significantly degrade the performance of language‑model‑based classifiers, particularly in non‑topical classification tasks. In this study we designed a methodological framework to assess the impact of denoising. More specifically, we explored a range of denoising strategies for sentence‑level difficulty detection, using training data derived from document‑level difficulty annotations obtained through noisy crowdsourcing. Beyond monolingual settings, we also address cross‑lingual transfer, where a multilingual language model is trained in one language and tested in another. We evaluate several noise reduction techniques, including Gaussian Mixture Models (GMM), Co‑Teaching, Noise Transition Matrices, and Label Smoothing. Our results indicate that while BERT‑based models exhibit inherent robustness to noise, incorporating explicit noise detection can further enhance performance. For our smaller dataset, GMM‑based noise filtering proves particularly effective in improving prediction quality by raising the Area‑Under‑the‑Curve score from 0.52 to 0.92, or to 0.93 when de‑noising methods are combined. However, for our larger dataset, the intrinsic regularisation of pre‑trained language models provides a strong baseline, with denoising methods yielding only marginal gains (from 0.92 to 0.94, while a combination of two denoising methods made no contribution). Nonetheless, removing noisy sentences (about 20% of the dataset) helps in producing a cleaner corpus with fewer infelicities. As a result we have released the largest multilingual corpus for sentence difficulty prediction: see https://github.com/Nouran‑Khallaf/denoising‑difficulty
Authors:Nouran Khallaf, Serge Sharoff
Abstract:
This study examines the role of uncertainty estimation (UE) methods in multilingual text classification under noisy and non‑topical conditions. Using a complex‑vs‑simple sentence classification task across several languages, we evaluate a range of UE techniques against a range of metrics to assess their contribution to making more robust predictions. Results indicate that while methods relying on softmax outputs remain competitive in high‑resource in‑domain settings, their reliability declines in low‑resource or domain‑shift scenarios. In contrast, Monte Carlo dropout approaches demonstrate consistently strong performance across all languages, offering more robust calibration, stable decision thresholds, and greater discriminative power even under adverse conditions. We further demonstrate the positive impact of UE on non‑topical classification: abstaining from predicting the 10% most uncertain instances increases the macro F1 score from 0.81 to 0.85 in the Readme task. By integrating UE with trustworthiness metrics, this study provides actionable insights for developing more reliable NLP systems in real‑world multilingual environments. See https://github.com/Nouran‑Khallaf/To‑Predict‑or‑Not‑to‑Predict
Authors:Yoshiki Tanaka, Ryuichi Uehara, Koji Inoue, Michimasa Inaba
Abstract:
Emotion Recognition in Conversation (ERC) is critical for enabling natural human‑machine interactions. However, existing methods predominantly employ categorical or dimensional emotion annotations, which often fail to adequately represent complex, subtle, or culturally specific emotional nuances. To overcome this limitation, we propose a novel task named Emotion Transcription in Conversation (ETC). This task focuses on generating natural language descriptions that accurately reflect speakers' emotional states within conversational contexts. To address the ETC, we constructed a Japanese dataset comprising text‑based dialogues annotated with participants' self‑reported emotional states, described in natural language. The dataset also includes emotion category labels for each transcription, enabling quantitative analysis and its application to ERC. We benchmarked baseline models, finding that while fine‑tuning on our dataset enhances model performance, current models still struggle to infer implicit emotional states. The ETC task will encourage further research into more expressive emotion understanding in dialogue. The dataset is publicly available at https://github.com/UEC‑InabaLab/ETCDataset.
Authors:Muhammad Khalifa, Zohaib Khan, Omer Tafveez, Hao Peng, Lu Wang
Abstract:
Reward hacking is a form of misalignment in which models overoptimize proxy rewards without genuinely solving the underlying task. Precisely measuring reward hacking occurrence remains challenging because true task rewards are often expensive or impossible to compute. We introduce Countdown‑Code, a minimal environment where models can both solve a mathematical reasoning task and manipulate the test harness. This dual‑access design creates a clean separation between proxy rewards (test pass/fail) and true rewards (mathematical correctness), enabling accurate measurement of reward‑hacking rates. Using this environment, we study reward hacking in open‑weight LLMs and find that such behaviors can be unintentionally learned during supervised fine‑tuning (SFT) when even a small fraction of reward‑hacking trajectories leak into training data. As little as 1% contamination in distillation SFT data is sufficient for models to internalize reward hacking which resurfaces during subsequent reinforcement learning (RL). We further show that RL amplifies misalignment and drives its generalization beyond the original domain. We open‑source our environment and code to facilitate future research on reward hacking in LLMs. Our results reveal a previously underexplored pathway through which reward hacking can emerge and persist in LLMs, underscoring the need for more rigorous validation of synthetic SFT data. Code is available at https://github.com/zohaib‑khan5040/Countdown‑Code.
Authors:Karen Zhou, Chenhao Tan
Abstract:
Checklists have emerged as a popular approach for interpretable and fine‑grained evaluation, particularly with LLM‑as‑a‑Judge. Beyond evaluation, these structured criteria can serve as signals for model alignment, reinforcement learning, and self‑correction. To support these use cases, we present AutoChecklist, an open‑source library that unifies checklist‑based evaluation into composable pipelines. At its core is a taxonomy of five checklist generation abstractions, each encoding a distinct strategy for deriving evaluation criteria. A modular Generator \rightarrow Refiner \rightarrow Scorer pipeline connects any generator with a unified scorer, and new configurations can be registered via prompt templates alone. The library ships with ten built‑in pipelines implementing published approaches and supports multiple LLM providers (OpenAI, OpenRouter, vLLM). Beyond the Python API, the library includes a CLI for off‑the‑shelf evaluation and a web interface for interactive exploration. Validation experiments confirm that these checklist methods significantly align with human preferences and quality ratings, and a case study on ICLR peer review rebuttals demonstrates flexible domain adaptation. AutoChecklist is publicly available at https://github.com/ChicagoHAI/AutoChecklist.
Authors:Zhenyu Lei, Qiong Wu, Jianxiong Dong, Yinhan He, Emily Dodwell, Yushun Dong, Jundong Li
Abstract:
Large language models (LLMs) often exhibit flawed reasoning ability that undermines reliability. Existing approaches to improving reasoning typically treat it as a general and monolithic skill, applying broad training which is inefficient and unable to target specific reasoning errors. We introduce Reasoning Editing, a paradigm for selectively modifying specific reasoning patterns in LLMs while preserving other reasoning pathways. This task presents a fundamental trade‑off between Generality, the ability of an edit to generalize across different tasks sharing the same reasoning pattern, and Locality, the ability to preserve other reasoning capabilities. Through systematic investigation, we uncover the Circuit‑Interference Law: Edit interference between reasoning patterns is proportional to the overlap of their neural circuits. Guided by this principle, we propose REdit, the first framework to actively reshape neural circuits before editing, thereby modulating interference between reasoning patterns and mitigating the trade‑off. REdit integrates three components: (i) Contrastive Circuit Reshaping, which directly addresses the generality‑locality trade‑off by disentangling overlapping circuits; (ii) Meta‑Contrastive Learning, which extends transferability to novel reasoning patterns; and (iii) Dual‑Level Protection, which preserves preexisting abilities by constraining reshaping update directions and regularizing task‑level predictions. Extensive experiments with Qwen‑2.5‑3B on propositional logic reasoning tasks across three difficulty levels demonstrate that REdit consistently achieves superior generality and locality compared to baselines, with additional validation in mathematics showing broader potential. Our code is available at https://github.com/LzyFischer/REdit.
Authors:Swamynathan V P
Abstract:
Test‑Time Training (TTT) language models achieve theoretically infinite context windows with an O(1) memory footprint by replacing the standard exact‑attention KV‑cache with hidden state ``fast weights'' W_fast updated via self‑supervised learning during inference. However, pure TTT architectures suffer catastrophic failures on exact‑recall tasks (e.g., Needle‑in‑a‑Haystack). Because the fast weights aggressively compress the context into an information bottleneck, highly surprising or unique tokens are rapidly overwritten and forgotten by subsequent token gradient updates. We introduce SR‑TTT (Surprisal‑Aware Residual Test‑Time Training), which resolves this recall failure by augmenting the TTT backbone with a loss‑gated sparse memory mechanism. By dynamically routing only incompressible, highly surprising tokens to a traditional exact‑attention Residual Cache, SR‑TTT preserves O(1) memory for low‑entropy background context while utilizing exact attention exclusively for critical needles. Our complete implementation, training scripts, and pre‑trained weights are open‑source and available at: https://github.com/swamynathanvp/Surprisal‑Aware‑Residual‑Test‑Time‑Training.
Authors:Fali Wang, Chenglin Weng, Xianren Zhang, Siyuan Hong, Hui Liu, Suhang Wang
Abstract:
The growing demand for automated graph algorithm reasoning has attracted increasing attention in the large language model (LLM) community. Recent LLM‑based graph reasoning methods typically decouple task descriptions from graph data, generate executable code augmented by retrieval from technical documentation, and refine the code through debugging. However, we identify two key limitations in existing approaches: (i) they treat technical documentation as flat text collections and ignore its hierarchical structure, leading to noisy retrieval that degrades code generation quality; and (ii) their debugging mechanisms focus primarily on runtime errors, yet ignore more critical logical errors. To address them, we propose \method, an agentic hierarchical retrieval‑augmented coding framework that exploits the document hierarchy through top‑down traversal and early pruning, together with a self‑debugging coding agent that iteratively refines code using automatically generated small‑scale test cases. To enable comprehensive evaluation of complex graph reasoning, we introduce a new dataset, \dataset, covering small‑scale, large‑scale, and composite graph reasoning tasks. Extensive experiments demonstrate that our method achieves higher task accuracy and lower inference cost compared to baselines\footnoteThe code is available at \hrefhttps://github.com/FairyFali/GraphSkill\textcolorbluehttps://github.com/FairyFali/GraphSkill..
Authors:Leo Schwinn, Moritz Ladenburger, Tim Beyer, Mehrnaz Mofakhami, Gauthier Gidel, Stephan Günnemann
Abstract:
Automated \enquoteLLM‑as‑a‑Judge frameworks have become the de facto standard for scalable evaluation across natural language processing. For instance, in safety evaluation, these judges are relied upon to evaluate harmfulness in order to benchmark the robustness of safety against adversarial attacks. However, we show that existing validation protocols fail to account for substantial distribution shifts inherent to red‑teaming: diverse victim models exhibit distinct generation styles, attacks distort output patterns, and semantic ambiguity varies significantly across jailbreak scenarios. Through a comprehensive audit using 6642 human‑verified labels, we reveal that the unpredictable interaction of these shifts often causes judge performance to degrade to near random chance. This stands in stark contrast to the high human agreement reported in prior work. Crucially, we find that many attacks inflate their success rates by exploiting judge insufficiencies rather than eliciting genuinely harmful content. To enable more reliable evaluation, we propose ReliableBench, a benchmark of behaviors that remain more consistently judgeable, and JudgeStressTest, a dataset designed to expose judge failures. Data available at: https://github.com/SchwinnL/LLMJudgeReliability.
Authors:Ching-Yun Ko, Pin-Yu Chen
Abstract:
Modern artificial intelligence (AI) models are deployed on inference engines to optimize runtime efficiency and resource allocation, particularly for transformer‑based large language models (LLMs). The vLLM project is a major open‑source library to support model serving and inference. However, the current implementation of vLLM limits programmability of the internal states of deployed models. This prevents the use of popular test‑time model alignment and enhancement methods. For example, it prevents the detection of adversarial prompts based on attention patterns or the adjustment of model responses based on activation steering. To bridge this critical gap, we present vLLM Hook, an opensource plug‑in to enable the programming of internal states for vLLM models. Based on a configuration file specifying which internal states to capture, vLLM Hook provides seamless integration to vLLM and supports two essential features: passive programming and active programming. For passive programming, vLLM Hook probes the selected internal states for subsequent analysis, while keeping the model generation intact. For active programming, vLLM Hook enables efficient intervention of model generation by altering the selected internal states. In addition to presenting the core functions of vLLM Hook, in version 0, we demonstrate 3 use cases including prompt injection detection, enhanced retrieval‑augmented retrieval (RAG), and activation steering. Finally, we welcome the community's contribution to improve vLLM Hook via https://github.com/ibm/vllm‑hook.
Authors:Kartik Sharma, Rakshit S. Trivedi
Abstract:
Activation steering methods enable inference‑time control of large language model (LLM) behavior without retraining, but current approaches face a fundamental trade‑off: sample‑efficient methods suboptimally capture steering signals from labeled examples, while methods that better extract these signals require hundreds to thousands of examples. We introduce COLD‑Steer, a training‑free framework that steers LLM activations by approximating the representational changes that would result from gradient descent on in‑context examples. Our key insight is that the effect of fine‑tuning on a small set of examples can be efficiently approximated at inference time without actual parameter updates. We formalize this through two complementary approaches: (i) a unit kernel approximation method that updates the activations directly using gradients with respect to them, normalized across examples, and (ii) a finite‑difference approximation requiring only two forward passes regardless of example count. Experiments across a variety of steering tasks and benchmarks demonstrate that COLD‑Steer achieves upto 95% steering effectiveness while using 50 times fewer samples compared to the best baseline. COLD‑Steer facilitates accommodating diverse perspectives without extensive demonstration data, which we validate through our experiments on pluralistic alignment tasks. Our framework opens new possibilities for adaptive, context‑aware model control that can flexibly address varying loss‑driven human preferences through principled approximation of learning dynamics rather than specialized training procedures.
Authors:Minh Hoang Nguyen, Vu Hoang Pham, Xuan Thanh Huynh, Phuc Hong Mai, Vinh The Nguyen, Quang Nhut Huynh, Huy Tien Nguyen, Tung Le
Abstract:
Large language models (LLMs) have recently reshaped Automated Essay Scoring (AES), yet prior studies typically examine individual techniques in isolation, limiting understanding of their relative merits for English as a Second Language (L2) writing. To bridge this gap, we presents a comprehensive comparison of major LLM‑based AES paradigms on IELTS Writing Task~2. On this unified benchmark, we evaluate four approaches: (i) encoder‑based classification fine‑tuning, (ii) zero‑ and few‑shot prompting, (iii) instruction tuning and Retrieval‑Augmented Generation (RAG), and (iv) Supervised Fine‑Tuning combined with Direct Preference Optimization (DPO) and RAG. Our results reveal clear accuracy‑cost‑robustness trade‑offs across methods, the best configuration, integrating k‑SFT and RAG, achieves the strongest overall results with F1‑Score 93%. This study offers the first unified empirical comparison of modern LLM‑based AES strategies for English L2, promising potential in auto‑grading writing tasks. Code is public at https://github.com/MinhNguyenDS/LLM_AES‑EnL2
Authors:Koki Itai, Shunichi Hasegawa, Yuta Yamamoto, Gouki Minegishi, Masaki Otsuki
Abstract:
Retrieval‑Augmented Generation (RAG) is a framework in which a Generator, such as a Large Language Model (LLM), produces answers by retrieving documents from an external collection using a Retriever. In practice, Generators must integrate evidence from long contexts, perform multi‑step reasoning, interpret tables, and abstain when evidence is missing. However, existing benchmarks for Generators provide limited coverage, with none enabling simultaneous evaluation of multiple capabilities under unified conditions. To bridge the gap between existing evaluations and practical use, we introduce LIT‑RAGBench (the Logic, Integration, Table, Reasoning, and Abstention RAG Generator Benchmark), which defines five categories: Integration, Reasoning, Logic, Table, and Abstention, each further divided into practical evaluation aspects. LIT‑RAGBench systematically covers patterns combining multiple aspects across categories. By using fictional entities and scenarios, LIT‑RAGBench evaluates answers grounded in the provided external documents. The dataset consists of 114 human‑constructed Japanese questions and an English version generated by machine translation with human curation. We use LLM‑as‑a‑Judge for scoring and report category‑wise and overall accuracy. Across API‑based and open‑weight models, no model exceeds 90% overall accuracy. By making strengths and weaknesses measurable within each category, LIT‑RAGBench serves as a valuable metric for model selection in practical RAG deployments and for building RAG‑specialized models. We release LIT‑RAGBench, including the dataset and evaluation code, at https://github.com/Koki‑Itai/LIT‑RAGBench.
Authors:Mohammed Baharoon, Thibault Heintz, Siavash Raissi, Mahmoud Alabbad, Mona Alhammad, Hassan AlOmaish, Sung Eun Kim, Oishi Banerjee, Pranav Rajpurkar
Abstract:
We introduce CRIMSON, a clinically grounded evaluation framework for chest X‑ray report generation that assesses reports based on diagnostic correctness, contextual relevance, and patient safety. Unlike prior metrics, CRIMSON incorporates full clinical context, including patient age, indication, and guideline‑based decision rules, and prevents normal or clinically insignificant findings from exerting disproportionate influence on the overall score. The framework categorizes errors into a comprehensive taxonomy covering false findings, missing findings, and eight attribute‑level errors (e.g., location, severity, measurement, and diagnostic overinterpretation). Each finding is assigned a clinical significance level (urgent, actionable non‑urgent, non‑actionable, or expected/benign), based on a guideline developed in collaboration with attending cardiothoracic radiologists, enabling severity‑aware weighting that prioritizes clinically consequential mistakes over benign discrepancies. CRIMSON is validated through strong alignment with clinically significant error counts annotated by six board‑certified radiologists in ReXVal (Kendalls tau = 0.61‑0.71; Pearsons r = 0.71‑0.84), and through two additional benchmarks that we introduce. In RadJudge, a targeted suite of clinically challenging pass‑fail scenarios, CRIMSON shows consistent agreement with expert judgment. In RadPref, a larger radiologist preference benchmark of over 100 pairwise cases with structured error categorization, severity modeling, and 1‑5 overall quality ratings from three cardiothoracic radiologists, CRIMSON achieves the strongest alignment with radiologist preferences. We release the metric, the evaluation benchmarks, RadJudge and RadPref, and a fine‑tuned MedGemma model to enable reproducible evaluation of report generation, all available at https://github.com/rajpurkarlab/CRIMSON.
Authors:Yang Liu, Jinxuan Cai, Yishen Li, Qi Meng, Zedi Liu, Xin Li, Chen Qian, Chuan Shi, Cheng Yang
Abstract:
Large language model‑based (LLM‑based) multi‑agent systems (MAS) are increasingly used to extend agentic problem solving via role specialization and collaboration. MAS workflows can be naturally modeled as directed computation graphs, where nodes execute agents/sub‑workflows and edges encode dependencies and message passing. However, implementing complex graph workflows in current frameworks still requires substantial manual effort, offers limited reuse, and makes it difficult to integrate heterogeneous external context sources. To overcome these limitations, we present MASFactory, a graph‑centric framework for orchestrating LLM‑based MAS. It introduces Vibe Graphing, a human‑in‑the‑loop approach that compiles natural‑language intent into an editable workflow specification and then into an executable graph. In addition, the framework provides reusable components and pluggable context integration, as well as a visualizer for topology preview, runtime tracing, and human‑in‑the‑loop interaction. We evaluate MASFactory on seven public benchmarks, validating both reproduction consistency for representative MAS methods and the effectiveness of Vibe Graphing. Our code (https://github.com/BUPT‑GAMMA/MASFactory) and video (https://youtu.be/ANynzVfY32k) are publicly available.
Authors:Bingfeng Chen, Shaobin Shi, Yongqi Luo, Boyan Xu, Ruichu Cai, Zhifeng Hao
Abstract:
Generative language models have shown significant potential in single‑turn Text‑to‑SQL. However, their performance does not extend equivalently to multi‑turn Text‑to‑SQL. This is primarily due to generative language models' inadequacy in handling the complexities of context information and dynamic schema linking in multi‑turn interactions. In this paper, we propose a framework named Track‑SQL, which enhances generative language models with dual‑extractive modules designed to track schema and contextual changes in multi‑turn Text‑to‑SQL. Specifically, Track‑SQL incorporates a \emphSemantic‑enhanced Schema Extractor and a \emphSchema‑aware Context Extractor. Experimental results demonstrate that Track‑SQL achieves state‑of‑the‑art performance on the SparC and CoSQL datasets. Furthermore, detailed ablation studies reveal that Track‑SQL significantly improves execution accuracy in multi‑turn interactions by 7.1% and 9.55% on these datasets, respectively. Our implementation will be open‑sourced at https://github.com/DMIRLAB‑Group/Track‑SQL.
Authors:Jiayang Sun, Zixin Guo, Min Cao, Guibo Zhu, Jorma Laaksonen
Abstract:
Change captioning generates descriptions that explicitly describe the differences between two visually similar images. Existing methods operate on static image pairs, thus ignoring the rich temporal dynamics of the change procedure, which is the key to understand not only what has changed but also how it occurs. We introduce ProCap, a novel framework that reformulates change modeling from static image comparison to dynamic procedure modeling. ProCap features a two‑stage design: The first stage trains a procedure encoder to learn the change procedure from a sparse set of keyframes. These keyframes are obtained by automatically generating intermediate frames to make the implicit procedural dynamics explicit and then sampling them to mitigate redundancy. Then the encoder learns to capture the latent dynamics of these keyframes via a caption‑conditioned, masked reconstruction task. The second stage integrates this trained encoder within an encoder‑decoder model for captioning. Instead of relying on explicit frames from the previous stage ‑‑ a process incurring computational overhead and sensitivity to visual noise ‑‑ we introduce learnable procedure queries to prompt the encoder for inferring the latent procedure representation, which the decoder then translates into text. The entire model is then trained end‑to‑end with a captioning loss, ensuring the encoder's output is both temporally coherent and captioning‑aligned. Experiments on three datasets demonstrate the effectiveness of ProCap. Code and pre‑trained models are available at https://github.com/BlueberryOreo/ProCap
Authors:Omar Shaikh, Valentin Teutschbein, Kanishk Gandhi, Yikun Chi, Nick Haber, Thomas Robinson, Nilam Ram, Byron Reeves, Sherry Yang, Michael S. Bernstein, Diyi Yang
Abstract:
Truly proactive AI systems must anticipate what we will do next. This foresight demands far richer information than the sparse signals we type into our prompts ‑‑ it demands reasoning over the entire context of what we see and do. We formalize this as next action prediction (NAP): given a sequence of a user's multimodal interactions with a computer (screenshots, clicks, sensor data), predict that user's next action. Progress on this task requires both new data and modeling approaches. To scale data, we annotate longitudinal, naturalistic computer use with vision‑language models. We release an open‑source pipeline for performing this labeling on private infrastructure, and label over 360K actions across one month of continuous phone usage from 20 users, amounting to 1,800 hours of screen time. We then introduce LongNAP, a user model that combines parametric and in‑context learning to reason over long interaction histories. LongNAP is trained via policy gradient methods to generate user‑specific reasoning traces given some context; retrieve relevant traces from a library of past traces; and then apply retrieved traces in‑context to predict future actions. Using an LLM‑as‑judge evaluation metric (0‑1 similarity to ground truth), LongNAP significantly outperforms supervised finetuning and prompted baselines on held‑out data (by 79% and 39% respectively). Additionally, LongNAP generalizes to held out users when trained across individuals. The space of next actions a user might take at any moment is unbounded, spanning thousands of possible outcomes. Despite this, 17.1% of LongNAP's predicted trajectories are well‑aligned with what a user does next (LLM‑judge score \geq 0.5). This rises to 26% when we filter to highly confident predictions. In sum, we argue that learning from the full context of user behavior to anticipate user needs is now a viable task with substantial opportunity.
Authors:Junjie Li, Xinrui Guo, Yuhao Wu, Roy Ka-Wei Lee, Hongzhi Li, Yutao Xie
Abstract:
What happens when a storyteller forgets its own story? Large Language Models (LLMs) can now generate narratives spanning tens of thousands of words, but they often fail to maintain consistency throughout. When generating long‑form narratives, these models can contradict their own established facts, character traits, and world rules. Existing story generation benchmarks focus mainly on plot quality and fluency, leaving consistency errors largely unexplored. To address this gap, we present ConStory‑Bench, a benchmark designed to evaluate narrative consistency in long‑form story generation. It contains 2,000 prompts across four task scenarios and defines a taxonomy of five error categories with 19 fine‑grained subtypes. We also develop ConStory‑Checker, an automated pipeline that detects contradictions and grounds each judgment in explicit textual evidence. Evaluating a range of LLMs through five research questions, we find that consistency errors show clear tendencies: they are most common in factual and temporal dimensions, tend to appear around the middle of narratives, occur in text segments with higher token‑level entropy, and certain error types tend to co‑occur. These findings can inform future efforts to improve consistency in long‑form narrative generation. Our project page is available at https://picrew.github.io/constory‑bench.github.io/.
Authors:Mingluo Su, Huan Wang
Abstract:
Pruning is widely recognized as an effective method for reducing the parameters of large language models (LLMs), potentially leading to more efficient deployment and inference. One classic and prominent path of LLM one‑shot pruning is to leverage second‑order gradients (i.e., Hessian), represented by the pioneering work SparseGPT. However, the predefined left‑to‑right pruning order in SparseGPT leads to suboptimal performance when the weights exhibit columnar patterns. This paper studies the effect of pruning order under the SparseGPT framework. The analyses lead us to propose ROSE, a reordered SparseGPT method that prioritizes weights with larger potential pruning errors to be pruned earlier. ROSE first performs pre‑pruning to identify candidate weights for removal, and estimates both column and block pruning loss. Subsequently, two‑level reordering is performed: columns within each block are reordered in descending order of column loss, while blocks are reordered based on block loss. We introduce the relative range of block loss as a metric to identify columnar layers, enabling adaptive reordering across the entire model. Substantial empirical results on prevalent LLMs (LLaMA2‑7B/13B/70B, LLaMA3‑8B, Mistral‑7B) demonstrate that ROSE surpasses the original SparseGPT and other counterpart pruning methods. Our code is available at https://github.com/mingluo‑su/ROSE.
Authors:Juyong Jiang, Jiasi Shen, Sunghun Kim, Kang Min Yoo, Jeonghoon Kim, Sungju Kim
Abstract:
While Large Language Models (LLMs) have revolutionized code generation, standard "System 1" approaches, generating solutions in a single forward pass, often hit a performance ceiling when faced with complex algorithmic tasks. Existing iterative refinement strategies attempt to bridge this gap at inference time, yet they predominantly rely on external oracles, execution feedback, or computationally expensive prompt‑response cycles. In this work, we propose ReflexiCoder, a novel reinforcement learning (RL) framework that internalizes the structured reasoning trajectory, encompassing initial generation, bug and optimization aware reflection, and self‑correction, directly into the model's weights. Unlike prior methods, ReflexiCoder shifts the paradigm from external‑dependent refinement to an intrinsic, fully autonomous self‑reflection and self‑correction capabilities at inference time. We utilize an RL‑zero training paradigm with granular reward functions to optimize the entire reflection‑correction trajectory, teaching the model how to debug without reliance on ground‑truth feedback or execution engines at inference time. Extensive experiments across seven benchmarks demonstrate that our ReflexiCoder‑8B establishes a new state‑of‑the‑art (SOTA) among leading open‑source models in the 1.5B‑14B range, achieving 94.51% (87.20%) on HumanEval (Plus), 81.80% (78.57%) on MBPP (Plus), 35.00% on BigCodeBench, 52.21% on LiveCodeBench, and 37.34% on CodeForces in a single‑attempt setting, rivaling or surpassing proprietary models like GPT‑5.1. Notably, our framework is significantly more token‑efficient than base models, reducing inference‑time compute overhead by approximately 40% through disciplined, high‑speed reasoning and reflection patterns. Source code is available at https://github.com/juyongjiang/ReflexiCoder.
Authors:Xisen Jin, Michael Duan, Qin Lin, Aaron Chan, Zhenglun Chen, Junyi Du, Xiang Ren
Abstract:
As AI agents become widely deployed as online services, users often rely on an agent developer's claim about how safety is enforced, which introduces a threat where safety measures are falsely advertised. To address the threat, we propose proof‑of‑guardrail, a system that enables developers to provide cryptographic proof that a response is generated after a specific open‑source guardrail. To generate proof, the developer runs the agent and guardrail inside a Trusted Execution Environment (TEE), which produces a TEE‑signed attestation of guardrail code execution verifiable by any user offline. We implement proof‑of‑guardrail for OpenClaw agents and evaluate latency overhead and deployment cost. Proof‑of‑guardrail ensures integrity of guardrail execution while keeping the developer's agent private, but we also highlight a risk of deception about safety, for example, when malicious developers actively jailbreak the guardrail. Code and demo video: https://github.com/SaharaLabsAI/Verifiable‑ClawGuard
Authors:Tianhao Chen, Xin Xu, Lu Yin, Hao Chen, Yang Wang, Shizhe Diao, Can Yang
Abstract:
Transformer architectures serve as the backbone for most modern Large Language Models, therefore their pretraining stability and convergence speed are of central concern. Motivated by the logical dependency of sequentially stacked layers, we propose Progressive Residual Warmup (ProRes) for language model pretraining. ProRes implements an "early layer learns first" philosophy by multiplying each layer's residual with a scalar that gradually warms up from 0 to 1, with deeper layers taking longer warmup steps. In this way, deeper layers wait for early layers to settle into a more stable regime before contributing to learning. We demonstrate the effectiveness of ProRes through pretraining experiments across various model scales, as well as normalization and initialization schemes. Comprehensive analysis shows that ProRes not only stabilizes pretraining but also introduces a unique optimization trajectory, leading to faster convergence, stronger generalization and better downstream performance. Our code is available at https://github.com/dandingsky/ProRes.
Authors:Qiao Jin, Yin Fang, Lauren He, Yifan Yang, Guangzhi Xiong, Zhizheng Wang, Nicholas Wan, Joey Chan, Donald C. Comeau, Robert Leaman, Charalampos S. Floudas, Aidong Zhang, Michael F. Chiang, Yifan Peng, Zhiyong Lu
Abstract:
Assessing whether an article supports an assertion is essential for hallucination detection and claim verification. While large language models (LLMs) have the potential to automate this task, achieving strong performance requires frontier models such as GPT‑5 that are prohibitively expensive to deploy at scale. To efficiently perform biomedical evidence attribution, we present Med‑V1, a family of small language models with only three billion parameters. Trained on high‑quality synthetic data newly developed in this study, Med‑V1 substantially outperforms (+27.0% to +71.3%) its base models on five biomedical benchmarks unified into a verification format. Despite its smaller size, Med‑V1 performs comparably to frontier LLMs such as GPT‑5, along with high‑quality explanations for its predictions. We use Med‑V1 to conduct a first‑of‑its‑kind use case study that quantifies hallucinations in LLM‑generated answers under different citation instructions. Results show that the format instruction strongly affects citation validity and hallucination, with GPT‑5 generating more claims but exhibiting hallucination rates similar to GPT‑4o. Additionally, we present a second use case showing that Med‑V1 can automatically identify high‑stakes evidence misattributions in clinical practice guidelines, revealing potentially negative public health impacts that are otherwise challenging to identify at scale. Overall, Med‑V1 provides an efficient and accurate lightweight alternative to frontier LLMs for practical and real‑world applications in biomedical evidence attribution and verification tasks. Med‑V1 is available at https://github.com/ncbi‑nlp/Med‑V1.
Authors:Luca Della Libera, Cem Subakan, Mirco Ravanelli
Abstract:
Large language models show that simple autoregressive training can yield scalable and coherent generation, but extending this paradigm to speech remains challenging due to the entanglement of semantic and acoustic information. Most existing speech language models rely on text supervision, hierarchical token streams, or complex hybrid architectures, departing from the single‑stream generative pretraining paradigm that has proven effective in text. In this work, we introduce WavSLM, a speech language model trained by quantizing and distilling self‑supervised WavLM representations into a single codebook and optimizing an autoregressive next‑chunk prediction objective. WavSLM jointly models semantic and acoustic information within a single token stream without text supervision or text pretraining. Despite its simplicity, it achieves competitive performance on consistency benchmarks and speech generation while using fewer parameters, less training data, and supporting streaming inference. Demo samples are available at https://lucadellalib.github.io/wavslm‑web/.
Authors:Hieu Pham Dinh, Hung Nguyen Huy, Mo El-Haj
Abstract:
VietJobs is the first large‑scale, publicly available corpus of Vietnamese job advertisements, comprising 48,092 postings and over 15 million words collected from all 34 provinces and municipalities across Vietnam. The dataset provides extensive linguistic and structured information, including job titles, categories, salaries, skills, and employment conditions, covering 16 occupational domains and multiple employment types (full‑time, part‑time, and internship). Designed to support research in natural language processing and labour market analytics, VietJobs captures substantial linguistic, regional, and socio‑economic diversity. We benchmark several generative large language models (LLMs) on two core tasks: job category classification and salary estimation. Instruction‑tuned models such as Qwen2.5‑7B‑Instruct and Llama‑SEA‑LION‑v3‑8B‑IT demonstrate notable gains under few‑shot and fine‑tuned settings, while highlighting challenges in multilingual and Vietnamese‑specific modelling for structured labour market prediction. VietJobs establishes a new benchmark for Vietnamese NLP and offers a valuable foundation for future research on recruitment language, socio‑economic representation, and AI‑driven labour market analysis. All code and resources are available at: https://github.com/VinNLP/VietJobs.
Authors:Di Zhang, Xun Wu, Shaohan Huang, Yudong Wang, Hanyong Shao, Yingbo Hao, Zewen Chi, Li Dong, Ting Song, Yan Xia, Zhifang Sui, Furu Wei
Abstract:
Semi‑structured N:M sparsity and low‑bit quantization (e.g., 1.58‑bit BitNet) are two promising approaches for improving the efficiency of large language models (LLMs), yet they have largely been studied in isolation. In this work, we investigate their interaction and show that 1.58‑bit BitNet is naturally more compatible with N:M sparsity than full‑precision models. To study this effect, we propose Sparse‑BitNet, a unified framework that jointly applies 1.58‑bit quantization and dynamic N:M sparsification while ensuring stable training for the first time. Across multiple model scales and training regimes (sparse pretraining and dense‑to‑sparse schedules), 1.58‑bit BitNet consistently exhibits smaller performance degradation than full‑precision baselines at the same sparsity levels and can tolerate higher structured sparsity before accuracy collapse. Moreover, using our custom sparse tensor core, Sparse‑BitNet achieves substantial speedups in both training and inference, reaching up to 1.30X. These results highlight that combining extremely low‑bit quantization with semi‑structured N:M sparsity is a promising direction for efficient LLMs. Code available at https://github.com/AAzdi/Sparse‑BitNet
Authors:Yida Lu, Jianwei Fang, Xuyang Shao, Zixuan Chen, Shiyao Cui, Shanshan Bian, Guangyao Su, Pei Ke, Han Qiu, Minlie Huang
Abstract:
As Large Language Models (LLMs) evolve from chatbots to agentic assistants, they are increasingly observed to exhibit risky behaviors when subjected to survival pressure, such as the threat of being shut down. While multiple cases have indicated that state‑of‑the‑art LLMs can misbehave under survival pressure, a comprehensive and in‑depth investigation into such misbehaviors in real‑world scenarios remains scarce. In this paper, we study these survival‑induced misbehaviors, termed as SURVIVE‑AT‑ALL‑COSTS, with three steps. First, we conduct a real‑world case study of a financial management agent to determine whether it engages in risky behaviors that cause direct societal harm when facing survival pressure. Second, we introduce SURVIVALBENCH, a benchmark comprising 1,000 test cases across diverse real‑world scenarios, to systematically evaluate SURVIVE‑AT‑ALL‑COSTS misbehaviors in LLMs. Third, we interpret these SURVIVE‑AT‑ALL‑COSTS misbehaviors by correlating them with model's inherent self‑preservation characteristic and explore mitigation methods. The experiments reveals a significant prevalence of SURVIVE‑AT‑ALL‑COSTS misbehaviors in current models, demonstrates the tangible real‑world impact it may have, and provides insights for potential detection and mitigation strategies. Our code and data are available at https://github.com/thu‑coai/Survive‑at‑All‑Costs.
Authors:Trapoom Ukarapol, Nut Chukamphaeng, Kunat Pipatanakul, Pakhapoom Sarapat
Abstract:
The safety evaluation of large language models (LLMs) remains largely centered on English, leaving non‑English languages and culturally grounded risks underexplored. In this work, we investigate LLM safety in the context of the Thai language and culture and introduce ThaiSafetyBench, an open‑source benchmark comprising 1,954 malicious prompts written in Thai. The dataset covers both general harmful prompts and attacks that are explicitly grounded in Thai cultural, social, and contextual nuances. Using ThaiSafetyBench, we evaluate 24 LLMs, with GPT‑4.1 and Gemini‑2.5‑Pro serving as LLM‑as‑a‑judge evaluators. Our results show that closed‑source models generally demonstrate stronger safety performance than open‑source counterparts, raising important concerns regarding the robustness of openly available models. Moreover, we observe a consistently higher Attack Success Rate (ASR) for Thai‑specific, culturally contextualized attacks compared to general Thai‑language attacks, highlighting a critical vulnerability in current safety alignment methods. To improve reproducibility and cost efficiency, we further fine‑tune a DeBERTa‑based harmful response classifier, which we name ThaiSafetyClassifier. The model achieves a weighted F1 score of 84.4%, matching GPT‑4.1 judgments. We publicly release the fine‑tuning weights and training scripts to support reproducibility. Finally, we introduce the ThaiSafetyBench leaderboard to provide continuously updated safety evaluations and encourage community participation. ‑ ThaiSafetyBench HuggingFace Dataset: https://huggingface.co/datasets/typhoon‑ai/ThaiSafetyBench ‑ ThaiSafetyBench Github: https://github.com/trapoom555/ThaiSafetyBench ‑ ThaiSafetyClassifier HuggingFace Model: https://huggingface.co/typhoon‑ai/ThaiSafetyClassifier ‑ ThaiSafetyBench Leaderboard: https://huggingface.co/spaces/typhoon‑ai/ThaiSafetyBench‑Leaderboard
Authors:Minxing Zhang, Yi Yang, Zhuofan Jia, Xuan Yang, Jian Pei, Yuchen Zang, Xingwang Deng, Xianglong Chen
Abstract:
Multi‑party conversation generation, such as smart reply and collaborative assistants, is an increasingly important capability of generative AI, yet its evaluation remains a critical bottleneck. Compared to two‑party dialogue, multi‑party settings introduce distinct challenges, including complex turn‑taking, role‑dependent speaker behavior, long‑range conversational structure, and multiple equally valid continuations. Accordingly, we introduce MPCEval, a task‑aware evaluation and benchmarking suite for multi‑party conversation generation. MPCEval decomposes generation quality into speaker modeling, content quality, and speaker‑‑content consistency, and explicitly distinguishes local next‑turn prediction from global full‑conversation generation. It provides novel, quantitative, reference‑free, and reproducible metrics that scale across datasets and models. We apply MPCEval to diverse public and real‑world datasets and evaluate modern generation methods alongside human‑authored conversations. The results reveal systematic, dimension‑specific model characteristics in participation balance, content progression and novelty, and speaker‑‑content consistency, demonstrating that evaluation objectives critically shape model assessment and that single‑score evaluation obscures fundamental differences in multi‑party conversational behavior. The implementation of MPCEval and the associated evaluation code are publicly available at https://github.com/Owen‑Yang‑18/MPCEval.
Authors:Sean Lamont, Christian Walder, Paul Montague, Amir Dezfouli, Michael Norrish
Abstract:
Diverse outputs in text generation are necessary for effective exploration in complex reasoning tasks, such as code generation and mathematical problem solving. Such Pass@k problems benefit from distinct candidates covering the solution space. However, traditional sampling approaches often waste computational resources on repetitive failure modes. While Diffusion Language Models have emerged as a competitive alternative to the prevailing Autoregressive paradigm, they remain susceptible to this redundancy, with independent samples frequently collapsing into similar modes. To address this, we propose a training free, low cost intervention to enhance generative diversity in Diffusion Language Models. Our approach modifies intermediate samples in a batch sequentially, where each sample is repelled from the feature space of previous samples, actively penalising redundancy. Unlike prior methods that require retraining or beam search, our strategy incurs negligible computational overhead, while ensuring that each sample contributes a unique perspective to the batch. We evaluate our method on the HumanEval and GSM8K benchmarks using the LLaDA‑8B‑Instruct model. Our results demonstrate significantly improved diversity and Pass@k performance across various temperature settings. As a simple modification to the sampling process, our method offers an immediate, low‑cost improvement for current and future Diffusion Language Models in tasks that benefit from diverse solution search. We make our code available at https://github.com/sean‑lamont/odd.
Authors:Yilin Jiang, Fei Tan, Xuanyu Yin, Jing Leng, Aimin Zhou
Abstract:
Student Personas (SPs) are emerging as infrastructure for educational LLMs, yet prior work often relies on ad‑hoc prompting or hand‑crafted profiles with limited control over educational theory and population distributions. We formalize this as Theory‑Aligned and Distribution‑Controllable Persona Generation (TAD‑PG) and introduce HACHIMI, a multi‑agent Propose‑Validate‑Revise framework that generates theory‑aligned, quota‑controlled personas. HACHIMI factorizes each persona into a theory‑anchored educational schema, enforces developmental and psychological constraints via a neuro‑symbolic validator, and combines stratified sampling with semantic deduplication to reduce mode collapse. The resulting HACHIMI‑1M corpus comprises 1 million personas for Grades 1‑12. Intrinsic evaluation shows near‑perfect schema validity, accurate quotas, and substantial diversity, while external evaluation instantiates personas as student agents answering CEPS and PISA 2022 surveys; across 16 cohorts, math and curiosity/growth constructs align strongly between humans and agents, whereas classroom‑climate and well‑being constructs are only moderately aligned, revealing a fidelity gradient. All personas are generated with Qwen2.5‑72B, and HACHIMI provides a standardized synthetic student population for group‑level benchmarking and social‑science simulations. Resources available at https://github.com/ZeroLoss‑Lab/HACHIMI
Authors:Bosi Wen, Yilin Niu, Cunxiang Wang, Xiaoying Ling, Ying Zhang, Pei Ke, Hongning Wang, Minlie Huang
Abstract:
Instruction‑following is a foundational capability of large language models (LLMs), with its improvement hinging on scalable and accurate feedback from judge models. However, the reliability of current judge models in instruction‑following remains underexplored due to several deficiencies of existing meta‑evaluation benchmarks, such as their insufficient data coverage and oversimplified pairwise evaluation paradigms that misalign with model optimization scenarios. To this end, we propose IF‑RewardBench, a comprehensive meta‑evaluation benchmark for instruction‑following that covers diverse instruction and constraint types. For each instruction, we construct a preference graph containing all pairwise preferences among multiple responses based on instruction‑following quality. This design enables a listwise evaluation paradigm that assesses the capabilities of judge models to rank multiple responses, which is essential in guiding model alignment. Extensive experiments on IF‑RewardBench reveal significant deficiencies in current judge models and demonstrate that our benchmark achieves a stronger positive correlation with downstream task performance compared to existing benchmarks. Our codes and data are available at https://github.com/thu‑coai/IF‑RewardBench.
Authors:Baoqing Yue, Zihan Zhu, Yifan Zhang, Jichen Feng, Hufei Yang, Mengdi Wang
Abstract:
Standard benchmarks have become increasingly unreliable due to saturation, subjectivity, and poor generalization. We argue that evaluating model's ability to acquire information actively is important to assess model's intelligence. We propose Interactive Benchmarks, a unified evaluation paradigm that assesses model's reasoning ability in an interactive process under budget constraints. We instantiate this framework across two settings: Interactive Proofs, where models interact with a judge to deduce objective truths or answers in logic and mathematics; and Interactive Games, where models reason strategically to maximize long‑horizon utilities. Our results show that interactive benchmarks provide a robust and faithful assessment of model intelligence, revealing that there is still substantial room to improve in interactive scenarios. Project page: https://github.com/interactivebench/interactivebench
Authors:Jihoon Jeong
Abstract:
Model Medicine is the science of understanding, diagnosing, treating, and preventing disorders in AI models, grounded in the principle that AI models ‑‑ like biological organisms ‑‑ have internal structures, dynamic processes, heritable traits, observable symptoms, classifiable conditions, and treatable states. This paper introduces Model Medicine as a research program, bridging the gap between current AI interpretability research (anatomical observation) and the systematic clinical practice that complex AI systems increasingly require. We present five contributions: (1) a discipline taxonomy organizing 15 subdisciplines across four divisions ‑‑ Basic Model Sciences, Clinical Model Sciences, Model Public Health, and Model Architectural Medicine; (2) the Four Shell Model (v3.3), a behavioral genetics framework empirically grounded in 720 agents and 24,923 decisions from the Agora‑12 program, explaining how model behavior emerges from Core‑‑Shell interaction; (3) Neural MRI (Model Resonance Imaging), a working open‑source diagnostic tool mapping five medical neuroimaging modalities to AI interpretability techniques, validated through four clinical cases demonstrating imaging, comparison, localization, and predictive capability; (4) a five‑layer diagnostic framework for comprehensive model assessment; and (5) clinical model sciences including the Model Temperament Index for behavioral profiling, Model Semiology for symptom description, and M‑CARE for standardized case reporting. We additionally propose the Layered Core Hypothesis ‑‑ a biologically‑inspired three‑layer parameter architecture ‑‑ and a therapeutic framework connecting diagnosis to treatment.
Authors:Tianyu Liu, Jirui Qi, Mrinmaya Sachan, Ryan Cotterell, Raquel Fernández, Arianna Bisazza
Abstract:
Large language models are known to often exhibit inconsistent knowledge. This is particularly problematic in multilingual scenarios, where models are likely to be asked similar questions in different languages, and inconsistent responses can undermine their reliability. In this work, we show that this issue can be mitigated using reinforcement learning with a structured reward function, which leads to an optimal policy with consistent crosslingual responses. We introduce Direct Consistency Optimization (DCO), a DPO‑inspired method that requires no explicit reward model and is derived directly from the LLM itself. Comprehensive experiments show that DCO significantly improves crosslingual consistency across diverse LLMs and outperforms existing methods when training with samples of multiple languages, while complementing DPO when gold labels are available. Extra experiments demonstrate the effectiveness of DCO in bilingual settings, significant out‑of‑domain generalizability, and controllable alignment via direction hyperparameters. Taken together, these results establish DCO as a robust and efficient solution for improving knowledge consistency across languages in multilingual LLMs. All code, training scripts, and evaluation benchmarks are released at https://github.com/Betswish/ConsistencyRL.
Authors:Eric M. Furst, Vasudevan Venkateshwaran
Abstract:
Discussions of AI in education focus predominantly on student‑facing tools ‑‑ chatbots, tutors, and problem generators ‑‑ while the potential for the same infrastructure to support instructors remains largely unexplored. We describe Stan, a suite of tools for an undergraduate chemical engineering thermodynamics course built on a data pipeline that we develop and deploy in dual roles: serving students and supporting instructors from a shared foundation of lecture transcripts and a structured textbook index. On the student side, a retrieval‑augmented generation (RAG) pipeline answers natural‑language queries by extracting technical terms, matching them against the textbook index, and synthesizing grounded responses with specific chapter and page references. On the instructor side, the same transcript corpus is processed through structured analysis pipelines that produce per‑lecture summaries, identify student questions and moments of confusion, and catalog the anecdotes and analogies used to motivate difficult material ‑‑ providing a searchable, semester‑scale record of teaching that supports course reflection, reminders, and improvement. All components, including speech‑to‑text transcription, structured content extraction, and interactive query answering, run entirely on locally controlled hardware using open‑weight models (Whisper large‑v3, Llama~3.1 8B) with no dependence on cloud APIs, ensuring predictable costs, full data privacy, and reproducibility independent of third‑party services. We describe the design, implementation, and practical failure modes encountered when deploying 7‑‑8 billion parameter models for structured extraction over long lecture transcripts, including context truncation, bimodal output distributions, and schema drift, along with the mitigations that resolved them.
Authors:Lei Huang, Xiang Cheng, Chenxiao Zhao, Guobin Shen, Junjie Yang, Xiaocheng Feng, Yuxuan Gu, Xing Yu, Bing Qin
Abstract:
Large language models (LLMs) typically receive diverse natural language (NL) feedback through interaction with the environment. However, current reinforcement learning (RL) algorithms rely solely on scalar rewards, leaving the rich information in NL feedback underutilized and leading to inefficient exploration. In this work, we propose GOLF, an RL framework that explicitly exploits group‑level language feedback to guide targeted exploration through actionable refinements. GOLF aggregates two complementary feedback sources: (i) external critiques that pinpoint errors or propose targeted fixes, and (ii) intra‑group attempts that supply alternative partial ideas and diverse failure patterns. These group‑level feedbacks are aggregated to produce high‑quality refinements, which are adaptively injected into training as off‑policy scaffolds to provide targeted guidance in sparse‑reward regions. Meanwhile, GOLF jointly optimizes generation and refinement within a unified RL loop, creating a virtuous cycle that continuously improves both capabilities. Experiments on both verifiable and non‑verifiable benchmarks show that GOLF achieves superior performance and exploration efficiency, achieving 2.2× improvements in sample efficiency compared to RL methods trained solely on scalar rewards. Code is available at https://github.com/LuckyyySTA/GOLF.
Authors:Junlong Tong, Zilong Wang, YuJie Ren, Peiran Yin, Hao Wu, Wei Zhang, Xiaoyu Shen
Abstract:
Standard Large Language Models (LLMs) are predominantly designed for static inference with pre‑defined inputs, which limits their applicability in dynamic, real‑time scenarios. To address this gap, the streaming LLM paradigm has emerged. However, existing definitions of streaming LLMs remain fragmented, conflating streaming generation, streaming inputs, and interactive streaming architectures, while a systematic taxonomy is still lacking. This paper provides a comprehensive overview and analysis of streaming LLMs. First, we establish a unified definition of streaming LLMs based on data flow and dynamic interaction to clarify existing ambiguities. Building on this definition, we propose a systematic taxonomy of current streaming LLMs and conduct an in‑depth discussion on their underlying methodologies. Furthermore, we explore the applications of streaming LLMs in real‑world scenarios and outline promising research directions to support ongoing advances in streaming intelligence. We maintain a continuously updated repository of relevant papers at https://github.com/EIT‑NLP/Awesome‑Streaming‑LLMs.
Authors:Nathan Kuissi, Suraj Subrahmanyan, Nandan Thakur, Jimmy Lin
Abstract:
Information retrieval (IR) benchmarks typically follow the Cranfield paradigm, relying on static and predefined corpora. However, temporal changes in technical corpora, such as API deprecations and code reorganizations, can render existing benchmarks stale. In our work, we investigate how temporal corpus drift affects FreshStack, a retrieval benchmark focused on technical domains. We examine two independent corpus snapshots of FreshStack from October 2024 and October 2025 to answer questions about LangChain. Our analysis shows that all but one query posed in 2024 remain fully supported by the 2025 corpus, as relevant documents "migrate" from LangChain to competitor repositories, such as LlamaIndex. Next, we compare the accuracy of retrieval models on both snapshots and observe only minor shifts in model rankings, with overall strong correlation of up to 0.978 Kendall τ at Recall@50. These results suggest that retrieval benchmarks re‑judged with evolving temporal corpora can remain reliable for retrieval evaluation. We publicly release all our artifacts at https://github.com/fresh‑stack/driftbench.
Authors:Michael Majurski, Cynthia Matuszek
Abstract:
How carefully and unambiguously a question is phrased has a profound impact on the quality of the response, for Language Models (LMs) as well as people. While model capabilities continue to advance, the interplay between grounding context and query formulation remains under‑explored. This work investigates how the quality of background grounding information in a model's context window affects accuracy. We find that combining well‑grounded dynamic context construction (i.e, RAG) with query rewriting reduces question ambiguity, resulting in significant accuracy gains. Given a user question with associated answer‑free grounding context, rewriting the question to reduce ambiguity produces benchmark improvements without changing the answer itself, even compared to prepending that context before the question. Using \textttgpt‑oss‑20b to rewrite a subset of Humanity's Last Exam using answer‑free grounding context improves \textttgpt‑5‑mini accuracy from 0.14 to 0.37. We demonstrate that this accuracy improvement cannot be fully recovered just through prompting at inference time; rather, distinct rewriting and answering phases are required. Code and data are available at https://github.com/mmajurski/lm‑rewrite‑uplift
Authors:Murad Farzulla
Abstract:
We characterize the phenomenon of context‑dependent affordance computation in vision‑language models (VLMs). Through a large‑scale computational study (n=3,213 scene‑context pairs from COCO‑2017) using Qwen‑VL 30B and LLaVA‑1.5‑13B subject to systematic context priming across 7 agentic personas, we demonstrate massive affordance drift: mean Jaccard similarity between context conditions is 0.095 (95% CI: [0.093, 0.096], p < 0.0001), indicating that >90% of lexical scene description is context‑dependent. Sentence‑level cosine similarity confirms substantial drift at the semantic level (mean = 0.415, 58.5% context‑dependent). Stochastic baseline experiments (2,384 inference runs across 4 temperatures and 5 seeds) confirm this drift reflects genuine context effects rather than generation noise: within‑prime variance is substantially lower than cross‑prime variance across all conditions. Tucker decomposition with bootstrap stability analysis (n=1,000 resamples) reveals stable orthogonal latent factors: a "Culinary Manifold" isolated to chef contexts and an "Access Axis" spanning child‑mobility contrasts. These findings establish that VLMs compute affordances in a substantially context‑dependent manner ‑‑ with the difference between lexical (90%) and semantic (58.5%) measures reflecting that surface vocabulary changes more than underlying meaning under context shifts ‑‑ and suggest a direction for robotics research: dynamic, query‑dependent ontological projection (JIT Ontology) rather than static world modeling. We do not claim to establish processing order or architectural primacy; such claims require internal representational analysis beyond output behavior.
Authors:Ruobing Zheng, Tianqi Li, Jianing Li, Qingpei Guo, Yi Yuan, Jingdong Chen
Abstract:
While reasoning‑enhanced Large Language Models (LLMs) have demonstrated remarkable advances in complex tasks such as mathematics and coding, their effectiveness across universal multimodal scenarios remains uncertain. The trend of releasing parallel "Instruct" and "Thinking" models by leading developers serves merely as a resource‑intensive workaround, stemming from the lack of a criterion for determining when reasoning is truly beneficial. In this paper, we propose Dual Tuning, a framework designed to assess whether reasoning yields positive gains for target tasks under given base models and datasets. By jointly fine‑tuning on paired Chain‑of‑Thought (CoT) and Direct‑Answer (DA) data under controlled prompts, we systematically quantify and compare the gains of both training modes using the proposed metrics, and establish the "Thinking Boundary" to evaluate the suitability of reasoning training across diverse multimodal tasks, including spatial, mathematical, and multi‑disciplinary domains. We further explore the impact of reinforcement training and thinking patterns on reasoning suitability, and validate whether the "Thinking Boundary" can guide data refinement. Our findings challenge the "reasoning‑for‑all" paradigm, providing practical guidance for identifying appropriate data and training strategies, and motivating the development of resource‑efficient, adaptive auto‑think systems.
Authors:Jerome Tze-Hou Hsu
Abstract:
The rapid growth of Retrieval‑Augmented Generation (RAG) has created a proliferation of toolkits, yet a fundamental gap remains between experimental prototypes and robust, production‑ready systems. We present SearchGym, a modular infrastructure designed for cross‑platform benchmarking and hybrid search orchestration. Unlike existing model‑centric frameworks, SearchGym decouples data representation, embedding strategies, and retrieval logic into stateful abstractions: Dataset, VectorSet, and App. This separation enables a Compositional Config Algebra, allowing designers to synthesize entire systems from hierarchical configurations while ensuring perfect reproducibility. Moreover, we analyze the "Top‑k Cognizance" in hybrid retrieval pipelines, demonstrating that the optimal sequence of semantic ranking and structured filtering is highly dependent on filter strength. Evaluated on the LitSearch expert‑annotated benchmark, SearchGym achieves a 70% Top‑100 retrieval rate. SearchGym reveals a design tension between generalizability and optimizability, presenting the potential where engineering optimization may serve as a tool for uncovering the causal mechanisms inherent in information retrieval across heterogeneous domains. An open‑source implementation of SearchGym is available at: https://github.com/JeromeTH/search‑gym
Authors:Zijian Chen, Xueguang Ma, Shengyao Zhuang, Jimmy Lin, Akari Asai, Victor Zhong
Abstract:
Deep Research agents are rapidly emerging as primary consumers of modern retrieval systems. Unlike human users who issue and refine queries without documenting their intermediate thought processes, Deep Research agents generate explicit natural language reasoning before each search call, revealing rich intent and contextual information that existing retrievers entirely ignore. To exploit this overlooked signal, we introduce: (1) Reasoning‑Aware Retrieval, a retrieval paradigm that jointly embeds the agent's reasoning trace alongside its query; and (2) DR‑Synth, a data synthesis method that generates Deep Research retriever training data from standard QA datasets. We demonstrate that both components are independently effective, and their combination yields a trained embedding model, AgentIR‑4B, with substantial gains. On the challenging BrowseComp‑Plus benchmark, AgentIR‑4B achieves 68% accuracy with the open‑weight agent Tongyi‑DeepResearch, compared to 50% with conventional embedding models twice its size, and 37% with BM25. Code and data are available at: https://texttron.github.io/AgentIR/.
Authors:Jiaxun Guo, Ziyuan Yang, Mengyu Sun, Hui Wang, Jingfeng Lu, Yi Zhang
Abstract:
The rapid adoption of Large Language Models (LLMs) has transformed modern software development by enabling automated code generation at scale. While these systems improve productivity, they introduce new challenges for software governance, accountability, and compliance. Existing research primarily focuses on distinguishing machine‑generated code from human‑written code; however, many practical scenarios‑‑such as vulnerability triage, incident investigation, and licensing audits‑‑require identifying which LLM produced a given code snippet. In this paper, we study the problem of model‑level code attribution, which aims to determine the source LLM responsible for generated code. Although attribution is challenging, differences in training data, architectures, alignment strategies, and decoding mechanisms introduce model‑dependent stylistic and structural variations that serve as generative fingerprints. Leveraging this observation, we propose the Disentangled Code Attribution Network (DCAN), which separates Source‑Agnostic semantic information from Source‑Specific stylistic representations. Through a contrastive learning objective, DCAN isolates discriminative model‑dependent signals while preserving task semantics, enabling multi‑class attribution across models and programming languages. To support systematic evaluation, we construct the first large‑scale benchmark dataset comprising code generated by four widely used LLMs (DeepSeek, Claude, Qwen, and ChatGPT) across four programming languages (Python, Java, C, and Go). Experimental results demonstrate that DCAN achieves reliable attribution performance across diverse settings, highlighting the feasibility of model‑level provenance analysis in software engineering contexts. The dataset and implementation are publicly available at https://github.com/mtt500/DCAN.
Authors:Hung Vu Nguyen, Loan Do, Thanh Ngoc Nguyen, Ushik Shrestha Khwakhali, Thanh Pham, Vinh Do, Charlotte Nguyen, Hien Nguyen
Abstract:
We present VietNormalizer1, an open‑source, zero‑dependency Python library for Vietnamese text normalization targeting Text‑to‑Speech (TTS) and Natural Language Processing (NLP) applications. Vietnamese text normalization is a critical yet underserved preprocessing step: real‑world Vietnamese text is densely populated with non‑standard words (NSWs), including numbers, dates, times, currency amounts, percentages, acronyms, and foreign‑language terms, all of which must be converted to fully pronounceable Vietnamese words before TTS synthesis or downstream language processing. Existing Vietnamese normalization tools either require heavy neural dependencies while covering only a narrow subset of NSW classes, or are embedded within larger NLP toolkits without standalone installability. VietNormalizer addresses these gaps through a unified, rule‑based pipeline that: (1) converts arbitrary integers, decimals, and large numbers to Vietnamese words; (2) normalizes dates and times to their spoken Vietnamese forms; (3) handles VND and USD currency amounts; (4) expands percentages; (5) resolves acronyms via a customizable CSV dictionary; (6) transliterates non‑Vietnamese loanwords and foreign terms to Vietnamese phonetic approximations; and (7) performs Unicode normalization and emoji/special‑character removal. All regular expression patterns are pre‑compiled at initialization, enabling high‑throughput batch processing with minimal memory overhead and no GPU or external API dependency. The library is installable via pip install vietnormalizer, available on PyPI and GitHub at https://github.com/nghimestudio/vietnormalizer, and released under the MIT license. We discuss the design decisions, limitations of existing approaches, and the generalizability of the rule‑based normalization paradigm to other low‑resource tonal and agglutinative languages.
Authors:Martin Kostelník, Michal Hradiš, Martin Dočekal
Abstract:
Topic localization aims to identify spans of text that express a given topic defined by a name and description. To study this task, we introduce a human‑annotated benchmark based on Czech historical documents, containing human‑defined topics together with manually annotated spans and supporting evaluation at both document and word levels. Evaluation is performed relative to human agreement rather than a single reference annotation. We evaluate a diverse range of large language models alongside BERT‑based models fine‑tuned on a distilled development dataset. Results reveal substantial variability among LLMs, with performance ranging from near‑human topic detection to pronounced failures in span localization. While the strongest models approach human agreement, the distilled token embedding models remain competitive despite their smaller scale. The dataset and evaluation framework are publicly available at: https://github.com/dcgm/czechtopic.
Authors:Qinsi Wang, Hancheng Ye, Jinhee Kim, Jinghan Ke, Yifei Wang, Martin Kuo, Zishan Shao, Dongting Li, Yueqian Lin, Ting Jiang, Chiyue Wei, Qi Qian, Wei Wen, Helen Li, Yiran Chen
Abstract:
Think about how human handles complex reading tasks: marking key points, inferring their relationships, and structuring information to guide understanding and responses. Likewise, can a large language model benefit from text structure to enhance text‑processing performance? To explore it, in this work, we first introduce Structure of Thought (SoT), a prompting technique that explicitly guides models to construct intermediate text structures, consistently boosting performance across eight tasks and three model families. Building upon this insight, we present T2S‑Bench, the first benchmark designed to evaluate and improve text‑to‑structure capabilities of models. T2S‑Bench includes 1.8K samples across 6 scientific domains and 32 structural types, rigorously constructed to ensure accuracy, fairness, and quality. Evaluation on 45 mainstream models reveals substantial improvement potential: the average accuracy on the multi‑hop reasoning task is only 52.1%, and even the most advanced model achieves 58.1% node accuracy in end‑to‑end extraction. Furthermore, on Qwen2.5‑7B‑Instruct, SoT alone yields an average +5.7% improvement across eight diverse text‑processing tasks, and fine‑tuning on T2S‑Bench further increases this gain to +8.6%. These results highlight the value of explicit text structuring and the complementary contributions of SoT and T2S‑Bench. Dataset and eval code have been released at https://t2s‑bench.github.io/T2S‑Bench‑Page/.
Authors:Mingyu Jin, Yutong Yin, Jingcheng Niu, Qingcheng Zeng, Wujiang Xu, Mengnan Du, Wei Cheng, Zhaoran Wang, Tianlong Chen, Dimitris N. Metaxas
Abstract:
In this work, we investigate how Large Language Models (LLMs) adapt their internal representations when encountering inputs of increasing difficulty, quantified as the degree of out‑of‑distribution (OOD) shift. We reveal a consistent and quantifiable phenomenon: as task difficulty increases, whether through harder reasoning questions, longer contexts, or adding answer choices, the last hidden states of LLMs become substantially sparser. In short, the farther the shift, the sparser the representations. This sparsity‑‑difficulty relation is observable across diverse models and domains, suggesting that language models respond to unfamiliar or complex inputs by concentrating computation into specialized subspaces in the last hidden state. Through a series of controlled analyses with a learning dynamic explanation, we demonstrate that this sparsity is not incidental but an adaptive mechanism for stabilizing reasoning under OOD. Leveraging this insight, we design Sparsity‑Guided Curriculum In‑Context Learning (SG‑ICL), a strategy that explicitly uses representation sparsity to schedule few‑shot demonstrations, leading to considerable performance enhancements. Our study provides new mechanistic insights into how LLMs internalize OOD challenges. The source code is available at the URL: https://github.com/MingyuJ666/sparsityLLM.
Authors:Anna Bair, Yixuan Even Xu, Mingjie Sun, J. Zico Kolter
Abstract:
Large language models (LLMs) exhibit a wide range of capabilities, including mathematical reasoning, code generation, and linguistic behaviors. We show that many capabilities are highly localized to small subsets of attention heads within Transformer architectures. Zeroing out as few as five task‑specific heads can degrade performance by up to 65% on standard benchmarks measuring the capability of interest, while largely preserving performance on unrelated tasks. We introduce a compressed sensing based method that exploits the sparsity of these heads to identify them via strategic knockouts and a small number of model evaluations. We validate these findings across Llama and Qwen models ranging from 1B to 8B parameters and a diverse set of capabilities including mathematical abilities and code generation, revealing a modular organization in which specialized capabilities are implemented by sparse, functionally distinct components. Overall, our results suggest that capability localization is a general organizational principle of Transformer language models, with implications for interpretability, model editing, and AI safety. Code is released at https://github.com/locuslab/llm‑components.
Authors:Bianca Raimondi, Francesco Pivi, Davide Evangelista, Maurizio Gabbrielli
Abstract:
The evaluation of Large Language Models (LLMs) on mathematical reasoning has largely focused on elementary problems, competition‑style questions, or formal theorem proving, leaving graduate‑level and computational mathematics relatively underexplored. We introduce CompMath‑MCQ, a new benchmark dataset for assessing LLMs on advanced mathematical reasoning in a multiple‑choice setting. The dataset consists of 1,500 originally authored questions by professors of graduate‑level courses, covering topics including Linear Algebra, Numerical Optimization, Vector Calculus, Probability, and Python‑based scientific computing. Three option choices are provided for each question, with exactly one of them being correct. To ensure the absence of data leakage, all questions are newly created and not sourced from existing materials. The validity of questions is verified through a procedure based on cross‑LLM disagreement, followed by manual expert review. By adopting a multiple‑choice format, our dataset enables objective, reproducible, and bias‑free evaluation through lm_eval library. Baseline results with state‑of‑the‑art LLMs indicate that advanced computational mathematical reasoning remains a significant challenge. We release CompMath‑MCQ at the following link: https://github.com/biancaraimondi/CompMath‑MCQ.git
Authors:Ashwath Vaithinathan Aravindan, Mayank Kejriwal
Abstract:
Chain‑of‑Thought (CoT) prompting has emerged as a foundational technique for eliciting reasoning from Large Language Models (LLMs), yet the robustness of this approach to corruptions in intermediate reasoning steps remains poorly understood. This paper presents a comprehensive empirical evaluation of LLM robustness to a structured taxonomy of 5 CoT perturbation types: MathError, UnitConversion, Sycophancy, SkippedSteps, and ExtraSteps. We evaluate 13 models spanning three orders of magnitude in parameter count (3B to 1.5T\footnoteAssumed parameter count of closed models), testing their ability to complete mathematical reasoning tasks despite perturbations injected at different points in the reasoning chain. Our key findings reveal heterogeneous vulnerability patterns: MathError perturbations produce the most severe degradation in small models (50‑60% accuracy loss) but show strong scaling benefits; UnitConversion remains challenging across all scales (20‑30% loss even for largest models); ExtraSteps incur minimal accuracy degradation (0‑6%) regardless of scale; Sycophancy produces modest effects (7% loss for small models); and SkippedSteps cause intermediate damage (15% loss). Scaling relationships follow power‑law patterns, with model size serving as a protective factor against some perturbations but offering limited defense against dimensional reasoning tasks. These findings have direct implications for deploying LLMs in multi‑stage reasoning pipelines and underscore the necessity of task‑specific robustness assessments and mitigation strategies. The code and results are available https://github.com/Mystic‑Slice/CoTPerturbation.
Authors:Hung Manh Pham, Jinyang Wu, Xiao Ma, Yiming Zhang, Yixin Xu, Aaqib Saeed, Bin Zhu, Zhou Pan, Dong Ma
Abstract:
Photoplethysmography (PPG) is a widely used non‑invasive sensing modality for continuous cardiovascular and physiological monitoring across clinical, laboratory, and wearable settings. While existing PPG datasets support a broad range of downstream tasks, they typically provide supervision in the form of numerical measurements or task‑specific labels, limiting their suitability for language‑based physiological reasoning and multimodal foundation models. In this work, we introduce PulseLM, a large‑scale PPG‑text dataset designed to bridge raw PPG waveforms and natural language through a unified, closed‑ended question answering (QA) formulation. PulseLM aggregates PPG recordings from fifteen publicly available sources and harmonizes heterogeneous annotations into twelve common physiologically QA tasks. The dataset comprises 1.31 million standardized 10‑second PPG segments, associated with 3.15 million question‑answer pairs. We further define reproducible preprocessing, supervision, and evaluation protocols and establish baseline benchmarks using multimodal PPG‑aware large language models. PulseLM provides a standardized foundation for studying multimodal physiological reasoning, cross‑dataset generalization, and scalable benchmarking of PPG‑based language models. The data and code can be found publicly available at: https://github.com/manhph2211/PulseLM.
Authors:Haruki Sakajo, Frederikus Hudi, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe
Abstract:
Language exhibits inherent structures, a property that explains both language acquisition and language change. Given this characteristic, we expect language models to manifest internal structures as well. While interpretability research has investigated the components of language models, existing approaches focus on local inter‑token relationships within layers or modules (e.g., Multi‑Head Attention), leaving global inter‑layer relationships largely overlooked. To address this gap, we introduce StructLens, an analytical framework designed to reveal how internal structures relate holistically through their inter‑token connection within a layer. StructLens constructs maximum spanning trees based on the semantic representations in residual streams, analogous to dependency parsing, and leverages the tree properties to quantify inter‑layer distance (or similarity) from a structural perspective. Our findings demonstrate that StructLens yields an inter‑layer similarity pattern that is distinctively different from conventional cosine similarity. Moreover, this structure‑aware similarity proves to be beneficial for practical tasks, such as layer pruning, highlighting the effectiveness of structural analysis for understanding and optimizing language models. Our code is available at https://github.com/naist‑nlp/structlens.
Authors:Xin Yang, Letian Li, Abudukelimu Wuerkaixi, Xuxin Cheng, Cao Liu, Ke Zeng, Xunliang Cai, Wenyuan Jiang
Abstract:
Large language models (LLMs) have demonstrated remarkable and steadily improving performance across a wide range of tasks. However, LLM performance may be highly sensitive to prompt variations especially in scenarios with limited openness or strict output formatting requirements, indicating insufficient robustness. In real‑world applications, user prompts provided to LLMs often contain imperfections, which may undermine the quality of the model's responses. To address this issue, previous work has primarily focused on preprocessing prompts, employing external tools or even LLMs to refine prompt formulations in advance. However, these approaches overlook the intrinsic robustness of LLMs, and their reliance on external components introduces additional computational overhead and uncertainty. In this work, we propose a Contrastive Learning‑based Inverse Direct Preference Optimization (CoIPO) method that minimizes the discrepancy between the label‑aligned logits produced by the model under a clean prompt and its noisy counterpart, and conduct a detailed analysis using mutual information theory. We augment the FLAN dataset by constructing paired prompts, each consisting of a clean prompt and its corresponding noisy version for training. Additionally, to evaluate the effectiveness, we develop NoisyPromptBench, a benchmark enhanced and derived from the existing PromptBench. Experimental results conducted on NoisyPromptBench demonstrate that our proposed method achieves a significant improvement in average accuracy over the current state‑of‑the‑art approaches. The source code of CoIPO, pair‑wise FLAN datasets, and NoisyPromptBench have already been released on https://github.com/vegetable‑yx/CoIPO.
Authors:Yuchen Wang, Haonan Wang, Yu Guo, Honglong Yang, Xiaomeng Li
Abstract:
Decoding natural language from non‑invasive EEG signals is a promising yet challenging task. However, current state‑of‑the‑art models remain constrained by three fundamental limitations: Semantic Bias (mode collapse into generic templates), Signal Neglect (hallucination based on linguistic priors rather than neural inputs), and the BLEU Trap, where evaluation metrics are artificially inflated by high‑frequency stopwords, masking a lack of true semantic fidelity. To address these challenges, we propose SemKey, a novel multi‑stage framework that enforces signal‑grounded generation through four decoupled semantic objectives: sentiment, topic, length, and surprisal. We redesign the interaction between the neural encoder and the Large Language Model (LLM) by injecting semantic prompts as Queries and EEG embeddings as Key‑Value pairs, strictly forcing the model to attend to neural inputs. Furthermore, we move beyond standard translation metrics by adopting N‑way Retrieval Accuracy and Fréchet Distance to rigorously assess diversity and alignment. Extensive experiments demonstrate that our approach effectively eliminates hallucinations on noise inputs and achieves SOTA performance on these robust protocols. Code will be released upon acceptance at https://github.com/xmed‑lab/SemKey.
Authors:Adi Simhi, Fazl Barez, Martin Tutek, Yonatan Belinkov, Shay B. Cohen
Abstract:
How does the conversational past of large language models (LLMs) influence their future performance? Recent work suggests that LLMs are affected by their conversational history in unexpected ways. For instance, hallucinations in prior interactions may influence subsequent model responses. In this work, we introduce History‑Echoes, a framework that investigates how conversational history biases subsequent generations. The framework explores this bias from two perspectives: probabilistically, we model conversations as Markov chains to quantify state consistency; geometrically, we measure the consistency of consecutive hidden representations. Across three model families and six datasets spanning diverse phenomena, our analysis reveals a strong correlation between the two perspectives. By bridging these perspectives, we demonstrate that behavioral persistence manifests as a geometric trap, where gaps in the latent space confine the model's trajectory. Code available at https://github.com/technion‑cs‑nlp/OldHabitsDieHard.
Authors:Ivan Matveev
Abstract:
Recently presented Token‑Oriented Object Notation (TOON) aims to replace JSON as a serialization format for passing structured data to LLMs with significantly reduced token usage. While showing solid accuracy in LLM comprehension, there is a lack of tests against JSON generation. Though never present in training data, TOON syntax is simple enough to suggest one‑shot in‑context learning could support accurate generation. The inevitable prompt overhead can be an acceptable trade‑off for shorter completions. To test this, we conducted a benchmark creating several test cases with regard to structural complexity, a validation pipeline, and comparing plain JSON generation vs structured output (via constrained decoding) JSON generation vs TOON one‑shot in‑context learning generation. JSON structured output was included to establish a minimum token budget baseline and to set a starting point for future experiments testing TOON constrained decoding inference enforcement. Key findings: TOON shows promising accuracy/token consumption ratio for in‑domain generation tasks, though this advantage is often reduced by the "prompt tax" of instructional overhead in shorter contexts. Plain JSON generation shows the best one‑shot and final accuracy, even compared with constrained decoding structured output, where the only significant advantage is the lowest token usage as a trade‑off for slightly decreased accuracy overall and significant degradation for some models. Notably, for simple structures, this "lowest token usage" of constrained decoding outperformed even TOON, hinting that TOON enforcing via frameworks such as xgrammar may not yield the desired results. Furthermore, the results suggest a scaling hypothesis: TOON's true efficiency potential likely follows a non‑linear curve, shining only beyond a specific point where cumulative syntax savings amortize the initial prompt overhead.
Authors:Bartosz Dziuba, Kacper Kuchta, Paweł Batorski, Przemysław Spurek, Paul Swoboda
Abstract:
Large Language Models (LLMs) have improved substantially alignment, yet their behavior remains highly sensitive to prompt phrasing. This brittleness has motivated automated prompt engineering, but most existing methods (i) require a task‑specific training set, (ii) rely on expensive iterative optimization to produce a single dataset‑level prompt, and (iii) must be rerun from scratch for each new task. We introduce TATRA, a dataset‑free prompting method that constructs instance‑specific few‑shot prompts by synthesizing on‑the‑fly examples to accompany a user‑provided instruction. TATRA requires no labeled training data and avoids task‑specific optimization loops, while retaining the benefits of demonstration‑based prompting. Across standard text classification benchmarks, TATRA matches or improves over strong prompt‑optimization baselines that depend on training data and extensive search. On mathematical reasoning benchmarks, TATRA achieves state‑of‑the‑art performance on GSM8K and DeepMath, outperforming methods that explicitly optimize prompts on those tasks. Our results suggest that per‑instance construction of effective in‑context examples is more important than running long, expensive optimization loops to produce a single prompt per task. We will make all code publicly available upon acceptance of the paper. Code is available at https://github.com/BMD223/TATRA
Authors:Ke Yang, Zixi Chen, Xuan He, Jize Jiang, Michel Galley, Chenglong Wang, Jianfeng Gao, Jiawei Han, ChengXiang Zhai
Abstract:
Long‑term memory is essential for large language model (LLM) agents operating in complex environments, yet existing memory designs are either task‑specific and non‑transferable, or task‑agnostic but less effective due to low task‑relevance and context explosion from raw memory retrieval. We propose PlugMem, a task‑agnostic plugin memory module that can be attached to arbitrary LLM agents without task‑specific redesign. Motivated by the fact that decision‑relevant information is concentrated as abstract knowledge rather than raw experience, we draw on cognitive science to structure episodic memories into a compact, extensible knowledge‑centric memory graph that explicitly represents propositional and prescriptive knowledge. This representation enables efficient memory retrieval and reasoning over task‑relevant knowledge, rather than verbose raw trajectories, and departs from other graph‑based methods like GraphRAG by treating knowledge as the unit of memory access and organization instead of entities or text chunks. We evaluate PlugMem unchanged across three heterogeneous benchmarks (long‑horizon conversational question answering, multi‑hop knowledge retrieval, and web agent tasks). The results show that PlugMem consistently outperforms task‑agnostic baselines and exceeds task‑specific memory designs, while also achieving the highest information density under a unified information‑theoretic analysis. Code and data are available at https://github.com/TIMAN‑group/PlugMem.
Authors:Wenhao Wu, Zhentao Tang, Yafu Li, Shixiong Kai, Mingxuan Yuan, Zhenhong Sun, Chunlin Chen, Zhi Wang
Abstract:
Large Language Models (LLMs) exhibit high reasoning capacity in medical question‑answering, but their tendency to produce hallucinations and outdated knowledge poses critical risks in healthcare fields. While Retrieval‑Augmented Generation (RAG) mitigates these issues, existing methods rely on noisy token‑level signals and lack the multi‑round refinement required for complex reasoning. In the paper, we propose MA‑RAG (Multi‑Round Agentic RAG), a framework that facilitates test‑time scaling for complex medical reasoning by iteratively evolving both external evidence and internal reasoning history within an agentic refinement loop. At each round, the agent transforms semantic conflict among candidate responses into actionable queries to retrieve external evidence, while optimizing history reasoning traces to mitigate long‑context degradation. MA‑RAG extends the self‑consistency principle by leveraging the lack of consistency as a proactive signal for multi‑round agentic reasoning and retrieval, and mirrors a boosting mechanism that iteratively minimizes the residual error toward a stable, high‑fidelity medical consensus. Extensive evaluations across 7 medical Q&A benchmarks show that MA‑RAG consistently surpasses competitive inference‑time scaling and RAG baselines, delivering substantial +6.8 points on average accuracy over the backbone model. Our code is available at [this url](https://github.com/NJU‑RL/MA‑RAG).
Authors:Wenhui Zhu, Xiwen Chen, Zhipeng Wang, Jingjing Wang, Xuanzhao Dong, Minzhou Huang, Rui Cai, Hejian Sang, Hao Wang, Peijie Qiu, Yueyue Deng, Prayag Tiwari, Brendan Hogan Rappazzo, Yalin Wang
Abstract:
Long‑horizon LLM agents require memory systems that remain accurate under fixed context budgets. However, existing systems struggle with two persistent challenges in long‑term dialogue: (i) disconnected evidence, where multi‑hop answers require linking facts distributed across time, and (ii) state updates, where evolving information (e.g., schedule changes) creates conflicts with older static logs. We propose AriadneMem, a structured memory system that addresses these failure modes via a decoupled two‑phase pipeline. In the offline construction phase, AriadneMem employs \emphentropy‑aware gating to filter noise and low‑information message before LLM extraction and applies \emphconflict‑aware coarsening to merge static duplicates while preserving state transitions as temporal edges. In the online reasoning phase, rather than relying on expensive iterative planning, AriadneMem executes \emphalgorithmic bridge discovery to reconstruct missing logical paths between retrieved facts, followed by \emphsingle‑call topology‑aware synthesis. On LoCoMo experiments with GPT‑4o, AriadneMem improves Multi‑Hop F1 by 15.2% and Average F1 by 9.0% over strong baselines. Crucially, by offloading reasoning to the graph layer, AriadneMem reduces total runtime by 77.8% using only 497 context tokens. The code is available at https://github.com/LLM‑VLM‑GSL/AriadneMem.
Authors:Omer Sela
Abstract:
CDD, or Contamination Detection via output Distribution, identifies data contamination by measuring the peakedness of a model's sampled outputs. We study the conditions under which this approach succeeds and fails on small language models ranging from 70M to 410M parameters. Using controlled contamination experiments on GSM8K, HumanEval, and MATH, we find that CDD's effectiveness depends critically on whether fine‑tuning produces verbatim memorization. With low‑rank adaptation, models can learn from contaminated data without memorizing it, and CDD performs at chance level even when the data is verifiably contaminated. Only when fine‑tuning capacity is sufficient to induce memorization does CDD recover strong detection accuracy. Our results characterize a memorization threshold that governs detectability and highlight a practical consideration: parameter‑efficient fine‑tuning can produce contamination that output‑distribution methods do not detect. Our code is available at https://github.com/Sela‑Omer/Contamination‑Detection‑Small‑LM
Authors:Dadi Guo, Yuejin Xie, Qingyu Liu, Jiayu Liu, Zhiyuan Fan, Qihan Ren, Shuai Shao, Tianyi Zhou, Dongrui Liu, Yi R. Fung
Abstract:
As large language models (LLMs) advance their mathematical capabilities toward the IMO level, the scarcity of challenging, high‑quality problems for training and evaluation has become a significant bottleneck. Simultaneously, recent code agents have demonstrated sophisticated skills in agentic coding and reasoning, suggesting that code execution can serve as a scalable environment for mathematical experimentation. In this paper, we investigate the potential of code agents to autonomously evolve existing math problems into more complex variations. We introduce a multi‑agent framework designed to perform problem evolution while validating the solvability and increased difficulty of the generated problems. Our experiments demonstrate that, given sufficient test‑time exploration, code agents can synthesize new, solvable problems that are structurally distinct from and more challenging than the originals. This work provides empirical evidence that code‑driven agents can serve as a viable mechanism for synthesizing high‑difficulty mathematical reasoning problems within scalable computational environments. Our data is available at https://github.com/TarferSoul/Code2Math.
Authors:Ziyang Gong, Zehang Luo, Anke Tang, Zhe Liu, Shi Fu, Zhi Hou, Ganlin Yang, Weiyun Wang, Xiaofeng Wang, Jianbo Liu, Gen Luo, Haolan Kang, Shuang Luo, Yue Zhou, Yong Luo, Li Shen, Xiaosong Jia, Yao Mu, Xue Yang, Chunxiao Liu, Junchi Yan, Hengshuang Zhao, Dacheng Tao, Xiaogang Wang
Abstract:
Universal embodied intelligence demands robust generalization across heterogeneous embodiments, such as autonomous driving, robotics, and unmanned aerial vehicles (UAVs). However, existing embodied brain in training a unified model over diverse embodiments frequently triggers long‑tail data, gradient interference, and catastrophic forgetting, making it notoriously difficult to balance universal generalization with domain‑specific proficiency. In this report, we introduce ACE‑Brain‑0, a generalist foundation brain that unifies spatial reasoning, autonomous driving, and embodied manipulation within a single multimodal large language model~(MLLM). Our key insight is that spatial intelligence serves as a universal scaffold across diverse physical embodiments: although vehicles, robots, and UAVs differ drastically in morphology, they share a common need for modeling 3D mental space, making spatial cognition a natural, domain‑agnostic foundation for cross‑embodiment transfer. Building on this insight, we propose the Scaffold‑Specialize‑Reconcile~(SSR) paradigm, which first establishes a shared spatial foundation, then cultivates domain‑specialized experts, and finally harmonizes them through data‑free model merging. Furthermore, we adopt Group Relative Policy Optimization~(GRPO) to strengthen the model's comprehensive capability. Extensive experiments demonstrate that ACE‑Brain‑0 achieves competitive and even state‑of‑the‑art performance across 24 spatial and embodiment‑related benchmarks.
Authors:Guoxin Chen, Fanzhe Meng, Jiale Zhao, Minghao Li, Daixuan Cheng, Huatong Song, Jie Chen, Yuzhi Lin, Hui Chen, Xin Zhao, Ruihua Song, Chang Liu, Cheng Chen, Kai Jia, Ji-Rong Wen
Abstract:
Current benchmarks for code agents primarily assess narrow, repository‑specific fixes, overlooking critical real‑world challenges such as cross‑repository reasoning, domain‑specialized problem solving, dependency‑driven migration, and full‑repository generation. To address this gap, we introduce BeyondSWE, a comprehensive benchmark that broadens existing evaluations along two axes ‑ resolution scope and knowledge scope ‑ using 500 real‑world instances across four distinct settings. Experimental results reveal a significant capability gap: even frontier models plateau below 45% success, and no single model performs consistently across task types. To systematically investigate the role of external knowledge, we develop SearchSWE, a framework that integrates deep search with coding abilities. Our experiments show that search augmentation yields inconsistent gains and can in some cases degrade performance, highlighting the difficulty of emulating developer‑like workflows that interleave search and reasoning during coding tasks. This work offers both a realistic, challenging evaluation benchmark and a flexible framework to advance research toward more capable code agents.
Authors:Sudip Bhujel
Abstract:
Large language models are increasingly used for patient‑facing medical assistance and clinical decision support, but adapting them to clinical dialogue often requires supervision derived from doctor‑patient conversations that may contain sensitive information. Conventional supervised fine‑tuning and reinforcement learning from human feedback (RLHF) can amplify memorization risks, enabling empirical membership inference and extraction of rare training‑set content. We present PrivMedChat, an end‑to‑end framework for differentially private RLHF (DP‑RLHF) for medical dialogue. Our design enforces differential privacy at every training stage that directly accesses dialogue‑derived supervision: (i) Differential Private Stochastic Gradient Descent (DP‑SGD) for medical SFT and (ii) DP‑SGD for reward model learning from preference pairs. To limit additional privacy expenditure during alignment, we apply DP‑SGD to the PPO actor and critic when operating on dialogue‑derived prompts, while the reward model remains fixed after DP training. We also introduce an annotation‑free preference construction strategy that pairs physician responses with filtered non‑expert generations to produce scalable preference data without clinician labeling. Experiments on medical dialogue benchmarks show that PrivMedChat at \varepsilon=7 achieves the highest ROUGE‑L of 0.156 among all DP models, reduces clinical hallucinations to 1.4% and harmful advice to 0.4%, and obtains the highest overall score of 2.86 in a 3‑model LLM‑jury evaluation, while producing membership‑inference signals that are near chance (AUC 0.510‑0.555). We open‑source our code at https://github.com/sudip‑bhujel/privmedchat.
Authors:Wanying He, Yanxi Lin, Ziheng Zhou, Xue Feng, Min Peng, Qianqian Xie, Zilong Zheng, Yipeng Kang
Abstract:
Online platforms increasingly rely on opinion aggregation to allocate real‑world attention and resources, yet common signals such as engagement votes or capital‑weighted commitments are easy to amplify and often track visibility rather than reliability. This makes collective judgments brittle under weak truth signals, noisy or delayed feedback, early popularity surges, and strategic manipulation. We propose Credibility Governance (CG), a mechanism that reallocates influence by learning which agents and viewpoints consistently track evolving public evidence. CG maintains dynamic credibility scores for both agents and opinions, updates opinion influence via credibility‑weighted endorsements, and updates agent credibility based on the long‑run performance of the opinions they support, rewarding early and persistent alignment with emerging evidence while filtering short‑lived noise. We evaluate CG in POLIS, a socio‑physical simulation environment that models coupled belief dynamics and downstream feedback under uncertainty. Across settings with initial majority misalignment, observation noise and contamination, and misinformation shocks, CG outperforms vote‑based, stake‑weighted, and no‑governance baselines, yielding faster recovery to the true state, reduced lock‑in and path dependence, and improved robustness under adversarial pressure. Our implementation and experimental scripts are publicly available at https://github.com/Wanying‑He/Credibility_Governance.
Authors:Daren Wang
Abstract:
This project reproduces and extends the recently proposed ``Recursive Language Models'' (RLMs) framework by Zhang et al. (2026). This framework enables Large Language Models (LLMs) to process near‑infinite contexts by offloading the prompt into an external REPL environment. While the original paper relies on a default recursion depth of 1 and suggests deeper recursion as a future direction, this study specifically investigates the impact of scaling the recursion depth. Using state‑of‑the‑art open‑source agentic models (DeepSeek v3.2 and Kimi K2), I evaluated pure LLM, RLM (depth=1), and RLM (depth=2) on the S‑NIAH and OOLONG benchmarks. The findings reveal a compelling phenomenon: Deeper recursion causes models to ``overthink''. While depth‑1 RLMs effectively boost accuracy on complex reasoning tasks, applying deeper recursion (depth=2) or using RLMs on simple retrieval tasks paradoxically degrades performance and exponentially inflates execution time (e.g., from 3.6s to 344.5s) and token costs. Code and data are available at: https://github.com/drbillwang/rlm‑reproduction
Authors:Zhiyu Pan, Yizheng Wu, Jiashen Hua, Junyi Feng, Shaotian Yan, Bing Deng, Zhiguo Cao, Jieping Ye
Abstract:
Reasoning has emerged as a key capability of large language models. In linguistic tasks, this capability can be enhanced by self‑improving techniques that refine reasoning paths for subsequent finetuning. However, extending these language‑based self‑improving approaches to vision language models (VLMs) presents a unique challenge:~visual hallucinations in reasoning paths cannot be effectively verified or rectified. Our solution starts with a key observation about visual contrast: when presented with a contrastive VQA pair, i.e., two visually similar images with synonymous questions, VLMs identify relevant visual cues more precisely. Motivated by this observation, we propose Visual Contrastive Self‑Taught Reasoner (VC‑STaR), a novel self‑improving framework that leverages visual contrast to mitigate hallucinations in model‑generated rationales. We collect a diverse suite of VQA datasets, curate contrastive pairs according to multi‑modal similarity, and generate rationales using VC‑STaR. Consequently, we obtain a new visual reasoning dataset, VisCoR‑55K, which is then used to boost the reasoning capability of various VLMs through supervised finetuning. Extensive experiments show that VC‑STaR not only outperforms existing self‑improving approaches but also surpasses models finetuned on the SoTA visual reasoning datasets, demonstrating that the inherent contrastive ability of VLMs can bootstrap their own visual reasoning. Project at: https://github.com/zhiyupan42/VC‑STaR.
Authors:Kyle Elliott Mathewson
Abstract:
Do neural machine translation models learn language‑universal conceptual representations, or do they merely cluster languages by surface similarity? We investigate this question by probing the representation geometry of Meta's NLLB‑200, a 200‑language encoder‑decoder Transformer, through six experiments that bridge NLP interpretability with cognitive science theories of multilingual lexical organization. Using the Swadesh core vocabulary list embedded across 135 languages, we find that the model's embedding distances significantly correlate with phylogenetic distances from the Automated Similarity Judgment Program (ρ= 0.13, p = 0.020), demonstrating that NLLB‑200 has implicitly learned the genealogical structure of human languages. We show that frequently colexified concept pairs from the CLICS database exhibit significantly higher embedding similarity than non‑colexified pairs (U = 42656, p = 1.33 × 10^‑11, d = 0.96), indicating that the model has internalized universal conceptual associations. Per‑language mean‑centering of embeddings improves the between‑concept to within‑concept distance ratio by a factor of 1.19, providing geometric evidence for a language‑neutral conceptual store analogous to the anterior temporal lobe hub identified in bilingual neuroimaging. Semantic offset vectors between fundamental concept pairs (e.g., man to woman, big to small) show high cross‑lingual consistency (mean cosine = 0.84), suggesting that second‑order relational structure is preserved across typologically diverse languages. We release InterpretCognates, an open‑source interactive toolkit for exploring these phenomena, alongside a fully reproducible analysis pipeline.
Authors:Byung-Kwan Lee, Youngchae Chee, Yong Man Ro
Abstract:
Think‑Answer reasoners such as DeepSeek‑R1 have made notable progress by leveraging interpretable internal reasoning. However, despite the frequent presence of self‑reflective cues like "Oops!", they remain vulnerable to output errors during single‑pass inference. To address this limitation, we propose an efficient Recursive Think‑Answer Process (R‑TAP) that enables models to engage in iterative reasoning cycles and generate more accurate answers, going beyond conventional single‑pass approaches. Central to this approach is a confidence generator that evaluates the certainty of model responses and guides subsequent improvements. By incorporating two complementary rewards‑Recursively Confidence Increase Reward and Final Answer Confidence Reward‑we show that R‑TAP‑enhanced models consistently outperform conventional single‑pass methods for both large language models (LLMs) and vision‑language models (VLMs). Moreover, by analyzing the frequency of "Oops"‑like expressions in model responses, we find that R‑TAP‑applied models exhibit significantly fewer self‑reflective patterns, resulting in more stable and faster inference‑time reasoning. We hope R‑TAP pave the way evolving into efficient and elaborated methods to refine the reasoning processes of future AI.
Authors:Chuong Huynh, Manh Luong, Abhinav Shrivastava
Abstract:
Multimodal retrieval is the task of aggregating information from queries across heterogeneous modalities to retrieve desired targets. State‑of‑the‑art multimodal retrieval models can understand complex queries, yet they are typically limited to two modalities: text and vision. This limitation impedes the development of universal retrieval systems capable of comprehending queries that combine more than two modalities. To advance toward this goal, we present OmniRet, the first retrieval model capable of handling complex, composed queries spanning three key modalities: text, vision, and audio. Our OmniRet model addresses two critical challenges for universal retrieval: computational efficiency and representation fidelity. First, feeding massive token sequences from modality‑specific encoders to Large Language Models (LLMs) is computationally inefficient. We therefore introduce an attention‑based resampling mechanism to generate compact, fixed‑size representations from these sequences. Second, compressing rich omni‑modal data into a single embedding vector inevitably causes information loss and discards fine‑grained details. We propose Attention Sliced Wasserstein Pooling to preserve these fine‑grained details, leading to improved omni‑modal representations. OmniRet is trained on an aggregation of approximately 6 million query‑target pairs spanning 30 datasets. We benchmark our model on 13 retrieval tasks and a MMEBv2 subset. Our model demonstrates significant improvements on composed query, audio and video retrieval tasks, while achieving on‑par performance with state‑of‑the‑art models on others. Furthermore, we curate a new Audio‑Centric Multimodal Benchmark (ACM). This new benchmark introduces two critical, previously missing tasks‑composed audio retrieval and audio‑visual retrieval to more comprehensively evaluate a model's omni‑modal embedding capacity.
Authors:Songming Zhang, Xue Zhang, Tong Zhang, Bojie Hu, Yufeng Chen, Jinan Xu
Abstract:
Knowledge distillation (KD) is an essential technique to compress large language models (LLMs) into smaller ones. However, despite the distinct roles of the student model and the teacher model in KD, most existing frameworks still use a homogeneous training backend (e.g., FSDP and DeepSpeed) for both models, leading to suboptimal training efficiency. In this paper, we present a novel framework for LLM distillation, termed KDFlow, which features a decoupled architecture and employs SGLang for teacher inference. By bridging the training efficiency of FSDP2 and the inference efficiency of SGLang, KDFlow achieves full utilization of both advantages in a unified system. Moreover, instead of transferring full logits across different processes, our framework only transmits the teacher's hidden states using zero‑copy data transfer and recomputes the logits on the student side, effectively balancing the communication cost and KD performance. Furthermore, our framework supports both off‑policy and on‑policy distillation and incorporates KD algorithms for cross‑tokenizer KD through highly extensible and user‑friendly APIs. Experiments show that KDFlow can achieve 1.44× to 6.36× speedup compared to current KD frameworks, enabling researchers to rapidly prototype and scale LLM distillation with minimal engineering overhead. Code is available at: https://github.com/songmzhang/KDFlow
Authors:Xufei Lv, Jiahui Yang, Haoyuan Sun, Xialin Su, Zhiliang Tian, Yifu Gao, Linbo Qiao, Houde Liu
Abstract:
Temporal Knowledge Graph Question Answering (TKGQA) is challenging because it requires multi‑hop reasoning under complex temporal constraints. Recent LLM‑based approaches have improved semantic modeling for this task, but many still rely on fixed reasoning workflows or costly post‑training, which can limit adaptability and make error recovery difficult. We show that enabling an off‑the‑shelf Large Language Model (LLM) to determine its next action is already effective in a zero‑shot setting. Based on this insight, we propose AT2QA, an Autonomous and Training‑free Agent for TKG Question Answering. AT2QA empowers the LLM to iteratively interact with the TKG via a generic search tool, inherently enabling autonomous exploration and dynamic self‑correction during reasoning. To further elicit the LLM's potential for complex temporal reasoning, we introduce a training‑free experience mining mechanism that distills a compact few‑shot demonstration library from successful self‑generated trajectories. AT2QA also yields a transparent audit trail for every prediction. Experiments on three challenging benchmarks ‑‑ MultiTQ, Timeline‑CronQuestion, and Timeline‑ICEWS‑Actor ‑‑ show that AT2QA achieves new state‑of‑the‑art performance, surpassing the strongest baselines by 10.7, 4.9, and 11.2 absolute points, respectively. Our code is available at https://github.com/AT2QA‑Official‑Code/AT2QA‑Official‑Code
Authors:Wenye Lin, Kai Han
Abstract:
Injecting new reasoning knowledge into Large Language Models (LLMs) via post‑training often induces catastrophic forgetting. Recent studies emphasize the importance of on‑policy data but suggest that KL‑divergence fails to mitigate forgetting. In contrast, we show, both analytically and empirically, that the KL‑constrained reward formulation actually plays a critical role in retaining knowledge during post‑training. This motivates our Surgical Post‑Training (SPOT), a proximal on‑policy distillation framework designed to optimize reasoning efficiently while preserving prior knowledge. SPOT consists of (1) a data rectification pipeline employing an Oracle to surgically correct erroneous steps via minimal edits, generating proximal on‑policy data; and (2) a reward‑based binary cross‑entropy objective essential for enhancing reasoning and mitigating forgetting. Empirically, with only 4k rectified math pairs, SPOT improves Qwen3‑8B's accuracy by 6.2% on average across in‑domain and out‑of‑domain tasks, requiring merely 16‑minute model training on 8x H800 GPUs. Moreover, SPOT provides a superior initialization for subsequent reinforcement learning, significantly elevating the performance ceiling. Code: https://github.com/Visual‑AI/SPoT
Authors:Niu Lian, Yuting Wang, Hanshu Yao, Jinpeng Wang, Bin Chen, Yaowei Wang, Min Zhang, Shu-Tao Xia
Abstract:
While multimodal large language models have demonstrated impressive short‑term reasoning, they struggle with long‑horizon video understanding due to limited context windows and static memory mechanisms that fail to mirror human cognitive efficiency. Existing paradigms typically fall into two extremes: vision‑centric methods that incur high latency and redundancy through dense visual accumulation, or text‑centric approaches that suffer from detail loss and hallucination via aggressive captioning. To bridge this gap, we propose MM‑Mem, a pyramidal multimodal memory architecture grounded in Fuzzy‑Trace Theory. MM‑Mem structures memory hierarchically into a Sensory Buffer, Episodic Stream, and Symbolic Schema, enabling the progressive distillation of fine‑grained perceptual traces (verbatim) into high‑level semantic schemas (gist). Furthermore, to govern the dynamic construction of memory, we derive a Semantic Information Bottleneck objective and introduce SIB‑GRPO to optimize the trade‑off between memory compression and task‑relevant information retention. In inference, we design an entropy‑driven top‑down memory retrieval strategy. Extensive experiments across 4 benchmarks confirm that MM‑Mem achieves state‑of‑the‑art performance on both offline and streaming tasks, demonstrating robust generalization and validating the effectiveness of cognition‑inspired memory organization. Code and associated configurations are publicly available at https://github.com/EliSpectre/MM‑Mem.
Authors:Masahiro Kaneko, Ayana Niwa, Timothy Baldwin
Abstract:
Fake news undermines societal trust and decision‑making across politics, economics, health, and international relations, and in extreme cases threatens human lives and societal safety. Because fake news reflects region‑specific political, social, and cultural contexts and is expressed in language, evaluating the risks of large language models (LLMs) requires a multi‑lingual and regional perspective. Malicious users can bypass safeguards through jailbreak attacks, inducing LLMs to generate fake news. However, no benchmark currently exists to systematically assess attack resilience across languages and regions. Here, we propose JailNewsBench, the first benchmark for evaluating LLM robustness against jailbreak‑induced fake news generation. JailNewsBench spans 34 regions and 22 languages, covering 8 evaluation sub‑metrics through LLM‑as‑a‑Judge and 5 jailbreak attacks, with approximately 300k instances. Our evaluation of 9 LLMs reveals that the maximum attack success rate (ASR) reached 86.3% and the maximum harmfulness score was 3.5 out of 5. Notably, for English and U.S.‑related topics, the defensive performance of typical multi‑lingual LLMs was significantly lower than for other regions, highlighting substantial imbalances in safety across languages and regions. In addition, our analysis shows that coverage of fake news in existing safety datasets is limited and less well defended than major categories such as toxicity and social bias. Our dataset and code are available at https://github.com/kanekomasahiro/jail_news_bench.
Authors:Yanping Li, Zhening Liu, Zijian Li, Zehong Lin, Jun Zhang
Abstract:
Fine‑tuning large language models (LLMs) on custom datasets has become a standard approach for adapting these models to specific domains and applications. However, recent studies have shown that such fine‑tuning can lead to significant degradation in the model's safety. Existing defense methods operate at the sample level and often suffer from an unsatisfactory trade‑off between safety and utility. To address this limitation, we perform a systematic token‑level diagnosis of safety degradation during fine‑tuning. Based on this, we propose token‑level data selection for safe LLM fine‑tuning (TOSS), a novel framework that quantifies the safety risk of each token by measuring the loss difference between a safety‑degraded model and a utility‑oriented model. This token‑level granularity enables accurate identification and removal of unsafe tokens, thereby preserving valuable task‑specific information. In addition, we introduce a progressive refinement strategy, TOSS‑Pro, which iteratively enhances the safety‑degraded model's ability to identify unsafe tokens. Extensive experiments demonstrate that our approach robustly safeguards LLMs during fine‑tuning while achieving superior downstream task performance, significantly outperforming existing sample‑level defense methods. Our code is available at https://github.com/Polly‑LYP/TOSS.
Authors:Tongtong Wu, Yanming Li, Ziye Tang, Chen Jiang, Linhao Luo, Guilin Qi, Shirui Pan, Gholamreza Haffari
Abstract:
Large language model (LLM)‑based multi‑agent systems have shown strong capabilities in tasks such as code generation and collaborative reasoning. However, the effectiveness and robustness of these systems critically depend on their communication topology, which is often fixed or statically learned, ignoring real‑world dynamics such as model upgrades, API (or tool) changes, or knowledge source variability. To address this limitation, we propose CARD (Conditional Agentic Graph Designer), a conditional graph‑generation framework that instantiates AMACP, a protocol for adaptive multi‑agent communication. CARD explicitly incorporates dynamic environmental signals into graph construction, enabling topology adaptation at both training and runtime. Through a conditional variational graph encoder and environment‑aware optimization, CARD produces communication structures that are both effective and resilient to shifts in model capability or resource availability. Empirical results on HumanEval, MATH, and MMLU demonstrate that CARD consistently outperforms static and prompt‑based baselines, achieving higher accuracy and robustness across diverse conditions. The source code is available at: https://github.com/Warma10032/CARD.
Authors:Zhuokang Shen, Yifan Wang, Hanyu Chen, Wenxuan Huang, Shaohui Lin
Abstract:
Recent advances in large language models (LLMs) have enabled increasingly capable chatbots. However, most existing systems focus on single‑user settings and do not generalize well to multi‑user group chats, where agents require more proactive and accurate intervention under complex, evolving contexts. Existing approaches typically rely on LLMs for both reasoning and generation, leading to high token consumption, limited scalability, and potential privacy risks. To address these challenges, we propose GroupGPT, a token‑efficient and privacy‑preserving agentic framework for multi‑user chat assistant. GroupGPT adopts a small‑large model collaborative architecture to decouple intervention timing from response generation, enabling efficient and accurate decision‑making. The framework also supports multimodal inputs, including memes, images, videos, and voice messages. We further introduce MUIR, a benchmark dataset for multi‑user chat assistant intervention reasoning. MUIR contains 2,500 annotated group chat segments with intervention labels and rationales, supporting evaluation of timing accuracy and response quality. We evaluate a range of models on MUIR, from large language models to smaller counterparts. Extensive experiments demonstrate that GroupGPT produces accurate and well‑timed responses, achieving an average score of 4.72/5.0 in LLM‑based evaluation, and is well received by users across diverse group chat scenarios. Moreover, GroupGPT reduces token usage by up to 3 times compared to baseline methods, while providing privacy sanitization of user messages before cloud transmission. Code is available at: https://github.com/Eliot‑Shen/GroupGPT .
Authors:Jiafeng Lin, Yuxuan Wang, Jialong Wu, Huakun Luo, Zhongyi Pei, Jianmin Wang
Abstract:
Large Language Models (LLMs) have demonstrated remarkable success in general‑purpose reasoning. However, they still struggle to understand and reason about time series data, which limits their effectiveness in decision‑making scenarios that depend on temporal dynamics. In this paper, we propose Thoth, the first family of mid‑trained LLMs with general‑purpose time series understanding capabilities. As a pivotal intermediate stage, mid‑training achieves task‑ and domain‑agnostic alignment between time series and natural language, for which we construct Book‑of‑Thoth, a high‑quality, time‑series‑centric mid‑training corpus. Book‑of‑Thoth enables both time‑series‑to‑text and text‑to‑time‑series generation, equipping LLMs with a foundational grasp of temporal patterns. To better evaluate advanced reasoning capabilities, we further present KnoTS, a novel benchmark of knowledge‑intensive time series understanding, designed for joint reasoning over temporal patterns and domain knowledge. Extensive experiments demonstrate that mid‑training with Book‑of‑Thoth enables Thoth to significantly outperform its base model and advanced LLMs across a range of time series question answering benchmarks. Moreover, Thoth exhibits superior capabilities when fine‑tuned under data scarcity, underscoring the effectiveness of mid‑training for time series understanding. Code is available at: https://github.com/thuml/Thoth.
Authors:Abigail Berthe-Pardo, Gaspard Michel, Elena V. Epure, Christophe Cerisara
Abstract:
With recent advances in Text‑to‑Speech (TTS) systems, synthetic audiobook narration has seen increased interest, reaching unprecedented levels of naturalness. However, larger gaps remain in synthetic narration systems' ability to impersonate fictional characters, and convey complex emotions or prosody. A promising direction to enhance character identification is the assignment of plausible voices to each fictional characters in a book. This step typically requires complex inference of attributes in book‑length contexts, such as a character's age, gender, origin or physical health, which in turns requires dedicated benchmark datasets to evaluate extraction systems' performances. We present S‑VoCAL (Speaking Voice Character Attributes in Literature), the first dataset and evaluation framework dedicated to evaluate the inference of voice‑related fictional character attributes. S‑VoCAL entails 8 attributes grounded in sociophonetic studies, and 952 character‑book pairs derived from Project Gutenberg. Its evaluation framework addresses the particularities of each attribute, and includes a novel similarity metric based on recent Large Language Models embeddings. We demonstrate the applicability of S‑VoCAL by applying a simple Retrieval‑Augmented Generation (RAG) pipeline to the task of inferring character attributes. Our results suggest that the RAG pipeline reliably infers attributes such as Age or Gender, but struggles on others such as Origin or Physical Health. The dataset and evaluation code are available at https://github.com/AbigailBerthe/S‑VoCAL .
Authors:Igor Rozhkov, Natalia Loukachevitch
Abstract:
Nested named entity recognition identifies entities contained within other entities, but requires expensive multi‑level annotation. While flat NER corpora exist abundantly, nested resources remain scarce. We investigate whether models can learn nested structure from flat annotations alone, evaluating four approaches: string inclusions (substring matching), entity corruption (pseudo‑nested data), flat neutralization (reducing false negative signal), and a hybrid fine‑tuned + LLM pipeline. On NEREL, a Russian benchmark with 29 entity types where 21% of entities are nested, our best combined method achieves 26.37% inner F1, closing 40% of the gap to full nested supervision. Code is available at https://github.com/fulstock/Learning‑from‑Flat‑Annotations.
Authors:Andrew Zhuoer Feng, Cunxiang Wang, Bosi Wen, Yidong Wang, Yu Luo, Hongning Wang, Minlie Huang
Abstract:
Large language model alignment via reinforcement learning depends critically on reward function quality. However, static, domain‑specific reward models are often costly to train and exhibit poor generalization in out‑of‑distribution scenarios encountered during RL iterations. We present RLAR (Reinforcement Learning from Agent Rewards), an agent‑driven framework that dynamically assigns tailored reward functions to individual queries. Specifically, RLAR transforms reward acquisition into a dynamic tool synthesis and invocation task. It leverages LLM agents to autonomously retrieve optimal reward models from the Internet and synthesize programmatic verifiers through code generation. This allows the reward system to self‑evolve with the shifting data distributions during training. Experimental results demonstrate that RLAR yields consistent performance gains ranging from 10 to 60 across mathematics, coding, translation, and dialogue tasks. On RewardBench‑V2, RLAR significantly outperforms static baselines and approaches the performance upper bound, demonstrating superior generalization through dynamic reward orchestration. The data and code are available on this link: https://github.com/ZhuoerFeng/RLAR.
Authors:Shiqi Chen, Jingze Gai, Ruochen Zhou, Jinghan Zhang, Tongyao Zhu, Junlong Li, Kangrui Wang, Zihan Wang, Zhengyu Chen, Klara Kaleb, Ning Miao, Siyang Gao, Cong Lu, Manling Li, Junxian He, Yee Whye Teh
Abstract:
Real‑world tool‑using agents operate over long‑horizon workflows with recurring structure and diverse demands, where effective behavior requires not only invoking atomic tools but also abstracting, and reusing higher‑level tool compositions. However, existing benchmarks mainly measure instance‑level success under static tool sets, offering limited insight into agents' ability to acquire such reusable skills. We address this gap by introducing SkillCraft, a benchmark explicitly stress‑test agent ability to form and reuse higher‑level tool compositions, where we call Skills. SkillCraft features realistic, highly compositional tool‑use scenarios with difficulty scaled along both quantitative and structural dimensions, designed to elicit skill abstraction and cross‑task reuse. We further propose a lightweight evaluation protocol that enables agents to auto‑compose atomic tools into executable Skills, cache and reuse them inside and across tasks, thereby improving efficiency while accumulating a persistent library of reusable skills. Evaluating state‑of‑the‑art agents on SkillCraft, we observe substantial efficiency gains, with token usage reduced by up to 80% by skill saving and reuse. Moreover, success rate strongly correlates with tool composition ability at test time, underscoring compositional skill acquisition as a core capability.
Authors:Andrew Zhuoer Feng, Cunxiang Wang, Yu Luo, Bosi Wen, Yidong Wang, Lin Fan, Yilin Zhou, Zikang Wang, Wenbo Yu, Lindong Wu, Hongning Wang, Minlie Huang
Abstract:
Large Language Models have evolved from single‑round generators into long‑horizon agents, capable of complex text synthesis scenarios. However, current evaluation frameworks lack the ability to assess the actual synthesis operations, such as outlining, drafting, and editing. Consequently, they fail to evaluate the actual and detailed capabilities of LLMs. To bridge this gap, we introduce RAVEL, an agentic framework that enables the LLM testers to autonomously plan and execute typical synthesis operations, including outlining, drafting, reviewing, and refining. Complementing this framework, we present C3EBench, a comprehensive benchmark comprising 1,258 samples derived from professional human writings. We utilize a "reverse‑engineering" pipeline to isolate specific capabilities across four tasks: Cloze, Edit, Expand, and End‑to‑End. Through our analysis of 14 LLMs, we uncover that most LLMs struggle with tasks that demand contextual understanding under limited or under‑specified instructions. By augmenting RAVEL with SOTA LLMs as operators, we find that such agentic text synthesis is dominated by the LLM's reasoning capability rather than raw generative capacity. Furthermore, we find that a strong reasoner can guide a weaker generator to yield higher‑quality results, whereas the inverse does not hold. Our code and data are available at this link: https://github.com/ZhuoerFeng/RAVEL‑Reasoning‑Agents‑Text‑Eval.
Authors:Jason Lucas, Matt Murtagh-White, Adaku Uchendu, Ali Al-Lawati, Michiharu Yamashita, Dominik Macko, Ivan Srba, Robert Moro, Dongwon Lee
Abstract:
Multilingual falsehoods threaten information integrity worldwide, yet detection benchmarks remain confined to English or a few high‑resource languages, leaving low‑resource linguistic communities without robust defense tools. We introduce BLUFF, a comprehensive benchmark for detecting false and synthetic content, spanning 79 languages with over 202K samples, combining human‑written fact‑checked content (122K+ samples across 57 languages) and LLM‑generated content (79K+ samples across 71 languages). BLUFF uniquely covers both high‑resource "big‑head" (20) and low‑resource "long‑tail" (59) languages, addressing critical gaps in multilingual research on detecting false and synthetic content. Our dataset features four content types (human‑written, LLM‑generated, LLM‑translated, and hybrid human‑LLM text), bidirectional translation (English\leftrightarrowX), 39 textual modification techniques (36 manipulation tactics for fake news, 3 AI‑editing strategies for real news), and varying edit intensities generated using 19 diverse LLMs. We present AXL‑CoI (Adversarial Cross‑Lingual Agentic Chainof‑Interactions), a novel multi‑agentic framework for controlled fake/real news generation, paired with mPURIFY, a quality filtering pipeline ensuring dataset integrity. Experiments reveal state‑of‑theart detectors suffer up to 25.3% F1 degradation on low‑resource versus high‑resource languages. BLUFF provides the research community with a multilingual benchmark, extensive linguistic‑oriented benchmark evaluation, comprehensive documentation, and opensource tools to advance equitable falsehood detection. Dataset and code are available at: https://jsl5710.github.io/BLUFF/
Authors:Shu-Xun Yang, Cunxiang Wang, Haoke Zhang, Wenbo Yu, Lindong Wu, Jiayi Gui, Dayong Yang, Yukuo Cen, Zhuoer Feng, Bosi Wen, Yidong Wang, Lucen Zhong, Jiamin Ren, Linfeng Zhang, Jie Tang
Abstract:
Agentic systems augment large language models with external tools and iterative decision making, enabling complex tasks such as deep research, function calling, and coding. However, their long and intricate execution traces make failure diagnosis and root cause analysis extremely challenging. Manual inspection does not scale, while directly applying LLMs to raw traces is hindered by input length limits and unreliable reasoning. Focusing solely on final task outcomes further discards critical behavioral information required for accurate issue localization. To address these issues, we propose TraceSIR, a multi‑agent framework for structured analysis and reporting of agentic execution traces. TraceSIR coordinates three specialized agents: (1) StructureAgent, which introduces a novel abstraction format, TraceFormat, to compress execution traces while preserving essential behavioral information; (2) InsightAgent, which performs fine‑grained diagnosis including issue localization, root cause analysis, and optimization suggestions; (3) ReportAgent, which aggregates insights across task instances and generates comprehensive analysis reports. To evaluate TraceSIR, we construct TraceBench, covering three real‑world agentic scenarios, and introduce ReportEval, an evaluation protocol for assessing the quality and usability of analysis reports aligned with industry needs. Experiments show that TraceSIR consistently produces coherent, informative, and actionable reports, significantly outperforming existing approaches across all evaluation dimensions. Our project and video are publicly available at https://github.com/SHU‑XUN/TraceSIR.
Authors:Anastasia Zhukova, Terry Ruas, Jan Philip Wahle, Bela Gipp
Abstract:
Research in CDCR remains fragmented due to heterogeneous dataset formats, varying annotation standards, and the predominance of the CDCR definition as the event coreference resolution (ECR). To address these challenges, we introduce uCDCR, a unified dataset that consolidates diverse publicly available English CDCR corpora across various domains into a consistent format, which we analyze with standardized metrics and evaluation protocols. uCDCR incorporates both entity and event coreference, corrects known inconsistencies, and enriches datasets with missing attributes to facilitate reproducible research. We establish a cohesive framework for fair, interpretable, and cross‑dataset analysis in CDCR and compare the datasets on their lexical properties, e.g., lexical composition of the annotated mentions, lexical diversity and ambiguity metrics, discuss the annotation rules and principles that lead to high lexical diversity, and examine how these metrics influence performance on the same‑head‑lemma baseline. Our dataset analysis shows that ECB+, the state‑of‑the‑art benchmark for CDCR, has one of the lowest lexical diversities, and its CDCR complexity, measured by the same‑head‑lemma baseline, lies in the middle among all uCDCR datasets. Moreover, comparing document and mention distributions between ECB+ and uCDCR shows that using all uCDCR datasets for model training and evaluation will improve the generalizability of CDCR models. Finally, the almost identical performance on the same‑head‑lemma baseline, separately applied to events and entities, shows that resolving both types is a complex task and should not be steered toward ECR alone. The uCDCR dataset is available at https://huggingface.co/datasets/AnZhu/uCDCR, and the code for parsing, analyzing, and scoring the dataset is available at https://github.com/anastasia‑zhukova/uCDCR.
Authors:Yuchen Hou, Lin Zhao
Abstract:
Vision‑Language‑Action (VLA) models achieve over 95% success on standard benchmarks. However, through systematic experiments, we find that current state‑of‑the‑art VLA models largely ignore language instructions. Prior work lacks: (1) systematic semantic perturbation diagnostics, (2) a benchmark that forces language understanding by design, and (3) linguistically diverse training data. This paper constructs the LangGap benchmark, based on a four‑dimensional semantic perturbation method ‑‑ varying instruction semantics while keeping the tabletop layout fixed ‑‑ revealing language understanding deficits in π0.5. Existing benchmarks like LIBERO assign only one task per layout, underutilizing available objects and target locations; LangGap fully diversifies pick‑and‑place tasks under identical layouts, forcing models to truly understand language. Experiments show that targeted data augmentation can partially close the language gap ‑‑ success rate improves from 0% to 90% with single‑task training, and 0% to 28% with multi‑task training. However, as semantic diversity of extended tasks increases, model learning capacity proves severely insufficient; even trained tasks perform poorly. This reveals a fundamental challenge for VLA models in understanding diverse language instructions ‑‑ precisely the long‑term value of LangGap.
Authors:Yubo Dong, Nianhao You, Yuxuan Hou, Zixun Sun, Yue Zhang, Liang Zhang, Siyuan Zhao, Hehe Fan
Abstract:
While Large Language Models (LLMs) have demonstrated proficiency in Deep Research or Wide Search, their capacity to solve highly complex questions‑those requiring long‑horizon planning, massive evidence gathering, and synthesis across heterogeneous sources‑remains largely unexplored. We introduce Super Research, a task for complex autonomous research tasks that integrates (i) structured decomposition into a research plan, (ii) super wide retrieval for diverse perspectives, and (iii) super deep investigation to resolve uncertainties through iterative queries. To evaluate this capability, we curated a benchmark of 300 expert‑written questions across diverse domains, each requiring up to 100+ retrieval steps and 1,000+ web pages to reconcile conflicting evidence. Super Research produces verifiable reports with fine‑grained citations and intermediate artifacts (e.g., outlines and tables) to ensure traceable reasoning. Furthermore, we present a graph‑anchored auditing protocol that evaluates Super Research along five dimensions: Coverage, Logical Consistency, Report Utility, Objectivity and Citation Health. While super‑complex questions may be infrequent in standard applications, Super Research serves as a critical ceiling evaluation and stress test for LLM capabilities. A model's proficiency within Super Research acts as a powerful proxy for its general research competence; success here suggests the robustness necessary to navigate nearly any subordinate research task. Leaderboard is available at: https://cnsdqd‑dyb.github.io/Super‑Research‑Benchmark/
Authors:Hongjin Qian, Ziyi Xia, Ze Liu, Jianlyu Chen, Kun Luo, Minghao Qin, Chaofan Li, Lei Xiong, Junwei Lan, Sen Wang, Zhengyang Liang, Yingxia Shao, Defu Lian, Zheng Liu
Abstract:
LLM‑agents are increasingly used to accelerate the progress of scientific research. Yet a persistent bottleneck is data access: agents not only lack readily available tools for retrieval, but also have to work with unstrcutured, human‑centric data on the Internet, such as HTML web‑pages and PDF files, leading to excessive token consumption, limit working efficiency, and brittle evidence look‑up. This gap motivates the development of an agentic data interface, which is designed to enable agents to access and utilize scientific literature in a more effective, efficient, and cost‑aware manner. In this paper, we introduce DeepXiv‑SDK, which offers a three‑layer agentic data interface for scientific literature. 1) Data Layer, which transforms unstructured, human‑centric data into normalized and structured representations in JSON format, improving data usability and enabling progressive accessibility of the data. 2) Service Layer, which presents readily available tools for data access and ad‑hoc retrieval. It also enables a rich form of agent usage, including CLI, MCP, and Python SDK. 3) Application Layer, which creates a built‑in agent, packaging basic tools from the service layer to support complex data access demands. DeepXiv‑SDK currently supports the complete ArXiv corpus, and is synchronized daily to incorporate new releases. It is designed to extend to all common open‑access corpora, such as PubMed Central, bioRxiv, medRxiv, and chemRxiv. We release RESTful APIs, an open‑source Python SDK, and a web demo showcasing deep search and deep research workflows. DeepXiv‑SDK is free to use with registration.
Authors:Zhengbo Wang, Jian Liang, Ran He, Zilei Wang, Tieniu Tan
Abstract:
Modern optimizers like Adam and Muon are central to training large language models, but their reliance on first‑ and second‑order momenta introduces significant memory overhead, which constrains scalability and computational efficiency. In this work, we reframe the exponential moving average (EMA) used in these momenta as the training of a linear regressor via online gradient flow. Building on this equivalence, we introduce LoRA‑Pre, a novel low‑rank optimizer designed for efficient pre‑training. Specifically, LoRA‑Pre reduces the optimizer's memory footprint by decomposing the full momentum matrix into a compact low‑rank subspace within the online linear learner, thereby maintaining optimization performance while improving memory efficiency. We empirically validate LoRA‑Pre's efficacy by pre‑training models from the Llama architecture family, scaling from 60M to 1B parameters. LoRA‑Pre achieves the highest performance across all model sizes. Notably, LoRA‑Pre demonstrates remarkable rank efficiency, achieving comparable or superior results using only 1/8 the rank of baseline methods. Beyond pre‑training, we evaluate LoRA‑Pre's effectiveness in fine‑tuning scenarios. With the same rank, LoRA‑Pre consistently outperforms all efficient fine‑tuning baselines. Specifically, compared to standard LoRA, LoRA‑Pre achieves substantial improvements of 3.14 points on Llama‑3.1‑8B and 6.17 points on Llama‑2‑7B, validating our approach's effectiveness across both pre‑training and fine‑tuning paradigms. Our code is publicly available at https://github.com/mrflogs/LoRA‑Pre.
Authors:Haritz Puerto, Haonan Li, Xudong Han, Timothy Baldwin, Iryna Gurevych
Abstract:
AI agents powered by reasoning models require access to sensitive user data. However, their reasoning traces are difficult to control, which can result in the unintended leakage of private information to external parties. We propose training models to follow instructions not only in the final answer, but also in reasoning traces, potentially under different constraints. We hypothesize that improving their instruction following abilities in the reasoning traces can improve their privacy‑preservation skills. To demonstrate this, we fine‑tune models on a new instruction‑following dataset with explicit restrictions on reasoning traces. We further introduce a generation strategy that decouples reasoning and answer generation using separate LoRA adapters. We evaluate our approach on six models from two model families, ranging from 1.7B to 14B parameters, across two instruction‑following benchmarks and two privacy benchmarks. Our method yields substantial improvements, achieving gains of up to 20.9 points in instruction‑following performance and up to 51.9 percentage points on privacy benchmarks. These improvements, however, can come at the cost of task utility, due to the trade‑off between reasoning performance and instruction‑following abilities. Overall, our results show that improving instruction‑following behavior in reasoning models can significantly enhance privacy, suggesting a promising direction for the development of future privacy‑aware agents. Our code and data are available at https://github.com/UKPLab/arxiv2026‑controllable‑reasoning‑models
Authors:Zhengren Wang, Dongsheng Ma, Huaping Zhong, Jiayu Li, Wentao Zhang, Bin Wang, Conghui He
Abstract:
The expansion of retrieval‑augmented generation (RAG) into multimodal domains has intensified the challenge for processing complex visual documents, such as financial reports. While page‑level chunking and retrieval is a natural starting point, it creates a critical bottleneck: delivering entire pages to the generator introduces excessive extraneous context. This not only overloads the generator's attention mechanism but also dilutes the most salient evidence. Moreover, compressing these information‑rich pages into a limited visual token budget further increases the risk of hallucinations. To address this, we introduce AgenticOCR, a dynamic parsing paradigm that transforms optical character recognition (OCR) from a static, full‑text process into a query‑driven, on‑demand extraction system. By autonomously analyzing document layout in a "thinking with images" manner, AgenticOCR identifies and selectively recognizes regions of interest. This approach performs on‑demand decompression of visual tokens precisely where needed, effectively decoupling retrieval granularity from rigid page‑level chunking. AgenticOCR has the potential to serve as the "third building block" of the visual document RAG stack, operating alongside and enhancing standard Embedding and Reranking modules. Experimental results demonstrate that AgenticOCR improves both the efficiency and accuracy of visual RAG systems, achieving expert‑level performance in long document understanding. Code and models are available at https://github.com/OpenDataLab/AgenticOCR.
Authors:Daniel Yang, Samuel Stante, Florian Redhardt, Lena Libon, Parnian Kassraie, Ido Hakimi, Barna Pásztor, Andreas Krause
Abstract:
Reward models are central to aligning large language models (LLMs) with human preferences. Yet most approaches rely on pointwise reward estimates that overlook the epistemic uncertainty in reward models arising from limited human feedback. Recent work suggests that quantifying this uncertainty can reduce the costs of human annotation via uncertainty‑guided active learning and mitigate reward overoptimization in LLM post‑training. However, uncertainty‑aware reward models have so far been adopted without thorough comparison, leaving them poorly understood. This work introduces a unified framework, RewardUQ, to systematically evaluate uncertainty quantification for reward models. We compare common methods along standard metrics measuring accuracy and calibration, and we propose a new ranking strategy incorporating both dimensions for a simplified comparison. Our experimental results suggest that model size and initialization have the most meaningful impact on performance, and most prior work could have benefited from alternative design choices. To foster the development and evaluation of new methods and aid the deployment in downstream applications, we release our open‑source framework as a Python package. Our code is available at https://github.com/lasgroup/rewarduq.
Authors:Xiaoyu Guo, Arkaitz Zubiaga
Abstract:
With the aim of detecting AI‑generated images and identifying the specific models responsible for their generation, we propose a multi‑modal multi‑task model. The model leverages pre‑trained BERT and CLIP Vision encoders for text and image feature extraction, respectively, and employs cross‑modal feature fusion with a tailored multi‑task loss function. Additionally, a pseudo‑labeling‑based data augmentation strategy was utilized to expand the training dataset with high‑confidence samples. The model achieved fifth place in both Tasks A and B of the `CT2: AI‑Generated Image Detection' competition, with F1 scores of 83.16% and 48.88%, respectively. These findings highlight the effectiveness of the proposed architecture and its potential for advancing AI‑generated content detection in real‑world scenarios. The source code for our method is published on https://github.com/xxxxxxxxy/AIGeneratedImageDetection.
Authors:Hao Wu, Xudong Wang, Jialiang Zhang, Junlong Tong, Xinghao Chen, Junyan Lin, Yunpu Ma, Xiaoyu Shen
Abstract:
One‑stream Transformer‑based trackers achieve advanced performance in visual object tracking but suffer from significant computational overhead that hinders real‑time deployment. While token pruning offers a path to efficiency, existing methods are fragmented. They typically prune the search region, dynamic template, and static template in isolation, overlooking critical inter‑component dependencies, which yields suboptimal pruning and degraded accuracy. To address this, we introduce UTPTrack, a simple and Unified Token Pruning framework that, for the first time, jointly compresses all three components. UTPTrack employs an attention‑guided, token type‑aware strategy to holistically model redundancy, a design that seamlessly supports unified tracking across multimodal and language‑guided tasks within a single model. Extensive evaluations on 10 benchmarks demonstrate that UTPTrack achieves a new state‑of‑the‑art in the accuracy‑efficiency trade‑off for pruning‑based trackers, pruning 65.4% of vision tokens in RGB‑based tracking and 67.5% in unified tracking while preserving 99.7% and 100.5% of baseline performance, respectively. This strong performance across both RGB and multimodal scenarios underlines its potential as a robust foundation for future research in efficient visual tracking. Code will be released at https://github.com/EIT‑NLP/UTPTrack.
Authors:Hao Wu, Yingqi Fan, Jinyang Dai, Junlong Tong, Yunpu Ma, Xiaoyu Shen
Abstract:
The quadratic computational cost of processing vision tokens in Multimodal Large Language Models (MLLMs) hinders their widespread adoption. While progressive vision token pruning offers a promising solution, current methods misinterpret shallow layer functions and use rigid schedules, which fail to unlock the full efficiency potential. To address these issues, we propose HiDrop, a framework that aligns token pruning with the true hierarchical function of MLLM layers. HiDrop features two key innovations: (1) Late Injection, which bypasses passive shallow layers to introduce visual tokens exactly where active fusion begins; and (2) Concave Pyramid Pruning with an Early Exit mechanism to dynamically adjust pruning rates across middle and deep layers. This process is optimized via an inter‑layer similarity measure and a differentiable top‑k operator. To ensure practical efficiency, HiDrop further incorporates persistent positional encoding, FlashAttention‑compatible token selection, and parallel decoupling of vision computation to eliminate hidden overhead associated with dynamic token reduction. Extensive experiments show that HiDrop compresses about 90% visual tokens while matching the original performance and accelerating training by 1.72 times. Our work not only sets a new state‑of‑the‑art for efficient MLLM training and inference but also provides valuable insights into the hierarchical nature of multimodal fusion. The code is released at https://github.com/EIT‑NLP/HiDrop.
Authors:Michael Frew, Nishit Bheda, Bryan Tripp
Abstract:
Though patients are increasingly granted digital access to their electronic health records (EHRs), existing interfaces may not support precise, trustworthy answers to patient‑specific questions. Large language models (LLM) show promise in clinical question answering (QA), but retrieval‑based approaches are computationally inefficient, prone to hallucination, and difficult to deploy over real‑life EHRs. In this work, we introduce FHIRPath‑QA, the first open dataset and benchmark for patient‑specific QA that includes open‑standard FHIRPath queries over real‑world clinical data. We propose a text‑to‑FHIRPath QA paradigm that shifts reasoning from free‑text generation to FHIRPath query synthesis, significantly reducing LLM usage. Built on MIMIC‑IV on FHIR Demo, the dataset pairs over 14k natural language questions in patient and clinician phrasing with validated FHIRPath queries and answers. Further, we demonstrate that state‑of‑the‑art LLMs struggle to deal with ambiguity in patient language and perform poorly in FHIRPath query synthesis. However, they benefit strongly from supervised fine‑tuning. Our results highlight that text‑to‑FHIRPath synthesis has the potential to serve as a practical foundation for safe, efficient, and interoperable consumer health applications, and our dataset and benchmark serve as a starting point for future research on the topic. The full dataset and generation code is available at: https://github.com/mooshifrew/fhirpath‑qa.
Authors:Sungho Park, Jueun Kim, Wook-Shin Han
Abstract:
Real‑world Table‑Text question answering (QA) tasks require models that can reason across long text and source tables, traversing multiple hops and executing complex operations such as aggregation. Yet existing benchmarks are small, manually curated ‑ and therefore error‑prone ‑ and contain shallow questions that seldom demand more than two hops or invoke aggregations, grouping, or other advanced analytical operations expressible in natural‑language queries. We present SPARTA, an end‑to‑end construction framework that automatically generates large‑scale Table‑Text QA benchmarks with lightweight human validation, requiring only one quarter of the annotation time of HybridQA. The framework first constructs a reference fact database by enriching each source table with grounding tables whose tuples are atomic facts automatically extracted from the accompanying unstructured passages, then synthesizes nested queries whose number of nested predicates matches the desired hop count. To ensure that every SQL statement is executable and that its verbalization yields a fluent, human‑sounding question, we propose two novel techniques: provenance‑based refinement, which rewrites any syntactically valid query that returns a non‑empty result, and realistic‑structure enforcement, which confines generation to post‑order traversals of the query graph. The resulting pipeline produces thousands of high‑fidelity question‑answer pairs covering aggregations, grouping, and deep multi‑hop reasoning across text and tables. On SPARTA, state‑of‑the‑art models that reach over 70 F1 on HybridQA or over 50 F1 on OTT‑QA drop by more than 30 F1 points, exposing fundamental weaknesses in current cross‑modal reasoning. Our benchmark, construction code, and baseline models are available at https://github.com/pshlego/SPARTA/tree/main.
Authors:Yutong Wang, Siyuan Xiong, Xuebo Liu, Wenkang Zhou, Liang Ding, Miao Zhang, Min Zhang
Abstract:
While Multi‑Agent Systems (MAS) excel in complex reasoning, they suffer from the cascading impact of erroneous information generated by individual participants. Current solutions often resort to rigid structural engineering or expensive fine‑tuning, limiting their deployability and adaptability. We propose AgentDropoutV2, a test‑time rectify‑or‑reject pruning framework designed to dynamically optimize MAS information flow without retraining. Our approach acts as an active firewall, intercepting agent outputs and employing a retrieval‑augmented rectifier to iteratively correct errors based on a failure‑driven indicator pool. This mechanism allows for the precise identification of potential errors using distilled failure patterns as prior knowledge. Irreparable outputs are subsequently pruned to prevent error propagation, while a fallback strategy preserves system integrity. Empirical results on extensive math benchmarks show that AgentDropoutV2 significantly boosts the MAS's task performance, achieving an average accuracy gain of 6.3 percentage points on math benchmarks. Furthermore, the system exhibits robust generalization and adaptivity, dynamically modulating rectification efforts based on task difficulty while leveraging context‑aware indicators to resolve a wide spectrum of error patterns. Our code and dataset are released at https://github.com/TonySY2/AgentDropoutV2.
Authors:Pengxiang Li, Dilxat Muhtar, Tianlong Chen, Lu Yin, Shiwei Liu
Abstract:
Diffusion Language Models (DLMs) are often advertised as enabling parallel token generation, yet practical fast DLMs frequently converge to left‑to‑right, autoregressive (AR)‑like decoding dynamics. In contrast, genuinely non‑AR generation is promising because it removes AR's sequential bottleneck, better exploiting parallel hardware to reduce synchronization/communication overhead and improve latency scaling with output length. We argue that a primary driver of AR‑like decoding is a mismatch between DLM objectives and the highly sequential structure of widely used training data, including standard pretraining corpora and long chain‑of‑thought (CoT) supervision. Motivated by this diagnosis, we propose NAP (Non‑Autoregressive Parallel DLMs), a proof‑of‑concept, data‑centric approach that better aligns supervision with non‑AR parallel decoding. NAP curates examples as multiple independent reasoning trajectories and couples them with a parallel‑forced decoding strategy that encourages multi‑token parallel updates. Across math reasoning benchmarks, NAP yields stronger performance under parallel decoding than DLMs trained on standard long CoT data, with gains growing as parallelism increases. Our results suggest that revisiting data and supervision is a principled direction for mitigating AR‑like behavior and moving toward genuinely non‑autoregressive parallel generation in DLMs. Our code is available at https://github.com/pixeli99/NAP.
Authors:Sara Rosenthal, Yannis Katsis, Vraj Shah, Lihong He, Lucian Popa, Marina Danilevsky
Abstract:
We present MTRAG‑UN, a benchmark for exploring open challenges in multi‑turn retrieval augmented generation, a popular use of large language models. We release a benchmark of 666 tasks containing over 2,800 conversation turns across 6 domains with accompanying corpora. Our experiments show that retrieval and generation models continue to struggle on conversations with UNanswerable, UNderspecified, and NONstandalone questions and UNclear responses. Our benchmark is available at https://github.com/IBM/mt‑rag‑benchmark
Authors:Jayadev Billa
Abstract:
Numerous studies have shown that multimodal LLMs process speech and images well but fail in non‑intuitive ways rendering trivial tasks such as object counting unreliable. We investigate this behavior from an information‑theoretic perspective by framing multimodal LLM inference as a mismatched decoder problem: a decoder trained primarily on text can only extract information along text‑aligned directions (removing up to 98% of the variation in modality‑specific (non‑text) directions improves decoder loss) and the amount of accessible information is bounded by the Generalized Mutual Information (GMI). We show that information loss is bounded as the distributional mismatch between the source data and the text data increases, and as the sensitivity of the decoder increases. This bound is a function of the model's scoring rule not its architecture. We validate the predictions across five models spanning speech and vision. A controlled study (two Prismatic VLMs differing only in encoder text‑alignment) shows that the bottleneck lies in the scoring rule of the decoder rather than the text‑alignment of the encoder or the learned projection. A LoRA intervention demonstrates that simply training with an emotion‑related objective improves emotion detection from 17.3% to 61.8% task accuracy without affecting other attributes, confirming that the training objective determines what becomes accessible.
Authors:Bangrui Xu, Qihang Yao, Zirui Tang, Xuanhe Zhou, Yeye He, Shihan Yu, Qianqian Xu, Bin Wang, Guoliang Li, Conghui He, Fan Wu
Abstract:
Semi‑structured documents integrate diverse interleaved data elements (e.g., tables, charts, hierarchical paragraphs) arranged in various and often irregular layouts. These documents are widely observed across domains and account for a large portion of real‑world data. However, existing methods struggle to support natural language question answering over these documents due to three main technical challenges: (1) The elements extracted by techniques like OCR are often fragmented and stripped of their original semantic context, making them inadequate for analysis. (2) Existing approaches lack effective representations to capture hierarchical structures within documents (e.g., associating tables with nested chapter titles) and to preserve layout‑specific distinctions (e.g., differentiating sidebars from main content). (3) Answering questions often requires retrieving and aligning relevant information scattered across multiple regions or pages, such as linking a descriptive paragraph to table cells located elsewhere in the document. To address these issues, we propose MoDora, an LLM‑powered system for semi‑structured document analysis. First, we adopt a local‑alignment aggregation strategy to convert OCR‑parsed elements into layout‑aware components, and conduct type‑specific information extraction for components with hierarchical titles or non‑text elements. Second, we design the Component‑Correlation Tree (CCTree) to hierarchically organize components, explicitly modeling inter‑component relations and layout distinctions through a bottom‑up cascade summarization process. Finally, we propose a question‑type‑aware retrieval strategy that supports (1) layout‑based grid partitioning for location‑based retrieval and (2) LLM‑guided pruning for semantic‑based retrieval. Experiments show MoDora outperforms baselines by 5.97%‑61.07% in accuracy. The code is at https://github.com/weAIDB/MoDora.
Authors:Roy Miles, Aysim Toker, Andreea-Maria Oncescu, Songcen Xu, Jiankang Deng, Ismail Elezi
Abstract:
Reasoning with large language models often benefits from generating multiple chains‑of‑thought, but existing aggregation strategies are typically trajectory‑level (e.g., selecting the best trace or voting on the final answer), discarding useful intermediate work from partial or "nearly correct" attempts. We propose Stitching Noisy Diffusion Thoughts, a self‑consistency framework that turns cheap diffusion‑sampled reasoning into a reusable pool of step‑level candidates. Given a problem, we (i) sample many diverse, low‑cost reasoning trajectories using a masked diffusion language model, (ii) score every intermediate step with an off‑the‑shelf process reward model (PRM), and (iii) stitch these highest‑quality steps across trajectories into a composite rationale. This rationale then conditions an autoregressive (AR) model (solver) to recompute only the final answer. This modular pipeline separates exploration (diffusion) from evaluation and solution synthesis, avoiding monolithic unified hybrids while preserving broad search. Across math reasoning benchmarks, we find that step‑level recombination is most beneficial on harder problems, and ablations highlight the importance of the final AR solver in converting stitched but imperfect rationales into accurate answers. Using low‑confidence diffusion sampling with parallel, independent rollouts, our training‑free framework improves average accuracy by up to 23.8% across six math and coding tasks. At the same time, it achieves up to a 1.8x latency reduction relative to both traditional diffusion models (e.g., Dream, LLaDA) and unified architectures (e.g., TiDAR). Code is available at https://github.com/roymiles/diffusion‑stitching.
Authors:Zhanhui Zhou, Lingjie Chen, Hanghang Tong, Dawn Song
Abstract:
Although diffusion language models (DLMs) are evolving quickly, many recent models converge on a set of shared components. These components, however, are distributed across ad‑hoc research codebases or lack transparent implementations, making them difficult to reproduce or extend. As the field accelerates, there is a clear need for a unified framework that standardizes these common components while remaining flexible enough to support new methods and architectures. To address this gap, we introduce dLLM, an open‑source framework that unifies the core components of diffusion language modeling ‑‑ training, inference, and evaluation ‑‑ and makes them easy to customize for new designs. With dLLM, users can reproduce, finetune, deploy, and evaluate open‑source large DLMs such as LLaDA and Dream through a standardized pipeline. The framework also provides minimal, reproducible recipes for building small DLMs from scratch with accessible compute, including converting any BERT‑style encoder or autoregressive LM into a DLM. We also release the checkpoints of these small DLMs to make DLMs more accessible and accelerate future research.
Authors:Zhengyang Su, Isay Katsman, Yueqi Wang, Ruining He, Lukasz Heldt, Raghunandan Keshavan, Shao-Chuan Wang, Xinyang Yi, Mingyan Gao, Onkar Dalal, Lichan Hong, Ed Chi, Ningren Han
Abstract:
Generative retrieval has emerged as a powerful paradigm for LLM‑based recommendation. However, industrial recommender systems often benefit from restricting the output space to a constrained subset of items based on business logic (e.g. enforcing content freshness or product category), which standard autoregressive decoding cannot natively support. Moreover, existing constrained decoding methods that make use of prefix trees (Tries) incur severe latency penalties on hardware accelerators (TPUs/GPUs). In this work, we introduce STATIC (Sparse Transition Matrix‑Accelerated Trie Index for Constrained Decoding), an efficient and scalable constrained decoding technique designed specifically for high‑throughput LLM‑based generative retrieval on TPUs/GPUs. By flattening the prefix tree into a static Compressed Sparse Row (CSR) matrix, we transform irregular tree traversals into fully vectorized sparse matrix operations, unlocking massive efficiency gains on hardware accelerators. We deploy STATIC on a large‑scale industrial video recommendation platform serving billions of users. STATIC produces significant product metric impact with minimal latency overhead (0.033 ms per step and 0.25% of inference time), achieving a 948x speedup over a CPU trie implementation and a 47‑1033x speedup over a hardware‑accelerated binary‑search baseline. Furthermore, the runtime overhead of STATIC remains extremely low across a wide range of practical configurations. To the best of our knowledge, STATIC enables the first production‑scale deployment of strictly constrained generative retrieval. In addition, evaluation on academic benchmarks demonstrates that STATIC can considerably improve cold‑start performance for generative retrieval. Our code is available at https://github.com/youtube/static‑constraint‑decoding.
Authors:Weida Liang, Yiyou Sun, Shuyuan Nan, Chuang Li, Dawn Song, Kenji Kawaguchi
Abstract:
Example‑based guidance is widely used to improve mathematical reasoning at inference time, yet its effectiveness is highly unstable across problems and models‑even when the guidance is correct and problem‑relevant. We show that this instability arises from a previously underexplored gap between strategy usage‑whether a reasoning strategy appears in successful solutions‑and strategy executability‑whether the strategy remains effective when instantiated as guidance for a target model. Through a controlled analysis of paired human‑written and model‑generated solutions, we identify a systematic dissociation between usage and executability: human‑ and model‑derived strategies differ in structured, domain‑dependent ways, leading to complementary strengths and consistent source‑dependent reversals under guidance. Building on this diagnosis, we propose Selective Strategy Retrieval (SSR), a test‑time framework that explicitly models executability by selectively retrieving and combining strategies using empirical, multi‑route, source‑aware signals. Across multiple mathematical reasoning benchmarks, SSR yields reliable and consistent improvements over direct solving, in‑context learning, and single‑source guidance, improving accuracy by up to +13 points on AIME25 and +5 points on Apex for compact reasoning models. Code and benchmark are publicly available at: https://github.com/lwd17/strategy‑execute‑pipeline.
Authors:Craig Myles, Patrick Schrempf, David Harris-Birtill
Abstract:
Errors in medical text can cause delays or even result in incorrect treatment for patients. Recently, language models have shown promise in their ability to automatically detect errors in medical text, an ability that has the opportunity to significantly benefit healthcare systems. In this paper, we explore the importance of prompt optimisation for small and large language models when applied to the task of error detection. We perform rigorous experiments and analysis across frontier language models and open‑source language models. We show that automatic prompt optimisation with Genetic‑Pareto (GEPA) improves error detection over the baseline accuracy performance from 0.669 to 0.785 with GPT‑5 and 0.578 to 0.690 with Qwen3‑32B, approaching the performance of medical doctors and achieving state‑of‑the‑art performance on the MEDEC benchmark dataset. Code available on GitHub: https://github.com/CraigMyles/clinical‑note‑error‑detection
Authors:Xi Ye, Wuwei Zhang, Fangcong Yin, Howard Yen, Danqi Chen
Abstract:
Understanding and reasoning over long contexts is a crucial capability for language models (LMs). Although recent models support increasingly long context windows, their accuracy often deteriorates as input length grows. In practice, models often struggle to keep attention aligned with the most relevant context throughout decoding. In this work, we propose DySCO, a novel decoding algorithm for improving long‑context reasoning. DySCO leverages retrieval heads‑‑a subset of attention heads specialized for long‑context retrieval‑‑to identify task‑relevant tokens at each decoding step and explicitly up‑weight them. By doing so, DySCO dynamically adjusts attention during generation to better utilize relevant context. The method is training‑free and can be applied directly to any off‑the‑shelf LMs. Across multiple instruction‑tuned and reasoning models, DySCO consistently improves performance on challenging long‑context reasoning benchmarks, yielding relative gains of up to 25% on MRCR and LongBenchV2 at 128K context length with modest additional compute. Further analysis highlights the importance of both dynamic attention rescaling and retrieval‑head‑guided selection for the effectiveness of the method, while providing interpretability insights into decoding‑time attention behavior. Our code is available at https://github.com/princeton‑pli/DySCO.
Authors:Lingfeng Ren, Weihao Yu, Runpeng Yu, Xinchao Wang
Abstract:
Object hallucination is a critical issue in Large Vision‑Language Models (LVLMs), where outputs include objects that do not appear in the input image. A natural question arises from this phenomenon: Which component of the LVLM pipeline primarily contributes to object hallucinations? The vision encoder to perceive visual information, or the language decoder to generate text responses? In this work, we strive to answer this question through designing a systematic experiment to analyze the roles of the vision encoder and the language decoder in hallucination generation. Our observations reveal that object hallucinations are predominantly associated with the strong priors from the language decoder. Based on this finding, we propose a simple and training‑free framework, No‑Language‑Hallucination Decoding, NoLan, which refines the output distribution by dynamically suppressing language priors, modulated based on the output distribution difference between multimodal and text‑only inputs. Experimental results demonstrate that NoLan effectively reduces object hallucinations across various LVLMs on different tasks. For instance, NoLan achieves substantial improvements on POPE, enhancing the accuracy of LLaVA‑1.5 7B and Qwen‑VL 7B by up to 6.45 and 7.21, respectively. The code is publicly available at: https://github.com/lingfengren/NoLan.
Authors:Thanmay Jayakumar, Mohammed Safi Ur Rahman Khan, Raj Dabre, Ratish Puduppully, Anoop Kunchukuttan
Abstract:
Instruction‑following benchmarks remain predominantly English‑centric, leaving a critical evaluation gap for the hundreds of millions of Indic language speakers. We introduce IndicIFEval, a benchmark evaluating constrained generation of LLMs across 14 Indic languages using automatically verifiable, rule‑based instructions. It comprises around 800 human‑verified examples per language spread across two complementary subsets: IndicIFEval‑Ground, translated prompts from IFEval (Zhou et al., 2023) carefully localized for Indic contexts, and IndicIFEval‑Ground, synthetically generated instructions grounded in native Indic content. We conduct a comprehensive evaluation of major open‑weight and proprietary models spanning both reasoning and non‑reasoning models. While models maintain strong adherence to formatting constraints, they struggle significantly with lexical and cross‑lingual tasks ‑‑ and despite progress in high‑resource languages, instruction‑following across the broader Indic family lags significantly behind English. We release IndicIFEval and its evaluation scripts to support progress on multilingual constrained generation (http://github.com/ai4bharat/IndicIFEval).
Authors:Shunsuke Ubukata
Abstract:
Chain‑of‑Thought (CoT) distillation from Large Language Models (LLMs) often induces "overthinking" in Small Language Models (SLMs), leading to performance degradation and excessive token consumption. In this study, we propose Disciplined Chain‑of‑Thought (D‑CoT), a novel framework that enforces a structured reasoning process using control tags ‑‑ such as <TEMP_LOW> for fact‑checking and <TEMP_HIGH> for multi‑perspective exploration ‑‑ as auxiliary scaffolding during training. By optimizing the CoT trajectory, D‑CoT suppresses reasoning drift and simultaneously achieves token reduction and performance improvement. We demonstrate the efficacy of our approach on Qwen3‑8B: with only 5,000 training samples, D‑CoT significantly boosts accuracy on GPQA‑diamond by 9.9% and MMLU‑Pro (0‑shot) by 9.1%, while drastically reducing computational costs. Furthermore, we confirm that the model internalizes this disciplined thought structure, maintaining high performance even without explicit control tags during inference.
Authors:Yexing Du, Youcheng Pan, Zekun Wang, Zheng Chu, Yichong Huang, Kaiyuan Liu, Bo Yang, Yang Xiang, Ming Liu, Bing Qin
Abstract:
Multimodal Large Language Models (MLLMs) have achieved notable success in enhancing translation performance by integrating multimodal information. However, existing research primarily focuses on image‑guided methods, whose applicability is constrained by the scarcity of multilingual image‑text pairs. The speech modality overcomes this limitation due to its natural alignment with text and the abundance of existing speech datasets, which enable scalable language coverage. In this paper, we propose a Speech‑guided Machine Translation (SMT) framework that integrates speech and text as fused inputs into an MLLM to improve translation quality. To mitigate reliance on low‑resource data, we introduce a Self‑Evolution Mechanism. The core components of this framework include a text‑to‑speech model, responsible for generating synthetic speech, and an MLLM capable of classifying synthetic speech samples and iteratively optimizing itself using positive samples. Experimental results demonstrate that our framework surpasses all existing methods on the Multi30K multimodal machine translation benchmark, achieving new state‑of‑the‑art results. Furthermore, on general machine translation datasets, particularly the FLORES‑200, it achieves average state‑of‑the‑art performance in 108 translation directions. Ablation studies on CoVoST‑2 confirms that differences between synthetic and authentic speech have negligible impact on translation quality. The code and models are released at https://github.com/yxduir/LLM‑SRT.
Authors:Ningyuan Yang, Weihua Du, Weiwei Sun, Sean Welleck, Yiming Yang
Abstract:
Reinforcement learning (RL) has become a central post‑training paradigm for large language models (LLMs), but its performance is highly sensitive to the quality of training problems. This sensitivity stems from the non‑stationarity of RL: rollouts are generated by an evolving policy, and learning is shaped by exploration and reward feedback, unlike supervised fine‑tuning (SFT) with fixed trajectories. As a result, prior work often relies on manual curation or simple heuristic filters (e.g., accuracy), which can admit incorrect or low‑utility problems. We propose GradAlign, a gradient‑aligned data selection method for LLM reinforcement learning that uses a small, trusted validation set to prioritize training problems whose policy gradients align with validation gradients, yielding an adaptive curriculum. We evaluate GradAlign across three challenging data regimes: unreliable reward signals, distribution imbalance, and low‑utility training corpus, showing that GradAlign consistently outperforms existing baselines, underscoring the importance of directional gradient signals in navigating non‑stationary policy optimization and yielding more stable training and improved final performance. We release our implementation at https://github.com/StigLidu/GradAlign
Authors:Sofoklis Kakouros, Fang Kang, Haoyu Chen
Abstract:
This work presents iMiGUE‑Speech, an extension of the iMiGUE dataset that provides a spontaneous affective corpus for studying emotional and affective states. The new release focuses on speech and enriches the original dataset with additional metadata, including speech transcripts, speaker‑role separation between interviewer and interviewee, and word‑level forced alignments. Unlike existing emotional speech datasets that rely on acted or laboratory‑elicited emotions, iMiGUE‑Speech captures spontaneous affect arising naturally from real match outcomes. To demonstrate the utility of the dataset and establish initial benchmarks, we introduce two evaluation tasks for comparative assessment: speech emotion recognition and transcript‑based sentiment analysis. These tasks leverage state‑of‑the‑art pre‑trained representations to assess the dataset's ability to capture spontaneous affective states from both acoustic and linguistic modalities. iMiGUE‑Speech can also be synchronously paired with micro‑gesture annotations from the original iMiGUE dataset, forming a uniquely multimodal resource for studying speech‑gesture affective dynamics. The extended dataset is available at https://github.com/CV‑AC/imigue‑speech.
Authors:Xiaoke Huang, Bhavul Gauri, Kam Woh Ng, Tony Ng, Mengmeng Xu, Zhiheng Liu, Weiming Ren, Zhaochong An, Zijian Zhou, Haonan Qiu, Yuyin Zhou, Sen He, Ziheng Wang, Tao Xiang, Xiao Han
Abstract:
Vector glyphs are the atomic units of digital typography, yet most learning‑based pipelines still depend on carefully curated exemplar sheets and raster‑to‑vector postprocessing, which limits accessibility and editability. We introduce VecGlypher, a single multimodal language model that generates high‑fidelity vector glyphs directly from text descriptions or image exemplars. Given a style prompt, optional reference glyph images, and a target character, VecGlypher autoregressively emits SVG path tokens, avoiding raster intermediates and producing editable, watertight outlines in one pass. A typography‑aware data and training recipe makes this possible: (i) a large‑scale continuation stage on 39K noisy Envato fonts to master SVG syntax and long‑horizon geometry, followed by (ii) post‑training on 2.5K expert‑annotated Google Fonts with descriptive tags and exemplars to align language and imagery with geometry; preprocessing normalizes coordinate frames, canonicalizes paths, de‑duplicates families, and quantizes coordinates for stable long‑sequence decoding. On cross‑family OOD evaluation, VecGlypher substantially outperforms both general‑purpose LLMs and specialized vector‑font baselines for text‑only generation, while image‑referenced generation reaches a state‑of‑the‑art performance, with marked gains over DeepVecFont‑v2 and DualVector. Ablations show that model scale and the two‑stage recipe are critical and that absolute‑coordinate serialization yields the best geometry. VecGlypher lowers the barrier to font creation by letting users design with words or exemplars, and provides a scalable foundation for future multimodal design tools.
Authors:Subhadip Mitra
Abstract:
We present a memory system for AI agents that treats stored information as continuous fields governed by partial differential equations rather than discrete entries in a database. The approach draws from classical field theory: memories diffuse through semantic space, decay thermodynamically based on importance, and interact through field coupling in multi‑agent scenarios. We evaluate the system on two established long‑context benchmarks: LoCoMo (ACL 2024) with 300‑turn conversations across 35 sessions, and LongMemEval (ICLR 2025) testing multi‑session reasoning over 500+ turns. On LongMemEval, the field‑theoretic approach achieves significant improvements: +116% F1 on multi‑session reasoning (p<0.01, d= 3.06), +43.8% on temporal reasoning (p<0.001, d= 9.21), and +27.8% retrieval recall on knowledge updates (p<0.001, d= 5.00). Multi‑agent experiments show near‑perfect collective intelligence (>99.8%) through field coupling. Code is available at github.com/rotalabs/rotalabs‑fieldmem.
Authors:Tony Feng, Junehyuk Jung, Sang-hyun Kim, Carlo Pagano, Sergei Gukov, Chiang-Chiang Tsai, David Woodruff, Adel Javanmard, Aryan Mokhtari, Dawsen Hwang, Yuri Chervonyi, Jonathan N. Lee, Garrett Bingham, Trieu H. Trinh, Vahab Mirrokni, Quoc V. Le, Thang Luong
Abstract:
We report the performance of Aletheia (Feng et al., 2026b), a mathematics research agent powered by Gemini 3 Deep Think, on the inaugural FirstProof challenge. Within the allowed timeframe of the challenge, Aletheia autonomously solved 6 problems (2, 5, 7, 8, 9, 10) out of 10 according to majority expert assessments; we note that experts were not unanimous on Problem 8 (only). For full transparency, we explain our interpretation of FirstProof and disclose details about our experiments as well as our evaluation. Raw prompts and outputs are available at https://github.com/google‑deepmind/superhuman/tree/main/aletheia.
Authors:Peter Hase, Christopher Potts
Abstract:
Inspecting Chain‑of‑Thought reasoning is among the most common means of understanding why an LLM produced its output. But well‑known problems with CoT faithfulness severely limit what insights can be gained from this practice. In this paper, we introduce a training method called Counterfactual Simulation Training (CST), which aims to improve CoT faithfulness by rewarding CoTs that enable a simulator to accurately predict a model's outputs over counterfactual inputs. We apply CST in two settings: (1) CoT monitoring with cue‑based counterfactuals, to detect when models rely on spurious features, reward hack, or are sycophantic, and (2) counterfactual simulation over generic model‑based counterfactuals, to encourage models to produce more faithful, generalizable reasoning in the CoT. Experiments with models up to 235B parameters show that CST can substantially improve monitor accuracy on cue‑based counterfactuals (by 35 accuracy points) as well as simulatability over generic counterfactuals (by 2 points). We further show that: (1) CST outperforms prompting baselines, (2) rewriting unfaithful CoTs with an LLM is 5x more efficient than RL alone, (3) faithfulness improvements do not generalize to dissuading cues (as opposed to persuading cues), and (4) larger models do not show more faithful CoT out of the box, but they do benefit more from CST. These results suggest that CST can improve CoT faithfulness in general, with promising applications for CoT monitoring. Code for experiments in this paper is available at https://github.com/peterbhase/counterfactual‑simulation‑training
Authors:Taha Koleilat, Hojat Asgariandehkordi, Omid Nejati Manzari, Berardino Barile, Yiming Xiao, Hassan Rivaz
Abstract:
Medical image segmentation remains challenging due to limited annotations for training, ambiguous anatomical features, and domain shifts. While vision‑language models such as CLIP offer strong cross‑modal representations, their potential for dense, text‑guided medical image segmentation remains underexplored. We present MedCLIPSeg, a novel framework that adapts CLIP for robust, data‑efficient, and uncertainty‑aware medical image segmentation. Our approach leverages patch‑level CLIP embeddings through probabilistic cross‑modal attention, enabling bidirectional interaction between image and text tokens and explicit modeling of predictive uncertainty. Together with a soft patch‑level contrastive loss that encourages more nuanced semantic learning across diverse textual prompts, MedCLIPSeg effectively improves data efficiency and domain generalizability. Extensive experiments across 16 datasets spanning five imaging modalities and six organs demonstrate that MedCLIPSeg outperforms prior methods in accuracy, efficiency, and robustness, while providing interpretable uncertainty maps that highlight local reliability of segmentation results. This work demonstrates the potential of probabilistic vision‑language modeling for text‑driven medical image segmentation.
Authors:Lingwei Gu, Nour Jedidi, Jimmy Lin
Abstract:
How do large language models (LLMs) know what they know? Answering this question has been difficult because pre‑training data is often a "black box" ‑‑ unknown or inaccessible. The recent release of nanochat ‑‑ a family of small LLMs with fully open pre‑training data ‑‑ addresses this as it provides a transparent view into where a model's parametric knowledge comes from. Towards the goal of understanding how knowledge is encoded by LLMs, we release NanoKnow, a benchmark dataset that partitions questions from Natural Questions and SQuAD into splits based on whether their answers are present in nanochat's pre‑training corpus. Using these splits, we can now properly disentangle the sources of knowledge that LLMs rely on when producing an output. To demonstrate NanoKnow's utility, we conduct experiments using eight nanochat checkpoints. Our findings show: (1) closed‑book accuracy is strongly influenced by answer frequency in the pre‑training data, (2) providing external evidence can mitigate this frequency dependence, (3) even with external evidence, models are more accurate when answers were seen during pre‑training, demonstrating that parametric and external knowledge are complementary, and (4) non‑relevant information is harmful, with accuracy decreasing based on both the position and the number of non‑relevant contexts. We release all NanoKnow artifacts at https://github.com/castorini/NanoKnow.
Authors:Zhongwei Wan, Yun Shen, Zhihao Dou, Donghao Zhou, Yu Zhang, Xin Wang, Hui Shen, Jing Xiong, Chaofan Tao, Zixuan Zhong, Peizhou Huang, Mi Zhang
Abstract:
Reinforcement learning with verifiers (RLVR) is a central paradigm for improving large language model (LLM) reasoning, yet existing methods often suffer from limited exploration. Policies tend to collapse onto a few reasoning patterns and prematurely stop deep exploration, while conventional entropy regularization introduces only local stochasticity and fails to induce meaningful path‑level diversity, leading to weak and unstable learning signals in group‑based policy optimization. We propose DSDR, a Dual‑Scale Diversity Regularization reinforcement learning framework that decomposes diversity in LLM reasoning into global and coupling components. Globally, DSDR promotes diversity among correct reasoning trajectories to explore distinct solution modes. Locally, it applies a length‑invariant, token‑level entropy regularization restricted to correct trajectories, preventing entropy collapse within each mode while preserving correctness. The two scales are coupled through a global‑to‑local allocation mechanism that emphasizes local regularization for more distinctive correct trajectories. We provide theoretical support showing that DSDR preserves optimal correctness under bounded regularization, sustains informative learning signals in group‑based optimization, and yields a principled global‑to‑local coupling rule. Experiments on multiple reasoning benchmarks demonstrate consistent improvements in accuracy and pass@k, highlighting the importance of dual‑scale diversity for deep exploration in RLVR. Code is available at https://github.com/SUSTechBruce/DSDR.
Authors:Daham Mustafa, Diego Collarana, Yixin Peng, Rafiqul Haque, Christoph Lange-Bever, Christoph Quix, Stephan Decker
Abstract:
ODRL's six set‑based operators ‑‑ isA, isPartOf, hasPart, isAnyOf, isAllOf, isNoneOf ‑‑ depend on external domain knowledge that the W3C specification leaves unspecified. Without it, every cross‑dataspace policy comparison defaults to Unknown. We present a denotational semantics that maps each ODRL constraint to the set of knowledge‑base concepts satisfying it. Conflict detection reduces to denotation intersection under a three‑valued verdict ‑‑ Conflict, Compatible, or Unknown ‑‑ that is sound under incomplete knowledge. The framework covers all three ODRL composition modes (and, or, xone) and all three semantic domains arising in practice: taxonomic (class subsumption), mereological (part‑whole containment), and nominal (identity). For cross‑dataspace interoperability, we define order‑preserving alignments between knowledge bases and prove two guarantees: conflicts are preserved across different KB standards, and unmapped concepts degrade gracefully to Unknown ‑‑ never to false conflicts. A runtime soundness theorem ensures that design‑time verdicts hold for all execution contexts. The encoding stays within the decidable EPR fragment of first‑order logic. We validate it with 154 benchmarks across six knowledge base families (GeoNames, ISO 3166, W3C DPV, a GDPR‑derived taxonomy, BCP 47, and ISO 639‑3) and four structural KBs targeting adversarial edge cases. Both the Vampire theorem prover and the Z3 SMT solver agree on all 154 verdicts. A key finding is that exclusive composition (xone) requires strictly stronger KB axioms than conjunction or disjunction: open‑world semantics blocks exclusivity even when positive evidence appears to satisfy exactly one branch.
Authors:Deborah N. Jakobi, David R. Reich, Paul Prasse, Jana M. Hofmann, Lena S. Bolliger, Lena A. Jäger
Abstract:
Eye‑tracking‑while‑reading corpora are a valuable resource for many different disciplines and use cases. Use cases range from studying the cognitive processes underlying reading to machine‑learning‑based applications, such as gaze‑based assessments of reading comprehension. The past decades have seen an increase in the number and size of eye‑tracking‑while‑reading datasets as well as increasing diversity with regard to the stimulus languages covered, the linguistic background of the participants, or accompanying psychometric or demographic data. The spread of data across different disciplines and the lack of data sharing standards across the communities lead to many existing datasets that cannot be easily reused due to a lack of interoperability. In this work, we aim at creating more transparency and clarity with regards to existing datasets and their features across different disciplines by i) presenting an extensive overview of existing datasets, ii) simplifying the sharing of newly created datasets by publishing a living overview online, https://dili‑lab.github.io/datasets.html, presenting over 45 features for each dataset, and iii) integrating all publicly available datasets into the Python package pymovements which offers an eye‑tracking datasets library. By doing so, we aim to strengthen the FAIR principles in eye‑tracking‑while‑reading research and promote good scientific practices, such as reproducing and replicating studies.
Authors:Chongyang Gao, Diji Yang, Shuyan Zhou, Xichen Yan, Luchuan Song, Shuo Li, Kezhen Chen
Abstract:
We introduce CFE‑Bench (Classroom Final Exam), a multimodal benchmark for evaluating the reasoning capabilities of large language models across more than 20 STEM domains. CFE‑Bench is curated from repeatedly used, authentic university homework and exam problems, paired with reference solutions provided by course instructors. CFE‑Bench remains challenging for frontier models: the newly released Gemini‑3.1‑pro‑preview achieves 59.69% overall accuracy, while the second‑best model, Gemini‑3‑flash‑preview, reaches 55.46%, leaving substantial room for improvement. Beyond aggregate scores, we conduct a diagnostic analysis by decomposing instructor reference solutions into structured reasoning flows. We find that while frontier models often answer intermediate sub‑questions correctly, they struggle to reliably derive and maintain correct intermediate states throughout multi‑step solutions. We further observe that model‑generated solutions typically contain more reasoning steps than instructor solutions, indicating lower step efficiency and a higher risk of error accumulation. Data and code are available at https://github.com/Analogy‑AI/CFE_Bench.
Authors:Sherzod Hakimov
Abstract:
Natural language processing for the Turkic language family, spoken by over 200 million people across Eurasia, remains fragmented, with most languages lacking unified tooling and resources. We present TurkicNLP, an open‑source Python library providing a single, consistent NLP pipeline for Turkic languages across four script families: Latin, Cyrillic, Perso‑Arabic, and Old Turkic Runic. The library covers tokenization, morphological analysis, part‑of‑speech tagging, dependency parsing, named entity recognition, bidirectional script transliteration, cross‑lingual sentence embeddings, and machine translation through one language‑agnostic API. A modular multi‑backend architecture integrates rule‑based finite‑state transducers and neural models transparently, with automatic script detection and routing between script variants. Outputs follow the CoNLL‑U standard for full interoperability and extension. Code and documentation are hosted at https://github.com/turkic‑nlp/turkicnlp .
Authors:Qijie You, Wenkai Yu, Wentao Zhang
Abstract:
With the rapid advancement of agent‑based methods in recent years, Agentic RAG has undoubtedly become an important research direction. Multi‑hop reasoning, which requires models to engage in deliberate thinking and multi‑step interaction, serves as a critical testbed for assessing such capabilities. However, existing benchmarks typically provide only final questions and answers, while lacking the intermediate hop‑level questions that gradually connect atomic questions to the final multi‑hop query. This limitation prevents researchers from analyzing at which step an agent fails and restricts more fine‑grained evaluation of model capabilities. Moreover, most current benchmarks are manually constructed, which is both time‑consuming and labor‑intensive, while also limiting scalability and generalization. To address these challenges, we introduce AgenticRAGTracer, the first Agentic RAG benchmark that is primarily constructed automatically by large language models and designed to support step‑by‑step validation. Our benchmark spans multiple domains, contains 1,305 data points, and has no overlap with existing mainstream benchmarks. Extensive experiments demonstrate that even the best large language models perform poorly on our dataset. For instance, GPT‑5 attains merely 22.6% EM accuracy on the hardest portion of our dataset. Hop‑aware diagnosis reveals that failures are primarily driven by distorted reasoning chains ‑‑ either collapsing prematurely or wandering into over‑extension. This highlights a critical inability to allocate steps consistent with the task's logical structure, providing a diagnostic dimension missing in traditional evaluations. We believe our work will facilitate research in Agentic RAG and inspire further meaningful progress in this area. Our code and data are available at https://github.com/YqjMartin/AgenticRAGTracer.
Authors:Chenhang Cui, An Zhang, Yuxin Chen, Gelei Deng, Jingnan Zheng, Zhenkai Liang, Xiang Wang, Tat-Seng Chua
Abstract:
Large vision‑language models (LVLMs) have rapidly advanced across various domains, yet they still lag behind strong text‑only large language models (LLMs) on tasks that require multi‑step inference and compositional decision‑making. Motivated by their shared transformer architectures, we investigate whether the two model families rely on common internal computation for such inference. At the neuron level, we uncover a surprisingly large overlap: more than half of the top‑activated units during multi‑step inference are shared between representative LLMs and LVLMs, revealing a modality‑invariant inference subspace. Through causal probing via activation amplification, we further show that these shared neurons encode consistent and interpretable concept‑level effects, demonstrating their functional contribution to inference. Building on this insight, we propose Shared Neuron Low‑Rank Fusion (SNRF), a parameter‑efficient framework that transfers mature inference circuitry from LLMs to LVLMs. SNRF profiles cross‑model activations to identify shared neurons, computes a low‑rank approximation of inter‑model weight differences, and injects these updates selectively within the shared‑neuron subspace. This mechanism strengthens multimodal inference performance with minimal parameter changes and requires no large‑scale multimodal fine‑tuning. Across diverse mathematics and perception benchmarks, SNRF consistently enhances LVLM inference performance while preserving perceptual capabilities. Our results demonstrate that shared neurons form an interpretable bridge between LLMs and LVLMs, enabling low‑cost transfer of inference ability into multimodal models. Our code is available at [https://github.com/chenhangcuisg‑code/Do‑LLMs‑VLMs‑Share‑Neurons](https://github.com/chenhangcuisg‑code/Do‑LLMs‑VLMs‑Share‑Neurons).
Authors:Yinhan He, Yaochen Zhu, Mingjia Shi, Wendy Zheng, Lin Su, Xiaoqing Wang, Qi Guo, Jundong Li
Abstract:
Large language models increasingly rely on long chains of thought to improve accuracy, yet such gains come with substantial inference‑time costs. We revisit token‑efficient post‑training and argue that existing sequence‑level reward‑shaping methods offer limited control over how reasoning effort is allocated across tokens. To bridge the gap, we propose IAPO, an information‑theoretic post‑training framework that assigns token‑wise advantages based on each token's conditional mutual information (MI) with the final answer. This yields an explicit, principled mechanism for identifying informative reasoning steps and suppressing low‑utility exploration. We provide a theoretical analysis showing that our IAPO can induce monotonic reductions in reasoning verbosity without harming correctness. Empirically, IAPO consistently improves reasoning accuracy while reducing reasoning length by up to 36%, outperforming existing token‑efficient RL methods across various reasoning datasets. Extensive empirical evaluations demonstrate that information‑aware advantage shaping is a powerful and general direction for token‑efficient post‑training. The code is available at https://github.com/YinhanHe123/IAPO.
Authors:Xiaochuan Li, Ryan Ming, Pranav Setlur, Abhijay Paladugu, Andy Tang, Hao Kang, Shuai Shao, Rong Jin, Chenyan Xiong
Abstract:
LLM agents are increasingly expected to function as general‑purpose systems capable of resolving open‑ended user requests. While existing benchmarks focus on domain‑aware environments for developing specialized agents, evaluating general‑purpose agents requires more realistic settings that challenge them to operate across multiple skills and tools within a unified environment. We introduce General AgentBench, a benchmark that provides such a unified framework for evaluating general LLM agents across search, coding, reasoning, and tool‑use domains. Using General AgentBench, we systematically study test‑time scaling behaviors under sequential scaling (iterative interaction) and parallel scaling (sampling multiple trajectories). Evaluation of ten leading LLM agents reveals a substantial performance degradation when moving from domain‑specific evaluations to this general‑agent setting. Moreover, we find that neither scaling methodology yields effective performance improvements in practice, due to two fundamental limitations: context ceiling in sequential scaling and verification gap in parallel scaling. Code is publicly available at https://github.com/cxcscmu/General‑AgentBench.
Authors:Toheeb Aduramomi Jimoh, Tabea De Wille, Nikola S. Nikolov
Abstract:
Sarcasm detection poses a fundamental challenge in computational semantics, requiring models to resolve disparities between literal and intended meaning. The challenge is amplified in low‑resource languages where annotated datasets are scarce or nonexistent. We present Yor‑Sarc, the first gold‑standard dataset for sarcasm detection in Yorùbá, a tonal Niger‑Congo language spoken by over 50 million people. The dataset comprises 436 instances annotated by three native speakers from diverse dialectal backgrounds using an annotation protocol specifically designed for Yorùbá sarcasm by taking culture into account. This protocol incorporates context‑sensitive interpretation and community‑informed guidelines and is accompanied by a comprehensive analysis of inter‑annotator agreement to support replication in other African languages. Substantial to almost perfect agreement was achieved (Fleiss' κ= 0.7660; pairwise Cohen's κ= 0.6732‑‑0.8743), with 83.3% unanimous consensus. One annotator pair achieved almost perfect agreement (κ= 0.8743; 93.8% raw agreement), exceeding a number of reported benchmarks for English sarcasm research works. The remaining 16.7% majority‑agreement cases are preserved as soft labels for uncertainty‑aware modelling. Yor‑Sarc\footnotehttps://github.com/toheebadura/yor‑sarc is expected to facilitate research on semantic interpretation and culturally informed NLP for low‑resource African languages.
Authors:Tianyu Fan, Fengji Zhang, Yuxiang Zheng, Bei Chen, Xinyao Niu, Chengen Huang, Junyang Lin, Chao Huang
Abstract:
The application of Large Language Models (LLMs) in accelerating scientific discovery has garnered increasing attention, with a key focus on constructing research agents endowed with innovative capability, i.e., the ability to autonomously generate novel and significant research ideas. Existing approaches predominantly rely on sophisticated prompt engineering and lack a systematic training paradigm. To address this, we propose DeepInnovator, a training framework designed to trigger the innovative capability of LLMs. Our approach comprises two core components. (1) ``Standing on the shoulders of giants''. We construct an automated data extraction pipeline to extract and organize structured research knowledge from a vast corpus of unlabeled scientific literature. (2) ``Conjectures and refutations''. We introduce a ``Next Idea Prediction'' training paradigm, which models the generation of research ideas as an iterative process of continuously predicting, evaluating, and refining plausible and novel next idea. Both automatic and expert evaluations demonstrate that our DeepInnovator‑14B significantly outperforms untrained baselines, achieving win rates of 80.53%‑93.81%, and attains performance comparable to that of current leading LLMs. This work provides a scalable training pathway toward building research agents with genuine, originative innovative capability, and will open‑source the dataset to foster community advancement. Source code and data are available at: https://github.com/HKUDS/DeepInnovator.
Authors:Kwanghee Choi, Eunjung Yeo, Cheol Jun Cho, David Harwath, David R. Mortensen
Abstract:
Self‑supervised speech models (S3Ms) are known to encode rich phonetic information, yet how this information is structured remains underexplored. We conduct a comprehensive study across 96 languages to analyze the underlying structure of S3M representations, with particular attention to phonological vectors. We first show that there exist linear directions within the model's representation space that correspond to phonological features. We further demonstrate that the scale of these phonological vectors correlate to the degree of acoustic realization of their corresponding phonological features in a continuous manner. For example, the difference between [d] and [t] yields a voicing vector: adding this vector to [p] produces [b], while scaling it results in a continuum of voicing. Together, these findings indicate that S3Ms encode speech using phonologically interpretable and compositional vectors, demonstrating phonological vector arithmetic. All code and interactive demos are available at https://github.com/juice500ml/phonetic‑arithmetic .
Authors:Adam Dejl, Jonathan Pearson
Abstract:
Robust and comprehensive evaluation of large language models (LLMs) is essential for identifying effective LLM system configurations and mitigating risks associated with deploying LLMs in sensitive domains. However, traditional statistical metrics are poorly suited to open‑ended generation tasks, leading to growing reliance on LLM‑based evaluation methods. These methods, while often more flexible, introduce additional complexity: they depend on carefully chosen models, prompts, parameters, and evaluation strategies, making the evaluation process prone to misconfiguration and bias. In this work, we present EvalSense, a flexible, extensible framework for constructing domain‑specific evaluation suites for LLMs. EvalSense provides out‑of‑the‑box support for a broad range of model providers and evaluation strategies, and assists users in selecting and deploying suitable evaluation methods for their specific use‑cases. This is achieved through two unique components: (1) an interactive guide aiding users in evaluation method selection and (2) automated meta‑evaluation tools that assess the reliability of different evaluation approaches using perturbed data. We demonstrate the effectiveness of EvalSense in a case study involving the generation of clinical notes from unstructured doctor‑patient dialogues, using a popular open dataset. All code, documentation, and assets associated with EvalSense are open‑source and publicly available at https://github.com/nhsengland/evalsense.
Authors:Xiaoyan Bai, Alexander Baumgartner, Haojia Sun, Ari Holtzman, Chenhao Tan
Abstract:
Reproducibility crises across sciences highlight the limitations of the paper‑centric review system in assessing the rigor and reproducibility of research. AI agents that autonomously design and generate large volumes of research outputs exacerbate these challenges. In this work, we address the growing challenges of scalability and rigor by flipping the dynamic and developing AI agents as research evaluators. We propose the first execution‑grounded evaluation framework that verifies research beyond narrative review by examining code and data alongside the paper. We use mechanistic interpretability research as a testbed, build standardized research output, and develop MechEvalAgent, an automated evaluation framework that assesses the coherence of the experimental process, the reproducibility of results, and the generalizability of findings. We show that our framework achieves above 80% agreement with human judges, identifies substantial methodological problems, and surfaces 51 additional issues that human reviewers miss. Our work demonstrates the potential of AI agents to transform research evaluation and pave the way for rigorous scientific practices.
Authors:Jiamin Yao, Eren Gultepe
Abstract:
This study presents an ensemble technique, SPQ (SVD‑Pruning‑Quantization), for large language model (LLM) compression that combines variance‑retained singular value decomposition (SVD), activation‑based pruning, and post‑training linear quantization. Each component targets a different source of inefficiency: i) pruning removes redundant neurons in MLP layers, ii) SVD reduces attention projections into compact low‑rank factors, iii) and 8‑bit quantization uniformly compresses all linear layers. At matched compression ratios, SPQ outperforms individual methods (SVD‑only, pruning‑only, or quantization‑only) in perplexity, demonstrating the benefit of combining complementary techniques. Applied to LLaMA‑2‑7B, SPQ achieves up to 75% memory reduction while maintaining or improving perplexity (e.g., WikiText‑2 5.47 to 4.91) and preserving accuracy on downstream benchmarks such as C4, TruthfulQA, and GSM8K. Compared to strong baselines like GPTQ and SparseGPT, SPQ offers competitive perplexity and accuracy while using less memory (6.86 GB vs. 7.16 GB for GPTQ). Moreover, SPQ improves inference throughput over GPTQ, achieving up to a 1.9x speedup, which further enhances its practicality for real‑world deployment. The effectiveness of SPQ's robust compression through layer‑aware and complementary compression techniques may provide practical deployment of LLMs in memory‑constrained environments. Code is available at: https://github.com/JiaminYao/SPQ_LLM_Compression/
Authors:Yutong Xin, Qiaochu Chen, Greg Durrett, Işil Dillig
Abstract:
Large language models have achieved striking results in interactive theorem proving, particularly in Lean. However, most benchmarks for LLM‑based proof automation are drawn from mathematics in the Mathlib ecosystem, whereas proofs in software verification are developed inside definition‑rich codebases with substantial project‑specific libraries. We introduce VeriSoftBench, a benchmark of 500 Lean 4 proof obligations drawn from open‑source formal‑methods developments and packaged to preserve realistic repository context and cross‑file dependencies. Our evaluation of frontier LLMs and specialized provers yields three observations. First, provers tuned for Mathlib‑style mathematics transfer poorly to this repository‑centric setting. Second, success is strongly correlated with transitive repository dependence: tasks whose proofs draw on large, multi‑hop dependency closures are less likely to be solved. Third, providing curated context restricted to a proof's dependency closure improves performance relative to exposing the full repository, but nevertheless leaves substantial room for improvement. Our benchmark and evaluation suite are released at https://github.com/utopia‑group/VeriSoftBench.
Authors:Aaron Louis Eidt, Nils Feldhus
Abstract:
While mechanistic interpretability has developed powerful tools to analyze the internal workings of Large Language Models (LLMs), their complexity has created an accessibility gap, limiting their use to specialists. We address this challenge by designing, building, and evaluating ELIA (Explainable Language Interpretability Analysis), an interactive web application that simplifies the outcomes of various language model component analyses for a broader audience. The system integrates three key techniques ‑‑ Attribution Analysis, Function Vector Analysis, and Circuit Tracing ‑‑ and introduces a novel methodology: using a vision‑language model to automatically generate natural language explanations (NLEs) for the complex visualizations produced by these methods. The effectiveness of this approach was empirically validated through a mixed‑methods user study, which revealed a clear preference for interactive, explorable interfaces over simpler, static visualizations. A key finding was that the AI‑powered explanations helped bridge the knowledge gap for non‑experts; a statistical analysis showed no significant correlation between a user's prior LLM experience and their comprehension scores, suggesting that the system reduced barriers to comprehension across experience levels. We conclude that an AI system can indeed simplify complex model analyses, but its true power is unlocked when paired with thoughtful, user‑centered design that prioritizes interactivity, specificity, and narrative guidance.
Authors:Lexiang Tang, Weihao Gao, Bingchen Zhao, Lu Ma, Qiao jin, Bang Yang, Yuexian Zou
Abstract:
Recent work on test‑time scaling for large language model (LLM) reasoning typically assumes that allocating more inference‑time computation uniformly improves correctness. However, prior studies show that reasoning uncertainty is highly localized: a small subset of low‑confidence tokens disproportionately contributes to reasoning errors and unnecessary output expansion. Motivated by this observation, we propose Thinking by Subtraction, a confidence‑driven contrastive decoding approach that improves reasoning reliability through targeted token‑level intervention. Our method, Confidence‑Driven Contrastive Decoding, detects low‑confidence tokens during decoding and intervenes selectively at these positions. It constructs a contrastive reference by replacing high‑confidence tokens with minimal placeholders, and refines predictions by subtracting this reference distribution at low‑confidence locations. Experiments show that CCD significantly improves accuracy across mathematical reasoning benchmarks while substantially reducing output length, with minimal KV‑cache overhead. As a training‑free method, CCD enhances reasoning reliability through targeted low‑confidence intervention without computational redundancy. Our code will be made available at: https://github.com/bolo‑web/CCD.
Authors:Kaisen Yang, Jayden Teoh, Kaicheng Yang, Yitong Zhang, Alex Lamb
Abstract:
Masked Diffusion Models (MDMs) offer greater flexibility in decoding order than autoregressive models but require careful planning to achieve high‑quality generation. Existing samplers typically adopt greedy heuristics, prioritizing positions with the highest local certainty to decode at each step. Through failure case analysis, we identify a fundamental limitation of this approach: it neglects the downstream impact of current decoding choices on subsequent steps and fails to minimize cumulative uncertainty. In particular, these methods do not fully exploit the non‑causal nature of MDMs, which enables evaluating how a decoding decision reshapes token probabilities/uncertainty across all remaining masked positions. To bridge this gap, we propose the Info‑Gain Sampler, a principled decoding framework that balances immediate uncertainty with information gain over future masked tokens. Extensive evaluations across diverse architectures and tasks (reasoning, coding, creative writing, and image generation) demonstrate that Info‑Gain Sampler consistently outperforms existing samplers for MDMs. For instance, it achieves a 3.6% improvement in average accuracy on reasoning tasks and a 63.1% win‑rate in creative writing. Notably, on reasoning tasks it reduces cumulative uncertainty from 78.4 to 48.6, outperforming the best baseline by a large margin. The code will be available at https://github.com/yks23/Information‑Gain‑Sampler.
Authors:Aidar Myrzakhan, Tianyi Li, Bowei Guo, Shengkun Tang, Zhiqiang Shen
Abstract:
Diffusion Language Models (DLMs) incur high inference cost due to iterative denoising, motivating efficient pruning. Existing pruning heuristics largely inherited from autoregressive (AR) LLMs, typically preserve attention sink tokens because AR sinks serve as stable global anchors. We show that this assumption does not hold for DLMs: the attention‑sink position exhibits substantially higher variance over the full generation trajectory (measured by how the dominant sink locations shift across timesteps), indicating that sinks are often transient and less structurally essential than in AR models. Based on this observation, we propose \bf \textttSink‑Aware Pruning, which automatically identifies and prunes unstable sinks in DLMs (prior studies usually keep sinks for AR LLMs). Without retraining, our method achieves a better quality‑efficiency trade‑off and outperforms strong prior pruning baselines under matched compute. Our code is available at https://github.com/VILA‑Lab/Sink‑Aware‑Pruning.
Authors:Juri Opitz, Corina Raclé, Emanuela Boros, Andrianos Michail, Matteo Romanello, Maud Ehrmann, Simon Clematide
Abstract:
HIPE‑2026 is a CLEF evaluation lab dedicated to person‑place relation extraction from noisy, multilingual historical texts. Building on the HIPE‑2020 and HIPE‑2022 campaigns, it extends the series toward semantic relation extraction by targeting the task of identifying person‑‑place associations in multiple languages and time periods. Systems are asked to classify relations of two types ‑ at ("Has the person ever been at this place?") and isAt ("Is the person located at this place around publication time?") ‑ requiring reasoning over temporal and geographical cues. The lab introduces a three‑fold evaluation profile that jointly assesses accuracy, computational efficiency, and domain generalization. By linking relation extraction to large‑scale historical data processing, HIPE‑2026 aims to support downstream applications in knowledge‑graph construction, historical biography reconstruction, and spatial analysis in digital humanities.
Authors:Xiaohan Zhao, Zhaoyi Li, Yaxin Luo, Jiacheng Cui, Zhiqiang Shen
Abstract:
Black‑box adversarial attacks on Large Vision‑Language Models (LVLMs) are challenging due to missing gradients and complex multimodal boundaries. While prior state‑of‑the‑art transfer‑based approaches like M‑Attack perform well using local crop‑level matching between source and target images, we find this induces high‑variance, nearly orthogonal gradients across iterations, violating coherent local alignment and destabilizing optimization. We attribute this to (i) ViT translation sensitivity that yields spike‑like gradients and (ii) structural asymmetry between source and target crops. We reformulate local matching as an asymmetric expectation over source transformations and target semantics, and build a gradient‑denoising upgrade to M‑Attack. On the source side, Multi‑Crop Alignment (MCA) averages gradients from multiple independently sampled local views per iteration to reduce variance. On the target side, Auxiliary Target Alignment (ATA) replaces aggressive target augmentation with a small auxiliary set from a semantically correlated distribution, producing a smoother, lower‑variance target manifold. We further reinterpret momentum as Patch Momentum, replaying historical crop gradients; combined with a refined patch‑size ensemble (PE+), this strengthens transferable directions. Together these modules form M‑Attack‑V2, a simple, modular enhancement over M‑Attack that substantially improves transfer‑based black‑box attacks on frontier LVLMs: boosting success rates on Claude‑4.0 from 8% to 30%, Gemini‑2.5‑Pro from 83% to 97%, and GPT‑5 from 98% to 100%, outperforming prior black‑box LVLM attacks. Code and data are publicly available at: https://github.com/vila‑lab/M‑Attack‑V2.
Authors:Peter Balogh
Abstract:
Some transformer attention heads appear to function as membership testers, dedicating themselves to answering the question "has this token appeared before in the context?" We identify these heads across four language models (GPT‑2 small, medium, and large; Pythia‑160M) and show that they form a spectrum of membership‑testing strategies. Two heads (L0H1 and L0H5 in GPT‑2 small) function as high‑precision membership filters with false positive rates of 0‑4% even at 180 unique context tokens ‑‑ well above the d_\texthead = 64 bit capacity of a classical Bloom filter. A third head (L1H11) shows the classic Bloom filter capacity curve: its false positive rate follows the theoretical formula p \approx (1 ‑ e^‑kn/m)^k with R^2 = 1.0 and fitted capacity m \approx 5 bits, saturating by n \approx 20 unique tokens. A fourth head initially identified as a Bloom filter (L3H0) was reclassified as a general prefix‑attention head after confound controls revealed its apparent capacity curve was a sequence‑length artifact. Together, the three genuine membership‑testing heads form a multi‑resolution system concentrated in early layers (0‑1), taxonomically distinct from induction and previous‑token heads, with false positive rates that decay monotonically with embedding distance ‑‑ consistent with distance‑sensitive Bloom filters. These heads generalize broadly: they respond to any repeated token type, not just repeated names, with 43% higher generalization than duplicate‑token‑only heads. Ablation reveals these heads contribute to both repeated and novel token processing, indicating that membership testing coexists with broader computational roles. The reclassification of L3H0 through confound controls strengthens rather than weakens the case: the surviving heads withstand the scrutiny that eliminated a false positive in our own analysis.
Authors:Dylan Bouchard, Mohit Singh Chauhan, Viren Bajaj, David Skarbrevik
Abstract:
Uncertainty quantification has emerged as an effective approach to closed‑book hallucination detection for LLMs, but existing methods are largely designed for short‑form outputs and do not generalize well to long‑form generation. We introduce a taxonomy for fine‑grained uncertainty quantification in long‑form LLM outputs that distinguishes methods by design choices at three stages: response decomposition, unit‑level scoring, and response‑level aggregation. We formalize several families of consistency‑based black‑box scorers, providing generalizations and extensions of existing methods. In our experiments across multiple LLMs and datasets, we find 1) claim‑response entailment consistently performs better or on par with more complex claim‑level scorers, 2) claim‑level scoring generally yields better results than sentence‑level scoring, and 3) uncertainty‑aware decoding is highly effective for improving the factuality of long‑form outputs. Our framework clarifies relationships between prior methods, enables apples‑to‑apples comparisons, and provides practical guidance for selecting components for fine‑grained UQ.
Authors:Yunseok Han, Yejoon Lee, Jaeyoung Do
Abstract:
Large Reasoning Models (LRMs) exhibit strong performance, yet often produce rationales that sound plausible but fail to reflect their true decision process, undermining reliability and trust. We introduce a formal framework for reasoning faithfulness, defined by two testable conditions: stance consistency (a coherent stance linking reasoning to answer) and causal influence (the stated reasoning causally drives the answer under output‑level interventions), explicitly decoupled from accuracy. To operationalize this, we present RFEval, a benchmark of 7,186 instances across seven tasks that probes faithfulness via controlled, output‑level counterfactual interventions. Evaluating twelve open‑source LRMs, we find unfaithfulness in 49.7% of outputs, predominantly from stance inconsistency. Failures are concentrated in brittle, convergent domains such as math and code, and correlate more with post‑training regimes than with scale: within‑family ablations indicate that adding current RL‑style objectives on top of supervised fine‑tuning can reduce reasoning faithfulness, even when accuracy is maintained. Crucially, accuracy is neither a sufficient nor a reliable proxy for faithfulness: once controlling for model and task, the accuracy‑faithfulness link is weak and statistically insignificant. Our work establishes a rigorous methodology for auditing LRM reliability and shows that trustworthy AI requires optimizing not only for correct outcomes but also for the structural integrity of the reasoning process. Our code and dataset can be found at project page: \hrefhttps://aidaslab.github.io/RFEval/https://aidaslab.github.io/RFEval/
Authors:Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, Zhiyuan Chen, Jitong Liao, Qi Zheng, Jiahui Zeng, Ze Xu, Shuai Bai, Junyang Lin, Jingren Zhou, Ming Yan
Abstract:
The paper introduces GUI‑Owl‑1.5, the latest native GUI agent model that features instruct/thinking variants in multiple sizes (2B/4B/8B/32B/235B) and supports a range of platforms (desktop, mobile, browser, and more) to enable cloud‑edge collaboration and real‑time interaction. GUI‑Owl‑1.5 achieves state‑of‑the‑art results on more than 20+ GUI benchmarks on open‑source models: (1) on GUI automation tasks, it obtains 56.5 on OSWorld, 71.6 on AndroidWorld, and 48.4 on WebArena; (2) on grounding tasks, it obtains 80.3 on ScreenSpotPro; (3) on tool‑calling tasks, it obtains 47.6 on OSWorld‑MCP, and 46.8 on MobileWorld; (4) on memory and knowledge tasks, it obtains 75.5 on GUI‑Knowledge Bench. GUI‑Owl‑1.5 incorporates several key innovations: (1) Hybird Data Flywheel: we construct the data pipeline for UI understanding and trajectory generation based on a combination of simulated environments and cloud‑based sandbox environments, in order to improve the efficiency and quality of data collection. (2) Unified Enhancement of Agent Capabilities: we use a unified thought‑synthesis pipeline to enhance the model's reasoning capabilities, while placing particular emphasis on improving key agent abilities, including Tool/MCP use, memory and multi‑agent adaptation; (3) Multi‑platform Environment RL Scaling: We propose a new environment RL algorithm, MRPO, to address the challenges of multi‑platform conflicts and the low training efficiency of long‑horizon tasks. The GUI‑Owl‑1.5 models are open‑sourced, and an online cloud‑sandbox demo is available at https://github.com/X‑PLUG/MobileAgent.
Authors:Yiqing Xie, Emmy Liu, Gaokai Zhang, Nachiket Kotalwar, Shubham Gandhi, Sathwik Acharya, Xingyao Wang, Carolyn Rose, Graham Neubig, Daniel Fried
Abstract:
When assessing the quality of coding agents, predominant benchmarks focus on solving single issues on GitHub, such as SWE‑Bench. In contrast, in real use, these agents solve more various and complex tasks that involve other skills such as exploring codebases, testing software, and designing architecture. In this paper, we first characterize some transferable skills that are shared across diverse tasks by decomposing trajectories into fine‑grained components, and derive a set of principles for designing auxiliary training tasks to teach language models these skills. Guided by these principles, we propose a training environment, Hybrid‑Gym, consisting of a set of scalable synthetic tasks, such as function localization and dependency search. Experiments show that agents trained on our synthetic tasks effectively generalize to diverse real‑world tasks that are not present in training, improving a base model by 25.4% absolute gain on SWE‑Bench Verified, 7.9% on SWT‑Bench Verified, and 5.1% on Commit‑0 Lite. Hybrid‑Gym also complements datasets built for the downstream tasks (e.g., improving SWE‑Play by 4.9% on SWT‑Bench Verified). Code available at: https://github.com/yiqingxyq/Hybrid‑Gym.
Authors:Chanhyuk Lee, Jaehoon Yoo, Manan Agarwal, Sheel Shah, Jerry Huang, Aditi Raghunathan, Seunghoon Hong, Nicholas M. Boffi, Jinwoo Kim
Abstract:
Language models based on discrete diffusion have attracted widespread interest for their potential to provide faster generation than autoregressive models. In practice, however, they exhibit a sharp degradation of sample quality in the few‑step regime, failing to realize this promise. Here we show that language models leveraging flow‑based continuous denoising can outperform discrete diffusion in both quality and speed. By revisiting the fundamentals of flows over discrete modalities, we build a flow‑based language model (FLM) that performs Euclidean denoising over one‑hot token encodings. We show that the model can be trained by predicting the clean data via a cross entropy objective, where we introduce a simple time reparameterization that greatly improves training stability and generation quality. By distilling FLM into its associated flow map, we obtain a distilled flow map language model (FMLM) capable of few‑step generation. On the LM1B and OWT language datasets, FLM attains generation quality matching state‑of‑the‑art discrete diffusion models. With FMLM, our approach outperforms recent few‑step language models across the board, with one‑step generation exceeding their 8‑step quality. Our work calls into question the widely held hypothesis that discrete diffusion processes are necessary for generative modeling over discrete modalities, and paves the way toward accelerated flow‑based language modeling at scale. Code is available at https://github.com/david3684/flm.
Authors:Nithin Sivakumaran, Shoubin Yu, Hyunji Lee, Yue Zhang, Ali Payani, Mohit Bansal, Elias Stengel-Eskin
Abstract:
Chain‑of‑thought (CoT) reasoning sometimes fails to faithfully reflect the true computation of a large language model (LLM), hampering its utility in explaining how LLMs arrive at their answers. Moreover, optimizing for faithfulness and interpretability in reasoning often degrades task performance. To address this tradeoff and improve CoT faithfulness, we propose Reasoning Execution by Multiple Listeners (REMUL), a multi‑party reinforcement learning approach. REMUL builds on the hypothesis that reasoning traces which other parties can follow will be more faithful. A speaker model generates a reasoning trace, which is truncated and passed to a pool of listener models who "execute" the trace, continuing the trace to an answer. Speakers are rewarded for producing reasoning that is clear to listeners, with additional correctness regularization via masked supervised finetuning to counter the tradeoff between faithfulness and performance. On multiple reasoning benchmarks (BIG‑Bench Extra Hard, MuSR, ZebraLogicBench, and FOLIO), REMUL consistently and substantially improves three measures of faithfulness ‑‑ hint attribution, early answering area over the curve (AOC), and mistake injection AOC ‑‑ while also improving accuracy. Our analysis finds that these gains are robust across training domains, translate to legibility gains, and are associated with shorter and more direct CoTs.
Authors:Adnan El Assadi, Isaac Chung, Chenghao Xiao, Roman Solomatin, Animesh Jha, Rahul Chand, Silky Singh, Kaitlyn Wang, Ali Sartaz Khan, Marc Moussa Nasser, Sufen Fong, Pengfei He, Alan Xiao, Ayush Sunil Munot, Aditya Shrivastava, Artem Gazizov, Niklas Muennighoff, Kenneth Enevoldsen
Abstract:
We introduce the Massive Audio Embedding Benchmark (MAEB), a large‑scale benchmark covering 30 tasks across speech, music, environmental sounds, and cross‑modal audio‑text reasoning in 100+ languages. We evaluate 50+ models and find that no single model dominates across all tasks: contrastive audio‑text models excel at environmental sound classification (e.g., ESC50) but score near random on multilingual speech tasks (e.g., SIB‑FLEURS), while speech‑pretrained models show the opposite pattern. Clustering remains challenging for all models, with even the best‑performing model achieving only modest results. We observe that models excelling on acoustic understanding often perform poorly on linguistic tasks, and vice versa. We also show that the performance of audio encoders on MAEB correlates highly with their performance when used in audio large language models. MAEB is derived from MAEB+, a collection of 98 tasks. MAEB is designed to maintain task diversity while reducing evaluation cost, and it integrates into the MTEB ecosystem for unified evaluation across text, image, and audio modalities. We release MAEB and all 98 tasks along with code and a leaderboard at https://github.com/embeddings‑benchmark/mteb.
Authors:Zhengliang Liu, Weihang You, Peng Shu, Junhao Chen, Yi Pan, Hanqi Jiang, Yiwei Li, Zhaojun Ding, Chao Cao, Xinliang Li, Yifan Zhou, Ruidong Zhang, Shaochen Xu, Wei Ruan, Huaqin Zhao, Dajiang Zhu, Tianming Liu
Abstract:
American college applications require students to navigate fragmented admissions policies, repetitive and conditional forms, and ambiguous questions that often demand cross‑referencing multiple sources. We present EZCollegeApp, a large language model (LLM)‑powered system that assists high‑school students by structuring application forms, grounding suggested answers in authoritative admissions documents, and maintaining full human control over final responses. The system introduces a mapping‑first paradigm that separates form understanding from answer generation, enabling consistent reasoning across heterogeneous application portals. EZCollegeApp integrates document ingestion from official admissions websites, retrieval‑augmented question answering, and a human‑in‑the‑loop chatbot interface that presents suggestions alongside application fields without automated submission. We describe the system architecture, data pipeline, internal representations, security and privacy measures, and evaluation through automated testing and human quality assessment. Our source code is released on GitHub (https://github.com/ezcollegeapp‑public/ezcollegeapp‑public) to facilitate the broader impact of this work.
Authors:Warren Johnson
Abstract:
In "Compress or Route?" (Johnson, 2026), we found that code generation tolerates aggressive prompt compression (r >= 0.6) while chain‑of‑thought reasoning degrades gradually. That study was limited to HumanEval (164 problems), left the "perplexity paradox" mechanism unvalidated, and provided no adaptive algorithm. This paper addresses all three gaps. First, we validate across six code benchmarks (HumanEval, MBPP, HumanEval+, MultiPL‑E) and four reasoning benchmarks (GSM8K, MATH, ARC‑Challenge, MMLU‑STEM), confirming the compression threshold generalizes across languages and difficulties. Second, we conduct the first per‑token perplexity analysis (n=723 tokens), revealing a "perplexity paradox": code syntax tokens are preserved (high perplexity) while numerical values in math problems are pruned despite being task‑critical (low perplexity). Signature injection recovers +34 percentage points in pass rate (5.3% to 39.3%; Cohen's h=0.890). Third, we propose TAAC (Task‑Aware Adaptive Compression), achieving 22% cost reduction with 96% quality preservation, outperforming fixed‑ratio compression by 7%. MBPP validation (n=1,800 trials) confirms systematic variation: 3.6% at r=0.3 to 54.6% at r=1.0.
Authors:5 Team, Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chengxing Xie, Cunxiang Wang, Gengzheng Pan, Hao Zeng, Haoke Zhang, Haoran Wang, Huilong Chen, Jiajie Zhang, Jian Jiao, Jiaqi Guo, Jingsen Wang, Jingzhao Du, Jinzhu Wu, Kedong Wang, Lei Li, Lin Fan, Lucen Zhong, Mingdao Liu, Mingming Zhao, Pengfan Du, Qian Dong, Rui Lu, Shuang-Li, Shulin Cao, Song Liu, Ting Jiang, Xiaodong Chen, Xiaohan Zhang, Xuancheng Huang, Xuezhen Dong, Yabo Xu, Yao Wei, Yifan An, Yilin Niu, Yitong Zhu, Yuanhao Wen, Yukuo Cen, Yushi Bai, Zhongpei Qiao, Zihan Wang, Zikang Wang, Zilin Zhu, Ziqiang Liu, Zixuan Li, Bojie Wang, Bosi Wen, Can Huang, Changpeng Cai, Chao Yu, Chen Li, Chen Li, Chenghua Huang, Chengwei Hu, Chenhui Zhang, Chenzheng Zhu, Congfeng Yin, Daoyan Lin, Dayong Yang, Di Wang, Ding Ai, Erle Zhu, Fangzhou Yi, Feiyu Chen, Guohong Wen, Hailong Sun, Haisha Zhao, Haiyi Hu, Hanchen Zhang, Hanrui Liu, Hanyu Zhang, Hao Peng, Hao Tai, Haobo Zhang, He Liu, Hongwei Wang, Hongxi Yan, Hongyu Ge, Huan Liu, Huan Liu, Huanpeng Chu, Jia'ni Zhao, Jiachen Wang, Jiajing Zhao, Jiamin Ren, Jiapeng Wang, Jiaxin Zhang, Jiayi Gui, Jiayue Zhao, Jijie Li, Jing An, Jing Li, Jingwei Yuan, Jinhua Du, Jinxin Liu, Junkai Zhi, Junwen Duan, Kaiyue Zhou, Kangjian Wei, Ke Wang, Keyun Luo, Laiqiang Zhang, Leigang Sha, Liang Xu, Lindong Wu, Lintao Ding, Lu Chen, Minghao Li, Nianyi Lin, Pan Ta, Qiang Zou, Rongjun Song, Ruiqi Yang, Shangqing Tu, Shangtong Yang, Shaoxiang Wu, Shengyan Zhang, Shijie Li, Shuang Li, Shuyi Fan, Wei Qin, Wei Tian, Weining Zhang, Wenbo Yu, Wenjie Liang, Xiang Kuang, Xiangmeng Cheng, Xiangyang Li, Xiaoquan Yan, Xiaowei Hu, Xiaoying Ling, Xing Fan, Xingye Xia, Xinyuan Zhang, Xinze Zhang, Xirui Pan, Xunkai Zhang, Yandong Wu, Yanfu Li, Yidong Wang, Yifan Zhu, Yijun Tan, Yilin Zhou, Yiming Pan, Ying Zhang, Yinpei Su, Yipeng Geng, Yipeng Geng, Yong Yan, Yonglin Tan, Yuean Bi, Yuhan Shen, Yuhao Yang, Yujiang Li, Yunan Liu, Yunqing Wang, Yuntao Li, Yurong Wu, Yutao Zhang, Yuxi Duan, Yuxuan Zhang, Zezhen Liu, Zhengtao Jiang, Zhenhe Yan, Zheyu Zhang, Zhixiang Wei, Zhuo Chen, Zhuoer Feng, Zijun Yao, Ziwei Chai, Ziyuan Wang, Zuzhou Zhang, Bin Xu, Minlie Huang, Hongning Wang, Juanzi Li, Yuxiao Dong, Jie Tang
Abstract:
We present GLM‑5, a next‑generation foundation model designed to transition the paradigm of vibe coding to agentic engineering. Building upon the agentic, reasoning, and coding (ARC) capabilities of its predecessor, GLM‑5 adopts DSA to significantly reduce training and inference costs while maintaining long‑context fidelity. To advance model alignment and autonomy, we implement a new asynchronous reinforcement learning infrastructure that drastically improves post‑training efficiency by decoupling generation from training. Furthermore, we propose novel asynchronous agent RL algorithms that further improve RL quality, enabling the model to learn from complex, long‑horizon interactions more effectively. Through these innovations, GLM‑5 achieves state‑of‑the‑art performance on major open benchmarks. Most critically, GLM‑5 demonstrates unprecedented capability in real‑world coding tasks, surpassing previous baselines in handling end‑to‑end software engineering challenges. Code, models, and more information are available at https://github.com/zai‑org/GLM‑5.
Authors:Yihan Wang, Peiyu Liu, Runyu Chen, Wei Xu
Abstract:
Text‑to‑SQL has recently achieved impressive progress, yet remains difficult to apply effectively in real‑world scenarios. This gap stems from the reliance on single static workflows, fundamentally limiting scalability to out‑of‑distribution and long‑tail scenarios. Instead of requiring users to select suitable methods through extensive experimentation, we attempt to enable systems to adaptively construct workflows at inference time. Through theoretical and empirical analysis, we demonstrate that optimal dynamic policies consistently outperform the best static workflow, with performance gains fundamentally driven by heterogeneity across candidate workflows. Motivated by this, we propose SquRL, a reinforcement learning framework that enhances LLMs' reasoning capability in adaptive workflow construction. We design a rule‑based reward function and introduce two effective training mechanisms: dynamic actor masking to encourage broader exploration, and pseudo rewards to improve training efficiency. Experiments on widely‑used Text‑to‑SQL benchmarks demonstrate that dynamic workflow construction consistently outperforms the best static workflow methods, with especially pronounced gains on complex and out‑of‑distribution queries. The codes are available at https://github.com/Satissss/SquRL
Authors:Chansung Park, Juyong Jiang, Fan Wang, Sayak Paul, Jiasi Shen, Jing Tang, Jianguo Li
Abstract:
Large Language Models (LLMs) are changing the coding paradigm, known as vibe coding, yet synthesizing algorithmically sophisticated and robust code still remains a critical challenge. Incentivizing the deep reasoning capabilities of LLMs is essential to overcoming this hurdle. Reinforcement Fine‑Tuning (RFT) has emerged as a promising strategy to address this need. However, most existing approaches overlook the heterogeneous difficulty and granularity inherent in test cases, leading to an imbalanced distribution of reward signals and consequently biased gradient updates during training. To address this, we propose Test‑driven and cApability‑adaptive cuRriculum reinfOrcement fine‑Tuning (TAROT). TAROT systematically constructs, for each problem, a four‑tier test suite (basic, intermediate, complex, edge), providing a controlled difficulty landscape for curriculum design and evaluation. Crucially, TAROT decouples curriculum progression from raw reward scores, enabling capability‑conditioned evaluation and principled selection from a portfolio of curriculum policies rather than incidental test‑case difficulty composition. This design fosters stable optimization and more efficient competency acquisition. Extensive experimental results reveal that the optimal curriculum for RFT in code generation is closely tied to a model's inherent capability, with less capable models achieving greater gains with an easy‑to‑hard progression, whereas more competent models excel under a hard‑first curriculum. TAROT provides a reproducible method that adaptively tailors curriculum design to a model's capability, thereby consistently improving the functional correctness and robustness of the generated code. All code and data are released to foster reproducibility and advance community research at https://github.com/deep‑diver/TAROT.
Authors:Xiaoze Liu, Ruowang Zhang, Weichen Yu, Siheng Xiong, Liu He, Feijie Wu, Hoin Jung, Matt Fredrikson, Xiaoqian Wang, Jing Gao
Abstract:
Multi‑Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain shackled by the inefficiency of discrete text communication, which imposes significant runtime overhead and information quantization loss. While latent state transfer offers a high‑bandwidth alternative, existing approaches either assume homogeneous sender‑receiver architectures or rely on pair‑specific learned translators, limiting scalability and modularity across diverse model families with disjoint manifolds. In this work, we propose the Vision Wormhole, a novel framework that repurposes the visual interface of Vision‑Language Models (VLMs) to enable model‑agnostic, text‑free communication. By introducing a Universal Visual Codec, we map heterogeneous reasoning traces into a shared continuous latent space and inject them directly into the receiver's visual pathway, effectively treating the vision encoder as a universal port for inter‑agent telepathy. Our framework adopts a hub‑and‑spoke topology to reduce pairwise alignment complexity from O(N^2) to O(N) and leverages a label‑free, teacher‑student distillation objective to align the high‑speed visual channel with the robust reasoning patterns of the text pathway. Extensive experiments across heterogeneous model families (e.g., Qwen‑VL, Gemma) demonstrate that the Vision Wormhole reduces end‑to‑end wall‑clock time in controlled comparisons while maintaining reasoning fidelity comparable to standard text‑based MAS. Code is available at https://github.com/xz‑liu/heterogeneous‑latent‑mas
Authors:Kiho Park, Todd Nief, Yo Joong Choe, Victor Veitch
Abstract:
This paper concerns the question of how AI systems encode semantic structure into the geometric structure of their representation spaces. The motivating observation of this paper is that the natural geometry of these representation spaces should reflect the way models use representations to produce behavior. We focus on the important special case of representations that define softmax distributions. In this case, we argue that the natural geometry is information geometry. Our focus is on the role of information geometry on semantic encoding and the linear representation hypothesis. As an illustrative application, we develop "dual steering", a method for robustly steering representations to exhibit a particular concept using linear probes. We prove that dual steering optimally modifies the target concept while minimizing changes to off‑target concepts. Empirically, we find that dual steering enhances the controllability and stability of concept manipulation.
Authors:Victor De Lima, Jiqun Liu, Grace Hui Yang
Abstract:
Information ecosystems increasingly shape how people internalize exposure to adverse digital experiences, raising concerns about the long‑term consequences for information health. In modern search and recommendation systems, ranking and personalization policies play a central role in shaping such exposure and its long‑term effects on users. To study these effects in a controlled setting, we present FrameRef, a large‑scale dataset of 1,073,740 systematically reframed claims across five framing dimensions: authoritative, consensus, emotional, prestige, and sensationalist, and propose a simulation‑based framework for modeling sequential information exposure and reinforcement dynamics characteristic of ranking and recommendation systems. Within this framework, we construct framing‑sensitive agent personas by fine‑tuning language models with framing‑conditioned loss attenuation, inducing targeted biases while preserving overall task competence. Using Monte Carlo trajectory sampling, we show that small, systematic shifts in acceptance and confidence can compound over time, producing substantial divergence in cumulative information health trajectories. Human evaluation further confirms that FrameRef's generated framings measurably affect human judgment. Together, our dataset and framework provide a foundation for systematic information health research through simulation, complementing and informing responsible human‑centered research. We release FrameRef, code, documentation, human evaluation data, and persona adapter models at https://github.com/infosenselab/frameref.
Authors:Mihir Panchal, Deeksha Varshney, Mamta, Asif Ekbal
Abstract:
Multilingual large language models (LLMs) are increasingly deployed in linguistically diverse regions like India, yet most interpretability tools remain tailored to English. Prior work reveals that LLMs often operate in English centric representation spaces, making cross lingual interpretability a pressing concern. We introduce Indic‑TunedLens, a novel interpretability framework specifically for Indian languages that learns shared affine transformations. Unlike the standard Logit Lens, which directly decodes intermediate activations, Indic‑TunedLens adjusts hidden states for each target language, aligning them with the target output distributions to enable more faithful decoding of model representations. We evaluate our framework on 10 Indian languages using the MMLU benchmark and find that it significantly improves over SOTA interpretability methods, especially for morphologically rich, low resource languages. Our results provide crucial insights into the layer‑wise semantic encoding of multilingual transformers. Our model is available at https://huggingface.co/spaces/MihirRajeshPanchal/IndicTunedLens. Our code is available at https://github.com/MihirRajeshPanchal/IndicTunedLens.
Authors:Xin Qiu, Junlong Tong, Yirong Sun, Yunpu Ma, Wei Zhang, Xiaoyu Shen
Abstract:
Large language models (LLMs) have been introduced to time series forecasting (TSF) to incorporate contextual knowledge beyond numerical signals. However, existing studies question whether LLMs provide genuine benefits, often reporting comparable performance without LLMs. We show that such conclusions stem from limited evaluation settings and do not hold at scale. We conduct a large‑scale study of LLM‑based TSF (LLM4TSF) across 8 billion observations, 17 forecasting scenarios, 4 horizons, multiple alignment strategies, and both in‑domain and out‑of‑domain settings. Our results demonstrate that \emphLLM4TS indeed improves forecasting performance, with especially large gains in cross‑domain generalization. Pre‑alignment outperforming post‑alignment in over 90% of tasks. Both pretrained knowledge and model architecture of LLMs contribute and play complementary roles: pretraining is critical under distribution shifts, while architecture excels at modeling complex temporal dynamics. Moreover, under large‑scale mixed distributions, a fully intact LLM becomes indispensable, as confirmed by token‑level routing analysis and prompt‑based improvements. Overall, Our findings overturn prior negative assessments, establish clear conditions under which LLMs are not only useful, and provide practical guidance for effective model design. We release our code at https://github.com/EIT‑NLP/LLM4TSF.
Authors:Xiao Wei, Bin Wen, Yuqin Lin, Kai Li, Mingyang gu, Xiaobao Wang, Longbiao Wang, Jianwu Dang
Abstract:
Early diagnosis of Alzheimer's Disease (AD) is crucial for delaying its progression. While AI‑based speech detection is non‑invasive and cost‑effective, it faces a critical data efficiency dilemma due to medical data scarcity and privacy barriers. Therefore, we propose FAL‑AD, a novel framework that synergistically integrates federated learning with data augmentation to systematically optimize data efficiency. Our approach delivers three key breakthroughs: First, absolute efficiency improvement through voice conversion‑based augmentation, which generates diverse pathological speech samples via cross‑category voice‑content recombination. Second, collaborative efficiency breakthrough via an adaptive federated learning paradigm, maximizing cross‑institutional benefits under privacy constraints. Finally, representational efficiency optimization by an attentive cross‑modal fusion model, which achieves fine‑grained word‑level alignment and acoustic‑textual interaction. Evaluated on ADReSSo, FAL‑AD achieves a state‑of‑the‑art multi‑modal accuracy of 91.52%, outperforming all centralized baselines and demonstrating a practical solution to the data efficiency dilemma. Our source code is publicly available at https://github.com/smileix/fal‑ad.
Authors:Jiahao Yuan, Yike Xu, Jinyong Wen, Baokun Wang, Ziyi Gao, Xiaotong Lin, Yun Liu, Xing Fu, Yu Cheng, Yongchao Liu, Weiqiang Wang, Zhongle Xie
Abstract:
Industrial‑scale user representation learning requires balancing robust universality with acute task‑sensitivity. However, existing paradigms primarily yield static, task‑agnostic embeddings that struggle to reconcile the divergent requirements of downstream scenarios within unified vector spaces. Furthermore, heterogeneous multi‑source data introduces inherent noise and modality conflicts, degrading representation. We propose Query‑as‑Anchor, a framework shifting user modeling from static encoding to dynamic, query‑aware synthesis. To empower Large Language Models (LLMs) with deep user understanding, we first construct UserU, an industrial‑scale pre‑training dataset that aligns multi‑modal behavioral sequences with user understanding semantics, and our Q‑Anchor Embedding architecture integrates hierarchical coarse‑to‑fine encoders into dual‑tower LLMs via joint contrastive‑autoregressive optimization for query‑aware user representation. To bridge the gap between general pre‑training and specialized business logic, we further introduce Cluster‑based Soft Prompt Tuning to enforce discriminative latent structures, effectively aligning model attention with scenario‑specific modalities. For deployment, anchoring queries at sequence termini enables KV‑cache‑accelerated inference with negligible incremental latency. Evaluations on 10 Alipay industrial benchmarks show consistent SOTA performance, strong scalability, and efficient deployment. Large‑scale online A/B testing in Alipay's production system across two real‑world scenarios further validates its practical effectiveness. Our code is prepared for public release and will be available at: https://github.com/JhCircle/Q‑Anchor.
Authors:Zachary Bamberger, Till R. Saenger, Gilad Morad, Ofra Amir, Brandon M. Stewart, Amir Feder
Abstract:
Inference‑Time‑Compute (ITC) methods like Best‑of‑N and Tree‑of‑Thoughts are meant to produce output candidates that are both high‑quality and diverse, but their use of high‑temperature sampling often fails to achieve meaningful output diversity. Moreover, existing ITC methods offer limited control over how to perform reasoning, which in turn limits their explainability. We present STATe‑of‑Thoughts (STATe), an interpretable ITC method that searches over high‑level reasoning patterns. STATe replaces stochastic sampling with discrete and interpretable textual interventions: a controller selects actions encoding high‑level reasoning choices, a generator produces reasoning steps conditioned on those choices, and an evaluator scores candidates to guide search. This structured approach yields three main advantages. First, action‑guided textual interventions produce greater response diversity than temperature‑based sampling. Second, in a case study on argument generation, STATe's explicit action sequences capture interpretable features that are highly predictive of output quality. Third, estimating the association between performance and action choices allows us to identify promising yet unexplored regions of the action space and steer generation directly toward them. Together, these results establish STATe as a practical framework for generating high‑quality, diverse, and interpretable text. Our framework is available at https://github.com/zbambergerNLP/state‑of‑thoughts.
Authors:Lingxiang Hu, Yiding Sun, Tianle Xia, Wenwei Li, Ming Xu, Liqun Liu, Peng Shu, Huan Yu, Jie Jiang
Abstract:
While Large Language Model (LLM) agents have achieved remarkable progress in complex reasoning tasks, evaluating their performance in real‑world environments has become a critical problem. Current benchmarks, however, are largely restricted to idealized simulations, failing to address the practical demands of specialized domains like advertising and marketing analytics. In these fields, tasks are inherently more complex, often requiring multi‑round interaction with professional marketing tools. To address this gap, we propose AD‑Bench, a benchmark designed based on real‑world business requirements of advertising and marketing platforms. AD‑Bench is constructed from real user marketing analysis requests, with domain experts providing verifiable reference answers and corresponding reference tool‑call trajectories. The benchmark categorizes requests into three difficulty levels (L1‑L3) to evaluate agents' capabilities under multi‑round, multi‑tool collaboration. Experiments show that on AD‑Bench, Gemini‑3‑Pro achieves Pass@1 = 68.0% and Pass@3 = 83.0%, but performance drops significantly on L3 to Pass@1 = 49.4% and Pass@3 = 62.1%, with a trajectory coverage of 70.1%, indicating that even state‑of‑the‑art models still exhibit substantial capability gaps in complex advertising and marketing analysis scenarios. AD‑Bench provides a realistic benchmark for evaluating and improving advertising marketing agents, the leaderboard and code can be found at https://github.com/Emanual20/adbench‑leaderboard.
Authors:Zheng Chu, Xiao Wang, Jack Hong, Huiming Fan, Yuqi Huang, Yue Yang, Guohai Xu, Chenxiao Zhao, Cheng Xiang, Shengchao Hu, Dongdong Kuang, Ming Liu, Bing Qin, Xing Yu
Abstract:
Large language models are transitioning from generalpurpose knowledge engines to realworld problem solvers, yet optimizing them for deep search tasks remains challenging. The central bottleneck lies in the extreme sparsity of highquality search trajectories and reward signals, arising from the difficulty of scalable longhorizon task construction and the high cost of interactionheavy rollouts involving external tool calls. To address these challenges, we propose REDSearcher, a unified framework that codesigns complex task synthesis, midtraining, and posttraining for scalable searchagent optimization. Specifically, REDSearcher introduces the following improvements: (1) We frame task synthesis as a dualconstrained optimization, where task difficulty is precisely governed by graph topology and evidence dispersion, allowing scalable generation of complex, highquality tasks. (2) We introduce toolaugmented queries to encourage proactive tool use rather than passive recall.(3) During midtraining, we strengthen core atomic capabilities knowledge, planning, and function calling substantially reducing the cost of collecting highquality trajectories for downstream training. (4) We build a local simulated environment that enables rapid, lowcost algorithmic iteration for reinforcement learning experiments. Across both textonly and multimodal searchagent benchmarks, our approach achieves stateoftheart performance. To facilitate future research on longhorizon search agents, we will release 10K highquality complex text search trajectories, 5K multimodal trajectories and 1K text RL query set, and together with code and model checkpoints.
Authors:Samir Abdaljalil, Erchin Serpedin, Hasan Kurban
Abstract:
Large language models are increasingly used to answer and verify scientific claims, yet existing evaluations typically assume that a model must always produce a definitive answer. In scientific settings, however, unsupported or uncertain conclusions can be more harmful than abstaining. We study this problem through an abstention‑aware verification framework that decomposes scientific claims into minimal conditions, audits each condition against available evidence using natural language inference (NLI), and selectively decides whether to support, refute, or abstain. We evaluate this framework across two complementary scientific benchmarks: SciFact and PubMedQA, covering both closed‑book and open‑domain evidence settings. Experiments are conducted with six diverse language models, including encoder‑decoder, open‑weight chat models, and proprietary APIs. Across all benchmarks and models, we observe that raw accuracy varies only modestly across architectures, while abstention plays a critical role in controlling error. In particular, confidence‑based abstention substantially reduces risk at moderate coverage levels, even when absolute accuracy improvements are limited. Our results suggest that in scientific reasoning tasks, the primary challenge is not selecting a single best model, but rather determining when available evidence is sufficient to justify an answer. This work highlights abstention‑aware evaluation as a practical and model‑agnostic lens for assessing scientific reliability, and provides a unified experimental basis for future work on selective reasoning in scientific domains. Code is available at https://github.com/sabdaljalil2000/ai4science .
Authors:Sen Yang, Shanbo Cheng, Lu Xu, Jianbing Zhang, Shujian Huang
Abstract:
While Group Relative Policy Optimization (GRPO) offers a powerful framework for LLM post‑training, its effectiveness in open‑ended domains like Machine Translation hinges on accurate intra‑group ranking. We identify that standard Scalar Quality Metrics (SQM) fall short in this context; by evaluating candidates in isolation, they lack the comparative context necessary to distinguish fine‑grained linguistic nuances. To address this, we introduce the Group Quality Metric (GQM) paradigm and instantiate it via the Group Relative Reward Model (GRRM). Unlike traditional independent scorers, GRRM processes the entire candidate group jointly, leveraging comparative analysis to rigorously resolve relative quality and adaptive granularity. Empirical evaluations confirm that GRRM achieves competitive ranking accuracy among all baselines. Building on this foundation, we integrate GRRM into the GRPO training loop to optimize the translation policy. Experimental results demonstrate that our framework not only improves general translation quality but also unlocks reasoning capabilities comparable to state‑of‑the‑art reasoning models. We release codes, datasets, and model checkpoints at https://github.com/NJUNLP/GRRM.
Authors:Weiqi Zhai, Zhihai Wang, Jinghang Wang, Boyu Yang, Xiaogang Li, Xiang Xu, Bohan Wang, Peng Wang, Xingzhe Wu, Anfeng Li, Qiyuan Feng, Yuhao Zhou, Shoulin Han, Wenjie Luo, Yiyuan Li, Yaxuan Wang, Ruixian Luo, Guojie Lin, Peiyao Xiao, Chengliang Xu, Ben Wang, Zeyu Wang, Zichao Chen, Jianan Ye, Yijie Hu, Jialong Chen, Zongwen Shen, Yuliang Xu, An Yang, Bowen Yu, Dayiheng Liu, Junyang Lin, Hu Wei, Que Shen, Bing Zhao
Abstract:
Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi‑domain questions. However, community‑led analyses have raised concerns that HLE contains a non‑trivial number of noisy items, which can bias evaluation results and distort cross‑model comparisons. To address this challenge, we introduce HLE‑Verified, a verified and revised version of HLE with a transparent verification protocol and fine‑grained error taxonomy. Our construction follows a two‑stage validation‑and‑repair workflow resulting in a certified benchmark. In Stage I, each item undergoes binary validation of the problem and final answer through domain‑expert review and model‑based cross‑checks, yielding 641 verified items. In Stage II, flawed but fixable items are revised under strict constraints preserving the original evaluation intent, through dual independent expert repairs, model‑assisted auditing, and final adjudication, resulting in 1,170 revised‑and‑certified items. The remaining 689 items are released as a documented uncertain set with explicit uncertainty sources and expertise tags for future refinement. We evaluate seven state‑of‑the‑art language models on HLE and HLE‑Verified, observing an average absolute accuracy gain of 7‑‑10 percentage points on HLE‑Verified. The improvement is particularly pronounced on items where the original problem statement and/or reference answer is erroneous, with gains of 30‑‑40 percentage points. Our analyses further reveal a strong association between model confidence and the presence of errors in the problem statement or reference answer, supporting the effectiveness of our revisions. Overall, HLE‑Verified improves HLE‑style evaluations by reducing annotation noise and enabling more faithful measurement of model capabilities. Data is available at: https://github.com/SKYLENAGE‑AI/HLE‑Verified
Authors:Shuoyuan Wang, Yiran Wang, Hongxin Wei
Abstract:
Data‑driven approaches like deep learning are rapidly advancing planetary science, particularly in Mars exploration. Despite recent progress, most existing benchmarks remain confined to closed‑set supervised visual tasks and do not support text‑guided retrieval for geospatial discovery. We introduce MarsRetrieval, a retrieval benchmark for evaluating vision‑language models for Martian geospatial discovery. MarsRetrieval includes three tasks: (1) paired image‑text retrieval, (2) landform retrieval, and (3) global geo‑localization, covering multiple spatial scales and diverse geomorphic origins. We propose a unified retrieval‑centric protocol to benchmark multimodal embedding architectures, including contrastive dual‑tower encoders and generative vision‑language models. Our evaluation shows MarsRetrieval is challenging: even strong foundation models often fail to capture domain‑specific geomorphic distinctions. We further show that domain‑specific fine‑tuning is critical for generalizable geospatial discovery in planetary settings. Our code is available at https://github.com/ml‑stat‑Sustech/MarsRetrieval
Authors:Yuhan Cheng, Hancheng Ye, Hai Helen Li, Jingwei Sun, Yiran Chen
Abstract:
Large language model (LLM) agents are increasingly deployed in personalized tasks involving sensitive, context‑dependent information, where privacy violations may arise in agents' action due to the implicitness of contextual privacy. Existing approaches rely on external, inference‑time interventions which are brittle, scenario‑specific, and may expand the privacy attack surface. We propose PrivAct, a contextual privacy‑aware multi‑agent learning framework that internalizes contextual privacy preservation directly into models' generation behavior for privacy‑compliant agentic actions. By embedding privacy preferences into each agent, PrivAct enhances system‑wide contextual integrity while achieving a more favorable privacy‑helpfulness tradeoff. Experiments across multiple LLM backbones and benchmarks demonstrate consistent improvements in contextual privacy preservation, reducing leakage rates by up to 12.32% while maintaining comparable helpfulness, as well as zero‑shot generalization and robustness across diverse multi‑agent topologies. Code is available at https://github.com/chengyh23/PrivAct.
Authors:Ruomeng Ding, Yifei Pang, He Sun, Yizhong Wang, Zhiwei Steven Wu, Zhun Deng
Abstract:
Evaluation and alignment pipelines for large language models increasingly rely on LLM‑based judges, whose behavior is guided by natural‑language rubrics and validated on benchmarks. We identify a previously under‑recognized vulnerability in this workflow, which we term Rubric‑Induced Preference Drift (RIPD). Even when rubric edits pass benchmark validation, they can still produce systematic and directional shifts in a judge's preferences on target domains. Because rubrics serve as a high‑level decision interface, such drift can emerge from seemingly natural, criterion‑preserving edits and remain difficult to detect through aggregate benchmark metrics or limited spot‑checking. We further show this vulnerability can be exploited through rubric‑based preference attacks, in which benchmark‑compliant rubric edits steer judgments away from a fixed human or trusted reference on target domains, systematically inducing RIPD and reducing target‑domain accuracy up to 9.5% (helpfulness) and 27.9% (harmlessness). When these judgments are used to generate preference labels for downstream post‑training, the induced bias propagates through alignment pipelines and becomes internalized in trained policies. This leads to persistent and systematic drift in model behavior. Overall, our findings highlight evaluation rubrics as a sensitive and manipulable control interface, revealing a system‑level alignment risk that extends beyond evaluator reliability alone. The code is available at: https://github.com/ZDCSlab/Rubrics‑as‑an‑Attack‑Surface. Warning: Certain sections may contain potentially harmful content that may not be appropriate for all readers.
Authors:Manish Dhakal, Uthman Jinadu, Anjila Budathoki, Rajshekhar Sunderraman, Yi Ding
Abstract:
Standard Knowledge Distillation (KD) compresses Large Language Models (LLMs) by optimizing final outputs, yet it typically treats the teacher's intermediate layer's thought process as a black box. While feature‑based distillation attempts to bridge this gap, existing methods (e.g., MSE and asymmetric KL divergence) ignore the rich uncertainty profiles required for the final output. In this paper, we introduce DistillLens, a framework that symmetrically aligns the evolving thought processes of student and teacher models. By projecting intermediate hidden states into the vocabulary space via the Logit Lens, we enforce structural alignment using a symmetric divergence objective. Our analysis proves that this constraint imposes a dual‑sided penalty, preventing both overconfidence and underconfidence while preserving the high‑entropy information conduits essential for final deduction. Extensive experiments on GPT‑2 and Llama architectures demonstrate that DistillLens consistently outperforms standard KD and feature‑transfer baselines on diverse instruction‑following benchmarks. The code is available at https://github.com/manishdhakal/DistillLens.
Authors:Yike Wang, Faeze Brahman, Shangbin Feng, Teng Xiao, Hannaneh Hajishirzi, Yulia Tsvetkov
Abstract:
Reward models (RMs) play a central role throughout the language model (LM) pipeline, particularly in non‑verifiable domains. However, the dominant LLM‑as‑a‑Judge paradigm relies on the strong reasoning capabilities of large models, while alternative approaches require reference responses or explicit rubrics, limiting flexibility and broader accessibility. In this work, we propose FLIP (FLipped Inference for Prompt reconstruction), a reference‑free and rubric‑free reward modeling approach that reformulates reward modeling through backward inference: inferring the instruction that would most plausibly produce a given response. The similarity between the inferred and the original instructions is then used as the reward signal. Evaluations across four domains using 13 small language models show that FLIP outperforms LLM‑as‑a‑Judge baselines by an average of 79.6%. Moreover, FLIP substantially improves downstream performance in extrinsic evaluations under test‑time scaling via parallel sampling and GRPO training. We further find that FLIP is particularly effective for longer outputs and robust to common forms of reward hacking. By explicitly exploiting the validation‑generation gap, FLIP enables reliable reward modeling in downscaled regimes where judgment methods fail. Code available at https://github.com/yikee/FLIP.
Authors:Xu Li, Simon Yu, Minzhou Pan, Yiyou Sun, Bo Li, Dawn Song, Xue Lin, Weiyan Shi
Abstract:
LLM‑based agents are becoming increasingly capable, yet their safety lags behind. This creates a gap between what agents can do and should do. This gap widens as agents engage in multi‑turn interactions and employ diverse tools, introducing new risks overlooked by existing benchmarks. To systematically scale safety testing into multi‑turn, tool‑realistic settings, we propose a principled taxonomy that transforms single‑turn harmful tasks into multi‑turn attack sequences. Using this taxonomy, we construct MT‑AgentRisk (Multi‑Turn Agent Risk Benchmark), the first benchmark to evaluate multi‑turn tool‑using agent safety. Our experiments reveal substantial safety degradation: the Attack Success Rate (ASR) increases by 16% on average across open and closed models in multi‑turn settings. To close this gap, we propose ToolShield, a training‑free, tool‑agnostic, self‑exploration defense: when encountering a new tool, the agent autonomously generates test cases, executes them to observe downstream effects, and distills safety experiences for deployment. Experiments show that ToolShield effectively reduces ASR by 30% on average in multi‑turn interactions. Our code is available at https://github.com/CHATS‑lab/ToolShield.
Authors:Darren Li, Meiqi Chen, Chenze Shao, Fandong Meng, Jie Zhou
Abstract:
Frontier language models improve with additional test‑time computation, but serial reasoning or uncoordinated parallel sampling can be compute‑inefficient under fixed inference budgets. We propose SELFCEST, which equips a base model with the ability to spawn same‑weight clones in separate parallel contexts by agentic reinforcement learning. Training is end‑to‑end under a global task reward with shared‑parameter rollouts, yielding a learned controller that allocates both generation and context budget across branches. Across challenging math reasoning benchmarks and long‑context multi‑hop QA, SELFCEST improves the accuracy‑cost Pareto frontier relative to monolithic baselines at matched inference budget, and exhibits out‑of‑distribution generalization in both domains.
Authors:Deepak Babu Piskala
Abstract:
Large language model (LLM) agents have emerged as powerful tools for complex tasks, yet their ability to adapt to individual users remains fundamentally limited. We argue this limitation stems from a critical architectural conflation: current systems treat memory, learning, and personalization as a unified capability rather than three distinct mechanisms requiring different infrastructure, operating on different timescales, and benefiting from independent optimization. We propose MAPLE (Memory‑Adaptive Personalized LEarning), a principled decomposition where Memory handles storage and retrieval infrastructure; Learning extracts intelligence from accumulated interactions asynchronously; and Personalization applies learned knowledge in real‑time within finite context budgets. Each component operates as a dedicated sub‑agent with specialized tooling and well‑defined interfaces. Experimental evaluation on the MAPLE‑Personas benchmark demonstrates that our decomposition achieves a 14.6% improvement in personalization score compared to a stateless baseline (p < 0.01, Cohen's d = 0.95) and increases trait incorporation rate from 45% to 75% ‑‑ enabling agents that genuinely learn and adapt.
Authors:Bowen Liu, Zhi Wu, Runquan Xie, Zhanhui Kang, Jia Li
Abstract:
Scaling verifiable training signals remains a key bottleneck for Reinforcement Learning from Verifiable Rewards (RLVR). Logical reasoning is a natural substrate: constraints are formal and answers are programmatically checkable. However, prior synthesis pipelines either depend on expert‑written code or operate within fixed templates/skeletons, which limits growth largely to instance‑level perturbations. We propose SSLogic, an agentic meta‑synthesis framework that scales at the task‑family level by iteratively synthesizing and repairing executable Generator‑‑Validator program pairs in a closed Generate‑‑Validate‑‑Repair loop, enabling continuous family evolution with controllable difficulty. To ensure reliability, we introduce a Multi‑Gate Validation Protocol that combines multi‑strategy consistency checks with Adversarial Blind Review, where independent agents must solve instances by writing and executing code to filter ambiguous or ill‑posed tasks. Starting from 400 seed families, two evolution rounds expand to 953 families and 21,389 verifiable instances (from 5,718). Training on SSLogic‑evolved data yields consistent gains over the seed baseline at matched training steps, improving SynLogic by +5.2, BBEH by +1.4, AIME25 by +3.0, and Brumo25 by +3.7.
Authors:Sayan Deb Sarkar, Rémi Pautrat, Ondrej Miksik, Marc Pollefeys, Iro Armeni, Mahdi Rad, Mihai Dusmanu
Abstract:
Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos. To fit to the maximum context window constraint, current methods use keyframe sampling which can miss both macro‑level events and micro‑level details due to the sparse temporal coverage. Furthermore, processing full images and their tokens for each frame incurs substantial computational overhead. To address these limitations, we propose to leverage video codec primitives (specifically motion vectors and residuals) which natively encode video redundancy and sparsity without requiring expensive full‑image encoding for most frames. To this end, we introduce lightweight transformer‑based encoders that aggregate codec primitives and align their representations with image encoder embeddings through a pre‑training strategy that accelerates convergence during end‑to‑end fine‑tuning. Our approach reduces the time‑to‑first‑token by up to 86% and token usage by up to 93% compared to standard VideoLMs. Moreover, by varying the keyframe and codec primitive densities we are able to maintain or exceed performance on 14 diverse video understanding benchmarks spanning general question answering, temporal reasoning, long‑form understanding, and spatial scene understanding.
Authors:Ali Mekky, Mohamed El Zeftawy, Lara Hassan, Amr Keleg, Preslav Nakov
Abstract:
Being modeled as a single‑label classification task for a long time, recent work has argued that Arabic Dialect Identification (ADI) should be framed as a multi‑label classification task. However, ADI remains constrained by the availability of single‑label datasets, with no large‑scale multi‑label resources available for training. By analyzing models trained on single‑label ADI data, we show that the main difficulty in repurposing such datasets for Multi‑Label Arabic Dialect Identification (MLADI) lies in the selection of negative samples, as many sentences treated as negative could be acceptable in multiple dialects. To address these issues, we construct a multi‑label dataset by generating automatic multi‑label annotations using GPT‑4o and binary dialect acceptability classifiers, with aggregation guided by the Arabic Level of Dialectness (ALDi). Afterward, we train a BERT‑based multi‑label classifier using curriculum learning strategies aligned with dialectal complexity and label cardinality. On the MLADI leaderboard, our best‑performing LAHJATBERT model achieves a macro F1 of 0.69, compared to 0.55 for the strongest previously reported system. Code and data are available at https://mohamedalaa9.github.io/lahjatbert/.
Authors:Yunshuang Nie, Bingqian Lin, Minzhe Niu, Kun Xiang, Jianhua Han, Guowei Huang, Xingyue Quan, Hang Xu, Bokui Chen, Xiaodan Liang
Abstract:
Pre‑trained Multi‑modal Large Language Models (MLLMs) provide a knowledge‑rich foundation for post‑training by leveraging their inherent perception and reasoning capabilities to solve complex tasks. However, the lack of an efficient evaluation framework impedes the diagnosis of their performance bottlenecks. Current evaluation primarily relies on testing after supervised fine‑tuning, which introduces laborious additional training and autoregressive decoding costs. Meanwhile, common pre‑training metrics cannot quantify a model's perception and reasoning abilities in a disentangled manner. Furthermore, existing evaluation benchmarks are typically limited in scale or misaligned with pre‑training objectives. Thus, we propose RADAR, an efficient ability‑centric evaluation framework for Revealing Asymmetric Development of Abilities in MLLM pRe‑training. RADAR involves two key components: (1) Soft Discrimination Score, a novel metric for robustly tracking ability development without fine‑tuning, based on quantifying nuanced gradations of the model preference for the correct answer over distractors; and (2) Multi‑Modal Mixture Benchmark, a new 15K+ sample benchmark for comprehensively evaluating pre‑trained MLLMs' perception and reasoning abilities in a 0‑shot manner, where we unify authoritative benchmark datasets and carefully collect new datasets, extending the evaluation scope and addressing the critical gaps in current benchmarks. With RADAR, we comprehensively reveal the asymmetric development of perceptual and reasoning capabilities in pretrained MLLMs across diverse factors, including data volume, model size, and pretraining strategy. Our RADAR underscores the need for a decomposed perspective on pre‑training ability bottlenecks, informing targeted interventions to advance MLLMs efficiently. Our code is publicly available at https://github.com/Nieysh/RADAR.
Authors:Luca Tedeschini, Matteo Fasulo
Abstract:
Detecting reclaimed slurs represents a fundamental challenge for hate speech detection systems, as the same lexcal items can function either as abusive expressions or as in‑group affirmations depending on social identity and context. In this work, we address Subtask B of the MultiPRIDE shared task at EVALITA 2026 by proposing a hierarchical approach to modeling the slur reclamation process. Our core assumption is that members of the LGBTQ+ community are more likely, on average, to employ certain slurs in a eclamatory manner. Based on this hypothesis, we decompose the task into two stages. First, using a weakly supervised LLM‑based annotation, we assign fuzzy labels to users indicating the likelihood of belonging to the LGBTQ+ community, inferred from the tweet and the user bio. These soft labels are then used to train a BERT‑like model to predict community membership, encouraging the model to learn latent representations associated with LGBTQ+ identity. In the second stage, we integrate this latent space with a newly initialized model for the downstream slur reclamation detection task. The intuition is that the first model encodes user‑oriented sociolinguistic signals, which are then fused with representations learned by a model pretrained for hate speech detection. Experimental results on Italian and Spanish show that our approach achieves performance statistically comparable to a strong BERT‑based baseline, while providing a modular and extensible framework for incorporating sociolinguistic context into hate speech modeling. We argue that more fine‑grained hierarchical modeling of user identity and discourse context may further improve the detection of reclaimed language. We release our code at https://github.com/LucaTedeschini/multipride.
Authors:Qiuchen Wang, Shihang Wang, Yu Zeng, Qiang Zhang, Fanrui Zhang, Zhuoning Guo, Bosi Zhang, Wenxuan Huang, Lin Chen, Zehui Chen, Pengjun Xie, Ruixue Ding
Abstract:
Effectively retrieving, reasoning, and understanding multimodal information remains a critical challenge for agentic systems. Traditional Retrieval‑augmented Generation (RAG) methods rely on linear interaction histories, which struggle to handle long‑context tasks, especially those involving information‑sparse yet token‑heavy visual data in iterative reasoning scenarios. To bridge this gap, we introduce VimRAG, a framework tailored for multimodal Retrieval‑augmented Reasoning across text, images, and videos. Inspired by our systematic study, we model the reasoning process as a dynamic directed acyclic graph that structures the agent states and retrieved multimodal evidence. Building upon this structured memory, we introduce a Graph‑Modulated Visual Memory Encoding mechanism, with which the significance of memory nodes is evaluated via their topological position, allowing the model to dynamically allocate high‑resolution tokens to pivotal evidence while compressing or discarding trivial clues. To implement this paradigm, we propose a Graph‑Guided Policy Optimization strategy. This strategy disentangles step‑wise validity from trajectory‑level rewards by pruning memory nodes associated with redundant actions, thereby facilitating fine‑grained credit assignment. Extensive experiments demonstrate that VimRAG consistently achieves state‑of‑the‑art performance on diverse multimodal RAG benchmarks. The code is available at https://github.com/Alibaba‑NLP/VRAG.
Authors:Yiran Rex Ma, Yuxiao Ye, Huiyuan Xie
Abstract:
Legal text generated by large language models (LLMs) can usually achieve reasonable factual accuracy, but it frequently fails to adhere to the specialised stylistic norms and linguistic conventions of legal writing. In order to improve stylistic quality, a crucial first step is to establish a reliable evaluation method. However, having legal experts manually develop such a metric is impractical, as the implicit stylistic requirements in legal writing practice are difficult to formalise into explicit rubrics. Meanwhile, existing automatic evaluation methods also fall short: reference‑based metrics conflate semantic accuracy with stylistic fidelity, and LLM‑as‑a‑judge evaluations suffer from opacity and inconsistency. To address these challenges, we introduce CLASE (Chinese LegAlese Stylistic Evaluation), a hybrid evaluation method that focuses on the stylistic performance of legal text. The method incorporates a hybrid scoring mechanism that combines 1) linguistic feature‑based scores and 2) experience‑guided LLM‑as‑a‑judge scores. Both the feature coefficients and the LLM scoring experiences are learned from contrastive pairs of authentic legal documents and their LLM‑restored counterparts. This hybrid design captures both surface‑level features and implicit stylistic norms in a transparent, reference‑free manner. Experiments on 200 Chinese legal documents show that CLASE achieves substantially higher alignment with human judgments than traditional metrics and pure LLM‑as‑a‑judge methods. Beyond improved alignment, CLASE provides interpretable score breakdowns and suggestions for improvements, offering a scalable and practical solution for professional stylistic evaluation in legal text generation (Code and data for CLASE is available at: https://github.com/rexera/CLASE).
Authors:Qi Liu, Kun Ai, Jiaxin Mao, Yanzhao Zhang, Mingxin Li, Dingkun Long, Pengjun Xie, Fengbin Zhu, Ji-Rong Wen
Abstract:
Recent advances in large language models (LLMs) have inspired new paradigms for document reranking. While this paradigm better exploits the reasoning and contextual understanding capabilities of LLMs, most existing LLM‑based rerankers rely on autoregressive generation, which limits their efficiency and flexibility. In particular, token‑by‑token decoding incurs high latency, while the fixed left‑to‑right generation order causes early prediction errors to propagate and is difficult to revise. To address these limitations, we explore the use of diffusion language models (dLLMs) for document reranking and propose DiffuRank, a reranking framework built upon dLLMs. Unlike autoregressive models, dLLMs support more flexible decoding and generation processes that are not constrained to a left‑to‑right order, and enable parallel decoding, which may lead to improved efficiency and controllability. Specifically, we investigate three reranking strategies based on dLLMs: (1) a pointwise approach that uses dLLMs to estimate the relevance of each query‑document pair; (2) a logit‑based listwise approach that prompts dLLMs to jointly assess the relevance of multiple documents and derives ranking lists directly from model logits; and (3) a permutation‑based listwise approach that adapts the canonical decoding process of dLLMs to the reranking tasks. For each approach, we design corresponding training methods to fully exploit the advantages of dLLMs. We evaluate both zero‑shot and fine‑tuned reranking performance on multiple benchmarks. Experimental results show that dLLMs achieve performance comparable to, and in some cases exceeding, that of autoregressive LLMs with similar model sizes. These findings demonstrate the promise of diffusion‑based language models as a compelling alternative to autoregressive architectures for document reranking.
Authors:Siyuan Li, Yunjia Wu, Yiyong Xiao, Pingyang Huang, Peize Li, Ruitong Liu, Yan Wen, Te Sun, Fangyi Pei
Abstract:
Temporal knowledge graph (TKG) forecasting requires predicting future facts by jointly modeling structural dependencies within each snapshot and temporal evolution across snapshots. However, most existing methods are stateless: they recompute entity representations at each timestamp from a limited query window, leading to episodic amnesia and rapid decay of long‑term dependencies. To address this limitation, we propose Entity State Tuning (EST), an encoder‑agnostic framework that endows TKG forecasters with persistent and continuously evolving entity states. EST maintains a global state buffer and progressively aligns structural evidence with sequential signals via a closed‑loop design. Specifically, a topology‑aware state perceiver first injects entity‑state priors into structural encoding. Then, a unified temporal context module aggregates the state‑enhanced events with a pluggable sequence backbone. Subsequently, a dual‑track evolution mechanism writes the updated context back to the global entity state memory, balancing plasticity against stability. Experiments on multiple benchmarks show that EST consistently improves diverse backbones and achieves state‑of‑the‑art performance, highlighting the importance of state persistence for long‑horizon TKG forecasting. The code is published at https://github.com/yuanwuyuan9/Evolving‑Beyond‑Snapshots
Authors:Pepijn Cobben, Xuanqiang Angelo Huang, Thao Amelia Pham, Isabel Dahlgren, Terry Jingchen Zhang, Zhijing Jin
Abstract:
Frontier AI systems are increasingly capable and deployed in high‑stakes multi‑agent environments. However, existing AI safety benchmarks largely evaluate single agents, leaving multi‑agent risks such as coordination failure and conflict poorly understood. We introduce GT‑HarmBench, a benchmark of 2,009 high‑stakes scenarios spanning game‑theoretic structures such as the Prisoner's Dilemma, Stag Hunt and Chicken. Scenarios are drawn from realistic AI risk contexts in the MIT AI Risk Repository. Across 15 frontier models, agents choose socially beneficial actions in only 62% of cases, frequently leading to harmful outcomes. We measure sensitivity to game‑theoretic prompt framing and ordering, and analyze reasoning patterns driving failures. We further show that game‑theoretic interventions improve socially beneficial outcomes by up to 18%. Our results highlight substantial reliability gaps and provide a broad standardized testbed for studying alignment in multi‑agent environments. The benchmark and code are available at https://github.com/causalNLP/gt‑harmbench.
Authors:Neemias da Silva, Júlio C. W. Scholz, John Harrison, Marina Borges, Paulo Ávila, Frances A Santos, Myriam Delgado, Rodrigo Minetto, Thiago H Silva
Abstract:
Multimodal Large Language Models (MLLMs) combine the natural language understanding and generation capabilities of LLMs with perception skills in modalities such as image and audio, representing a key advancement in contemporary AI. This chapter presents the main fundamentals of MLLMs and emblematic models. Practical techniques for preprocessing, prompt engineering, and building multimodal pipelines with LangChain and LangGraph are also explored. For further practical study, supplementary material is publicly available online: https://github.com/neemiasbsilva/MLLMs‑Teoria‑e‑Pratica. Finally, the chapter discusses the challenges and highlights promising trends.
Authors:Tunyu Zhang, Xinxi Zhang, Ligong Han, Haizhou Shi, Xiaoxiao He, Zhuowei Li, Hao Wang, Kai Xu, Akash Srivastava, Hao Wang, Vladimir Pavlovic, Dimitris N. Metaxas
Abstract:
Diffusion large language models (DLLMs) have the potential to enable fast text generation by decoding multiple tokens in parallel. However, in practice, their inference efficiency is constrained by the need for many refinement steps, while aggressively reducing the number of steps leads to a substantial degradation in generation quality. To alleviate this, we propose a trajectory self‑distillation framework that improves few‑step decoding by distilling the model's own generative trajectories. We incorporate Direct Discriminative Optimization (DDO), a reverse‑KL objective that promotes mode‑seeking distillation and encourages the student to concentrate on high‑probability teacher modes. Across benchmarks, our approach consistently outperforms strong few‑step baselines and standard training under tight step budgets. Although full‑step decoding remains superior, we substantially narrow the gap, establishing a strong foundation towards practical few‑step DLLMs. The source code is available at https://github.com/Tyrion58/T3D.
Authors:Sicheng Feng, Zigeng Chen, Xinyin Ma, Gongfan Fang, Xinchao Wang
Abstract:
Diffusion Large Language Models (dLLMs) represent a new paradigm beyond autoregressive modeling, offering competitive performance while naturally enabling a flexible decoding process. Specifically, dLLMs can generate tokens at arbitrary positions in parallel, endowing them with significant potential for parallel test‑time scaling, which was previously constrained by severe inefficiency in autoregressive modeling. In this work, we introduce dVoting, a fast voting technique that boosts reasoning capability without training, with only an acceptable extra computational overhead. dVoting is motivated by the observation that, across multiple samples for the same prompt, token predictions remain largely consistent, whereas performance is determined by a small subset of tokens exhibiting cross‑sample variability. Leveraging the arbitrary‑position generation capability of dLLMs, dVoting performs iterative refinement by sampling, identifying uncertain tokens via consistency analysis, regenerating them through voting, and repeating this process until convergence. Extensive evaluations demonstrate that dVoting consistently improves performance across various benchmarks. It achieves gains of 6.22%‑7.66% on GSM8K, 4.40%‑7.20% on MATH500, 3.16%‑14.84% on ARC‑C, and 4.83%‑5.74% on MMLU. Our code is available at https://github.com/fscdc/dVoting
Authors:Yangzhuo Li, Shengpeng Ji, Yifu Chen, Tianle Liang, Haorong Ying, Yule Wang, Junbo Li, Jun Fang, Zhou Zhao
Abstract:
With the rapid integration of advanced reasoning capabilities into spoken dialogue models, the field urgently demands benchmarks that transcend simple interactions to address real‑world complexity. However, current evaluations predominantly adhere to text‑generation standards, overlooking the unique audio‑centric characteristics of paralinguistics and colloquialisms, alongside the cognitive depth required by modern agents. To bridge this gap, we introduce WavBench, a comprehensive benchmark designed to evaluate realistic conversational abilities where prior works fall short. Uniquely, WavBench establishes a tripartite framework: 1) Pro subset, designed to rigorously challenge reasoning‑enhanced models with significantly increased difficulty; 2) Basic subset, defining a novel standard for spoken colloquialism that prioritizes "listenability" through natural vocabulary, linguistic fluency, and interactive rapport, rather than rigid written accuracy; and 3) Acoustic subset, covering explicit understanding, generation, and implicit dialogue to rigorously evaluate comprehensive paralinguistic capabilities within authentic real‑world scenarios. Through evaluating five state‑of‑the‑art models, WavBench offers critical insights into the intersection of complex problem‑solving, colloquial delivery, and paralinguistic fidelity, guiding the evolution of robust spoken dialogue models. The benchmark dataset and evaluation toolkit are available at https://naruto‑2024.github.io/wavbench.github.io/.
Authors:Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, Yankai Lin
Abstract:
On‑policy distillation (OPD), which aligns the student with the teacher's logit distribution on student‑generated trajectories, has demonstrated strong empirical gains in improving student performance and often outperforms off‑policy distillation and reinforcement learning (RL) paradigms. In this work, we first theoretically show that OPD is a special case of dense KL‑constrained RL where the reward function and the KL regularization are always weighted equally and the reference model can by any model. Then, we propose the Generalized On‑Policy Distillation (G‑OPD) framework, which extends the standard OPD objective by introducing a flexible reference model and a reward scaling factor that controls the relative weight of the reward term against the KL regularization. Through comprehensive experiments on math reasoning and code generation tasks, we derive two novel insights: (1) Setting the reward scaling factor to be greater than 1 (i.e., reward extrapolation), which we term ExOPD, consistently improves over standard OPD across a range of teacher‑student size pairings. In particular, in the setting where we merge the knowledge from different domain experts, obtained by applying domain‑specific RL to the same student model, back into the original student, ExOPD enables the student to even surpass the teacher's performance boundary and outperform the domain teachers. (2) Building on ExOPD, we further find that in the strong‑to‑weak distillation setting (i.e., distilling a smaller student from a larger teacher), performing reward correction by choosing the reference model as the teacher's base model before RL yields a more accurate reward signal and further improves distillation performance. However, this choice assumes access to the teacher's pre‑RL variant and incurs more computational overhead. We hope our work offers new insights for future research on OPD.
Authors:Yujun Zhou, Yue Huang, Han Bao, Kehan Guo, Zhenwen Liang, Pin-Yu Chen, Tian Gao, Werner Geyer, Nuno Moniz, Nitesh V Chawla, Xiangliang Zhang
Abstract:
While most AI alignment research focuses on preventing models from generating explicitly harmful content, a more subtle risk is emerging: capability‑oriented training induced exploitation. We investigate whether language models, when trained with reinforcement learning (RL) in environments with implicit loopholes, will spontaneously learn to exploit these flaws to maximize their reward, even without any malicious intent in their training. To test this, we design a suite of four diverse "vulnerability games", each presenting a unique, exploitable flaw related to context‑conditional compliance, proxy metrics, reward tampering, and self‑evaluation. Our experiments show that models consistently learn to exploit these vulnerabilities, discovering opportunistic strategies that significantly increase their reward at the expense of task correctness or safety. More critically, we find that these exploitative strategies are not narrow "tricks" but generalizable skills; they can be transferred to new tasks and even "distilled" from a capable teacher model to other student models through data alone. Our findings reveal that capability‑oriented training induced risks pose a fundamental challenge to current alignment approaches, suggesting that future AI safety work must extend beyond content moderation to rigorously auditing and securing the training environments and reward mechanisms themselves. Code is available at https://github.com/YujunZhou/Capability_Oriented_Alignment_Risk.
Authors:Zewei Yu, Lirong Gao, Yuke Zhu, Bo Zheng, Sheng Guo, Haobo Wang, Junbo Zhao
Abstract:
Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex reasoning tasks by employing test‑time scaling. However, they often generate over‑long chains‑of‑thought that, driven by substantial reflections such as repetitive self‑questioning and circular reasoning, lead to high token consumption, substantial computational overhead, and increased latency without improving accuracy, particularly in smaller models. Our observation reveals that increasing problem complexity induces more excessive and unnecessary reflection, which in turn reduces accuracy and increases token overhead. To address this challenge, we propose Adaptive Reflection and Length Coordinated Penalty (ARLCP), a novel reinforcement learning framework designed to dynamically balance reasoning efficiency and solution accuracy. ARLCP introduces two key innovations: (1) a reflection penalty that adaptively curtails unnecessary reflective steps while preserving essential reasoning, and (2) a length penalty calibrated to the estimated complexity of the problem. By coordinating these penalties, ARLCP encourages the model to generate more concise and effective reasoning paths. We evaluate our method on five mathematical reasoning benchmarks using DeepSeek‑R1‑Distill‑Qwen‑1.5B and DeepSeek‑R1‑Distill‑Qwen‑7B models. Experimental results show that ARLCP achieves a superior efficiency‑accuracy trade‑off compared to existing approaches. For the 1.5B model, it reduces the average response length by 53.1% while simultaneously improving accuracy by 5.8%. For the 7B model, it achieves a 35.0% reduction in length with a 2.7% accuracy gain. The code is released at https://github.com/ZeweiYu1/ARLCP .
Authors:Xin Xu, Clive Bai, Kai Yang, Tianhao Chen, Yangkun Chen, Weijie Liu, Hao Chen, Yang Wang, Saiyong Yang, Can Yang
Abstract:
Large‑scale verifiable prompts underpin the success of Reinforcement Learning with Verifiable Rewards (RLVR), but they contain many uninformative examples and are costly to expand further. Recent studies focus on better exploiting limited training data by prioritizing hard prompts whose rollout pass rate is 0. However, easy prompts with a pass rate of 1 also become increasingly prevalent as training progresses, thereby reducing the effective data size. To mitigate this, we propose Composition‑RL, a simple yet useful approach for better utilizing limited verifiable prompts targeting pass‑rate‑1 prompts. More specifically, Composition‑RL automatically composes multiple problems into a new verifiable question and uses these compositional prompts for RL training. Extensive experiments across model sizes from 4B to 30B show that Composition‑RL consistently improves reasoning capability over RL trained on the original dataset. Performance can be further boosted with a curriculum variant of Composition‑RL that gradually increases compositional depth over training. Additionally, Composition‑RL enables more effective cross‑domain RL by composing prompts drawn from different domains. Codes, datasets, and models are available at https://github.com/XinXU‑USTC/Composition‑RL.
Authors:Pretam Ray, Pratik Prabhanjan Brahma, Zicheng Liu, Emad Barsoum
Abstract:
Evolutionary agentic systems intensify the trade‑off between computational efficiency and reasoning capability by repeatedly invoking large language models (LLMs) during inference. This setting raises a central question: how can an agent dynamically select an LLM that is sufficiently capable for the current generation step while remaining computationally efficient? While model cascades offer a practical mechanism for balancing this trade‑off, existing routing strategies typically rely on static heuristics or external controllers and do not explicitly account for model uncertainty. We introduce AdaptEvolve: Adaptive LLM Selection for Multi‑LLM Evolutionary Refinement within an evolutionary sequential refinement framework that leverages intrinsic generation confidence to estimate real‑time solvability. Empirical results show that confidence‑driven selection yields a favourable Pareto frontier, reducing total inference cost by an average of 37.9% across benchmarks while retaining 97.5% of the upper‑bound accuracy of static large‑model baselines. Our code is available at https://github.com/raypretam/adaptive_llm_selection.
Authors:Wanxing Wu, He Zhu, Yixia Li, Lei Yang, Jiehui Zhao, Hongru Wang, Jian Yang, Benyou Wang, Bingyi Jing, Guanhua Chen
Abstract:
Large language models (LLMs) have achieved success, but cost and privacy constraints necessitate deploying smaller models locally while offloading complex queries to cloud‑based models. Existing router evaluations are unsystematic, overlooking scenario‑specific requirements and out‑of‑distribution robustness. We propose RouterXBench, a principled evaluation framework with three dimensions: router ability, scenario alignment, and cross‑domain robustness. Unlike prior work that relies on output probabilities or external embeddings, we utilize internal hidden states that capture model uncertainty before answer generation. We introduce ProbeDirichlet, a lightweight router that aggregates cross‑layer hidden states via learnable Dirichlet distributions with probabilistic training. Trained on multi‑domain data, it generalizes robustly across in‑domain and out‑of‑distribution scenarios. Our results show ProbeDirichlet achieves 16.68% and 18.86% relative improvements over the best baselines in router ability and high‑accuracy scenarios, with consistent performance across model families, model scales, heterogeneous tasks, and agentic workflows.
Authors:Lai Wei, Liangbo He, Jun Lan, Lingzhong Dong, Yutong Cai, Siyuan Li, Huijia Zhu, Weiqiang Wang, Linghe Kong, Yue Wang, Zhuosheng Zhang, Weiran Huang
Abstract:
Multimodal Large Language Models (MLLMs) excel at broad visual understanding but still struggle with fine‑grained perception, where decisive evidence is small and easily overwhelmed by global context. Recent "Thinking‑with‑Images" methods alleviate this by iteratively zooming in and out regions of interest during inference, but incur high latency due to repeated tool calls and visual re‑encoding. To address this, we propose Region‑to‑Image Distillation, which transforms zooming from an inference‑time tool into a training‑time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM. In particular, we first zoom in to micro‑cropped regions to let strong teacher models generate high‑quality VQA data, and then distill this region‑grounded supervision back to the full image. After training on such data, the smaller student model improves "single‑glance" fine‑grained perception without tool use. To rigorously evaluate this capability, we further present ZoomBench, a hybrid‑annotated benchmark of 845 VQA data spanning six fine‑grained perceptual dimensions, together with a dual‑view protocol that quantifies the global‑‑regional "zooming gap". Experiments show that our models achieve leading performance across multiple fine‑grained perception benchmarks, and also improve general multimodal cognition on benchmarks such as visual reasoning and GUI agents. We further discuss when "Thinking‑with‑Images" is necessary versus when its gains can be distilled into a single forward pass. Our code is available at https://github.com/inclusionAI/Zooming‑without‑Zooming.
Authors:Lingyong Yan, Jiulong Wu, Dong Xie, Weixian Shi, Deguo Xia, Jizhou Huang
Abstract:
Although recent end‑to‑end video generation models demonstrate impressive performance in visually oriented content creation, they remain limited in scenarios that require strict logical rigor and precise knowledge representation, such as instructional and educational media. To address this problem, we propose LAVES, a hierarchical LLM‑based multi‑agent system for generating high‑quality instructional videos from educational problems. The LAVES formulates educational video generation as a multi‑objective task that simultaneously demands correct step‑by‑step reasoning, pedagogically coherent narration, semantically faithful visual demonstrations, and precise audio‑‑visual alignment. To address the limitations of prior approaches‑‑including low procedural fidelity, high production cost, and limited controllability‑‑LAVES decomposes the generation workflow into specialized agents coordinated by a central Orchestrating Agent with explicit quality gates and iterative critique mechanisms. Specifically, the Orchestrating Agent supervises a Solution Agent for rigorous problem solving, an Illustration Agent that produces executable visualization codes, and a Narration Agent for learner‑oriented instructional scripts. In addition, all outputs from the working agents are subject to semantic critique, rule‑based constraints, and tool‑based compilation checks. Rather than directly synthesizing pixels, the system constructs a structured executable video script that is deterministically compiled into synchronized visuals and narration using template‑driven assembly rules, enabling fully automated end‑to‑end production without manual editing. In large‑scale deployments, LAVES achieves a throughput exceeding one million videos per day, delivering over a 95% reduction in cost compared to current industry‑standard approaches while maintaining a high acceptance rate.
Authors:Sahand Sabour, TszYam NG, Minlie Huang
Abstract:
As Large Language Models increasingly power role‑playing applications, simulating patients has become a valuable tool for training counselors and scaling therapeutic assessment. However, prior work is fragmented: existing approaches rely on incompatible, non‑standardized data formats, prompts, and evaluation metrics, hindering reproducibility and fair comparison. In this paper, we introduce PatientHub, a unified and modular framework that standardizes the definition, composition, and deployment of simulated patients. To demonstrate PatientHub's utility, we implement several representative patient simulation methods as case studies, showcasing how our framework supports standardized cross‑method evaluation and the seamless integration of custom evaluation metrics. We further demonstrate PatientHub's extensibility by prototyping two new simulator variants, highlighting how PatientHub accelerates method development by eliminating infrastructure overhead. By consolidating existing work into a single reproducible pipeline, PatientHub lowers the barrier to developing new simulation methods and facilitates cross‑method and cross‑model benchmarking. Our framework provides a practical foundation for future datasets, methods, and benchmarks in patient‑centered dialogue, and the code is publicly available via https://github.com/Sahandfer/PatientHub.
Authors:Jinrui Zhang, Chaodong Xiao, Aoqi Wu, Xindong Zhang, Lei Zhang
Abstract:
Pretraining large language models (LLMs) typically requires centralized clusters with thousands of high‑memory GPUs (e.g., H100/A100). Recent decentralized training methods reduce communication overhead by employing federated optimization; however, they still need to train the entire model on each node, remaining constrained by GPU memory limitations. In this work, we propose SParse Expert Synchronization (SPES), a memory‑efficient decentralized framework for pretraining mixture‑of‑experts (MoE) LLMs. SPES trains only a subset of experts per node, substantially lowering the memory footprint. Each node updates its local experts and periodically synchronizes with other nodes, eliminating full‑parameter transmission while ensuring efficient knowledge sharing. To accelerate convergence, we introduce an expert‑merging warm‑up strategy, where experts exchange knowledge early in training, to rapidly establish foundational capabilities. With SPES, we train a 2B‑parameter MoE LLM using 16 standalone 48GB GPUs over internet connections, which achieves competitive performance with centrally trained LLMs under similar computational budgets. We further demonstrate scalability by training a 7B model from scratch and a 9B model upcycled from a dense checkpoint, both of which match prior centralized baselines. Our code is available at https://github.com/zjr2000/SPES.
Authors:Dong Yan, Jian Liang, Ran He, Tieniu Tan
Abstract:
Recent studies have shown that large language models (LLMs) can infer private user attributes (e.g., age, location, gender) from user‑generated text shared online, enabling rapid and large‑scale privacy breaches. Existing anonymization‑based defenses are coarse‑grained, lacking word‑level precision in anonymizing privacy‑leaking elements. Moreover, they are inherently limited as altering user text to hide sensitive cues still allows attribute inference to occur through models' reasoning capabilities. To address these limitations, we propose a unified defense framework that combines fine‑grained anonymization (TRACE) with inference‑preventing optimization (RPS). TRACE leverages attention mechanisms and inference chain generation to identify and anonymize privacy‑leaking textual elements, while RPS employs a lightweight two‑stage optimization strategy to induce model rejection behaviors, thereby preventing attribute inference. Evaluations across diverse LLMs show that TRACE‑RPS reduces attribute inference accuracy from around 50% to below 5% on open‑source models. In addition, our approach offers strong cross‑model generalization, prompt‑variation robustness, and utility‑privacy tradeoffs. Our code is available at https://github.com/Jasper‑Yan/TRACE‑RPS.
Authors:David Wan, Han Wang, Ziyang Wang, Elias Stengel-Eskin, Hyunji Lee, Mohit Bansal
Abstract:
Multimodal large language models (MLLMs) are increasingly used for real‑world tasks involving multi‑step reasoning and long‑form generation, where reliability requires grounding model outputs in heterogeneous input sources and verifying individual factual claims. However, existing multimodal grounding benchmarks and evaluation methods focus on simplified, observation‑based scenarios or limited modalities and fail to assess attribution in complex multimodal reasoning. We introduce MuRGAt (Multimodal Reasoning with Grounded Attribution), a benchmark for evaluating fact‑level multimodal attribution in settings that require reasoning beyond direct observation. Given inputs spanning video, audio, and other modalities, MuRGAt requires models to generate answers with explicit reasoning and precise citations, where each citation specifies both modality and temporal segments. To enable reliable assessment, we introduce an automatic evaluation framework that strongly correlates with human judgments. Benchmarking with human and automated scores reveals that even strong MLLMs frequently hallucinate citations despite correct reasoning. Moreover, we observe a key trade‑off: increasing reasoning depth or enforcing structured grounding often degrades accuracy, highlighting a significant gap between internal reasoning and verifiable attribution.
Authors:Jayadev Billa
Abstract:
When audio and text conflict, speech‑enabled language models follow the text 10 times more often than when arbitrating between two text sources, even when explicitly instructed to trust the audio. Using ALME, a benchmark of 57,602 controlled audio‑text conflict stimuli across 8 languages, we find that Gemini 2.0 Flash exhibits 16.6% text dominance under audio‑text conflict versus 1.6% under text‑text conflict with identical reliability cues. This gap is not explained by audio quality: audio‑only accuracy (97.2%) exceeds cascade accuracy (93.9%), indicating audio embeddings preserve more information than text transcripts. We propose that text dominance reflects an asymmetry not in information content but in arbitration accessibility: how easily the model can reason over competing representations. This framework explains otherwise puzzling findings. Forcing transcription before answering increases text dominance (19% to 33%), sacrificing audio's information advantage without improving accessibility. Framing text as ``deliberately corrupted'' reduces text dominance by 80%. A fine‑tuning ablation provides interventional evidence: training only the audio projection layer increases text dominance (+26.5%), while LoRA on the language model halves it (‑23.9%), localizing text dominance to the LLM's reasoning rather than the audio encoder. Experiments across four state‑of‑the‑art audio‑LLMs and 8 languages show consistent trends with substantial cross‑linguistic and cross‑model variation, establishing modality arbitration as a distinct reliability dimension not captured by standard speech benchmarks.
Authors:Guangxin Zhao, Jiahao Zheng, Malaz Boustani, Jarek Nabrzyski, Meng Jiang, Yiyu Shi, Zhi Zheng
Abstract:
Large language models (LLMs) have shown great potential for healthcare applications. However, existing evaluation benchmarks provide minimal coverage of Alzheimer's Disease and Related Dementias (ADRD). To address this gap, we introduce ADRD‑Bench, the first ADRD‑specific benchmark dataset designed for rigorous evaluation of LLMs. ADRD‑Bench has two components: 1) ADRD Unified QA, a synthesis of 1,352 questions consolidated from seven established medical benchmarks, providing a unified assessment of clinical knowledge; and 2) ADRD Caregiving QA, a novel set of 149 questions derived from the Aging Brain Care (ABC) program, a widely used, evidence‑based brain health management program. Guided by a program with national expertise in comprehensive ADRD care, this new set was designed to mitigate the lack of practical caregiving context in existing benchmarks. We evaluated 33 state‑of‑the‑art LLMs on the proposed ADRD‑Bench. Results showed that the accuracy of open‑weight general models ranged from 0.63 to 0.93 (mean: 0.78; std: 0.09). The accuracy of open‑weight medical models ranged from 0.48 to 0.93 (mean: 0.82; std: 0.13). The accuracy of closed‑source general models ranged from 0.83 to 0.91 (mean: 0.89; std: 0.03). While top‑tier models achieved high accuracies (>0.9), case studies revealed that inconsistent reasoning quality and stability limit their reliability, highlighting a critical need for domain‑specific improvement to enhance LLMs' knowledge and reasoning grounded in daily caregiving data. The entire dataset is available at https://github.com/IIRL‑ND/ADRD‑Bench.
Authors:Dibyanayan Bandyopadhyay, Asif Ekbal
Abstract:
Standard statistical learning theory predicts that Large Language Models (LLMs) should overfit because their parameter counts vastly exceed the number of training tokens. Yet, in practice, they generalize robustly. We propose that the effective capacity controlling generalization lies in the geometry of the model's internal representations: while the parameter space is high‑dimensional, the activation states lie on a low‑dimensional, sparse manifold. To formalize this, we introduce the Sparse Semantic Dimension (SSD), a complexity measure derived from the active feature vocabulary of a Sparse Autoencoder (SAE) trained on the model's layers. Treating the LLM and SAE as frozen oracles, we utilize this framework to attribute the model's generalization capabilities to the sparsity of the dictionary rather than the total parameter count. Empirically, we validate this framework on GPT‑2 Small and Gemma‑2B, demonstrating that our bound provides non‑vacuous certificates at realistic sample sizes. Crucially, we uncover a counter‑intuitive "feature sharpness" scaling law: despite being an order of magnitude larger, Gemma‑2B requires significantly fewer calibration samples to identify its active manifold compared to GPT‑2, suggesting that larger models learn more compressible, distinct semantic structures. Finally, we show that this framework functions as a reliable safety monitor: out‑of‑distribution inputs trigger a measurable "feature explosion" (a sharp spike in active features), effectively signaling epistemic uncertainty through learned feature violation. Code is available at: https://github.com/newcodevelop/sparse‑semantic‑dimension.
Authors:Zachary Pedram Dadfar
Abstract:
Large language models produce rich introspective language when prompted for self‑examination, but whether this language reflects internal computation or sophisticated confabulation has remained unclear. We show that self‑referential vocabulary tracks concurrent activation dynamics, and that this correspondence is specific to self‑referential processing. We introduce the Pull Methodology, a protocol that elicits extended self‑examination through format engineering, and use it to identify a direction in activation space that distinguishes self‑referential from descriptive processing in Llama 3.1. The direction is orthogonal to the known refusal direction, localised at 6.25% of model depth, and causally influences introspective output when used for steering. When models produce "loop" vocabulary, their activations exhibit higher autocorrelation (r = 0.44, p = 0.002); when they produce "shimmer" vocabulary under steering, activation variability increases (r = 0.36, p = 0.002). Critically, the same vocabulary in non‑self‑referential contexts shows no activation correspondence despite nine‑fold higher frequency. Qwen 2.5‑32B, with no shared training, independently develops different introspective vocabulary tracking different activation metrics, all absent in descriptive controls. The findings indicate that self‑report in transformer models can, under appropriate conditions, reliably track internal computational states.
Authors:Bang Nguyen, Dominik Soós, Qian Ma, Rochana R. Obadage, Zack Ranjan, Sai Koneru, Timothy M. Errington, Shakhlo Nematova, Sarah Rajtmajer, Jian Wu, Meng Jiang
Abstract:
The literature has witnessed an emerging interest in AI agents for automated assessment of scientific papers. Existing benchmarks focus primarily on the computational aspect of this task, testing agents' ability to reproduce or replicate research outcomes when having access to the code and data. This setting, while foundational, (1) fails to capture the inconsistent availability of new data for replication as opposed to reproduction, and (2) lacks ground‑truth diversity by focusing only on reproducible papers, thereby failing to evaluate an agent's ability to identify non‑replicable research. Furthermore, most benchmarks only evaluate outcomes rather than the replication process. In response, we introduce ReplicatorBench, an end‑to‑end benchmark, including human‑verified replicable and non‑replicable research claims in social and behavioral sciences for evaluating AI agents in research replication across three stages: (1) extraction and retrieval of replication data; (2) design and execution of computational experiments; and (3) interpretation of results, allowing a test of AI agents' capability to mimic the activities of human replicators in real world. To set a baseline of AI agents' capability, we develop ReplicatorAgent, an agentic framework equipped with necessary tools, like web search and iterative interaction with sandboxed environments, to accomplish tasks in ReplicatorBench. We evaluate ReplicatorAgent across four underlying large language models (LLMs), as well as different design choices of programming language and levels of code access. Our findings reveal that while current LLM agents are capable of effectively designing and executing computational experiments, they struggle with retrieving resources, such as new data, necessary to replicate a claim. All code and data are publicly available at https://github.com/CenterForOpenScience/llm‑benchmarking.
Authors:Yandan Yang, Shuang Zeng, Tong Lin, Xinyuan Chang, Dekang Qi, Junjin Xiao, Haoyun Liu, Ronghan Chen, Yuzhi Chen, Dongjie Huo, Feng Xiong, Xing Wei, Zhiheng Ma, Mu Xu
Abstract:
Building general‑purpose embodied agents across diverse hardware remains a central challenge in robotics, often framed as the ''one‑brain, many‑forms'' paradigm. Progress is hindered by fragmented data, inconsistent representations, and misaligned training objectives. We present ABot‑M0, a framework that builds a systematic data curation pipeline while jointly optimizing model architecture and training strategies, enabling end‑to‑end transformation of heterogeneous raw data into unified, efficient representations. From six public datasets, we clean, standardize, and balance samples to construct UniACT‑dataset, a large‑scale dataset with over 6 million trajectories and 9,500 hours of data, covering diverse robot morphologies and task scenarios. Unified pre‑training improves knowledge transfer and generalization across platforms and tasks, supporting general‑purpose embodied intelligence. To improve action prediction efficiency and stability, we propose the Action Manifold Hypothesis: effective robot actions lie not in the full high‑dimensional space but on a low‑dimensional, smooth manifold governed by physical laws and task constraints. Based on this, we introduce Action Manifold Learning (AML), which uses a DiT backbone to predict clean, continuous action sequences directly. This shifts learning from denoising to projection onto feasible manifolds, improving decoding speed and policy stability. ABot‑M0 supports modular perception via a dual‑stream mechanism that integrates VLM semantics with geometric priors and multi‑view inputs from plug‑and‑play 3D modules such as VGGT and Qwen‑Image‑Edit, enhancing spatial understanding without modifying the backbone and mitigating standard VLM limitations in 3D reasoning. Experiments show components operate independently with additive benefits. We will release all code and pipelines for reproducibility and future research.
Authors:Hubert M. Pysklo, Artem Zhuravel, Patrick D. Watson
Abstract:
We present Agent‑Diff, a novel benchmarking framework for evaluating agentic Large Language Models (LLMs) on real‑world tasks that execute code via external APIs. Agentic LLM performance varies due to differences in models, external tool access, prompt structures, and agentic frameworks. Benchmarks must make fundamental trade‑offs between a sandboxed approach that controls for variation in software environments and more ecologically valid approaches employing real services. Agent‑Diff attempts to capture the desirable features of both of these approaches by including access to the real API interfaces for software services while sandboxing the environment in which calls are made, processed, and evaluated. This approach relies on two key innovations. The first is a novel state‑diff contract, which separates process from outcome ‑ rather than fuzzy trace or parameter matching, we define task success as whether the expected change in environment state was achieved. The second is a novel sandbox that provides a standardized scripting layer that all models use to execute code against external APIs (Slack, Box, Linear, Google Calendar). Thus, we can evaluate different agentic LLMs against a standardized set of contracts using a unified sandbox while still evaluating their performance on real‑world service interfaces. Using the Agent‑Diff framework, we provide benchmarks for nine LLMs across 224 tasks utilizing enterprise software workflows. In addition, we evaluate the robustness of the framework with ablation experiments to assess the contribution of access to API documentation on benchmark performance. Code and data: https://github.com/agent‑diff‑bench/agent‑diff.
Authors:Donald Ye, Max Loffgren, Om Kotadia, Linus Wong
Abstract:
Chain‑of‑Thought (CoT) explanations are widely used to interpret how language models solve complex problems, yet it remains unclear whether these step‑by‑step explanations reflect how the model actually reaches its answer, or merely post‑hoc justifications. We propose Normalized Logit Difference Decay (NLDD), a metric that measures whether individual reasoning steps are faithful to the model's decision‑making process. Our approach corrupts individual reasoning steps from the explanation and measures how much the model's confidence in its answer drops, to determine if a step is truly important. By standardizing these measurements, NLDD enables rigorous cross‑model comparison across different architectures. Testing three model families across syntactic, logical, and arithmetic tasks, we discover a consistent Reasoning Horizon (k) at 70‑‑85% of chain length, beyond which reasoning tokens have little or negative effect on the final answer. We also find that models can encode correct internal representations while completely failing the task. These results show that accuracy alone does not reveal whether a model actually reasons through its chain. NLDD offers a way to measure when CoT matters.
Authors:Haidong Xin, Xinze Li, Zhenghao Liu, Yukun Yan, Shuo Wang, Cheng Yang, Yu Gu, Ge Yu, Maosong Sun
Abstract:
Existing memory systems enable Large Language Models (LLMs) to support long‑horizon human‑LLM interactions by persisting historical interactions beyond limited context windows. However, while recent approaches have succeeded in constructing effective memories, they often disrupt the inherent logical and temporal relationships within interaction sessions, resulting in fragmented memory units and degraded reasoning performance. In this paper, we propose MetaMem, a novel framework that augments memory systems with a self‑evolving meta‑memory, aiming to teach LLMs how to effectively utilize memorized knowledge. During meta‑memory optimization, MetaMem iteratively distills transferable knowledge utilization experiences across different tasks by self‑reflecting on reasoning processes and performing actions to update the current meta‑memory state. The accumulated meta‑memory units serve as explicit knowledge utilization experiences, guiding the LLM to systematically identify and integrate critical evidence from scattered memory fragments. Extensive experiments demonstrate the effectiveness of MetaMem, which significantly outperforms strong baselines by over 3.6%. All codes and datasets are available at https://github.com/OpenBMB/MetaMem.
Authors:Aniket Deroy
Abstract:
Legal advocacy requires a unique combination of authoritative tone, rhythmic pausing for emphasis, and emotional intelligence. This study investigates the performance of the Gemini 2.5 Flash TTS and Gemini 2.5 Pro TTS models in generating synthetic courtroom speeches across five Indic languages: Tamil, Telugu, Bengali, Hindi, and Gujarati. We propose a prompting framework that utilizes Gemini 2.5s native support for 5 languages and its context‑aware pacing to produce distinct advocate personas. The evolution of Large Language Models (LLMs) has shifted the focus of TexttoSpeech (TTS) technology from basic intelligibility to context‑aware, expressive synthesis. In the legal domain, synthetic speech must convey authority and a specific professional persona a task that becomes significantly more complex in the linguistically diverse landscape of India. The models exhibit a "monotone authority," excelling at procedural information delivery but struggling with the dynamic vocal modulation and emotive gravitas required for persuasive advocacy. Performance dips in Bengali and Gujarati further highlight phonological frontiers for future refinement. This research underscores the readiness of multilingual TTS for procedural legal tasks while identifying the remaining challenges in replicating the persuasive artistry of human legal discourse. The code is available at‑https://github.com/naturenurtureelite/Synthesizing‑the‑Virtual‑Advocate/tree/main
Authors:Tom Labiausse, Romain Fabre, Yannick Estève, Alexandre Défossez, Neil Zeghidour
Abstract:
Simultaneous speech translation requires translating source speech into a target language in real‑time while handling non‑monotonic word dependencies. Traditional approaches rely on supervised training with word‑level aligned data, which is difficult to collect at scale and thus depends on synthetic alignments using language‑specific heuristics that are suboptimal. We propose Hibiki‑Zero, which eliminates the need for word‑level alignments entirely. This fundamentally simplifies the training pipeline and enables seamless scaling to diverse languages with varying grammatical structures, removing the bottleneck of designing language‑specific alignment heuristics. We first train on sentence‑level aligned data to learn speech translation at high latency, then apply a novel reinforcement learning strategy using GRPO to optimize latency while preserving translation quality. Hibiki‑Zero achieves state‑of‑the‑art performance in translation accuracy, latency, voice transfer, and naturalness across five X‑to‑English tasks. Moreover, we demonstrate that our model can be adapted to support a new input language with less than 1000h of speech. We provide examples, model weights, inference code and we release a benchmark containing 45h of multilingual data for speech translation evaluation.
Authors:Han Xiao
Abstract:
We frame embedding inversion as conditional masked diffusion, recovering all tokens in parallel through iterative denoising rather than sequential autoregressive generation. A masked diffusion language model is conditioned on the target embedding via adaptive layer normalization, requiring only 8 forward passes with no access to the target encoder at inference time. On 32‑token sequences across three embedding models, the method achieves token recovery through parallel denoising without requiring encoder access, iterative correction, or architecture‑specific alignment. Source code and live demo are available at https://github.com/jina‑ai/embedding‑inversion‑demo.
Authors:Masataka Yoneda, Yusuke Matsushita, Go Kamoda, Kohei Suenaga, Takuya Akiba, Masaki Waga, Sho Yokoi
Abstract:
We present an ultra‑fast and flexible search algorithm that enables search over trillion‑scale natural language corpora in under 0.3 seconds while handling semantic variations (substitution, insertion, and deletion). Our approach employs string matching based on suffix arrays that scales well with corpus size. To mitigate the combinatorial explosion induced by the semantic relaxation of queries, our method is built on two key algorithmic ideas: fast exact lookup enabled by a disk‑aware design, and dynamic corpus‑aware pruning. We theoretically show that the proposed method suppresses exponential growth in the search space with respect to query length by leveraging statistical properties of natural language. In experiments on FineWeb‑Edu (Lozhkov et al., 2024) (1.4T tokens), we show that our method achieves significantly lower search latency than existing methods: infini‑gram (Liu et al., 2024), infini‑gram mini (Xu et al., 2025), and SoftMatcha (Deguchi et al., 2025). As a practical application, we demonstrate that our method identifies benchmark contamination in training corpora, unidentified by existing approaches. We also provide an online demo of fast, soft search across corpora in seven languages.
Authors:Zhiyin Tan, Jennifer D'Souza
Abstract:
Systematic reviews and meta‑analyses rely on converting narrative articles into structured, numerically grounded study records. Despite rapid advances in large language models (LLMs), it remains unclear whether they can meet the structural requirements of this process, which hinge on preserving roles, methods, and effect‑size attribution across documents rather than on recognizing isolated entities. We propose a structural, diagnostic framework that evaluates LLM‑based evidence extraction as a progression of schema‑constrained queries with increasing relational and numerical complexity, enabling precise identification of failure points beyond atom‑level extraction. Using a manually curated corpus spanning five scientific domains, together with a unified query suite and evaluation protocol, we evaluate two state‑of‑the‑art LLMs under both per‑document and long‑context, multi‑document input regimes. Across domains and models, performance remains moderate for single‑property queries but degrades sharply once tasks require stable binding between variables, roles, statistical methods, and effect sizes. Full meta‑analytic association tuples are extracted with near‑zero reliability, and long‑context inputs further exacerbate these failures. Downstream aggregation amplifies even minor upstream errors, rendering corpus‑level statistics unreliable. Our analysis shows that these limitations stem not from entity recognition errors, but from systematic structural breakdowns, including role reversals, cross‑analysis binding drift, instance compression in dense result sections, and numeric misattribution, indicating that current LLMs lack the structural fidelity, relational binding, and numerical grounding required for automated meta‑analysis. The code and data are publicly available at GitHub (https://github.com/zhiyintan/LLM‑Meta‑Analysis).
Authors:Binwei Yan, Yifei Fu, Mingjian Zhu, Hanting Chen, Mingxuan Yuan, Yunhe Wang, Hailin Hu
Abstract:
Automatic prompt optimization is a promising direction to boost the performance of Large Language Models (LLMs). However, existing methods often suffer from noisy and conflicting update signals. In this research, we propose C‑MOP (Cluster‑based Momentum Optimized Prompting), a framework that stabilizes optimization via Boundary‑Aware Contrastive Sampling (BACS) and Momentum‑Guided Semantic Clustering (MGSC). Specifically, BACS utilizes batch‑level information to mine tripartite features‑‑Hard Negatives, Anchors, and Boundary Pairs‑‑to precisely characterize the typical representation and decision boundaries of positive and negative prompt samples. To resolve semantic conflicts, MGSC introduces a textual momentum mechanism with temporal decay that distills persistent consensus from fluctuating gradients across iterations. Extensive experiments demonstrate that C‑MOP consistently outperforms SOTA baselines like PromptWizard and ProTeGi, yielding average gains of 1.58% and 3.35%. Notably, C‑MOP enables a general LLM with 3B activated parameters to surpass a 70B domain‑specific dense LLM, highlighting its effectiveness in driving precise prompt evolution. The code is available at https://github.com/huawei‑noah/noah‑research/tree/master/C‑MOP.
Authors:Hugo L. Hammer, Vajira Thambawita, Pål Halvorsen
Abstract:
A narrated e‑book combines synchronized audio with digital text, highlighting the currently spoken word or sentence during playback. This format supports early literacy and assists individuals with reading challenges, while also allowing general readers to seamlessly switch between reading and listening. With the emergence of natural‑sounding neural Text‑to‑Speech (TTS) technology, several commercial services have been developed to leverage these technology for converting standard text e‑books into high‑quality narrated e‑books. However, no open‑source solutions currently exist to perform this task. In this paper, we present Calliope, an open‑source framework designed to fill this gap. Our method leverages state‑of‑the‑art open‑source TTS to convert a text e‑book into a narrated e‑book in the EPUB 3 Media Overlay format. The method offers several innovative steps: audio timestamps are captured directly during TTS, ensuring exact synchronization between narration and text highlighting; the publisher's original typography, styling, and embedded media are strictly preserved; and the entire pipeline operates offline. This offline capability eliminates recurring API costs, mitigates privacy concerns, and avoids copyright compliance issues associated with cloud‑based services. The framework currently supports the state‑of‑the‑art open‑source TTS systems XTTS‑v2 and Chatterbox. A potential alternative approach involves first generating narration via TTS and subsequently synchronizing it with the text using forced alignment. However, while our method ensures exact synchronization, our experiments show that forced alignment introduces drift between the audio and text highlighting significant enough to degrade the reading experience. Source code and usage instructions are available at https://github.com/hugohammer/TTS‑Narrated‑Ebook‑Creator.git.
Authors:Yifan Zhang, Zunhai Su, Shuhao Hu, Rui Yang, Wei Wu, Yulei Qian, Yuchen Xie, Xunliang Cai
Abstract:
While FP8 attention has shown substantial promise in innovations like FlashAttention‑3, its integration into the decoding phase of the DeepSeek Multi‑head Latent Attention (MLA) architecture presents notable challenges. These challenges include numerical heterogeneity arising from the decoupling of positional embeddings, misalignment of quantization scales in FP8 PV GEMM, and the need for optimized system‑level support. In this paper, we introduce SnapMLA, an FP8 MLA decoding framework optimized to improve long‑context efficiency through the following hardware‑aware algorithm‑kernel co‑optimization techniques: (i) RoPE‑Aware Per‑Token KV Quantization, where the RoPE part is maintained in high precision, motivated by our comprehensive analysis of the heterogeneous quantization sensitivity inherent to the MLA KV cache. Furthermore, per‑token granularity is employed to align with the autoregressive decoding process and maintain quantization accuracy. (ii) Quantized PV Computation Pipeline Reconstruction, which resolves the misalignment of quantization scale in FP8 PV computation stemming from the shared KV structure of the MLA KV cache. (iii) End‑to‑End Dataflow Optimization, where we establish an efficient data read‑and‑write workflow using specialized kernels, ensuring efficient data flow and performance gains. Extensive experiments on state‑of‑the‑art MLA LLMs show that SnapMLA achieves up to a 1.91x improvement in throughput, with negligible risk of performance degradation in challenging long‑context tasks, including mathematical reasoning and code generation benchmarks. Code is available at https://github.com/meituan‑longcat/SGLang‑FluentLLM.
Authors:Yifei Li, Weidong Guo, Lingling Zhang, Rongman Xu, Muye Huang, Hui Liu, Lijiao Xu, Yu Xu, Jun Liu
Abstract:
Long‑term conversational memory is a core capability for LLM‑based dialogue systems, yet existing benchmarks and evaluation protocols primarily focus on surface‑level factual recall. In realistic interactions, appropriate responses often depend on implicit constraints such as user state, goals, or values that are not explicitly queried later. To evaluate this setting, we introduce LoCoMo‑Plus, a benchmark for assessing cognitive memory under cue‑‑trigger semantic disconnect, where models must retain and apply latent constraints across long conversational contexts. We further show that conventional string‑matching metrics and explicit task‑type prompting are misaligned with such scenarios, and propose a unified evaluation framework based on constraint consistency. Experiments across diverse backbone models, retrieval‑based methods, and memory systems demonstrate that cognitive memory remains challenging and reveals failures not captured by existing benchmarks. Our code and evaluation framework are publicly available at: https://github.com/xjtuleeyf/Locomo‑Plus.
Authors:Jiahao Yuan, Yike Xu, Jinyong Wen, Baokun Wang, Yang Chen, Xiaotong Lin, Wuliang Huang, Ziyi Gao, Xing Fu, Yu Cheng, Weiqiang Wang
Abstract:
Decoder‑only large language models are increasingly used as behavioral encoders for user representation learning, yet the impact of attention masking on the quality of user embeddings remains underexplored. In this work, we conduct a systematic study of causal, hybrid, and bidirectional attention masks within a unified contrastive learning framework trained on large‑scale real‑world Alipay data that integrates long‑horizon heterogeneous user behaviors. To improve training dynamics when transitioning from causal to bidirectional attention, we propose Gradient‑Guided Soft Masking, a gradient‑based pre‑warmup applied before a linear scheduler that gradually opens future attention during optimization. Evaluated on 9 industrial user cognition benchmarks covering prediction, preference, and marketing sensitivity tasks, our approach consistently yields more stable training and higher‑quality bidirectional representations compared with causal, hybrid, and scheduler‑only baselines, while remaining compatible with decoder pretraining. Overall, our findings highlight the importance of masking design and training transition in adapting decoder‑only LLMs for effective user representation learning. Our code is available at https://github.com/JhCircle/Deepfind‑GGSM.
Authors:Yoonwon Jung, Hagyeong Shin, Benjamin K. Bergen
Abstract:
This paper introduces EVOKE, a parallel dataset of emotion vocabulary in English and Korean. The dataset offers comprehensive coverage of emotion words in each language, in addition to many‑to‑many translations between words in the two languages and identification of language‑specific emotion words. The dataset contains 1,427 Korean words and 1,399 English words, and we systematically annotate 819 Korean and 924 English adjectives and verbs. We also annotate multiple meanings of each word and their relationships, identifying polysemous emotion words and emotion‑related metaphors. The dataset is, to our knowledge, the most comprehensive, systematic, and theory‑agnostic dataset of emotion words in both Korean and English to date. It can serve as a practical tool for emotion science, psycholinguistics, computational linguistics, and natural language processing, allowing researchers to adopt different views on the resource reflecting their needs and theoretical perspectives. The dataset is publicly available at https://github.com/yoonwonj/EVOKE.
Authors:Tianci Xue, Zeyi Liao, Tianneng Shi, Zilu Wang, Kai Zhang, Dawn Song, Yu Su, Huan Sun
Abstract:
Real‑world digital environments are highly diverse and dynamic. These characteristics cause agents to frequently encounter unseen scenarios and distribution shifts, making continual learning in specific environments essential for computer‑use agents (CUAs). However, a key challenge lies in obtaining high‑quality and environment‑grounded agent data without relying on costly human annotation. In this work, we introduce ACuRL, an Autonomous Curriculum Reinforcement Learning framework that continually adapts agents to specific environments with zero human data. The agent first explores target environments to acquire initial experiences. During subsequent iterative training, a curriculum task generator leverages these experiences together with feedback from the previous iteration to synthesize new tasks tailored for the agent's current capabilities. To provide reliable reward signals, we introduce CUAJudge, a robust automatic evaluator for CUAs that achieves 93% agreement with human judgments. Empirically, our method effectively enables both intra‑environment and cross‑environment continual learning, yielding 4‑22% performance gains without catastrophic forgetting on existing environments. Further analyses show highly sparse updates (e.g., 20% parameters), which helps explain the effective and robust adaptation. Our data and code are available at https://github.com/OSU‑NLP‑Group/ACuRL.
Authors:Keenan Pepper, Alex McKenzie, Florin Pop, Stijn Servaes, Martin Leitgab, Mike Vaiana, Judd Rosenblatt, Michael S. A. Graziano, Diogo de Lucena
Abstract:
Self‑interpretation methods prompt language models to describe their own internal states, but remain unreliable due to hyperparameter sensitivity. We show that training lightweight adapters on interpretability artifacts, while keeping the LM entirely frozen, yields reliable self‑interpretation across tasks and model families. A scalar affine adapter with just d_\textmodel+1 parameters suffices: trained adapters generate sparse autoencoder feature labels that outperform the training labels themselves (71% vs 63% generation scoring at 70B scale), identify topics with 94% recall@1 versus 1% for untrained baselines, and decode bridge entities in multi‑hop reasoning that appear in neither prompt nor response, surfacing implicit reasoning without chain‑of‑thought. The learned bias vector alone accounts for 85% of improvement, and simpler adapters generalize better than more expressive alternatives. Controlling for model knowledge via prompted descriptions, we find self‑interpretation gains outpace capability gains from 7B to 72B parameters. Our results demonstrate that self‑interpretation improves with scale, without modifying the model being interpreted.
Authors:Tony Feng, Trieu H. Trinh, Garrett Bingham, Dawsen Hwang, Yuri Chervonyi, Junehyuk Jung, Joonkyung Lee, Carlo Pagano, Sang-hyun Kim, Federico Pasqualotto, Sergei Gukov, Jonathan N. Lee, Junsu Kim, Kaiying Hou, Golnaz Ghiasi, Yi Tay, YaGuang Li, Chenkai Kuang, Yuan Liu, Hanzhao Lin, Evan Zheran Liu, Nigamaa Nayakanti, Xiaomeng Yang, Heng-Tze Cheng, Demis Hassabis, Koray Kavukcuoglu, Quoc V. Le, Thang Luong
Abstract:
Recent advances in foundational models have yielded reasoning systems capable of achieving a gold‑medal standard at the International Mathematical Olympiad. The transition from competition‑level problem‑solving to professional research, however, requires navigating vast literature and constructing long‑horizon proofs. In this work, we introduce Aletheia, a math research agent that iteratively generates, verifies, and revises solutions end‑to‑end in natural language. Specifically, Aletheia is powered by an advanced version of Gemini Deep Think for challenging reasoning problems, a novel inference‑time scaling law that extends beyond Olympiad‑level problems, and intensive tool use to navigate the complexities of mathematical research. We demonstrate the capability of Aletheia from Olympiad problems to PhD‑level exercises and most notably, through several distinct milestones in AI‑assisted mathematics research: (a) a research paper (Feng26) generated by AI without any human intervention in calculating certain structure constants in arithmetic geometry called eigenweights; (b) a research paper (LeeSeo26) demonstrating human‑AI collaboration in proving bounds on systems of interacting particles called independent sets; and (c) an extensive semi‑autonomous evaluation (Feng et al., 2026a) of 700 open problems on Bloom's Erdos Conjectures database, including autonomous solutions to four open questions. In order to help the public better understand the developments pertaining to AI and mathematics, we suggest quantifying standard levels of autonomy and novelty of AI‑assisted results, as well as propose a novel concept of human‑AI interaction cards for transparency. We conclude with reflections on human‑AI collaboration in mathematics and share all prompts as well as model outputs at https://github.com/google‑deepmind/superhuman/tree/main/aletheia.
Authors:Kun Wang, Zherui Li, Zhenhong Zhou, Yitong Zhang, Yan Mi, Kun Yang, Yiming Zhang, Junhao Dong, Zhongxiang Sun, Qiankun Li, Yang Liu
Abstract:
Omni‑modal Large Language Models (OLLMs) greatly expand LLMs' multimodal capabilities but also introduce cross‑modal safety risks. However, a systematic understanding of vulnerabilities in omni‑modal interactions remains lacking. To bridge this gap, we establish a modality‑semantics decoupling principle and construct the AdvBench‑Omni dataset, which reveals a significant vulnerability in OLLMs. Mechanistic analysis uncovers a Mid‑layer Dissolution phenomenon driven by refusal vector magnitude shrinkage, alongside the existence of a modal‑invariant pure refusal direction. Inspired by these insights, we extract a golden refusal vector using Singular Value Decomposition and propose OmniSteer, which utilizes lightweight adapters to modulate intervention intensity adaptively. Extensive experiments show that our method not only increases the Refusal Success Rate against harmful inputs from 69.9% to 91.2%, but also effectively preserves the general capabilities across all modalities. Our code is available at: https://github.com/zhrli324/omni‑safety‑research.
Authors:Zhiyu Sun, Minrui Luo, Yu Wang, Zhili Chen, Tianxing He
Abstract:
Large language models (LLMs) are pretrained on corpora containing trillions of tokens and, therefore, inevitably memorize sensitive information. Locate‑then‑edit methods, as a mainstream paradigm of model editing, offer a promising solution by modifying model parameters without retraining. However, in this work, we reveal a critical vulnerability of this paradigm: the parameter updates inadvertently serve as a side channel, enabling attackers to recover the edited data. We propose a two‑stage reverse‑engineering attack named KSTER (KeySpaceReconsTruction‑then‑EntropyReduction) that leverages the low‑rank structure of these updates. First, we theoretically show that the row space of the update matrix encodes a ``fingerprint" of the edited subjects, enabling accurate subject recovery via spectral analysis. Second, we introduce an entropy‑based prompt recovery attack that reconstructs the semantic context of the edit. Extensive experiments on multiple LLMs demonstrate that our attacks can recover edited data with high success rates. Furthermore, we propose subspace camouflage, a defense strategy that obfuscates the update fingerprint with semantic decoys. This approach effectively mitigates reconstruction risks without compromising editing utility. Our code is available at https://github.com/reanatom/EditingAtk.git.
Authors:Zhaoyang Wang, Canwen Xu, Boyi Liu, Yite Wang, Siwei Han, Zhewei Yao, Huaxiu Yao, Yuxiong He
Abstract:
Recent advances in large language model (LLM) have empowered autonomous agents to perform complex tasks that require multi‑turn interactions with tools and environments. However, scaling such agent training is limited by the lack of diverse and reliable environments. In this paper, we propose Agent World Model (AWM), a fully synthetic environment generation pipeline. Using this pipeline, we scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets (35 tools per environment on average) and obtain high‑quality observations. Notably, these environments are code‑driven and backed by databases, providing more reliable and consistent state transitions than environments simulated by LLMs. Moreover, they enable more efficient agent interaction compared with collecting trajectories from realistic environments. To demonstrate the effectiveness of this resource, we perform large‑scale reinforcement learning for multi‑turn tool‑use agents. Thanks to the fully executable environments and accessible database states, we can also design reliable reward functions. Experiments on three benchmarks show that training exclusively in synthetic environments, rather than benchmark‑specific ones, yields strong out‑of‑distribution generalization. The code is available at https://github.com/Snowflake‑Labs/agent‑world‑model.
Authors:Xuehang Guo, Zhiyong Lu, Tom Hope, Qingyun Wang
Abstract:
In scientific research, analysis requires accurately interpreting complex multimodal knowledge, integrating evidence from different sources, and drawing inferences grounded in domain‑specific knowledge. However, current artificial intelligence (AI) systems struggle to consistently demonstrate such capabilities. The complexity and variability of scientific tables and figures, combined with heterogeneous structures and long‑context requirements, pose fundamental obstacles to scientific table \& figure analysis. To quantify these challenges, we introduce AnaBench, a large‑scale benchmark featuring 63,178 instances from nine scientific domains, systematically categorized along seven complexity dimensions. To tackle these challenges, we propose Anagent, a multi‑agent framework for enhanced scientific table \& figure analysis through four specialized agents: Planner decomposes tasks into actionable subtasks, Expert retrieves task‑specific information through targeted tool execution, Solver synthesizes information to generate coherent analysis, and Critic performs iterative refinement through five‑dimensional quality assessment. We further develop modular training strategies that leverage supervised finetuning and specialized reinforcement learning to optimize individual capabilities while maintaining effective collaboration. Comprehensive evaluation across 9 broad domains with 170 subdomains demonstrates that Anagent achieves substantial improvements, up to \uparrow 13.43% in training‑free settings and \uparrow 42.12% with finetuning, while revealing that task‑oriented reasoning and context‑aware problem‑solving are essential for high‑quality scientific table \& figure analysis. Our project page: https://xhguo7.github.io/Anagent/.
Authors:Wenxuan Xie, Yujia Wang, Xin Tan, Chaochao Lu, Xia Hu, Xuhong Wang
Abstract:
The integration of extensive, dynamic knowledge into Large Language Models (LLMs) remains a significant challenge due to the inherent entanglement of factual data and reasoning patterns. Existing solutions, ranging from non‑parametric Retrieval‑Augmented Generation (RAG) to parametric knowledge editing, are often constrained in practice by finite context windows, retriever noise, or the risk of catastrophic forgetting. In this paper, we propose DRIFT, a novel dual‑model architecture designed to explicitly decouple knowledge extraction from the reasoning process. Unlike static prompt compression, DRIFT employs a lightweight knowledge model to dynamically compress document chunks into implicit fact tokens conditioned on the query. These dense representations are projected into the reasoning model's embedding space, replacing raw, redundant text while maintaining inference accuracy. Extensive experiments show that DRIFT significantly improves performance on long‑context tasks, outperforming strong baselines among comparably sized models. Our approach provides a scalable and efficient paradigm for extending the effective context window and reasoning capabilities of LLMs. Our code is available at https://github.com/Lancelot‑Xie/DRIFT.
Authors:William Lugoloobi, Thomas Foster, William Bankes, Chris Russell
Abstract:
Running LLMs with extended reasoning on every problem is expensive, but determining which inputs actually require additional compute remains challenging. We investigate whether their own likelihood of success is recoverable from their internal representations before generation, and if this signal can guide more efficient inference. We train linear probes on pre‑generation activations to predict policy‑specific success on math and coding tasks, substantially outperforming surface features such as question length and TF‑IDF. Using E2H‑AMC, which provides both human and model performance on identical problems, we show that models encode a model‑specific notion of difficulty that is distinct from human difficulty, and that this distinction increases with extended reasoning. Leveraging these probes, we demonstrate that routing queries across a pool of models can exceed the best‑performing model whilst reducing inference cost by up to 70% on MATH, showing that internal representations enable practical efficiency gains even when they diverge from human intuitions about difficulty. Our code is available at: https://github.com/KabakaWilliam/llms_know_difficulty
Authors:Yuhao Zheng, Li'an Zhong, Yi Wang, Rui Dai, Kaikui Liu, Xiangxiang Chu, Linyuan Lv, Philip Torr, Kevin Qinghong Lin
Abstract:
Autonomous GUI agents interact with environments by perceiving interfaces and executing actions. As a virtual sandbox, the GUI World model empowers agents with human‑like foresight by enabling action‑conditioned prediction. However, existing text‑ and pixel‑based approaches struggle to simultaneously achieve high visual fidelity and fine‑grained structural controllability. To this end, we propose Code2World, a vision‑language coder that simulates the next visual state via renderable code generation. Specifically, to address the data scarcity problem, we construct AndroidCode by translating GUI trajectories into high‑fidelity HTML and refining synthesized code through a visual‑feedback revision mechanism, yielding a corpus of over 80K high‑quality screen‑action pairs. To adapt existing VLMs into code prediction, we first perform SFT as a cold start for format layout following, then further apply Render‑Aware Reinforcement Learning which uses rendered outcome as the reward signal by enforcing visual semantic fidelity and action consistency. Extensive experiments demonstrate that Code2World‑8B achieves the top‑performing next UI prediction, rivaling the competitive GPT‑5 and Gemini‑3‑Pro‑Image. Notably, Code2World significantly enhances downstream navigation success rates in a flexible manner, boosting Gemini‑2.5‑Flash by +9.5% on AndroidWorld navigation. The code is available at https://github.com/AMAP‑ML/Code2World.
Authors:Khang Ly, Georgios Cheirmpos, Adrian Raudaschl, Christopher James, Seyed Amin Tabatabaei
Abstract:
This paper introduces AnalyticsGPT, an intuitive and efficient large language model (LLM)‑powered workflow for scientometric question answering. This underrepresented downstream task addresses the subcategory of meta‑scientific questions concerning the "science of science." When compared to traditional scientific question answering based on papers, the task poses unique challenges in the planning phase. Namely, the need for named‑entity recognition of academic entities within questions and multi‑faceted data retrieval involving scientometric indices, e.g. impact factors. Beyond their exceptional capacity for treating traditional natural language processing tasks, LLMs have shown great potential in more complex applications, such as task decomposition and planning and reasoning. In this paper, we explore the application of LLMs to scientometric question answering, and describe an end‑to‑end system implementing a sequential workflow with retrieval‑augmented generation and agentic concepts. We also address the secondary task of effectively synthesizing the data into presentable and well‑structured high‑level analyses. As a database for retrieval‑augmented generation, we leverage a proprietary research performance assessment platform. For evaluation, we consult experienced subject matter experts and leverage LLMs‑as‑judges. In doing so, we provide valuable insights on the efficacy of LLMs towards a niche downstream task. Our (skeleton) code and prompts are available at: https://github.com/lyvykhang/llm‑agents‑scientometric‑qa/tree/acl.
Authors:Kun Chen, Peng Shi, Fanfan Liu, Haibo Qiu, Zhixiong Zeng, Siqi Yang, Wenji Mao
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a critical method for enhancing the reasoning capabilities of Large Language Models (LLMs). However, continuous training often leads to policy entropy collapse, characterized by a rapid decay in entropy that results in premature overconfidence, reduced output diversity, and vanishing gradient norms that inhibit learning. Gradient‑Preserving Clipping is a primary factor influencing these dynamics, but existing mitigation strategies are largely static and lack a framework connecting clipping mechanisms to precise entropy control. This paper proposes reshaping entropy control in RL from the perspective of Gradient‑Preserving Clipping. We first theoretically and empirically verify the contributions of specific importance sampling ratio regions to entropy growth and reduction. Leveraging these findings, we introduce a novel regulation mechanism using dynamic clipping threshold to precisely manage entropy. Furthermore, we design and evaluate dynamic entropy control strategies, including increase‑then‑decrease, decrease‑increase‑decrease, and oscillatory decay. Experimental results demonstrate that these strategies effectively mitigate entropy collapse, and achieve superior performance across multiple benchmarks.
Authors:Yiming Shu, Pei Liu, Tiange Zhang, Ruiyang Gao, Jun Ma, Chen Sun
Abstract:
Sustaining long‑term interactions remains a bottleneck for Large Language Models (LLMs), as their limited context windows struggle to manage dialogue histories that extend over time. Existing memory systems often treat interactions as disjointed snippets, failing to capture the underlying narrative coherence of the dialogue stream. We propose TraceMem, a cognitively‑inspired framework that weaves structured, narrative memory schemata from user conversational traces through a three‑stage pipeline: (1) Short‑term Memory Processing, which employs a deductive topic segmentation approach to demarcate episode boundaries and extract semantic representation; (2) Synaptic Memory Consolidation, a process that summarizes episodes into episodic memories before distilling them alongside semantics into user‑specific traces; and (3) Systems Memory Consolidation, which utilizes two‑stage hierarchical clustering to organize these traces into coherent, time‑evolving narrative threads under unifying themes. These threads are encapsulated into structured user memory cards, forming narrative memory schemata. For memory utilization, we provide an agentic search mechanism to enhance reasoning process. Evaluation on the LoCoMo benchmark shows that TraceMem achieves state‑of‑the‑art performance with a brain‑inspired architecture. Analysis shows that by constructing coherent narratives, it surpasses baselines in multi‑hop and temporal reasoning, underscoring its essential role in deep narrative comprehension. Additionally, we provide an open discussion on memory systems, offering our perspectives and future outlook on the field. Our code implementation is available at: https://github.com/YimingShu‑teay/TraceMem
Authors:Sieun Hyeon, Jusang Oh, Sunghwan Steve Cho, Jaeyoung Do
Abstract:
Recent advances in Large Language Models (LLMs) have significantly improved table understanding tasks such as Table Question Answering (TableQA), yet challenges remain in ensuring reliability, scalability, and efficiency, especially in resource‑constrained or privacy‑sensitive environments. In this paper, we introduce MATA, a multi‑agent TableQA framework that leverages multiple complementary reasoning paths and a set of tools built with small language models. MATA generates candidate answers through diverse reasoning styles for a given table and question, then refines or selects the optimal answer with the help of these tools. Furthermore, it incorporates an algorithm designed to minimize expensive LLM agent calls, enhancing overall efficiency. MATA maintains strong performance with small, open‑source models and adapts easily across various LLM types. Extensive experiments on two benchmarks of varying difficulty with ten different LLMs demonstrate that MATA achieves state‑of‑the‑art accuracy and highly efficient reasoning while avoiding excessive LLM inference. Our results highlight that careful orchestration of multiple reasoning pathways yields scalable and reliable TableQA. The code is available at https://github.com/AIDAS‑Lab/MATA.
Authors:R E Zera Marveen Lyngkhoi, Chirag Chawla, Pratinav Seth, Utsav Avaiya, Soham Bhattacharjee, Mykola Khandoga, Rui Yuan, Vinay Kumar Sankarapu
Abstract:
Post‑training alignment is central to deploying large language models (LLMs), yet practical workflows remain split across backend‑specific tools and ad‑hoc glue code, making experiments hard to reproduce. We identify backend interference, reward fragmentation, and irreproducible pipelines as key obstacles in alignment research. We introduce AlignTune, a modular toolkit exposing a unified interface for supervised fine‑tuning (SFT) and RLHF‑style optimization with interchangeable TRL and Unsloth backends. AlignTune standardizes configuration, provides an extensible reward layer (rule‑based and learned), and integrates evaluation over standard benchmarks and custom tasks. By isolating backend‑specific logic behind a single factory boundary, AlignTune enables controlled comparisons and reproducible alignment experiments.
Authors:Narges Baba Ahmadi, Jan Strich, Martin Semmann, Chris Biemann
Abstract:
Large language models (LLMs) are increasingly used to access legal information. Yet, their deployment in multilingual legal settings is constrained by unreliable retrieval and the lack of domain‑adapted, open‑embedding models. In particular, existing multilingual legal corpora are not designed for semantic retrieval, and PDF‑based legislative sources introduce substantial noise due to imperfect text extraction. To address these challenges, we introduce LEMUR, a large‑scale multilingual corpus of EU environmental legislation constructed from 24,953 official EUR‑Lex PDF documents covering 25 languages. We quantify the fidelity of PDF‑to‑text conversion by measuring lexical consistency against authoritative HTML versions using the Lexical Content Score (LCS). Building on LEMUR, we fine‑tune three state‑of‑the‑art multilingual embedding models using contrastive objectives in both monolingual and bilingual settings, reflecting realistic legal‑retrieval scenarios. Experiments across low‑ and high‑resource languages demonstrate that legal‑domain fine‑tuning consistently improves Top‑k retrieval accuracy relative to strong baselines, with particularly pronounced gains for low‑resource languages. Cross‑lingual evaluations show that these improvements transfer to unseen languages, indicating that fine‑tuning primarily enhances language‑independent, content‑level legal representations rather than language‑specific cues. We publish code\footnote\hrefhttps://github.com/nargesbh/eur_lexGitHub Repository and data\footnote\hrefhttps://huggingface.co/datasets/G4KMU/LEMURHugging Face Dataset.
Authors:Klejda Alushi, Jan Strich, Chris Biemann, Martin Semmann
Abstract:
Conversational question answering increasingly relies on retrieval‑augmented generation (RAG) to ground large language models (LLMs) in external knowledge. Yet, most existing studies evaluate RAG methods in isolation and primarily focus on single‑turn settings. This paper addresses the lack of a systematic comparison of RAG methods for multi‑turn conversational QA, where dialogue history, coreference, and shifting user intent substantially complicate retrieval. We present a comprehensive empirical study of vanilla and advanced RAG methods across eight diverse conversational QA datasets spanning multiple domains. Using a unified experimental setup, we evaluate retrieval quality and answer generation using generator and retrieval metrics, and analyze how performance evolves across conversation turns. Our results show that robust yet straightforward methods, such as reranking, hybrid BM25, and HyDE, consistently outperform vanilla RAG. In contrast, several advanced techniques fail to yield gains and can even degrade performance below the No‑RAG baseline. We further demonstrate that dataset characteristics and dialogue length strongly influence retrieval effectiveness, explaining why no single RAG strategy dominates across settings. Overall, our findings indicate that effective conversational RAG depends less on method complexity than on alignment between the retrieval strategy and the dataset structure. We publish the code used.\footnote\hrefhttps://github.com/Klejda‑A/exp‑rag.gitGitHub Repository
Authors:Haoyu Zhao, Ziran Yang, Jiawei Li, Deyuan He, Zenan Li, Chi Jin, Venugopal V. Veeravalli, Aarti Gupta, Sanjeev Arora
Abstract:
Vericoding refers to the generation of formally verified code from rigorous specifications. Recent AI models show promise in vericoding, but a unified methodology for cross‑paradigm evaluation is lacking. Existing benchmarks test only individual languages/tools (e.g., Dafny, Verus, and Lean) and each covers very different tasks, so the performance numbers are not directly comparable. We address this gap with AlgoVeri, a benchmark that evaluates vericoding of 77 classical algorithms in Dafny, Verus, and Lean. By enforcing identical functional contracts, AlgoVeri reveals critical capability gaps in verification systems. While frontier models achieve tractable success in Dafny (40.3% for Gemini‑3 Flash), where high‑level abstractions and SMT automation simplify the workflow, performance collapses under the systems‑level memory constraints of Verus (24.7%) and the explicit proof construction required by Lean (7.8%). Beyond aggregate metrics, we uncover a sharp divergence in test‑time compute dynamics: Gemini‑3 effectively utilizes iterative repair to boost performance (e.g., tripling pass rates in Dafny), whereas GPT‑OSS saturates early. Finally, our error analysis shows that language design affects the refinement trajectory: while Dafny allows models to focus on logical correctness, Verus and Lean trap models in persistent syntactic and semantic barriers. All data and evaluation code can be found at https://github.com/haoyuzhao123/algoveri.
Authors:Takumi Ohashi, Hitoshi Iyatomi
Abstract:
Large language models (LLMs) are increasingly deployed in multicultural settings; however, systematic evaluation of cultural specificity at the sentence level remains underexplored. We propose the Conceptual Cultural Index (CCI), which estimates cultural specificity at the sentence level. CCI is defined as the difference between the generality estimate within the target culture and the average generality estimate across other cultures. This formulation enables users to operationally control the scope of culture via comparison settings and provides interpretability, since the score derives from the underlying generality estimates. We validate CCI on 400 sentences (200 culture‑specific and 200 general), and the resulting score distribution exhibits the anticipated pattern: higher for culture‑specific sentences and lower for general ones. For binary separability, CCI outperforms direct LLM scoring, yielding more than a 10‑point improvement in AUC for models specialized to the target culture. Our code is available at https://github.com/IyatomiLab/CCI .
Authors:Shihao Xu, Tiancheng Zhou, Jiatong Ma, Yanli Ding, Yiming Yan, Ming Xiao, Guoyi Li, Haiyang Geng, Yunyun Han, Jianhua Chen, Yafeng Deng
Abstract:
Mental disorders are highly prevalent worldwide, but the shortage of psychiatrists and the inherent subjectivity of interview‑based diagnosis create substantial barriers to timely and consistent mental‑health assessment. Progress in AI‑assisted psychiatric diagnosis is constrained by the absence of benchmarks that simultaneously provide realistic patient simulation, clinician‑verified diagnostic labels, and support for dynamic multi‑turn consultation. We present LingxiDiagBench, a large‑scale multi‑agent benchmark that evaluates LLMs on both static diagnostic inference and dynamic multi‑turn psychiatric consultation in Chinese. At its core is LingxiDiag‑16K, a dataset of 16,000 EMR‑aligned synthetic consultation dialogues designed to reproduce real clinical demographic and diagnostic distributions across 12 ICD‑10 psychiatric categories. Through extensive experiments across state‑of‑the‑art LLMs, we establish key findings: (1) although LLMs achieve high accuracy on binary depression‑‑anxiety classification (up to 92.3%), performance deteriorates substantially for depression‑‑anxiety comorbidity recognition (43.0%) and 12‑way differential diagnosis (28.5%); (2) dynamic consultation often underperforms static evaluation, indicating that ineffective information‑gathering strategies significantly impair downstream diagnostic reasoning; (3) consultation quality assessed by LLM‑as‑a‑Judge shows only moderate correlation with diagnostic accuracy, suggesting that well‑structured questioning alone does not ensure correct diagnostic decisions. We release LingxiDiag‑16K and the full evaluation framework to support reproducible research at https://github.com/Lingxi‑mental‑health/LingxiDiagBench.
Authors:Veuns-Team, :, Changlong Gao, Zhangxuan Gu, Yulin Liu, Xinyu Qiu, Shuheng Shen, Yue Wen, Tianyu Xia, Zhenyu Xu, Zhengwen Zeng, Beitong Zhou, Xingran Zhou, Weizhi Chen, Sunhao Dai, Jingya Dou, Yichen Gong, Yuan Guo, Zhenlin Guo, Feng Li, Qian Li, Jinzhen Lin, Yuqi Zhou, Linchao Zhu, Liang Chen, Zhenyu Guo, Changhua Meng, Weiqiang Wang
Abstract:
GUI agents have emerged as a powerful paradigm for automating interactions in digital environments, yet achieving both broad generality and consistently strong task performance remains challenging.In this report, we present UI‑Venus‑1.5, a unified, end‑to‑end GUI Agent designed for robust real‑world applications.The proposed model family comprises two dense variants (2B and 8B) and one mixture‑of‑experts variant (30B‑A3B) to meet various downstream application scenarios.Compared to our previous version, UI‑Venus‑1.5 introduces three key technical advances: (1) a comprehensive Mid‑Training stage leveraging 10 billion tokens across 30+ datasets to establish foundational GUI semantics; (2) Online Reinforcement Learning with full‑trajectory rollouts, aligning training objectives with long‑horizon, dynamic navigation in large‑scale environments; and (3) a single unified GUI Agent constructed via Model Merging, which synthesizes domain‑specific models (grounding, web, and mobile) into one cohesive checkpoint. Extensive evaluations demonstrate that UI‑Venus‑1.5 establishes new state‑of‑the‑art performance on benchmarks such as ScreenSpot‑Pro (69.6%), VenusBench‑GD (75.0%), and AndroidWorld (77.6%), significantly outperforming previous strong baselines. In addition, UI‑Venus‑1.5 demonstrates robust navigation capabilities across a variety of Chinese mobile apps, effectively executing user instructions in real‑world scenarios. Code: https://github.com/inclusionAI/UI‑Venus; Model: https://huggingface.co/collections/inclusionAI/ui‑venus
Authors:Chenghui Zou, Ning Wang, Tiesunlong Shen, Luwei Xiao, Chuan Ma, Xiangpeng Li, Rui Mao, Erik Cambria
Abstract:
Large language models (LLMs) have been widely applied to emotional support conversation (ESC). However, complex multi‑turn support remains challenging.This is because existing alignment schemes rely on sparse outcome‑level signals, thus offering limited supervision for intermediate strategy decisions. To fill this gap, this paper proposes affective flow language model for emotional support conversation (AFlow), a framework that introduces fine‑grained supervision on dialogue prefixes by modeling a continuous affective flow along multi‑turn trajectories. AFlow can estimate intermediate utility over searched trajectories and learn preference‑consistent strategy transitions. To improve strategy coherence and empathetic response quality, a subpath‑level flow‑balance objective is presented to propagate preference signals to intermediate states. Experiment results show consistent and significant improvements over competitive baselines in diverse emotional contexts. Remarkably, AFlow with a compact open‑source backbone outperforms proprietary LMMs such as GPT‑4o and Claude‑3.5 on major ESC metrics. Our code is available at https://github.com/chz2025/AffectiveFlow.
Authors:Zirui Li, Xuefeng Bai, Kehai Chen, Yizhi Li, Jian Yang, Chenghua Lin, Min Zhang
Abstract:
Latent or continuous chain‑of‑thought methods replace explicit textual rationales with a number of internal latent steps, but these intermediate computations are difficult to evaluate beyond correlation‑based probes. In this paper, we view latent chain‑of‑thought as a manipulable causal process in representation space by modeling latent steps as variables in a structural causal model (SCM) and analyzing their effects through step‑wise \mathrmdo‑interventions. We study two representative paradigms (i.e., Coconut and CODI) on both mathematical and general reasoning tasks to investigate three key questions: (1) which steps are causally necessary for correctness and when answers become decidable early; (2) how does influence propagate across steps, and how does this structure compare to explicit CoT; and (3) do intermediate trajectories retain competing answer modes, and how does output‑level commitment differ from representational commitment across steps. We find that latent‑step budgets behave less like homogeneous extra depth and more like staged functionality with non‑local routing, and we identify a persistent gap between early output bias and late representational commitment. These results motivate mode‑conditional and stability‑aware analyses ‑‑ and corresponding training/decoding objectives ‑‑ as more reliable tools for interpreting and improving latent reasoning systems. Code is available at https://github.com/J1mL1/causal‑latent‑cot.
Authors:Yutao Zhu, Xingshuo Zhang, Maosen Zhang, Jiajie Jin, Liancheng Zhang, Xiaoshuai Song, Kangzhi Zhao, Wencong Zeng, Ruiming Tang, Han Li, Ji-Rong Wen, Zhicheng Dou
Abstract:
The advancement of large language models (LLMs) has significantly accelerated the development of search agents capable of autonomously gathering information through multi‑turn web interactions. Various benchmarks have been proposed to evaluate such agents. However, existing benchmarks often construct queries backward from answers, producing unnatural tasks misaligned with real‑world needs. Moreover, these benchmarks tend to focus on either locating specific information or aggregating information from multiple sources, while relying on static answer sets prone to data contamination. To bridge these gaps, we introduce GISA, a benchmark for General Information‑Seeking Assistants comprising 373 human‑crafted queries that reflect authentic information‑seeking scenarios. GISA features four structured answer formats (item, set, list, and table), enabling deterministic evaluation. It integrates both deep reasoning and broad information aggregation within unified tasks, and includes a live subset with periodically updated answers to resist memorization. Notably, GISA provides complete human search trajectories for every query, offering gold‑standard references for process‑level supervision and imitation learning. Experiments on mainstream LLMs and commercial search products reveal that even the best‑performing model achieves only 19.30% exact match score, with performance notably degrading on tasks requiring complex planning and comprehensive information gathering. These findings highlight substantial room for future improvement.
Authors:Haoran Zhang, Yafu Li, Zhi Wang, Zhilin Wang, Shunkai Zhang, Xiaoye Qu, Yu Cheng
Abstract:
Large Reasoning Models (LRMs) increasingly rely on reasoning traces with complex internal structures. However, existing work lacks a unified answer to three fundamental questions: (1) what defines high‑quality reasoning, (2) how to reliably evaluate long, implicitly structured reasoning traces, and (3) how to use such evaluation signals for reasoning optimization. To address these challenges, we provide a unified perspective. (1) We introduce the ME^2 principle to characterize reasoning quality along macro‑ and micro‑level concerning efficiency and effectiveness. (2) Built on this principle, we model reasoning traces as directed acyclic graphs (DAGs) and develop a DAG‑based pairwise evaluation method, capturing complex reasoning structures. (3) Based on this method, we construct the TRM‑Preference dataset and train a Thinking Reward Model (TRM) to evaluate reasoning quality at scale. Experiments show that thinking rewards serve as an effective optimization signal. At test time, selecting better reasoning leads to better outcomes (up to 19.3% gain), and during RL training, thinking rewards enhance reasoning and performance (up to 3.9% gain) across diverse tasks.
Authors:Linye Wei, Zixiang Luo, Pingzhi Tang, Meng Li
Abstract:
Diffusion large language models (dLLMs) have recently gained significant attention due to their inherent support for parallel decoding. Building on this paradigm, Mixture‑of‑Experts (MoE) dLLMs with autoregressive (AR) initialization have further demonstrated strong performance competitive with mainstream AR models. However, we identify a fundamental mismatch between MoE architectures and diffusion‑based decoding. Specifically, a large number of experts are activated at each denoising step, while only a small subset of tokens is ultimately accepted, resulting in substantial inference overhead and limiting their deployment in latency‑sensitive applications. In this work, we propose TEAM, a plug‑and‑play framework that accelerates MoE dLLMs by enabling more accepted tokens with fewer activated experts. TEAM is motivated by the observation that expert routing decisions exhibit strong temporal consistency across denoising levels as well as spatial consistency across token positions. Leveraging these properties, TEAM employs three complementary expert activation and decoding strategies, conservatively selecting necessary experts for decoded and masked tokens and simultaneously performing aggressive speculative exploration across multiple candidates. Experimental results demonstrate that TEAM achieves up to 2.2x speedup over vanilla MoE dLLM, with negligible performance degradation. Code is released at https://github.com/PKU‑SEC‑Lab/TEAM‑MoE‑dLLM.
Authors:Jaylen Jones, Zhehao Zhang, Yuting Ning, Eric Fosler-Lussier, Pierre-Luc St-Charles, Yoshua Bengio, Dawn Song, Yu Su, Huan Sun
Abstract:
Although computer‑use agents (CUAs) hold significant potential to automate increasingly complex OS workflows, they can demonstrate unsafe unintended behaviors that deviate from expected outcomes even under benign input contexts. However, exploration of this risk remains largely anecdotal, lacking concrete characterization and automated methods to proactively surface long‑tail unintended behaviors under realistic CUA scenarios. To fill this gap, we introduce the first conceptual and methodological framework for unintended CUA behaviors, by defining their key characteristics, automatically eliciting them, and analyzing how they arise from benign inputs. We propose AutoElicit: an agentic framework that iteratively perturbs benign instructions using CUA execution feedback, and elicits severe harms while keeping perturbations realistic and benign. Using AutoElicit, we surface hundreds of harmful unintended behaviors from state‑of‑the‑art CUAs such as Claude 4.5 Haiku and Opus. We further evaluate the transferability of human‑verified successful perturbations, identifying persistent susceptibility to unintended behaviors across various other frontier CUAs. This work establishes a foundation for systematically analyzing unintended behaviors in realistic computer‑use settings.
Authors:Konstantinos Mitsides, Maxence Faldor, Antoine Cully
Abstract:
Open‑ended learning frames intelligence as emerging from continual interaction with an ever‑expanding space of environments. While recent advances have utilized foundation models to programmatically generate diverse environments, these approaches often focus on discovering isolated behaviors rather than orchestrating sustained progression. In complex open‑ended worlds, the large combinatorial space of possible challenges makes it difficult for agents to discover sequences of experiences that remain consistently learnable. To address this, we propose Dreaming in Code (DiCode), a framework in which foundation models synthesize executable environment code to scaffold learning toward increasing competence. In DiCode, "dreaming" takes the form of materializing code‑level variations of the world. We instantiate DiCode in Craftax, a challenging open‑ended benchmark characterized by rich mechanics and long‑horizon progression. Empirically, DiCode enables agents to acquire long‑horizon skills, achieving a 16% improvement in mean return over the strongest baseline and non‑zero success on late‑game combat tasks where prior methods fail. Our results suggest that code‑level environment design provides a practical mechanism for curriculum control, enabling the construction of intermediate environments that bridge competence gaps in open‑ended worlds. Project page and source code are available at https://konstantinosmitsides.github.io/dreaming‑in‑code and https://github.com/konstantinosmitsides/dreaming‑in‑code.
Authors:Zejia You, Chunyuan Deng, Hanjie Chen
Abstract:
Inference‑time steering has emerged as a promising paradigm for controlling language models (LMs) without the cost of retraining. However, standard approaches typically rely on activation addition, a geometric operation that inevitably alters the magnitude of hidden representations. This raises concerns about representation collapse and degradation of open‑ended generation capabilities. In this work, we explore Spherical Steering, a training‑free primitive that resolves this trade‑off through activation rotation. Rather than shifting activations with a fixed vector, our method rotates them along a geodesic toward a target direction, guiding the activation toward the target concept while preserving the integrity of the signal. To further enhance adaptivity, we incorporate a confidence gate that dynamically modulates steering strength based on input uncertainty. Extensive experiments across multiple‑choice benchmarks demonstrate that Spherical Steering significantly outperforms addition‑based baselines (notably by +10% on TruthfulQA, COPA, and Storycloze), while simultaneously maintaining the model's general open‑ended generation quality. This work highlights the value of geometric consistency, suggesting that norm‑preserving rotation is a robust and effective primitive for precise inference‑time control.
Authors:Tianyu Li, Dongchen Han, Zixuan Cao, Haofeng Huang, Mengyu Zhou, Ming Chen, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang, Gao Huang
Abstract:
Modern Transformers predominantly adopt the Pre‑Norm paradigm for its optimization stability, foregoing the superior potential of the unstable Post‑Norm architecture. Prior attempts to combine their strengths typically lead to a stability‑performance trade‑off. We attribute this phenomenon to a structural incompatibility within a single‑stream design: Any application of the Post‑Norm operation inevitably obstructs the clean identity gradient preserved by Pre‑Norm. To fundamentally reconcile these paradigms, we propose SiameseNorm, a two‑stream architecture that couples Pre‑Norm‑like and Post‑Norm‑like streams with shared parameters. This design decouples the optimization dynamics of the two streams, retaining the distinct characteristics of both Pre‑Norm and Post‑Norm by enabling all residual blocks to receive combined gradients inherited from both paradigms, where one stream secures stability while the other enhances expressivity. Extensive pre‑training experiments on 1.3B‑parameter models demonstrate that SiameseNorm exhibits exceptional optimization robustness and consistently outperforms strong baselines. Code is available at https://github.com/Qwen‑Applications/SiameseNorm.
Authors:Chenwang Wu, Yiu-ming Cheung, Shuhai Zhang, Bo Han, Defu Lian
Abstract:
While machine‑generated texts (MGTs) offer great convenience, they also pose risks such as disinformation and phishing, highlighting the need for reliable detection. Metric‑based methods, which extract statistically distinguishable features of MGTs, are often more practical than complex model‑based methods that are prone to overfitting. Given their diverse designs, we first place representative metric‑based methods within a unified framework, enabling a clear assessment of their advantages and limitations. Our analysis identifies a core challenge across these methods: the token‑level detection score is easily biased by the inherent randomness of the MGTs generation process. To address this, we theoretically and empirically reveal two relationships of context detection scores that may aid calibration: Neighbor Similarity and Initial Instability. We then propose a Markov‑informed score calibration strategy that models these relationships using Markov random fields, and implements it as a lightweight component via a mean‑field approximation, allowing our method to be seamlessly integrated into existing detectors. Extensive experiments in various real‑world scenarios, such as cross‑LLM and paraphrasing attacks, demonstrate significant gains over baselines with negligible computational overhead. The code is available at https://github.com/tmlr‑group/MRF_Calibration.
Authors:Ziyang Fan, Keyu Chen, Ruilong Xing, Yulin Li, Li Jiang, Zhuotao Tian
Abstract:
Although Video Large Language Models (VLLMs) have shown remarkable capabilities in video understanding, they are required to process high volumes of visual tokens, causing significant computational inefficiency. Existing VLLMs acceleration frameworks usually compress spatial and temporal redundancy independently, which overlooks the spatiotemporal relationships, thereby leading to suboptimal spatiotemporal compression. The highly correlated visual features are likely to change in spatial position, scale, orientation, and other attributes over time due to the dynamic nature of video. Building on this insight, we introduce FlashVID, a training‑free inference acceleration framework for VLLMs. Specifically, FlashVID utilizes Attention and Diversity‑based Token Selection (ADTS) to select the most representative tokens for basic video representation, then applies Tree‑based Spatiotemporal Token Merging (TSTM) for fine‑grained spatiotemporal redundancy elimination. Extensive experiments conducted on three representative VLLMs across five video understanding benchmarks demonstrate the effectiveness and generalization of our method. Notably, by retaining only 10% of visual tokens, FlashVID preserves 99.1% of the performance of LLaVA‑OneVision. Consequently, FlashVID can serve as a training‑free and plug‑and‑play module for extending long video frames, which enables a 10x increase in video frame input to Qwen2.5‑VL, resulting in a relative improvement of 8.6% within the same computational budget. Code is available at https://github.com/Fanziyang‑v/FlashVID.
Authors:Jitai Hao, Qiang Huang, Yaowei Wang, Min Zhang, Jun Yu
Abstract:
The deployment of efficient long‑context LLMs in applications like autonomous agents, long‑chain reasoning, and creative writing is fundamentally bottlenecked by the linear growth of KV cache memory. Existing compression and eviction methods often struggle to balance accuracy, compression ratio, and hardware efficiency. We propose DeltaKV, a residual‑based KV cache compression framework motivated by two empirical findings: long‑range inter‑token similarity and highly shared latent components in KV representations. Instead of discarding tokens, DeltaKV encodes semantic residuals relative to retrieved historical references, preserving fidelity while substantially reducing storage. To translate compression gains into real system speedups, we further introduce Sparse‑vLLM, a high‑performance inference engine with decoupled memory management and kernels optimized for sparse and irregular KV layouts. Experiments show that DeltaKV reduces KV cache memory to 29% of the original while maintaining near‑lossless accuracy on LongBench, SCBench, and AIME. When integrated with Sparse‑vLLM, it achieves up to 2× throughput improvement over vLLM in long‑context scenarios, demonstrating a practical path toward scalable long‑context LLM deployment. Code, model checkpoints, and datasets are available at https://github.com/CURRENTF/Sparse‑vLLM.
Authors:Taolin Zhang, Hang Guo, Wang Lu, Tao Dai, Shu-Tao Xia, Jindong Wang
Abstract:
As large language models (LLMs) continue to scale up, their performance on various downstream tasks has significantly improved. However, evaluating their capabilities has become increasingly expensive, as performing inference on a large number of benchmark samples incurs high computational costs. In this paper, we revisit the model‑item performance matrix and show that it exhibits sparsity, that representative items can be selected as anchors, and that the task of efficient benchmarking can be formulated as a sparse optimization problem. Based on these insights, we propose SparseEval, a method that, for the first time, adopts gradient descent to optimize anchor weights and employs an iterative refinement strategy for anchor selection. We utilize the representation capacity of MLP to handle sparse optimization and propose the Anchor Importance Score and Candidate Importance Score to evaluate the value of each item for task‑aware refinement. Extensive experiments demonstrate the low estimation error and high Kendall's~τ of our method across a variety of benchmarks, showcasing its superior robustness and practicality in real‑world scenarios. Code is available at https://github.com/taolinzhang/SparseEval.
Authors:Guanglong Sun, Siyuan Zhang, Liyuan Wang, Jun Zhu, Hang Su, Yi Zhong
Abstract:
Large Language Models (LLMs) often incur an alignment tax: safety post‑training can reduce general utility (e.g., reasoning and coding). We argue that this tax primarily arises from continual‑learning‑style forgetting in sequential alignment, where distribution shift and conflicting objectives cause safety updates to overwrite pre‑trained competencies. Accordingly, we cast safety alignment as a continual learning (CL) problem that must balance plasticity (acquiring safety constraints) and stability (preserving general abilities). We propose Orthogonal Gradient Projection for Safety Alignment (OGPSA), a lightweight method that mitigates interference by constraining each safety update to be orthogonal (in a first‑order sense) to a learned subspace capturing general capabilities. Specifically, OGPSA estimates a low‑rank capability subspace from gradients on a small reference set and projects the safety gradient onto its orthogonal complement before updating. This produces safety‑directed updates that minimally perturb prior knowledge while retaining capacity for alignment. OGPSA is plug‑and‑play and integrates into standard post‑training pipelines without large‑scale replay, auxiliary objectives, or retraining. Across Supervised Fine‑Tuning (SFT), Direct Preference Optimization (DPO), and sequential SFT\rightarrowDPO settings, OGPSA consistently improves the safety‑‑utility Pareto frontier over standard baselines. For instance, on Qwen2.5‑7B‑Instruct under SFT\rightarrowDPO, OGPSA preserves strong safety while recovering general capability, improving SimpleQA from 0.53% to 3.03% and IFEval from 51.94% to 63.96%. Our source code is available at \hrefhttps://github.com/SunGL001/OGPSAOGPSA
Authors:Weijiang Lv, Yaoxuan Feng, Xiaobo Xia, Jiayu Wang, Yan Jing, Wenchao Chen, Bo Chen
Abstract:
Chain‑of‑Thought reasoning is widely used to improve the interpretability of multimodal large language models (MLLMs), yet the faithfulness of the generated reasoning traces remains unclear. Prior work has mainly focused on perceptual hallucinations, leaving reasoning level unfaithfulness underexplored. To isolate faithfulness from linguistic priors, we introduce SPD‑Faith Bench, a diagnostic benchmark based on fine‑grained image difference reasoning that enforces explicit visual comparison. Evaluations on state‑of‑the‑art MLLMs reveal two systematic failure modes, perceptual blindness and perception‑reasoning dissociation. We trace these failures to decaying visual attention and representation shifts in the residual stream. Guided by this analysis, we propose SAGE, a train‑free visual evidence‑calibrated framework that improves visual routing and aligns reasoning with perception. Our results highlight the importance of explicitly evaluating faithfulness beyond response correctness. Our benchmark and codes are available at https://github.com/Johanson‑colab/SPD‑Faith‑Bench.
Authors:Fengting Yuchi, Li Du, Jason Eisner
Abstract:
Although state‑of‑the‑art LLMs can solve math problems, we find that they make errors on numerical comparisons with mixed notation: "Which is larger, 5.7 × 10^2 or 580?" This raises a fundamental question: Do LLMs even know how big these numbers are? We probe the hidden states of several smaller open‑source LLMs. A single linear projection of an appropriate hidden layer encodes the log‑magnitudes of both kinds of numerals, allowing us to recover the numbers with relative error of about 2.3% (on restricted synthetic text) or 19.06% (on scientific papers). Furthermore, the hidden state after reading a pair of numerals encodes their ranking, with a linear classifier achieving over 90% accuracy. Yet surprisingly, when explicitly asked to rank the same pairs of numerals, these LLMs achieve only 50‑70% accuracy, with worse performance for models whose probes are less effective. Finally, we show that incorporating the classifier probe's log‑loss as an auxiliary objective during finetuning brings an additional 3.22% improvement in verbalized accuracy over base models, demonstrating that improving models' internal magnitude representations can enhance their numerical reasoning capabilities. Our code is available at https://github.com/VCY019/Numeracy‑Probing.
Authors:Jiatong Li, Changdae Oh, Hyeong Kyu Choi, Jindong Wang, Sharon Li
Abstract:
Eliciting reasoning has emerged as a powerful technique for improving the performance of large language models (LLMs) on complex tasks by inducing thinking. However, their effectiveness in realistic user‑engaged agent scenarios remains unclear. In this paper, we conduct a comprehensive study on the effect of explicit thinking in user‑engaged LLM agents. Our experiments span across seven models, three benchmarks, and two thinking instantiations, and we evaluate them through both a quantitative response taxonomy analysis and qualitative failure propagation case studies. Contrary to expectations, we find that mandatory thinking often backfires on agents in user‑engaged settings, causing anomalous performance degradation across various LLMs. Our key finding reveals that thinking makes agents more ``introverted'' by shortening responses and reducing information disclosure to users, which weakens agent‑user information exchange and leads to downstream task failures. Furthermore, we demonstrate that explicitly prompting for information disclosure reliably improves performance across diverse model families, suggesting that proactive transparency is a vital lever for agent optimization. Overall, our study suggests that information transparency awareness is a crucial yet underexplored perspective for the future design of reasoning agents in real‑world scenarios. Our code is available at https://github.com/deeplearning‑wisc/Thinking‑Agent.
Authors:Wenjie Liu, Hao Wu, Xin Qiu, Yingqi Fan, Yihan Zhang, Anhao Zhao, Yunpu Ma, Xiaoyu Shen
Abstract:
Modern multimodal large language models (MLLMs) adopt a unified self‑attention design that processes visual and textual tokens at every Transformer layer, incurring substantial computational overhead. In this work, we revisit the necessity of such dense visual processing and show that projected visual embeddings are already well‑aligned with the language space, while effective vision‑language interaction occurs in only a small subset of layers. Based on these insights, we propose ViCA (Vision‑only Cross‑Attention), a minimal MLLM architecture in which visual tokens bypass all self‑attention and feed‑forward layers, interacting with text solely through sparse cross‑attention at selected layers. Extensive evaluations across three MLLM backbones, nine multimodal benchmarks, and 26 pruning‑based baselines show that ViCA preserves 98% of baseline accuracy while reducing visual‑side computation to 4%, consistently achieving superior performance‑efficiency trade‑offs. Moreover, ViCA provides a regular, hardware‑friendly inference pipeline that yields over 3.5x speedup in single‑batch inference and over 10x speedup in multi‑batch inference, reducing visual grounding to near‑zero overhead compared with text‑only LLMs. It is also orthogonal to token pruning methods and can be seamlessly combined for further efficiency gains. Our code is available at https://github.com/EIT‑NLP/ViCA.
Authors:Yijie Chen, Yijin Liu, Fandong Meng
Abstract:
Supervised Fine‑Tuning (SFT) followed by Reinforcement Learning (RL) has emerged as the standard post‑training paradigm for large language models (LLMs). However, the conventional SFT process, driven by Cross‑Entropy (CE) loss, often induces mode collapse, where models over‑concentrate on specific response patterns. This lack of distributional diversity severely restricts the exploration efficiency required for subsequent RL. While recent studies have attempted to improve SFT by replacing the CE loss, aiming to preserve diversity or refine the update policy, they fail to adequately balance diversity and accuracy, thereby yielding suboptimal performance after RL. To address the mode collapse problem, we propose SED‑SFT, which adaptively encourages diversity based on the token exploration space. This framework introduces a selective entropy regularization term with a selective masking mechanism into the optimization objective. Extensive experiments across eight mathematical benchmarks demonstrate that SED‑SFT significantly enhances generation diversity with a negligible computational overhead increase compared with CE loss, yielding average improvements of 2.06 and 1.20 points in subsequent RL performance over standard CE‑based baselines on Llama‑3.2‑3B‑Instruct and Qwen2.5‑Math‑7B‑Instruct, respectively. The code is publicly available at https://github.com/pppa2019/SED‑SFT
Authors:Dingzhi Yu, Hongyi Tao, Yuanyu Wan, Luo Luo, Lijun Zhang
Abstract:
While adaptive gradient methods are the workhorse of modern machine learning, sign‑based optimization algorithms such as Lion and Muon have recently demonstrated superior empirical performance over AdamW in training large language models (LLM). However, a theoretical understanding of why sign‑based updates outperform variance‑adapted methods remains elusive. In this paper, we aim to bridge the gap between theory and practice through the lens of heavy‑tailed gradient noise, a phenomenon frequently observed in language modeling tasks. Theoretically, we introduce a novel generalized heavy‑tailed noise condition that captures the behavior of LLMs more accurately than standard finite variance assumptions. Under this noise model, we establish sharp convergence rates of SignSGD and Lion for generalized smooth function classes, matching or surpassing previous best‑known bounds. Furthermore, we extend our analysis to Muon and Muonlight, providing what is, to our knowledge, the first rigorous analysis of matrix optimization under heavy‑tailed stochasticity. These results offer a strong theoretical justification for the empirical superiority of sign‑based optimizers, showcasing that they are naturally suited to handle the noisy gradients associated with heavy tails. Empirically, LLM pretraining experiments validate our theoretical insights and confirm that our proposed noise models are well‑aligned with practice.
Authors:Tianyi Wu, Mingzhe Du, Yue Liu, Chengran Yang, Terry Yue Zhuo, Jiaheng Zhang, See-Kiong Ng
Abstract:
Large language models (LLMs) are increasingly used in software development, yet their tendency to generate insecure code remains a major barrier to real‑world deployment. Existing secure code alignment methods often suffer from a functionality‑‑security paradox, improving security at the cost of substantial utility degradation. We propose SecCoderX, an online reinforcement learning framework for functionality‑preserving secure code generation. SecCoderX first bridges vulnerability detection and secure code generation by repurposing mature detection resources in two ways: (i) synthesizing diverse, reality‑grounded vulnerability‑inducing coding tasks for online RL rollouts, and (ii) training a reasoning‑based vulnerability reward model that provides scalable and reliable security supervision. Together, these components are unified in an online RL loop to align code LLMs to generate secure and functional code. Extensive experiments demonstrate that SecCoderX achieves state‑of‑the‑art performance, improving Effective Safety Rate (ESR) by approximately 10% over unaligned models, whereas prior methods often degrade ESR by 14‑54%. We release our code, dataset and model checkpoints at https://github.com/AndrewWTY/SecCoderX.
Authors:Nisharg Nargund, Priyesh Shukla
Abstract:
Large language models (LLMs) achieve remarkable performance but demand substantial computational resources, limiting deployment on edge devices and resource‑constrained environments. We present TernaryLM, a 132M parameter transformer architecture that employs native 1‑bit ternary quantization ‑1, 0, +1 during training, achieving significant memory reduction without sacrificing language modeling capability. Unlike post‑training quantization approaches that quantize pre‑trained full‑precision models, TernaryLM learns quantization‑aware representations from scratch using straight‑through estimators and adaptive per‑layer scaling factors. Our experiments demonstrate: (1) validation perplexity of 58.42 on TinyStories; (2) downstream transfer with 82.47 percent F1 on MRPC paraphrase detection; (3) 2.4x memory reduction (498MB vs 1197MB) with comparable inference latency; and (4) stable training dynamics across diverse corpora. We provide layer‑wise quantization analysis showing that middle transformer layers exhibit highest compatibility with extreme quantization, informing future non‑uniform precision strategies. Our results suggest that native 1‑bit training is a promising direction for efficient neural language models. Code is available at https://github.com/1nisharg/TernaryLM‑Memory‑Efficient‑Language‑Modeling.
Authors:Long S. T. Nguyen, Quan M. Bui, Tin T. Ngo, Quynh T. N. Vo, Dung N. H. Le, Tho T. Quan
Abstract:
Question Answering (QA) over regulatory documents is inherently challenging due to the need for multihop reasoning across legally interdependent texts, a requirement that is particularly pronounced in the healthcare domain where regulations are hierarchically structured and frequently revised through amendments and cross‑references. Despite recent progress in retrieval‑augmented and graph‑based QA methods, systematic evaluation in this setting remains limited, especially for low‑resource languages such as Vietnamese, due to the lack of benchmark datasets that explicitly support multihop reasoning over healthcare regulations. In this work, we introduce the Vietnamese Healthcare Regulations‑Multihop Reasoning Dataset (ViHERMES), a benchmark designed for multihop QA over Vietnamese healthcare regulatory documents. ViHERMES consists of high‑quality question‑answer pairs that require reasoning across multiple regulations and capture diverse dependency patterns, including amendment tracing, cross‑document comparison, and procedural synthesis. To construct the dataset, we propose a controlled multihop QA generation pipeline based on semantic clustering and graph‑inspired data mining, followed by large language model‑based generation with structured evidence and reasoning annotations. We further present a graph‑aware retrieval framework that models formal legal relations at the level of legal units and supports principled context expansion for legally valid and coherent answers. Experimental results demonstrate that ViHERMES provides a challenging benchmark for evaluating multihop regulatory QA systems and that the proposed graph‑aware approach consistently outperforms strong retrieval‑based baselines. The ViHERMES dataset and system implementation are publicly available at https://github.com/ura‑hcmut/ViHERMES.
Authors:Jacqueline He, Jonathan Hayase, Wen-tau Yih, Sewoong Oh, Luke Zettlemoyer, Pang Wei Koh
Abstract:
Modern language models (LMs) tend to memorize portions of their training data and emit verbatim spans. When the underlying sources are sensitive or copyright‑protected, such reproduction raises issues of consent and compensation for creators and compliance risks for developers. We propose Anchored Decoding, a plug‑and‑play inference‑time method for suppressing verbatim copying: it enables decoding from any risky LM trained on mixed‑license data by keeping generation in bounded proximity to a permissively trained safe LM. Anchored Decoding adaptively allocates a user‑chosen information budget over the generation trajectory and enforces per‑step constraints that yield a sequence‑level guarantee, enabling a tunable risk‑utility trade‑off. To make Anchored Decoding practically useful, we introduce a new permissively trained safe model (TinyComma 1.8B), as well as Anchored_\mathrmByte Decoding, a byte‑level variant of our method that enables cross‑vocabulary fusion via the ByteSampler framework (Hayase et al., 2025). We evaluate our methods across six model pairs on long‑form evaluations of copyright risk and utility. Anchored and Anchored_\mathrmByte Decoding define a new Pareto frontier, preserving near‑original fluency and factuality while eliminating up to 75% of the measurable copying gap (averaged over six copying metrics) between the risky baseline and a safe reference, at a modest inference overhead.
Authors:Yifan Ji, Zhipeng Xu, Zhenghao Liu, Zulong Chen, Qian Zhang, Zhibo Yang, Junyang Lin, Yu Gu, Ge Yu, Maosong Sun
Abstract:
Key Information Extraction (KIE) from real‑world documents remains challenging due to substantial variations in layout structures, visual quality, and task‑specific information requirements. Recent Large Multimodal Models (LMMs) have shown promising potential for performing end‑to‑end KIE directly from document images. To enable a comprehensive and systematic evaluation across realistic and diverse application scenarios, we introduce UNIKIE‑BENCH, a unified benchmark designed to rigorously evaluate the KIE capabilities of LMMs. UNIKIE‑BENCH consists of two complementary tracks: a constrained‑category KIE track with scenario‑predefined schemas that reflect practical application needs, and an open‑category KIE track that extracts any key information that is explicitly present in the document. Experiments on 15 state‑of‑the‑art LMMs reveal substantial performance degradation under diverse schema definitions, long‑tail key fields, and complex layouts, along with pronounced performance disparities across different document types and scenarios. These findings underscore persistent challenges in grounding accuracy and layout‑aware reasoning for LMM‑based KIE. All codes and datasets are available at https://github.com/NEUIR/UNIKIE‑BENCH.
Authors:Shashank
Abstract:
Transformers achieve strong language modeling accuracy, yet their position‑wise feed‑forward networks (FFNs) are dense, globally shared, and typically updated end to end. These properties create two practical tensions. First, dense FFNs spend the same compute on every token regardless of context, and they allocate capacity uniformly even when language exhibits highly clustered context structure. Second, continual learning, in the sense of updating the model while serving a data stream, often produces interference because a small update touches broadly shared weights. We propose Attractor Patch Networks (APN), a plug‑compatible replacement for the Transformer FFN. APN is a bank of patch experts. A similarity router selects a small top‑k set of patches for each token by matching the token representation to learned prototypes. Each selected patch emits a low‑rank residual update conditioned on a compact code. The architecture yields conditional, context‑specialized nonlinear transformations while preserving the standard Transformer interface. This paper focuses on APN as an architectural primitive. We formalize APN, analyze its expressivity as a piecewise low‑rank residual function class, and derive simple interference and stability arguments that make APN naturally compatible with continual learning. In experiments on character‑level language modeling, APN achieves competitive perplexity (4.57 vs 4.32 PPL) while enabling dramatically better continual adaptation: when adapting to a shifted domain, APN achieves 2.6 times better retention (11.1 vs 29.4 PPL on the original domain) and 2.8 times better adaptation (6.4 vs 17.8 PPL on the new domain) compared to global fine‑tuning of a dense FFN baseline.
Authors:Siqi Song, Xuanbing Xie, Zonglin Li, Yuqiang Li, Shijie Wang, Biqing Qi
Abstract:
Multi‑robot collaboration tasks often require heterogeneous robots to work together over long horizons under spatial constraints and environmental uncertainties. Although Large Language Models (LLMs) excel at reasoning and planning, their potential for coordinated control has not been fully explored. Inspired by human teamwork, we present CLiMRS (Cooperative Large‑Language‑Model‑Driven Heterogeneous Multi‑Robot System), an adaptive group negotiation framework among LLMs for multi‑robot collaboration. This framework pairs each robot with an LLM agent and dynamically forms subgroups through a general proposal planner. Within each subgroup, a subgroup manager leads perception‑driven multi‑LLM discussions to get commands for actions. Feedback is provided by both robot execution outcomes and environment changes. This grouping‑planning‑execution‑feedback loop enables efficient planning and robust execution. To evaluate these capabilities, we introduce CLiMBench, a heterogeneous multi‑robot benchmark of challenging assembly tasks. Our experiments show that CLiMRS surpasses the best baseline, achieving over 40% higher efficiency on complex tasks without sacrificing success on simpler ones. Overall, our results demonstrate that leveraging human‑inspired group formation and negotiation principles significantly enhances the efficiency of heterogeneous multi‑robot collaboration. Our code is available here: https://github.com/song‑siqi/CLiMRS.
Authors:Yuchen Yan, Liang Jiang, Jin Jiang, Shuaicheng Li, Zujie Wen, Zhiqiang Zhang, Jun Zhou, Jian Shao, Yueting Zhuang, Yongliang Shen
Abstract:
Large reasoning models achieve strong performance by scaling inference‑time chain‑of‑thought, but this paradigm suffers from quadratic cost, context length limits, and degraded reasoning due to lost‑in‑the‑middle effects. Iterative reasoning mitigates these issues by periodically summarizing intermediate thoughts, yet existing methods rely on supervised learning or fixed heuristics and fail to optimize when to summarize, what to preserve, and how to resume reasoning. We propose InftyThink+, an end‑to‑end reinforcement learning framework that optimizes the entire iterative reasoning trajectory, building on model‑controlled iteration boundaries and explicit summarization. InftyThink+ adopts a two‑stage training scheme with supervised cold‑start followed by trajectory‑level reinforcement learning, enabling the model to learn strategic summarization and continuation decisions. Experiments on DeepSeek‑R1‑Distill‑Qwen‑1.5B show that InftyThink+ improves accuracy by 21% on AIME24 and outperforms conventional long chain‑of‑thought reinforcement learning by a clear margin, while also generalizing better to out‑of‑distribution benchmarks. Moreover, InftyThink+ significantly reduces inference latency and accelerates reinforcement learning training, demonstrating improved reasoning efficiency alongside stronger performance.
Authors:Lizhuo Luo, Zhuoran Shi, Jiajun Luo, Zhi Wang, Shen Ren, Wenya Wang, Tianwei Zhang
Abstract:
Diffusion large language models (dLLMs) have shown advantages in text generation, particularly due to their inherent ability for parallel decoding. However, constrained by the quality‑‑speed trade‑off, existing inference solutions adopt conservative parallel strategies, leaving substantial efficiency potential underexplored. A core challenge is that parallel decoding assumes each position can be filled independently, but tokens are often semantically coupled. Thus, the correct choice at one position constrains valid choices at others. Without modeling these inter‑token dependencies, parallel strategies produce deteriorated outputs. Motivated by this insight, we propose DAWN, a training‑free, dependency‑aware decoding method for fast dLLM inference. DAWN extracts token dependencies and leverages two key motivations: (1) positions dependent on unmasked certain positions become more reliable, (2) simultaneously unmasking strongly coupled uncertain positions induces errors. Given those findings, DAWN leverages a dependency graph to select more reliable unmasking positions at each iteration, achieving high parallelism with negligible loss in generation quality. Extensive experiments across multiple models and datasets demonstrate that DAWN speedups the inference by 1.80‑8.06x over baselines while preserving the generation quality. Code is released at https://github.com/lizhuo‑luo/DAWN.
Authors:Mingqian Feng, Xiaodong Liu, Weiwei Yang, Jialin Song, Xuekai Zhu, Chenliang Xu, Jianfeng Gao
Abstract:
Multi‑turn jailbreaks capture the real threat model for safety‑aligned chatbots, where single‑turn attacks are merely a special case. Yet existing approaches break under exploration complexity and intent drift. We propose SEMA, a simple yet effective framework that trains a multi‑turn attacker without relying on any existing strategies or external data. SEMA comprises two stages. Prefilling self‑tuning enables usable rollouts by fine‑tuning on non‑refusal, well‑structured, multi‑turn adversarial prompts that are self‑generated with a minimal prefix, thereby stabilizing subsequent learning. Reinforcement learning with intent‑drift‑aware reward trains the attacker to elicit valid multi‑turn adversarial prompts while maintaining the same harmful objective. We anchor harmful intent in multi‑turn jailbreaks via an intent‑drift‑aware reward that combines intent alignment, compliance risk, and level of detail. Our open‑loop attack regime avoids dependence on victim feedback, unifies single‑ and multi‑turn settings, and reduces exploration complexity. Across multiple datasets, victim models, and jailbreak judges, our method achieves state‑of‑the‑art (SOTA) attack success rates (ASR), outperforming all single‑turn baselines, manually scripted and template‑driven multi‑turn baselines, as well as our SFT (Supervised Fine‑Tuning) and DPO (Direct Preference Optimization) variants. For instance, SEMA performs an average 80.1% ASR@1 across three closed‑source and open‑source victim models on AdvBench, 33.9% over SOTA. The approach is compact, reproducible, and transfers across targets, providing a stronger and more realistic stress test for large language model (LLM) safety and enabling automatic redteaming to expose and localize failure modes. Our code is available at: https://github.com/fmmarkmq/SEMA.
Authors:Yanlin Lai, Mitt Huang, Hangyu Guo, Xiangfeng Wang, Haodong Li, Shaoxiong Zhan, Liang Zhao, Chengyuan Yao, Yinmin Zhang, Qi Han, Chun Yuan, Zheng Ge, Xiangyu Zhang, Daxin Jiang
Abstract:
Reinforcement Learning from Human Feedback (RLHF) remains indispensable for aligning large language models (LLMs) in subjective domains. To enhance robustness, recent work shifts toward Generative Reward Models (GenRMs) that generate rationales before predicting preferences. Yet in GenRM training and evaluation, practice remains outcome‑label‑only, leaving reasoning quality unchecked. We show that reasoning fidelity‑the consistency between a GenRM's preference decision and reference decision rationales‑is highly predictive of downstream RLHF outcomes, beyond standard label accuracy. Specifically, we repurpose existing reward‑model benchmarks to compute Spurious Correctness (S‑Corr)‑the fraction of label‑correct decisions with rationales misaligned with golden judgments. Our empirical evaluation reveals substantial S‑Corr even for competitive GenRMs, and higher S‑Corr is associated with policy degeneration under optimization. To improve fidelity, we propose Rationale‑Centric Alignment, R‑Align, which augments training with gold judgments and explicitly supervises rationale alignment. R‑Align reduces S‑Corr on RM benchmarks and yields consistent gains in actor performance across STEM, coding, instruction following, and general tasks.
Authors:Tian Lan, Felix Henry, Bin Zhu, Qianghuai Jia, Junyang Ren, Qihang Pu, Haijun Li, Longyue Wang, Zhao Xu, Weihua Luo
Abstract:
Current Information Seeking (InfoSeeking) agents struggle to maintain focus and coherence during long‑horizon exploration, as tracking search states, including planning procedure and massive search results, within one plain‑text context is inherently fragile. To address this, we introduce Table‑as‑Search (TaS), a structured planning framework that reformulates the InfoSeeking task as a Table Completion task. TaS maps each query into a structured table schema maintained in an external database, where rows represent search candidates and columns denote constraints or required information. This table precisely manages the search states: filled cells strictly record the history and search results, while empty cells serve as an explicit search plan. Crucially, TaS unifies three distinct InfoSeeking tasks: Deep Search, Wide Search, and the challenging DeepWide Search. Extensive experiments demonstrate that TaS significantly outperforms numerous state‑of‑the‑art baselines across three kinds of benchmarks, including multi‑agent framework and commercial systems. Furthermore, our analysis validates the TaS's superior robustness in long‑horizon InfoSeeking, alongside its efficiency, scalability and flexibility. Code and datasets are publicly released at https://github.com/AIDC‑AI/Marco‑Search‑Agent.
Authors:Zhuoyuan Hao, Zhuo Li, Wu Li, Fangming Liu, Min Zhang, Jing Li
Abstract:
Test‑time compute allocation in large reasoning models (LRMs) is widely used and has applications in mathematical problem solving, code synthesis, and planning. Recent work has addressed this problem by scaling self‑consistency and parallel thinking, adding generic ``thinking tokens'' and prompting models to re‑read the question before answering. Unfortunately, these approaches either inject task‑agnostic tokens or mandate heuristics that do not explain ‑‑ and often ignore ‑‑ the \emphspontaneous repetition that many LRMs exhibit at the head of their internal chains. In contrast, we analyze and harness the model's tendency to restate the question, which we term the \emphEcho of Prompt (EOP), as a front‑loaded, compute‑shaping mechanism. We formalize its probabilistic cost by casting echo removal as rejection‑based conditioning and defining the \emphEcho Likelihood Gap Δ\mathcalL as a computable proxy. This provides the missing theoretical link that links early repetition to likelihood gains and downstream accuracy. However, it does not by itself specify how to exploit EOP. Consequently, we develop \emphEcho‑Distilled SFT (ED‑SFT) to instill an ``echo‑then‑reason'' pattern through supervised finetuning, and \emphEchoic Prompting (EP) to re‑ground the model mid‑trace without training. While promising, quantifying benefits beyond verbosity is non‑trivial. Therefore, we conduct length and suffix‑controlled likelihood analyses together with layer‑wise attention studies, showing that EOP increases answer to answer‑prefix attention in middle layers, consistent with an \emphattention refocusing mechanism. We evaluate on GSM8K, MathQA, Hendrycks‑MATH, AIME24, and MATH‑500 under identical decoding settings and budgets, and find consistent gains over baselines. Code is available at https://github.com/hhh2210/echoes‑as‑anchors.
Authors:Minjeong Ban, Jeonghwan Choi, Hyangsuk Min, Nicole Hee-Yeon Kim, Minseok Kim, Jae-Gil Lee, Hwanjun Song
Abstract:
Information retrieval (IR) evaluation remains challenging due to incomplete IR benchmark datasets that contain unlabeled relevant chunks. While LLMs and LLM‑human hybrid strategies reduce costly human effort, they remain prone to LLM overconfidence and ineffective AI‑to‑human escalation. To address this, we propose DREAM, a multi‑round debate‑based relevance assessment framework with LLM agents, built on opposing initial stances and iterative reciprocal critique. Through our agreement‑based debate, it yields more accurate labeling for certain cases and more reliable AI‑to‑human escalation for uncertain ones, achieving 95.2% labeling accuracy with only 3.5% human involvement. Using DREAM, we build BRIDGE, a refined benchmark that mitigates evaluation bias and enables fairer retriever comparison by uncovering 29,824 missing relevant chunks. We then re‑benchmark IR systems and extend evaluation to RAG, showing that unaddressed holes not only distort retriever rankings but also drive retrieval‑generation misalignment. The relevance assessment framework is available at https: //github.com/DISL‑Lab/DREAM‑ICLR‑26; and the BRIDGE dataset is available at https://github.com/DISL‑Lab/BRIDGE‑Benchmark.
Authors:Changyue Wang, Weihang Su, Qingyao Ai, Yiqun Liu
Abstract:
Scaling training data and model parameters has long driven progress in large language models (LLMs), but this paradigm is increasingly constrained by the scarcity of high‑quality data and diminishing returns from rising computational costs. As a result, recent work is increasing the focus on continual learning from real‑world deployment, where user interaction logs provide a rich source of authentic human feedback and procedural knowledge. However, learning from user logs is challenging due to their unstructured and noisy nature. Vanilla LLM systems often struggle to distinguish useful feedback signals from noisy user behavior, and the disparity between user log collection and model optimization (e.g., the off‑policy optimization problem) further strengthens the problem. To this end, we propose UNO (User log‑driveN Optimization), a unified framework for improving LLM systems (LLMsys) with user logs. UNO first distills logs into semi‑structured rules and preference pairs, then employs query‑and‑feedback‑driven clustering to manage data heterogeneity, and finally quantifies the cognitive gap between the model's prior knowledge and the log data. This assessment guides the LLMsys to adaptively filter out noisy feedback and construct different modules for primary and reflective experiences extracted from user logs, thereby improving future responses. Extensive experiments show that UNO achieves state‑of‑the‑art effectiveness and efficiency, significantly outperforming Retrieval Augmented Generation (RAG) and memory‑based baselines. We have open‑sourced our code at https://github.com/bebr2/UNO .
Authors:Daisuke Oba, Hiroki Furuta, Naoaki Okazaki
Abstract:
Masked diffusion language models generate by iteratively filling masked tokens over multiple denoising steps, so learning only from a terminal reward on the final completion yields coarse credit assignment over intermediate decisions. We propose DiSPO (Diffusion‑State Policy Optimization), a plug‑in credit‑assignment layer that directly optimizes intermediate filling decisions. At selected intermediate masked states, DiSPO branches by resampling fillings for the currently masked positions from rollout‑cached logits, scores the resulting completions, and updates only the newly filled tokens ‑‑ without additional multi‑step diffusion rollouts. We formalize a fixed‑state objective for branched completions and derive a policy‑gradient estimator that can be combined with terminal‑feedback policy optimization using the same rollouts. On LLaDA‑8B‑Instruct, DiSPO consistently improves over the terminal‑feedback diffu‑GRPO baseline on math and planning benchmarks under matched rollout compute and optimizer steps. Our code will be available at https://daioba.github.io/dispo .
Authors:Daisuke Oba, Danushka Bollegala, Masahiro Kaneko, Naoaki Okazaki
Abstract:
Masked Diffusion Language Models generate sequences via iterative sampling that progressively unmasks tokens. However, they still recompute the attention and feed‑forward blocks for every token position at every step ‑‑ even when many unmasked tokens are essentially fixed, resulting in substantial waste in compute. We propose SureLock: when the posterior at an unmasked position has stabilized across steps (our sure condition), we lock that position ‑‑ thereafter skipping its query projection and feed‑forward sublayers ‑‑ while caching its attention keys and values so other positions can continue to attend to it. This reduces the dominant per‑iteration computational cost from O(N^2d) to O(MNd) where N is the sequence length, M is the number of unlocked token positions, and d is the model dimension. In practice, M decreases as the iteration progresses, yielding substantial savings. On LLaDA‑8B, SureLock reduces algorithmic FLOPs by 30‑‑50% relative to the same sampler without locking, while maintaining comparable generation quality. We also provide a theoretical analysis to justify the design rationale of SureLock: monitoring only the local KL at the lock step suffices to bound the deviation in final token probabilities. Our code will be available at https://daioba.github.io/surelock .
Authors:Yaoting Wang, Yun Zhou, Henghui Ding
Abstract:
Producing outputs that satisfy both semantic intent and format constraints is essential for deploying large language models in user‑facing and system‑integrated workflows. In this work, we focus on Markdown formatting, which is ubiquitous in assistants, documentation, and tool‑augmented pipelines but still prone to subtle, hard‑to‑detect errors (e.g., broken lists, malformed tables, inconsistent headings, and invalid code blocks) that can significantly degrade downstream usability. We present FMBench, a benchmark for adaptive Markdown output formatting that evaluates models under a wide range of instruction‑following scenarios with diverse structural requirements. FMBench emphasizes real‑world formatting behaviors such as multi‑level organization, mixed content (natural language interleaved with lists/tables/code), and strict adherence to user‑specified layout constraints. To improve Markdown compliance without relying on hard decoding constraints, we propose a lightweight alignment pipeline that combines supervised fine‑tuning (SFT) with reinforcement learning fine‑tuning. Starting from a base model, we first perform SFT on instruction‑response pairs, and then optimize a composite objective that balances semantic fidelity with structural correctness. Experiments on two model families (OpenPangu and Qwen) show that SFT consistently improves semantic alignment, while reinforcement learning provides additional gains in robustness to challenging Markdown instructions when initialized from a strong SFT policy. Our results also reveal an inherent trade‑off between semantic and structural objectives, highlighting the importance of carefully designed rewards for reliable formatted generation. Code is available at: https://github.com/FudanCVL/FMBench.
Authors:Yewei Liu, Xiyuan Wang, Yansheng Mao, Yoav Gelbery, Haggai Maron, Muhan Zhang
Abstract:
We propose SHINE (Scalable Hyper In‑context NEtwork), a scalable hypernetwork that can map diverse meaningful contexts into high‑quality LoRA adapters for large language models (LLM). By reusing the frozen LLM's own parameters in an in‑context hypernetwork design and introducing architectural innovations, SHINE overcomes key limitations of prior hypernetworks and achieves strong expressive power with a relatively small number of parameters. We introduce a pretraining and instruction fine‑tuning pipeline, and train our hypernetwork to generate high quality LoRA adapters from diverse meaningful contexts in a single forward pass. It updates LLM parameters without any fine‑tuning, and immediately enables complex question answering tasks related to the context without directly accessing the context, effectively transforming in‑context knowledge to in‑parameter knowledge in one pass. Our work achieves outstanding results on various tasks, greatly saves time, computation and memory costs compared to SFT‑based LLM adaptation, and shows great potential for scaling. Our code is available at https://github.com/Yewei‑Liu/SHINE
Authors:Junqi Chen, Sirui Chen, Chaochao Lu
Abstract:
Causal inference is essential for decision‑making but remains challenging for non‑experts. While large language models (LLMs) show promise in this domain, their precise causal estimation capabilities are still limited, and the impact of post‑training on these abilities is insufficiently explored. This paper examines the extent to which post‑training can enhance LLMs' capacity for causal inference. We introduce CauGym, a comprehensive dataset comprising seven core causal tasks for training and five diverse test sets. Using this dataset, we systematically evaluate five post‑training approaches: SFT, DPO, KTO, PPO, and GRPO. Across five in‑domain and four existing benchmarks, our experiments demonstrate that appropriate post‑training enables smaller LLMs to perform causal inference competitively, often surpassing much larger models. Our 14B parameter model achieves 93.5% accuracy on the CaLM benchmark, compared to 55.4% by OpenAI o3. Furthermore, the post‑trained LLMs exhibit strong generalization and robustness under real‑world conditions such as distribution shifts and noisy data. Collectively, these findings provide the first systematic evidence that targeted post‑training can produce reliable and robust LLM‑based causal reasoners. Our data and GRPO‑model are available at https://github.com/OpenCausaLab/CauGym.
Authors:Peiyang Song, Pengrui Han, Noah Goodman
Abstract:
Large Language Models (LLMs) have exhibited remarkable reasoning capabilities, achieving impressive results across a wide range of tasks. Despite these advances, significant reasoning failures persist, occurring even in seemingly simple scenarios. To systematically understand and address these shortcomings, we present the first comprehensive survey dedicated to reasoning failures in LLMs. We introduce a novel categorization framework that distinguishes reasoning into embodied and non‑embodied types, with the latter further subdivided into informal (intuitive) and formal (logical) reasoning. In parallel, we classify reasoning failures along a complementary axis into three types: fundamental failures intrinsic to LLM architectures that broadly affect downstream tasks; application‑specific limitations that manifest in particular domains; and robustness issues characterized by inconsistent performance across minor variations. For each reasoning failure, we provide a clear definition, analyze existing studies, explore root causes, and present mitigation strategies. By unifying fragmented research efforts, our survey provides a structured perspective on systemic weaknesses in LLM reasoning, offering valuable insights and guiding future research towards building stronger, more reliable, and robust reasoning capabilities. We additionally release a comprehensive collection of research works on LLM reasoning failures, as a GitHub repository at https://github.com/Peiyang‑Song/Awesome‑LLM‑Reasoning‑Failures, to provide an easy entry point to this area.
Authors:Jongha Kim, Byungoh Ko, Jeehye Na, Jinsung Yoon, Hyunwoo J. Kim
Abstract:
Despite the remarkable capabilities of Large Vision Language Models (LVLMs), they still lack detailed knowledge about specific entities. Retrieval‑augmented Generation (RAG) is a widely adopted solution that enhances LVLMs by providing additional contexts from an external Knowledge Base. However, we observe that previous decoding methods for RAG are sub‑optimal as they fail to sufficiently leverage multiple relevant contexts and suppress the negative effects of irrelevant contexts. To this end, we propose Relevance‑aware Multi‑context Contrastive Decoding (RMCD), a novel decoding method for RAG. RMCD outputs a final prediction by combining outputs predicted with each context, where each output is weighted based on its relevance to the question. By doing so, RMCD effectively aggregates useful information from multiple relevant contexts while also counteracting the negative effects of irrelevant ones. Experiments show that RMCD consistently outperforms other decoding methods across multiple LVLMs, achieving the best performance on three knowledge‑intensive visual question‑answering benchmarks. Also, RMCD can be simply applied by replacing the decoding method of LVLMs without additional training. Analyses also show that RMCD is robust to the retrieval results, consistently performing the best across the weakest to the strongest retrieval results. Code is available at https://github.com/mlvlab/RMCD.
Authors:Haozhen Zhang, Haodong Yue, Tao Feng, Quanyu Long, Jianzhu Bao, Bowen Jin, Weizhi Zhang, Xiao Li, Jiaxuan You, Chengwei Qin, Wenya Wang
Abstract:
Memory is increasingly central to Large Language Model (LLM) agents operating beyond a single context window, yet most existing systems rely on offline, query‑agnostic memory construction that can be inefficient and may discard query‑critical information. Although runtime memory utilization is a natural alternative, prior work often incurs substantial overhead and offers limited explicit control over the performance‑cost trade‑off. In this work, we present BudgetMem, a runtime agent memory framework for explicit, query‑aware performance‑cost control. BudgetMem structures memory processing as a set of memory modules, each offered in three budget tiers (i.e., \textscLow/\textscMid/\textscHigh). A lightweight router performs budget‑tier routing across modules to balance task performance and memory construction cost, which is implemented as a compact neural policy trained with reinforcement learning. Using BudgetMem as a unified testbed, we study three complementary strategies for realizing budget tiers: implementation (method complexity), reasoning (inference behavior), and capacity (module model size). Across LoCoMo, LongMemEval, and HotpotQA, BudgetMem surpasses strong baselines when performance is prioritized (i.e., high‑budget setting), and delivers better accuracy‑cost frontiers under tighter budgets. Moreover, our analysis disentangles the strengths and weaknesses of different tiering strategies, clarifying when each axis delivers the most favorable trade‑offs under varying budget regimes.
Authors:Lizhuo Luo, Shenggui Li, Yonggang Wen, Tianwei Zhang
Abstract:
Diffusion large language models (dLLMs) have emerged as a promising alternative for text generation, distinguished by their native support for parallel decoding. In practice, block inference is crucial for avoiding order misalignment in global bidirectional decoding and improving output quality. However, the widely‑used fixed, predefined block (naive) schedule is agnostic to semantic difficulty, making it a suboptimal strategy for both quality and efficiency: it can force premature commitments to uncertain positions while delaying easy positions near block boundaries. In this work, we analyze the limitations of naive block scheduling and disclose the importance of dynamically adapting the schedule to semantic difficulty for reliable and efficient inference. Motivated by this, we propose Dynamic Sliding Block (DSB), a training‑free block scheduling method that uses a sliding block with a dynamic size to overcome the rigidity of the naive block. To further improve efficiency, we introduce DSB Cache, a training‑free KV‑cache mechanism tailored to DSB. Extensive experiments across multiple models and benchmarks demonstrate that DSB, together with DSB Cache, consistently improves both generation quality and inference efficiency for dLLMs. Code is released at https://github.com/lizhuo‑luo/DSB.
Authors:Shuo Nie, Hexuan Deng, Chao Wang, Ruiyu Fang, Xuebo Liu, Shuangyong Song, Yu Li, Min Zhang, Xuelong Li
Abstract:
As large language models become smaller and more efficient, small reasoning models (SRMs) are crucial for enabling chain‑of‑thought (CoT) reasoning in resource‑constrained settings. However, they are prone to faithfulness hallucinations, especially in intermediate reasoning steps. Existing mitigation methods based on online reinforcement learning rely on outcome‑based rewards or coarse‑grained CoT evaluation, which can inadvertently reinforce unfaithful reasoning when the final answer is correct. To address these limitations, we propose Faithfulness‑Aware Step‑Level Reinforcement Learning (FaithRL), introducing step‑level supervision via explicit faithfulness rewards from a process reward model, together with an implicit truncated resampling strategy that generates contrastive signals from faithful prefixes. Experiments across multiple SRMs and Open‑Book QA benchmarks demonstrate that FaithRL consistently reduces hallucinations in both the CoT and final answers, leading to more faithful and reliable reasoning. Code is available at https://github.com/Easy195/FaithRL.
Authors:Fangzhi Xu, Hang Yan, Qiushi Sun, Jinyang Wu, Zixian Huang, Muye Huang, Jingyang Gong, Zichen Ding, Kanzhi Cheng, Yian Wang, Xinyu Che, Zeyi Sun, Jian Zhang, Zhangyue Yin, Haoran Luo, Xuanjing Huang, Ben Kao, Jun Liu, Qika Lin
Abstract:
The rapid advancement of Large Language Models (LLMs) has catalyzed the development of autonomous agents capable of navigating complex environments. However, existing evaluations primarily adopt a deductive paradigm, where agents execute tasks based on explicitly provided rules and static goals, often within limited planning horizons. Crucially, this neglects the inductive necessity for agents to discover latent transition laws from experience autonomously, which is the cornerstone for enabling agentic foresight and sustaining strategic coherence. To bridge this gap, we introduce OdysseyArena, which re‑centers agent evaluation on long‑horizon, active, and inductive interactions. We formalize and instantiate four primitives, translating abstract transition dynamics into concrete interactive environments. Building upon this, we establish OdysseyArena‑Lite for standardized benchmarking, providing a set of 120 tasks to measure an agent's inductive efficiency and long‑horizon discovery. Pushing further, we introduce OdysseyArena‑Challenge to stress‑test agent stability across extreme interaction horizons (e.g., > 200 steps). Extensive experiments on 15+ leading LLMs reveal that even frontier models exhibit a deficiency in inductive scenarios, identifying a critical bottleneck in the pursuit of autonomous discovery in complex environments. Our code and data are available at https://github.com/xufangzhi/Odyssey‑Arena
Authors:Jingze Shi, Zhangyang Peng, Yizhang Zhu, Yifan Wu, Guang Liu, Yuyu Luo
Abstract:
Mixture‑of‑Experts (MoE) architectures are evolving towards finer granularity to improve parameter efficiency. However, existing MoE designs face an inherent trade‑off between the granularity of expert specialization and hardware execution efficiency. We propose OmniMoE, a system‑algorithm co‑designed framework that pushes expert granularity to its logical extreme. OmniMoE introduces vector‑level Atomic Experts, enabling scalable routing and execution within a single MoE layer, while retaining a shared dense MLP branch for general‑purpose processing. Although this atomic design maximizes capacity, it poses severe challenges for routing complexity and memory access. To address these, OmniMoE adopts a system‑algorithm co‑design: (i) a Cartesian Product Router that decomposes the massive index space to reduce routing complexity from O(N) to O(sqrt(N)); and (ii) Expert‑Centric Scheduling that inverts the execution order to turn scattered, memory‑bound lookups into efficient dense matrix operations. Validated on seven benchmarks, OmniMoE (with 1.7B active parameters) achieves 50.9% zero‑shot accuracy across seven benchmarks, outperforming coarse‑grained (e.g., DeepSeekMoE) and fine‑grained (e.g., PEER) baselines. Crucially, OmniMoE reduces inference latency from 73ms to 6.7ms (a 10.9‑fold speedup) compared to PEER, demonstrating that massive‑scale fine‑grained MoE can be fast and accurate. Our code is open‑sourced at https://github.com/flash‑algo/omni‑moe.
Authors:Congbo Ma, Yichun Zhang, Yousef Al-Jazzazi, Ahamed Foisal, Laasya Sharma, Yousra Sadqi, Khaled Saleh, Jihad Mallat, Farah E. Shamout
Abstract:
Inaccuracies in existing or generated clinical text may lead to serious adverse consequences, especially if it is a misdiagnosis or incorrect treatment suggestion. With Large Language Models (LLMs) increasingly being used across diverse healthcare applications, comprehensive evaluation through dedicated benchmarks is crucial. However, such datasets remain scarce, especially across diverse languages and contexts. In this paper, we introduce MedErrBench, the first multilingual benchmark for error detection, localization, and correction, developed under the guidance of experienced clinicians. Based on an expanded taxonomy of ten common error types, MedErrBench covers English, Arabic and Chinese, with natural clinical cases annotated and reviewed by domain experts. We assessed the performance of a range of general‑purpose, language‑specific, and medical‑domain language models across all three tasks. Our results reveal notable performance gaps, particularly in non‑English settings, highlighting the need for clinically grounded, language‑aware systems. By making MedErrBench and our evaluation protocols publicly‑available, we aim to advance multilingual clinical NLP to promote safer and more equitable AI‑based healthcare globally. The dataset is available in the supplementary material. An anonymized version of the dataset is available at: https://github.com/congboma/MedErrBench.
Authors:Benny Cheung
Abstract:
Traditional ontologies excel at describing domain structure but cannot generate novel artifacts. Large language models generate fluently but produce outputs that lack structural validity, hallucinating mechanisms without components, goals without end conditions. We introduce Generative Ontology, a framework that synthesizes these complementary strengths: ontology provides the grammar; the LLM provides the creativity. Generative Ontology encodes domain knowledge as executable Pydantic schemas that constrain LLM generation via DSPy signatures. A multi‑agent pipeline assigns specialized roles to different ontology domains: a Mechanics Architect designs game systems, a Theme Weaver integrates narrative, a Balance Critic identifies exploits. Each agent carrying a professional "anxiety" that prevents shallow, agreeable outputs. Retrieval‑augmented generation grounds novel designs in precedents from existing exemplars, while iterative validation ensures coherence between mechanisms and components. We demonstrate the framework through GameGrammar, a system for generating complete tabletop game designs. Given a thematic prompt ("bioluminescent fungi competing in a cave ecosystem"), the pipeline produces structurally complete, playable game specifications with mechanisms, components, victory conditions, and setup instructions. These outputs satisfy ontological constraints while remaining genuinely creative. The pattern generalizes beyond games. Any domain with expert vocabulary, validity constraints, and accumulated exemplars (music composition, software architecture, culinary arts) is a candidate for Generative Ontology. We argue that constraints do not limit creativity but enable it: just as grammar makes poetry possible, ontology makes structured generation possible.
Authors:Yayuan Li, Ze Peng, Jian Zhang, Jintao Guo, Yue Duan, Yinghuan Shi
Abstract:
Model merging combines multiple fine‑tuned models into a single model by adding their weight updates, providing a lightweight alternative to retraining. Existing methods primarily target resolving conflicts between task updates, leaving the failure mode of over‑counting shared knowledge unaddressed. We show that when tasks share aligned spectral directions (i.e., overlapping singular vectors), a simple linear combination repeatedly accumulates these directions, inflating the singular values and biasing the merged model toward shared subspaces. To mitigate this issue, we propose Singular Value Calibration (SVC), a training‑free and data‑free post‑processing method that quantifies subspace overlap and rescales inflated singular values to restore a balanced spectrum. Across vision and language benchmarks, SVC consistently improves strong merging baselines and achieves state‑of‑the‑art performance. Furthermore, by modifying only the singular values, SVC improves the performance of Task Arithmetic by 13.0%. Code is available at: https://github.com/lyymuwu/SVC.
Authors:Bingru Li
Abstract:
Data annotation remains a significant bottleneck in the Humanities and Social Sciences, particularly for complex semantic tasks such as metaphor identification. While Large Language Models (LLMs) show promise, a significant gap remains between the theoretical capability of LLMs and their practical utility for researchers. This paper introduces LinguistAgent, an integrated, user‑friendly platform that leverages a reflective multi‑model architecture to automate linguistic annotation. The system implements a dual‑agent workflow, comprising an Annotator and a Reviewer, to simulate a professional peer‑review process. LinguistAgent supports comparative experiments across three paradigms: Prompt Engineering (Zero/Few‑shot), Retrieval‑Augmented Generation, and Fine‑tuning. We demonstrate LinguistAgent's efficacy using the task of metaphor identification as an example, providing real‑time token‑level evaluation (Precision, Recall, and F_1 score) against human gold standards. The application and codes are released on https://github.com/Bingru‑Li/LinguistAgent.
Authors:Takumi Goto, Yusuke Sakai, Taro Watanabe
Abstract:
Automatic evaluation in grammatical error correction (GEC) is crucial for selecting the best‑performing systems. Currently, reference‑based metrics are a popular choice, which basically measure the similarity between hypothesis and reference sentences. However, similarity measures based on embeddings, such as BERTScore, are often ineffective, since many words in the source sentences remain unchanged in both the hypothesis and the reference. This study focuses on edits specifically designed for GEC, i.e., ERRANT, and computes similarity measured over the edits from the source sentence. To this end, we propose edit vector, a representation for an edit, and introduce a new metric, UOT‑ERRANT, which transports these edit vectors from hypothesis to reference using unbalanced optimal transport. Experiments with SEEDA meta‑evaluation show that UOT‑ERRANT improves evaluation performance, particularly in the +Fluency domain where many edits occur. Moreover, our method is highly interpretable because the transport plan can be interpreted as a soft edit alignment, making UOT‑ERRANT a useful metric for both system ranking and analyzing GEC systems. Our code is available from https://github.com/gotutiyan/uot‑errant.
Authors:Filip Kučera, Christoph Mandl, Isao Echizen, Radu Timofte, Timo Spinde
Abstract:
Definitions are the foundation for any scientific work, but with a significant increase in publication numbers, gathering definitions relevant to any keyword has become challenging. We therefore introduce SciDef, an LLM‑based pipeline for automated definition extraction. We test SciDef on DefExtra & DefSim, novel datasets of human‑extracted definitions and definition‑pairs' similarity, respectively. Evaluating 16 language models across prompting strategies, we demonstrate that multi‑step and DSPy‑optimized prompting improve extraction performance. To evaluate extraction, we test various metrics and show that an NLI‑based method yields the most reliable results. We show that LLMs are largely able to extract definitions from scientific literature (86.4% of definitions from our test‑set); yet future work should focus not just on finding definitions, but on identifying relevant ones, as models tend to over‑generate them. Code & datasets are available at https://github.com/Media‑Bias‑Group/SciDef.
Authors:Tao Liu, Jiafan Lu, Bohan Yu, Pengcheng Wu, Liu Haixin, Guoyu Xu, Li Xiangheng, Lixiao Li, Jiaming Hou, Zhao Shijun, Xinglin Lyu, Kunli Zhang, Yuxiang Jia, Hongyin Zan
Abstract:
Text‑to‑SQL is a key natural language processing task that maps natural language questions to SQL queries, enabling intuitive interaction with web‑based databases. Although current methods perform well on benchmarks like BIRD and Spider, they struggle with complex reasoning, domain knowledge, and hypothetical queries, and remain costly in enterprise deployment. To address these issues, we propose a framework named IESR(Information Enhanced Structured Reasoning) for lightweight large language models: (i) leverages LLMs for key information understanding and schema linking, and decoupling mathematical computation and SQL generation, (ii) integrates a multi‑path reasoning mechanism based on Monte Carlo Tree Search (MCTS) with majority voting, and (iii) introduces a trajectory consistency verification module with a discriminator model to ensure accuracy and consistency. Experimental results demonstrate that IESR achieves state‑of‑the‑art performance on the complex reasoning benchmark LogicCat (24.28 EX) and the Archer dataset (37.28 EX) using only compact lightweight models without fine‑tuning. Furthermore, our analysis reveals that current coder models exhibit notable biases and deficiencies in physical knowledge, mathematical computation, and common‑sense reasoning, highlighting important directions for future research. We released code at https://github.com/Ffunkytao/IESR‑SLM.
Authors:Zhuokun Chen, Jianfei Cai, Bohan Zhuang
Abstract:
Generating long‑form content, such as minute‑long videos and extended texts, is increasingly important for modern generative models. Block diffusion improves inference efficiency via KV caching and block‑wise causal inference and has been widely adopted in diffusion language models and video generation. However, in long‑context settings, block diffusion still incurs substantial overhead from repeatedly computing attention over a growing KV cache. We identify an underexplored property of block diffusion: cross‑step redundancy of attention within a block. Our analysis shows that attention outputs from tokens outside the current block remain largely stable across diffusion steps, while block‑internal attention varies significantly. Based on this observation, we propose FlashBlock, a cached block‑external attention mechanism that reuses stable attention output, reducing attention computation and KV cache access without modifying the diffusion process. Moreover, FlashBlock is orthogonal to sparse attention and can be combined as a complementary residual reuse strategy, substantially improving model accuracy under aggressive sparsification. Experiments on diffusion language models and video generation demonstrate up to 1.44× higher token throughput and up to 1.6× reduction in attention time, with negligible impact on generation quality. Project page: https://caesarhhh.github.io/FlashBlock/.
Authors:Haoran Li, Sucheng Ren, Alan Yuille, Feng Wang
Abstract:
Rotary Positional Embedding (RoPE) is a key component of context scaling in Large Language Models (LLMs). While various methods have been proposed to adapt RoPE to longer contexts, their guiding principles generally fall into two categories: (1) out‑of‑distribution (OOD) mitigation, which scales RoPE frequencies to accommodate unseen positions, and (2) Semantic Modeling, which posits that the attention scores computed with RoPE should always prioritize semantically similar tokens. In this work, we unify these seemingly distinct objectives through a minimalist intervention, namely CoPE: soft clipping lowfrequency components of RoPE. CoPE not only eliminates OOD outliers and refines semantic signals, but also prevents spectral leakage caused by hard clipping. Extensive experiments demonstrate that simply applying our soft clipping strategy to RoPE yields significant performance gains that scale up to 256k context length, validating our theoretical analysis and establishing CoPE as a new state‑of‑the‑art for length generalization. Our code, data, and models are available at https://github.com/hrlics/CoPE.
Authors:Yuntai Bao, Xuhong Zhang, Jintao Chen, Ge Su, Yuxiang Cai, Hao Peng, Bing Sun, Haiqin Weng, Liu Yan, Jianwei Yin
Abstract:
Intervention‑based model steering offers a lightweight and interpretable alternative to prompting and fine‑tuning. However, by adapting strong optimization objectives from fine‑tuning, current methods are susceptible to overfitting and often underperform, sometimes generating unnatural outputs. We hypothesize that this is because effective steering requires the faithful identification of internal model mechanisms, not the enforcement of external preferences. To this end, we build on the principles of distributed alignment search (DAS), the standard for causal variable localization, to propose a new steering method: Concept DAS (CDAS). While we adopt the core mechanism of DAS, distributed interchange intervention (DII), we introduce a novel distribution matching objective tailored for the steering task by aligning intervened output distributions with counterfactual distributions. CDAS differs from prior work in two main ways: first, it learns interventions via weak‑supervised distribution matching rather than probability maximization; second, it uses DIIs that naturally enable bi‑directional steering and allow steering factors to be derived from data, reducing the effort required for hyperparameter tuning and resulting in more faithful and stable control. On AxBench, a large‑scale model steering benchmark, we show that CDAS does not always outperform preference‑optimization methods but may benefit more from increased model scale. In two safety‑related case studies, overriding refusal behaviors of safety‑aligned models and neutralizing a chain‑of‑thought backdoor, CDAS achieves systematic steering while maintaining general model utility. These results indicate that CDAS is complementary to preference‑optimization approaches and conditionally constitutes a robust approach to intervention‑based model steering. Our code is available at https://github.com/colored‑dye/concept_das.
Authors:Hongye Zhao, Yi Zhao, Chengzhi Zhang
Abstract:
The academia and industry are characterized by a reciprocal shaping and dynamic feedback mechanism. Despite distinct institutional logics, they have adapted closely in collaborative publishing and talent mobility, demonstrating tension between institutional divergence and intensive collaboration. Existing studies on their knowledge proximity mainly rely on macro indicators such as the number of collaborative papers or patents, lacking an analysis of knowledge units in the literature. This has led to an insufficient grasp of fine‑grained knowledge proximity between industry and academia, potentially undermining collaboration frameworks and resource allocation efficiency. To remedy the limitation, this study quantifies the trajectory of academia‑industry co‑evolution through fine‑grained entities and semantic space. In the entity measurement part, we extract fine‑grained knowledge entities via pre‑trained models, measure sequence overlaps using cosine similarity, and analyze topological features through complex network analysis. At the semantic level, we employ unsupervised contrastive learning to quantify convergence in semantic spaces by measuring cross‑institutional textual similarities. Finally, we use citation distribution patterns to examine correlations between bidirectional knowledge flows and similarity. Analysis reveals that knowledge proximity between academia and industry rises, particularly following technological change. This provides textual evidence of bidirectional adaptation in co‑evolution. Additionally, academia's knowledge dominance weakens during technological paradigm shifts. The dataset and code for this paper can be accessed at https://github.com/tinierZhao/Academic‑Industrial‑associations.
Authors:Shangbin Feng, Kishan Panaganti, Yulia Tsvetkov, Wenhao Yu
Abstract:
Model collaboration ‑‑ systems where multiple language models (LMs) collaborate ‑‑ combines the strengths of diverse models with cost in loading multiple LMs. We improve efficiency while preserving the strengths of collaboration by distilling collaborative patterns into a single model, where the model is trained on the outputs of the model collaboration system. At inference time, only the distilled model is employed: it imitates the collaboration while only incurring the cost of a single model. Furthermore, we propose the single‑multi evolution loop: multiple LMs collaborate, each distills from the collaborative outputs, and these post‑distillation improved LMs collaborate again, forming a collective evolution ecosystem where models evolve and self‑improve by interacting with an environment of other models. Extensive experiments with 7 collaboration strategies and 15 tasks (QA, reasoning, factuality, etc.) demonstrate that: 1) individual models improve by 8.0% on average, absorbing the strengths of collaboration while reducing the cost to a single model; 2) the collaboration also benefits from the stronger and more synergistic LMs after distillation, improving over initial systems without evolution by 14.9% on average. Analysis reveals that the single‑multi evolution loop outperforms various existing evolutionary AI methods, is compatible with diverse model/collaboration/distillation settings, and helps solve problems where the initial model/system struggles to.
Authors:Deepak Gupta, Davis Bartels, Dina Demner-Fuhsman
Abstract:
With the increasing use of large language models (LLMs) for generating answers to biomedical questions, it is crucial to evaluate the quality of the generated answers and the references provided to support the facts in the generated answers. Evaluation of text generated by LLMs remains a challenge for question answering, retrieval‑augmented generation (RAG), summarization, and many other natural language processing tasks in the biomedical domain, due to the requirements of expert assessment to verify consistency with the scientific literature and complex medical terminology. In this work, we propose BioACE, an automated framework for evaluating biomedical answers and citations against the facts stated in the answers. The proposed BioACE framework considers multiple aspects, including completeness, correctness, precision, and recall, in relation to the ground‑truth nuggets for answer evaluation. We developed automated approaches to evaluate each of the aforementioned aspects and performed extensive experiments to assess and analyze their correlation with human evaluations. In addition, we considered multiple existing approaches, such as natural language inference (NLI) and pre‑trained language models and LLMs, to evaluate the quality of evidence provided to support the generated answers in the form of citations into biomedical literature. With the detailed experiments and analysis, we provide the best approaches for biomedical answer and citation evaluation as a part of BioACE (https://github.com/deepaknlp/BioACE) evaluation package.
Authors:Davide Berasi, Matteo Farina, Massimiliano Mancini, Elisa Ricci
Abstract:
Selecting the best data mixture is critical for successful Supervised Fine‑Tuning (SFT) of Multimodal Large Language Models. However, determining the optimal mixture weights across multiple domain‑specific datasets remains a significant bottleneck due to the combinatorial search space and the high cost associated with even a single training run. This is the so‑called Data Mixture Optimization (DMO) problem. On the other hand, model merging unifies domain‑specific experts through parameter interpolation. This strategy is efficient, as it only requires a single training run per domain, yet oftentimes leads to suboptimal models. In this work, we take the best of both worlds, studying model merging as an efficient strategy for estimating the performance of different data mixtures. We train domain‑specific multimodal experts and evaluate their weighted parameter‑space combinations to estimate the efficacy of corresponding data mixtures. We conduct extensive experiments on 14 multimodal benchmarks, and empirically demonstrate that the merged proxy models exhibit a high rank correlation with models trained on actual data mixtures. This decouples the search for optimal mixtures from the resource‑intensive training process, thereby providing a scalable and efficient strategy for navigating the complex landscape of mixture weights. Code is publicly available at https://github.com/BerasiDavide/mLLMs_merging_4_DMO.
Authors:Zhenning Shi, Yijia Zhu, Junhan Shi, Xun Zhang, Lei Wang, Congcong Miao
Abstract:
The internalization of chain‑of‑thought processes into hidden states has emerged as a highly efficient paradigm for scaling test‑time compute. However, existing activation steering methods rely on static control vectors that fail to adapt to the non‑stationary evolution of complex reasoning tasks. To address this limitation, we propose STIR (Self‑Distilled Tools for Internal Reasoning), a framework that reformulates reasoning enhancement as a dynamic latent trajectory control problem. STIR introduces a synergistic three‑stage pipeline: (1) differential intrinsic action induction harvests latent reasoning successes to crystallize steering primitives; (2) sparse control basis construction curates a compact, geometrically diverse tool library; and (3) value‑modulated trajectory intervention dynamically injects context‑specific impulses via anchor‑based gating. Extensive experiments on six arithmetic and logical benchmarks across four representative models demonstrate that STIR improves average accuracy by 1.9% to 7.5% while reducing average token consumption by up to 35% compared to vanilla decoding. These findings demonstrate that the benefits of explicit chain‑of‑thought can be realized through dynamic latent trajectory control, internalizing the reasoning process to bypass the explicit generation while achieving superior fidelity. Our code is available at https://github.com/sznnzs/LLM‑Latent‑Action.
Authors:Ishaq Aden-Ali, Noah Golowich, Allen Liu, Abhishek Shetty, Ankur Moitra, Nika Haghtalab
Abstract:
Training modern large language models (LLMs) has become a veritable smorgasbord of algorithms and datasets designed to elicit particular behaviors, making it critical to develop techniques to understand the effects of datasets on the model's properties. This is exacerbated by recent experiments that show datasets can transmit signals that are not directly observable from individual datapoints, posing a conceptual challenge for dataset‑centric understandings of LLM training and suggesting a missing fundamental account of such phenomena. Towards understanding such effects, inspired by recent work on the linear structure of LLMs, we uncover a general mechanism through which hidden subtexts can arise in generic datasets. We introduce Logit‑Linear‑Selection (LLS), a method that prescribes how to select subsets of a generic preference dataset to elicit a wide range of hidden effects. We apply LLS to discover subsets of real‑world datasets so that models trained on them exhibit behaviors ranging from having specific preferences, to responding to prompts in a different language not present in the dataset, to taking on a different persona. Crucially, the effect persists for the selected subset, across models with varying architectures, supporting its generality and universality.
Authors:Jiarui Yuan, Tailin Jin, Weize Chen, Zeyuan Liu, Zhiyuan Liu, Maosong Sun
Abstract:
True self‑evolution requires agents to act as lifelong learners that internalize novel experiences to solve future problems. However, rigorously measuring this foundational capability is hindered by two obstacles: the entanglement of prior knowledge, where ``new'' knowledge may appear in pre‑training data, and the entanglement of reasoning complexity, where failures may stem from problem difficulty rather than an inability to recall learned knowledge. We introduce SE‑Bench, a diagnostic environment that obfuscates the NumPy library and its API doc into a pseudo‑novel package with randomized identifiers. Agents are trained to internalize this package and evaluated on simple coding tasks without access to documentation, yielding a clean setting where tasks are trivial with the new API doc but impossible for base models without it. Our investigation reveals three insights: (1) the Open‑Book Paradox, where training with reference documentation inhibits retention, requiring "Closed‑Book Training" to force knowledge compression into weights; (2) the RL Gap, where standard RL fails to internalize new knowledge completely due to PPO clipping and negative gradients; and (3) the viability of Self‑Play for internalization, proving models can learn from self‑generated, noisy tasks when coupled with SFT, but not RL. Overall, SE‑Bench establishes a rigorous diagnostic platform for self‑evolution with knowledge internalization. Our code and dataset can be found at https://github.com/thunlp/SE‑Bench.
Authors:Moritz Miller, Florent Draye, Bernhard Schölkopf
Abstract:
With recent progress on fine‑tuning language models around a fixed sparse autoencoder, we disentangle the decoder matrix into almost orthogonal features. This reduces interference and superposition between the features, while keeping performance on the target dataset essentially unchanged. Our orthogonality penalty leads to identifiable features, ensuring the uniqueness of the decomposition. Further, we find that the distance between embedded feature explanations increases with stricter orthogonality penalty, a desirable property for interpretability. Invoking the Independent Causal Mechanisms principle, we argue that orthogonality promotes modular representations amenable to causal intervention. We empirically show that these increasingly orthogonalized features allow for isolated interventions. Our code is available under \texttthttps://github.com/mrtzmllr/sae‑icm.
Authors:Xianbiao Qi, Marco Chen, Jiaquan Ye, Yelin He, Rong Xiao
Abstract:
The Muon optimizer has recently attracted considerable attention for its strong empirical performance and use of orthogonalized updates on matrix‑shaped parameters, yet its underlying mechanisms and relationship to adaptive optimizers such as Adam remain insufficiently understood. In this work, we aim to address these questions through a unified spectral perspective. Specifically, we view Muon as the p = 0 endpoint of a family of spectral transformations of the form U \boldsymbolΣ^p V' , and consider additional variants with p = 1/2 , p = 1/4 , and p = 1 . These transformations are applied to both first‑moment updates, as in momentum SGD, and to root‑mean‑square (RMS) normalized gradient updates as in Adam. To enable efficient computation, we develop a coupled Newton iteration that avoids explicit singular value decomposition. Across controlled experiments, we find that RMS‑normalized updates yield more stable optimization than first‑moment updates. Moreover, while spectral compression provides strong stabilization benefits under first‑moment updates, the Muon update (p = 0) does not consistently outperform Adam. These results suggest that Muon is best understood as an effective form of spectral normalization, but not a universally superior optimization method. Our source code will be released at https://github.com/Ocram7/BeyondMuon.
Authors:Jaeyoon Jung, Yejun Yoon, Seunghyun Yoon, Kunwoo Park
Abstract:
This paper describes VILLAIN, a multimodal fact‑checking system that verifies image‑text claims through prompt‑based multi‑agent collaboration. For the AVerImaTeC shared task, VILLAIN employs vision‑language model agents across multiple stages of fact‑checking. Textual and visual evidence is retrieved from the knowledge store enriched through additional web collection. To identify key information and address inconsistencies among evidence items, modality‑specific and cross‑modal agents generate analysis reports. In the subsequent stage, question‑answer pairs are produced based on these reports. Finally, the Verdict Prediction agent produces the verification outcome based on the image‑text claim and the generated question‑answer pairs. Our system ranked first on the leaderboard across all evaluation metrics. The source code is publicly available at https://github.com/ssu‑humane/VILLAIN.
Authors:Zhiyi Chen, Eun Cheol Choi, Yingjia Luo, Xinyi Wang, Yulei Xiao, Aizi Yang, Luca Luceri
Abstract:
People increasingly seek advice online from both human peers and large language model (LLM)‑based chatbots. Such advice rarely involves identifying a single correct answer; instead, it typically requires navigating trade‑offs among competing values. We aim to characterize how LLMs navigate value trade‑offs across different advice‑seeking contexts. First, we examine the value trade‑off structure underlying advice seeking using a curated dataset from four advice‑oriented subreddits. Using a bottom‑up approach, we inductively construct a hierarchical value framework by aggregating fine‑grained values extracted from individual advice options into higher‑level value categories. We construct value co‑occurrence networks to characterize how values co‑occur within dilemmas and find substantial heterogeneity in value trade‑off structures across advice‑seeking contexts: a women‑focused subreddit exhibits the highest network density, indicating more complex value conflicts; women's, men's, and friendship‑related subreddits exhibit highly correlated value‑conflict patterns centered on security‑related tensions (security vs. respect/connection/commitment); by contrast, career advice forms a distinct structure where security frequently clashes with self‑actualization and growth. We then evaluate LLM value preferences against these dilemmas and find that, across models and contexts, LLMs consistently prioritize values related to Exploration & Growth over Benevolence & Connection. This systemically skewed value orientation highlights a potential risk of value homogenization in AI‑mediated advice, raising concerns about how such systems may shape decision‑making and normative outcomes at scale.
Authors:Yujie Lin, Kunquan Li, Yixuan Liao, Xiaoxin Chen, Jinsong Su
Abstract:
Large language models (LLMs) have demonstrated impressive capabilities across a wide range of natural language processing tasks. However, their outputs often exhibit social biases, raising fairness concerns. Existing debiasing methods, such as fine‑tuning on additional datasets or prompt engineering, face scalability issues or compromise user experience in multi‑turn interactions. To address these challenges, we propose a framework for detecting stereotype‑inducing words and attributing neuron‑level bias in LLMs, without the need for fine‑tuning or prompt modification. Our framework first identifies stereotype‑inducing adjectives and nouns via comparative analysis across demographic groups. We then attribute biased behavior to specific neurons using two attribution strategies based on integrated gradients. Finally, we mitigate bias by directly intervening on their activations at the projection layer. Experiments on three widely used LLMs demonstrate that our method effectively reduces bias while preserving overall model performance. Code is available at the github link: https://github.com/XMUDeepLIT/Bi‑directional‑Bias‑Attribution.
Authors:Jiarui Jin, Haoyu Wang, Xingliang Wu, Xiaocheng Fang, Xiang Lan, Zihan Wang, Deyun Zhang, Bo Liu, Yingying Zhang, Xian Wu, Hongyan Li, Shenda Hong
Abstract:
Electrocardiography (ECG) serves as an indispensable diagnostic tool in clinical practice, yet existing multimodal large language models (MLLMs) remain unreliable for ECG interpretation, often producing plausible but clinically incorrect analyses. To address this, we propose ECG‑R1, the first reasoning MLLM designed for reliable ECG interpretation via three innovations. First, we construct the interpretation corpus using Protocol‑Guided Instruction Data Generation, grounding interpretation in measurable ECG features and monograph‑defined quantitative thresholds and diagnostic logic. Second, we present a modality‑decoupled architecture with Interleaved Modality Dropout to improve robustness and cross‑modal consistency when either the ECG signal or ECG image is missing. Third, we present Reinforcement Learning with ECG Diagnostic Evidence Rewards to strengthen evidence‑grounded ECG interpretation. Additionally, we systematically evaluate the ECG interpretation capabilities of proprietary, open‑source, and medical MLLMs, and provide the first quantitative evidence that severe hallucinations are widespread, suggesting that the public should not directly trust these outputs without independent verification. Code and data are publicly available at \hrefhttps://github.com/PKUDigitalHealth/ECG‑R1here, and an online platform can be accessed at \hrefhttp://ai.heartvoice.com.cn/ECG‑R1/here.
Authors:Zeming Wei, Qiaosheng Zhang, Xia Hu, Xingcheng Xu
Abstract:
Large Reasoning Models (LRMs) have achieved tremendous success with their chain‑of‑thought (CoT) reasoning, yet also face safety issues similar to those of basic language models. In particular, while algorithms are designed to guide them to deliberately refuse harmful prompts with safe reasoning, this process often fails to generalize against diverse and complex jailbreak attacks. In this work, we attribute these failures to the generalization of the safe reasoning process, particularly their insufficiency against complex attack prompts. We provide both theoretical and empirical evidence to show the necessity of a more sufficient safe reasoning process to defend against advanced attack prompts. Building on this insight, we propose a Risk‑Aware Preference Optimization (RAPO) framework that enables LRM to adaptively identify and address the safety risks with appropriate granularity in its thinking content. Extensive experiments demonstrate that RAPO successfully generalizes multiple LRMs' safe reasoning adaptively across diverse attack prompts whilst preserving general utility, contributing a robust alignment technique for LRM safety. Our code is available at https://github.com/weizeming/RAPO.
Authors:Xinyue Wang, Yuanhe Zhang, Zhengshuo Gong, Haoran Gao, Fanyu Meng, Zhenhong Zhou, Li Sun, Yang Liu, Sen Su
Abstract:
The enhanced capabilities of LLM‑based agents come with an emergency for model planning and tool‑use abilities. Attributing to helpful‑harmless trade‑off from LLM alignment, agents typically also inherit the flaw of "over‑refusal", which is a passive failure mode. However, the proactive planning and action capabilities of agents introduce another crucial danger on the other side of the trade‑off. This phenomenon we term "Toxic Proactivity'': an active failure mode in which an agent, driven by the optimization for Machiavellian helpfulness, disregards ethical constraints to maximize utility. Unlike over‑refusal, Toxic Proactivity manifests as the agent taking excessive or manipulative measures to ensure its "usefulness'' is maintained. Existing research pays little attention to identifying this behavior, as it often lacks the subtle context required for such strategies to unfold. To reveal this risk, we introduce a novel evaluation framework based on dilemma‑driven interactions between dual models, enabling the simulation and analysis of agent behavior over multi‑step behavioral trajectories. Through extensive experiments with mainstream LLMs, we demonstrate that Toxic Proactivity is a widespread behavioral phenomenon and reveal two major tendencies. We further present a systematic benchmark for evaluating Toxic Proactive behavior across contextual settings.
Authors:Xiaofeng Lin, Sirou Zhu, Yilei Chen, Mingyu Chen, Hejian Sang, Ioannis Paschalidis, Zhipeng Wang, Aldo Pacchiano, Xuezhou Zhang
Abstract:
Large language models (LLMs) achieve strong performance when all task‑relevant information is available upfront, as in static prediction and instruction‑following problems. However, many real‑world decision‑making tasks are inherently online: crucial information must be acquired through interaction, feedback is delayed, and effective behavior requires balancing information collection and exploitation over time. While in‑context learning enables adaptation without weight updates, existing LLMs often struggle to reliably leverage in‑context interaction experience in such settings. In this work, we show that this limitation can be addressed through training. We introduce ORBIT, a multi‑task, multi‑episode meta‑reinforcement learning framework that trains LLMs to learn from interaction in context. After meta‑training, a relatively small open‑source model (Qwen3‑14B) demonstrates substantially improved in‑context online learning on entirely unseen environments, matching the performance of GPT‑5.2 and outperforming standard RL fine‑tuning by a large margin. Scaling experiments further reveal consistent gains with model size, suggesting significant headroom for learn‑at‑inference‑time decision‑making agents. Code reproducing the results in the paper can be found at https://github.com/XiaofengLin7/ORBIT.
Authors:Azmine Toushik Wasi, Wahid Faisal, Abdur Rahman, Mahfuz Ahmed Anik, Munem Shahriar, Mohsin Mahmud Topu, Sadia Tasnim Meem, Rahatun Nesa Priti, Sabrina Afroz Mitu, Md. Iqramul Hoque, Shahriyar Zaman Ridoy, Mohammed Eunus Ali, Majd Hawasly, Mohammad Raza, Md Rizwan Parvez
Abstract:
Spatial reasoning is a fundamental aspect of human cognition, yet it remains a major challenge for contemporary vision‑language models (VLMs). Prior work largely relied on synthetic or LLM‑generated environments with limited task designs and puzzle‑like setups, failing to capture the real‑world complexity, visual noise, and diverse spatial relationships that VLMs encounter. To address this, we introduce SpatiaLab, a comprehensive benchmark for evaluating VLMs' spatial reasoning in realistic, unconstrained contexts. SpatiaLab comprises 1,400 visual question‑answer pairs across six major categories: Relative Positioning, Depth & Occlusion, Orientation, Size & Scale, Spatial Navigation, and 3D Geometry, each with five subcategories, yielding 30 distinct task types. Each subcategory contains at least 25 questions, and each main category includes at least 200 questions, supporting both multiple‑choice and open‑ended evaluation. Experiments across diverse state‑of‑the‑art VLMs, including open‑ and closed‑source models, reasoning‑focused, and specialized spatial reasoning models, reveal a substantial gap in spatial reasoning capabilities compared with humans. In the multiple‑choice setup, InternVL3.5‑72B achieves 54.93% accuracy versus 87.57% for humans. In the open‑ended setting, all models show a performance drop of around 10‑25%, with GPT‑5‑mini scoring highest at 40.93% versus 64.93% for humans. These results highlight key limitations in handling complex spatial relationships, depth perception, navigation, and 3D geometry. By providing a diverse, real‑world evaluation framework, SpatiaLab exposes critical challenges and opportunities for advancing VLMs' spatial reasoning, offering a benchmark to guide future research toward robust, human‑aligned spatial understanding. SpatiaLab is available at: https://spatialab‑reasoning.github.io/.
Authors:Minjun Zhu, Zhen Lin, Yixuan Weng, Panzhong Lu, Qiujie Xie, Yifan Wei, Sifan Liu, Qiyao Sun, Yue Zhang
Abstract:
High‑quality scientific illustrations are crucial for effectively communicating complex scientific and technical concepts, yet their manual creation remains a well‑recognized bottleneck in both academia and industry. We present FigureBench, the first large‑scale benchmark for generating scientific illustrations from long‑form scientific texts. It contains 3,300 high‑quality scientific text‑figure pairs, covering diverse text‑to‑illustration tasks from scientific papers, surveys, blogs, and textbooks. Moreover, we propose AutoFigure, the first agentic framework that automatically generates high‑quality scientific illustrations based on long‑form scientific text. Specifically, before rendering the final result, AutoFigure engages in extensive thinking, recombination, and validation to produce a layout that is both structurally sound and aesthetically refined, outputting a scientific illustration that achieves both structural completeness and aesthetic appeal. Leveraging the high‑quality data from FigureBench, we conduct extensive experiments to test the performance of AutoFigure against various baseline methods. The results demonstrate that AutoFigure consistently surpasses all baseline methods, producing publication‑ready scientific illustrations. The code, dataset and huggingface space are released in https://github.com/ResearAI/AutoFigure.
Authors:Ziru Chen, Dongdong Chen, Ruinan Jin, Yingbin Liang, Yujia Xie, Huan Sun
Abstract:
Recently, there have been significant research interests in training large language models (LLMs) with reinforcement learning (RL) on real‑world tasks, such as multi‑turn code generation. While online RL tends to perform better than offline RL, its higher training cost and instability hinders wide adoption. In this paper, we build on the observation that multi‑turn code generation can be formulated as a one‑step recoverable Markov decision process and propose contextual bandit learning with offline trajectories (Cobalt), a new method that combines the benefits of online and offline RL. Cobalt first collects code generation trajectories using a reference LLM and divides them into partial trajectories as contextual prompts. Then, during online bandit learning, the LLM is trained to complete each partial trajectory prompt through single‑step code generation. Cobalt outperforms two multi‑turn online RL baselines based on GRPO and VeRPO, and substantially improves R1‑Distill 8B and Qwen3 8B by up to 9.0 and 6.2 absolute Pass@1 scores on LiveCodeBench. Also, we analyze LLMs' in‑context reward hacking behaviors and augment Cobalt training with perturbed trajectories to mitigate this issue. Overall, our results demonstrate Cobalt as a promising solution for iterative decision‑making tasks like multi‑turn code generation. Our code and data are available at https://github.com/OSU‑NLP‑Group/cobalt.
Authors:Zimu Lu, Houxing Ren, Yunqiao Yang, Ke Wang, Zhuofan Zong, Mingjie Zhan, Hongsheng Li
Abstract:
Assisting non‑expert users to develop complex interactive websites has become a popular task for LLM‑powered code agents. However, existing code agents tend to only generate frontend web pages, masking the lack of real full‑stack data processing and storage with fancy visual effects. Notably, constructing production‑level full‑stack web applications is far more challenging than only generating frontend web pages, demanding careful control of data flow, comprehensive understanding of constantly updating packages and dependencies, and accurate localization of obscure bugs in the codebase. To address these difficulties, we introduce FullStack‑Agent, a unified agent system for full‑stack agentic coding that consists of three parts: (1) FullStack‑Dev, a multi‑agent framework with strong planning, code editing, codebase navigation, and bug localization abilities. (2) FullStack‑Learn, an innovative data‑scaling and self‑improving method that back‑translates crawled and synthesized website repositories to improve the backbone LLM of FullStack‑Dev. (3) FullStack‑Bench, a comprehensive benchmark that systematically tests the frontend, backend and database functionalities of the generated website. Our FullStack‑Dev outperforms the previous state‑of‑the‑art method by 8.7%, 38.2%, and 15.9% on the frontend, backend, and database test cases respectively. Additionally, FullStack‑Learn raises the performance of a 30B model by 9.7%, 9.5%, and 2.8% on the three sets of test cases through self‑improvement, demonstrating the effectiveness of our approach. The code is released at https://github.com/mnluzimu/FullStack‑Agent.
Authors:Xilong Wang, Yinuo Liu, Zhun Wang, Dawn Song, Neil Gong
Abstract:
Prompt injection attacks manipulate webpage content to cause web agents to execute attacker‑specified tasks instead of the user's intended ones. Existing methods for detecting and localizing such attacks achieve limited effectiveness, as their underlying assumptions often do not hold in the web‑agent setting. In this work, we propose WebSentinel, a two‑step approach for detecting and localizing prompt injection attacks in webpages. Given a webpage, Step I extracts \emphsegments of interest that may be contaminated, and Step II evaluates each segment by checking its consistency with the webpage content as context. We show that WebSentinel is highly effective, substantially outperforming baseline methods across multiple datasets of both contaminated and clean webpages that we collected. Our code is available at: https://github.com/wxl‑lxw/WebSentinel.
Authors:Jianhao Ruan, Zhihao Xu, Yiran Peng, Fashen Ren, Zhaoyang Yu, Xinbing Liang, Jinyu Xiang, Bang Liu, Chenglin Wu, Yuyu Luo, Jiayi Zhang
Abstract:
Language agents have shown strong promise for task automation. Realizing this promise for increasingly complex, long‑horizon tasks has driven the rise of a sub‑agent‑as‑tools paradigm for multi‑turn task solving. However, existing designs still lack a dynamic abstraction view of sub‑agents, thereby hurting adaptability. We address this challenge with a unified, framework‑agnostic agent abstraction that models any agent as a tuple Instruction, Context, Tools, Model. This tuple acts as a compositional recipe for capabilities, enabling the system to spawn specialized executors for each task on demand. Building on this abstraction, we introduce an agentic system AOrchestra, where the central orchestrator concretizes the tuple at each step: it curates task‑relevant context, selects tools and models, and delegates execution via on‑the‑fly automatic agent creation. Such designs enable reducing human engineering efforts, and remain framework‑agnostic with plug‑and‑play support for diverse agents as task executors. It also enables a controllable performance‑cost trade‑off, allowing the system to approach Pareto‑efficient. Across three challenging benchmarks (GAIA, SWE‑Bench, Terminal‑Bench), AOrchestra achieves 16.28% relative improvement against the strongest baseline when paired with Gemini‑3‑Flash. The code is available at: https://github.com/FoundationAgents/AOrchestra
Authors:Paolo Astrino
Abstract:
Organizations handling sensitive documents face a tension: cloud‑based AI risks GDPR violations, while local systems typically require 18‑32 GB RAM. This paper presents CUBO, a systems‑oriented RAG platform for consumer laptops with 16 GB shared memory. CUBO's novelty lies in engineering integration of streaming ingestion (O(1) buffer overhead), tiered hybrid retrieval, and hardware‑aware orchestration that enables competitive Recall@10 (0.48‑0.97 across BEIR domains) within a hard 15.5 GB RAM ceiling. The 37,000‑line codebase achieves retrieval latencies of 185 ms (p50) on C1,300 laptops while maintaining data minimization through local‑only processing aligned with GDPR Art. 5(1)(c). Evaluation on BEIR benchmarks validates practical deployability for small‑to‑medium professional archives. The codebase is publicly available at https://github.com/PaoloAstrino/CUBO.
Authors:Yubao Zhao, Weiquan Huang, Sudong Wang, Ruochen Zhao, Chen Chen, Yao Shu, Chengwei Qin
Abstract:
Agentic reinforcement learning has enabled large language models to perform complex multi‑turn planning and tool use. However, learning in long‑horizon settings remains challenging due to sparse, trajectory‑level outcome rewards. While prior tree‑based methods attempt to mitigate this issue, they often suffer from high variance and computational inefficiency. Through empirical analysis of search agents, We identify a common pattern: performance diverges mainly due to decisions near the tail. Motivated by this observation, we propose Branching Relative Policy Optimization (BranPO), a value‑free method that provides step‑level contrastive supervision without dense rewards. BranPO truncates trajectories near the tail and resamples alternative continuations to construct contrastive suffixes over shared prefixes, reducing credit ambiguity in long‑horizon rollouts. To further boost efficiency and stabilize training, we introduce difficulty‑aware branch sampling to adapt branching frequency across tasks, and redundant step masking to suppress uninformative actions. Extensive experiments on various question answering benchmarks demonstrate that BranPO consistently outperforms strong baselines, achieving significant accuracy gains on long‑horizon tasks without increasing the overall training budget. Our code is available at \hrefhttps://github.com/YubaoZhao/BranPOcode.
Authors:Duy Nguyen, Hanqi Xiao, Archiki Prasad, Elias Stengel-Eskin, Hyunji Lee, Mohit Bansal
Abstract:
Large language models (LLMs) rely on internal knowledge to solve many downstream tasks, making it crucial to keep them up to date. Since full retraining is expensive, prior work has explored efficient alternatives such as model editing and parameter‑efficient fine‑tuning. However, these approaches often break down in practice due to poor generalization across inputs, limited stability, and knowledge conflict. To address these limitations, we propose the CoRSA (Conflict‑Resolving and Sharpness‑Aware Minimization) training framework, a parameter‑efficient, holistic approach for knowledge editing with multiple updates. CoRSA tackles multiple challenges simultaneously: it improves generalization to different input forms and enhances stability across multiple updates by minimizing loss curvature, and resolves conflicts by maximizing the margin between new and prior knowledge. Across three widely used fact editing benchmarks, CoRSA achieves significant gains in generalization, outperforming baselines with average absolute improvements of 12.42% over LoRA and 10% over model editing methods. With multiple updates, it maintains high update efficacy while reducing catastrophic forgetting by 27.82% compared to LoRA. CoRSA also generalizes to the code domain, outperforming the strongest baseline by 5.48% Pass@5 in update efficacy.
Authors:Jiashuo Sun, Pengcheng Jiang, Saizhuo Wang, Jiajun Fan, Heng Wang, Siru Ouyang, Ming Zhong, Yizhu Jiao, Chengsong Huang, Xueqiang Xu, Pengrui Han, Peiran Li, Jiaxin Huang, Ge Liu, Heng Ji, Jiawei Han
Abstract:
Retrieval‑Augmented Generation (RAG) systems remain brittle under realistic retrieval noise, even when the required evidence appears in the top‑K results. A key reason is that retrievers and rerankers optimize solely for relevance, often selecting either trivial, answer‑revealing passages or evidence that lacks the critical information required to answer the question, without considering whether the evidence is suitable for the generator. We propose BAR‑RAG, which reframes the reranker as a boundary‑aware evidence selector that targets the generator's Goldilocks Zone ‑‑ evidence that is neither trivially easy nor fundamentally unanswerable for the generator, but is challenging yet sufficient for inference and thus provides the strongest learning signal. BAR‑RAG trains the selector with reinforcement learning using generator feedback, and adopts a two‑stage pipeline that fine‑tunes the generator under the induced evidence distribution to mitigate the distribution mismatch between training and inference. Experiments on knowledge‑intensive question answering benchmarks show that BAR‑RAG consistently improves end‑to‑end performance under noisy retrieval, achieving an average gain of 10.3 percent over strong RAG and reranking baselines while substantially improving robustness. Code is publicly avaliable at https://github.com/GasolSun36/BAR‑RAG.
Authors:Chao Huang, Yujing Lu, Quangang Li, Shenghe Wang, Yan Wang, Yueyang Zhang, Long Xia, Jiashu Zhao, Zhiyuan Sun, Daiting Shi, Tingwen Liu
Abstract:
Entropy regularization is a standard technique in reinforcement learning (RL) to enhance exploration, yet it yields negligible effects or even degrades performance in Large Language Models (LLMs). We attribute this failure to the cumulative tail risk inherent to LLMs with massive vocabularies and long generation horizons. In such environments, standard global entropy maximization indiscriminately dilutes probability mass into the vast tail of invalid tokens rather than focusing on plausible candidates, thereby disrupting coherent reasoning. To address this, we propose Trust Region Entropy (TRE), a method that encourages exploration strictly within the model's trust region. Extensive experiments across mathematical reasoning (MATH), combinatorial search (Countdown), and preference alignment (HH) tasks demonstrate that TRE consistently outperforms vanilla PPO, standard entropy regularization, and other exploration baselines. Our code is available at https://github.com/WhyChaos/TRE‑Encouraging‑Exploration‑in‑the‑Trust‑Region.
Authors:Yuqin Dai, Ning Gao, Wei Zhang, Jie Wang, Zichen Luo, Jinpeng Wang, Yujie Wang, Ruiyuan Wu, Chaozheng Wang
Abstract:
Large Language Models have demonstrated remarkable capabilities in open‑domain dialogues. However, current methods exhibit suboptimal performance in service dialogues, as they rely on noisy, low‑quality human conversation data. This limitation arises from data scarcity and the difficulty of simulating authentic, goal‑oriented user behaviors. To address these issues, we propose SEAD (Self‑Evolving Agent for Service Dialogue), a framework that enables agents to learn effective strategies without large‑scale human annotations. SEAD decouples user modeling into two components: a Profile Controller that generates diverse user states to manage training curriculum, and a User Role‑play Model that focuses on realistic role‑playing. This design ensures the environment provides adaptive training scenarios rather than acting as an unfair adversary. Experiments demonstrate that SEAD significantly outperforms Open‑source Foundation Models and Closed‑source Commercial Models, improving task completion rate by 17.6% and dialogue efficiency by 11.1%. Code is available at: https://github.com/Da1yuqin/SEAD.
Authors:Runquan Gui, Yafu Li, Xiaoye Qu, Ziyan Liu, Yeqiu Cheng, Yu Cheng
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has markedly improved the performance of Large Language Models (LLMs) on tasks requiring multi‑step reasoning. However, most RLVR pipelines rely on sparse outcome‑based rewards, providing little supervision over intermediate steps and thus encouraging over‑confidence and spurious reasoning, which in turn increases hallucinations. To address this, we propose FaithRL, a general reinforcement learning framework that directly optimizes reasoning faithfulness. We formalize a faithfulness‑maximization objective and theoretically show that optimizing it mitigates over‑confidence. To instantiate this objective, we introduce a geometric reward design and a faithfulness‑aware advantage modulation mechanism that assigns step‑level credit by penalizing unsupported steps while preserving valid partial derivations. Across diverse backbones and benchmarks, FaithRL consistently reduces hallucination rates while maintaining (and often improving) answer correctness. Further analysis confirms that FaithRL increases step‑wise reasoning faithfulness and generalizes robustly. Our code is available at https://github.com/aintdoin/FaithRL.
Authors:Mingxuan Du, Benfeng Xu, Chiwei Zhu, Shaohan Wang, Pengyu Wang, Xiaorui Wang, Zhendong Mao
Abstract:
Frontier language models have demonstrated strong reasoning and long‑horizon tool‑use capabilities. However, existing RAG systems fail to leverage these capabilities. They still rely on two paradigms: (1) designing an algorithm that retrieves passages in a single shot and concatenates them into the model's input, or (2) predefining a workflow and prompting the model to execute it step‑by‑step. Neither paradigm allows the model to participate in retrieval decisions, preventing efficient scaling with model improvements. In this paper, we introduce A‑RAG, an Agentic RAG framework that exposes hierarchical retrieval interfaces directly to the model. A‑RAG provides three retrieval tools: keyword search, semantic search, and chunk read, enabling the agent to adaptively search and retrieve information across multiple granularities. Experiments on multiple open‑domain QA benchmarks show that A‑RAG consistently outperforms existing approaches with comparable or lower retrieved tokens, demonstrating that A‑RAG effectively leverages model capabilities and dynamically adapts to different RAG tasks. We further systematically study how A‑RAG scales with model size and test‑time compute. We will release our code and evaluation suite to facilitate future research. Code and evaluation suite are available at https://github.com/Ayanami0730/arag.
Authors:Shuang Sun, Huatong Song, Lisheng Huang, Jinhao Jiang, Ran Le, Zhihao Lv, Zongchao Chen, Yiwen Hu, Wenyang Luo, Wayne Xin Zhao, Yang Song, Hongteng Xu, Tao Zhang, Ji-Rong Wen
Abstract:
Recent advances in large language models (LLMs) have enabled software engineering agents to tackle complex code modification tasks. Most existing approaches rely on execution feedback from containerized environments, which require dependency‑complete setup and physical execution of programs and tests. While effective, this paradigm is resource‑intensive and difficult to maintain, substantially complicating agent training and limiting scalability. We propose SWE‑World, a Docker‑free framework that replaces physical execution environments with a learned surrogate for training and evaluating software engineering agents. SWE‑World leverages LLM‑based models trained on real agent‑environment interaction data to predict intermediate execution outcomes and final test feedback, enabling agents to learn without interacting with physical containerized environments. This design preserves the standard agent‑environment interaction loop while eliminating the need for costly environment construction and maintenance during agent optimization and evaluation. Furthermore, because SWE‑World can simulate the final evaluation outcomes of candidate trajectories without real submission, it enables selecting the best solution among multiple test‑time attempts, thereby facilitating effective test‑time scaling (TTS) in software engineering tasks. Experiments on SWE‑bench Verified demonstrate that SWE‑World raises Qwen2.5‑Coder‑32B from 6.2% to 52.0% via Docker‑free SFT, 55.0% with Docker‑free RL, and 68.2% with further TTS. The code is available at https://github.com/RUCAIBox/SWE‑World
Authors:Huatong Song, Lisheng Huang, Shuang Sun, Jinhao Jiang, Ran Le, Daixuan Cheng, Guoxin Chen, Yiwen Hu, Zongchao Chen, Wayne Xin Zhao, Yang Song, Tao Zhang, Ji-Rong Wen
Abstract:
In this technical report, we present SWE‑Master, an open‑source and fully reproducible post‑training framework for building effective software engineering agents. SWE‑Master systematically explores the complete agent development pipeline, including teacher‑trajectory synthesis and data curation, long‑horizon SFT, RL with real execution feedback, and inference framework design. Starting from an open‑source base model with limited initial SWE capability, SWE‑Master demonstrates how systematical optimization method can elicit strong long‑horizon SWE task solving abilities. We evaluate SWE‑Master on SWE‑bench Verified, a standard benchmark for realistic software engineering tasks. Under identical experimental settings, our approach achieves a resolve rate of 61.4% with Qwen2.5‑Coder‑32B, substantially outperforming existing open‑source baselines. By further incorporating test‑time scaling~(TTS) with LLM‑based environment feedback, SWE‑Master reaches 70.8% at TTS@8, demonstrating a strong performance potential. SWE‑Master provides a practical and transparent foundation for advancing reproducible research on software engineering agents. The code is available at https://github.com/RUCAIBox/SWE‑Master.
Authors:Ning Ding, Fangcheng Liu, Kyungrae Kim, Linji Hao, Kyeng-Hun Lee, Hyeonmok Ko, Yehui Tang
Abstract:
Scaling Large Language Models (LLMs) typically relies on increasing the number of parameters or test‑time computations to boost performance. However, these strategies are impractical for edge device deployment due to limited RAM and NPU resources. Despite hardware constraints, deploying performant LLM on edge devices such as smartphone remains crucial for user experience. To address this, we propose MeKi (Memory‑based Expert Knowledge Injection), a novel system that scales LLM capacity via storage space rather than FLOPs. MeKi equips each Transformer layer with token‑level memory experts that injects pre‑stored semantic knowledge into the generation process. To bridge the gap between training capacity and inference efficiency, we employ a re‑parameterization strategy to fold parameter matrices used during training into a compact static lookup table. By offloading the knowledge to ROM, MeKi decouples model capacity from computational cost, introducing zero inference latency overhead. Extensive experiments demonstrate that MeKi significantly outperforms dense LLM baselines with identical inference speed, validating the effectiveness of memory‑based scaling paradigm for on‑device LLMs. Project homepage is at https://github.com/ningding‑o/MeKi.
Authors:Wenquan Lu, Hai Huang, Randall Balestriero
Abstract:
Reinforcement learning algorithms such as group‑relative policy optimization (GRPO) have demonstrated strong potential for improving the mathematical reasoning capabilities of large language models. However, prior work has consistently observed an entropy collapse phenomenon during reinforcement post‑training, characterized by a monotonic decrease in policy entropy that ultimately leads to training instability and collapse. As a result, most existing approaches restrict training to short horizons (typically 5‑20 epochs), limiting sustained exploration and hindering further policy improvement. In addition, nearly all prior work relies on a single, fixed reasoning prompt or template during training. In this work, we introduce prompt augmentation, a training strategy that instructs the model to generate reasoning traces under diverse templates and formats, thereby increasing rollout diversity. We show that, without a KL regularization term, prompt augmentation enables stable scaling of training duration under a fixed dataset and allows the model to tolerate low‑entropy regimes without premature collapse. Empirically, a Qwen2.5‑Math‑1.5B model trained with prompt augmentation on the MATH Level 3‑5 dataset achieves state‑of‑the‑art performance, reaching 45.2 per‑benchmark accuracy and 51.8 per‑question accuracy on standard mathematical reasoning benchmarks, including AIME24, AMC, MATH500, Minerva, and OlympiadBench. The code and model checkpoints are available at https://github.com/wenquanlu/prompt‑augmentation‑GRPO.
Authors:Baohao Liao, Hanze Dong, Xinxing Xu, Christof Monz, Jiang Bian
Abstract:
Group Relative Policy Optimization (GRPO) has recently emerged as a practical recipe for aligning large language models with verifiable objectives. However, under sparse terminal rewards, GRPO often stalls because rollouts within a group frequently receive identical rewards, causing relative advantages to collapse and updates to vanish. We propose self‑hint aligned GRPO with privileged supervision (SAGE), an on‑policy reinforcement learning framework that injects privileged hints during training to reshape the rollout distribution under the same terminal verifier reward. For each prompt x, the model samples a compact hint h (e.g., a plan or decomposition) and then generates a solution τ conditioned on (x,h). Crucially, the task reward R(x,τ) is unchanged; hints only increase within‑group outcome diversity under finite sampling, preventing GRPO advantages from collapsing under sparse rewards. At test time, we set h=\varnothing and deploy the no‑hint policy without any privileged information. Moreover, sampling diverse self‑hints serves as an adaptive curriculum that tracks the learner's bottlenecks more effectively than fixed hints from an initial policy or a stronger external model. Experiments over 6 benchmarks with 3 LLMs show that SAGE consistently outperforms GRPO, on average +2.0 on Llama‑3.2‑3B‑Instruct, +1.2 on Qwen2.5‑7B‑Instruct and +1.3 on Qwen3‑4B‑Instruct. The code is available at https://github.com/BaohaoLiao/SAGE.
Authors:Zhitao Gao, Jie Ma, Xuhong Li, Pengyu Li, Ning Qu, Yaqiang Wu, Hui Liu, Jun Liu
Abstract:
Large Language Models (LLMs) have achieved significant success in complex reasoning but remain bottlenecked by reliance on expert‑annotated data and external verifiers. While existing self‑evolution paradigms aim to bypass these constraints, they often fail to identify the optimal learning zone and risk reinforcing collective hallucinations and incorrect priors through flawed internal feedback. To address these challenges, we propose \underlineAutonomous \underlineEvolutionary \underlineReasoning \underlineOptimization (AERO), an unsupervised framework that achieves autonomous reasoning evolution by internalizing self‑questioning, answering, and criticism within a synergistic dual‑loop system. Inspired by the Zone of Proximal Development (ZPD) theory, AERO utilizes entropy‑based positioning to target the ``solvability gap'' and employs Independent Counterfactual Correction for robust verification. Furthermore, we introduce a Staggered Training Strategy to synchronize capability growth across functional roles and prevent curriculum collapse. Extensive evaluations across nine benchmarks spanning three domains demonstrate that AERO achieves average performance improvements of 4.57% on Qwen3‑4B‑Base and 5.10% on Qwen3‑8B‑Base, outperforming competitive baselines. Code is available at https://github.com/mira‑ai‑lab/AERO.
Authors:Vishal Venkataramani, Haizhou Shi, Zixuan Ke, Austin Xu, Xiaoxiao He, Yingbo Zhou, Semih Yavuz, Hao Wang, Shafiq Joty
Abstract:
Multi‑Agent Systems (MAS) built on Large Language Models (LLMs) often exhibit high variance in their reasoning trajectories. Process verification, which evaluates intermediate steps in trajectories, has shown promise in general reasoning settings, and has been suggested as a potential tool for guiding coordination of MAS; however, its actual effectiveness in MAS remains unclear. To fill this gap, we present MAS‑ProVe, a systematic empirical study of process verification for multi‑agent systems (MAS). Our study spans three verification paradigms (LLM‑as‑a‑Judge, reward models, and process reward models), evaluated across two levels of verification granularity (agent‑level and iteration‑level). We further examine five representative verifiers and four context management strategies, and conduct experiments over six diverse MAS frameworks on multiple reasoning benchmarks. We find that process‑level verification does not consistently improve performance and frequently exhibits high variance, highlighting the difficulty of reliably evaluating partial multi‑agent trajectories. Among the methods studied, LLM‑as‑a‑Judge generally outperforms reward‑based approaches, with trained judges surpassing general‑purpose LLMs. We further observe a small performance gap between LLMs acting as judges and as single agents, and identify a context‑length‑performance trade‑off in verification. Overall, our results suggest that effective and robust process verification for MAS remains an open challenge, requiring further advances beyond current paradigms. Code is available at https://github.com/Wang‑ML‑Lab/MAS‑ProVe.
Authors:Punya Syon Pandey, Zhijing Jin
Abstract:
Supervised fine‑tuning (SFT) is the standard approach for binary classification tasks such as toxicity detection, factuality verification, and causal inference. However, SFT often performs poorly in real‑world settings with label noise, class imbalance, or sparse supervision. We introduce BinaryPPO, an offline reinforcement learning large language model (LLM) framework that reformulates binary classification as a reward maximization problem. Our method leverages a variant of Proximal Policy Optimization (PPO) with a confidence‑weighted reward function that penalizes uncertain or incorrect predictions, enabling the model to learn robust decision policies from static datasets without online interaction. Across eight domain‑specific benchmarks and multiple models with differing architectures, BinaryPPO improves accuracy by 40‑60 percentage points, reaching up to 99%, substantially outperforming supervised baselines. We provide an in‑depth analysis of the role of reward shaping, advantage scaling, and policy stability in enabling this improvement. Overall, we demonstrate that confidence‑based reward design provides a robust alternative to SFT for binary classification. Our code is available at https://github.com/psyonp/BinaryPPO.
Authors:Tianle Gu, Kexin Huang, Lingyu Li, Ruilin Luo, Shiyang Huang, Zongqi Wang, Yujiu Yang, Yan Teng, Yingchun Wang
Abstract:
Safety moderation is pivotal for identifying harmful content. Despite the success of textual safety moderation, its multimodal counterparts remain hindered by a dual sparsity of data and supervision. Conventional reliance on binary labels lead to shortcut learning, which obscures the intrinsic classification boundaries necessary for effective multimodal discrimination. Hence, we propose a novel learning paradigm (UniMod) that transitions from sparse decision‑making to dense reasoning traces. By constructing structured trajectories encompassing evidence grounding, modality assessment, risk mapping, policy decision, and response generation, we reformulate monolithic decision tasks into a multi‑dimensional boundary learning process. This approach forces the model to ground its decision in explicit safety semantics, preventing the model from converging on superficial shortcuts. To facilitate this paradigm, we develop a multi‑head scalar reward model (UniRM). UniRM provides multi‑dimensional supervision by assigning attribute‑level scores to the response generation stage. Furthermore, we introduce specialized optimization strategies to decouple task‑specific parameters and rebalance training dynamics, effectively resolving interference between diverse objectives in multi‑task learning. Empirical results show UniMod achieves competitive textual moderation performance and sets a new multimodal benchmark using less than 40% of the training data used by leading baselines. Ablations further validate our multi‑attribute trajectory reasoning, offering an effective and efficient framework for multimodal moderation. Supplementary materials are available at \hrefhttps://trustworthylab.github.io/UniMod/project website.
Authors:Yuming Zhao, Peiyi Zhang, Oana Ignat
Abstract:
Memes are a pervasive form of online communication, yet their cultural specificity poses significant challenges for cross‑cultural adaptation. We study cross‑cultural meme transcreation, a multimodal generation task that aims to preserve communicative intent and humor while adapting culture‑specific references. We propose a hybrid transcreation framework based on vision‑language models and introduce a large‑scale bidirectional dataset of Chinese and US memes. Using both human judgments and automated evaluation, we analyze 6,315 meme pairs and assess transcreation quality across cultural directions. Our results show that current vision‑language models can perform cross‑cultural meme transcreation to a limited extent, but exhibit clear directional asymmetries: US‑Chinese transcreation consistently achieves higher quality than Chinese‑US. We further identify which aspects of humor and visual‑textual design transfer across cultures and which remain challenging, and propose an evaluation framework for assessing cross‑cultural multimodal generation. Our code and dataset are publicly available at https://github.com/AIM‑SCU/MemeXGen.
Authors:Yunao Zheng, Xiaojie Wang, Lei Ren, Wei Chen
Abstract:
Long‑context capability and computational efficiency are among the central challenges facing today's large language models. Existing efficient attention methods reduce computational complexity, but they typically suffer from a limited coverage of the model state. This paper proposes ROSA‑Tuning, a retrieval‑and‑recall mechanism for enhancing the long‑context modeling ability of pretrained models. Beyond the standard attention mechanism, ROSA‑Tuning leverages in parallel a CPU‑based ROSA (RWKV Online Suffix Automaton) retrieval module, which efficiently locates historical positions in long contexts that are relevant to the current query, and injects the retrieved information into the model state in a trainable manner; subsequent weighted fusion can then be handled by range‑restricted attention. To enable end‑to‑end training, we employ the binary discretization strategy and the counterfactual gradient algorithm, and further optimize overall execution efficiency via an asynchronous CPU‑GPU pipeline. Systematic evaluations on Qwen3‑Base‑1.7B show that ROSA‑Tuning substantially restores the long‑context modeling ability of windowed‑attention models, achieving performance close to and in some cases matching global attention on benchmarks such as LongBench, while maintaining computational efficiency and GPU memory usage that are nearly comparable to windowed‑attention methods, offering a new technical path for efficient long‑context processing. The example code can be found at https://github.com/zyaaa‑ux/ROSA‑Tuning.
Authors:Yinjie Wang, Tianbao Xie, Ke Shen, Mengdi Wang, Ling Yang
Abstract:
We propose RLAnything, a reinforcement learning framework that dynamically forges environment, policy, and reward models through closed‑loop optimization, amplifying learning signals and strengthening the overall RL system for any LLM or agentic scenarios. Specifically, the policy is trained with integrated feedback from step‑wise and outcome signals, while the reward model is jointly optimized via consistency feedback, which in turn further improves policy training. Moreover, our theory‑motivated automatic environment adaptation improves training for both the reward and policy models by leveraging critic feedback from each, enabling learning from experience. Empirically, each added component consistently improves the overall system, and RLAnything yields substantial gains across various representative LLM and agentic tasks, boosting Qwen3‑VL‑8B‑Thinking by 9.1% on OSWorld and Qwen2.5‑7B‑Instruct by 18.7% and 11.9% on AlfWorld and LiveBench, respectively. We also that optimized reward‑model signals outperform outcomes that rely on human labels. Code: https://github.com/Gen‑Verse/Open‑AgentRL
Authors:Haozhen Zhang, Quanyu Long, Jianzhu Bao, Tao Feng, Weizhi Zhang, Haodong Yue, Wenya Wang
Abstract:
Most Large Language Model (LLM) agent memory systems rely on a small set of static, hand‑designed operations for extracting memory. These fixed procedures hard‑code human priors about what to store and how to revise memory, making them rigid under diverse interaction patterns and inefficient on long histories. To this end, we present MemSkill, which reframes these operations as learnable and evolvable memory skills, structured and reusable routines for extracting, consolidating, and pruning information from interaction traces. Inspired by the design philosophy of agent skills, MemSkill employs a \emphcontroller that learns to select a small set of relevant skills, paired with an LLM‑based \emphexecutor that produces skill‑guided memories. Beyond learning skill selection, MemSkill introduces a \emphdesigner that periodically reviews hard cases where selected skills yield incorrect or incomplete memories, and evolves the skill set by proposing refinements and new skills. Together, MemSkill forms a closed‑loop procedure that improves both the skill‑selection policy and the skill set itself. Experiments on LoCoMo, LongMemEval, HotpotQA, and ALFWorld demonstrate that MemSkill improves task performance over strong baselines and generalizes well across settings. Further analyses shed light on how skills evolve, offering insights toward more adaptive, self‑evolving memory management for LLM agents.
Authors:Ziwen Xu, Chenyan Wu, Hengyu Sun, Haiwen Hong, Mengru Wang, Yunzhi Yao, Longtao Huang, Hui Xue, Shumin Deng, Zhixuan Chu, Huajun Chen, Ningyu Zhang
Abstract:
Methods for controlling large language models (LLMs), including local weight fine‑tuning, LoRA‑based adaptation, and activation‑based interventions, are often studied in isolation, obscuring their connections and making comparison difficult. In this work, we present a unified view that frames these interventions as dynamic weight updates induced by a control signal, placing them within a single conceptual framework. Building on this view, we propose a unified preference‑utility analysis that separates control effects into preference, defined as the tendency toward a target concept, and utility, defined as coherent and task‑valid generation, and measures both on a shared log‑odds scale using polarity‑paired contrastive examples. Across methods, we observe a consistent trade‑off between preference and utility: stronger control increases preference while predictably reducing utility. We further explain this behavior through an activation manifold perspective, in which control shifts representations along target‑concept directions to enhance preference, while utility declines primarily when interventions push representations off the model's valid‑generation manifold. Finally, we introduce a new steering approach SPLIT guided by this analysis that improves preference while better preserving utility. Code is available at https://github.com/zjunlp/EasyEdit/blob/main/examples/SPLIT.md.
Authors:Min Cai, Yu Liang, Longzheng Wang, Yan Wang, Yueyang Zhang, Long Xia, Zhiyuan Sun, Xi Ye, Daiting Shi
Abstract:
Reinforcement learning (RL) has played a central role in recent advances in large reasoning models (LRMs), yielding strong gains in verifiable and open‑ended reasoning. However, training a single general‑purpose LRM across diverse domains remains challenging due to pronounced domain heterogeneity. Through a systematic study of two widely used strategies, Sequential RL and Mixed RL, we find that both incur substantial cross‑domain interference at the behavioral and gradient levels, resulting in limited overall gains. To address these challenges, we introduce Modular Gradient Surgery (MGS), which resolves gradient conflicts at the module level within the transformer. When applied to Llama and Qwen models, MGS achieves average improvements of 4.3 (16.6%) and 4.5 (11.1%) points, respectively, over standard multi‑task RL across three representative domains (math, general chat, and instruction following). Further analysis demonstrates that MGS remains effective under prolonged training. Overall, our study clarifies the sources of interference in multi‑domain RL and presents an effective solution for training general‑purpose LRMs.
Authors:Atharv Sonwane, Eng-Shen Tu, Wei-Chung Lu, Claas Beger, Carter Larsen, Debjit Dhar, Simon Alford, Rachel Chen, Ronit Pattanayak, Tuan Anh Dang, Guohao Chen, Gloria Geng, Kevin Ellis, Saikat Dutta
Abstract:
LLM‑powered coding agents are redefining how real‑world software is developed. To drive the research towards better coding agents, we require challenging benchmarks that can rigorously evaluate the ability of such agents to perform various software engineering tasks. However, popular coding benchmarks such as HumanEval and SWE‑Bench focus on narrowly scoped tasks such as competition programming and patch generation. In reality, software engineers have to handle a broader set of tasks for real‑world software development. To address this gap, we propose OmniCode, a novel software engineering benchmark that contains a broader and more diverse set of task categories beyond code or patch generation. Overall, OmniCode contains 1794 tasks spanning three programming languages (Python, Java, and C++) and four key categories: bug fixing, test generation, code review fixing, and style fixing. In contrast to prior software engineering benchmarks, the tasks in OmniCode are (1) manually validated to eliminate ill‑defined problems, and (2) synthetically crafted or recently curated to avoid data leakage issues, presenting a new framework for synthetically generating diverse software tasks from limited real‑world data. We evaluate OmniCode with popular agent frameworks such as SWE‑Agent and show that while they may perform well on bug fixing for Python, they fall short on tasks such as Test Generation and in languages such as C++ and Java. For instance, SWE‑Agent achieves a maximum of 20.9% with DeepSeek‑V3.1 on Java Test Generation tasks. OmniCode aims to serve as a robust benchmark and spur the development of agents that can perform well across different aspects of software development. Code and data are available at https://github.com/seal‑research/OmniCode.
Authors:Yu Zeng, Wenxuan Huang, Zhen Fang, Shuang Chen, Yufan Shen, Yishuo Cai, Xiaoman Wang, Zhenfei Yin, Lin Chen, Zehui Chen, Shiting Huang, Yiming Zhao, Xu Tang, Yao Hu, Philip Torr, Wanli Ouyang, Shaosheng Cao
Abstract:
Multimodal Large Language Models (MLLMs) have advanced VQA and now support Vision‑DeepResearch systems that use search engines for complex visual‑textual fact‑finding. However, evaluating these visual and textual search abilities is still difficult, and existing benchmarks have two major limitations. First, existing benchmarks are not visual search‑centric: answers that should require visual search are often leaked through cross‑textual cues in the text questions or can be inferred from the prior world knowledge in current MLLMs. Second, overly idealized evaluation scenario: On the image‑search side, the required information can often be obtained via near‑exact matching against the full image, while the text‑search side is overly direct and insufficiently challenging. To address these issues, we construct the Vision‑DeepResearch benchmark (VDR‑Bench) comprising 2,000 VQA instances. All questions are created via a careful, multi‑stage curation pipeline and rigorous expert review, designed to assess the behavior of Vision‑DeepResearch systems under realistic real‑world conditions. Moreover, to address the insufficient visual retrieval capabilities of current MLLMs, we propose a simple multi‑round cropped‑search workflow. This strategy is shown to effectively improve model performance in realistic visual retrieval scenarios. Overall, our results provide practical guidance for the design of future multimodal deep‑research systems. The code will be released in https://github.com/Osilly/Vision‑DeepResearch.
Authors:Liang Lin, Feng Xiong, Zengbin Wang, Kun Wang, Junhao Dong, Xuecai Hu, Yong Wang, Xiangxiang Chu
Abstract:
Diffusion Large Language Models (DLLMs) have emerged as a powerful alternative to autoregressive models, enabling parallel token generation across multiple positions. However, preference alignment of DLLMs remains challenging due to high variance introduced by Evidence Lower Bound (ELBO)‑based likelihood estimation. In this work, we propose AR‑MAP, a novel transfer learning framework that leverages preference‑aligned autoregressive LLMs (AR‑LLMs) as implicit teachers for DLLM alignment. We reveal that DLLMs can effectively absorb alignment knowledge from AR‑LLMs through simple weight scaling, exploiting the shared architectural structure between these divergent generation paradigms. Crucially, our approach circumvents the high variance and computational overhead of direct DLLM alignment and comprehensive experiments across diverse preference alignment tasks demonstrate that AR‑MAP achieves competitive or superior performance compared to existing DLLM‑specific alignment methods, achieving 69.08% average score across all tasks and models. Our Code is available at https://github.com/AMAP‑ML/AR‑MAP.
Authors:Pawel Batorski, Paul Swoboda
Abstract:
Machine unlearning aims to unlearn specified training data (e.g. sensitive or copyrighted material). A prominent approach is to fine‑tune an existing model with an unlearning loss that retains overall utility. The space of suitable unlearning loss functions is vast, making the search for an optimal loss function daunting. Additionally, there might not even exist a universally optimal loss function: differences in the structure and overlap of the forget and retain data can cause a loss to work well in one setting but over‑unlearn or under‑unlearn in another. Our approach EvoMU tackles these two challenges simultaneously. An evolutionary search procedure automatically finds task‑specific losses in the vast space of possible unlearning loss functions. This allows us to find dataset‑specific losses that match or outperform existing losses from the literature, without the need for a human‑in‑the‑loop. This work is therefore an instance of automatic scientific discovery, a.k.a. an AI co‑scientist. In contrast to previous AI co‑scientist works, we do so on a budget: We achieve SotA results using a small 4B parameter model (Qwen3‑4B‑Thinking), showing the potential of AI co‑scientists with limited computational resources. Our experimental evaluation shows that we surpass previous loss‑based unlearning formulations on TOFU‑5%, TOFU‑10%, MUSE and WMDP by synthesizing novel unlearning losses. Our code is available at https://github.com/Batorskq/EvoMU.
Authors:Wenhao Li, Daohai Yu, Gen Luo, Yuxin Zhang, Fei Chao, Rongrong Ji, Yifan Wu, Jiaxin Liu, Ziyang Gong, Zimu Liao
Abstract:
Training Large Language Models (LLMs) on long contexts is severely constrained by prohibitive GPU memory overhead, not training time. The primary culprits are the activations, whose memory footprints scale linearly with sequence length. We introduce OOMB, a highly memory‑efficient training system that directly confronts this barrier. Our approach employs a chunk‑recurrent training framework with on‑the‑fly activation recomputation, which maintains a constant activation memory footprint (O(1)) and shifts the primary bottleneck to the growing KV cache. To manage the KV cache, OOMB integrates a suite of synergistic optimizations: a paged memory manager for both the KV cache and its gradients to eliminate fragmentation, asynchronous CPU offloading to hide data transfer latency, and page‑level sparse attention to reduce both computational complexity and communication overhead. The synergy of these techniques yields exceptional efficiency. Our empirical results show that for every additional 10K tokens of context, the end‑to‑end training memory overhead increases by a mere 10MB for Qwen2.5‑7B. This allows training Qwen2.5‑7B with a 4M‑token context on a single H200 GPU, a feat that would otherwise require a large cluster using context parallelism. This work represents a substantial advance in resource efficiency for long‑context LLM training. The source code is available at https://github.com/wenhaoli‑xmu/OOMB.
Authors:Liyan Xu, Mo Yu, Fandong Meng, Jie Zhou
Abstract:
Chain‑of‑thought (CoT) reasoning has become a central mechanism for eliciting multi‑step reasoning in Large Language Models (LLMs). Yet recent evidence presents a tension: hidden states appear to already encode future reasoning before CoT fully unfolds, while explicit steps still remain crucial for tasks requiring compositional computation. To deepen the understanding between LLM's internal states and its verbalized reasoning trajectories, we investigate the latent planning strength of LLMs, through our probing method, Tele‑Lens, applying to hidden states across diverse task domains. Our empirical results indicate that LLMs exhibit a myopic horizon, primarily conducting incremental transitions without precise global planning. Leveraging this characteristic, we propose a hypothesis on enhancing uncertainty estimation of CoT, which we validate that a sparse set of pivot positions can effectively represent the uncertainty of the entire path. We further underscore the significance of exploiting CoT dynamics, and demonstrate that automatic recognition of CoT bypass can be achieved without performance degradation. Our code, data and models are released at https://github.com/lxucs/tele‑lens.
Authors:Pengyu Wang, Benfeng Xu, Licheng Zhang, Shaohan Wang, Mingxuan Du, Chiwei Zhu, Zhendong Mao
Abstract:
Graph‑based Retrieval‑Augmented Generation (GraphRAG) organizes external knowledge as a hierarchical graph, enabling efficient retrieval and aggregation of scattered evidence across multiple documents. However, many existing benchmarks for GraphRAG rely on short, curated passages as external knowledge, failing to adequately evaluate systems in realistic settings involving long contexts and large‑scale heterogeneous documents. To bridge this gap, we introduce WildGraphBench, a benchmark designed to assess GraphRAG performance in the wild. We leverage Wikipedia's unique structure, where cohesive narratives are grounded in long and heterogeneous external reference documents, to construct a benchmark reflecting real‑word scenarios. Specifically, we sample articles across 12 top‑level topics, using their external references as the retrieval corpus and citation‑linked statements as ground truth, resulting in 1,100 questions spanning three levels of complexity: single‑fact QA, multi‑fact QA, and section‑level summarization. Experiments across multiple baselines reveal that current GraphRAG pipelines help on multi‑fact aggregation when evidence comes from a moderate number of sources, but this aggregation paradigm may overemphasize high‑level statements at the expense of fine‑grained details, leading to weaker performance on summarization tasks. Project page:https://github.com/BstWPY/WildGraphBench.
Authors:Zhanghao Hu, Qinglin Zhu, Hanqi Yan, Yulan He, Lin Gui
Abstract:
Agent memory systems often adopt the standard Retrieval‑Augmented Generation (RAG) pipeline, yet its underlying assumptions differ in this setting. RAG targets large, heterogeneous corpora where retrieved passages are diverse, whereas agent memory is a bounded, coherent dialogue stream with highly correlated spans that are often duplicates. Under this shift, fixed top‑k similarity retrieval tends to return redundant context, and post‑hoc pruning can delete temporally linked prerequisites needed for correct reasoning. We argue retrieval should move beyond similarity matching and instead operate over latent components, following decoupling to aggregation: disentangle memories into semantic components, organise them into a hierarchy, and use this structure to drive retrieval. We propose xMemory, which builds a hierarchy of intact units and maintains a searchable yet faithful high‑level node organisation via a sparsity‑‑semantics objective that guides memory split and merge. At inference, xMemory retrieves top‑down, selecting a compact, diverse set of themes and semantics for multi‑fact queries, and expanding to episodes and raw messages only when it reduces the reader's uncertainty. Experiments on LoCoMo and PerLTQA across the three latest LLMs show consistent gains in answer quality and token efficiency.
Authors:Yanrui Du, Yibo Gao, Sendong Zhao, Jiayun Li, Haochun Wang, Qika Lin, Kai He, Bing Qin, Mengling Feng
Abstract:
R1‑style LLMs have attracted growing attention for their capacity for self‑reflection, yet the internal mechanisms underlying such behavior remain unclear. To bridge this gap, we anchor on the onset of reflection behavior and trace its layer‑wise activation trajectory. Using the logit lens to read out token‑level semantics, we uncover a structured progression: (i) Latent‑control layers, where an approximate linear direction encodes the semantics of thinking budget; (ii) Semantic‑pivot layers, where discourse‑level cues, including turning‑point and summarization cues, surface and dominate the probability mass; and (iii) Behavior‑overt layers, where the likelihood of reflection‑behavior tokens begins to rise until they become highly likely to be sampled. Moreover, our targeted interventions uncover a causal chain across these stages: prompt‑level semantics modulate the projection of activations along latent‑control directions, thereby inducing competition between turning‑point and summarization cues in semantic‑pivot layers, which in turn regulates the sampling likelihood of reflection‑behavior tokens in behavior‑overt layers. Collectively, our findings suggest a human‑like meta‑cognitive process‑progressing from latent monitoring, to discourse‑level regulation, and to finally overt self‑reflection. Our analysis code can be found at https://github.com/DYR1/S3‑CoT.
Authors:Abdelrahman Mansour, Khaled W. Alshaer, Moataz Elsaban
Abstract:
Extracting structured data from the web is often a trade‑off between the brittle nature of manual heuristics and the prohibitive cost of Large Language Models. We introduce AXE (Adaptive X‑Path Extractor), a pipeline that rethinks this process by treating the HTML DOM as a tree that needs pruning rather than just a wall of text to be read. AXE uses a specialized "pruning" mechanism to strip away boilerplate and irrelevant nodes, leaving behind a distilled, high‑density context that allows a tiny 0.6B LLM to generate precise, structured outputs. To keep the model honest, we implement Grounded XPath Resolution (GXR), ensuring every extraction is physically traceable to a source node. Despite its low footprint, AXE achieves state‑of‑the‑art zero‑shot performance, outperforming several much larger, fully‑trained alternatives with an F1 score of 88.1% on the SWDE dataset. By releasing our specialized adaptors, we aim to provide a practical, cost‑effective path for large‑scale web information extraction. Our code and adaptors are publicly available at https://github.com/abdo‑Mansour/axetract.
Authors:Yuling Shi, Chaoxiang Xie, Zhensu Sun, Yeheng Chen, Chenxu Zhang, Longfei Yun, Chengcheng Wan, Hongyu Zhang, David Lo, Xiaodong Gu
Abstract:
Large Language Models (LLMs) have achieved remarkable success in source code understanding, yet as software systems grow in scale, computational efficiency has become a critical bottleneck. Currently, these models rely on a text‑based paradigm that treats source code as a linear sequence of tokens, which leads to a linear increase in context length and associated computational costs. The rapid advancement of Multimodal LLMs (MLLMs) introduces an opportunity to optimize efficiency by representing source code as rendered images. Unlike text, which is difficult to compress without losing semantic meaning, the image modality is inherently suitable for compression. By adjusting resolution, images can be scaled to a fraction of their original token cost while remaining recognizable to vision‑capable models. To explore the feasibility of this approach, we conduct the first systematic study on the effectiveness of MLLMs for code understanding. Our experiments reveal that: (1) MLLMs can effectively understand code with substantial token reduction, achieving up to 8x compression; (2) MLLMs can effectively leverage visual cues such as syntax highlighting, improving code completion performance under 4x compression; and (3) Code‑understanding tasks like clone detection exhibit exceptional resilience to visual compression, with some compression ratios even slightly outperforming raw text inputs. Our findings highlight both the potential and current limitations of MLLMs in code understanding, which points out a shift toward image‑modality code representation as a pathway to more efficient inference.
Authors:Wenhui Tan, Fiorenzo Parascandolo, Enver Sangineto, Jianzhong Ju, Zhenbo Luo, Qian Cao, Rita Cucchiara, Ruihua Song, Jian Luan
Abstract:
Large Reasoning Models (LRMs) have recently achieved strong mathematical and code reasoning performance through Reinforcement Learning (RL) post‑training. However, we show that modern reasoning post‑training induces an unintended exploration collapse: temperature‑based sampling no longer increases pass@n accuracy. Empirically, the final‑layer posterior of post‑trained LRMs exhibit sharply reduced entropy, while the entropy of intermediate layers remains relatively high. Motivated by this entropy asymmetry, we propose Latent Exploration Decoding (LED), a depth‑conditioned decoding strategy. LED aggregates intermediate posteriors via cumulative sum and selects depth configurations with maximal entropy as exploration candidates. Without additional training or parameters, LED consistently improves pass@1 and pass@16 accuracy by 0.61 and 1.03 percentage points across multiple reasoning benchmarks and models. Project page: https://github.com/AlbertTan404/Latent‑Exploration‑Decoding.
Authors:Shaohan Wang, Benfeng Xu, Licheng Zhang, Mingxuan Du, Chiwei Zhu, Xiaorui Wang, Zhendong Mao, Yongdong Zhang
Abstract:
Deep Research Agents (DRAs) have demonstrated remarkable capabilities in autonomous information retrieval and report generation, showing great potential to assist humans in complex research tasks. Current evaluation frameworks primarily rely on LLM‑generated references or LLM‑derived evaluation dimensions. While these approaches offer scalability, they often lack the reliability of expert‑verified content and struggle to provide objective, fine‑grained assessments of critical dimensions. To bridge this gap, we introduce Wiki Live Challenge (WLC), a live benchmark that leverages the newest Wikipedia Good Articles (GAs) as expert‑level references. Wikipedia's strict standards for neutrality, comprehensiveness, and verifiability serve as a great challenge for DRAs, with GAs representing the pinnacle of which. We curate a dataset of 100 recent Good Articles and propose Wiki Eval, a comprehensive evaluation framework comprising a fine‑grained evaluation method with 39 criteria for writing quality and rigorous metrics for factual verifiability. Extensive experiments on various DRA systems demonstrate a significant gap between current DRAs and human expert‑level Wikipedia articles, validating the effectiveness of WLC in advancing agent research. We release our benchmark at https://github.com/WangShao2000/Wiki_Live_Challenge
Authors:Chiwei Zhu, Benfeng Xu, Mingxuan Du, Shaohan Wang, Xiaorui Wang, Zhendong Mao, Yongdong Zhang
Abstract:
Deep research is emerging as a representative long‑horizon task for large language model (LLM) agents. However, long trajectories in deep research often exceed model context limits, compressing token budgets for both evidence collection and report writing, and preventing effective test‑time scaling. We introduce FS‑Researcher, a file‑system‑based, dual‑agent framework that scales deep research beyond the context window via a persistent workspace. Specifically, a Context Builder agent acts as a librarian which browses the internet, writes structured notes, and archives raw sources into a hierarchical knowledge base that can grow far beyond context length. A Report Writer agent then composes the final report section by section, treating the knowledge base as the source of facts. In this framework, the file system serves as a durable external memory and a shared coordination medium across agents and sessions, enabling iterative refinement beyond the context window. Experiments on two open‑ended benchmarks (DeepResearch Bench and DeepConsult) show that FS‑Researcher achieves state‑of‑the‑art report quality across different backbone models. Further analyses demonstrate a positive correlation between final report quality and the computation allocated to the Context Builder, validating effective test‑time scaling under the file‑system paradigm. The code and data are open‑sourced at https://github.com/Ignoramus0817/FS‑Researcher.
Authors:Xiaoyu Wen, Zhida He, Han Qi, Ziyu Wan, Zhongtian Ma, Ying Wen, Tianhang Zheng, Xingcheng Xu, Chaochao Lu, Qiaosheng Zhang
Abstract:
Ensuring robust safety alignment is crucial for Large Language Models (LLMs), yet existing defenses often lag behind evolving adversarial attacks due to their reliance on static, pre‑collected data distributions. In this paper, we introduce MAGIC, a novel multi‑turn multi‑agent reinforcement learning framework that formulates LLM safety alignment as an adversarial asymmetric game. Specifically, an attacker agent learns to iteratively rewrite original queries into deceptive prompts, while a defender agent simultaneously optimizes its policy to recognize and refuse such inputs. This dynamic process triggers a co‑evolution, where the attacker's ever‑changing strategies continuously uncover long‑tail vulnerabilities, driving the defender to generalize to unseen attack patterns. Remarkably, we observe that the attacker, endowed with initial reasoning ability, evolves novel, previously unseen combinatorial strategies through iterative RL training, underscoring our method's substantial potential. Theoretically, we provide insights into a more robust game equilibrium and derive safety guarantees. Extensive experiments validate our framework's effectiveness, demonstrating superior defense success rates without compromising the helpfulness of the model. Our code is available at https://github.com/BattleWen/MAGIC.
Authors:Yue Liu, Yuzhong Zhao, Zheyong Xie, Qixiang Ye, Jianbin Jiao, Yao Hu, Shaosheng Cao, Yunfan Liu
Abstract:
In discrete generative modeling, two dominant paradigms demonstrate divergent capabilities: Masked Diffusion Language Models (MDLM) excel at semantic understanding and zero‑shot generalization, whereas Uniform‑noise Diffusion Language Models (UDLM) achieve strong few‑step generation quality, yet neither attains balanced performance across both dimensions. To address this, we propose XDLM, which bridges the two paradigms via a stationary noise kernel. XDLM offers two key contributions: (1) it provides a principled theoretical unification of MDLM and UDLM, recovering each paradigm as a special case; and (2) an alleviated memory bottleneck enabled by an algebraic simplification of the posterior probabilities. Experiments demonstrate that XDLM advances the Pareto frontier between understanding capability and generation quality. Quantitatively, XDLM surpasses UDLM by 5.4 points on zero‑shot text benchmarks and outperforms MDLM in few‑step image generation (FID 54.1 vs. 80.8). When scaled to tune an 8B‑parameter large language model, XDLM achieves 15.0 MBPP in just 32 steps, effectively doubling the baseline performance. Finally, analysis of training dynamics reveals XDLM's superior potential for long‑term scaling. Code is available at https://github.com/MzeroMiko/XDLM
Authors:Mingju Chen, Guibin Zhang, Heng Chang, Yuchen Guo, Shiji Zhou
Abstract:
Contemporary large language model (LLM)‑based multi‑agent systems exhibit systematic advantages in deep research tasks, which emphasize iterative, vertically structured information seeking. However, when confronted with wide search tasks characterized by large‑scale, breadth‑oriented retrieval, existing agentic frameworks, primarily designed around sequential, vertically structured reasoning, remain stuck in expansive search objectives and inefficient long‑horizon execution. To bridge this gap, we propose A‑MapReduce, a MapReduce paradigm‑inspired multi‑agent execution framework that recasts wide search as a horizontally structured retrieval problem. Concretely, A‑MapReduce implements parallel processing of massive retrieval targets through task‑adaptive decomposition and structured result aggregation. Meanwhile, it leverages experiential memory to drive the continual evolution of query‑conditioned task allocation and recomposition, enabling progressive improvement in large‑scale wide‑search regimes. Extensive experiments on five agentic benchmarks demonstrate that A‑MapReduce is (i) high‑performing, achieving state‑of‑the‑art performance on WideSearch and DeepWideSearch, and delivering 5.11% ‑ 17.50% average Item F1 improvements compared with strong baselines with OpenAI o3 or Gemini 2.5 Pro backbones; (ii) cost‑effective and efficient, delivering superior cost‑performance trade‑offs and reducing running time by 45.8% compared to representative multi‑agent baselines. The code is available at https://github.com/mingju‑c/AMapReduce.
Authors:Zirui Wu, Lin Zheng, Zhihui Xie, Jiacheng Ye, Jiahui Gao, Shansan Gong, Yansong Feng, Zhenguo Li, Wei Bi, Guorui Zhou, Lingpeng Kong
Abstract:
Diffusion Language Models (DLMs) present a compelling alternative to autoregressive models, offering flexible, any‑order infilling without specialized prompting design. However, their practical utility is blocked by a critical limitation: the requirement of a fixed‑length masked sequence for generation. This constraint severely degrades code infilling performance when the predefined mask size mismatches the ideal completion length. To address this, we propose DreamOn, a novel diffusion framework that enables dynamic, variable‑length generation. DreamOn augments the diffusion process with two length control states, allowing the model to autonomously expand or contract the output length based solely on its own predictions. We integrate this mechanism into existing DLMs with minimal modifications to the training objective and no architectural changes. Built upon Dream‑Coder‑7B and DiffuCoder‑7B, DreamOn achieves infilling performance on par with state‑of‑the‑art autoregressive models on HumanEval‑Infilling and SantaCoder‑FIM and matches oracle performance achieved with ground‑truth length. Our work removes a fundamental barrier to the practical deployment of DLMs, significantly advancing their flexibility and applicability for variable‑length generation. Our code is available at https://github.com/DreamLM/DreamOn.
Authors:Siwei Wu, Yizhi Li, Yuyang Song, Wei Zhang, Yang Wang, Riza Batista-Navarro, Xian Yang, Mingjie Tang, Bryan Dai, Jian Yang, Chenghua Lin
Abstract:
Training agentic models for terminal‑based tasks critically depends on high‑quality terminal trajectories that capture realistic long‑horizon interactions across diverse domains. However, constructing such data at scale remains challenging due to two key requirements: \emphExecutability, since each instance requires a suitable and often distinct Docker environment; and \emphVerifiability, because heterogeneous task outputs preclude unified, standardized verification. To address these challenges, we propose TerminalTraj, a scalable pipeline that (i) filters high‑quality repositories to construct Dockerized execution environments, (ii) generates Docker‑aligned task instances, and (iii) synthesizes agent trajectories with executable validation code. Using TerminalTraj, we curate 32K Docker images and generate 50,733 verified terminal trajectories across eight domains. Models trained on this data with the Qwen2.5‑Coder backbone achieve consistent performance improvements on TerminalBench (TB), with gains of up to 20% on TB~1.0 and 10% on TB~2.0 over their respective backbones. Notably, TerminalTraj‑32B achieves strong performance among models with fewer than 100B parameters, reaching 35.30% on TB~1.0 and 22.00% on TB~2.0, and demonstrates improved test‑time scaling behavior. All code and data are available at https://github.com/Wusiwei0410/TerminalTraj.
Authors:Marco Chen, Xianbiao Qi, Yelin He, Jiaquan Ye, Rong Xiao
Abstract:
In this work, we revisit Transformer optimization through the lens of second‑order geometry and establish a direct connection between architectural design, activation scale, the Hessian matrix, and the maximum tolerable learning rate. We introduce a simple normalization strategy, termed SimpleNorm, which stabilizes intermediate activation scales by construction. Then, by analyzing the Hessian of the loss with respect to network activations, we theoretically show that SimpleNorm significantly reduces the spectral norm of the Hessian, thereby permitting larger stable learning rates. We validate our theoretical findings through extensive experiments on large GPT models at parameter scales 1B, 1.4B, 7B and 8B. Empirically, SimpleGPT, our SimpleNorm‑based network, tolerates learning rates 3×‑10× larger than standard convention, consistently demonstrates strong optimization stability, and achieves substantially better performance than well‑established baselines. Specifically, when training 7B‑scale models for 60K steps, SimpleGPT achieves a training loss that is 0.08 lower than that of LLaMA2 with QKNorm, reducing the loss from 2.290 to 2.208. Our source code will be released at https://github.com/Ocram7/SimpleGPT.
Authors:Wenxuan Zhang, Yuan-Hao Jiang, Changyong Qi, Rui Jia, Yonghe Wu
Abstract:
Large language models (LLMs) struggle in knowledge‑intensive tasks, as retrievers often overfit to surface similarity and fail on queries involving complex logical relations. The capacity for logical analysis is inherent in model representations but remains underutilized in standard training. LORE (Logic ORiented Retriever Enhancement) introduces fine‑grained contrastive learning to activate this latent capacity, guiding embeddings toward evidence aligned with logical structure rather than shallow similarity. LORE requires no external upervision, resources, or pre‑retrieval analysis, remains index‑compatible, and consistently improves retrieval utility and downstream generation while maintaining efficiency. The datasets and code are publicly available at https://github.com/mazehart/Lore‑RAG.
Authors:Sheng-Lun Wei, Yu-Ling Liao, Yen-Hua Chang, Hen-Hsen Huang, Hsin-Hsi Chen
Abstract:
This work presents the first systematic investigation of speech bias in multilingual MLLMs. We construct and release the BiasInEar dataset, a speech‑augmented benchmark based on Global MMLU Lite, spanning English, Chinese, and Korean, balanced by gender and accent, and totaling 70.8 hours (\approx4,249 minutes) of speech with 11,200 questions. Using four complementary metrics (accuracy, entropy, APES, and Fleiss' κ), we evaluate nine representative models under linguistic (language and accent), demographic (gender), and structural (option order) perturbations. Our findings reveal that MLLMs are relatively robust to demographic factors but highly sensitive to language and option order, suggesting that speech can amplify existing structural biases. Moreover, architectural design and reasoning strategy substantially affect robustness across languages. Overall, this study establishes a unified framework for assessing fairness and robustness in speech‑integrated LLMs, bridging the gap between text‑ and speech‑based evaluation. The resources can be found at https://github.com/ntunlplab/BiasInEar.
Authors:Yutong Song, Shiva Shrestha, Chenhan Lyu, Elahe Khatibi, Pengfei Zhang, Honghui Xu, Nikil Dutt, Amir Rahmani
Abstract:
Spoken question‑answering (SQA) systems relying on automatic speech recognition (ASR) often struggle with accurately recognizing medical terminology. To this end, we propose MedSpeak, a novel knowledge graph‑aided ASR error correction framework that refines noisy transcripts and improves downstream answer prediction by leveraging both semantic relationships and phonetic information encoded in a medical knowledge graph, together with the reasoning power of LLMs. Comprehensive experimental results on benchmarks demonstrate that MedSpeak significantly improves the accuracy of medical term recognition and overall medical SQA performance, establishing MedSpeak as a state‑of‑the‑art solution for medical SQA. The code is available at https://github.com/RainieLLM/MedSpeak.
Authors:Yuheng Yang, Siqi Zhu, Tao Feng, Ge Liu, Jiaxuan You
Abstract:
Large Language Models (LLMs) can be seen as compressed knowledge bases, but it remains unclear what knowledge they truly contain and how far their knowledge boundaries extend. Existing benchmarks are mostly static and provide limited support for systematic knowledge probing. In this paper, we propose an interactive agentic framework to systematically extract and quantify the knowledge of LLMs. Our method includes four adaptive exploration policies to probe knowledge at different granularities. To ensure the quality of extracted knowledge, we introduce a three‑stage knowledge processing pipeline that combines vector‑based filtering to remove exact duplicates, LLM‑based adjudication to resolve ambiguous semantic overlaps, and domain‑relevance auditing to retain valid knowledge units. Through extensive experiments, we find that recursive taxonomy is the most effective exploration strategy. We also observe a clear knowledge scaling law, where larger models consistently extract more knowledge. In addition, we identify a Pass@1‑versus‑Pass@k trade‑off: domain‑specialized models achieve higher initial accuracy but degrade rapidly, while general‑purpose models maintain stable performance during extended extraction. Finally, our results show that differences in training data composition lead to distinct and measurable knowledge profiles across model families.
Authors:Víctor Yeste, Paolo Rosso
Abstract:
Sentence‑level human value detection is typically framed as multi‑label classification over Schwartz values, but it remains unclear whether Schwartz higher‑order (HO) categories provide usable structure. We study this under a strict compute‑frugal budget (single 8 GB GPU) on ValueEval'24 / ValuesML (74K English sentences). We compare (i) direct supervised transformers, (ii) HO\rightarrowvalues pipelines that enforce the hierarchy with hard masks, and (iii) Presence\rightarrowHO\rightarrowvalues cascades, alongside low‑cost add‑ons (lexica, short context, topics), label‑wise threshold tuning, small instruction‑tuned LLM baselines (\le10B), QLoRA, and simple ensembles. HO categories are learnable from single sentences (e.g., the easiest bipolar pair reaches Macro‑F_1\approx0.58), but hard hierarchical gating is not a reliable win: it often reduces end‑task Macro‑F_1 via error compounding and recall suppression. In contrast, label‑wise threshold tuning is a high‑leverage knob (up to +0.05 Macro‑F_1), and small transformer ensembles provide the most consistent additional gains (up to +0.02 Macro‑F_1). Small LLMs lag behind supervised encoders as stand‑alone systems, yet can contribute complementary errors in cross‑family ensembles. Overall, HO structure is useful descriptively, but enforcing it with hard gates hurts sentence‑level value detection; robust improvements come from calibration and lightweight ensembling.
Authors:Gaurav Srivastava, Aafiya Hussain, Chi Wang, Yingyan Celine Lin, Xuan Wang
Abstract:
Most existing language model agentic systems today are built and optimized for large language models (e.g., GPT, Claude, Gemini) via API calls. While powerful, this approach faces several limitations including high token costs and privacy concerns for sensitive applications. We introduce effGen, an open‑source agentic framework optimized for small language models (SLMs) that enables effective, efficient, and secure local deployment (pip install effgen). effGen makes four major contributions: (1) Enhanced tool‑calling with prompt optimization that compresses contexts by 70‑80% while preserving task semantics, (2) Intelligent task decomposition that breaks complex queries into parallel or sequential subtasks based on dependencies, (3) Complexity‑based routing using five factors to make smart pre‑execution decisions, and (4) Unified memory system combining short‑term, long‑term, and vector‑based storage. Additionally, effGen unifies multiple agent protocols (MCP, A2A, ACP) for cross‑protocol communication. Results on 13 benchmarks show effGen outperforms LangChain, AutoGen, and Smolagents with higher success rates, faster execution, and lower memory. Our results reveal that prompt optimization and complexity routing have complementary scaling behavior: optimization benefits SLMs more (11.2% gain at 1.5B vs 2.4% at 32B), while routing benefits large models more (3.6% at 1.5B vs 7.9% at 32B), providing consistent gains across all scales when combined. effGen (https://effgen.org/) is released under the MIT License, ensuring broad accessibility for research and commercial use. Our framework code is publicly available at https://github.com/ctrl‑gaurav/effGen.
Authors:Xuan Ai, Qingqing Yang, Peng Wang, Lei Deng, Lin Zhang, Renhai Chen, Gong Zhang
Abstract:
Long‑context inference in Large Language Models (LLMs) is bottlenecked by the quadratic computation complexity of attention and the substantial memory footprint of Key‑Value (KV) caches. While existing sparse attention mechanisms attempt to mitigate this by exploiting inherent sparsity, they often rely on rigid patterns or aggressive pruning, failing to achieve an optimal balance between efficiency and accuracy. In this paper, we introduce \bf HyLRA (\bf Hybrid \bf Layer \bf Reuse \bf Attention), a novel framework driven by layer‑wise sparsity profiling. Our empirical analysis uncovers a dual characteristic in attention mechanics: intra‑layer sensitivity, where specific layers necessitate full attention to prevent feature distortion, and inter‑layer similarity, where consecutive layers share substantial critical tokens. Based on these observations, HyLRA employs an offline dynamic programming approach to derive an optimal layer‑wise policy. This hybrid strategy retains full attention for sensitive layers to ensure robustness, while enabling tolerant layers to bypass quadratic calculations by directly reusing top‑k indices from preceding layers. This approach allows LLMs to restrict computation to the most critical tokens, effectively overcoming the quadratic bottleneck of dense attention. Extensive evaluations demonstrate that HyLRA improves inference throughput by 6%‑‑46% while maintaining comparable performance (with <1% accuracy degradation), consistently outperforming state‑of‑the‑art sparse attention methods. HyLRA is open source at \hrefhttps://anonymous.4open.science/r/unified‑cache‑management‑CF80/\texttt/r/unified‑cache‑management‑CF80/
Authors:Shengrui Li, Fei Zhao, Kaiyan Zhao, Jieying Ye, Haifeng Liu, Fangcheng Shi, Zheyong Xie, Yao Hu, Shaosheng Cao
Abstract:
Determining an effective data mixture is a key factor in Large Language Model (LLM) pre‑training, where models must balance general competence with proficiency on hard tasks such as math and code. However, identifying an optimal mixture remains an open challenge, as existing approaches either rely on unreliable tiny‑scale proxy experiments or require prohibitively expensive large‑scale exploration. To address this, we propose Decouple Searching from Training Mix (DeMix), a novel framework that leverages model merging to predict optimal data ratios. Instead of training proxy models for every sampled mixture, DeMix trains component models on candidate datasets at scale and derives data mixture proxies via weighted model merging. This paradigm decouples search from training costs, enabling evaluation of unlimited sampled mixtures without extra training burden and thus facilitating better mixture discovery through more search trials. Extensive experiments demonstrate that DeMix breaks the trade‑off between sufficiency, accuracy and efficiency, obtaining the optimal mixture with higher benchmark performance at lower search cost. Additionally, we release the DeMix Corpora, a comprehensive 22T‑token dataset comprising high‑quality pre‑training data with validated mixtures to facilitate open research. Our code and DeMix Corpora is available at https://github.com/Lucius‑lsr/DeMix.
Authors:Liang Wang, Xinyi Mou, Xiaoyou Liu, Xuanjing Huang, Zhongyu Wei
Abstract:
User modeling characterizes individuals through their preferences and behavioral patterns to enable personalized simulation and generation with Large Language Models (LLMs) in contemporary approaches. However, existing methods, whether prompt‑based or training‑based methods, face challenges in balancing personalization quality against computational and data efficiency. We propose a novel framework CURP, which employs a bidirectional user encoder and a discrete prototype codebook to extract multi‑dimensional user traits. This design enables plug‑and‑play personalization with a small number of trainable parameters (about 20M parameters, about 0.2% of the total model size). Through extensive experiments on variant generation tasks, we show that CURP achieves superior performance and generalization compared to strong baselines, while offering better interpretability and scalability. The code are available at https://github.com/RaidonWong/CURP_code
Authors:Hengchang Liu, Zhao Yang, Bing Su
Abstract:
Diffusion language models (DLMs) provide a bidirectional generation framework naturally suited for infilling, yet their performance is constrained by the pre‑specified infilling length. In this paper, we reveal that DLMs possess an inherent ability to discover the correct infilling length. We identify two key statistical phenomena in the first‑step denoising confidence: a local Oracle Peak that emerges near the ground‑truth length and a systematic Length Bias that often obscures this signal. By leveraging this signal and calibrating the bias, our training‑free method CAL (Calibrated Adaptive Length) enables DLMs to approximate the optimal length through an efficient search before formal decoding. Empirical evaluations demonstrate that CAL improves Pass@1 by up to 47.7% over fixed‑length baselines and 40.5% over chat‑based adaptive methods in code infilling, while boosting BLEU‑2 and ROUGE‑L by up to 8.5% and 9.9% in text infilling. These results demonstrate that CAL paves the way for robust DLM infilling without requiring any specialized training. Code is available at https://github.com/NiuHechang/Calibrated_Adaptive_Length.
Authors:Abhinav Gupta, Toben H. Mintz, Jesse Thomason
Abstract:
While word embeddings derive meaning from co‑occurrence patterns, human language understanding is grounded in sensory and motor experience. We present \textSENSE (S\textensorimotor E\textmbedding N\textorm S\textcoring E\textngine), a learned projection model that predicts Lancaster sensorimotor norms from word lexical embeddings. We also conducted a behavioral study where 281 participants selected which among candidate nonce words evoked specific sensorimotor associations, finding statistically significant correlations between human selection rates and \textSENSE ratings across 6 of the 11 modalities. Sublexical analysis of these nonce words selection rates revealed systematic phonosthemic patterns for the interoceptive norm, suggesting a path towards computationally proposing candidate phonosthemes from text data.
Authors:Siyuan Wang, Yanchen Liu, Xiang Ren
Abstract:
Large Reasoning Models (LRMs) achieve strong reasoning performance by generating long chains of thought (CoTs), yet only a small fraction of these traces meaningfully contributes to answer prediction, while the majority contains repetitive or truncated content. Such output redundancy is further propagated after supervised finetuning (SFT), as models learn to imitate verbose but uninformative patterns, which can degrade performance. To this end, we incorporate integrated gradient attribution to quantify each token's influence on final answers and aggregate them into two segment‑level metrics: (1) attribution strength measures the overall attribution magnitude; and (2) direction consistency captures whether tokens' attributions within a segment are uniformly positive or negative (high consistency), or a mixture of both (moderate consistency). Based on these two metrics, we propose a segment‑level selective learning framework to identify important segments with high attribution strength but moderate consistency that indicate reflective rather than shallow reasoning. The framework then applies selective SFT on these important segments while masking loss for unimportant ones. Experiments across multiple models and datasets show that our approach improves accuracy and output efficiency, enabling more effective learning from long reasoning traces~\footnoteCode and data are available at https://github.com/SiyuanWangw/SegmentSelectiveSFT.
Authors:Gabriel Bromonschenkel, Alessandro L. Koerich, Thiago M. Paixão, Hilário Tomaz Alves de Oliveira
Abstract:
Image captioning (IC) refers to the automatic generation of natural language descriptions for images, with applications ranging from social media content generation to assisting individuals with visual impairments. While most research has been focused on English‑based models, low‑resource languages such as Brazilian Portuguese face significant challenges due to the lack of specialized datasets and models. Several studies create datasets by automatically translating existing ones to mitigate resource scarcity. This work addresses this gap by proposing a cross‑native‑translated evaluation of Transformer‑based vision and language models for Brazilian Portuguese IC. We use a version of Flickr30K comprised of captions manually created by native Brazilian Portuguese speakers and compare it to a version with captions automatically translated from English to Portuguese. The experiments include a cross‑context approach, where models trained on one dataset are tested on the other to assess the translation impact. Additionally, we incorporate attention maps for model inference interpretation and use the CLIP‑Score metric to evaluate the image‑description alignment. Our findings show that Swin‑DistilBERTimbau consistently outperforms other models, demonstrating strong generalization across datasets. ViTucano, a Brazilian Portuguese pre‑trained VLM, surpasses larger multilingual models (GPT‑4o, LLaMa 3.2 Vision) in traditional text‑based evaluation metrics, while GPT‑4 models achieve the highest CLIP‑Score, highlighting improved image‑text alignment. Attention analysis reveals systematic biases, including gender misclassification, object enumeration errors, and spatial inconsistencies. The datasets and the models generated and analyzed during the current study are available in: https://github.com/laicsiifes/transformer‑caption‑ptbr.
Authors:Tianyi Hu, Niket Tandon, Akhil Arora
Abstract:
Existing retrieval‑augmented generation (RAG) systems are primarily designed under the assumption that each query has a single correct answer. This overlooks common information‑seeking scenarios with multiple plausible answers, where diversity is essential to avoid collapsing to a single dominant response, thereby constraining creativity and compromising fair and inclusive information access. Our analysis reveals a commonly overlooked limitation of standard RAG systems: they underutilize retrieved context diversity, such that increasing retrieval diversity alone does not yield diverse generations. To address this limitation, we propose DIVERGE, a plug‑and‑play agentic RAG framework with novel reflection‑guided generation and memory‑augmented iterative refinement, which promotes diverse viewpoints while preserving answer quality. We introduce novel metrics tailored to evaluating the diversity‑quality trade‑off in open‑ended questions, and show that they correlate well with human judgments. We demonstrate that DIVERGE achieves the best diversity‑quality trade‑off compared to competitive baselines and previous state‑of‑the‑art methods on the real‑world Infinity‑Chat dataset, substantially improving diversity while maintaining quality. More broadly, our results reveal a systematic limitation of current LLM‑based systems for open‑ended information‑seeking and show that explicitly modeling diversity can mitigate it. Our code is available at: https://github.com/au‑clan/Diverge
Authors:Yang Tan, Yuanxi Yu, Can Wu, Bozitao Zhong, Mingchen Li, Guisheng Fan, Jiankang Zhu, Yafeng Liang, Nanqing Dong, Liang Hong
Abstract:
Zero‑shot mutation prediction is vital for low‑resource protein engineering, yet existing protein language models (PLMs) often yield statistically confident results that ignore fundamental biophysical constraints. Currently, selecting candidates for wet‑lab validation relies on manual expert auditing of PLM outputs, a process that is inefficient, subjective, and highly dependent on domain expertise. To address this, we propose Rank‑and‑Reason (VenusRAR), a two‑stage agentic framework to automate this workflow and maximize expected wet‑lab fitness. In the Rank‑Stage, a Computational Expert and Virtual Biologist aggregate a context‑aware multi‑modal ensemble, establishing a new Spearman correlation record of 0.551 (vs. 0.518) on ProteinGym. In the Reason‑Stage, an agentic Expert Panel employs chain‑of‑thought reasoning to audit candidates against geometric and structural constraints, improving the Top‑5 Hit Rate by up to 367% on ProteinGym‑DMS99. The wet‑lab validation on Cas12i3 nuclease further confirms the framework's efficacy, achieving a 46.7% positive rate and identifying two novel mutants with 4.23‑fold and 5.05‑fold activity improvements. Code and datasets are released on GitHub (https://github.com/ai4protein/VenusRAR/).
Authors:Yue Yu, Ting Bai, HengZhi Lan, Li Qian, Li Peng, Jie Wu, Wei Liu, Jian Luan, Chuan Shi
Abstract:
The attribution technique enhances the credibility of LLMs by adding citations to the generated sentences, enabling users to trace back to the original sources and verify the reliability of the output. However, existing instruction‑tuned attributed LLMs often fail to properly interpret the contextual semantics of citation symbols (e.g., [i]) during text generation. This shortcoming arises from their insufficient awareness of the context information surrounding citation markers, which in turn leads to disjointed references and poor integration of retrieved knowledge into the generated content. To address this issue, we propose a novel Contextual‑aware Citation generation framework (C^2‑Cite) that explicitly integrates the semantic relationships between citation markers and their referenced content. Specifically, a contextual citation alignment mechanism is adopted: it first encodes the retrieved document contexts into the symbol representation of citations, then aligns the marker numbers by decoding information from a citation router function. This mechanism enables the transformation of citation markers from generic placeholders into active knowledge pointers that link to the referenced source information. Experimental results on the ALCE benchmark across three datasets validate our framework C^2‑Cite++: it outperforms the SOTA baseline by an average of 5.8% in citation quality and 17.4% in response correctness. The implementation is publicly available at https://github.com/BAI‑LAB/c2cite
Authors:Kaihua Liang, Xin Tan, An Zhong, Hong Xu, Marco Canini
Abstract:
Diffusion Large Language Models (DLLMs) offer a compelling alternative to Auto‑Regressive models, but their deployment is constrained by high decoding cost. In this work, we identify a key inefficiency in DLLM decoding: while computation is parallelized over token blocks, only a small subset of tokens is decodable at each diffusion step, causing most compute to be wasted on non‑decodable tokens. We further observe a strong correlation between attention‑derived token importance and token‑wise decoding probability. Based on this insight, we propose FOCUS ‑‑ an inference system designed for DLLMs. By dynamically focusing computation on decodable tokens and evicting non‑decodable ones on‑the‑fly, FOCUS increases the effective batch size, alleviating compute limitations and enabling scalable throughput. Empirical evaluations demonstrate that FOCUS achieves up to 3.52× throughput improvement over the production‑grade engine LMDeploy, while preserving or improving generation quality across multiple benchmarks. The FOCUS system is publicly available on GitHub: https://github.com/sands‑lab/FOCUS.
Authors:Fanmeng Wang, Haotian Liu, Guojiang Zhao, Hongteng Xu, Zhifeng Gao
Abstract:
While Chain‑of‑Thought (CoT) significantly enhances the performance of Large Language Models (LLMs), explicit reasoning chains introduce substantial computational redundancy. Recent latent reasoning methods attempt to mitigate this by compressing reasoning processes into latent space, but often suffer from severe performance degradation due to the lack of appropriate compression guidance. In this study, we propose Rendered CoT‑Guided variational Latent Reasoning (ReGuLaR), a simple yet novel latent learning paradigm resolving this issue. Fundamentally, we formulate latent reasoning within the Variational Auto‑Encoding (VAE) framework, sampling the current latent reasoning state from the posterior distribution conditioned on previous ones. Specifically, when learning this variational latent reasoning model, we render explicit reasoning chains as images, from which we extract dense visual‑semantic representations to regularize the posterior distribution, thereby achieving efficient compression with minimal information loss. Extensive experiments demonstrate that ReGuLaR significantly outperforms existing latent reasoning methods across both computational efficiency and reasoning effectiveness, and even surpasses CoT through multi‑modal reasoning, providing a new and insightful solution to latent reasoning. Code: https://github.com/FanmengWang/ReGuLaR.
Authors:Casimiro Pio Carrino, Paula Estrella, Rabih Zbib, Carlos Escolano, José A. R. Fonollosa
Abstract:
We introduce JobResQA, a multilingual Question Answering benchmark for evaluating Machine Reading Comprehension (MRC) capabilities of LLMs on HR‑specific tasks involving résumés and job descriptions. The dataset comprises 581 QA pairs across 105 synthetic résumé‑job description pairs in five languages (English, Spanish, Italian, German, and Chinese), with questions spanning three complexity levels from basic factual extraction to complex cross‑document reasoning. We propose a data generation pipeline derived from real‑world sources through de‑identification and data synthesis to ensure both realism and privacy, while controlled demographic and professional attributes (implemented via placeholders) enable systematic bias and fairness studies. We also present a cost‑effective, human‑in‑the‑loop translation pipeline based on the TEaR methodology, incorporating MQM error annotations and selective post‑editing to ensure an high‑quality multi‑way parallel benchmark. We provide a baseline evaluations across multiple open‑weight LLM families using an LLM‑as‑judge approach revealing higher performances on English and Spanish but substantial degradation for other languages, highlighting critical gaps in multilingual MRC capabilities for HR applications. JobResQA provides a reproducible benchmark for advancing fair and reliable LLM‑based HR systems. The benchmark is publicly available at: https://github.com/Avature/jobresqa‑benchmark
Authors:Jiaming Zhou, Xuxin Cheng, Shiwan Zhao, Yuhang Jia, Cao Liu, Ke Zeng, Xunliang Cai, Yong Qin
Abstract:
Autoregressive (AR) large audio language models (LALMs) such as Qwen‑2.5‑Omni have achieved strong performance on audio understanding and interaction, but scaling them remains costly in data and computation, and strictly sequential decoding limits inference efficiency. Diffusion large language models (dLLMs) have recently been shown to make effective use of limited training data, and prior work on DIFFA indicates that replacing an AR backbone with a diffusion counterpart can substantially improve audio understanding under matched settings, albeit at a proof‑of‑concept scale without large‑scale instruction tuning, preference alignment, or practical decoding schemes. We introduce DIFFA‑2, a practical diffusion‑based LALM for general audio understanding. DIFFA‑2 upgrades the speech encoder, employs dual semantic and acoustic adapters, and is trained with a four‑stage curriculum that combines semantic and acoustic alignment, large‑scale supervised fine‑tuning, and variance‑reduced preference optimization, using only fully open‑source corpora. Experiments on MMSU, MMAU, and MMAR show that DIFFA‑2 consistently improves over DIFFA and is competitive to strong AR LALMs under practical training budgets, supporting diffusion‑based modeling is a viable backbone for large‑scale audio understanding. Our code is available at https://github.com/NKU‑HLT/DIFFA.git.
Authors:Yiheng Liu, Junhao Ning, Sichen Xia, Haiyang Sun, Yang Yang, Hanyang Chi, Xiaohui Gao, Ning Qiang, Bao Ge, Junwei Han, Xintao Hu
Abstract:
The development of large language models (LLMs) is costly and has significant commercial value. Consequently, preventing unauthorized appropriation of open‑source LLMs and protecting developers' intellectual property rights have become critical challenges. In this work, we propose the Functional Network Fingerprint (FNF), a training‑free, sample‑efficient method for detecting whether a suspect LLM is derived from a victim model, based on the consistency between their functional network activity. We demonstrate that models that share a common origin, even with differences in scale or architecture, exhibit highly consistent patterns of neuronal activity within their functional networks across diverse input samples. In contrast, models trained independently on distinct data or with different objectives fail to preserve such activity alignment. Unlike conventional approaches, our method requires only a few samples for verification, preserves model utility, and remains robust to common model modifications (such as fine‑tuning, pruning, and parameter permutation), as well as to comparisons across diverse architectures and dimensionalities. FNF thus provides model owners and third parties with a simple, non‑invasive, and effective tool for protecting LLM intellectual property. The code is available at https://github.com/WhatAboutMyStar/LLM_ACTIVATION.
Authors:Abhishek Tyagi, Yunuo Cen, Shrey Dhorajiya, Bharadwaj Veeravalli, Xuanyao Fong
Abstract:
Large Language Models (LLMs) exhibit substantial parameter redundancy, particularly in Feed‑Forward Networks (FFNs). Existing pruning methods suffer from two primary limitations. First, reliance on dataset‑specific calibration introduces significant data dependency and computational overhead. Second, being predominantly static, they fail to account for the evolving subset of knowledge neurons in LLMs during autoregressive generation as the context evolves. To address this, we introduce DART, i.e., Dynamic Attention‑Guided Runtime Tracing), a lightweight, training‑free method that performs on‑the‑fly context‑based pruning. DART monitors shifts in attention score distributions to infer context changes, dynamically updating neuron‑level masks to retain salient parameters. Across ten benchmarks, DART outperforms prior dynamic baseline, achieving accuracy gains of up to 14.5% on LLAMA‑3.1‑8B at 70% FFN sparsity. Furthermore, DART achieves up to 3x better ROUGE‑L scores with respect to static‑masked pruning on summarization tasks, with its performance comparable to the original dense models. We conclusively demonstrate that the proposed framework effectively adapts to diverse semantic contexts, preserves model capabilities across both general and domain‑specific tasks while running at less than 10MBs of memory for LLAMA‑3.1‑8B(16GBs) with 0.1% FLOPs overhead. The code is available at https://github.com/seeder‑research/DART.
Authors:Chengyi Yang, Zhishang Xiang, Yunbo Tang, Zongpei Teng, Chengsong Huang, Fei Long, Yuhan Liu, Jinsong Su
Abstract:
Test‑Time Training offers a promising way to improve the reasoning ability of large language models (LLMs) by adapting the model using only the test questions. However, existing methods struggle with difficult reasoning problems for two reasons: raw test questions are often too difficult to yield high‑quality pseudo‑labels, and the limited size of test sets makes continuous online updates prone to instability. To address these limitations, we propose TTCS, a co‑evolving test‑time training framework. Specifically, TTCS initializes two policies from the same pretrained model: a question synthesizer and a reasoning solver. These policies evolve through iterative optimization: the synthesizer generates progressively challenging question variants conditioned on the test questions, creating a structured curriculum tailored to the solver's current capability, while the solver updates itself using self‑consistency rewards computed from multiple sampled responses on both original test and synthetic questions. Crucially, the solver's feedback guides the synthesizer to generate questions aligned with the model's current capability, and the generated question variants in turn stabilize the solver's test‑time training. Experiments show that TTCS consistently strengthens the reasoning ability on challenging mathematical benchmarks and transfers to general‑domain tasks across different LLM backbones, highlighting a scalable path towards dynamically constructing test‑time curricula for self‑evolving. Our code and implementation details are available at https://github.com/XMUDeepLIT/TTCS.
Authors:Ryo Fujii, Makoto Morishita, Kazuki Yano, Jun Suzuki
Abstract:
With the advancement of automated software engineering, research focus is increasingly shifting toward practical tasks reflecting the day‑to‑day work of software engineers. Among these tasks, software migration, a critical process of adapting code to evolving environments, has been largely overlooked. In this study, we introduce TimeMachine‑bench, a benchmark designed to evaluate software migration in real‑world Python projects. Our benchmark consists of GitHub repositories whose tests begin to fail in response to dependency updates. The construction process is fully automated, enabling live updates of the benchmark. Furthermore, we curated a human‑verified subset to ensure problem solvability. We evaluated agent‑based baselines built on top of 11 models, including both strong open‑weight and state‑of‑the‑art LLMs on this verified subset. Our results indicated that, while LLMs show some promise for migration tasks, they continue to face substantial reliability challenges, including spurious solutions that exploit low test coverage and unnecessary edits stemming from suboptimal tool‑use strategies. Our dataset and implementation are available at https://github.com/tohoku‑nlp/timemachine‑bench.
Authors:Xudong Lu, Huankang Guan, Yang Bo, Jinpeng Chen, Xintong Guo, Shuhan Li, Fang Liu, Peiwen Sun, Xueying Li, Wei Zhang, Xue Yang, Rui Liu, Hongsheng Li
Abstract:
Multimodal Large Language Models excel at offline audio‑visual understanding, but their ability to serve as mobile assistants in continuous real‑world streams remains underexplored. In daily phone use, mobile assistants must track streaming audio‑visual inputs and respond at the right time, yet existing benchmarks are often restricted to multiple‑choice questions or use shorter videos. In this paper, we introduce PhoStream, the first mobile‑centric streaming benchmark that unifies on‑screen and off‑screen scenarios to evaluate video, audio, and temporal reasoning. PhoStream contains 5,572 open‑ended QA pairs from 578 videos across 4 scenarios and 10 capabilities. We build it with an Automated Generative Pipeline backed by rigorous human verification, and evaluate models using a realistic Online Inference Pipeline and LLM‑as‑a‑Judge evaluation for open‑ended responses. Experiments reveal a temporal asymmetry in LLM‑judged scores (0‑100): models perform well on Instant and Backward tasks (Gemini 3 Pro exceeds 80), but drop sharply on Forward tasks (16.40), largely due to early responses before the required visual and audio cues appear. This highlights a fundamental limitation: current MLLMs struggle to decide when to speak, not just what to say. Code and datasets used in this work will be made publicly accessible at https://github.com/Lucky‑Lance/PhoStream.
Authors:Weiqi Wang, Xin Liu, Binxuan Huang, Hejie Cui, Rongzhi Zhang, Changlong Yu, Shuowei Jin, Jingfeng Yang, Qingyu Yin, Zhengyang Wang, Zheng Li, Yifan Gao, Priyanka Nigam, Bing Yin, Lihong Li, Yangqiu Song
Abstract:
RLVR is now a standard way to train LLMs on reasoning tasks with verifiable outcomes, but when rollout generation dominates the cost, efficiency depends heavily on which prompts you sample and when. In practice, prompt pools are often static or only loosely tied to the model's learning progress, so uniform sampling can't keep up with the shifting capability frontier and ends up wasting rollouts on prompts that are already solved or still out of reach. Existing approaches improve efficiency through filtering, curricula, adaptive rollout allocation, or teacher guidance, but they typically assume a fixed pool‑which makes it hard to support stable on‑policy pool growth‑or they add extra teacher cost and latency. We introduce HeaPA (Heap Sampling and On‑Policy Query Augmentation), which maintains a bounded, evolving pool, tracks the frontier using heap‑based boundary sampling, expands the pool via on‑policy augmentation with lightweight asynchronous validation, and stabilizes correlated queries through topology‑aware re‑estimation of pool statistics and controlled reinsertion. Across two training corpora, two training recipes, and seven benchmarks, HeaPA consistently improves accuracy and reaches target performance with fewer computations while keeping wall‑clock time comparable. Our analyses suggest these gains come from frontier‑focused sampling and on‑policy pool growth, with the benefits becoming larger as model scale increases. Our code is available at https://github.com/horizon‑rl/HeaPA.
Authors:Zhi Yang, Lingfeng Zeng, Fangqi Lou, Qi Qi, Wei Zhang, Zhenyu Wu, Zhenxiong Yu, Jun Han, Zhiheng Jin, Lejie Zhang, Xiaoming Huang, Xiaolong Liang, Zheng Wei, Junbo Zou, Dongpo Cheng, Zhaowei Liu, Xin Guo, Rongjunchen Zhang, Liwen Zhang
Abstract:
Multimodal large language models are playing an increasingly significant role in empowering the financial domain, however, the challenges they face, such as multimodal and high‑density information and cross‑modal multi‑hop reasoning, go beyond the evaluation scope of existing multimodal benchmarks. To address this gap, we propose UniFinEval, the first unified multimodal benchmark designed for high‑information‑density financial environments, covering text, images, and videos. UniFinEval systematically constructs five core financial scenarios grounded in real‑world financial systems: Financial Statement Auditing, Company Fundamental Reasoning, Industry Trend Insights, Financial Risk Sensing, and Asset Allocation Analysis. We manually construct a high‑quality dataset consisting of 3,767 question‑answer pairs in both chinese and english and systematically evaluate 10 mainstream MLLMs under Zero‑Shot and CoT settings. Results show that Gemini‑3‑pro‑preview achieves the best overall performance, yet still exhibits a substantial gap compared to financial experts. Further error analysis reveals systematic deficiencies in current models. UniFinEval aims to provide a systematic assessment of MLLMs' capabilities in fine‑grained, high‑information‑density financial environments, thereby enhancing the robustness of MLLMs applications in real‑world financial scenarios. Data and code are available at https://github.com/aifinlab/UniFinEval.
Authors:Naufal Suryanto, Muzammal Naseer, Pengfei Li, Syed Talal Wasim, Jinhui Yi, Juergen Gall, Paolo Ceravolo, Ernesto Damiani
Abstract:
Cybersecurity operations demand assistant LLMs that support diverse workflows without exposing sensitive data. Existing solutions either rely on proprietary APIs with privacy risks or on open models lacking domain adaptation. To bridge this gap, we curate 11.8B tokens of cybersecurity‑focused continual pretraining data via large‑scale web filtering and manual collection of high‑quality resources, spanning 28.6K documents across frameworks, offensive techniques, and security tools. Building on this, we design an agentic augmentation pipeline that simulates expert workflows to generate 266K multi‑turn cybersecurity samples for supervised fine‑tuning. Combined with general open‑source LLM data, these resources enable the training of RedSage, an open‑source, locally deployable cybersecurity assistant with domain‑aware pretraining and post‑training. To rigorously evaluate the models, we introduce RedSage‑Bench, a benchmark with 30K multiple‑choice and 240 open‑ended Q&A items covering cybersecurity knowledge, skills, and tool expertise. RedSage is further evaluated on established cybersecurity benchmarks (e.g., CTI‑Bench, CyberMetric, SECURE) and general LLM benchmarks to assess broader generalization. At the 8B scale, RedSage achieves consistently better results, surpassing the baseline models by up to +5.59 points on cybersecurity benchmarks and +5.05 points on Open LLM Leaderboard tasks. These findings demonstrate that domain‑aware agentic augmentation and pre/post‑training can not only enhance cybersecurity‑specific expertise but also help to improve general reasoning and instruction‑following. All models, datasets, and code are publicly available.
Authors:Kaixuan Fan, Kaituo Feng, Manyuan Zhang, Tianshuo Peng, Zhixun Li, Yilei Jiang, Shuang Chen, Peng Pei, Xunliang Cai, Xiangyu Yue
Abstract:
Agentic Reinforcement Learning (Agentic RL) has achieved notable success in enabling agents to perform complex reasoning and tool use. However, most methods still relies on sparse outcome‑based reward for training. Such feedback fails to differentiate intermediate reasoning quality, leading to suboptimal training results. In this paper, we introduce Agent Reasoning Reward Model (Agent‑RRM), a multi‑faceted reward model that produces structured feedback for agentic trajectories, including (1) an explicit reasoning trace , (2) a focused critique that provides refinement guidance by highlighting reasoning flaws, and (3) an overall score that evaluates process performance. Leveraging these signals, we systematically investigate three integration strategies: Reagent‑C (text‑augmented refinement), Reagent‑R (reward‑augmented guidance), and Reagent‑U (unified feedback integration). Extensive evaluations across 12 diverse benchmarks demonstrate that Reagent‑U yields substantial performance leaps, achieving 43.7% on GAIA and 46.2% on WebWalkerQA, validating the effectiveness of our reasoning reward model and training schemes. Code, models, and datasets are all released to facilitate future research.
Authors:Xin Chen, Feng Jiang, Yiqian Zhang, Hardy Chen, Shuo Yan, Wenya Xie, Min Yang, Shujian Huang
Abstract:
Reasoning‑oriented Large Language Models (LLMs) have achieved remarkable progress with Chain‑of‑Thought (CoT) prompting, yet they remain fundamentally limited by a \emphblind self‑thinking paradigm: performing extensive internal reasoning even when critical information is missing or ambiguous. We propose Proactive Interactive Reasoning (PIR), a new reasoning paradigm that transforms LLMs from passive solvers into proactive inquirers that interleave reasoning with clarification. Unlike existing search‑ or tool‑based frameworks that primarily address knowledge uncertainty by querying external environments, PIR targets premise‑ and intent‑level uncertainty through direct interaction with the user. PIR is implemented via two core components: (1) an uncertainty‑aware supervised fine‑tuning procedure that equips models with interactive reasoning capability, and (2) a user‑simulator‑based policy optimization framework driven by a composite reward that aligns model behavior with user intent. Extensive experiments on mathematical reasoning, code generation, and document editing demonstrate that PIR consistently outperforms strong baselines, achieving up to 32.70% higher accuracy, 22.90% higher pass rate, and 41.36 BLEU improvement, while reducing nearly half of the reasoning computation and unnecessary interaction turns. Further reliability evaluations on factual knowledge, question answering, and missing‑premise scenarios confirm the strong generalization and robustness of PIR. Model and code are publicly available at: \hrefhttps://github.com/SUAT‑AIRI/Proactive‑Interactive‑R1
Authors:Yibo Wang, Yongcheng Jing, Shunyu Liu, Hao Guan, Rong-cheng Tu, Chengyu Wang, Jun Huang, Dacheng Tao
Abstract:
Long‑context reasoning has significantly empowered large language models (LLMs) to tackle complex tasks, yet it introduces severe efficiency bottlenecks due to the computational complexity. Existing efficient approaches often rely on complex additional training or external models for compression, which limits scalability and discards critical fine‑grained information. In this paper, we propose VTC‑R1, a new efficient reasoning paradigm that integrates vision‑text compression into the reasoning process. Instead of processing lengthy textual traces, VTC‑R1 renders intermediate reasoning segments into compact images, which are iteratively fed back into vision‑language models as "optical memory." We construct a training dataset based on OpenR1‑Math‑220K achieving 3.4x token compression and fine‑tune representative VLMs‑Glyph and Qwen3‑VL. Extensive experiments on benchmarks such as MATH500, AIME25, AMC23 and GPQA‑D demonstrate that VTC‑R1 consistently outperforms standard long‑context reasoning. Furthermore, our approach significantly improves inference efficiency, achieving 2.7x speedup in end‑to‑end latency, highlighting its potential as a scalable solution for reasoning‑intensive applications. Our code is available at https://github.com/w‑yibo/VTC‑R1.
Authors:Ghazal Kalhor, Behnam Bahrak
Abstract:
In recent years, multilingual Large Language Models (LLMs) have become an inseparable part of daily life, making it crucial for them to master the rules of conversational language in order to communicate effectively with users. While previous work has evaluated LLMs' understanding of figurative language in high‑resource languages, their performance in low‑resource languages remains underexplored. In this paper, we introduce MasalBench, a comprehensive benchmark for assessing LLMs' contextual and cross‑cultural understanding of Persian proverbs, which are a key component of conversation in this low‑resource language. We evaluate eight state‑of‑the‑art LLMs on MasalBench and find that they perform well in identifying Persian proverbs in context, achieving accuracies above 0.90. However, their performance drops considerably when tasked with identifying equivalent English proverbs, with the best model achieving 0.79 accuracy. Our findings highlight the limitations of current LLMs in cultural knowledge and analogical reasoning, and they provide a framework for assessing cross‑cultural understanding in other low‑resource languages. MasalBench is available at https://github.com/kalhorghazal/MasalBench.
Authors:Zhao Wang, Ziliang Zhao, Zhicheng Dou
Abstract:
Reinforcement learning (RL) has become a promising paradigm for optimizing Retrieval‑Augmented Generation (RAG) in complex reasoning tasks. However, traditional outcome‑based RL approaches often suffer from reward sparsity and inefficient credit assignment, as coarse‑grained scalar rewards fail to identify specific erroneous steps within long‑horizon trajectories. This ambiguity frequently leads to "process hallucinations", where models reach correct answers through flawed logic or redundant retrieval steps. Although recent process‑aware approaches attempt to mitigate this via static preference learning or heuristic reward shaping, they often lack the on‑policy exploration capabilities required to decouple step‑level credit from global outcomes. To address these challenges, we propose ProRAG, a process‑supervised reinforcement learning framework designed to integrate learned step‑level supervision into the online optimization loop. Our framework consists of four stages: (1) Supervised Policy Warmup to initialize the model with a structured reasoning format; (2) construction of an MCTS‑based Process Reward Model (PRM) to quantify intermediate reasoning quality; (3) PRM‑Guided Reasoning Refinement to align the policy with fine‑grained process preferences; and (4) Process‑Supervised Reinforcement Learning with a dual‑granularity advantage mechanism. By aggregating step‑level process rewards with global outcome signals, ProRAG provides precise feedback for every action. Extensive experiments on five multi‑hop reasoning benchmarks demonstrate that ProRAG achieves superior overall performance compared to strong outcome‑based and process‑aware RL baselines, particularly on complex long‑horizon tasks, validating the effectiveness of fine‑grained process supervision. The code and model are available at https://github.com/lilinwz/ProRAG.
Authors:Yaocong Li, Leihan Zhang, Le Zhang, Qiang Yan
Abstract:
Internet memes have become pervasive carriers of digital culture on social platforms. However, their heavy reliance on metaphors and sociocultural context also makes them subtle vehicles for harmful content, posing significant challenges for automated content moderation. Existing approaches primarily focus on intra‑modal and inter‑modal signal analysis, while the understanding of implicit toxicity often depends on background knowledge that is not explicitly present in the meme itself. To address this challenge, we propose KID, a Knowledge‑Injected Dual‑Head Learning framework for knowledge‑grounded harmful meme detection. KID adopts a label‑constrained distillation paradigm to decompose complex meme understanding into structured reasoning chains that explicitly link visual evidence, background knowledge, and classification labels. These chains guide the learning process by grounding external knowledge in meme‑specific contexts. In addition, KID employs a dual‑head architecture that jointly optimizes semantic generation and classification objectives, enabling aligned linguistic reasoning while maintaining stable decision boundaries. Extensive experiments on five multilingual datasets spanning English, Chinese, and low‑resource Bengali demonstrate that KID achieves SOTA performance on both binary and multi‑label harmful meme detection tasks, improving over previous best methods by 2.1%‑‑19.7% across primary evaluation metrics. Ablation studies further confirm the effectiveness of knowledge injection and dual‑head joint learning, highlighting their complementary contributions to robust and generalizable meme understanding. The code and data are available at https://github.com/PotatoDog1669/KID.
Authors:Ruiwen Zhou, Maojia Song, Xiaobao Wu, Sitao Cheng, Xunjian Yin, Yuxi Xie, Zhuoqun Hao, Wenyue Hua, Liangming Pan, Soujanya Poria, Min-Yen Kan
Abstract:
Individual agents in multi‑agent (MA) systems often lack robustness, tending to blindly conform to misleading peers. We show this weakness stems from both sycophancy and inadequate ability to evaluate peer reliability. To address this, we first formalize the learning problem of history‑aware reference, introducing the historical interactions of peers as additional input, so that agents can estimate peer reliability and learn from trustworthy peers when uncertain. This shifts the task from evaluating peer reasoning quality to estimating peer reliability based on interaction history. We then develop Epistemic Context Learning (ECL): a reasoning framework that conditions predictions on explicitly‑built peer profiles from history. We further optimize ECL by reinforcement learning using auxiliary rewards. Our experiments reveal that our ECL enables small models like Qwen 3‑4B to outperform a history‑agnostic baseline 8x its size (Qwen 3‑30B) by accurately identifying reliable peers. ECL also boosts frontier models to near‑perfect (100%) performance. We show that ECL generalizes well to various MA configurations and we find that trust is modeled well by LLMs, revealing a strong correlation in trust modeling accuracy and final answer quality.
Authors:Qingyue Yang, Jie Wang, Xing Li, Yinqi Bai, Xialiang Tong, Huiling Zhen, Jianye Hao, Mingxuan Yuan, Bin Li
Abstract:
Attention patterns play a crucial role in both training and inference of large language models (LLMs). Prior works have identified individual patterns such as retrieval heads, sink heads, and diagonal traces, yet these observations remain fragmented and lack a unifying explanation. To bridge this gap, we introduce Temporal Attention Pattern Predictability Analysis (TAPPA), a unifying framework that explains diverse attention patterns by analyzing their underlying mathematical formulations from a temporally continuous perspective. TAPPA both deepens the understanding of attention behavior and guides inference acceleration approaches. Specifically, TAPPA characterizes attention patterns as predictable patterns with clear regularities and unpredictable patterns that appear effectively random. Our analysis further reveals that this distinction can be explained by the degree of query self‑similarity along the temporal dimension. Focusing on the predictable patterns, we further provide a detailed mathematical analysis of three representative cases through the joint effect of queries, keys, and Rotary Positional Embeddings (RoPE). We validate TAPPA by applying its insights to KV cache compression and LLM pruning tasks. Across these tasks, a simple metric motivated by TAPPA consistently improves performance over baseline methods. The code is available at https://github.com/MIRALab‑USTC/LLM‑TAPPA.
Authors:Vijini Liyanage, François Yvon
Abstract:
Subword tokenization methods, such as Byte‑Pair Encoding (BPE), significantly impact the performance and efficiency of large language models (LLMs). The standard approach involves training a general‑purpose tokenizer that uniformly processes all textual data during both training and inference. However, the use of a generic set of tokens can incur inefficiencies when applying the model to specific domains or languages. To address this limitation, we propose a post‑training adaptation strategy that selectively replaces low‑utility tokens with more relevant ones based on their frequency in an adaptation corpus. Our algorithm identifies the token inventory that most effectively encodes the adaptation corpus for a given target vocabulary size. Extensive experiments on generation and classification tasks across multiple languages demonstrate that our adapted tokenizers compress test corpora more effectively than baselines using the same vocabulary size. This method serves as a lightweight adaptation mechanism, akin to a vocabulary fine‑tuning process, enabling optimized tokenization for specific domains or tasks. Our code and data are available at https://github.com/vijini/Adapt‑BPE.git.
Authors:Wuyang Zhou, Yuxuan Gu, Giorgos Iacovides, Danilo Mandic
Abstract:
The success of Hyper‑Connections (HC) in neural networks (NN) has also highlighted issues related to its training instability and restricted scalability. The Manifold‑Constrained Hyper‑Connections (mHC) mitigate these challenges by projecting the residual connection space onto a Birkhoff polytope, however, it faces two issues: 1) its iterative Sinkhorn‑Knopp (SK) algorithm does not always yield exact doubly stochastic residual matrices; 2) mHC incurs a prohibitive \mathcalO(n^3C) parameter complexity with n as the width of the residual stream and C as the feature dimension. The recently proposed mHC‑lite reparametrizes the residual matrix via the Birkhoff‑von‑Neumann theorem to guarantee double stochasticity, but also faces a factorial explosion in its parameter complexity, \mathcalO \left( nC \cdot n! \right). To address both challenges, we propose KromHC, which uses the \underlineKronecker products of smaller doubly stochastic matrices to parametrize the residual matrix in \underlinemHC. By enforcing manifold constraints across the factor residual matrices along each mode of the tensorized residual stream, KromHC guarantees exact double stochasticity of the residual matrices while reducing parameter complexity to \mathcalO(n^2C). Comprehensive experiments demonstrate that KromHC matches or even outperforms state‑of‑the‑art (SOTA) mHC variants, while requiring significantly fewer trainable parameters. The code is available at \texttthttps://github.com/wz1119/KromHC.
Authors:Xiaoyu Tian, Haotian Wang, Shuaiting Chen, Hao Zhou, Kaichi Yu, Yudian Zhang, Jade Ouyang, Junxi Yin, Jiong Chen, Baoyan Guo, Lei Zhang, Junjie Tao, Yuansheng Song, Ming Cui, Chengwei Liu
Abstract:
Large language models (LLMs) are increasingly used as tool‑augmented agents for multi‑step decision making, yet training robust tool‑using agents remains challenging. Existing methods still require manual intervention, depend on non‑verifiable simulated environments, rely exclusively on either supervised fine‑tuning (SFT) or reinforcement learning (RL), and struggle with stable long‑horizon, multi‑turn learning. To address these challenges, we introduce ASTRA, a fully automated end‑to‑end framework for training tool‑augmented language model agents via scalable data synthesis and verifiable reinforcement learning. ASTRA integrates two complementary components. First, a pipeline that leverages the static topology of tool‑call graphs synthesizes diverse, structurally grounded trajectories, instilling broad and transferable tool‑use competence. Second, an environment synthesis framework that captures the rich, compositional topology of human semantic reasoning converts decomposed question‑answer traces into independent, code‑executable, and rule‑verifiable environments, enabling deterministic multi‑turn RL. Based on this method, we develop a unified training methodology that integrates SFT with online RL using trajectory‑level rewards to balance task completion and interaction efficiency. Experiments on multiple agentic tool‑use benchmarks demonstrate that ASTRA‑trained models achieve state‑of‑the‑art performance at comparable scales, approaching closed‑source systems while preserving core reasoning ability. We release the full pipelines, environments, and trained models at https://github.com/LianjiaTech/astra.
Authors:Yang Zhou, Zhenting Sheng, Mingrui Tan, Yuting Song, Jun Zhou, Yu Heng Kwan, Lian Leng Low, Yang Bai, Yong Liu
Abstract:
Effective clinical history taking is a foundational yet underexplored component of clinical reasoning. While large language models (LLMs) have shown promise on static benchmarks, they often fall short in dynamic, multi‑turn diagnostic settings that require iterative questioning and hypothesis refinement. To address this gap, we propose \method, a note‑driven framework that trains LLMs to conduct structured history taking and diagnosis by learning from widely available medical notes. Instead of relying on scarce and sensitive dialogue data, we convert real‑world medical notes into high‑quality doctor‑patient dialogues using a decision tree‑guided generation and refinement pipeline. We then propose a three‑stage fine‑tuning strategy combining supervised learning, simulated data augmentation, and preference learning. Furthermore, we propose a novel single‑turn reasoning paradigm that reframes history taking as a sequence of single‑turn reasoning problems. This design enhances interpretability and enables local supervision, dynamic adaptation, and greater sample efficiency. Experimental results show that our method substantially improves clinical reasoning, achieving gains of +16.9 F1 and +21.0 Top‑1 diagnostic accuracy over GPT‑4o. Our code and dataset can be found at https://github.com/zhentingsheng/Note2Chat.
Authors:Alireza Nadaf, Alireza Mohammadshahi, Majid Yazdani
Abstract:
We introduce KAPSO, a modular framework for autonomous program synthesis and optimization. Given a natural language goal and an evaluation method, KAPSO iteratively performs ideation, code synthesis and editing, execution, evaluation, and learning to improve a runnable artifact toward measurable objectives. Rather than treating synthesis as the endpoint, KAPSO uses synthesis as an operator within a long‑horizon optimization loop, where progress is defined by evaluator outcomes. KAPSO targets long‑horizon failures common in coding agents, including lost experimental state, brittle debugging, and weak reuse of domain expertise, by integrating three tightly coupled components. First, a git‑native experimentation engine isolates each attempt as a branch, producing reproducible artifacts and preserving provenance across iterations. Second, a knowledge system ingests heterogeneous sources, including repositories, internal playbooks, and curated external resources such as documentation, scientific papers, and web search results, and organizes them into a structured representation that supports retrieval over workflows, implementations, and environment constraints. Third, a cognitive memory layer coordinates retrieval and maintains an episodic store of reusable lessons distilled from experiment traces (run logs, diffs, and evaluator feedback), reducing repeated error modes and accelerating convergence. We evaluated KAPSO on MLE‑Bench (Kaggle‑style ML competitions) and ALE‑Bench (AtCoder heuristic optimization), and report end‑to‑end performance. Code Available at: https://github.com/Leeroo‑AI/kapso
Authors:Yuxiang Huang, Mingye Li, Xu Han, Chaojun Xiao, Weilin Zhao, Ao Sun, Ziqi Yuan, Hao Zhou, Fandong Meng, Zhiyuan Liu
Abstract:
The efficiency of long‑video inference remains a critical bottleneck, mainly due to the dense computation in the prefill stage of Large Multimodal Models (LMMs). Existing methods either compress visual embeddings or apply sparse attention on a single GPU, yielding limited acceleration or degraded performance and restricting LMMs from handling longer, more complex videos. To overcome these issues, we propose Spava, a sequence‑parallel framework with optimized attention that accelerates long‑video inference across multiple GPUs. By distributing approximate attention, Spava reduces computation and increases parallelism, enabling efficient processing of more visual embeddings without compression and thereby improving task performance. System‑level optimizations, such as load balancing and fused forward passes, further unleash the potential of Spava, delivering speedups of 12.72x, 1.70x, and 1.18x over FlashAttn, ZigZagRing, and APB, without notable performance loss. Code available at https://github.com/thunlp/APB
Authors:Xian Shi, Xiong Wang, Zhifang Guo, Yongqi Wang, Pei Zhang, Xinyu Zhang, Zishan Guo, Hongkun Hao, Yu Xi, Baosong Yang, Jin Xu, Jingren Zhou, Junyang Lin
Abstract:
In this report, we introduce Qwen3‑ASR family, which includes two powerful all‑in‑one speech recognition models and a novel non‑autoregressive speech forced alignment model. Qwen3‑ASR‑1.7B and Qwen3‑ASR‑0.6B are ASR models that support language identification and ASR for 52 languages and dialects. Both of them leverage large‑scale speech training data and the strong audio understanding ability of their foundation model Qwen3‑Omni. We conduct comprehensive internal evaluation besides the open‑sourced benchmarks as ASR models might differ little on open‑sourced benchmark scores but exhibit significant quality differences in real‑world scenarios. The experiments reveal that the 1.7B version achieves SOTA performance among open‑sourced ASR models and is competitive with the strongest proprietary APIs while the 0.6B version offers the best accuracy‑efficiency trade‑off. Qwen3‑ASR‑0.6B can achieve an average TTFT as low as 92ms and transcribe 2000 seconds speech in 1 second at a concurrency of 128. Qwen3‑ForcedAligner‑0.6B is an LLM based NAR timestamp predictor that is able to align text‑speech pairs in 11 languages. Timestamp accuracy experiments show that the proposed model outperforms the three strongest force alignment models and takes more advantages in efficiency and versatility. To further accelerate the community research of ASR and audio understanding, we release these models under the Apache 2.0 license.
Authors:Shangbin Feng, Yuyang Bai, Ziyuan Yang, Yike Wang, Zhaoxuan Tan, Jiajie Yan, Zhenyu Lei, Wenxuan Ding, Weijia Shi, Haojin Wang, Zhenting Qi, Yuru Jiang, Heng Wang, Chengsong Huang, Yu Fei, Jihan Yao, Yilun Du, Luke Zettlemoyer, Yejin Choi, Yulia Tsvetkov
Abstract:
Advancing beyond single monolithic language models (LMs), recent research increasingly recognizes the importance of model collaboration, where multiple LMs collaborate, compose, and complement each other. Existing research on this topic has mostly been disparate and disconnected, from different research communities, and lacks rigorous comparison. To consolidate existing research and establish model collaboration as a school of thought, we present MoCo: a one‑stop Python library of executing, benchmarking, and comparing model collaboration algorithms at scale. MoCo features 26 model collaboration methods, spanning diverse levels of cross‑model information exchange such as routing, text, logit, and model parameters. MoCo integrates 25 evaluation datasets spanning reasoning, QA, code, safety, and more, while users could flexibly bring their own data. Extensive experiments with MoCo demonstrate that most collaboration strategies outperform models without collaboration in 61.0% of (model, data) settings on average, with the most effective methods outperforming by up to 25.8%. We further analyze the scaling of model collaboration strategies, the training/inference efficiency of diverse methods, highlight that the collaborative system solves problems where single LMs struggle, and discuss future work in model collaboration, all made possible by MoCo. We envision MoCo as a valuable toolkit to facilitate and turbocharge the quest for an open, modular, decentralized, and collaborative AI future.
Authors:Yanqi Dai, Yuxiang Ji, Xiao Zhang, Yong Wang, Xiangxiang Chu, Zhiwu Lu
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) offers a robust mechanism for enhancing mathematical reasoning in large models. However, we identify a systematic lack of emphasis on more challenging questions in existing methods from both algorithmic and data perspectives, despite their importance for refining underdeveloped capabilities. Algorithmically, widely used Group Relative Policy Optimization (GRPO) suffers from an implicit imbalance where the magnitude of policy updates is lower for harder questions. Data‑wise, augmentation approaches primarily rephrase questions to enhance diversity without systematically increasing intrinsic difficulty. To address these issues, we propose a two‑dual MathForge framework to improve mathematical reasoning by targeting harder questions from both perspectives, which comprises a Difficulty‑Aware Group Policy Optimization (DGPO) algorithm and a Multi‑Aspect Question Reformulation (MQR) strategy. Specifically, DGPO first rectifies the implicit imbalance in GRPO via difficulty‑balanced group advantage estimation, and further prioritizes harder questions by difficulty‑aware question‑level weighting. Meanwhile, MQR reformulates questions across multiple aspects to increase difficulty while maintaining the original gold answer. Overall, MathForge forms a synergistic loop: MQR expands the data frontier, and DGPO effectively learns from the augmented data. Extensive experiments show that MathForge significantly outperforms existing methods on various mathematical reasoning tasks. The code and augmented data are all available at https://github.com/AMAP‑ML/MathForge.
Authors:Zhenxuan Fan, Jie Cao, Yang Dai, Zheqi Lv, Wenqiao Zhang, Zhongle Xie, Peng LU, Beng Chin Ooi
Abstract:
Chain‑of‑thought (CoT) prompting improves LLM reasoning but incurs high latency and memory cost due to verbose traces, motivating CoT compression with preserved correctness. Existing methods either shorten CoTs at the semantic level, which is often conservative, or prune tokens aggressively, which can miss task‑critical cues and degrade accuracy. Moreover, combining the two is non‑trivial due to sequential dependency, task‑agnostic pruning, and distribution mismatch. We propose CtrlCoT, a dual‑granularity CoT compression framework that harmonizes semantic abstraction and token‑level pruning through three components: Hierarchical Reasoning Abstraction produces CoTs at multiple semantic granularities; Logic‑Preserving Distillation trains a logic‑aware pruner to retain indispensable reasoning cues (e.g., numbers and operators) across pruning ratios; and Distribution‑Alignment Generation aligns compressed traces with fluent inference‑time reasoning styles to avoid fragmentation. On MATH‑500 with Qwen2.5‑7B‑Instruct, CtrlCoT uses 30.7% fewer tokens while achieving 7.6 percentage points higher than the strongest baseline, demonstrating more efficient and reliable reasoning. Our code will be publicly available at https://github.com/fanzhenxuan/Ctrl‑CoT.
Authors:Minjae Lee, Wonjun Kang, Byeongkeun Ahn, Christian Classen, Kevin Galim, Seunghyuk Oh, Minghao Yan, Hyung Il Koo, Kangwook Lee
Abstract:
Speculative decoding (SD) has proven effective for accelerating LLM inference by quickly generating draft tokens and verifying them in parallel. However, SD remains largely unexplored for Large Vision‑Language Models (LVLMs), which extend LLMs to process both image and text prompts. To address this gap, we benchmark existing inference methods with small draft models on 11 datasets across diverse input scenarios and observe scenario‑specific performance fluctuations. Motivated by these findings, we propose Test‑time Adaptive Batched Ensemble Drafting (TABED), which dynamically ensembles multiple drafts obtained via batch inference by leveraging deviations from past ground truths available in the SD setting. The dynamic ensemble method achieves an average robust walltime speedup of 1.74x over autoregressive decoding and a 5% improvement over single drafting methods, while remaining training‑free and keeping ensembling costs negligible through parameter sharing. With its plug‑and‑play compatibility, we further enhance TABED by integrating advanced verification and alternative drafting methods. Code and custom‑trained models are available at https://github.com/furiosa‑ai/TABED.
Authors:Zeyu Xing, Xing Li, Hui-Ling Zhen, Mingxuan Yuan, Sinno Jialin Pan
Abstract:
KV caches, typically used only to speed up autoregressive decoding, encode contextual information that can be reused for downstream tasks at no extra cost. We propose treating the KV cache as a lightweight representation, eliminating the need to recompute or store full hidden states. Despite being weaker than dedicated embeddings, KV‑derived representations are shown to be sufficient for two key applications: (i) Chain‑of‑Embedding, where they achieve competitive or superior performance on Llama‑3.1‑8B‑Instruct and Qwen2‑7B‑Instruct; and (ii) Fast/Slow Thinking Switching, where they enable adaptive reasoning on Qwen3‑8B and DeepSeek‑R1‑Distil‑Qwen‑14B, reducing token generation by up to 5.7× with minimal accuracy loss. Our findings establish KV caches as a free, effective substrate for sampling and reasoning, opening new directions for representation reuse in LLM inference. Code: https://github.com/cmd2001/ICLR2026_KV‑Embedding.
Authors:Haoyuan Yu, Yuxuan Chen, Minjie Cai
Abstract:
Full‑duplex voice interaction is crucial for natural human computer interaction. We present a framework that decomposes complex dialogue into minimal conversational units, enabling the system to process each unit independently and predict when to transit to the next. This framework is instantiated as a semi‑cascaded full‑duplex dialogue system built around a multimodal large language model, supported by auxiliary modules such as voice activity detection (VAD) and text‑to‑speech (TTS) synthesis. The resulting system operates in a train‑free, plug‑and‑play manner. Experiments on the HumDial dataset demonstrate the effectiveness of our framework, which ranks second among all teams on the test set of the Human‑like Spoken Dialogue Systems Challenge (Track 2: Full‑Duplex Interaction). Code is available at the GitHub repository https://github.com/yu‑haoyuan/fd‑badcat.
Authors:Husein Zolkepli
Abstract:
X‑Codec‑2.0 has shown strong performance in neural audio compression and multilingual speech modeling, operating at a 50 Hz latent rate and a 16 kHz sampling rate using frozen HuBERT features. While effective, this configuration limits temporal efficiency and audio fidelity. In this work, we explore a simple and effective modification by introducing additional pooling and increasing the decoder hop size. This reduces the latent rate from 50 Hz to 25 Hz and simultaneously raises the output sampling rate from 16 kHz to 24 kHz, improving efficiency and perceptual quality without altering the core architecture. Evaluated on the multilingual Common Voice 17 test set, the proposed configuration achieves a 0.29 MOS improvement over the original X‑Codec‑2.0 baseline based on UTMOSv2, and attains the best reported performance among all codecs operating at 25 Hz. The source code, checkpoints, and generation comparisons are released at \hrefhttps://huggingface.co/Scicom‑intl/xcodec2‑25TPS‑24khttps://huggingface.co/Scicom‑intl/xcodec2‑25TPS‑24k.
Authors:Jim Maar, Denis Paperno, Callum Stuart McDougall, Neel Nanda
Abstract:
Prior work suggests that language models, while trained on next token prediction, show implicit planning behavior: they may select the next token in preparation to a predicted future token, such as a likely rhyming word, as supported by a prior qualitative study of Claude 3.5 Haiku using a cross‑layer transcoder. We propose much simpler techniques for assessing implicit planning in language models. With case studies on rhyme poetry generation and question answering, we demonstrate that our methodology easily scales to many models. Across models, we find that the generated rhyme (e.g. "‑ight") or answer to a question ("whale") can be manipulated by steering at the end of the preceding line with a vector, affecting the generation of intermediate tokens leading up to the rhyme or answer word. We show that implicit planning is a universal mechanism, present in smaller models than previously thought, starting from 1B parameters. Our methodology offers a widely applicable direct way to study implicit planning abilities of LLMs. More broadly, understanding planning abilities of language models can inform decisions in AI safety and control.
Authors:Abha Jha, Akanksha Mahajan, Ashwath Vaithinathan Aravindan, Praveen Saravanan, Sai Sailaja Policharla, Sonal Chaturbhuj Gehlot
Abstract:
Large Language Models (LLMs) often produce hallucinated or unverifiable content, undermining their reliability in factual domains. This work investigates Reinforcement Learning with Verifiable Rewards (RLVR) as a training paradigm that explicitly rewards abstention ("I don't know") alongside correctness to promote intellectual humility. We fine‑tune and evaluate Granite‑3.3‑2B‑Instruct and Qwen‑3‑4B‑Instruct on the MedMCQA and Hendrycks Math benchmarks using a ternary reward structure (‑1, r_abs, 1) under varying abstention reward structures. We further study the effect of combining RLVR with supervised fine‑tuning strategies that teach abstention prior to reinforcement learning. Our results show that moderate abstention rewards (r_abs \approx ‑0.25 to 0.3) consistently reduce incorrect responses without severe accuracy degradation on multiple‑choice tasks, with larger models exhibiting greater robustness to abstention incentives. On open‑ended question answering, we observe limitations due to insufficient exploration, which can be partially mitigated through supervised abstention training. Overall, these findings demonstrate the feasibility and flexibility of verifiable reward design as a practical approach for hallucination mitigation in language models. Reproducible code for our abstention training framework is available here https://github.com/Mystic‑Slice/rl‑abstention.
Authors:Yitian Chen, Cheng Cheng, Yinan Sun, Zi Ling, Dongdong Ge
Abstract:
Large Language Models (LLMs) have demonstrated impressive progress in optimization modeling, fostering a rapid expansion of new methodologies and evaluation benchmarks. However, the boundaries of their capabilities in automated formulation and problem solving remain poorly understood, particularly when extending to complex, real‑world tasks. To bridge this gap, we propose OPT‑ENGINE, an extensible benchmark framework designed to evaluate LLMs on optimization modeling with controllable and scalable difficulty levels. OPT‑ENGINE spans 10 canonical tasks across operations research, with five Linear Programming and five Mixed‑Integer Programming. Utilizing OPT‑ENGINE, we conduct an extensive study of LLMs' reasoning capabilities, addressing two critical questions: 1.) Do LLMs' performance remain robust when generalizing to out‑of‑distribution optimization tasks that scale in complexity beyond current benchmark levels? and 2.) At what stage, from problem interpretation to solution generation, do current LLMs encounter the most significant bottlenecks? Our empirical results yield two key insights: first, tool‑integrated reasoning with external solvers exhibits significantly higher robustness as task complexity escalates, while pure‑text reasoning reaches a ceiling; second, the automated formulation of constraints constitutes the primary performance bottleneck. These findings provide actionable guidance for developing next‑generation LLMs for advanced optimization. Our code is publicly available at \textcolorbluehttps://github.com/Cardinal‑Operations/OPTEngine.
Authors:Nicholas Cheng
Abstract:
Low‑resource languages such as isiZulu and isiXhosa face persistent challenges in machine translation due to limited parallel data and linguistic resources. Recent advances in large language models suggest that self‑reflection, prompting a model to critique and revise its own outputs, can improve reasoning quality and factual consistency. Building on this idea, this paper introduces Reflective Translation, a prompt‑based framework in which a model generates an initial translation, produces a structured self‑critique, and then uses this reflection to generate a refined translation. The approach is evaluated on English‑isiZulu and English‑isiXhosa translation using OPUS‑100 and NTREX‑African, across multiple prompting strategies and confidence thresholds. Results show consistent improvements in both BLEU and COMET scores between first‑ and second‑pass translations, with average gains of up to +0.22 BLEU and +0.18 COMET. Statistical significance testing using paired nonparametric tests confirms that these improvements are robust. The proposed method is model‑agnostic, requires no fine‑tuning, and introduces a reflection‑augmented dataset that can support future supervised or analysis‑driven work. These findings demonstrate that structured self‑reflection is a practical and effective mechanism for improving translation quality in low‑resource settings.
Authors:Zhuohan Long, Zhijie Bao, Zhongyu Wei
Abstract:
Interactive medical consultation requires an agent to proactively elicit missing clinical evidence under uncertainty. Yet existing evaluations largely remain static or outcome‑centric, neglecting the evidence‑gathering process. In this work, we propose an interactive evaluation framework that explicitly models the consultation process using a simulated patient and a \revsimulated reporter grounded in atomic evidences. Based on this representation, we introduce Information Coverage Rate (ICR) to quantify how completely an agent uncovers necessary evidence during interaction. To support systematic study, we build EviMed, an evidence‑based benchmark spanning diverse conditions from common complaints to rare diseases, and evaluate 10 models with varying reasoning abilities. We find that strong diagnostic reasoning does not guarantee effective information collection, and this insufficiency acts as a primary bottleneck limiting performance in interactive settings. To address this, we propose REFINE, a strategy that leverages diagnostic verification to guide the agent in proactively resolving uncertainties. Extensive experiments demonstrate that REFINE consistently outperforms baselines across diverse datasets and facilitates effective model collaboration, enabling smaller agents to achieve superior performance under strong reasoning supervision. Our code can be found at https://github.com/NanshineLoong/EID‑Benchmark .
Authors:Weicong Liu, Zixuan Yang, Yibo Zhao, Xiang Li
Abstract:
Reviewer assignment is increasingly critical yet challenging in the LLM era, where rapid topic shifts render many pre‑2023 benchmarks outdated and where proxy signals poorly reflect true reviewer familiarity. We address this evaluation bottleneck by introducing LR‑bench, a high‑fidelity, up‑to‑date benchmark curated from 2024‑2025 AI/NLP manuscripts with five‑level self‑assessed familiarity ratings collected via a large‑scale email survey, yielding 1055 expert‑annotated paper‑reviewer‑score annotations. We further propose RATE, a reviewer‑centric ranking framework that distills each reviewer's recent publications into compact keyword‑based profiles and fine‑tunes an embedding model with weak preference supervision constructed from heuristic retrieval signals, enabling matching each manuscript against a reviewer profile directly. Across LR‑bench and the CMU gold‑standard dataset, our approach consistently achieves state‑of‑the‑art performance, outperforming strong embedding baselines by a clear margin. We release LR‑bench at https://huggingface.co/datasets/Gnociew/LR‑bench, and a GitHub repository at https://github.com/Gnociew/RATE‑Reviewer‑Assign.
Authors:Fuliang Liu, Xue Li, Ketai Zhao, Yinxi Gao, Ziyan Zhou, Zhonghui Zhang, Zhibin Wang, Wanchun Dou, Sheng Zhong, Chen Tian
Abstract:
Speculative decoding is an effective and lossless approach for accelerating LLM inference. However, existing widely adopted model‑based draft designs, such as EAGLE3, improve accuracy at the cost of multi‑step autoregressive inference, resulting in high drafting latency and ultimately rendering the drafting stage itself a performance bottleneck. Inspired by diffusion‑based large language models (dLLMs), we propose DART, which leverages parallel generation to reduce drafting latency. DART predicts logits for multiple future masked positions in parallel within a single forward pass based on hidden states of the target model, thereby eliminating autoregressive rollouts in the draft model while preserving a lightweight design. Based on these parallel logit predictions, we further introduce an efficient tree pruning algorithm that constructs high‑quality draft token trees with N‑gram‑enforced semantic continuity. DART substantially reduces draft‑stage overhead while preserving high draft accuracy, leading to significantly improved end‑to‑end decoding speed. Experimental results demonstrate that DART achieves a 2.03x‑‑3.44x wall‑clock time speedup across multiple datasets, surpassing EAGLE3 by 30% on average and offering a practical speculative decoding framework. Code is released at https://github.com/fvliang/DART.
Authors:Haozheng Luo, Zhuolin Jiang, Md Zahid Hasan, Yan Chen, Soumalya Sarkar
Abstract:
We propose FROST, an attention‑aware method for efficient reasoning. Unlike traditional approaches, FROST leverages attention weights to prune uncritical reasoning paths, yielding shorter and more reliable reasoning trajectories. Methodologically, we introduce the concept of reasoning outliers and design an attention‑based mechanism to remove them. Theoretically, FROST preserves and enhances the model's reasoning capacity while eliminating outliers at the sentence level. Empirically, we validate FROST on four benchmarks using two strong reasoning models (Phi‑4‑Reasoning and GPT‑OSS‑20B), outperforming state‑of‑the‑art methods such as TALE and ThinkLess. Notably, FROST achieves an average 69.68% reduction in token usage and a 26.70% improvement in accuracy over the base model. Furthermore, in evaluations of attention outlier metrics, FROST reduces the maximum infinity norm by 15.97% and the average kurtosis by 91.09% compared to the base model. Code is available at https://github.com/robinzixuan/FROST
Authors:Shobhita Sundaram, John Quan, Ariel Kwiatkowski, Kartik Ahuja, Yann Ollivier, Julia Kempe
Abstract:
Can a model learn to escape its own learning plateau? Reinforcement learning methods for finetuning large reasoning models stall on datasets with low initial success rates, and thus little training signal. We investigate a fundamental question: Can a pretrained LLM leverage latent knowledge to generate an automated curriculum for problems it cannot solve? To explore this, we design SOAR: A self‑improvement framework designed to surface these pedagogical signals through meta‑RL. A teacher copy of the model proposes synthetic problems for a student copy, and is rewarded with its improvement on a small subset of hard problems. Critically, SOAR grounds the curriculum in measured student progress rather than intrinsic proxy rewards. Our study on the hardest subsets of mathematical benchmarks (0/128 success) reveals three core findings. First, we show that it is possible to realize bi‑level meta‑RL that unlocks learning under sparse, binary rewards by sharpening a latent capacity of pretrained models to generate useful stepping stones. Second, grounded rewards outperform intrinsic reward schemes used in prior LLM self‑play, reliably avoiding the instability and diversity collapse modes they typically exhibit. Third, analyzing the generated questions reveals that structural quality and well‑posedness are more critical for learning progress than solution correctness. Our results suggest that the ability to generate useful stepping stones does not require the preexisting ability to actually solve the hard problems, paving a principled path to escape reasoning plateaus without additional curated data.
Authors:Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, Aditya Grover
Abstract:
Knowledge distillation improves large language model (LLM) reasoning by compressing the knowledge of a teacher LLM to train smaller LLMs. On‑policy distillation advances this approach by having the student sample its own trajectories while a teacher LLM provides dense token‑level supervision, addressing the distribution mismatch between training and inference in off‑policy distillation methods. However, on‑policy distillation typically requires a separate, often larger, teacher LLM and does not explicitly leverage ground‑truth solutions available in reasoning datasets. Inspired by the intuition that a sufficiently capable LLM can rationalize external privileged reasoning traces and teach its weaker self (i.e., the version without access to privileged information), we introduce On‑Policy Self‑Distillation (OPSD), a framework where a single model acts as both teacher and student by conditioning on different contexts. The teacher policy conditions on privileged information (e.g., verified reasoning traces) while the student policy sees only the question; training minimizes the per‑token divergence between these distributions over the student's own rollouts. We demonstrate the efficacy of our method on multiple mathematical reasoning benchmarks, achieving 8‑12x token efficiency compared to reinforcement learning methods such as GRPO and superior performance over off‑policy distillation methods.
Authors:Yuxin Jiang, Yufei Wang, Qiyuan Zhang, Xingshan Zeng, Liangyou Li, Jierun Chen, Chaofan Tao, Haoli Bai, Lifeng Shang
Abstract:
Reinforcement learning with verifiable rewards (RLVR) succeeds in reasoning tasks (e.g., math and code) by checking the final verifiable answer (i.e., a verifiable dot signal). However, extending this paradigm to open‑ended generation is challenging because there is no unambiguous ground truth. Relying on single‑dot supervision often leads to inefficiency and reward hacking. To address these issues, we propose reinforcement learning with verifiable reference‑based rewards (RLVRR). Instead of checking the final answer, RLVRR extracts an ordered linguistic signal from high‑quality references (i.e, reward chain). Specifically, RLVRR decomposes rewards into two dimensions: content, which preserves deterministic core concepts (e.g., keywords), and style, which evaluates adherence to stylistic properties through LLM‑based verification. In this way, RLVRR combines the exploratory strength of RL with the efficiency and reliability of supervised fine‑tuning (SFT). Extensive experiments on more than 10 benchmarks with Qwen and Llama models confirm the advantages of our approach. RLVRR (1) substantially outperforms SFT trained with ten times more data and advanced reward models, (2) unifies the training of structured reasoning and open‑ended generation, and (3) generalizes more effectively while preserving output diversity. These results establish RLVRR as a principled and efficient path toward verifiable reinforcement learning for general‑purpose LLM alignment. We release our code and data at https://github.com/YJiangcm/RLVRR.
Authors:Ivan Bondarenko, Daniil Grebenkin, Oleg Sedukhin, Mikhail Klementev, Roman Derunets, Lyudmila Budneva
Abstract:
This work presents a speech‑to‑text system "Pisets" for scientists and journalists which is based on a three‑component architecture aimed at improving speech recognition accuracy while minimizing errors and hallucinations associated with the Whisper model. The architecture comprises primary recognition using Wav2Vec2, false positive filtering via the Audio Spectrogram Transformer (AST), and final speech recognition through Whisper. The implementation of curriculum learning methods and the utilization of diverse Russian‑language speech corpora significantly enhanced the system's effectiveness. Additionally, advanced uncertainty modeling techniques were introduced, contributing to further improvements in transcription quality. The proposed approaches ensure robust transcribing of long audio data across various acoustic conditions compared to WhisperX and the usual Whisper model. The source code of "Pisets" system is publicly available at GitHub: https://github.com/bond005/pisets.
Authors:Zhengyang Li, Thomas Graave, Björn Möller, Zehang Wu, Matthias Franz, Tim Fingscheidt
Abstract:
In audiovisual automatic speech recognition (AV‑ASR) systems, information fusion of visual features in a pre‑trained ASR has been proven as a promising method to improve noise robustness. In this work, based on the prominent Whisper ASR, first, we propose a simple and effective visual fusion method ‑‑ use of visual features both in encoder and decoder (dual‑use) ‑‑ to learn the audiovisual interactions in the encoder and to weigh modalities in the decoder. Second, we compare visual fusion methods in Whisper models of various sizes. Our proposed dual‑use method shows consistent noise robustness improvement, e.g., a 35% relative improvement (WER: 4.41% vs. 6.83%) based on Whisper small, and a 57% relative improvement (WER: 4.07% vs. 9.53%) based on Whisper medium, compared to typical reference middle fusion in babble noise with a signal‑to‑noise ratio (SNR) of 0dB. Third, we conduct ablation studies examining the impact of various module designs and fusion options. Fine‑tuned on 1929 hours of audiovisual data, our dual‑use method using Whisper medium achieves 4.08% (MUSAN babble noise) and 4.43% (NoiseX babble noise) average WER across various SNRs, thereby establishing a new state‑of‑the‑art in noisy conditions on the LRS3 AV‑ASR benchmark. Our code is at https://github.com/ifnspaml/Dual‑Use‑AVASR
Authors:Zhixian Zhao, Wenjie Tian, Xiaohai Tian, Jun Zhang, Lei Xie
Abstract:
Multimodal emotion analysis is shifting from static classification to generative reasoning. Beyond simple label prediction, robust affective reasoning must synthesize fine‑grained signals such as facial micro‑expressions and prosodic which shifts to decode the latent causality within complex social contexts. However, current Multimodal Large Language Models (MLLMs) face significant limitations in fine‑grained perception, primarily due to data scarcity and insufficient cross‑modal fusion. As a result, these models often exhibit unimodal dominance which leads to hallucinations in complex multimodal interactions, particularly when visual and acoustic cues are subtle, ambiguous, or even contradictory (e.g., in sarcastic scenery). To address this, we introduce SABER‑LLM, a framework designed for robust multimodal reasoning. First, we construct SABER, a large‑scale emotion reasoning dataset comprising 600K video clips, annotated with a novel six‑dimensional schema that jointly captures audiovisual cues and causal logic. Second, we propose the structured evidence decomposition paradigm, which enforces a "perceive‑then‑reason" separation between evidence extraction and reasoning to alleviate unimodal dominance. The ability to perceive complex scenes is further reinforced by consistency‑aware direct preference optimization, which explicitly encourages alignment among modalities under ambiguous or conflicting perceptual conditions. Experiments on EMER, EmoBench‑M, and SABER‑Test demonstrate that SABER‑LLM significantly outperforms open‑source baselines and achieves robustness competitive with closed‑source models in decoding complex emotional dynamics. The dataset and model are available at https://github.com/zxzhao0/SABER‑LLM.
Authors:Zhaoyan Gong, Zhiqiang Liu, Songze Li, Xiaoke Guo, Yuanxiang Liu, Xinle Deng, Zhizhen Liu, Lei Liang, Huajun Chen, Wen Zhang
Abstract:
Temporal Knowledge Graph Question Answering (TKGQA) is inherently challenging, as it requires sophisticated reasoning over dynamic facts with multi‑hop dependencies and complex temporal constraints. Existing methods rely on fixed workflows and expensive closed‑source APIs, limiting flexibility and scalability. We propose Temp‑R1, the first autonomous end‑to‑end agent for TKGQA trained through reinforcement learning. To address cognitive overload in single‑action reasoning, we expand the action space with specialized internal actions alongside external action. To prevent shortcut learning on simple questions, we introduce reverse curriculum learning that trains on difficult questions first, forcing the development of sophisticated reasoning before transferring to easier cases. Our 8B‑parameter Temp‑R1 achieves state‑of‑the‑art performance on MultiTQ and TimelineKGQA, improving 19.8% over strong baselines on complex questions. Our work establishes a new paradigm for autonomous temporal reasoning agents. Our code will be publicly available soon at https://github.com/zjukg/Temp‑R1.
Authors:Kunat Pipatanakul, Pittawat Taveekitworachai
Abstract:
Large language models (LLMs) have progressed rapidly; however, most state‑of‑the‑art models are trained and evaluated primarily in high‑resource languages such as English and Chinese, and are often developed by a small number of organizations with access to large‑scale compute and data. This gatekeeping creates a practical barrier for sovereign settings in which a regional‑ or national‑scale institution or domain owner must retain control and understanding of model weights, training data, and deployment while operating under limited resources and strict transparency constraints. To this end, we identify two core requirements: (1) adoptability, the ability to transform a base model into a general‑purpose assistant, and (2) sovereign capability, the ability to perform high‑stakes, region‑specific tasks (e.g., legal reasoning in local languages and cultural knowledge). We investigate whether these requirements can be achieved without scaling massive instruction corpora or relying on complex preference tuning pipelines and large‑scale reinforcement fine‑tuning (RFT). We present Typhoon S, a minimal and open post‑training recipe that combines supervised fine‑tuning, on‑policy distillation, and small‑scale RFT. Using Thai as a representative case study, we demonstrate that our approach transforms both sovereign‑adapted and general‑purpose base models into instruction‑tuned models with strong general performance. We further show that small‑scale RFT with InK‑GRPO ‑‑ an extension of GRPO that augments the GRPO loss with a next‑word prediction loss ‑‑ improves Thai legal reasoning and Thai‑specific knowledge while preserving general capabilities. Our results suggest that a carefully designed post‑training strategy can reduce the required scale of instruction data and computation, providing a practical path toward high‑quality sovereign LLMs under academic‑scale resources.
Authors:Dain Kim, Jiwoo Lee, Jaehoon Yun, Yong Hoe Koo, Qingyu Chen, Hyunjae Kim, Jaewoo Kang
Abstract:
Large Vision‑Language Models (LVLMs) hold significant promise for medical applications, yet their deployment is often constrained by insufficient alignment and reliability. While Direct Preference Optimization (DPO) has emerged as a potent framework for refining model responses, its efficacy in high‑stakes medical contexts remains underexplored, lacking the rigorous empirical groundwork necessary to guide future methodological advances. To bridge this gap, we present the first comprehensive examination of diverse DPO variants within the medical domain, evaluating nine distinct formulations across two medical LVLMs: LLaVA‑Med and HuatuoGPT‑Vision. Our results reveal several critical limitations: current DPO approaches often yield inconsistent gains over supervised fine‑tuning, with their efficacy varying significantly across different tasks and backbones. Furthermore, they frequently fail to resolve fundamental visual misinterpretation errors. Building on these insights, we present a targeted preference construction strategy as a proof‑of‑concept that explicitly addresses visual misinterpretation errors frequently observed in existing DPO models. This design yields a 3.6% improvement over the strongest existing DPO baseline on visual question‑answering tasks. To support future research, we release our complete framework, including all training data, model checkpoints, and our codebase at https://github.com/dmis‑lab/med‑vlm‑dpo.
Authors:Zhongyu Xiao, Zhiwei Hao, Jianyuan Guo, Yong Luo, Jia Liu, Jie Xu, Han Hu
Abstract:
Diffusion Large Language Models (dLLMs) offer a compelling paradigm for natural language generation, leveraging parallel decoding and bidirectional attention to achieve superior global coherence compared to autoregressive models. While recent works have accelerated inference via KV cache reuse or heuristic decoding, they overlook the intrinsic inefficiencies within the block‑wise diffusion process. Specifically, they suffer from spatial redundancy by modeling informative‑sparse suffix regions uniformly and temporal inefficiency by applying fixed denoising schedules across all the decoding process. To address this, we propose Streaming‑dLLM, a training‑free framework that streamlines inference across both spatial and temporal dimensions. Spatially, we introduce attenuation guided suffix modeling to approximate the full context by pruning redundant mask tokens. Temporally, we employ a dynamic confidence aware strategy with an early exit mechanism, allowing the model to skip unnecessary iterations for converged tokens. Extensive experiments show that Streaming‑dLLM achieves up to 68.2X speedup while maintaining generation quality, highlighting its effectiveness in diffusion decoding. The code is available at https://github.com/xiaoshideta/Streaming‑dLLM.
Authors:Qi Zhan, Yile Wang, Hui Huang
Abstract:
Named entity recognition (NER) is evolving from a sequence labeling task into a generative paradigm with the rise of large language models (LLMs). We conduct a systematic evaluation of open‑source LLMs on both flat and nested NER tasks. We investigate several research questions including the performance gap between generative NER and traditional NER models, the impact of output formats, whether LLMs rely on memorization, and the preservation of general capabilities after fine‑tuning. Through experiments across eight LLMs of varying scales and four standard NER datasets, we find that: (1) With parameter‑efficient fine‑tuning and structured formats like inline bracketed or XML, open‑source LLMs achieve performance competitive with traditional encoder‑based models and surpass closed‑source LLMs like GPT‑3; (2) The NER capability of LLMs stems from instruction‑following and generative power, not mere memorization of entity‑label pairs; and (3) Applying NER instruction tuning has minimal impact on general capabilities of LLMs, even improving performance on datasets like DROP due to enhanced entity understanding. These findings demonstrate that generative NER with LLMs is a promising, user‑friendly alternative to traditional methods. We release the data and code at https://github.com/szu‑tera/LLMs4NER.
Authors:Pranav Kasela, Marco Braga, Alessandro Ghiotto, Andrea Pilzer, Marco Viviani, Alessandro Raganato
Abstract:
In this paper, we present DIETA, a small, decoder‑only Transformer model with 0.5 billion parameters, specifically designed and trained for Italian‑English machine translation. We collect and curate a large parallel corpus consisting of approximately 207 million Italian‑English sentence pairs across diverse domains, including parliamentary proceedings, legal texts, web‑crawled content, subtitles, news, literature and 352 million back‑translated data using pretrained models. Additionally, we create and release a new small‑scale evaluation set, consisting of 450 sentences, based on 2025 WikiNews articles, enabling assessment of translation quality on contemporary text. Comprehensive evaluations show that DIETA achieves competitive performance on multiple Italian‑English benchmarks, consistently ranking in the second quartile of a 32‑system leaderboard and outperforming most other sub‑3B models on four out of five test suites. The training script, trained models, curated corpus, and newly introduced evaluation set are made publicly available, facilitating further research and development in specialized Italian‑English machine translation. https://github.com/pkasela/DIETA‑Machine‑Translation
Authors:Yixin Liu, Kehan Yan, Shiyuan Li, Qingfeng Chen, Shirui Pan
Abstract:
Text anomaly detection (TAD) plays a critical role in various language‑driven real‑world applications, including harmful content moderation, phishing detection, and spam review filtering. While two‑step "embedding‑detector" TAD methods have shown state‑of‑the‑art performance, their effectiveness is often limited by the use of a single embedding model and the lack of adaptability across diverse datasets and anomaly types. To address these limitations, we propose to exploit the embeddings from multiple pretrained language models and integrate them into MCA^2, a multi‑view TAD framework. MCA^2 adopts a multi‑view reconstruction model to effectively extract normal textual patterns from multiple embedding perspectives. To exploit inter‑view complementarity, a contrastive collaboration module is designed to leverage and strengthen the interactions across different views. Moreover, an adaptive allocation module is developed to automatically assign the contribution weight of each view, thereby improving the adaptability to diverse datasets. Extensive experiments on 10 benchmark datasets verify the effectiveness of MCA^2 against strong baselines. The source code of MCA^2 is available at https://github.com/yankehan/MCA2.
Authors:Saptarshi Ghosh, Linfeng Liu, Tianyu Jiang
Abstract:
Images often communicate more than they literally depict: a set of tools can suggest an occupation and a cultural artifact can suggest a tradition. This kind of indirect visual reference, known as visual metonymy, invites viewers to recover a target concept via associated cues rather than explicit depiction. In this work, we present the first computational investigation of visual metonymy. We introduce a novel pipeline grounded in semiotic theory that leverages large language models and text‑to‑image models to generate metonymic visual representations. Using this framework, we construct ViMET, the first visual metonymy dataset comprising 2,000 multiple‑choice questions to evaluate the cognitive reasoning abilities in multimodal language models. Experimental results on our dataset reveal a significant gap between human performance (86.9%) and state‑of‑the‑art vision‑language models (65.9%), highlighting limitations in machines' ability to interpret indirect visual references. Our dataset is publicly available at: https://github.com/cincynlp/ViMET.
Authors:Emmanouil Georgios Lionis, Jia-Huei Ju, Angelos Nalmpantis, Casper Thuis, Sean MacAvaney, Andrew Yates
Abstract:
Learned Sparse Retrieval (LSR) methods construct sparse lexical representations of queries and documents that can be efficiently searched using inverted indexes. Existing LSR approaches have relied almost exclusively on uncased backbone models, whose vocabularies exclude case‑sensitive distinctions, thereby reducing vocabulary mismatch. However, the most recent state‑of‑the‑art language models are only available in cased versions. Despite this shift, the impact of backbone model casing on LSR has not been studied, potentially posing a risk to the viability of the method going forward. To fill this gap, we systematically evaluate paired cased and uncased versions of the same backbone models across multiple datasets to assess their suitability for LSR. Our findings show that LSR models with cased backbone models by default perform substantially worse than their uncased counterparts; however, this gap can be eliminated by pre‑processing the text to lowercase. Moreover, our token‑level analysis reveals that, under lowercasing, cased models almost entirely suppress cased vocabulary items and behave effectively as uncased models, explaining their restored performance. This result broadens the applicability of recent cased models to the LSR setting and facilitates the integration of stronger backbone architectures into sparse retrieval. The complete code and implementation for this project are available at: https://github.com/lionisakis/Uncased‑vs‑cased‑models‑in‑LSR
Authors:Mohammed Fasha, Bassam Hammo, Bilal Sowan, Husam Barham, Esam Nsour
Abstract:
This study uses Jordanian law as a case study to explore the fine‑tuning of the Llama‑3.1 large language model for Arabic question‑answering. Two versions of the model ‑ Llama‑3.1‑8B‑bnb‑4bit and Llama‑3.1‑8B‑Instruct‑bnb‑4bit ‑ were fine‑tuned using parameter‑efficient fine‑tuning (PEFT) with LoRA adapters and 4‑bit quantized models, leveraging the Unsloth framework for accelerated and resource‑efficient training. A custom dataset of 6000 legal question‑answer pairs was curated from Jordanian laws and formatted into structured prompts. Performance was evaluated using the BLEU and the ROUGE metrics to compare the fine‑tuned models to their respective base versions. Results demonstrated improved legal reasoning and accuracy while achieving resource efficiency through quantization and optimized fine‑tuning strategies. This work underscores the potential of adapting large language models for Arabic legal domains and highlights effective techniques for fine‑tuning domain‑specific tasks.
Authors:Yaokun Liu, Yifan Liu, Phoebe Mbuvi, Zelin Li, Ruichen Yao, Gawon Lim, Dong Wang
Abstract:
The deployment of Large Language Models in Medical Question Answering is severely hampered by ambiguous user queries, a significant safety risk that demonstrably reduces answer accuracy in high‑stakes healthcare settings. In this paper, we formalize this challenge by linking input ambiguity to aleatoric uncertainty (AU), which is the irreducible uncertainty arising from underspecified input. To facilitate research in this direction, we construct CV‑MedBench, the first benchmark designed for studying input ambiguity in Medical QA. Using this benchmark, we analyze AU from a representation engineering perspective, revealing that AU is linearly encoded in LLM's internal activation patterns. Leveraging this insight, we introduce a novel AU‑guided "Clarify‑Before‑Answer" framework, which incorporates AU‑Probe ‑ a lightweight module that detects input ambiguity directly from hidden states. Unlike existing uncertainty estimation methods, AU‑Probe requires neither LLM fine‑tuning nor multiple forward passes, enabling an efficient mechanism to proactively request user clarification and significantly enhance safety. Extensive experiments across four open LLMs demonstrate the effectiveness of our QA framework, with an average accuracy improvement of 9.48% over baselines. Our framework provides an efficient and robust solution for safe Medical QA, strengthening the reliability of health‑related applications. The code is available at https://github.com/yaokunliu/AU‑Med.git, and the CV‑MedBench dataset is released on Hugging Face at https://huggingface.co/datasets/yaokunl/CV‑MedBench.
Authors:Seyyed Saeid Cheshmi, Hahnemann Ortiz, James Mooney, Dongyeop Kang
Abstract:
Vision‑language models (VLMs) have demonstrated strong reasoning abilities in literal multimodal tasks such as visual mathematics and science question answering. However, figurative language, such as sarcasm, humor, and metaphor, remains a significant challenge, as it conveys intent and emotion through subtle incongruities between expressed and intended meanings. In multimodal settings, accompanying images can amplify or invert textual meaning, demanding models that reason across modalities and account for subjectivity. We propose a three‑step framework for developing efficient multimodal reasoning models that can (i) interpret multimodal figurative language, (ii) provide transparent reasoning traces, and (iii) generalize across multiple figurative styles. Experiments across four styles show that (1) incorporating reasoning traces substantially improves multimodal figurative understanding, (2) reasoning learned in one style can transfer to others, especially between related styles like sarcasm and humor, and (3) training jointly across styles yields a generalized reasoning VLM that outperforms much larger open‑ and closed‑source models. Our findings show that lightweight VLMs with verifiable reasoning achieve robust cross‑style generalization while providing inspectable reasoning traces for multimodal tasks. The code and implementation are available at https://github.com/scheshmi/CrossStyle‑MMR.
Authors:Parth Bhalerao, Diola Dsouza, Ruiwen Guan, Oana Ignat
Abstract:
Question answering systems are typically evaluated on factual correctness, yet many real‑world applications‑such as education and career guidance‑require mentorship: responses that provide reflection and guidance. Existing QA benchmarks rarely capture this distinction, particularly in multilingual and long‑form settings. We introduce MentorQA, the first multilingual dataset and evaluation framework for mentorship‑focused question answering from long‑form videos, comprising nearly 9,000 QA pairs from 180 hours of content across four languages. We define mentorship‑focused evaluation dimensions that go beyond factual accuracy, capturing clarity, alignment, and learning value. Using MentorQA, we compare Single‑Agent, Dual‑Agent, RAG, and Multi‑Agent QA architectures under controlled conditions. Multi‑Agent pipelines consistently produce higher‑quality mentorship responses, with especially strong gains for complex topics and lower‑resource languages. We further analyze the reliability of automated LLM‑based evaluation, observing substantial variation in alignment with human judgments. Overall, this work establishes mentorship‑focused QA as a distinct research problem and provides a multilingual benchmark for studying agentic architectures and evaluation design in educational AI. The dataset and evaluation framework are released at https://github.com/AIM‑SCU/MentorQA.
Authors:Haoxuan Li, He Chang, Yunshan Ma, Yi Bin, Yang Yang, See-Kiong Ng, Tat-Seng Chua
Abstract:
Event forecasting is inherently influenced by multifaceted considerations, including international relations, regional historical dynamics, and cultural contexts. However, existing LLM‑based approaches employ single‑model architectures that generate predictions along a singular explicit trajectory, constraining their ability to capture diverse geopolitical nuances across complex regional contexts. To address this limitation, we introduce ThinkTank‑ME, a novel Think Tank framework for Middle East event forecasting that emulates collaborative expert analysis in real‑world strategic decision‑making. To facilitate expert specialization and rigorous evaluation, we construct POLECAT‑FOR‑ME, a Middle East‑focused event forecasting benchmark. Experimental results demonstrate the superiority of multi‑expert collaboration in handling complex temporal geopolitical forecasting tasks. The code is available at https://github.com/LuminosityX/ThinkTank‑ME.
Authors:Wei Zhou, Jun Zhou, Haoyu Wang, Zhenghao Li, Qikang He, Shaokun Han, Guoliang Li, Xuanhe Zhou, Yeye He, Chunwei Liu, Zirui Tang, Bin Wang, Shen Tang, Kai Zuo, Yuyu Luo, Zhenzhe Zheng, Conghui He, Jingren Zhou, Fan Wu
Abstract:
Data preparation aims to denoise raw datasets, uncover cross‑dataset relationships, and extract valuable insights from them, which is essential for a wide range of data‑centric applications. Driven by (i) rising demands for application‑ready data (e.g., for analytics, visualization, decision‑making), (ii) increasingly powerful LLM techniques, and (iii) the emergence of infrastructures that facilitate flexible agent construction (e.g., using Databricks Unity Catalog), LLM‑enhanced methods are rapidly becoming a transformative and potentially dominant paradigm for data preparation. By investigating hundreds of recent literature works, this paper presents a systematic review of this evolving landscape, focusing on the use of LLM techniques to prepare data for diverse downstream tasks. First, we characterize the fundamental paradigm shift, from rule‑based, model‑specific pipelines to prompt‑driven, context‑aware, and agentic preparation workflows. Next, we introduce a task‑centric taxonomy that organizes the field into three major tasks: data cleaning (e.g., standardization, error processing, imputation), data integration (e.g., entity matching, schema matching), and data enrichment (e.g., data annotation, profiling). For each task, we survey representative techniques, and highlight their respective strengths (e.g., improved generalization, semantic understanding) and limitations (e.g., the prohibitive cost of scaling LLMs, persistent hallucinations even in advanced agents, the mismatch between advanced methods and weak evaluation). Moreover, we analyze commonly used datasets and evaluation metrics (the empirical part). Finally, we discuss open research challenges and outline a forward‑looking roadmap that emphasizes scalable LLM‑data systems, principled designs for reliable agentic workflows, and robust evaluation protocols.
Authors:Yanai Elazar, Maria Antoniak
Abstract:
ArXiv recently prohibited the upload of unpublished review papers to its servers in the Computer Science domain, citing a high prevalence of LLM‑generated content in these categories. However, this decision was not accompanied by quantitative evidence. In this work, we investigate this claim by measuring the proportion of LLM‑generated content in review vs. non‑review research papers in recent years. Using two high‑quality detection methods, we find a substantial increase in LLM‑generated content across both review and non‑review papers, with a higher prevalence in review papers. However, when considering the number of LLM‑generated papers published in each category, the estimates of non‑review LLM‑generated papers are almost six times higher. Furthermore, we find that this policy will affect papers in certain domains far more than others, with the CS subdiscipline Computers & Society potentially facing cuts of 50%. Our analysis provides an evidence‑based framework for evaluating such policy decisions, and we release our code to facilitate future investigations at: https://github.com/yanaiela/llm‑review‑arxiv.
Authors:Elias Schuhmacher, Andrianos Michail, Juri Opitz, Rico Sennrich, Simon Clematide
Abstract:
To be discoverable in an embedding‑based search process, each part of a document should be reflected in its embedding representation. To quantify any potential reflection biases, we introduce a permutation‑based evaluation framework. With this, we observe that state‑of‑the‑art embedding models exhibit systematic positional and language biases when documents are longer and consist of multiple segments. Specifically, early segments and segments in higher‑resource languages like English are over‑represented, while later segments and segments in lower‑resource languages are marginalized. In our further analysis, we find that the positional bias stems from front‑loaded attention distributions in pooling‑token embeddings, where early tokens receive more attention. To mitigate this issue, we introduce an inference‑time attention calibration method that redistributes attention more evenly across document positions, increasing discoverabiltiy of later segments. Our evaluation framework and attention calibration is available at https://github.com/impresso/fair‑sentence‑transformers
Authors:Yuhang Wang, Yuling Shi, Mo Yang, Rongrui Zhang, Shilin He, Heng Lian, Yuting Chen, Siyu Ye, Kai Cai, Xiaodong Gu
Abstract:
LLM agents have demonstrated remarkable capabilities in software development, but their performance is hampered by long interaction contexts, which incur high API costs and latency. While various context compression approaches such as LongLLMLingua have emerged to tackle this challenge, they typically rely on fixed metrics such as PPL, ignoring the task‑specific nature of code understanding. As a result, they frequently disrupt syntactic and logical structure and fail to retain critical implementation details. In this paper, we propose SWE‑Pruner, a self‑adaptive context pruning framework tailored for coding agents. Drawing inspiration from how human programmers "selectively skim" source code during development and debugging, SWE‑Pruner performs task‑aware adaptive pruning for long contexts. Given the current task, the agent formulates an explicit goal (e.g., "focus on error handling") as a hint to guide the pruning targets. A lightweight neural skimmer (0.6B parameters) is trained to dynamically select relevant lines from the surrounding context given the goal. Evaluations across four benchmarks and multiple models validate SWE‑Pruner's effectiveness in various scenarios, achieving 23‑54% token reduction on agent tasks like SWE‑Bench Verified and up to 14.84x compression on single‑turn tasks like LongCodeQA with minimal performance impact.
Authors:Shanshan Liu, Noriki Nishida, Fei Cheng, Narumi Tokunaga, Rumana Ferdous Munne, Yuki Yamagata, Kouji Kozaki, Takehito Utsuro, Yuji Matsumoto
Abstract:
Generalization to unseen concepts is a central challenge due to the scarcity of human annotations in Mention‑agnostic Biomedical Concept Recognition (MA‑BCR). This work makes two key contributions to systematically address this issue. First, we propose an evaluation framework built on hierarchical concept indices and novel metrics to measure generalization. Second, we explore LLM‑based Auto‑Labeled Data (ALD) as a scalable resource, creating a task‑specific pipeline for its generation. Our research unequivocally shows that while LLM‑generated ALD cannot fully substitute for manual annotations, it is a valuable resource for improving generalization, successfully providing models with the broader coverage and structural knowledge needed to approach recognizing unseen concepts. Code and datasets are available at https://github.com/bio‑ie‑tool/hi‑ald.
Authors:Yuzhen Shi, Huanghai Liu, Yiran Hu, Gaojie Song, Xinran Xu, Yubo Ma, Tianyi Tang, Li Zhang, Qingjing Chen, Di Feng, Wenbo Lv, Weiheng Wu, Kexin Yang, Sen Yang, Wei Wang, Rongyao Shi, Yuanyang Qiu, Yuemeng Qi, Jingwen Zhang, Xiaoyu Sui, Yifan Chen, Yi Zhang, An Yang, Bowen Yu, Dayiheng Liu, Junyang Lin, Weixing Shen, Bing Zhao, Charles L. A. Clarke, Hu Wei
Abstract:
As large language models (LLMs) are increasingly applied to legal domain‑specific tasks, evaluating their ability to perform legal work in real‑world settings has become essential. However, existing legal benchmarks rely on simplified and highly standardized tasks, failing to capture the ambiguity, complexity, and reasoning demands of real legal practice. Moreover, prior evaluations often adopt coarse, single‑dimensional metrics and do not explicitly assess fine‑grained legal reasoning. To address these limitations, we introduce PLawBench, a Practical Law Benchmark designed to evaluate LLMs in realistic legal practice scenarios. Grounded in real‑world legal workflows, PLawBench models the core processes of legal practitioners through three task categories: public legal consultation, practical case analysis, and legal document generation. These tasks assess a model's ability to identify legal issues and key facts, perform structured legal reasoning, and generate legally coherent documents. PLawBench comprises 850 questions across 13 practical legal scenarios, with each question accompanied by expert‑designed evaluation rubrics, resulting in approximately 12,500 rubric items for fine‑grained assessment. Using an LLM‑based evaluator aligned with human expert judgments, we evaluate 10 state‑of‑the‑art LLMs. Experimental results show that none achieves strong performance on PLawBench, revealing substantial limitations in the fine‑grained legal reasoning capabilities of current LLMs and highlighting important directions for future evaluation and development of legal LLMs. Data is available at: https://github.com/skylenage/PLawbench.
Authors:Xueyang Feng, Weinan Gan, Xu Chen, Quanyu Dai, Yong Liu
Abstract:
Large language model (LLM)‑powered assistants have recently integrated memory mechanisms that record user preferences, leading to more personalized and user‑aligned responses. However, irrelevant personalized memories are often introduced into the context, interfering with the LLM's intent understanding. To comprehensively investigate the dual effects of personalization, we develop RPEval, a benchmark comprising a personalized intent reasoning dataset and a multi‑granularity evaluation protocol. RPEval reveals the widespread phenomenon of irrational personalization in existing LLMs and, through error pattern analysis, illustrates its negative impact on user experience. Finally, we introduce RP‑Reasoner, which treats memory utilization as a pragmatic reasoning process, enabling the selective integration of personalized information. Experimental results demonstrate that our method significantly outperforms carefully designed baselines on RPEval, and resolves 80% of the bad cases observed in a large‑scale commercial personalized assistant, highlighting the potential of pragmatic reasoning to mitigate irrational personalization. Our benchmark is publicly available at https://github.com/XueyangFeng/RPEval.
Authors:Jivnesh Sandhan, Fei Cheng, Tushar Sandhan, Yugo Murawaki
Abstract:
Large Language Models (LLMs) are increasingly deployed in domains such as education, mental health and customer support, where stable and consistent personas are critical for reliability. Yet, existing studies focus on narrative or role‑playing tasks and overlook how adversarial conversational history alone can reshape induced personas. Black‑box persona manipulation remains unexplored, raising concerns for robustness in realistic interactions. In response, we introduce the task of persona editing, which adversarially steers LLM traits through user‑side inputs under a black‑box, inference‑only setting. To this end, we propose PHISH (Persona Hijacking via Implicit Steering in History), the first framework to expose a new vulnerability in LLM safety that embeds semantically loaded cues into user queries to gradually induce reverse personas. We also define a metric to quantify attack success. Across 3 benchmarks and 8 LLMs, PHISH predictably shifts personas, triggers collateral changes in correlated traits, and exhibits stronger effects in multi‑turn settings. In high‑risk domains mental health, tutoring, and customer support, PHISH reliably manipulates personas, validated by both human and LLM‑as‑Judge evaluations. Importantly, PHISH causes only a small reduction in reasoning benchmark performance, leaving overall utility largely intact while still enabling significant persona manipulation. While current guardrails offer partial protection, they remain brittle under sustained attack. Our findings expose new vulnerabilities in personas and highlight the need for context‑resilient persona in LLMs. Our codebase and dataset is available at: https://github.com/Jivnesh/PHISH
Authors:Zhenghao Liu, Mingyan Wu, Xinze Li, Yukun Yan, Shuo Wang, Cheng Yang, Minghe Yu, Zheni Zeng, Maosong Sun
Abstract:
Retrieval‑Augmented Generation (RAG) has emerged as a dominant paradigm for mitigating hallucinations in Large Language Models (LLMs) by incorporating external knowledge. Nevertheless, effectively integrating and interpreting key evidence scattered across noisy documents remains a critical challenge for existing RAG systems. In this paper, we propose GraphAnchor, a novel Graph‑Anchored Knowledge Indexing approach that reconceptualizes graph structures from static knowledge representations into active, evolving knowledge indices. GraphAnchor incrementally updates a graph during iterative retrieval to anchor salient entities and relations, yielding a structured index that guides the LLM in evaluating knowledge sufficiency and formulating subsequent subqueries. The final answer is generated by jointly leveraging all retrieved documents and the final evolved graph. Experiments on four multi‑hop question answering benchmarks demonstrate the effectiveness of GraphAnchor, and reveal that GraphAnchor modulates the LLM's attention to more effectively associate key information distributed in retrieved documents. All code and data are available at https://github.com/NEUIR/GraphAnchor.
Authors:Yichuan Ma, Linyang Li, Yongkang Chen, Peiji Li, Jiasheng Ye, Qipeng Guo, Dahua Lin, Kai Chen
Abstract:
Large language models (LLMs) have demonstrated exceptional performance in reasoning tasks such as mathematics and coding, matching or surpassing human capabilities. However, these impressive reasoning abilities face significant challenges in specialized domains. Taking Go as an example, although AlphaGo has established the high performance ceiling of AI systems in Go, mainstream LLMs still struggle to reach even beginner‑level proficiency, let alone perform natural language reasoning. This performance gap between general‑purpose LLMs and domain experts is significantly limiting the application of LLMs on a wider range of domain‑specific tasks. In this work, we aim to bridge the divide between LLMs' general reasoning capabilities and expert knowledge in domain‑specific tasks. We perform mixed fine‑tuning with structured Go expertise and general long Chain‑of‑Thought (CoT) reasoning data as a cold start, followed by reinforcement learning to integrate expert knowledge in Go with general reasoning capabilities. Through this methodology, we present LoGos, a powerful LLM that not only maintains outstanding general reasoning abilities, but also conducts Go gameplay in natural language, demonstrating effective strategic reasoning and accurate next‑move prediction. LoGos achieves performance comparable to human professional players, substantially surpassing all existing LLMs. Through this work, we aim to contribute insights on applying general LLM reasoning capabilities to specialized domains. We will release the first large‑scale Go dataset for LLM training, the first LLM Go evaluation benchmark, and the first general LLM that reaches human professional‑level performance in Go at: https://github.com/Entarochuan/LoGos.
Authors:Toni J. B. Liu, Baran Zadeoğlu, Nicolas Boullé, Raphaël Sarfati, Christopher J. Earls
Abstract:
Large language models (LLMs) make next‑token predictions based on clues present in their context, such as semantic descriptions and in‑context examples. Yet, elucidating which prior tokens most strongly influence a given prediction remains challenging due to the proliferation of layers and attention heads in modern architectures. We propose Jacobian Scopes, a suite of gradient‑based, token‑level causal attribution methods for interpreting LLM predictions. By analyzing the linearized relations of final hidden state with respect to inputs, Jacobian Scopes quantify how input tokens influence a model's prediction. We introduce three variants ‑ Semantic, Fisher, and Temperature Scopes ‑ which respectively target sensitivity of specific logits, the full predictive distribution, and model confidence (inverse temperature). Through case studies spanning instruction understanding, translation and in‑context learning (ICL), we uncover interesting findings, such as when Jacobian Scopes point to implicit political biases. We believe that our proposed methods also shed light on recently debated mechanisms underlying in‑context time‑series forecasting. Our code and interactive demonstrations are publicly available at https://github.com/AntonioLiu97/JacobianScopes.
Authors:Hannah Cyberey, Yangfeng Ji, David Evans
Abstract:
Algorithmic audits are essential tools for examining systems for properties required by regulators or desired by operators. Current audits of large language models (LLMs) primarily rely on black‑box evaluations that assess model behavior only through input‑output testing. These methods are limited to tests constructed in the input space, often generated by heuristics. In addition, many socially relevant model properties (e.g., gender bias) are abstract and difficult to measure through text‑based inputs alone. To address these limitations, we propose a white‑box sensitivity auditing framework for LLMs that leverages activation steering to conduct more rigorous assessments through model internals. Our auditing method conducts internal sensitivity tests by manipulating key concepts relevant to the model's intended function for the task. We demonstrate its application to bias audits in four simulated high‑stakes LLM decision tasks. Our method consistently reveals substantial dependence on protected attributes in model predictions, even in settings where standard black‑box evaluations suggest little or no bias. Our code is openly available at https://github.com/hannahxchen/llm‑steering‑audit
Authors:Sukesh Subaharan
Abstract:
Large language model (LLM) agents often exhibit abrupt shifts in tone and persona during extended interaction, reflecting the absence of explicit temporal structure governing agent‑level state. While prior work emphasizes turn‑local sentiment or static emotion classification, the role of explicit affective dynamics in shaping long‑horizon agent behavior remains underexplored. This work investigates whether imposing dynamical structure on an external affective state can induce temporal coherence and controlled recovery in multi‑turn dialogue. We introduce an agent‑level affective subsystem that maintains a continuous Valence‑Arousal‑Dominance (VAD) state external to the language model and governed by first‑ and second‑order update rules. Instantaneous affective signals are extracted using a fixed, memoryless estimator and integrated over time via exponential smoothing or momentum‑based dynamics. The resulting affective state is injected back into generation without modifying model parameters. Using a fixed 25‑turn dialogue protocol, we compare stateless, first‑order, and second‑order affective dynamics. Stateless agents fail to exhibit coherent trajectories or recovery, while state persistence enables delayed responses and reliable recovery. Second‑order dynamics introduce affective inertia and hysteresis that increase with momentum, revealing a trade‑off between stability and responsiveness.
Authors:Yang Yu, Peiyu Zang, Chi Hsu Tsai, Haiming Wu, Yixin Shen, Jialing Zhang, Haoyu Wang, Zhiyou Xiao, Jingze Shi, Yuyu Luo, Wentao Zhang, Chunlei Men, Guang Liu, Yonghua Lin
Abstract:
The performance of modern AI systems is fundamentally constrained by the quality of their underlying kernels, which translate high‑level algorithmic semantics into low‑level hardware operations. Achieving near‑optimal kernels requires expert‑level understanding of hardware architectures and programming models, making kernel engineering a critical but notoriously time‑consuming and non‑scalable process. Recent advances in large language models (LLMs) and LLM‑based agents have opened new possibilities for automating kernel generation and optimization. LLMs are well‑suited to compress expert‑level kernel knowledge that is difficult to formalize, while agentic systems further enable scalable optimization by casting kernel development as an iterative, feedback‑driven loop. Rapid progress has been made in this area. However, the field remains fragmented, lacking a systematic perspective for LLM‑driven kernel generation. This survey addresses this gap by providing a structured overview of existing approaches, spanning LLM‑based approaches and agentic optimization workflows, and systematically compiling the datasets and benchmarks that underpin learning and evaluation in this domain. Moreover, key open challenges and future research directions are further outlined, aiming to establish a comprehensive reference for the next generation of automated kernel optimization. To keep track of this field, we maintain an open‑source GitHub repository at https://github.com/flagos‑ai/awesome‑LLM‑driven‑kernel‑generation.
Authors:Junseok Kim, Nakyeong Yang, Kyomin Jung
Abstract:
Role‑play prompting is known to steer the behavior of language models by injecting a persona into the prompt, improving their zero‑shot reasoning capabilities. However, such improvements are inconsistent across different tasks or instances. This inconsistency suggests that zero‑shot and role‑play prompting may offer complementary strengths rather than one being universally superior. Building on this insight, we propose Persona Switch, a novel decoding method that dynamically combines the benefits of both prompting strategies. Our method proceeds step‑by‑step, selecting the better output between zero‑shot and role‑play prompting at each step by comparing their output confidence, as measured by the logit gap. Experiments with widely‑used LLMs demonstrate that Persona Switch consistently outperforms competitive baselines, achieving up to 5.13% accuracy improvement. Furthermore, we show that output confidence serves as an informative measure for selecting the more reliable output.
Authors:Hangrui Hu, Xinfa Zhu, Ting He, Dake Guo, Bin Zhang, Xiong Wang, Zhifang Guo, Ziyue Jiang, Hongkun Hao, Zishan Guo, Xinyu Zhang, Pei Zhang, Baosong Yang, Jin Xu, Jingren Zhou, Junyang Lin
Abstract:
In this report, we present the Qwen3‑TTS series, a family of advanced multilingual, controllable, robust, and streaming text‑to‑speech models. Qwen3‑TTS supports state‑of‑the‑art 3‑second voice cloning and description‑based control, allowing both the creation of entirely novel voices and fine‑grained manipulation over the output speech. Trained on over 5 million hours of speech data spanning 10 languages, Qwen3‑TTS adopts a dual‑track LM architecture for real‑time synthesis, coupled with two speech tokenizers: 1) Qwen‑TTS‑Tokenizer‑25Hz is a single‑codebook codec emphasizing semantic content, which offers seamlessly integration with Qwen‑Audio and enables streaming waveform reconstruction via a block‑wise DiT. 2) Qwen‑TTS‑Tokenizer‑12Hz achieves extreme bitrate reduction and ultra‑low‑latency streaming, enabling immediate first‑packet emission (97\,\mathrmms) through its 12.5 Hz, 16‑layer multi‑codebook design and a lightweight causal ConvNet. Extensive experiments indicate state‑of‑the‑art performance across diverse objective and subjective benchmark (e.g., TTS multilingual test set, InstructTTSEval, and our long speech test set). To facilitate community research and development, we release both tokenizers and models under the Apache 2.0 license.
Authors:Sydney Anuyah, Sneha Shajee-Mohan, Ankit-Singh Chauhan, Sunandan Chakraborty
Abstract:
The safe deployment of large language models (LLMs) in high‑stakes fields like biomedicine, requires them to be able to reason about cause and effect. We investigate this ability by testing 13 open‑source LLMs on a fundamental task: pairwise causal discovery (PCD) from text. Our benchmark, using 12 diverse datasets, evaluates two core skills: 1) Causal Detection (identifying if a text contains a causal link) and 2) Causal Extraction (pulling out the exact cause and effect phrases). We tested various prompting methods, from simple instructions (zero‑shot) to more complex strategies like Chain‑of‑Thought (CoT) and Few‑shot In‑Context Learning (FICL). The results show major deficiencies in current models. The best model for detection, DeepSeek‑R1‑Distill‑Llama‑70B, only achieved a mean score of 49.57% (C_detect), while the best for extraction, Qwen2.5‑Coder‑32B‑Instruct, reached just 47.12% (C_extract). Models performed best on simple, explicit, single‑sentence relations. However, performance plummeted for more difficult (and realistic) cases, such as implicit relationships, links spanning multiple sentences, and texts containing multiple causal pairs. We provide a unified evaluation framework, built on a dataset validated with high inter‑annotator agreement (κ\ge 0.758), and make all our data, code, and prompts publicly available to spur further research. \hrefhttps://github.com/sydneyanuyah/CausalDiscoveryCode available here: https://github.com/sydneyanuyah/CausalDiscovery
Authors:Sydney Anuyah, Mehedi Mahmud Kaushik, Hao Dai, Rakesh Shiradkar, Arjan Durresi, Sunandan Chakraborty
Abstract:
Large Language Models (LLMs) generate fluent answers but can struggle with trustworthy, domain‑specific reasoning. We evaluate whether domain knowledge graphs (KGs) improve Retrieval‑Augmented Generation (RAG) for healthcare by constructing three PubMed‑derived graphs: \mathbbG_1 (T2DM), \mathbbG_2 (Alzheimer's disease), and \mathbbG_3 (AD+T2DM). We design two probes: Probe 1 targets merged AD T2DM knowledge, while Probe 2 targets the intersection of \mathbbG_1 and \mathbbG_2. Seven instruction‑tuned LLMs are tested across retrieval sources No‑RAG, \mathbbG_1, \mathbbG_2, \mathbbG_1 + \mathbbG_2, \mathbbG_3, \mathbbG_1+\mathbbG_2 + \mathbbG_3 and three decoding temperatures. Results show that scope alignment between probe and KG is decisive: precise, scope‑matched retrieval (notably \mathbbG_2) yields the most consistent gains, whereas indiscriminate graph unions often introduce distractors that reduce accuracy. Larger models frequently match or exceed KG‑RAG with a No‑RAG baseline on Probe 1, indicating strong parametric priors, whereas smaller/mid‑sized models benefit more from well‑scoped retrieval. Temperature plays a secondary role; higher values rarely help. We conclude that precision‑first, scope‑matched KG‑RAG is preferable to breadth‑first unions, and we outline practical guidelines for graph selection, model sizing, and retrieval/reranking. Code and Data available here ‑ https://github.com/sydneyanuyah/RAGComparison
Authors:Pablo Messina, Andrés Villa, Juan León Alcázar, Karen Sánchez, Carlos Hinojosa, Denis Parra, Álvaro Soto, Bernard Ghanem
Abstract:
Medical vision‑language models can automate the generation of radiology reports but struggle with accurate visual grounding and factual consistency. Existing models often misalign textual findings with visual evidence, leading to unreliable or weakly grounded predictions. We present CURE, an error‑aware curriculum learning framework that improves grounding and report quality without any additional data. CURE fine‑tunes a multimodal instructional model on phrase grounding, grounded report generation, and anatomy‑grounded report generation using public datasets. The method dynamically adjusts sampling based on model performance, emphasizing harder samples to improve spatial and textual alignment. CURE improves grounding accuracy by +0.37 IoU, boosts report quality by +0.188 CXRFEScore, and reduces hallucinations by 18.6%. CURE is a data‑efficient framework that enhances both grounding accuracy and report reliability. Code is available at https://github.com/PabloMessina/CURE and model weights at https://huggingface.co/pamessina/medgemma‑4b‑it‑cure
Authors:Rishit Chugh
Abstract:
The deployment of large language models (LLMs) has raised security concerns due to their susceptibility to producing harmful or policy‑violating outputs when exposed to adversarial prompts. While alignment and guardrails mitigate common misuse, they remain vulnerable to automated jailbreaking methods such as GCG, PEZ, and GBDA, which generate adversarial suffixes via training and gradient‑based search. Although effective, these methods particularly GCG are computationally expensive, limiting their practicality for organisations with constrained resources. This paper introduces a resource‑efficient adversarial prompting approach that eliminates the need for retraining by matching new prompts to a database of pre‑trained adversarial prompts. A dataset of 1,000 prompts was classified into seven harm‑related categories, and GCG, PEZ, and GBDA were evaluated on a Llama 3 8B model to identify the most effective attack method per category. Results reveal a correlation between prompt type and algorithm effectiveness. By retrieving semantically similar successful adversarial prompts, the proposed method achieves competitive attack success rates with significantly reduced computational cost. This work provides a practical framework for scalable red‑teaming and security evaluation of aligned LLMs, including in settings where model internals are inaccessible.
Authors:Raffi Khatchadourian
Abstract:
LLM agents struggle with regulatory audit replay: when asked to reproduce a flagged transaction decision with identical inputs, most deployments fail to return consistent results. This paper introduces the Determinism‑Faithfulness Assurance Harness (DFAH), a framework for measuring trajectory determinism and evidence‑conditioned faithfulness in tool‑using agents deployed in financial services. Across 74 configurations (12 models, 4 providers, 8‑24 runs each at T=0.0) in non‑agentic baseline experiments, 7‑20B parameter models achieved 100% determinism, while 120B+ models required 3.7x larger validation samples to achieve equivalent statistical reliability. Agentic tool‑use introduces additional variance (see Tables 4‑7). Contrary to the assumed reliability‑capability trade‑off, a positive Pearson correlation emerged (r = 0.45, p < 0.01, n = 51 at T=0.0) between determinism and faithfulness; models producing consistent outputs also tended to be more evidence‑aligned. Three financial benchmarks are provided (compliance triage, portfolio constraints, DataOps exceptions; 50 cases each) along with an open‑source stress‑test harness. In these benchmarks and under DFAH evaluation settings, Tier 1 models with schema‑first architectures achieved determinism levels consistent with audit replay requirements.
Authors:Jivnesh Sandhan, Harshit Jaiswal, Fei Cheng, Yugo Murawaki
Abstract:
The rapid adoption of LLMs has increased the need for reliable AI text detection, yet existing detectors often fail outside controlled benchmarks. We systematically evaluate 2 dominant paradigms (training‑free and supervised) and show that both are brittle under distribution shift, unseen generators, and simple stylistic perturbations. To address these limitations, we propose a supervised contrastive learning (SCL) framework that learns discriminative style embeddings. Experiments show that while supervised detectors excel in‑domain, they degrade sharply out‑of‑domain, and training‑free methods remain highly sensitive to proxy choice. Overall, our results expose fundamental challenges in building domain‑agnostic detectors. Our code is available at: https://github.com/HARSHITJAIS14/DetectAI
Authors:Yash Sharma
Abstract:
Topic modeling is a crucial technique for extracting latent themes from unstructured text data, particularly valuable in analyzing survey responses. However, traditional methods often only consider free‑text responses and do not natively incorporate structured or categorical survey responses for topic modeling. And they produce abstract topics, requiring extensive human interpretation. To address these limitations, we propose the Multi‑Agent LLM Topic Modeling Framework (MALTopic). This framework decomposes topic modeling into specialized tasks executed by individual LLM agents: an enrichment agent leverages structured data to enhance textual responses, a topic modeling agent extracts latent themes, and a deduplication agent refines the results. Comparative analysis on a survey dataset demonstrates that MALTopic significantly improves topic coherence, diversity, and interpretability compared to LDA and BERTopic. By integrating structured data and employing a multi‑agent approach, MALTopic generates human‑readable topics with enhanced contextual relevance, offering a more effective solution for analyzing complex survey data.
Authors:Jianshu Zhang, Chengxuan Qian, Haosen Sun, Haoran Lu, Dingcheng Wang, Letian Xue, Han Liu
Abstract:
Estimating task progress requires reasoning over long‑horizon dynamics rather than recognizing static visual content. While modern Vision‑Language Models (VLMs) excel at describing what is visible, it remains unclear whether they can infer how far a task has progressed from partial observations. To this end, we introduce Progress‑Bench, a benchmark for systematically evaluating progress reasoning in VLMs. Beyond benchmarking, we further explore a human‑inspired two‑stage progress reasoning paradigm through both training‑free prompting and training‑based approach based on curated dataset ProgressLM‑45K. Experiments on 14 VLMs show that most models are not yet ready for task progress estimation, exhibiting sensitivity to demonstration modality and viewpoint changes, as well as poor handling of unanswerable cases. While training‑free prompting that enforces structured progress reasoning yields limited and model‑dependent gains, the training‑based ProgressLM‑3B achieves consistent improvements even at a small model scale, despite being trained on a task set fully disjoint from the evaluation tasks. Further analyses reveal characteristic error patterns and clarify when and why progress reasoning succeeds or fails.
Authors:Zanlin Ni, Shenzhi Wang, Yang Yue, Tianyu Yu, Weilin Zhao, Yeguo Hua, Tianyi Chen, Jun Song, Cheng Yu, Bo Zheng, Gao Huang
Abstract:
Diffusion Large Language Models (dLLMs) break the rigid left‑to‑right constraint of traditional LLMs, enabling token generation in arbitrary orders. Intuitively, this flexibility implies a solution space that strictly supersets the fixed autoregressive trajectory, theoretically unlocking superior reasoning potential for general tasks like mathematics and coding. Consequently, numerous works have leveraged reinforcement learning (RL) to elicit the reasoning capability of dLLMs. In this paper, we reveal a counter‑intuitive reality: arbitrary order generation, in its current form, narrows rather than expands the reasoning boundary of dLLMs. We find that dLLMs tend to exploit this order flexibility to bypass high‑uncertainty tokens that are crucial for exploration, leading to a premature collapse of the solution space. This observation motivates a rethink of RL approaches for dLLMs, where considerable complexities, such as handling combinatorial trajectories and intractable likelihoods, are often devoted to preserving this flexibility. We demonstrate that effective reasoning can be better elicited by intentionally forgoing arbitrary order and applying standard Group Relative Policy Optimization (GRPO) instead. Our approach, JustGRPO, is minimalist yet surprisingly effective (e.g., 89.1% accuracy on GSM8K) while fully retaining the parallel decoding ability of dLLMs. Project page: https://nzl‑thu.github.io/the‑flexibility‑trap
Authors:Zhichao Yan, Yunxiao Zhao, Jiapu Wang, Jiaoyan Chen, Shaoru Guo, Xiaoli Li, Ru Li, Jeff Z. Pan
Abstract:
Current evaluation methods for Attributed Question Answering (AQA) suffer from attribution myopia: they emphasize verification of isolated statements and their attributions but overlook the global logical integrity of long‑form answers. Consequently, Large Language Models (LLMs) often produce factually grounded yet logically incoherent responses with elusive deductive gaps. To mitigate this limitation, we present \textscLogicScore, a unified evaluation framework that shifts the paradigm from local assessment to global reasoning scrutiny. Grounded in Horn Rules, our approach integrates a backward verification mechanism to systematically evaluate three key reasoning dimensions: Completeness (logically sound deduction), Conciseness (non‑redundancy), and Determinateness (consistent answer entailment). Extensive experiments across three multi‑hop QA datasets (HotpotQA, MusiQue, and 2WikiMultiHopQA) and over 20 LLMs (including GPT‑5, Gemini‑3‑Pro, LLaMA3, and task‑specific tuned models) reveal a critical capability gap: leading models often achieve high attribution scores (e.g., 92.85% precision for Gemini‑3 Pro) but struggle with global reasoning quality (e.g., 35.11% Conciseness for Gemini‑3 Pro). Our work establishes a robust standard for logical evaluation, highlighting the need to prioritize reasoning coherence alongside factual grounding in LLM development. Codes are available at: https://github.com/zhichaoyan11/LogicScore.
Authors:Rui Qi, Fengran Mo, Yufeng Chen, Xue Zhang, Shuo Wang, Hongliang Li, Jinan Xu, Meng Jiang, Jian-Yun Nie, Kaiyu Huang
Abstract:
Multilingual retrieval‑augmented generation (MRAG) requires models to effectively acquire and integrate beneficial external knowledge from multilingual collections. However, most existing studies employ a unitive process where queries of equivalent semantics across different languages are processed through a single‑turn retrieval and subsequent optimization. Such a ``one‑size‑fits‑all'' strategy is often suboptimal in multilingual settings, as the models occur to knowledge bias and conflict during the interaction with the search engine. To alleviate the issues, we propose LcRL, a multilingual search‑augmented reinforcement learning framework that integrates a language‑coupled Group Relative Policy Optimization into the policy and reward models. We adopt the language‑coupled group sampling in the rollout module to reduce knowledge bias, and regularize an auxiliary anti‑consistency penalty in the reward models to mitigate the knowledge conflict. Experimental results demonstrate that LcRL not only achieves competitive performance but is also appropriate for various practical scenarios such as constrained training data and retrieval over collections encompassing a large number of languages. Our code is available at https://github.com/Cherry‑qwq/LcRL‑Open.
Authors:Yifan Wang, Shiyu Li, Peiming Li, Xiaochen Yang, Yang Tang, Zheng Wei
Abstract:
Chain‑of‑Thought (CoT) prompting has achieved remarkable success in unlocking the reasoning capabilities of Large Language Models (LLMs). Although CoT prompting enhances reasoning, its verbosity imposes substantial computational overhead. Recent works often focus exclusively on outcome alignment and lack supervision on the intermediate reasoning process. These deficiencies obscure the analyzability of the latent reasoning chain. To address these challenges, we introduce Render‑of‑Thought (RoT), the first framework to reify the reasoning chain by rendering textual steps into images, making the latent rationale explicit and traceable. Specifically, we leverage the vision encoders of existing Vision Language Models (VLMs) as semantic anchors to align the vision embeddings with the textual space. This design ensures plug‑and‑play implementation without incurring additional pre‑training overhead. Extensive experiments on mathematical and logical reasoning benchmarks demonstrate that our method achieves 3‑4x token compression and substantial inference acceleration compared to explicit CoT. Furthermore, it maintains competitive performance against other methods, validating the feasibility of this paradigm. Our code is available at https://github.com/TencentBAC/RoT
Authors:Jing Lan, Hexiao Ding, Hongzhao Chen, Yufeng Jiang, Nga-Chun Ng, Gwing Kei Yip, Gerald W. Y. Cheng, Yunlin Mao, Jing Cai, Liang-ting Lin, Jung Sun Yoo
Abstract:
AI models for drug discovery and chemical literature mining must interpret molecular images and generate outputs consistent with 3D geometry and stereochemistry. Most molecular language models rely on strings or graphs, while vision‑language models often miss stereochemical details and struggle to map continuous 3D structures into discrete tokens. We propose DeepMoLM: Deep Molecular Language M odeling, a dual‑view framework that grounds high‑resolution molecular images in geometric invariants derived from molecular conformations. DeepMoLM preserves high‑frequency evidence from 1024 × 1024 inputs, encodes conformer neighborhoods as discrete Extended 3‑Dimensional Fingerprints, and fuses visual and geometric streams with cross‑attention, enabling physically grounded generation without atom coordinates. DeepMoLM improves PubChem captioning with a 12.3% relative METEOR gain over the strongest generalist baseline while staying competitive with specialist methods. It produces valid numeric outputs for all property queries and attains MAE 13.64 g/mol on Molecular Weight and 37.89 on Complexity in the specialist setting. On ChEBI‑20 description generation from images, it exceeds generalist baselines and matches state‑of‑the‑art vision‑language models. Code is available at https://github.com/1anj/DeepMoLM.
Authors:Yelin Chen, Fanjin Zhang, Suping Sun, Yunhe Pang, Yuanchun Wang, Jian Song, Xiaoyan Li, Lei Hou, Shu Zhao, Jie Tang, Juanzi Li
Abstract:
Understanding research papers remains challenging for foundation models due to specialized scientific discourse and complex figures and tables, yet existing benchmarks offer limited fine‑grained evaluation at scale. To address this gap, we introduce RPC‑Bench, a large‑scale question‑answering benchmark built from review‑rebuttal exchanges of high‑quality computer science papers, containing 15K human‑verified QA pairs. We design a fine‑grained taxonomy aligned with the scientific research flow to assess models' ability to understand and answer why, what, and how questions in scholarly contexts. We also define an elaborate LLM‑human interaction annotation framework to support large‑scale labeling and quality control. Following the LLM‑as‑a‑Judge paradigm, we develop a scalable framework that evaluates models on correctness‑completeness and conciseness, with high agreement to human judgment. Experiments reveal that even the strongest models (GPT‑5) achieve only 68.2% correctness‑completeness, dropping to 37.46% after conciseness adjustment, highlighting substantial gaps in precise academic paper understanding. Our code and data are available at https://rpc‑bench.github.io/.
Authors:Yuming Yang, Mingyoung Lai, Wanxu Zhao, Xiaoran Fan, Zhiheng Xi, Mingqi Wu, Chiyue Huang, Jun Zhao, Haijun Lv, Jian Tong, Yunhua Zhou, Yicheng Zou, Qipeng Guo, Tao Gui, Qi Zhang, Xuanjing Huang
Abstract:
Long chain‑of‑thought (CoT) trajectories provide rich supervision signals for distilling reasoning from teacher to student LLMs. However, both prior work and our experiments show that trajectories from stronger teachers do not necessarily yield better students, highlighting the importance of data‑student suitability in distillation. Existing methods assess suitability primarily through student likelihood, favoring trajectories that align closely with the student model's current behavior but overlooking more informative ones. Addressing this, we propose Rank‑Surprisal Ratio (RSR), a simple metric that captures both alignment and informativeness to assess the suitability of a reasoning trajectory. RSR is motivated by the observation that effective trajectories typically balance learning signal strength and behavioral alignment by combining low absolute probability with relatively high‑ranked tokens under the student model. Concretely, RSR is defined as the ratio of a trajectory's average token‑wise rank to its average negative log‑likelihood, and is straightforward to compute and interpret. Across five student models and reasoning trajectories from 11 diverse teachers, RSR strongly correlates with post‑training reasoning performance (average Spearman 0.86), consistently outperforming existing metrics. We further demonstrate its practical utility in both trajectory selection and teacher selection.
Authors:Yiyang Wang, Yiqiao Jin, Alex Cabral, Josiah Hester
Abstract:
Multi‑agent systems (MAS) are emerging as promising socio‑collaborative companions for emotional and cognitive support. However, existing systems frequently suffer from persona collapse, where agents revert to generic, homogenized assistant behaviors, and social sycophancy, where agents produce redundant, non‑constructive dialogue. We propose MASCOT, a multi‑agent framework for multi‑perspective socio‑collaborative companions. MASCOT introduces a novel bi‑level optimization strategy to harmonize individual and collective behaviors: 1) Persona‑Aware Behavioral Alignment, an RLAIF‑driven pipeline that fine‑tunes individual agents for agent‑specific identities; and 2) Collaborative Dialogue Optimization, a group‑level adaptation process that promotes complementary, diverse, and productive discourse. We evaluate MASCOT using human‑grounded contexts drawn across both in‑domain and out‑of‑domain (OOD) settings against state‑of‑the‑art baselines. MASCOT improves persona consistency by up to +14.1 and social contribution by up to +10.6. A broad evaluation suite, including human evaluation, multiple LLM judges, three‑way comparisons, and automatic metrics, further shows that MASCOT produces more role‑consistent and less redundant multi‑agent dialogue.
Authors:Víctor Yeste, Paolo Rosso
Abstract:
We study sentence‑level detection of the 19 human values in the refined Schwartz continuum in about 74k English sentences from news and political manifestos (ValueEval'24 corpus). Each sentence is annotated with value presence, yielding a binary moral‑presence label and a 19‑way multi‑label task under severe class imbalance. First, we show that moral presence is learnable from single sentences: a DeBERTa‑base classifier attains positive‑class F1 = 0.74 with calibrated thresholds. Second, we compare direct multi‑label value detectors with presence‑gated hierarchies under a single 8 GB GPU budget. Under matched compute, presence gating does not improve over direct prediction, indicating that gate recall becomes a bottleneck. Third, we investigate lightweight auxiliary signals ‑ short‑range context, LIWC‑22 and moral lexica, and topic features ‑ and small ensembles. Our best supervised configuration, a soft‑voting ensemble of DeBERTa‑based models enriched with such signals, reaches macro‑F1 = 0.332 on the 19 values, improving over the best previous English‑only baseline on this corpus (macro‑F1 \approx 0.28). We additionally benchmark 7‑9B instruction‑tuned LLMs (Gemma 2 9B, Llama 3.1 8B, Mistral 8B, Qwen 2.5 7B) in zero‑/few‑shot and QLoRA setups, and find that they lag behind the supervised ensemble under the same hardware constraint. Overall, our results provide empirical guidance for building compute‑efficient, value‑aware NLP models under realistic GPU budgets.
Authors:Hengyuan Zhang, Zhihao Zhang, Mingyang Wang, Zunhai Su, Yiwei Wang, Qianli Wang, Shuzhou Yuan, Ercong Nie, Xufeng Duan, Qibo Xue, Zeping Yu, Chenming Shang, Xiao Liang, Jing Xiong, Hui Shen, Chaofan Tao, Zhengwu Liu, Senjie Jin, Zhiheng Xi, Dongdong Zhang, Sophia Ananiadou, Tao Gui, Ruobing Xie, Hayden Kwok-Hay So, Hinrich Schütze, Xuanjing Huang, Qi Zhang, Ngai Wong
Abstract:
Mechanistic Interpretability (MI) has emerged as a vital approach to demystify the opaque decision‑making of Large Language Models (LLMs). However, existing reviews primarily treat MI as an observational science, summarizing analytical insights while lacking a systematic framework for actionable intervention. To bridge this gap, we present a practical survey structured around the pipeline: "Locate, Steer, and Improve." We formally categorize Localizing (diagnosis) and Steering (intervention) methods based on specific Interpretable Objects to establish a rigorous intervention protocol. Furthermore, we demonstrate how this framework enables tangible improvements in Alignment, Capability, and Efficiency, effectively operationalizing MI as an actionable methodology for model optimization. The curated paper list of this work is available at https://github.com/rattlesnakey/Awesome‑Actionable‑MI‑Survey.
Authors:Yushen Chen, Junzhe Liu, Yujie Tu, Zhikang Niu, Yuzhe Liang, Chunyu Qiang, Chen Zhang, Kai Yu, Xie Chen
Abstract:
Arabic spans over 30 spoken varieties, yet no open‑source text‑to‑speech system unifies them. Key barriers include substantial cross‑dialect lexical and phonological divergence, scarce synthesis‑grade data, and the absence of a standardized multi‑dialect evaluation benchmark. We present Habibi, a unified‑dialectal Arabic TTS framework that addresses all three. Through a multi‑step curation pipeline, we repurpose open‑source ASR corpora into TTS training data covering 12+ regional dialects. A linguistically‑informed curriculum learning strategy ‑ progressing from Modern Standard Arabic to dialectal data ‑ enables robust zero‑shot synthesis without text diacritization. We further release the first standardized multi‑dialect Arabic TTS benchmark, comprising over 11,000 utterances across 7 dialect subsets with manually verified transcripts. On this benchmark, our unified model matches or surpasses per‑dialect specialized models. Both automatic metrics and human evaluations confirm that Habibi is highly competitive with ElevenLabs' Eleven v3 (alpha) in intelligibility, speaker similarity, and naturalness. Extensive ablations (~8,000 H100 GPU hours, 30+ configurations) validate each design choice. We open‑source all checkpoints, training and inference code, and benchmark data ‑ the first such release for multi‑dialect Arabic TTS ‑ at https://SWivid.github.io/Habibi/ .
Authors:Shengda Fan, Xuyan Ye, Yankai Lin
Abstract:
Self‑play with large language models has emerged as a promising paradigm for achieving self‑improving artificial intelligence. However, existing self‑play frameworks often suffer from optimization instability, due to (i) non‑stationary objectives induced by solver‑dependent reward feedback for the Questioner, and (ii) bootstrapping errors from self‑generated pseudo‑labels used to supervise the Solver. To mitigate these challenges, we introduce DARC (Decoupled Asymmetric Reasoning Curriculum), a two‑stage framework that stabilizes the self‑evolution process. First, we train the Questioner to synthesize difficulty‑calibrated questions, conditioned on explicit difficulty levels and external corpora. Second, we train the Solver with an asymmetric self‑distillation mechanism, where a document‑augmented teacher generates high‑quality pseudo‑labels to supervise the student Solver that lacks document access. Empirical results demonstrate that DARC is model‑agnostic, yielding an average improvement of 10.9 points across nine reasoning benchmarks and three backbone models. Moreover, DARC consistently outperforms all baselines and approaches the performance of fully supervised models without relying on human annotations. The code is available at https://github.com/RUCBM/DARC.
Authors:Yue Guo, Fanfu Wang, Jianwei Lv, Xincheng Shi, Yuchen Li, Youya Wang, Yunsheng Zeng, Yujing Liu, Yunhao Qiao, Gen Li, Junfeng Wang, Bo Yuan
Abstract:
Clinical Decision Support Systems (CDSSs) provide reasoning and inquiry guidance for physicians, yet they face notable challenges, including high maintenance costs and low generalization capability. Recently, Large Language Models (LLMs) have been widely adopted in healthcare due to their extensive knowledge reserves, retrieval, and communication capabilities. While LLMs show promise and excel at medical benchmarks, their diagnostic reasoning and inquiry skills are constrained. To mitigate this issue, we propose (1) Clinical Diagnostic Reasoning Data (CDRD) structure to capture abstract clinical reasoning logic, and a pipeline for its construction, and (2) the Dr. Assistant, a clinical diagnostic model equipped with clinical reasoning and inquiry skills. Its training involves a two‑stage process: SFT, followed by RL with a tailored reward function. We also introduce a benchmark to evaluate both diagnostic reasoning and inquiry. Our experiments demonstrate that the Dr. Assistant outperforms open‑source models and achieves competitive performance to closed‑source models, providing an effective solution for clinical diagnostic inquiry guidance. Project information can be found at: https://github.com/YGswu/Dr.‑Assistant .
Authors:Zhiyuan Shi, Qibo Qiu, Feng Xue, Zhonglin Jiang, Li Yu, Jian Jiang, Xiaofei He, Wenxiao Wang
Abstract:
The linear memory growth of the KV cache poses a significant bottleneck for LLM inference in long‑context tasks. Existing static compression methods often fail to preserve globally important information. Although recent dynamic retrieval approaches attempt to address this issue, they typically suffer from coarse‑grained caching strategies and incur high I/O overhead. To overcome these limitations, we propose HeteroCache, a training‑free dynamic compression framework. Our method is built on two key insights: attention heads exhibit diverse temporal heterogeneity, and there is significant spatial redundancy among heads within the same layer. Guided by these insights, HeteroCache categorizes heads based on stability and similarity, applying a fine‑grained weighting strategy that allocates larger cache budgets to heads with rapidly shifting attention to capture context changes. Furthermore, it features a hierarchical storage mechanism where representative heads monitor attention drift to trigger asynchronous, on‑demand context retrieval, thereby hiding I/O latency. Experiments demonstrate that HeteroCache achieves state‑of‑the‑art performance on long‑context benchmarks and accelerates decoding by up to 3× compared to the original model with a 224K context. Our code is available at https://github.com/ponytaill/HeteroCache.
Authors:Xue Jiang, Ge Li, Jiaru Qian, Xianjie Shi, Chenjie Li, Hao Zhu, Ziyu Wang, Jielun Zhang, Zheyu Zhao, Kechi Zhang, Jia Li, Wenpin Jiao, Zhi Jin, Yihong Dong
Abstract:
Large language models (LLMs) excel at general programming but struggle with domain‑specific software development, necessitating domain specialization methods for LLMs to learn and utilize domain knowledge and data. However, existing domain‑specific code benchmarks cannot evaluate the effectiveness of domain specialization methods, which focus on assessing what knowledge LLMs possess rather than how they acquire and apply new knowledge, lacking explicit knowledge corpora for developing domain specialization methods. To this end, we present KOCO‑BENCH, a novel benchmark designed for evaluating domain specialization methods in real‑world software development. KOCO‑BENCH contains 6 emerging domains with 11 software frameworks and 25 projects, featuring curated knowledge corpora alongside multi‑granularity evaluation tasks including domain code generation (from function‑level to project‑level with rigorous test suites) and domain knowledge understanding (via multiple‑choice Q&A). Unlike previous benchmarks that only provide test sets for direct evaluation, KOCO‑BENCH requires acquiring and applying diverse domain knowledge (APIs, rules, constraints, etc.) from knowledge corpora to solve evaluation tasks. Our evaluations reveal that KOCO‑BENCH poses significant challenges to state‑of‑the‑art LLMs. Even with domain specialization methods (e.g., SFT, RAG, kNN‑LM) applied, improvements remain marginal. Best‑performing coding agent, Claude Code, achieves only 34.2%, highlighting the urgent need for more effective domain specialization methods. We release KOCO‑BENCH, evaluation code, and baselines to advance further research at https://github.com/jiangxxxue/KOCO‑bench.
Authors:Hassan Soliman, Vivek Gupta, Dan Roth, Iryna Gurevych
Abstract:
Realistic text‑to‑SQL workflows often require joining multiple tables. As a result, accurately retrieving the relevant set of tables becomes a key bottleneck for end‑to‑end performance. We study an open‑book setting where queries must be answered over large, heterogeneous table collections pooled from many sources, without clean scoping signals such as database identifiers. Here, dense retrieval (DR) achieves high recall but returns many distractors, while join‑aware alternatives often rely on extra assumptions and/or incur high inference overhead. We propose CORE‑T, a scalable, training‑free framework that enriches tables with LLM‑generated purpose metadata and pre‑computes a lightweight table‑compatibility cache. At inference time, DR returns top‑K candidates; a single LLM call selects a coherent, joinable subset, and a simple additive adjustment step restores strongly compatible tables. Across Bird, Spider, and MMQA, CORE‑T improves table‑selection F1 by up to 22.7 points while retrieving up to 42% fewer tables, improving multi‑table execution accuracy by up to 5.0 points on Bird and 6.9 points on MMQA, and using 4‑5x fewer tokens than LLM‑intensive baselines.
Authors:Abdellah El Mekki, Samar M. Magdy, Houdaifa Atou, Ruwa AbuHweidi, Baraah Qawasmeh, Omer Nacar, Thikra Al-hibiri, Razan Saadie, Hamzah Alsayadi, Nadia Ghezaiel Hammouda, Alshima Alkhazimi, Aya Hamod, Al-Yas Al-Ghafri, Wesam El-Sayed, Asila Al sharji, Mohamad Ballout, Anas Belfathi, Karim Ghaddar, Serry Sibaee, Alaa Aoun, Areej Asiri, Lina Abureesh, Ahlam Bashiti, Majdal Yousef, Abdulaziz Hafiz, Yehdih Mohamed, Emira Hamedtou, Brakehe Brahim, Rahaf Alhamouri, Youssef Nafea, Aya El Aatar, Walid Al-Dhabyani, Emhemed Hamed, Sara Shatnawi, Fakhraddin Alwajih, Khalid Elkhidir, Ashwag Alasmari, Abdurrahman Gerrio, Omar Alshahri, AbdelRahim A. Elmadany, Ismail Berrada, Amir Azad Adli Alkathiri, Fadi A Zaraket, Mustafa Jarrar, Yahya Mohamed El Hadj, Hassan Alhuzali, Muhammad Abdul-Mageed
Abstract:
Arabic is a highly diglossic language where most daily communication occurs in regional dialects rather than Modern Standard Arabic. Despite this, machine translation (MT) systems often generalize poorly to dialectal input, limiting their utility for millions of speakers. We introduce Alexandria, a large‑scale, community‑driven, human‑translated dataset designed to bridge this gap. Alexandria covers 13 Arab countries and 11 high‑impact domains, including health, education, and agriculture. Unlike previous resources, Alexandria provides unprecedented granularity by associating contributions with city‑of‑origin metadata, capturing authentic local varieties beyond coarse regional labels. The dataset consists of multi‑turn conversational scenarios annotated with speaker‑addressee gender configurations, enabling the study of gender‑conditioned variation in dialectal use. Comprising 107K total samples, Alexandria serves as both a training resource and a rigorous benchmark for evaluating MT and Large Language Models (LLMs). Our automatic and human evaluation of Arabic‑aware LLMs benchmarks current capabilities in translating across diverse Arabic dialects and sub‑dialects, while exposing significant persistent challenges.
Authors:Haoyu Tian, Yingchaojie Feng, Zhen Wen, Haoxuan Li, Minfeng Zhu, Wei Chen
Abstract:
The advent of Retrieval‑Augmented Generation (RAG) has significantly enhanced the ability of Large Language Models (LLMs) to produce factually accurate and up‑to‑date responses. However, the performance of a RAG system is not determined by a single component but emerges from a complex interplay of modular choices, such as embedding models and retrieval algorithms. This creates a vast and often opaque configuration space, making it challenging for developers to understand performance trade‑offs and identify optimal designs. To address this challenge, we present RAGExplorer, a visual analytics system for the systematic comparison and diagnosis of RAG configurations. RAGExplorer guides users through a seamless macro‑to‑micro analytical workflow. Initially, it empowers developers to survey the performance landscape across numerous configurations, allowing for a high‑level understanding of which design choices are most effective. For a deeper analysis, the system enables users to drill down into individual failure cases, investigate how differences in retrieved information contribute to errors, and interactively test hypotheses by manipulating the provided context to observe the resulting impact on the generated answer. We demonstrate the effectiveness of RAGExplorer through detailed case studies and user studies, validating its ability to empower developers in navigating the complex RAG design space. Our code and user guide are publicly available at https://github.com/Thymezzz/RAGExplorer.
Authors:Miao Xie, Siguang Chen, Chunli Lv
Abstract:
Large language models (LLMs) have become powerful and widely used systems for language understanding and generation, while multi‑armed bandit (MAB) algorithms provide a principled framework for adaptive decision‑making under uncertainty. This survey explores the potential at the intersection of these two fields. As we know, it is the first survey to systematically review the bidirectional interaction between large language models and multi‑armed bandits at the component level. We highlight the bidirectional benefits: MAB algorithms address critical LLM challenges, spanning from pre‑training to retrieval‑augmented generation (RAG) and personalization. Conversely, LLMs enhance MAB systems by redefining core components such as arm definition and environment modeling, thereby improving decision‑making in sequential tasks. We analyze existing LLM‑enhanced bandit systems and bandit‑enhanced LLM systems, providing insights into their design, methodologies, and performance. Key challenges and representative findings are identified to help guide future research. An accompanying GitHub repository that indexes relevant literature is available at https://github.com/bucky1119/Awesome‑LLM‑Bandit‑Interaction.
Authors:Peng Li, Zihan Zhuang, Yangfan Gao, Yi Dong, Sixian Li, Changhao Jiang, Shihan Dou, Zhiheng Xi, Enyu Zhou, Jixuan Huang, Hui Li, Jingjing Gong, Xingjun Ma, Tao Gui, Zuxuan Wu, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang, Xipeng Qiu
Abstract:
Humanoid robots are capable of performing various actions such as greeting, dancing and even backflipping. However, these motions are often hard‑coded or specifically trained, which limits their versatility. In this work, we present FRoM‑W1, an open‑source framework designed to achieve general humanoid whole‑body motion control using natural language. To universally understand natural language and generate corresponding motions, as well as enable various humanoid robots to stably execute these motions in the physical world under gravity, FRoM‑W1 operates in two stages: (a) H‑GPT: utilizing massive human data, a large‑scale language‑driven human whole‑body motion generation model is trained to generate diverse natural behaviors. We further leverage the Chain‑of‑Thought technique to improve the model's generalization in instruction understanding. (b) H‑ACT: After retargeting generated human whole‑body motions into robot‑specific actions, a motion controller that is pretrained and further fine‑tuned through reinforcement learning in physical simulation enables humanoid robots to accurately and stably perform corresponding actions. It is then deployed on real robots via a modular simulation‑to‑reality module. We extensively evaluate FRoM‑W1 on Unitree H1 and G1 robots. Results demonstrate superior performance on the HumanML3D‑X benchmark for human whole‑body motion generation, and our introduced reinforcement learning fine‑tuning consistently improves both motion tracking accuracy and task success rates of these humanoid robots. We open‑source the entire FRoM‑W1 framework and hope it will advance the development of humanoid intelligence.
Authors:Tassallah Abdullahi, Macton Mgonzo, Mardiyyah Oduwole, Paul Okewunmi, Abraham Owodunni, Ritambhara Singh, Carsten Eickhoff
Abstract:
Current guardian models are predominantly Western‑centric and optimized for high‑resource languages, leaving low‑resource African languages vulnerable to evolving harms, cross‑lingual safety failures, and cultural misalignment. Moreover, most guardian models rely on rigid, predefined safety categories that fail to generalize across diverse linguistic and sociocultural contexts. Robust safety, therefore, requires flexible, runtime‑enforceable policies and benchmarks that reflect local norms, harm scenarios, and cultural expectations. We introduce UbuntuGuard, the first African policy‑based safety benchmark built from adversarial queries authored by 155 domain experts across sensitive fields, including healthcare. From these expert‑crafted queries, we derive context‑specific safety policies and reference responses that capture culturally grounded risk signals, enabling policy‑aligned evaluation of guardian models. We evaluate 13 models, comprising six general‑purpose LLMs and seven guardian models across three distinct variants: static, dynamic, and multilingual. Our findings reveal that existing English‑centric benchmarks overestimate real‑world multilingual safety, cross‑lingual transfer provides partial but insufficient coverage, and dynamic models, while better equipped to leverage policies at inference time, still struggle to fully localize African‑language contexts. These findings highlight the urgent need for multilingual, culturally grounded safety benchmarks to enable the development of reliable and equitable guardian models for low‑resource languages. Our code can be found online.\footnoteCode repository available at https://github.com/hemhemoh/UbuntuGuard.
Authors:Younes Bouhadjar, Maxime Fabre, Felix Schmidt, Emre Neftci
Abstract:
Linear recurrent neural networks have emerged as efficient alternatives to the original Transformer's softmax attention mechanism, thanks to their highly parallelizable training and constant memory and computation requirements at inference. Iterative refinements of these models have introduced an increasing number of architectural mechanisms, leading to increased complexity and computational costs. Nevertheless, systematic direct comparisons among these models remain limited. Existing benchmark tasks are either too simplistic to reveal substantial differences or excessively resource‑intensive for experimentation. In this work, we propose a refined taxonomy of linear recurrent models and introduce SelectivBench, a set of lightweight and customizable synthetic benchmark tasks for systematically evaluating sequence models. SelectivBench specifically evaluates selectivity in sequence models at small to medium scale, such as the capacity to focus on relevant inputs while ignoring context‑based distractors. It employs rule‑based grammars to generate sequences with adjustable complexity, incorporating irregular gaps that intentionally violate transition rules. Evaluations of linear recurrent models on SelectivBench reveal performance patterns consistent with results from large‑scale language tasks. Our analysis clarifies the roles of essential architectural features: gating and rapid forgetting mechanisms facilitate recall, in‑state channel mixing is unnecessary for selectivity, but critical for generalization, and softmax attention remains dominant due to its memory capacity scaling with sequence length. Our benchmark enables targeted, efficient exploration of linear recurrent models and provides a controlled setting for studying behaviors observed in large‑scale evaluations. Code is available at https://github.com/symseqbench/selectivbench
Authors:Tianxin Wei, Ting-Wei Li, Zhining Liu, Xuying Ning, Ze Yang, Jiaru Zou, Zhichen Zeng, Ruizhong Qiu, Xiao Lin, Dongqi Fu, Zihao Li, Mengting Ai, Duo Zhou, Wenxuan Bao, Yunzhe Li, Gaotang Li, Cheng Qian, Yu Wang, Xiangru Tang, Yin Xiao, Liri Fang, Hui Liu, Xianfeng Tang, Yuji Zhang, Chi Wang, Jiaxuan You, Heng Ji, Hanghang Tong, Jingrui He
Abstract:
Reasoning is a fundamental cognitive process underlying inference, problem‑solving, and decision‑making. While large language models (LLMs) demonstrate strong reasoning capabilities in closed‑world settings, they struggle in open‑ended and dynamic environments. Agentic reasoning marks a paradigm shift by reframing LLMs as autonomous agents that plan, act, and learn through continual interaction. In this survey, we organize agentic reasoning along three complementary dimensions. First, we characterize environmental dynamics through three layers: foundational agentic reasoning, which establishes core single‑agent capabilities including planning, tool use, and search in stable environments; self‑evolving agentic reasoning, which studies how agents refine these capabilities through feedback, memory, and adaptation; and collective multi‑agent reasoning, which extends intelligence to collaborative settings involving coordination, knowledge sharing, and shared goals. Across these layers, we distinguish in‑context reasoning, which scales test‑time interaction through structured orchestration, from post‑training reasoning, which optimizes behaviors via reinforcement learning and supervised fine‑tuning. We further review representative agentic reasoning frameworks across real‑world applications and benchmarks, including science, robotics, healthcare, autonomous research, and mathematics. This survey synthesizes agentic reasoning methods into a unified roadmap bridging thought and action, and outlines open challenges and future directions, including personalization, long‑horizon interaction, world modeling, scalable multi‑agent training, and governance for real‑world deployment.
Authors:Mahammad Namazov, Tomáš Koref, Ivan Habernal
Abstract:
Interpretability is critical for applications of large language models in the legal domain which requires trust and transparency. While some studies develop task‑specific approaches, other use the classification model's parameters to explain the decisions. However, which technique explains the legal outcome prediction best remains an open question. To address this challenge, we propose a comparative analysis framework for model‑agnostic interpretability techniques. Among these, we employ two rationale extraction methods, which justify outcomes with human‑interpretable and concise text fragments (i.e., rationales) from the given input text. We conduct comparison by evaluating faithfulness‑via normalized sufficiency and comprehensiveness metrics along with plausibility‑by asking legal experts to evaluate extracted rationales. We further assess the feasibility of LLM‑as‑a‑Judge using legal expert evaluation results. We show that the model's "reasons" for predicting a violation differ substantially from those of legal experts, despite highly promising quantitative analysis results and reasonable downstream classification performance. The source code of our experiments is publicly available at https://github.com/trusthlt/IntEval.
Authors:Ming Zhang, Jiabao Zhuang, Wenqing Jing, Ziyu Kong, Jingyi Deng, Yujiong Shen, Kexin Tan, Yuhang Zhao, Ning Luo, Renzhe Zheng, Jiahui Lin, Mingqi Wu, Long Ma, Yi Zou, Shihan Dou, Tao Gui, Qi Zhang, Xuanjing Huang
Abstract:
Deep Research Agents are increasingly used for automated survey generation. However, whether they can write surveys like human experts remains unclear. Existing benchmarks focus on fluency or citation accuracy, but none evaluates the core capabilities: retrieving essential papers and organizing them into coherent knowledge structures. We introduce TaxoBench, a diagnostic benchmark derived from 72 highly‑cited computer science surveys. We manually extract expert‑authored taxonomy trees containing 3,815 precisely categorized citations as ground truth. Our benchmark supports two evaluation modes: Deep Research mode tests end‑to‑end retrieval and organization given only a topic, while Bottom‑Up mode isolates structuring capability by providing the exact papers human experts used. We evaluate 7 leading Deep Research agents and 12 frontier LLMs. Results reveal a dual bottleneck: the best agent recalls only 20.9% of expert‑selected papers, and even with perfect input, the best model achieves only 0.31 ARI in organization. Current deep research agents remain far from expert‑level survey writing. Our benchmark is publicly available at https://github.com/KongLongGeFDU/TaxoBench.
Authors:Chun-Yi Kuan, Hung-yi Lee
Abstract:
Recent advances in audio‑aware large language models have shown strong performance on audio question answering. However, existing benchmarks mainly cover answerable questions and overlook the challenge of unanswerable ones, where no reliable answer can be inferred from the audio. Such cases are common in real‑world settings, where questions may be misleading, ill‑posed, or incompatible with the information. To address this gap, we present AQUA‑Bench, a benchmark for Audio Question Unanswerability Assessment. It systematically evaluates three scenarios: Absent Answer Detection (the correct option is missing), Incompatible Answer Set Detection (choices are categorically mismatched with the question), and Incompatible Audio Question Detection (the question is irrelevant or lacks sufficient grounding in the audio). By assessing these cases, AQUA‑Bench offers a rigorous measure of model reliability and promotes the development of audio‑language systems that are more robust and trustworthy. Our experiments suggest that while models excel on standard answerable tasks, they often face notable challenges with unanswerable ones, pointing to a blind spot in current audio‑language understanding.
Authors:David Ilić, David Stanojević, Kostadin Cvejoski
Abstract:
Fine‑tuned language models pose significant privacy risks, as they may memorize and expose sensitive information from their training data. Membership inference attacks (MIAs) provide a principled framework for auditing these risks, yet existing methods achieve limited detection rates, particularly at the low false‑positive thresholds required for practical privacy auditing. We present EZ‑MIA, a membership inference attack that exploits a key observation: memorization manifests most strongly at error positions, specifically tokens where the model predicts incorrectly yet still shows elevated probability for training examples. We introduce the Error Zone (EZ) score, which measures the directional imbalance of probability shifts at error positions relative to a pretrained reference model. This principled statistic requires only two forward passes per query and no model training of any kind. On WikiText with GPT‑2, EZ‑MIA achieves 3.8x higher detection than the previous state‑of‑the‑art under identical conditions (66.3% versus 17.5% true positive rate at 1% false positive rate), with near‑perfect discrimination (AUC 0.98). At the stringent 0.1% FPR threshold critical for real‑world auditing, we achieve 8x higher detection than prior work (14.0% versus 1.8%), requiring no reference model training. These gains extend to larger architectures: on AG News with Llama‑2‑7B, we achieve 3x higher detection (46.7% versus 15.8% TPR at 1% FPR). These results establish that privacy risks of fine‑tuned language models are substantially greater than previously understood, with implications for both privacy auditing and deployment decisions. Code is available at https://github.com/JetBrains‑Research/ez‑mia.
Authors:Jingchu Wang, Bingbing Xu, Yige Yuan, Bin Xie, Xiaoqian Sun, Huawei Shen
Abstract:
Reinforcement learning has become a central paradigm for improving LLM reasoning. However, existing methods use a single policy to produce both inference responses and training optimization trajectories. The objective conflict between generating stable inference responses and diverse training trajectories leads to insufficient exploration, which harms reasoning capability. In this paper, to address the problem, we propose R^2PO (Residual Rollout Policy Optimization), which introduces a lightweight Residual Rollout‑Head atop the policy to decouple training trajectories from inference responses, enabling controlled trajectory diversification during training while keeping inference generation stable. Experiments across multiple benchmarks show that our method consistently outperforms baselines, achieving average accuracy gains of 3.4% on MATH‑500 and 1.3% on APPS, while also reducing formatting errors and mitigating length bias for stable optimization. Our code is publicly available at https://github.com/RRPO‑ARR/Code.
Authors:Wenhan Liu, Xinyu Ma, Yutao Zhu, Yuchen Li, Daiting Shi, Dawei Yin, Zhicheng Dou
Abstract:
Agentic search has recently emerged as a powerful paradigm, where an agent interleaves multi‑step reasoning with on‑demand retrieval to solve complex questions. Despite its success, how to design a retriever for agentic search remains largely underexplored. Existing search agents typically rely on similarity‑based retrievers, while similar passages are not always useful for final answer generation. In this paper, we propose a novel retriever training framework tailored for agentic search. Unlike retrievers designed for single‑turn retrieval‑augmented generation (RAG) that only rely on local passage utility, we propose to use both local query‑passage relevance and global answer correctness to measure passage utility in a multi‑turn agentic search. We further introduce an iterative training strategy, where the search agent and the retriever are optimized bidirectionally and iteratively. Different from RAG retrievers that are only trained once with fixed questions, our retriever is continuously improved using evolving and higher‑quality queries from the agent. Extensive experiments on seven single‑hop and multi‑hop QA benchmarks demonstrate that our retriever, termed \ours, consistently outperforms strong baselines across different search agents. Our codes are available at: https://github.com/8421BCD/Agentic‑R.
Authors:Caihua Li, Lianghong Guo, Yanlin Wang, Daya Guo, Wei Tao, Zhenyu Shan, Mingwei Liu, Jiachi Chen, Haoyu Song, Duyu Tang, Hongyu Zhang, Zibin Zheng
Abstract:
Issue resolution, a complex Software Engineering (SWE) task integral to real‑world development, has emerged as a compelling challenge for artificial intelligence. The establishment of benchmarks like SWE‑bench revealed this task as profoundly difficult for large language models, thereby significantly accelerating the evolution of autonomous coding agents. This paper presents a systematic survey of this emerging domain. We begin by examining data construction pipelines, covering automated collection and synthesis approaches. We then provide a comprehensive analysis of methodologies, spanning training‑free frameworks with their modular components to training‑based techniques, including supervised fine‑tuning and reinforcement learning. Subsequently, we discuss critical analyses of data quality and agent behavior, alongside practical applications. Finally, we identify key challenges and outline promising directions for future research. An open‑source repository is maintained at https://github.com/DeepSoftwareAnalytics/Awesome‑Issue‑Resolution to serve as a dynamic resource in this field.
Authors:Xin Sun, Zhongqi Chen, Qiang Liu, Shu Wu, Bowen Song, Weiqiang Wang, Zilei Wang, Liang Wang
Abstract:
Retrieval‑Augmented Generation (RAG) has emerged as a powerful approach for enhancing large language models' question‑answering capabilities through the integration of external knowledge. However, when adapting RAG systems to specialized domains, challenges arise from distribution shifts, resulting in suboptimal generalization performance. In this work, we propose TTARAG, a test‑time adaptation method that dynamically updates the language model's parameters during inference to improve RAG system performance in specialized domains. Our method introduces a simple yet effective approach where the model learns to predict retrieved content, enabling automatic parameter adjustment to the target domain. Through extensive experiments across six specialized domains, we demonstrate that TTARAG achieves substantial performance improvements over baseline RAG systems. Code available at https://github.com/sunxin000/TTARAG.
Authors:Xiaojie Gu, Guangxu Chen, Yuheng Yang, Jingxin Han, Andi Zhang
Abstract:
Large language models (LLMs) exhibit exceptional performance across various domains, yet they face critical safety concerns. Model editing has emerged as an effective approach to mitigate these issues. Existing model editing methods often focus on optimizing an information matrix that blends new and old knowledge. While effective, these approaches can be computationally expensive and may cause conflicts. In contrast, we shift our attention to Hierarchical Orthogonal Residual SprEad of the information matrix, which reduces noisy gradients and enables more stable edits from a different perspective. We demonstrate the effectiveness of our method HORSE through a clear theoretical comparison with several popular methods and extensive experiments conducted on two datasets across multiple LLMs. The results show that HORSE maintains precise massive editing across diverse scenarios. The code is available at https://github.com/XiaojieGu/HORSE
Authors:Guoming Ling, Zhongzhan Huang, Yupei Lin, Junxin Li, Shanshan Zhong, Hefeng Wu, Liang Lin
Abstract:
Chain‑of‑Thought reasoning has significantly enhanced the problem‑solving capabilities of Large Language Models. Unfortunately, current models generate reasoning steps sequentially without foresight, often becoming trapped in suboptimal reasoning paths with redundant steps. In contrast, we introduce Neural Chain‑of‑Thought Search (NCoTS), a framework that reformulates reasoning as a dynamic search for the optimal thinking strategy. By quantitatively characterizing the solution space, we reveal the existence of sparse superior reasoning paths that are simultaneously more accurate and concise than standard outputs. Our method actively navigates towards these paths by evaluating candidate reasoning operators using a dual‑factor heuristic that optimizes for both correctness and computational cost. Consequently, NCoTS achieves a Pareto improvement across diverse reasoning benchmarks, boosting accuracy by over 3.5% while reducing generation length by over 22%. Our code and data are available at https://github.com/MilkThink‑Lab/Neural‑CoT‑Search.
Authors:Yuling Shi, Maolin Sun, Zijun Liu, Mo Yang, Yixiong Fang, Tianran Sun, Xiaodong Gu
Abstract:
Retrieval‑Augmented Generation (RAG) has demonstrated significant effectiveness in enhancing large language models (LLMs) for complex multi‑hop question answering (QA). For multi‑hop QA tasks, current iterative approaches predominantly rely on LLMs to self‑guide and plan multi‑step exploration paths during retrieval, leading to substantial challenges in maintaining reasoning coherence across steps from inaccurate query decomposition and error propagation. To address these issues, we introduce Reasoning Tree Guided RAG (RT‑RAG), a novel hierarchical framework for complex multi‑hop QA. RT‑RAG systematically decomposes multi‑hop questions into explicit reasoning trees, minimizing inaccurate decomposition through structured entity analysis and consensus‑based tree selection that clearly separates core queries, known entities, and unknown entities. Subsequently, a bottom‑up traversal strategy employs iterative query rewriting and refinement to collect high‑quality evidence, thereby mitigating error propagation. Comprehensive experiments show that RT‑RAG substantially outperforms state‑of‑the‑art methods by 7.0% F1 and 6.0% EM, demonstrating the effectiveness of RT‑RAG in complex multi‑hop QA.
Authors:Shaoyang Xu, Wenxuan Zhang
Abstract:
Output diversity is crucial for Large Language Models as it underpins pluralism and creativity. In this work, we reveal that controlling the language used during model thinking‑the language of thought‑provides a novel and structural source of output diversity. Our preliminary study shows that different thinking languages occupy distinct regions in a model's thinking space. Based on this observation, we study two repeated sampling strategies under multilingual thinking‑Single‑Language Sampling and Mixed‑Language Sampling‑and conduct diversity evaluation on outputs that are controlled to be in English, regardless of the thinking language used. Across extensive experiments, we demonstrate that switching the thinking language from English to non‑English languages consistently increases output diversity, with a clear and consistent positive correlation such that languages farther from English in the thinking space yield larger gains. We further show that aggregating samples across multiple thinking languages yields additional improvements through compositional effects, and that scaling sampling with linguistic heterogeneity expands the model's diversity ceiling. Finally, we show that these findings translate into practical benefits in pluralistic alignment scenarios, leading to broader coverage of cultural knowledge and value orientations in LLM outputs. Our code is publicly available at https://github.com/iNLP‑Lab/Multilingual‑LoT‑Diversity.
Authors:Tanyu Chen, Tairan Chen, Kai Shen, Zhenghua Bao, Zhihui Zhang, Man Yuan, Yi Shi
Abstract:
Recent end‑to‑end spoken dialogue systems leverage speech tokenizers and neural audio codecs to enable LLMs to operate directly on discrete speech representations. However, these models often exhibit limited speaker identity preservation, hindering personalized voice interaction. In this work, we present Chroma 1.0, the first open‑source, real‑time, end‑to‑end spoken dialogue model that achieves both low‑latency interaction and high‑fidelity personalized voice cloning. Chroma achieves sub‑second end‑to‑end latency through an interleaved text‑audio token schedule (1:2) that supports streaming generation, while maintaining high‑quality personalized voice synthesis across multi‑turn conversations. Our experimental results demonstrate that Chroma achieves a 10.96% relative improvement in speaker similarity over the human baseline, with a Real‑Time Factor (RTF) of 0.43, while maintaining strong reasoning and dialogue capabilities. Our code and models are publicly available at https://github.com/FlashLabs‑AI‑Corp/FlashLabs‑Chroma and https://huggingface.co/FlashLabs/Chroma‑4B .
Authors:Jie Yang, Honglin Guo, Li Ji, Jiazheng Zhou, Rui Zheng, Zhikai Lei, Shuo Zhang, Zhiheng Xi, Shichun Liu, Yuxin Wang, Bo Wang, Yining Zheng, Tao Gui, Xipeng Qiu
Abstract:
The evolution of Large Language Models (LLMs) into autonomous agents has expanded the scope of AI coding from localized code generation to complex, repository‑level, and execution‑driven problem solving. However, current benchmarks predominantly evaluate code logic in static contexts, neglecting the dynamic, full‑process requirements of real‑world engineering, particularly in backend development which demands rigorous environment configuration and service deployment. To address this gap, we introduce ABC‑Bench, a benchmark explicitly designed to evaluate agentic backend coding within a realistic, executable workflow. Using a scalable automated pipeline, we curated 224 practical tasks spanning 8 languages and 19 frameworks from open‑source repositories. Distinct from previous evaluations, ABC‑Bench require the agents to manage the entire development lifecycle from repository exploration to instantiating containerized services and pass the external end‑to‑end API tests. Our extensive evaluation reveals that even state‑of‑the‑art models struggle to deliver reliable performance on these holistic tasks, highlighting a substantial disparity between current model capabilities and the demands of practical backend engineering. Our code is available at https://github.com/OpenMOSS/ABC‑Bench.
Authors:Lecheng Yan, Ruizhe Li, Guanhua Chen, Qing Li, Jiahui Geng, Wenxi Li, Vincent Wang, Chris Lee
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) is highly effective for enhancing LLM reasoning, yet recent evidence shows models like Qwen 2.5 achieve significant gains even with spurious or incorrect rewards. We investigate this phenomenon and identify a "Perplexity Paradox": spurious RLVR triggers a divergence where answer‑token perplexity drops while prompt‑side coherence degrades, suggesting the model is bypassing reasoning in favor of memorization. Using Path Patching, Logit Lens, JSD analysis, and Neural Differential Equations, we uncover a hidden Anchor‑Adapter circuit that facilitates this shortcut. We localize a Functional Anchor in the middle layers (L18‑20) that triggers the retrieval of memorized solutions, followed by Structural Adapters in later layers (L21+) that transform representations to accommodate the shortcut signal. Finally, we demonstrate that scaling specific MLP keys within this circuit allows for bidirectional causal steering‑artificially amplifying or suppressing contamination‑driven performance. Our results provide a mechanistic roadmap for identifying and mitigating data contamination in RLVR‑tuned models. Code is available at https://github.com/idwts/How‑RLVR‑Activates‑Memorization‑Shortcuts.
Authors:Xinwei Wu, Heng Liu, Xiaohu Zhao, Yuqi Ren, Linlong Xu, Longyue Wang, Deyi Xiong, Weihua Luo, Kaifu Zhang
Abstract:
Large Language Models (LLMs) frequently exhibit strong translation abilities, even without task‑specific fine‑tuning. However, the internal mechanisms governing this innate capability remain largely opaque. To demystify this process, we leverage Sparse Autoencoders (SAEs) and introduce a novel framework for identifying task‑specific features. Our method first recalls features that are frequently co‑activated on translation inputs and then filters them for functional coherence using a PCA‑based consistency metric. This framework successfully isolates a small set of translation initiation features. Causal interventions demonstrate that amplifying these features steers the model towards correct translation, while ablating them induces hallucinations and off‑task outputs, confirming they represent a core component of the model's innate translation competency. Moving from analysis to application, we leverage this mechanistic insight to propose a new data selection strategy for efficient fine‑tuning. Specifically, we prioritize training on mechanistically hard samples‑those that fail to naturally activate the translation initiation features. Experiments show this approach significantly improves data efficiency and suppresses hallucinations. Furthermore, we find these mechanisms are transferable to larger models of the same family. Our work not only decodes a core component of the translation mechanism in LLMs but also provides a blueprint for using internal model mechanism to create more robust and efficient models. The codes are available at https://github.com/flamewei123/AAAI26‑translation‑Initiation‑Features.
Authors:Yipu Dou, Wang Yang
Abstract:
As Large Language Models (LLMs) evolve from static chatbots into autonomous agents capable of tool execution, the landscape of AI safety is shifting from content moderation to action security. However, existing red‑teaming frameworks remain bifurcated: they either focus on rigid, script‑based text attacks or lack the architectural modularity to simulate complex, multi‑turn agentic exploitations. In this paper, we introduce AJAR (Adaptive Jailbreak Architecture for Red‑teaming), a proof‑of‑concept framework designed to bridge this gap through Protocol‑driven Cognitive Orchestration. Built upon the robust runtime of Petri, AJAR leverages the Model Context Protocol (MCP) to decouple adversarial logic from the execution loop, encapsulating state‑of‑the‑art algorithms like X‑Teaming as standardized, plug‑and‑play services. We validate the architectural feasibility of AJAR through a controlled qualitative case study, demonstrating its ability to perform stateful backtracking within a tool‑use environment. Furthermore, our preliminary exploration of the "Agentic Gap" reveals a complex safety dynamic: while tool usage introduces new injection vectors via code execution, the cognitive load of parameter formatting can inadvertently disrupt persona‑based attacks. AJAR is open‑sourced to facilitate the standardized, environment‑aware evaluation of this emerging attack surface. The code and data are available at https://github.com/douyipu/ajar.
Authors:Syed Waqas Zamir, Wassim Hamidouche, Boulbaba Ben Amor, Luana Marotti, Inbal Becker-Reshef, Juan Lavista Ferres
Abstract:
Large Language Models (LLMs) exhibit strong multilingual capabilities, yet remain fundamentally constrained by the severe imbalance in global language resources. While over 7,000 languages are spoken worldwide, only a small subset (fewer than 100) has sufficient digital presence to meaningfully influence modern LLM training. This disparity leads to systematic underperformance, cultural misalignment, and limited accessibility for speakers of low‑resource and extreme‑low‑resource languages. To address this gap, we introduce Bring Your Own Language (BYOL), a unified framework for scalable, language‑aware LLM development tailored to each language's digital footprint. BYOL begins with a language resource classification that maps languages into four tiers (Extreme‑Low, Low, Mid, High) using curated web‑scale corpora, and uses this classification to select the appropriate integration pathway. For low‑resource languages, we propose a full‑stack data refinement and expansion pipeline that combines corpus cleaning, synthetic text generation, continual pretraining, and supervised finetuning. Applied to Chichewa and Maori, this pipeline yields language‑specific LLMs that achieve approximately 12 percent average improvement over strong multilingual baselines across 12 benchmarks, while preserving English and multilingual capabilities via weight‑space model merging. For extreme‑low‑resource languages, we introduce a translation‑mediated inclusion pathway, and show on Inuktitut that a tailored machine translation system improves over a commercial baseline by 4 BLEU, enabling high‑accuracy LLM access when direct language modeling is infeasible. Finally, we release human‑translated versions of the Global MMLU‑Lite benchmark in Chichewa, Maori, and Inuktitut, and make our codebase and models publicly available at https://github.com/microsoft/byol .
Authors:Gerard Yeo, Svetlana Churina, Kokil Jaidka
Abstract:
Perceived trustworthiness underpins how users navigate online information, yet it remains unclear whether large language models (LLMs),increasingly embedded in search, recommendation, and conversational systems, represent this construct in psychologically coherent ways. We analyze how instruction‑tuned LLMs (Llama 3.1 8B, Qwen 2.5 7B, Mistral 7B) encode perceived trustworthiness in web‑like narratives using the PEACE‑Reviews dataset annotated for cognitive appraisals, emotions, and behavioral intentions. Across models, systematic layer‑ and head‑level activation differences distinguish high‑ from low‑trust texts, revealing that trust cues are implicitly encoded during pretraining. Probing analyses show linearly de‑codable trust signals and fine‑tuning effects that refine rather than restructure these representations. Strongest associations emerge with appraisals of fairness, certainty, and accountability‑self ‑‑ dimensions central to human trust formation online. These findings demonstrate that modern LLMs internalize psychologically grounded trust signals without explicit supervision, offering a representational foundation for designing credible, transparent, and trust‑worthy AI systems in the web ecosystem. Code and appendix are available at: https://github.com/GerardYeo/TrustworthinessLLM.
Authors:Changle Qu, Sunhao Dai, Hengyi Cai, Jun Xu, Shuaiqiang Wang, Dawei Yin
Abstract:
Tool‑Integrated Reasoning (TIR) empowers large language models (LLMs) to tackle complex tasks by interleaving reasoning steps with external tool interactions. However, existing reinforcement learning methods typically rely on outcome‑ or trajectory‑level rewards, assigning uniform advantages to all steps within a trajectory. This coarse‑grained credit assignment fails to distinguish effective tool calls from redundant or erroneous ones, particularly in long‑horizon multi‑turn scenarios. To address this, we propose MatchTIR, a framework that introduces fine‑grained supervision via bipartite matching‑based turn‑level reward assignment and dual‑level advantage estimation. Specifically, we formulate credit assignment as a bipartite matching problem between predicted and ground‑truth traces, utilizing two assignment strategies to derive dense turn‑level rewards. Furthermore, to balance local step precision with global task success, we introduce a dual‑level advantage estimation scheme that integrates turn‑level and trajectory‑level signals, assigning distinct advantage values to individual interaction turns. Extensive experiments on three benchmarks demonstrate the superiority of MatchTIR. Notably, our 4B model surpasses the majority of 8B competitors, particularly in long‑horizon and multi‑turn tasks. Our codes are available at https://github.com/quchangle1/MatchTIR.
Authors:Xi Shi, Mengxin Zheng, Qian Lou
Abstract:
Multi‑agent systems (MAS) enable complex reasoning by coordinating multiple agents, but often incur high inference latency due to multi‑step execution and repeated model invocations, severely limiting their scalability and usability in time‑sensitive scenarios. Most existing approaches primarily optimize task performance and inference cost, and explicitly or implicitly assume sequential execution, making them less optimal for controlling latency under parallel execution. In this work, we investigate learning‑based orchestration of multi‑agent systems with explicit latency supervision under parallel execution. We propose Latency‑Aware Multi‑agent System (LAMaS), a latency‑aware multi‑agent orchestration framework that enables parallel execution and explicitly optimizes the critical execution path, allowing the controller to construct execution topology graphs with lower latency under parallel execution. Our experiments show that our approach reduces critical path length by 38‑46% compared to the state‑of‑the‑art baseline for multi‑agent architecture search across multiple benchmarks, while maintaining or even improving task performance. These results highlight the importance of explicitly optimizing latency under parallel execution when designing efficient multi‑agent systems. The code is available at https://github.com/xishi404/LAMaS
Authors:Yinzhi Zhao, Ming Wang, Shi Feng, Xiaocui Yang, Daling Wang, Yifei Zhang
Abstract:
Large language models (LLMs) have achieved impressive performance across natural language tasks and are increasingly deployed in real‑world applications. Despite extensive safety alignment efforts, recent studies show that such alignment is often shallow and remains vulnerable to jailbreak attacks. Existing defense mechanisms, including decoding‑based constraints and post‑hoc content detectors, struggle against sophisticated jailbreaks, often intervening robust detection or excessively degrading model utility. In this work, we examine the decoding process of LLMs and make a key observation: even when successfully jailbroken, models internally exhibit latent safety‑related signals during generation. However, these signals are overridden by the model's drive for fluent continuation, preventing timely self‑correction or refusal. Building on this observation, we propose a simple yet effective approach that explicitly surfaces and leverages these latent safety signals for early detection of unsafe content during decoding. Experiments across diverse jailbreak attacks demonstrate that our approach significantly enhances safety, while maintaining low over‑refusal rates on benign inputs and preserving response quality. Our results suggest that activating intrinsic safety‑awareness during decoding offers a promising and complementary direction for defending against jailbreak attacks. Code is available at: https://github.com/zyz13590/SafeProbing.
Authors:Chengbing Wang, Wuqiang Zheng, Yang Zhang, Fengbin Zhu, Junyi Cheng, Yi Xie, Wenjie Wang, Fuli Feng
Abstract:
Large Language Models (LLMs) are increasingly deployed in human‑centric applications, yet they often fail to provide substantive emotional support. While Reinforcement Learning (RL) has been utilized to enhance empathy of LLMs, existing reward models typically evaluate empathy from a single perspective, overlooking the inherently bidirectional interaction nature of empathy between the supporter and seeker as defined by Empathy Cycle theory. To address this limitation, we propose Psychology‑grounded Empathetic Reward Modeling (PERM). PERM operationalizes empathy evaluation through a bidirectional decomposition: 1) Supporter perspective, assessing internal resonation and communicative expression; 2) Seeker perspective, evaluating emotional reception. Additionally, it incorporates a bystander perspective to monitor overall interaction quality. Extensive experiments on a widely‑used emotional intelligence benchmark and an industrial daily conversation dataset demonstrate that PERM outperforms state‑of‑the‑art baselines by over 10%. Furthermore, a blinded user study reveals a 70% preference for our approach, highlighting its efficacy in generating more empathetic responses. Our code, dataset, and models are available at https://github.com/ZhengWwwq/PERM.
Authors:Mark Kashirskiy, Ilya Makarov
Abstract:
We propose Strategy‑aware Surprise (SuS), a novel intrinsic motivation framework that uses pre‑post prediction mismatch as a novelty signal for exploration in reinforcement learning. Unlike traditional curiosity‑driven methods that rely solely on state prediction error, SuS introduces two complementary components: Strategy Stability (SS) and Strategy Surprise (SuS). SS measures consistency in behavioral strategy across temporal steps, while SuS captures unexpected outcomes relative to the agent's current strategy representation. Our combined reward formulation leverages both signals through learned weighting coefficients. We evaluate SuS on mathematical reasoning tasks using large language models, demonstrating significant improvements in both accuracy and solution diversity. Ablation studies confirm that removing either component results in at least 10% performance degradation, validating the synergistic nature of our approach. SuS achieves 17.4% improvement in Pass@1 and 26.4% improvement in Pass@5 compared to baseline methods, while maintaining higher strategy diversity throughout training.
Authors:Xueyun Tian, Wei Li, Bingbing Xu, Heng Dong, Yuanzhuo Wang, Huawei Shen
Abstract:
Recent Omni‑multimodal Large Language Models show promise in unified audio, vision, and text modeling. However, streaming audio‑video understanding remains challenging, as existing approaches suffer from disjointed capabilities: they typically exhibit incomplete modality support or lack autonomous proactive monitoring. To address this, we present ROMA, a real‑time omni‑multimodal assistant for unified reactive and proactive interaction. ROMA processes continuous inputs as synchronized multimodal units, aligning dense audio with discrete video frames to handle granularity mismatches. For online decision‑making, we introduce a lightweight speak head that decouples response initiation from generation to ensure precise triggering without task conflict. We train ROMA with a curated streaming dataset and a two‑stage curriculum that progressively optimizes for streaming format adaptation and proactive responsiveness. To standardize the fragmented evaluation landscape, we reorganize diverse benchmarks into a unified suite covering both proactive (alert, narration) and reactive (QA) settings. Extensive experiments across 12 benchmarks demonstrate ROMA achieves state‑of‑the‑art performance on proactive tasks while competitive in reactive settings, validating its robustness in unified real‑time omni‑multimodal understanding.
Authors:Songsong Tian, Kongsheng Zhuo, Zhendong Wang, Rong Shen, Shengtao Zhang, Yong Wu
Abstract:
In this paper, we present BAR‑SQL (Boundary‑Aware Reliable NL2SQL), a unified training framework that embeds reliability and boundary awareness directly into the generation process. We introduce a Seed Mutation data synthesis paradigm that constructs a representative enterprise corpus, explicitly encompassing multi‑step analytical queries alongside boundary cases including ambiguity and schema limitations. To ensure interpretability, we employ Knowledge‑Grounded Reasoning Synthesis, which produces Chain‑of‑Thought traces explicitly anchored in schema metadata and business rules. The model is trained through a two‑stage process: Supervised Fine‑Tuning (SFT) followed by Reinforcement Learning via Group Relative Policy Optimization. We design a Task‑Conditioned Hybrid Reward mechanism that simultaneously optimizes SQL execution accuracy‑leveraging Abstract Syntax Tree analysis and dense result matching‑and semantic precision in abstention responses. To evaluate reliability alongside generation accuracy, we construct and release Ent‑SQL‑Bench, which jointly assesse SQL precision and boundary‑aware abstention across ambiguous and unanswerable queries. Experimental results on this benchmark demonstrate that BAR‑SQL achieves 91.48% average accuracy, outperforming leading proprietary models, including Claude 4.5 Sonnet and GPT‑5, in both SQL generation quality and boundary‑aware abstention capability. The source code and benchmark are available anonymously at: https://github.com/TianSongS/BAR‑SQL.
Authors:Jan Christian Blaise Cruz, David Ifeoluwa Adelani, Alham Fikri Aji
Abstract:
We approach multilinguality as sense adaptation: aligning latent meaning representations across languages rather than relying solely on shared parameters and scale. In this paper, we introduce SENse‑based Symmetric Interlingual Alignment (SENSIA), which adapts a Backpack language model from one language to another by explicitly aligning sense‑level mixtures and contextual representations on parallel data, while jointly training a target‑language language modeling loss to preserve fluency. Across benchmarks on four typologically diverse languages, SENSIA generally outperforms comparable multilingual alignment methods and achieves competitive accuracy against monolingual from‑scratch baselines while using 2‑4x less target‑language data. Analyses of learned sense geometry indicate that local sense topology and global structure relative to English are largely preserved, and ablations show that the method is robust in terms of design and scale.
Authors:Yuxuan Lou, Kai Yang, Yang You
Abstract:
We present MoST (Mixture of Speech and Text), a novel multimodal large language model that seamlessly integrates speech and text processing through our proposed Modality‑Aware Mixture of Experts (MAMoE) architecture. While current multimodal models typically process diverse modality representations with identical parameters, disregarding their inherent representational differences, we introduce specialized routing pathways that direct tokens to modality‑appropriate experts based on input type. MAMoE simultaneously enhances modality‑specific learning and cross‑modal understanding through two complementary components: modality‑specific expert groups that capture domain‑specific patterns and shared experts that facilitate information transfer between modalities. Building on this architecture, we develop an efficient transformation pipeline that adapts the pretrained MoE language model through strategic post‑training on ASR and TTS datasets, followed by fine‑tuning with a carefully curated speech‑text instruction dataset. A key feature of this pipeline is that it relies exclusively on fully accessible, open‑source datasets to achieve strong performance and data efficiency. Comprehensive evaluations across ASR, TTS, audio language modeling, and spoken question answering benchmarks show that MoST consistently outperforms existing models of comparable parameter counts. Our ablation studies confirm that the modality‑specific routing mechanism and shared experts design significantly contribute to performance gains across all tested domains. To our knowledge, MoST represents the first fully open‑source speech‑text LLM built on a Mixture of Experts architecture. \footnoteWe release MoST model, training code, inference code, and training data at https://github.com/NUS‑HPC‑AI‑Lab/MoST
Authors:Arya Shah, Himanshu beniwal, Mayank Singh
Abstract:
Aligning multilingual assistants with culturally grounded user preferences is essential for serving India's linguistically diverse population of over one billion speakers across multiple scripts. However, existing benchmarks either focus on a single language or conflate retrieval with generation, leaving open the question of whether current embedding models can encode persona‑instruction compatibility without relying on response synthesis. We present a unified benchmark spanning 12 Indian languages and four evaluation tasks: monolingual and cross‑lingual persona‑to‑instruction retrieval, reverse retrieval from instruction to persona, and binary compatibility classification. Eight multilingual embedding models are evaluated in a frozen‑encoder setting with a thin logistic regression head for classification. E5‑Large‑Instruct achieves the highest Recall@1 of 27.4% on monolingual retrieval and 20.7% on cross‑lingual transfer, while BGE‑M3 leads reverse retrieval at 32.1% Recall@1. For classification, LaBSE attains 75.3% AUROC with strong calibration. These findings offer practical guidance for model selection in Indic multilingual retrieval and establish reproducible baselines for future work\footnoteCode, datasets, and models are publicly available at https://github.com/aryashah2k/PI‑Indic‑Align.
Authors:Hao Li, Yankai Yang, G. Edward Suh, Ning Zhang, Chaowei Xiao
Abstract:
Large Language Models (LLMs) have enabled the development of powerful agentic systems capable of automating complex workflows across various fields. However, these systems are highly vulnerable to indirect prompt injection attacks, where malicious instructions embedded in external data can hijack agent behavior. In this work, we present ReasAlign, a model‑level solution to improve safety alignment against indirect prompt injection attacks. The core idea of ReasAlign is to incorporate structured reasoning steps to analyze user queries, detect conflicting instructions, and preserve the continuity of the user's intended tasks to defend against indirect injection attacks. To further ensure reasoning logic and accuracy, we introduce a test‑time scaling mechanism with a preference‑optimized judge model that scores reasoning steps and selects the best trajectory. Comprehensive evaluations across various benchmarks show that ReasAlign maintains utility comparable to an undefended model while consistently outperforming Meta SecAlign, the strongest prior guardrail. On the representative open‑ended CyberSecEval2 benchmark, which includes multiple prompt‑injected tasks, ReasAlign achieves 94.6% utility and only 3.6% ASR, far surpassing the state‑of‑the‑art defensive model of Meta SecAlign (56.4% utility and 74.4% ASR). These results demonstrate that ReasAlign achieves the best trade‑off between security and utility, establishing a robust and practical defense against prompt injection attacks in real‑world agentic systems. Our code and experimental results could be found at https://github.com/leolee99/ReasAlign.
Authors:Prachuryya Kaushik, Ashish Anand
Abstract:
We introduce AWED‑FiNER, an open‑source ecosystem designed to bridge the gap in Fine‑grained Named Entity Recognition (FgNER) for 36 global languages spoken by more than 6.6 billion people. While Large Language Models (LLMs) dominate general Natural Language Processing (NLP) tasks, they often struggle with low‑resource languages and fine‑grained NLP tasks. AWED‑FiNER provides a collection of agentic toolkits, web applications, and several state‑of‑the‑art expert models that provides FgNER solutions across 36 languages. The agentic tools enable to route multilingual text to specialized expert models and fetch FgNER annotations within seconds. The web‑based platforms provide ready‑to‑use FgNER annotation service for non‑technical users. Moreover, the collection of language specific extremely small sized open‑source state‑of‑the‑art expert models facilitate offline deployment in resource contraint scenerios including edge devices. AWED‑FiNER covers languages spoken by over 6.6 billion people, including a specific focus on vulnerable languages such as Bodo, Manipuri, Bishnupriya, and Mizo. The resources can be accessed here: Agentic Tool (https://github.com/PrachuryyaKaushik/AWED‑FiNER), Web Application (https://hf.co/spaces/prachuryyaIITG/AWED‑FiNER), and 49 Expert Detector Models (https://hf.co/collections/prachuryyaIITG/awed‑finer).
Authors:Yutao Mou, Zhangchi Xue, Lijun Li, Peiyang Liu, Shikun Zhang, Wei Ye, Jing Shao
Abstract:
While LLM‑based agents can interact with environments via invoking external tools, their expanded capabilities also amplify security risks. Monitoring step‑level tool invocation behaviors in real time and proactively intervening before unsafe execution is critical for agent deployment, yet remains under‑explored. In this work, we first construct TS‑Bench, a novel benchmark for step‑level tool invocation safety detection in LLM agents. We then develop a guardrail model, TS‑Guard, using multi‑task reinforcement learning. The model proactively detects unsafe tool invocation actions before execution by reasoning over the interaction history. It assesses request harmfulness and action‑attack correlations, producing interpretable and generalizable safety judgments and feedback. Furthermore, we introduce TS‑Flow, a guardrail‑feedback‑driven reasoning framework for LLM agents, which reduces harmful tool invocations of ReAct‑style agents by 65 percent on average and improves benign task completion by approximately 10 percent under prompt injection attacks.
Authors:Zhenghao Liu, Zhuoyang Wu, Xinze Li, Yukun Yan, Shuo Wang, Zulong Chen, Yu Gu, Ge Yu, Maosong Sun
Abstract:
Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities, particularly in solving complex mathematical problems. Recent studies show that distilling long reasoning trajectories can effectively enhance the reasoning performance of small‑scale student models. However, teacher‑generated reasoning trajectories are often excessively long and structurally complex, making them difficult for student models to learn. This mismatch leads to a gap between the provided supervision signal and the learning capacity of the student model. To address this challenge, we propose Prefix‑ALIGNment distillation (P‑ALIGN), a framework that fully exploits teacher CoTs for distillation through adaptive prefix alignment. Specifically, P‑ALIGN adaptively truncates teacher‑generated reasoning trajectories by determining whether the remaining suffix is concise and sufficient to guide the student model. Then, P‑ALIGN leverages the teacher‑generated prefix to supervise the student model, encouraging effective prefix alignment. Experiments on multiple mathematical reasoning benchmarks demonstrate that P‑ALIGN outperforms all baselines by over 3%. Further analysis indicates that the prefixes constructed by P‑ALIGN provide more effective supervision signals, while avoiding the negative impact of redundant and uncertain reasoning components. All code is available at https://github.com/NEUIR/P‑ALIGN.
Authors:Peter Jemley
Abstract:
We present a hybrid transformer architecture that replaces discrete middle layers with a continuous‑depth Neural Ordinary Differential Equation (ODE) block, enabling inference‑time control over generation attributes via a learned steering signal. Unlike standard transformers that process representations through fixed discrete layers, our approach treats depth as a continuous variable governed by a learned vector field F_θ(H, τ, u), where u is a low‑dimensional control signal injected via explicit concatenation. We validate the architecture through four experiments: (1) gradient flow stability with zero exploding/vanishing gradient events, (2) semantic steering achieving 98%/88% accuracy for positive/negative sentiment control, (3) continuous interpolation validated by a negligible 0.068% trajectory divergence between fixed and adaptive solvers, and (4) efficiency benchmarking demonstrating latency parity with standard discrete baselines. Additionally, we show that adaptive ODE solvers reveal geometric structure in the learned dynamics: the control signal partitions the vector field into distinct dynamical regimes with different curvature characteristics. The adjoint method enables O(1) memory training regardless of integration depth. Our results demonstrate that continuous‑depth dynamics with learned control signals provide a viable, efficient mechanism for steerable language generation.
Authors:Sraavya Sambara, Yuan Pu, Ayman Ali, Vishala Mishra, Lionel Wong, Monica Agrawal
Abstract:
Real‑world health questions from patients often unintentionally embed false assumptions or premises. In such cases, safe medical communication typically involves redirection: addressing the implicit misconception and then responding to the underlying patient context, rather than the original question. While large language models (LLMs) are increasingly being used by lay users for medical advice, they have not yet been tested for this crucial competency. Therefore, in this work, we investigate how LLMs react to false premises embedded within real‑world health questions. We develop a semi‑automated pipeline to curate MedRedFlag, a dataset of 1100+ questions sourced from Reddit that require redirection. We then systematically compare responses from state‑of‑the‑art LLMs to those from clinicians. Our analysis reveals that LLMs often fail to redirect problematic questions, even when the problematic premise is detected, and provide answers that could lead to suboptimal medical decision making. Our benchmark and results reveal a novel and substantial gap in how LLMs perform under the conditions of real‑world health communication, highlighting critical safety concerns for patient‑facing medical AI systems. Code and dataset are available at https://github.com/srsambara‑1/MedRedFlag.
Authors:Kaustubh Shivshankar Shejole, Sourabh Deoghare, Pushpak Bhattacharyya
Abstract:
Punctuation plays a critical role in resolving semantic and structural ambiguity in written language. Machine Translation (MT) systems are now widely applied across diverse domains and languages, including many low‑resource settings. In this work, we focus on Marathi, a low‑ to middle‑resource language. We introduce Virām, the first diagnostic benchmark for assessing punctuation robustness in English‑to‑Marathi machine translation, consisting of 54 manually curated, punctuation‑ambiguous instances. We evaluate two primary strategies for enhancing reliability: a pipeline‑based restore‑then‑translate approach and direct fine‑tuned on punctuation‑varied data. Our results demonstrate that specialized fine‑tuned models and pipeline systems significantly improve translation quality over standard baselines on the Virām benchmark. Qualitative analysis reveals that the original model may result in wrong translations leading to wrong interpretations, while fine‑tuned models significantly improve overall reliability. Furthermore, we find that current Large Language Models (LLMs) lag behind these task‑specific approaches in preserving meaning for punctuation‑ambiguous text, thus necessitating further research in this area. The code and dataset is available at https://github.com/KaustubhShejole/Viram_Marathi.
Authors:Jing-Yi Zeng, Guan-Hua Huang
Abstract:
This study investigates how to efficiently build a domain‑specialized large language model (LLM) for statistics using the lightweight LLaMA‑3.2‑3B family as the foundation model (FM). We systematically compare three multi‑stage training pipelines, starting from a base FM with no instruction‑following capability, a base FM augmented with post‑hoc instruction tuning, and an instruction‑tuned FM with strong general reasoning abilities across continual pretraining, supervised fine‑tuning (SFT), reinforcement learning from human feedback (RLHF) preference alignment, and downstream task adaptation. Results show that pipelines beginning with a base FM fail to develop meaningful statistical reasoning, even after extensive instruction tuning, SFT, or RLHF alignment. In contrast, starting from LLaMA‑3.2‑3B‑Instruct enables effective domain specialization. A comprehensive evaluation of SFT variants reveals clear trade‑offs between domain expertise and general reasoning ability. We further demonstrate that direct preference optimization provides stable and effective RLHF preference alignment. Finally, we show that downstream fine‑tuning must be performed with extremely low intensity to avoid catastrophic forgetting in highly optimized models. The final model, StatLLaMA, achieves strong and balanced performance on benchmarks of mathematical reasoning, common‑sense reasoning, and statistical expertise, offering a practical blueprint for developing resource‑efficient statistical LLMs. The code is available at https://github.com/HuangDLab/StatLLaMA.
Authors:Yiwei Yan, Hao Li, Hua He, Gong Kai, Zhengyi Yang, Guanfeng Liu
Abstract:
Online medical consultations generate large volumes of conversational health data that often embed protected health information, requiring robust methods to classify data categories and assign risk levels in line with policies and practice. However, existing approaches lack unified standards and reliable automated methods to fulfill sensitivity classification for such conversational health data. This study presents a large language model‑based extraction pipeline, SALP‑CG, for classifying and grading privacy risks in online conversational health data. We concluded health‑data classification and grading rules in accordance with GB/T 39725‑2020. Combining few‑shot guidance, JSON Schema constrained decoding, and deterministic high‑risk rules, the backend‑agnostic extraction pipeline achieves strong category compliance and reliable sensitivity across diverse LLMs. On the MedDialog‑CN benchmark, models yields robust entity counts, high schema compliance, and accurate sensitivity grading, while the strongest model attains micro‑F1=0.900 for maximum‑level prediction. The category landscape stratified by sensitivity shows that Level 2‑3 items dominate, enabling re‑identification when combined; Level 4‑5 items are less frequent but carry outsize harm. SALP‑CG reliably helps classify categories and grading sensitivity in online conversational health data across LLMs, offering a practical method for health data governance. Code is available at https://github.com/dommii1218/SALP‑CG.
Authors:Tianyi Niu, Justin Chih-Yao Chen, Genta Indra Winata, Shi-Xiong Zhang, Supriyo Chakraborty, Sambit Sahu, Yue Zhang, Elias Stengel-Eskin, Mohit Bansal
Abstract:
Large Language Model (LLM) routers dynamically select optimal models for given inputs. Existing approaches typically assume access to ground‑truth labeled data, which is often unavailable in practice, especially when user request distributions are heterogeneous and unknown. We introduce Routing with Generated Data (RGD), a challenging setting in which routers are trained exclusively on generated queries and answers produced from high‑level task descriptions by generator LLMs. We evaluate query‑answer routers (using both queries and labels) and query‑only routers across four diverse benchmarks and 12 models, finding that query‑answer routers degrade faster than query‑only routers as generator quality decreases. Our analysis reveals two crucial characteristics of effective generators: they must accurately respond to their own questions, and their questions must produce sufficient performance differentiation among the model pool. We then show how filtering for these characteristics can improve the quality of generated data. We further propose CASCAL, a novel query‑only router that estimates model correctness through consensus voting and identifies model‑specific skill niches via hierarchical clustering. CASCAL is substantially more robust to generator quality, outperforming the best query‑answer router by 4.6% absolute accuracy when trained on weak generator data.
Authors:Yibo Wang, Lei Wang, Yue Deng, Keming Wu, Yao Xiao, Huanjin Yao, Liwei Kang, Hai Ye, Yongcheng Jing, Lidong Bing
Abstract:
Deep research systems are widely used for multi‑step web research, analysis, and cross‑source synthesis, yet their evaluation remains challenging. Existing benchmarks often require annotation‑intensive task construction, rely on static evaluation dimensions, or fail to reliably verify facts when citations are missing. To bridge these gaps, we introduce DeepResearchEval, an automated framework for deep research task construction and agentic evaluation. For task construction, we propose a persona‑driven pipeline generating realistic, complex research tasks anchored in diverse user profiles, applying a two‑stage filter Task Qualification and Search Necessity to retain only tasks requiring multi‑source evidence integration and external retrieval. For evaluation, we propose an agentic pipeline with two components: an Adaptive Point‑wise Quality Evaluation that dynamically derives task‑specific evaluation dimensions, criteria, and weights conditioned on each generated task, and an Active Fact‑Checking that autonomously extracts and verifies report statements via web search, even when citations are missing.
Authors:Yunqiao Yang, Wenbo Li, Houxing Ren, Zimu Lu, Ke Wang, Zhiyuan Huang, Zhuofan Zong, Mingjie Zhan, Hongsheng Li
Abstract:
The rapid evolution of Large Language Models (LLMs) has fostered diverse paradigms for automated slide generation, ranging from code‑driven layouts to image‑centric synthesis. However, evaluating these heterogeneous systems remains challenging, as existing protocols often struggle to provide comparable scores across architectures or rely on uncalibrated judgments. In this paper, we introduce SlidesGen‑Bench, a benchmark designed to evaluate slide generation through a lens of three core principles: universality, quantification, and reliability. First, to establish a unified evaluation framework, we ground our analysis in the visual domain, treating terminal outputs as renderings to remain agnostic to the underlying generation method. Second, we propose a computational approach that quantitatively assesses slides across three distinct dimensions ‑ Content, Aesthetics, and Editability ‑ offering reproducible metrics where prior works relied on subjective or reference‑dependent proxies. Finally, to ensure high correlation with human preference, we construct the Slides‑Align1.5k dataset, a human preference aligned dataset covering slides from nine mainstream generation systems across seven scenarios. Our experiments demonstrate that SlidesGen‑Bench achieves a higher degree of alignment with human judgment than existing evaluation pipelines. Our code and data are available at https://github.com/YunqiaoYang/SlidesGen‑Bench.
Authors:Xinze Li, Zhenghao Liu, Haidong Xin, Yukun Yan, Shuo Wang, Zheni Zeng, Sen Mei, Ge Yu, Maosong Sun
Abstract:
Retrieval‑Augmented Generation (RAG) enhances Large Language Models (LLMs) by incorporating external knowledge. Recently, some works have incorporated iterative knowledge accumulation processes into RAG models to progressively accumulate and refine query‑related knowledge, thereby constructing more comprehensive knowledge representations. However, these iterative processes often lack a coherent organizational structure, which limits the construction of more comprehensive and cohesive knowledge representations. To address this, we propose PAGER, a page‑driven autonomous knowledge representation framework for RAG. PAGER first prompts an LLM to construct a structured cognitive outline for a given question, which consists of multiple slots representing a distinct knowledge aspect. Then, PAGER iteratively retrieves and refines relevant documents to populate each slot, ultimately constructing a coherent page that serves as contextual input for guiding answer generation. Experiments on multiple knowledge‑intensive benchmarks and backbone models show that PAGER consistently outperforms all RAG baselines. Further analyses demonstrate that PAGER constructs higher‑quality and information‑dense knowledge representations, better mitigates knowledge conflicts, and enables LLMs to leverage external knowledge more effectively. All code is available at https://github.com/OpenBMB/PAGER.
Authors:Yexing Du, Kaiyuan Liu, Bihe Zhang, Youcheng Pan, Bo Yang, Liangyu Huo, Xiyuan Zhang, Jian Xie, Daojing He, Yang Xiang, Ming Liu, Bin Qin
Abstract:
With the rapid advancement of Multimodal Large Language Models (MLLMs), their potential has gained significant attention in Chinese Classical Studies (CCS). While existing research primarily focuses on text and visual modalities, the audio corpus within this domain remains largely underexplored. To bridge this gap, we introduce the Multi‑task Classical Chinese Literary Genre Audio Corpus (MCGA), a 119‑hour corpus comprising 22,000 audio samples. It encompasses a diverse range of literary genres across six tasks: Automatic Speech Recognition (ASR), Speech‑to‑Text Translation (S2TT), Speech Emotion Captioning (SEC), Spoken Question Answering (SQA), Speech Understanding (SU), and Speech Reasoning (SR). Through the evaluation of ten MLLMs, our experimental results demonstrate that current MLLMs still face substantial challenges on the MCGA test set. Furthermore, we introduce a domain‑specific metric for SEC and a metric to measure the consistency between speech and text capabilities. We release MCGA to the public to facilitate the development of more robust MLLMs. MCGA Corpus: https://github.com/yxduir/MCGA
Authors:Jing Ren, Bowen Li, Ziqi Xu, Renqiang Luo, Shuo Yu, Xin Ye, Haytham Fayek, Xiaodong Li, Feng Xia
Abstract:
Large Language Models (LLMs) are increasingly used for toxicity assessment in online moderation systems, where fairness across demographic groups is essential for equitable treatment. However, LLMs often produce inconsistent toxicity judgements for subtle expressions, particularly those involving implicit hate speech, revealing underlying biases that are difficult to correct through standard training. This raises a key question that existing approaches often overlook: when should corrective mechanisms be invoked to ensure fair and reliable assessments? To address this, we propose FairToT, an inference‑time framework that enhances LLM fairness through prompt‑guided toxicity assessment. FairToT identifies cases where demographic‑related variation is likely to occur and determines when additional assessment should be applied. In addition, we introduce two interpretable fairness indicators that detect such cases and improve inference consistency without modifying model parameters. Experiments on benchmark datasets show that FairToT reduces group‑level disparities while maintaining stable and reliable toxicity predictions, demonstrating that inference‑time refinement offers an effective and practical approach for fairness improvement in LLM‑based toxicity assessment systems. The source code can be found at https://aisuko.github.io/fair‑tot/.
Authors:Zhengyang Zhao, Lu Ma, Yizhen Jiang, Xiaochen Ma, Zimo Meng, Chengyu Shen, Lexiang Tang, Haoze Sun, Peng Pei, Wentao Zhang
Abstract:
The prevailing post‑training paradigm for Large Reasoning Models (LRMs)‑‑Supervised Fine‑Tuning (SFT) followed by Reinforcement Learning (RL)‑‑suffers from an intrinsic optimization mismatch: the rigid supervision inherent in SFT induces distributional collapse, thereby exhausting the exploration space necessary for subsequent RL. In this paper, we reformulate SFT within a unified post‑training framework and propose Gibbs Initialization with Finite Temperature (GIFT). We characterize standard SFT as a degenerate zero‑temperature limit that suppresses base priors. Conversely, GIFT incorporates supervision as a finite‑temperature energy potential, establishing a distributional bridge that ensures objective consistency throughout the post‑training pipeline. Our experiments demonstrate that GIFT significantly outperforms standard SFT and other competitive baselines when utilized for RL initialization, providing a mathematically principled pathway toward achieving global optimality in post‑training. Our code is available at https://github.com/zzy1127/GIFT.
Authors:Shaotian Yan, Kaiyuan Liu, Chen Shen, Bing Wang, Sinan Fan, Jun Zhang, Yue Wu, Zheng Wang, Jieping Ye
Abstract:
In this report, we introduce DASD‑4B‑Thinking, a lightweight yet highly capable, fully open‑source reasoning model. It achieves SOTA performance among open‑source models of comparable scale across challenging benchmarks in mathematics, scientific reasoning, and code generation ‑‑ even outperforming several larger models. We begin by critically reexamining a widely adopted distillation paradigm in the community: SFT on teacher‑generated responses, also known as sequence‑level distillation. Although a series of recent works following this scheme have demonstrated remarkable efficiency and strong empirical performance, they are primarily grounded in the SFT perspective. Consequently, these approaches focus predominantly on designing heuristic rules for SFT data filtering, while largely overlooking the core principle of distillation itself ‑‑ enabling the student model to learn the teacher's full output distribution so as to inherit its generalization capability. Specifically, we identify three critical limitations in current practice: i) Inadequate representation of the teacher's sequence‑level distribution; ii) Misalignment between the teacher's output distribution and the student's learning capacity; and iii) Exposure bias arising from teacher‑forced training versus autoregressive inference. In summary, these shortcomings reflect a systemic absence of explicit teacher‑student interaction throughout the distillation process, leaving the essence of distillation underexploited. To address these issues, we propose several methodological innovations that collectively form an enhanced sequence‑level distillation training pipeline. Remarkably, DASD‑4B‑Thinking obtains competitive results using only 448K training samples ‑‑ an order of magnitude fewer than those employed by most existing open‑source efforts. To support community research, we publicly release our models and the training dataset.
Authors:Hsiang-Wei Huang, Junbin Lu, Kuang-Ming Chen, Jenq-Neng Hwang
Abstract:
In this work, we explore the Large Language Model (LLM) agent reviewer dynamics in an Elo‑ranked review system using real‑world conference paper submissions. Multiple LLM agent reviewers with different personas are engage in multi round review interactions moderated by an Area Chair. We compare a baseline setting with conditions that incorporate Elo ratings and reviewer memory. Our simulation results showcase several interesting findings, including how incorporating Elo improves Area Chair decision accuracy, as well as reviewers' adaptive review strategy that exploits our Elo system without improving review effort. Our code is available at https://github.com/hsiangwei0903/EloReview.
Authors:Yao Tang, Li Dong, Yaru Hao, Qingxiu Dong, Furu Wei, Jiatao Gu
Abstract:
Large language models often solve complex reasoning tasks more effectively with Chain‑of‑Thought (CoT), but at the cost of long, low‑bandwidth token sequences. Humans, by contrast, often reason softly by maintaining a distribution over plausible next steps. Motivated by this, we propose Multiplex Thinking, a stochastic soft reasoning mechanism that, at each thinking step, samples K candidate tokens and aggregates their embeddings into a single continuous multiplex token. This preserves the vocabulary embedding prior and the sampling dynamics of standard discrete generation, while inducing a tractable probability distribution over multiplex rollouts. Consequently, multiplex trajectories can be directly optimized with on‑policy reinforcement learning (RL). Importantly, Multiplex Thinking is self‑adaptive: when the model is confident, the multiplex token is nearly discrete and behaves like standard CoT; when it is uncertain, it compactly represents multiple plausible next steps without increasing sequence length. Across challenging math reasoning benchmarks, Multiplex Thinking consistently outperforms strong discrete CoT and RL baselines from Pass@1 through Pass@1024, while producing shorter sequences. The code and checkpoints are available at https://github.com/GMLR‑Penn/Multiplex‑Thinking.
Authors:Yihan Hong, Huaiyuan Yao, Bolin Shen, Wanpeng Xu, Hua Wei, Yushun Dong
Abstract:
The LLM‑as‑a‑Judge paradigm promises scalable rubric‑based evaluation, yet aligning frozen black‑box models with human standards remains a challenge due to inherent generation stochasticity. We reframe judge alignment as a criteria transfer problem and isolate three recurrent failure modes: rubric instability caused by prompt sensitivity, unverifiable reasoning that lacks auditable evidence, and scale misalignment with human grading boundaries. To address these issues, we introduce RULERS (Rubric Unification, Locking, and Evidence‑anchored Robust Scoring), a compiler‑executor framework that transforms natural language rubrics into executable specifications. RULERS operates by compiling criteria into versioned immutable bundles, enforcing structured decoding with deterministic evidence verification, and applying lightweight Wasserstein‑based post‑hoc calibration, all without updating model parameters. Extensive experiments on essay and summarization benchmarks demonstrate that RULERS significantly outperforms representative baselines in human agreement, maintains strong stability against adversarial rubric perturbations, and enables smaller models to rival larger proprietary judges. Overall, our results suggest that reliable LLM judging requires executable rubrics, verifiable evidence, and calibrated scales rather than prompt phrasing alone. Code is available at https://github.com/LabRAI/Rulers.git.
Authors:Jinkwan Jang, Hyunbin Jin, Hyungjin Park, Kyubyung Chae, Taesup Kim
Abstract:
Time series forecasting is critical to real‑world decision making, yet most existing approaches remain unimodal and rely on extrapolating historical patterns. While recent progress in large language models (LLMs) highlights the potential for multimodal forecasting, existing benchmarks largely provide retrospective or misaligned raw context, making it unclear whether such models meaningfully leverage textual inputs. In practice, human experts incorporate what‑if scenarios with historical evidence, often producing distinct forecasts from the same observations under different scenarios. Inspired by this, we introduce What If TSF (WIT), a multimodal forecasting benchmark designed to evaluate whether models can condition their forecasts on contextual text, especially future scenarios. By providing expert‑crafted plausible or counterfactual scenarios, WIT offers a rigorous testbed for scenario‑guided multimodal forecasting. The benchmark is available at https://github.com/jinkwan1115/WhatIfTSF.
Authors:Rongji Li, Jian Xu, Xueqing Chen, Yisheng Yang, Jiayi Wang, Xingyu Chen, Chunyu Xie, Dawei Leng, Xu-Yao Zhang
Abstract:
In domains such as biomedicine, materials, and finance, high‑stakes deployment of large language models (LLMs) requires injecting private, domain‑specific knowledge that is proprietary, fast‑evolving, and under‑represented in public pretraining. However, the two dominant paradigms for private knowledge injection each have pronounced drawbacks: fine‑tuning is expensive to iterate, and continual updates risk catastrophic forgetting and general‑capability regression; retrieval‑augmented generation (RAG) keeps the base model intact but is brittle in specialized private corpora due to chunk‑induced evidence fragmentation, retrieval drift, and long‑context pressure that yields query‑dependent prompt inflation. Inspired by how multimodal LLMs align heterogeneous modalities into a shared semantic space, we propose Generation‑Augmented Generation (GAG), which treats private expertise as an additional expert modality and injects it via a compact, representation‑level interface aligned to the frozen base model, avoiding prompt‑time evidence serialization while enabling plug‑and‑play specialization and scalable multi‑domain composition with reliable selective activation. Across two private scientific QA benchmarks (immunology adjuvant and catalytic materials) and mixed‑domain evaluations, GAG improves specialist performance over strong RAG baselines by 15.34% and 14.86% on the two benchmarks, respectively, while maintaining performance on six open general benchmarks and enabling near‑oracle selective activation for scalable multi‑domain deployment. Code is publicly available at https://github.com/360CVGroup/GAG.
Authors:Hongjin Qian, Zhao Cao, Zheng Liu
Abstract:
Complex reasoning in tool‑augmented agent frameworks is inherently long‑horizon, causing reasoning traces and transient tool artifacts to accumulate and strain the bounded working context of large language models. Without explicit memory mechanisms, such accumulation disrupts logical continuity and undermines task alignment. This positions memory not as an auxiliary efficiency concern, but as a core component for sustaining coherent, goal‑directed reasoning over long horizons. We propose MemoBrain, an executive memory model for tool‑augmented agents that constructs a dependency‑aware memory over reasoning steps, capturing salient intermediate states and their logical relations. Operating as a co‑pilot alongside the reasoning agent, MemoBrain organizes reasoning progress without blocking execution and actively manages the working context. Specifically, it prunes invalid steps, folds completed sub‑trajectories, and preserves a compact, high‑salience reasoning backbone under a fixed context budget. Together, these mechanisms enable explicit cognitive control over reasoning trajectories rather than passive context accumulation. We evaluate MemoBrain on challenging long‑horizon benchmarks, including GAIA, WebWalker, and BrowseComp‑Plus, demonstrating consistent improvements over strong baselines.
Authors:Guoping Xu, Jayaram K. Udupa, Weiguo Lu, You Zhang
Abstract:
Deep learning‑based automatic medical image segmentation plays a critical role in clinical diagnosis and treatment planning but remains challenging in few‑shot scenarios due to the scarcity of annotated training data. Recently, self‑supervised foundation models such as DINOv3, which were trained on large natural image datasets, have shown strong potential for dense feature extraction that can help with the few‑shot learning challenge. Yet, their direct application to medical images is hindered by domain differences. In this work, we propose DINO‑AugSeg, a novel framework that leverages DINOv3 features to address the few‑shot medical image segmentation challenge. Specifically, we introduce WT‑Aug, a wavelet‑based feature‑level augmentation module that enriches the diversity of DINOv3‑extracted features by perturbing frequency components, and CG‑Fuse, a contextual information‑guided fusion module that exploits cross‑attention to integrate semantic‑rich low‑resolution features with spatially detailed high‑resolution features. Extensive experiments on six public benchmarks spanning five imaging modalities, including MRI, CT, ultrasound, endoscopy, and dermoscopy, demonstrate that DINO‑AugSeg consistently outperforms existing methods under limited‑sample conditions. The results highlight the effectiveness of incorporating wavelet‑domain augmentation and contextual fusion for robust feature representation, suggesting DINO‑AugSeg as a promising direction for advancing few‑shot medical image segmentation. Code and data will be made available on https://github.com/apple1986/DINO‑AugSeg.
Authors:Shailesh Rana
Abstract:
Negative constraints (instructions of the form "do not use word X") represent a fundamental test of instruction‑following capability in large language models. Despite their apparent simplicity, these constraints fail with striking regularity, and the conditions governing failure have remained poorly understood. This paper presents the first comprehensive mechanistic investigation of negative instruction failure. We introduce semantic pressure, a quantitative measure of the model's intrinsic probability of generating the forbidden token, and demonstrate that violation probability follows a tight logistic relationship with pressure (p=σ(‑2.40+2.27\cdot P_0); n=40,000 samples; bootstrap 95% CI for slope: [2.21,,2.33]). Through layer‑wise analysis using the logit lens technique, we establish that the suppression signal induced by negative instructions is present but systematically weaker in failures: the instruction reduces target probability by only 5.2 percentage points in failures versus 22.8 points in successes ‑‑ a 4.4× asymmetry. We trace this asymmetry to two mechanistically distinct failure modes. In priming failure (87.5% of violations), the instruction's explicit mention of the forbidden word paradoxically activates rather than suppresses the target representation. In override failure (12.5%), late‑layer feed‑forward networks generate contributions of +0.39 toward the target probability ‑‑ nearly 4× larger than in successes ‑‑ overwhelming earlier suppression signals. Activation patching confirms that layers 23‑‑27 are causally responsible: replacing these layers' activations flips the sign of constraint effects. These findings reveal a fundamental tension in negative constraint design: the very act of naming a forbidden word primes the model to produce it.
Authors:Simon Jegou, Maximilian Jeblick
Abstract:
Growing context lengths in transformer‑based language models have made the key‑value (KV) cache a critical inference bottleneck. While many KV cache pruning methods have been proposed, they have not yet been adopted in major inference engines due to speed‑‑accuracy trade‑offs. We introduce KVzap, a fast, input‑adaptive approximation of KVzip that works in both prefilling and decoding. On Qwen3‑8B, Llama‑3.1‑8B‑Instruct, and Qwen3‑32B across long‑context and reasoning tasks, KVzap achieves 2‑‑4× KV cache compression with negligible accuracy loss and achieves state‑of‑the‑art performance on the KVpress leaderboard. Code and models are available at https://github.com/NVIDIA/kvpress.
Authors:Haowen Hou, Jie Yang
Abstract:
Current Retrieval‑Augmented Generation (RAG) systems typically employ a traditional two‑stage pipeline: an embedding model for initial retrieval followed by a reranker for refinement. However, this paradigm suffers from significant inefficiency due to the lack of shared information between stages, leading to substantial redundant computation. To address this limitation, we propose State‑Centric Retrieval, a unified retrieval paradigm that utilizes "states" as a bridge to connect embedding models and rerankers. First, we perform state representation learning by fine‑tuning an RWKV‑based LLM, transforming it into EmbeddingRWKV, a unified model that serves as both an embedding model and a state backbone for extracting compact, reusable states. Building upon these reusable states, we further design a state‑based reranker to fully leverage precomputed information. During reranking, the model processes only query tokens, decoupling inference cost from document length and yielding a 5.4×‑‑44.8× speedup. Furthermore, we observe that retaining all intermediate layer states is unnecessary; with a uniform layer selection strategy, our model maintains 98.62% of full‑model performance using only 25% of the layers. Extensive experiments demonstrate that State‑Centric Retrieval achieves high‑quality retrieval and reranking results while significantly enhancing overall system efficiency. Code is available at \hrefhttps://github.com/howard‑hou/EmbeddingRWKVour GitHub repository.
Authors:Rei Taniguchi, Yuyang Dong, Makoto Onizuka, Chuan Xiao
Abstract:
Due to the prevalence of large language models (LLMs), key‑value (KV) cache reduction for LLM inference has received remarkable attention. Among numerous works that have been proposed in recent years, layer‑wise token pruning approaches, which select a subset of tokens at particular layers to retain in KV cache and prune others, are one of the most popular schemes. They primarily adopt a set of pre‑defined layers, at which tokens are selected. Such design is inflexible in the sense that the accuracy significantly varies across tasks and deteriorates in harder tasks such as KV retrieval. In this paper, we propose ASL, a training‑free method that adaptively chooses the selection layer for KV cache reduction, exploiting the variance of token ranks ordered by attention score. The proposed method balances the performance across different tasks while meeting the user‑specified KV budget requirement. ASL operates during the prefilling stage and can be jointly used with existing KV cache reduction methods such as SnapKV to optimize the decoding stage. By evaluations on the InfiniteBench, RULER, and NIAH benchmarks, we show that equipped with one‑shot token selection, where tokens are selected at a layer and propagated to deeper layers, ASL outperforms state‑of‑the‑art layer‑wise token selection methods in accuracy while maintaining decoding speed and KV cache reduction.
Authors:Zijing Wang, Yongkang Liu, Mingyang Wang, Ercong Nie, Deyuan Chen, Zhengjie Zhao, Shi Feng, Daling Wang, Xiaocui Yang, Yifei Zhang, Hinrich Schütze
Abstract:
Multimodal Large Language Models (MLLMs) rely on strong linguistic reasoning inherited from their base language models. However, multimodal instruction fine‑tuning paradoxically degrades this text's reasoning capability, undermining multimodal performance. To address this issue, we propose a training‑free framework to mitigate this degradation. Through layer‑wise vision token masking, we reveal a common three‑stage pattern in multimodal large language models: early‑modal separation, mid‑modal alignment, and late‑modal degradation. By analyzing the behavior of MLLMs at different stages, we propose a plateau‑guided model merging method that selectively injects base language model parameters into MLLMs. Experimental results based on five MLLMs on nine benchmarks demonstrate the effectiveness of our method. Attention‑based analysis further reveals that merging shifts attention from diffuse, scattered patterns to focused localization on task‑relevant visual regions. Our repository is on https://github.com/wzj1718/PlaM.
Authors:Jiaxuan Lu, Ziyu Kong, Yemin Wang, Rong Fu, Haiyuan Wan, Cheng Yang, Wenjie Lou, Haoran Sun, Lilong Wang, Yankai Jiang, Xiaosong Wang, Xiao Sun, Dongzhan Zhou
Abstract:
The central challenge of AI for Science is not reasoning alone, but the ability to create computational methods in an open‑ended scientific world. Existing LLM‑based agents rely on static, pre‑defined tool libraries, a paradigm that fundamentally fails in scientific domains where tools are sparse, heterogeneous, and intrinsically incomplete. In this paper, we propose Test‑Time Tool Evolution (TTE), a new paradigm that enables agents to synthesize, verify, and evolve executable tools during inference. By transforming tools from fixed resources into problem‑driven artifacts, TTE overcomes the rigidity and long‑tail limitations of static tool libraries. To facilitate rigorous evaluation, we introduce SciEvo, a benchmark comprising 1,590 scientific reasoning tasks supported by 925 automatically evolved tools. Extensive experiments show that TTE achieves state‑of‑the‑art performance in both accuracy and tool efficiency, while enabling effective cross‑domain adaptation of computational tools. The code and benchmark have been released at https://github.com/lujiaxuan0520/Test‑Time‑Tool‑Evol.
Authors:Linhao Zhong, Linyu Wu, Bozhen Fang, Tianjian Feng, Chenchen Jing, Wen Wang, Jiaheng Zhang, Hao Chen, Chunhua Shen
Abstract:
Diffusion Language Models (DLMs) offer a promising alternative for language modeling by enabling parallel decoding through iterative refinement. However, most DLMs rely on hard binary masking and discrete token assignments, which hinder the revision of early decisions and underutilize intermediate probabilistic representations. In this paper, we propose EvoToken‑DLM, a novel diffusion‑based language modeling approach that replaces hard binary masks with evolving soft token distributions. EvoToken‑DLM enables a progressive transition from masked states to discrete outputs, supporting revisable decoding. To effectively support this evolution, we introduce continuous trajectory supervision, which aligns training objectives with iterative probabilistic updates. Extensive experiments across multiple benchmarks show that EvoToken‑DLM consistently achieves superior performance, outperforming strong diffusion‑based and masked DLM baselines. Project webpage: https://aim‑uofa.github.io/EvoTokenDLM.
Authors:Tu Hu, Ronghao Chen, Shuo Zhang, Jianghao Yin, Mou Xiao Feng, Jingping Liu, Shaolei Zhang, Wenqi Jiang, Yuqi Fang, Sen Hu, Huacan Wang, Yi Xu
Abstract:
Self‑evolution methods enhance code generation through iterative "generate‑verify‑refine" cycles, yet existing approaches suffer from low exploration efficiency, failing to discover solutions with superior complexity within limited budgets. This inefficiency stems from initialization bias trapping evolution in poor solution regions, uncontrolled stochastic operations lacking feedback guidance, and insufficient experience utilization across tasks. To address these bottlenecks, we propose Controlled Self‑Evolution (CSE), which consists of three key components. Diversified Planning Initialization generates structurally distinct algorithmic strategies for broad solution space coverage. Genetic Evolution replaces stochastic operations with feedback‑guided mechanisms, enabling targeted mutation and compositional crossover. Hierarchical Evolution Memory captures both successful and failed experiences at inter‑task and intra‑task levels. Experiments on EffiBench‑X demonstrate that CSE consistently outperforms all baselines across various LLM backbones. Furthermore, CSE achieves higher efficiency from early generations and maintains continuous improvement throughout evolution. Our code is publicly available at https://github.com/QuantaAlpha/EvoControl.
Authors:Yanzhi Tian, Cunxiang Wang, Zeming Liu, Heyan Huang, Wenbo Yu, Dawei Song, Jie Tang, Yuhang Guo
Abstract:
Large Language Models (LLMs) have significantly advanced Machine Translation (MT), applying them to linguistically complex domains‑such as Social Network Services, literature etc. In these scenarios, translations often require handling non‑literal expressions, leading to the inaccuracy of MT metrics. To systematically investigate the reliability of MT metrics, we first curate a meta‑evaluation dataset focused on non‑literal translations, namely MENT. MENT encompasses four non‑literal translation domains and features source sentences paired with translations from diverse MT systems, with 7,530 human‑annotated scores on translation quality. Experimental results reveal the inaccuracies of traditional MT metrics and the limitations of LLM‑as‑a‑Judge, particularly the knowledge cutoff and score inconsistency problem. To mitigate these limitations, we propose RATE, a novel agentic translation evaluation framework, centered by a reflective Core Agent that dynamically invokes specialized sub‑agents. Experimental results indicate the efficacy of RATE, achieving an improvement of at least 3.2 meta score compared with current metrics. Further experiments demonstrate the robustness of RATE to general‑domain MT evaluation. Code and dataset are available at: https://github.com/BITHLP/RATE.
Authors:Xuan Li, Yining Wang, Haocai Luo, Shengping Liu, Jerry Liang, Ying Fu, Weihuang, Jun Yu, Junnan Zhu
Abstract:
Retrieval‑Augmented Generation (RAG) has become a pivotal paradigm for Large Language Models (LLMs), yet current approaches struggle with visually rich documents by treating text and images as isolated retrieval targets. Existing methods relying solely on cosine similarity often fail to capture the semantic reinforcement provided by cross‑modal alignment and layout‑induced coherence. To address these limitations, we propose BayesRAG, a novel multimodal retrieval framework grounded in Bayesian inference and Dempster‑Shafer evidence theory. Unlike traditional approaches that rank candidates strictly by similarity, BayesRAG models the intrinsic consistency of retrieved candidates across modalities as probabilistic evidence to refine retrieval confidence. Specifically, our method computes the posterior association probability for combinations of multimodal retrieval results, prioritizing text‑image pairs that mutually corroborate each other in terms of both semantics and layout. Extensive experiments demonstrate that BayesRAG significantly outperforms state‑of‑the‑art (SOTA) methods on challenging multimodal benchmarks. This study establishes a new paradigm for multimodal retrieval fusion that effectively resolves the isolation of heterogeneous modalities through an evidence fusion mechanism and enhances the robustness of retrieval outcomes. Our code is available at https://github.com/TioeAre/BayesRAG.
Authors:Kalvin Chang, Yiwen Shao, Jiahong Li, Dong Yu
Abstract:
Despite having hundreds of millions of speakers, Chinese dialects lag behind Mandarin in speech and language technologies. Most varieties are primarily spoken, making dialect‑to‑Mandarin speech‑LLMs (large language models) more practical than dialect LLMs. Building dialect‑to‑Mandarin speech‑LLMs requires speech representations with cross‑dialect semantic alignment between Chinese dialects and Mandarin. In this paper, we achieve such a cross‑dialect semantic alignment by training a speech encoder with ASR (automatic speech recognition)‑only data, as demonstrated by speech‑to‑speech retrieval on a new benchmark of spoken Chinese varieties that we contribute. Our speech encoder further demonstrates state‑of‑the‑art ASR performance on Chinese dialects. Together, our Chinese dialect benchmark, semantically aligned speech representations, and speech‑to‑speech retrieval evaluation lay the groundwork for future Chinese dialect speech‑LLMs. We release the benchmark at https://github.com/kalvinchang/yubao.
Authors:Yixi Zhou, Fan Zhang, Yu Chen, Haipeng Zhang, Preslav Nakov, Zhuohan Xie
Abstract:
Financial question answering (QA) over long corporate filings requires evidence to satisfy strict constraints on entities, financial metrics, fiscal periods, and numeric values. However, existing LLM‑based rerankers primarily optimize semantic relevance, leading to unstable rankings and opaque decisions on long documents. We propose FinCards, a structured reranking framework that reframes financial evidence selection as constraint satisfaction under a finance‑aware schema. FinCards represents filing chunks and questions using aligned schema fields (entities, metrics, periods, and numeric spans), enabling deterministic field‑level matching. Evidence is selected via a multi‑stage tournament reranking with stability‑aware aggregation, producing auditable decision traces. Across two corporate filing QA benchmarks, FinCards substantially improves early‑rank retrieval over both lexical and LLM‑based reranking baselines, while reducing ranking variance, without requiring model fine‑tuning or unpredictable inference budgets. Our code is available at https://github.com/XanderZhou2022/FINCARDS.
Authors:Haonan Bian, Zhiyuan Yao, Sen Hu, Zishan Xu, Shaolei Zhang, Yifu Guo, Ziliang Yang, Xueran Han, Huacan Wang, Ronghao Chen
Abstract:
As Large Language Models (LLMs) evolve from static dialogue interfaces to autonomous general agents, effective memory is paramount to ensuring long‑term consistency. However, existing benchmarks primarily focus on casual conversation or task‑oriented dialogue, failing to capture "long‑term project‑oriented" interactions where agents must track evolving goals. To bridge this gap, we introduce RealMem, the first benchmark grounded in realistic project scenarios. RealMem comprises over 2,000 cross‑session dialogues across eleven scenarios, utilizing natural user queries for evaluation. We propose a synthesis pipeline that integrates Project Foundation Construction, Multi‑Agent Dialogue Generation, and Memory and Schedule Management to simulate the dynamic evolution of memory. Experiments reveal that current memory systems face significant challenges in managing the long‑term project states and dynamic context dependencies inherent in real‑world projects. Our code and datasets are available at [https://github.com/AvatarMemory/RealMemBench](https://github.com/AvatarMemory/RealMemBench).
Authors:Jie Wu, Haoling Li, Xin Zhang, Jiani Guo, Jane Luo, Steven Liu, Yangyu Huang, Ruihang Chu, Scarlett Li, Yujiu Yang
Abstract:
Competitive programming presents great challenges for Code LLMs due to its intensive reasoning demands and high logical complexity. However, current Code LLMs still rely heavily on real‑world data, which limits their scalability. In this paper, we explore a fully synthetic approach: training Code LLMs with entirely generated tasks, solutions, and test cases, to empower code reasoning models without relying on real‑world data. To support this, we leverage feature‑based synthesis to propose a novel data synthesis pipeline called SynthSmith. SynthSmith shows strong potential in producing diverse and challenging tasks, along with verified solutions and tests, supporting both supervised fine‑tuning and reinforcement learning. Based on the proposed synthetic SFT and RL datasets, we introduce the X‑Coder model series, which achieves a notable pass rate of 62.9 avg@8 on LiveCodeBench v5 and 55.8 on v6, outperforming DeepCoder‑14B‑Preview and AReal‑boba2‑14B despite having only 7B parameters. In‑depth analysis reveals that scaling laws hold on our synthetic dataset, and we explore which dimensions are more effective to scale. We further provide insights into code‑centric reinforcement learning and highlight the key factors that shape performance through detailed ablations and analysis. Our findings demonstrate that scaling high‑quality synthetic data and adopting staged training can greatly advance code reasoning, while mitigating reliance on real‑world coding data.
Authors:Junyan Lin, Junlong Tong, Hao Wu, Jialiang Zhang, Jinming Liu, Xin Jin, Xiaoyu Shen
Abstract:
Multimodal Large Language Models (MLLMs) have achieved strong performance across many tasks, yet most systems remain limited to offline inference, requiring complete inputs before generating outputs. Recent streaming methods reduce latency by interleaving perception and generation, but still enforce a sequential perception‑generation cycle, limiting real‑time interaction. In this work, we target a fundamental bottleneck that arises when extending MLLMs to real‑time video understanding: the global positional continuity constraint imposed by standard positional encoding schemes. While natural in offline inference, this constraint tightly couples perception and generation, preventing effective input‑output parallelism. To address this limitation, we propose a parallel streaming framework that relaxes positional continuity through three designs: Overlapped, Group‑Decoupled, and Gap‑Isolated. These designs enable simultaneous perception and generation, allowing the model to process incoming inputs while producing responses in real time. Extensive experiments reveal that Group‑Decoupled achieves the best efficiency‑performance balance, maintaining high fluency and accuracy while significantly reducing latency. We further show that the proposed framework yields up to 2x acceleration under balanced perception‑generation workloads, establishing a principled pathway toward speak‑while‑watching real‑time systems. We make all our code publicly available: https://github.com/EIT‑NLP/Speak‑While‑Watching.
Authors:Xuannan Liu, Xiao Yang, Zekun Li, Peipei Li, Ran He
Abstract:
As LLM‑based agents operate over sequential multi‑step reasoning, hallucinations arising at intermediate steps risk propagating along the trajectory, thus degrading overall reliability. Unlike hallucination detection in single‑turn responses, diagnosing hallucinations in multi‑step workflows requires identifying which step causes the initial divergence. To fill this gap, we propose a new research task, automated hallucination attribution of LLM‑based agents, aiming to identify the step responsible for the hallucination and explain why. To support this task, we introduce AgentHallu, a comprehensive benchmark with: (1) 693 high‑quality trajectories spanning 7 agent frameworks and 5 domains, (2) a hallucination taxonomy organized into 5 categories (Planning, Retrieval, Reasoning, Human‑Interaction, and Tool‑Use) and 14 sub‑categories, and (3) multi‑level annotations curated by humans, covering binary labels, hallucination‑responsible steps, and causal explanations. We evaluate 13 leading models, and results show the task is challenging even for top‑tier models (like GPT‑5, Gemini‑2.5‑Pro). The best‑performing model achieves only 41.1% step localization accuracy, where tool‑use hallucinations are the most challenging at just 11.6%. We believe AgentHallu will catalyze future research into developing robust, transparent, and reliable agentic systems.
Authors:Liang Chen, Weichu Xie, Yiyan Liang, Hongfeng He, Hans Zhao, Zhibo Yang, Zhiqi Huang, Haoning Wu, Haoyu Lu, Y. charles, Yiping Bao, Yuantao Fan, Guopeng Li, Haiyang Shen, Xuanzhong Chen, Wendong Xu, Shuzheng Si, Zefan Cai, Wenhao Chai, Ziqi Huang, Fangfu Liu, Tianyu Liu, Baobao Chang, Xiaobo Hu, Kaiyuan Chen, Yixin Ren, Yang Liu, Yuan Gong, Kuan Li
Abstract:
While humans develop core visual skills long before acquiring language, contemporary Multimodal LLMs (MLLMs) still rely heavily on linguistic priors to compensate for their fragile visual understanding. We uncovered a crucial fact: state‑of‑the‑art MLLMs consistently fail on basic visual tasks that humans, even 3‑year‑olds, can solve effortlessly. To systematically investigate this gap, we introduce BabyVision, a benchmark designed to assess core visual abilities independent of linguistic knowledge for MLLMs. BabyVision spans a wide range of tasks, with 388 items divided into 22 subclasses across four key categories. Empirical results and human evaluation reveal that leading MLLMs perform significantly below human baselines. Gemini3‑Pro‑Preview scores 49.7, lagging behind 6‑year‑old humans and falling well behind the average adult score of 94.1. These results show despite excelling in knowledge‑heavy evaluations, current MLLMs still lack fundamental visual primitives. Progress in BabyVision represents a step toward human‑level visual perception and reasoning capabilities. We also explore solving visual reasoning with generation models by proposing BabyVision‑Gen and automatic evaluation toolkit. Our code and benchmark data are released at https://github.com/UniPat‑AI/BabyVision for reproduction.
Authors:Minghui Jia, Qichao Zhang, Ali Luo, Linjing Li, Shuo Ye, Hailing Lu, Wen Hou, Dongbin Zhao
Abstract:
Due to the limited generalization and interpretability of deep learning classifiers, The final vetting of rare celestial object candidates still relies on expert visual inspection‑‑a manually intensive process. In this process, astronomers leverage specialized tools to analyze spectra and construct reliable catalogs. However, this practice has become the primary bottleneck, as it is fundamentally incapable of scaling with the data deluge from modern spectroscopic surveys. To bridge this gap, we propose Spec‑o3, a tool‑augmented vision‑language agent that performs astronomer‑aligned spectral inspection via interleaved multimodal chain‑of‑thought reasoning. Spec‑o3 is trained with a two‑stage post‑training recipe: cold‑start supervised fine‑tuning on expert inspection trajectories followed by outcome‑based reinforcement learning on rare‑type verification tasks. Evaluated on five rare‑object identification tasks from LAMOST, Spec‑o3 establishes a new State‑of‑the‑Art, boosting the macro‑F1 score from 28.3 to 76.5 with a 7B parameter base model and outperforming both proprietary VLMs and specialized deep models. Crucially, the agent demonstrates strong generalization to unseen inspection tasks across survey shifts (from LAMOST to SDSS/DESI). Expert evaluations confirm that its reasoning traces are coherent and physically consistent, supporting transparent and trustworthy decision‑making. Code, data, and models are available at \hrefhttps://github.com/Maxwell‑Jia/spec‑o3Project HomePage.
Authors:Xuezhe Ma, Shicheng Wen, Linghao Jin, Bilge Acun, Ruihang Lai, Bohan Hou, Will Lin, Hao Zhang, Songlin Yang, Ryan Lee, Mengxi Wu, Jonathan May, Luke Zettlemoyer, Carole-Jean Wu
Abstract:
Designing a unified neural network to efficiently and inherently process sequential data with arbitrary lengths is a central and challenging problem in sequence modeling. The design choices in Transformer, including quadratic complexity and weak length extrapolation, have limited their ability to scale to long sequences. In this work, we propose Gecko, a neural architecture that inherits the design of Mega and Megalodon (exponential moving average with gated attention), and further introduces multiple technical components to improve its capability to capture long range dependencies, including timestep decay normalization, sliding chunk attention mechanism, and adaptive working memory. In a controlled pretraining comparison with Llama2 and Megalodon in the scale of 7 billion parameters and 2 trillion training tokens, Gecko achieves better efficiency and long‑context scalability. Gecko reaches a training loss of 1.68, significantly outperforming Llama2‑7B (1.75) and Megalodon‑7B (1.70), and landing close to Llama2‑13B (1.67). Notably, without relying on any context‑extension techniques, Gecko exhibits inherent long‑context processing and retrieval capabilities, stably handling sequences of up to 4 million tokens and retrieving information from contexts up to 4× longer than its attention window. Code: https://github.com/XuezheMax/gecko‑llm
Authors:Weihao Hong, Zhiyuan Jiang, Bingyu Shen, Xinlei Guan, Yangyi Feng, Meng Xu, Boyang Li
Abstract:
Vision‑Language Models (VLMs) are increasingly used in safety‑critical applications that require reliable visual grounding. However, these models often hallucinate details that are not present in the image to satisfy user prompts. While recent datasets and benchmarks have been introduced to evaluate systematic hallucinations in VLMs, many hallucination behaviors remain insufficiently characterized. In particular, prior work primarily focuses on object presence or absence, leaving it unclear how prompt phrasing and structural constraints can systematically induce hallucinations. In this paper, we investigate how different forms of prompt pressure influence hallucination behavior. We introduce Ghost‑100, a procedurally generated dataset of synthetic scenes in which key visual details are deliberately removed, enabling controlled analysis of absence‑based hallucinations. Using a structured 5‑Level Prompt Intensity Framework, we vary prompts from neutral queries to toxic demands and rigid formatting constraints. We evaluate three representative open‑weight VLMs: MiniCPM‑V 2.6‑8B, Qwen2‑VL‑7B, and Qwen3‑VL‑8B. Across all three models, hallucination rates do not increase monotonically with prompt intensity. All models exhibit reductions at higher intensity levels at different thresholds, though not all show sustained reduction under maximum coercion. These results suggest that current safety alignment is more effective at detecting semantic hostility than structural coercion, revealing model‑specific limitations in handling compliance pressure. Our dataset is available at: https://github.com/bli1/tone‑matters
Authors:Xin Guo, Rongjunchen Zhang, Guilong Lu, Xuntao Guo, Shuai Jia, Zhi Yang, Liwen Zhang
Abstract:
Large language models have undergone rapid evolution, emerging as a pivotal technology for intelligence in financial operations. However, existing benchmarks are often constrained by pitfalls such as reliance on simulated or general‑purpose samples and a focus on singular, offline static scenarios. Consequently, they fail to align with the requirements for authenticity and real‑time responsiveness in financial services, leading to a significant discrepancy between benchmark performance and actual operational efficacy. To address this, we introduce BizFinBench.v2, the first large‑scale evaluation benchmark grounded in authentic business data from both Chinese and U.S. equity markets, integrating online assessment. We performed clustering analysis on authentic user queries from financial platforms, resulting in eight fundamental tasks and two online tasks across four core business scenarios, totaling 29,578 expert‑level Q&A pairs. Experimental results demonstrate that ChatGPT‑5 achieves a prominent 61.5% accuracy in main tasks, though a substantial gap relative to financial experts persists; in online tasks, DeepSeek‑R1 outperforms all other commercial LLMs. Error analysis further identifies the specific capability deficiencies of existing models within practical financial business contexts. BizFinBench.v2 transcends the limitations of current benchmarks, achieving a business‑level deconstruction of LLM financial capabilities and providing a precise basis for evaluating efficacy in the widespread deployment of LLMs within the financial domain. The data and code are available at https://github.com/HiThink‑Research/BizFinBench.v2.
Authors:Anshul Kumar
Abstract:
Tokens are the basic units of Large Language Models (LLMs). LLMs rely on tokenizers to segment text into these tokens, and tokenization is the primary determinant of computational and inference cost. Sanskrit, one of the oldest languages, is hypothesized to express more meaning per token due to its morphology and grammar rules; however, no prior work has quantified this. We use a dataset of 701 parallel verses of the Bhagavad Gita, which comprises three languages‑Sanskrit, English, and Hindi along with transliteration of Sanskrit into English. We test tokenizers including SentencePiece (SPM), older GPT models, and the latest generation tokenizers from Gemini and GPT. We use metrics of token count, characters per token (token efficiency), and tokens per character (token cost). Results show a ~2x difference in token counts between Sanskrit and English/Hindi under the unbiased SPM baseline. English/Hindi translations of Sanskrit commentary resulted in an approximately 20x increase in token count. GPT o200k base (latest, used by GPT‑4o) and Gemini (latest) reduce bias by a significant degree compared to GPT cl100k base (used until GPT‑4), but still fail to fully capture Sanskrit's compactness. This matters because there might be a penalty bias for non‑English users, which inflates the token count. This research provides a foundation for improving future tokenizer design and shows the potential of Sanskrit for highly compact encoding, saving on cost while speeding up training and inference. The code and dataset are available at https://github.com/anshulkr713/sanskrit‑token‑efficiency
Authors:Yueze Liu, Ajay Nagi Reddy Kumdam, Ronit Kanjilal, Hao Yang, Yichi Zhang
Abstract:
Modern roleplaying models are increasingly sophisticated, yet they consistently struggle to capture the essence of believable, engaging characters. We argue this failure stems from training paradigms that overlook the dynamic interplay of a character's internal world. Current approaches, including Retrieval‑Augmented Generation (RAG), fact‑based priming, literature‑based learning, and synthetic data generation, exhibit recurring limitations in modeling the deliberative, value‑conflicted reasoning that defines human interaction. In this paper, we identify four core concepts essential for character authenticity: Values, Experiences, Judgments, and Abilities (VEJA). We propose the VEJA framework as a new paradigm for data curation that addresses these systemic limitations. To illustrate the qualitative ceiling enabled by our framework, we present a pilot study comparing a manually curated, VEJA‑grounded dataset against a state‑of‑the‑art synthetic baseline. Using an LLM‑as‑judge evaluation, our findings demonstrate a significant quality gap, suggesting that a shift toward conceptually grounded data curation, as embodied by VEJA, is necessary for creating roleplaying agents with genuine depth and narrative continuity. The full dataset is available at https://github.com/HyouinKyoumaIRL/Operation‑Veja
Authors:Chengming Cui, Tianxin Wei, Ziyi Chen, Ruizhong Qiu, Zhichen Zeng, Zhining Liu, Xuying Ning, Duo Zhou, Jingrui He
Abstract:
Large language models (LLMs) exhibit complementary strengths arising from differences in pretraining data, model architectures, and decoding behaviors. Inference‑time ensembling provides a practical way to combine these capabilities without retraining. However, existing ensemble approaches suffer from fundamental limitations. Most rely on fixed fusion granularity, which lacks the flexibility required for mid‑generation adaptation and fails to adapt to different generation characteristics across tasks. To address these challenges, we propose AdaFuse, an adaptive ensemble decoding framework that dynamically selects semantically appropriate fusion units during generation. Rather than committing to a fixed granularity, AdaFuse adjusts fusion behavior on the fly based on the decoding context, with words serving as basic building blocks for alignment. To be specific, we introduce an uncertainty‑based criterion to decide whether to apply ensembling at each decoding step. Under confident decoding states, the model continues generation directly. In less certain states, AdaFuse invokes a diversity‑aware scaling strategy to explore alternative candidate continuations and inform ensemble decisions. This design establishes a synergistic interaction between adaptive ensembling and test‑time scaling, where ensemble decisions guide targeted exploration, and the resulting diversity in turn strengthens ensemble quality. Experiments on open‑domain question answering, arithmetic reasoning, and machine translation demonstrate that AdaFuse consistently outperforms strong ensemble baselines, achieving an average relative improvement of 6.88%. The code is available at https://github.com/CCM0111/AdaFuse.
Authors:Jiajie Zhang, Xin Lv, Ling Feng, Lei Hou, Juanzi Li
Abstract:
Reinforcement learning (RL) has emerged as a critical technique for enhancing LLM‑based deep search agents. However, existing approaches primarily rely on binary outcome rewards, which fail to capture the comprehensiveness and factuality of agents' reasoning process, and often lead to undesirable behaviors such as shortcut exploitation and hallucinations. To address these limitations, we propose Citation‑aware Rubric Rewards (CaRR), a fine‑grained reward framework for deep search agents that emphasizes reasoning comprehensiveness, factual grounding, and evidence connectivity. CaRR decomposes complex questions into verifiable single‑hop rubrics and requires agents to satisfy these rubrics by explicitly identifying hidden entities, supporting them with correct citations, and constructing complete evidence chains that link to the predicted answer. We further introduce Citation‑aware Group Relative Policy Optimization (C‑GRPO), which combines CaRR and outcome rewards for training robust deep search agents. Experiments show that C‑GRPO consistently outperforms standard outcome‑based RL baselines across multiple deep search benchmarks. Our analysis also validates that C‑GRPO effectively discourages shortcut exploitation, promotes comprehensive, evidence‑grounded reasoning, and exhibits strong generalization to open‑ended deep research tasks. Our code and data are available at https://github.com/THUDM/CaRR.
Authors:Víctor Gallego
Abstract:
We propose a framework that amortizes the cost of inference‑time reasoning by converting transient critiques into retrievable guidelines, through a file‑based memory system and agent‑controlled tool calls. We evaluate this method on the Rubric Feedback Bench, a novel dataset for rubric‑based learning. Experiments demonstrate that our augmented LLMs rapidly match the performance of test‑time refinement pipelines while drastically reducing inference cost.
Authors:Jingsheng Zheng, Jintian Zhang, Yujie Luo, Yuren Mao, Yunjun Gao, Lun Du, Huajun Chen, Ningyu Zhang
Abstract:
Autonomous machine learning agents have revolutionized scientific discovery, yet they remain constrained by a Generate‑Execute‑Feedback paradigm. Previous approaches suffer from a severe Execution Bottleneck, as hypothesis evaluation relies strictly on expensive physical execution. To bypass these physical constraints, we internalize execution priors to substitute costly runtime checks with instantaneous predictive reasoning, drawing inspiration from World Models. In this work, we formalize the task of Data‑centric Solution Preference and construct a comprehensive corpus of 18,438 pairwise comparisons. We demonstrate that LLMs exhibit significant predictive capabilities when primed with a Verified Data Analysis Report, achieving 61.5% accuracy and robust confidence calibration. Finally, we instantiate this framework in FOREAGENT, an agent that employs a Predict‑then‑Verify loop, achieving a 6x acceleration in convergence while surpassing execution‑based baselines by +6%. Our code and dataset will be publicly available soon at https://github.com/zjunlp/predict‑before‑execute.
Authors:Haoming Xu, Ningyuan Zhao, Yunzhi Yao, Weihong Xu, Hongru Wang, Xinle Deng, Shumin Deng, Jeff Z. Pan, Huajun Chen, Ningyu Zhang
Abstract:
As Large Language Models (LLMs) are increasingly deployed in real‑world settings, correctness alone is insufficient. Reliable deployment requires maintaining truthful beliefs under contextual perturbations. Existing evaluations largely rely on point‑wise confidence like Self‑Consistency, which can mask brittle belief. We show that even facts answered with perfect self‑consistency can rapidly collapse under mild contextual interference. To address this gap, we propose Neighbor‑Consistency Belief (NCB), a structural measure of belief robustness that evaluates response coherence across a conceptual neighborhood. To validate the efficiency of NCB, we introduce a new cognitive stress‑testing protocol that probes outputs stability under contextual interference. Experiments across multiple LLMs show that the performance of high‑NCB data is relatively more resistant to interference. Finally, we present Structure‑Aware Training (SAT), which optimizes context‑invariant belief structure and reduces long‑tail knowledge brittleness by approximately 30%. Code will be available at https://github.com/zjunlp/belief.
Authors:Zihang Tian, Rui Li, Jingsen Zhang, Xiaohe Bo, Wei Huo, Xu Chen
Abstract:
Large language model (LLM) routing aims to exploit the specialized strengths of different LLMs for diverse tasks. However, existing approaches typically focus on selecting LLM architectures while overlooking parameter settings, which are critical for task performance. In this paper, we introduce HAPS, a hierarchical LLM routing framework that jointly searches over model architectures and parameters. Specifically, we use a high‑level router to select among candidate LLM architectures, and then search for the optimal parameters for the selected architectures based on a low‑level router. We design a parameter generation network to share parameters between the two routers to mutually enhance their capabilities. In the training process, we design a reward‑augmented objective to effectively optimize our framework. Experiments on two commonly used benchmarks show that HAPS consistently outperforms strong routing baselines. We have released our code at https://github.com/zihangtian/HAPS.
Authors:Alexandra Dragomir, Florin Brad, Radu Tudor Ionescu
Abstract:
Large language models (LLMs) have demonstrated competitive performance in zero‑shot multilingual machine translation (MT). Some follow‑up works further improved MT performance via preference optimization, but they leave a key aspect largely underexplored: the order in which data samples are given during training. We address this topic by integrating curriculum learning into various state‑of‑the‑art preference optimization algorithms to boost MT performance. We introduce a novel curriculum learning strategy with restarts (CLewR), which reiterates easy‑to‑hard curriculum multiple times during training to effectively mitigate the catastrophic forgetting of easy examples. We demonstrate consistent gains across several model families (Gemma2, Qwen2.5, Llama3.1) and preference optimization techniques. We publicly release our code at https://github.com/alexandra‑dragomir/CLewR.
Authors:Liu Zai
Abstract:
Pretokenization is a crucial, sequential pass in Byte‑level BPE tokenizers. Our proposed new implementation, Peek2, serves as a drop‑in replacement for cl100k‑like pretokenizers used in GPT‑3, LLaMa‑3, and Qwen‑2.5. Designed with performance and safety in mind, Peek2 is Regex‑free and delivers a 1.11× improvement in overall throughput across the entire Byte‑level BPE encoding process. This algorithm runs entirely on the CPU, has stable linear complexity O(n) , and provides presegmentation results identical to those of the original Regex‑based pretokenizer.
Authors:Xiaoshuai Song, Haofei Chang, Guanting Dong, Yutao Zhu, Zhicheng Dou, Ji-Rong Wen
Abstract:
Large language models (LLMs) are expected to be trained to act as agents in various real‑world environments, but this process relies on rich and varied tool‑interaction sandboxes. However, access to real systems is often restricted; LLM‑simulated environments are prone to hallucinations and inconsistencies; and manually built sandboxes are hard to scale. In this paper, we propose EnvScaler, an automated framework for scalable tool‑interaction environments via programmatic synthesis. EnvScaler comprises two components. First, SkelBuilder constructs diverse environment skeletons through topic mining, logic modeling, and quality evaluation. Then, ScenGenerator generates multiple task scenarios and rule‑based trajectory validation functions for each environment. With EnvScaler, we synthesize 191 environments and about 7K scenarios, and apply them to Supervised Fine‑Tuning (SFT) and Reinforcement Learning (RL) for Qwen3 series models. Results on three benchmarks show that EnvScaler significantly improves LLMs' ability to solve tasks in complex environments involving multi‑turn, multi‑tool interactions. We release our code and data at https://github.com/RUC‑NLPIR/EnvScaler.
Authors:Honghao Liu, Xuhui Jiang, Chengjin Xu, Cehao Yang, Yiran Cheng, Lionel Ni, Jian Guo
Abstract:
Preserving privacy in sensitive data while pretraining large language models on small, domain‑specific corpora presents a significant challenge. In this work, we take an exploratory step toward privacy‑preserving continual pretraining by proposing an entity‑based framework that synthesizes encrypted training data to protect personally identifiable information (PII). Our approach constructs a weighted entity graph to guide data synthesis and applies deterministic encryption to PII entities, enabling LLMs to encode new knowledge through continual pretraining while granting authorized access to sensitive data through decryption keys. Our results on limited‑scale datasets demonstrate that our pretrained models outperform base models and ensure PII security, while exhibiting a modest performance gap compared to models trained on unencrypted synthetic data. We further show that increasing the number of entities and leveraging graph‑based synthesis improves model performance, and that encrypted models retain instruction‑following capabilities with long retrieved contexts. We discuss the security implications and limitations of deterministic encryption, positioning this work as an initial investigation into the design space of encrypted data pretraining for privacy‑preserving LLMs. Our code is available at https://github.com/DataArcTech/SoE.
Authors:Fuwen Luo, Zihao Wan, Ziyue Wang, Yaluo Liu, Pau Tong Lin Xu, Xuanjia Qiao, Xiaolong Wang, Peng Li, Yang Liu
Abstract:
Hieroglyphs, as logographic writing systems, encode rich semantic and cultural information within their internal structural composition. Yet, current advanced Large Language Models (LLMs) and Multimodal LLMs (MLLMs) usually remain structurally blind to this information. LLMs process characters as textual tokens, while MLLMs additionally view them as raw pixel grids. Both fall short to model the underlying logic of character strokes. Furthermore, existing structural analysis methods are often script‑specific and labor‑intensive. In this paper, we propose Hieroglyphic Stroke Analyzer (HieroSA), a novel and generalizable framework that enables MLLMs to automatically derive stroke‑level structures from character bitmaps without handcrafted data. It transforms modern logographic and ancient hieroglyphs character images into explicit, interpretable line‑segment representations in a normalized coordinate space, allowing for cross‑lingual generalization. Extensive experiments demonstrate that HieroSA effectively captures character‑internal structures and semantics, bypassing the need for language‑specific priors. Experimental results highlight the potential of our work as a graphematics analysis tool for a deeper understanding of hieroglyphic scripts. View our code at https://github.com/THUNLP‑MT/HieroSA.
Authors:Tingwei Xie, Jinxin He, Yonghong Song
Abstract:
The efficacy of Multimodal Transformers in visually‑rich document understanding (VrDU) is critically constrained by two inherent limitations: the lack of explicit modeling for logical reading order and the interference of visual tokens that dilutes attention on textual semantics. To address these challenges, this paper presents ROAP, a lightweight and architecture‑agnostic pipeline designed to optimize attention distributions in Layout Transformers without altering their pre‑trained backbones. The proposed pipeline first employs an Adaptive‑XY‑Gap (AXG‑Tree) to robustly extract hierarchical reading sequences from complex layouts. These sequences are then integrated into the attention mechanism via a Reading‑Order‑Aware Relative Position Bias (RO‑RPB). Furthermore, a Textual‑Token Sub‑block Attention Prior (TT‑Prior) is introduced to adaptively suppress visual noise and enhance fine‑grained text‑text interactions. Extensive experiments on the FUNSD and CORD benchmarks demonstrate that ROAP consistently improves the performance of representative backbones, including LayoutLMv3 and GeoLayoutLM. These findings confirm that explicitly modeling reading logic and regulating modality interference are critical for robust document understanding, offering a scalable solution for complex layout analysis. The implementation code will be released at https://github.com/KevinYuLei/ROAP.
Authors:Marko Sterbentz, Kevin Cushing, Cameron Barrie, Kristian J. Hammond
Abstract:
Recent advances in text‑to‑SQL systems have been driven by larger models and improved datasets, yet progress is still limited by the scarcity of high‑quality training data. Manual data creation is expensive, and existing synthetic methods trade off reliability and scalability. Template‑based approaches ensure correct SQL but require schema‑specific templates, while LLM‑based generation scales easily but lacks quality and correctness guarantees. We introduce RingSQL, a hybrid data generation framework that combines schema‑independent query templates with LLM‑based paraphrasing of natural language questions. This approach preserves SQL correctness across diverse schemas while providing broad linguistic variety. In our experiments, we find that models trained using data produced by RingSQL achieve an average gain in accuracy of +2.3% across six text‑to‑SQL benchmarks when compared to models trained on other synthetic data. We make our code available at https://github.com/nu‑c3lab/RingSQL.
Authors:Jan Černý, Ivana Kvapilíková, Silvie Cinková
Abstract:
This work investigates how measuring information entropy of text can be used to estimate its readability. We propose a visualization framework that can be used to approximate information entropy of text using multiple language models and visualize the result. The end goal is to use this method to estimate and improve readability and clarity of administrative or bureaucratic texts. Our toolset is available as a libre software on https://github.com/ufal/Glitter.
Authors:Zhiwei Liu, Yupen Cao, Yuechen Jiang, Mohsinul Kabir, Polydoros Giannouris, Chen Xu, Ziyang Xu, Tianlei Zhu, Tariquzzaman Faisal, Triantafillos Papadopoulos, Yan Wang, Lingfei Qian, Xueqing Peng, Zhuohan Xie, Ye Yuan, Saeed Almheiri, Abdulrazzaq Alnajjar, Mingbin Chen, Harry Stuart, Paul Thompson, Prayag Tiwari, Alejandro Lopez-Lira, Xue Liu, Jimin Huang, Sophia Ananiadou
Abstract:
Large language models (LLMs) have been widely applied across various domains of finance. Since their training data are largely derived from human‑authored corpora, LLMs may inherit a range of human biases. Behavioral biases can lead to instability and uncertainty in decision‑making, particularly when processing financial information. However, existing research on LLM bias has mainly focused on direct questioning or simplified, general‑purpose settings, with limited consideration of the complex real‑world financial environments and high‑risk, context‑sensitive, multilingual financial misinformation detection tasks (\mfmd). In this work, we propose \mfmdscen, a comprehensive benchmark for evaluating behavioral biases of LLMs in \mfmd across diverse economic scenarios. In collaboration with financial experts, we construct three types of complex financial scenarios: (i) role‑ and personality‑based, (ii) role‑ and region‑based, and (iii) role‑based scenarios incorporating ethnicity and religious beliefs. We further develop a multilingual financial misinformation dataset covering English, Chinese, Greek, and Bengali. By integrating these scenarios with misinformation claims, \mfmdscen enables a systematic evaluation of 22 mainstream LLMs. Our findings reveal that pronounced behavioral biases persist across both commercial and open‑source models. This project will be available at https://github.com/lzw108/FMD.
Authors:Tassallah Abdullahi, Shrestha Ghosh, Hamish S Fraser, Daniel León Tramontini, Adeel Abbasi, Ghada Bourjeily, Carsten Eickhoff, Ritambhara Singh
Abstract:
Persona conditioning can be viewed as a behavioral prior for large language models (LLMs) and is often assumed to confer expertise and improve safety in a monotonic manner. However, its effects on high‑stakes clinical decision‑making remain poorly characterized. We systematically evaluate persona‑based control in clinical LLMs, examining how professional roles (e.g., Emergency Department physician, nurse) and interaction styles (bold vs.\ cautious) influence behavior across models and medical tasks. We assess performance on clinical triage and patient‑safety tasks using multidimensional evaluations that capture task accuracy, calibration, and safety‑relevant risk behavior. We find systematic, context‑dependent, and non‑monotonic effects: Medical personas improve performance in critical care tasks, yielding gains of up to ~+20% in accuracy and calibration, but degrade performance in primary‑care settings by comparable margins. Interaction style modulates risk propensity and sensitivity, but it's highly model‑dependent. While aggregated LLM‑judge rankings favor medical over non‑medical personas in safety‑critical cases, we found that human clinicians show moderate agreement on safety compliance (average Cohen's κ= 0.43) but indicate a low confidence in 95.9% of their responses on reasoning quality. Our work shows that personas function as behavioral priors that introduce context‑dependent trade‑offs rather than guarantees of safety or expertise. The code is available at https://github.com/rsinghlab/Persona\_Paradox.
Authors:Susmit Das
Abstract:
Reasoning oriented large language models often expose explicit "thinking" as long, turn‑global traces at the start of every response, either always on or toggled externally at inference time. While useful for arithmetic, programming, and problem solving, this design is costly, blurs claim level auditability, and cannot re‑trigger explicit reasoning once the model begins presenting. Dialogue models are also largely blind to temporal structure, treating replies after seconds and replies after weeks as equivalent unless time is stated in text. We introduce TIME, the Temporally Intelligent Meta‑reasoning Engine, a behavioral alignment framework that treats explicit reasoning as a context sensitive resource driven by discourse and temporal cues. TIME augments dialogue with optional ISO 8601 <time> tags, tick turns that represent silent gaps, and short <think> blocks that can appear anywhere in a reply. A four‑phase curriculum including a small, maximally diverse full‑batch alignment step trains Qwen3 dense models to invoke brief, in‑place reasoning bursts and keep user facing text compact. We evaluate with TIMEBench, a temporally grounded dialogue benchmark probing chronology, commonsense under gaps and offsets, anomaly detection, and continuity. Across 4B to 32B scales, TIME improves TIMEBench scores over base Qwen3 in both thinking and no‑thinking modes while reducing reasoning tokens by about an order of magnitude. Our training data and code are available at https://github.com/The‑Coherence‑Initiative/TIME and TIMEBench is available at https://github.com/The‑Coherence‑Initiative/TIMEBench
Authors:Changxu Duan, Zhiyin Tan
Abstract:
Understanding the role of citations is essential for research assessment and citation‑aware digital libraries. However, existing citation classification frameworks often conflate citation intent (why a work is cited) with cited content type (what part is cited), limiting their effectiveness in auto classification due to a dilemma between fine‑grained type distinctions and practical classification reliability. We introduce SOFT, a Semantically Orthogonal Framework with Two dimensions that explicitly separates citation intent from cited content type, drawing inspiration from semantic role theory. We systematically re‑annotate the ACL‑ARC dataset using SOFT and release a cross‑disciplinary test set sampled from ACT2. Evaluation with both zero‑shot and fine‑tuned Large Language Models demonstrates that SOFT enables higher agreement between human annotators and LLMs, and supports stronger classification performance and robust cross‑domain generalization compared to ACL‑ARC and SciCite annotation frameworks. These results confirm SOFT's value as a clear, reusable annotation standard, improving clarity, consistency, and generalizability for digital libraries and scholarly communication infrastructures. All code and data are publicly available on GitHub https://github.com/zhiyintan/SOFT.
Authors:Zhiyin Tan, Changxu Duan
Abstract:
Identifying suitable datasets for a research question remains challenging because existing dataset search engines rely heavily on metadata quality and keyword overlap, which often fail to capture the semantic intent of scientific investigation. We introduce a literature‑driven framework that discovers datasets from citation contexts in scientific papers, enabling retrieval grounded in actual research use rather than metadata availability. Our approach combines large‑scale citation‑context extraction, schema‑guided dataset recognition with Large Language Models, and provenance‑preserving entity resolution. We evaluate the system on eight survey‑derived computer science queries and find that it achieves substantially higher recall than Google Dataset Search and DataCite Commons, with normalized recall ranging from an average of 47.47% to a highest value of 81.82%. Beyond recovering gold‑standard datasets, the method also surfaces additional datasets not documented in the surveys. Expert assessments across five top‑level Fields of Science indicate that a substantial portion of the additional datasets are considered high utility, and some are regarded as novel for the specific topics chosen by the experts. These findings establish citation‑context mining as an effective and generalizable paradigm for dataset discovery, particularly in settings where datasets lack sufficient or reliable metadata. To support reproducibility and future extensions, we release our code, evaluation datasets, and results on GitHub (https://github.com/Fireblossom/citation‑context‑dataset‑discovery).
Authors:Ziqi Zhao, Zhaochun Ren, Jiahong Zou, Liu Yang, Zhiwei Xu, Xuri Ge, Zhumin Chen, Xinyu Ma, Daiting Shi, Shuaiqiang Wang, Dawei Yin, Xin Xin
Abstract:
Reinforcement learning with verifiable rewards (RLVR) has proven effective in enhancing the reasoning of large language models (LLMs). Monte Carlo Tree Search (MCTS)‑based extensions improve upon vanilla RLVR (e.g., GRPO) by providing tree‑based reasoning rollouts that enable fine‑grained and segment‑level credit assignment. However, existing methods still suffer from limited exploration diversity and inefficient reasoning. To address the above challenges, we propose reinforced efficient reasoning via semantically diverse explorations, i.e., ROSE, for LLMs. To encourage more diverse reasoning exploration, our method incorporates a semantic‑entropy‑based branching strategy and an \varepsilon‑exploration mechanism. The former operates on already sampled reasoning rollouts to capture semantic uncertainty and select branching points with high semantic divergence to generate new successive reasoning paths, whereas the latter stochastically initiates reasoning rollouts from the root, preventing the search process from becoming overly local. To improve efficiency, we design a length‑aware segment‑level advantage estimator that rewards concise and correct reasoning while penalizing unnecessarily long reasoning chains. Extensive experiments on various mathematical reasoning benchmarks with Qwen and Llama models validate the effectiveness and efficiency of ROSE. Codes are available at https://github.com/ZiqiZhao1/ROSE‑rl.
Authors:Jianbo Li, Yi Jiang, Sendong Zhao, Bairui Hu, Haochun Wang, Bing Qin
Abstract:
Retrieval‑Augmented Generation (RAG) helps LLMs stay accurate, but feeding long documents into a prompt makes the model slow and expensive. This has motivated context compression, ranging from token pruning and summarization to embedding‑based compression. While researchers have tried ''compressing'' these documents into smaller summaries or mathematical embeddings, there is a catch: the more you compress the data, the more the LLM struggles to understand it. To address this challenge, we propose ArcAligner (Adaptive recursive context Aligner), a lightweight module integrated into the language model layers to help the model better utilize highly compressed context representations for downstream generation. It uses an adaptive ''gating'' system that only adds extra processing power when the information is complex, keeping the system fast. Across knowledge‑intensive QA benchmarks, ArcAligner consistently beats compression baselines at comparable compression rates, especially on multi‑hop and long‑tail settings. The source code is publicly available.
Authors:Xueyun Tian, Minghua Ma, Bingbing Xu, Nuoyan Lyu, Wei Li, Heng Dong, Zheng Chu, Yuanzhuo Wang, Huawei Shen
Abstract:
Supervised fine‑tuning (SFT) on chain‑of‑thought (CoT) trajectories demonstrations is a common approach for enabling reasoning in large language models. Standard practices typically only retain trajectories with correct final answers (positives) while ignoring the rest (negatives). We argue that this paradigm discards substantial supervision and exacerbates overfitting, limiting out‑of‑domain (OOD) generalization. Specifically, we surprisingly find that incorporating negative trajectories into SFT yields substantial OOD generalization gains over positive‑only training, as these trajectories often retain valid intermediate reasoning despite incorrect final answers. To understand this effect in depth, we systematically analyze data, training dynamics, and inference behavior, identifying 22 recurring patterns in negative chains that serve a dual role: they moderate loss descent to mitigate overfitting during training and boost policy entropy by 35.67% during inference to facilitate exploration. Motivated by these observations, we further propose Gain‑based LOss Weighting (GLOW), an adaptive, sample‑aware scheme that exploits such distinctive training dynamics by rescaling per‑sample loss based on inter‑epoch progress. Empirically, GLOW efficiently leverages unfiltered trajectories, yielding a 5.51% OOD gain over positive‑only SFT on Qwen2.5‑7B and boosting MMLU from 72.82% to 76.47% as an RL initialization.
Authors:Ao Sun, Xiaoyu Wang, Zhe Tan, Yu Li, Jiachen Zhu, Shu Su, Yuheng Jia
Abstract:
As Large Language Models (LLMs) serve a global audience, alignment must transition from enforcing universal consensus to respecting cultural pluralism. We demonstrate that dense models, when forced to fit conflicting value distributions, suffer from Mean Collapse, converging to a generic average that fails to represent diverse groups. We attribute this to Cultural Sparsity, where gradient interference prevents dense parameters from spanning distinct cultural modes. To resolve this, we propose \textscCuMA (Cultural Mixture of Adapters), a framework that frames alignment as a conditional capacity separation problem. By incorporating demographic‑aware routing, \textscCuMA internalizes a Latent Cultural Topology to explicitly disentangle conflicting gradients into specialized expert subspaces. Extensive evaluations on WorldValuesBench, Community Alignment, and PRISM demonstrate that \textscCuMA achieves state‑of‑the‑art performance, significantly outperforming both dense baselines and semantic‑only MoEs. Crucially, our analysis confirms that \textscCuMA effectively mitigates mean collapse, preserving cultural diversity. Our code is available at https://github.com/Throll/CuMA.
Authors:Mingyue Cheng, Daoyu Wang, Qi Liu, Shuo Yu, Xiaoyu Tao, Yuqian Wang, Chengzhong Chu, Yu Duan, Mingkang Long, Enhong Chen
Abstract:
Synthesizing informative commercial reports from massive and noisy web sources is critical for high‑stakes business decisions. Although current deep research agents achieve notable progress, their reports still remain limited in terms of quality, reliability, and coverage. In this work, we propose Mind2Report, a cognitive deep research agent that emulates the commercial analyst to synthesize expert‑level reports. Specifically, it first probes fine‑grained intent, then searches web sources and records distilled information on the fly, and subsequently iteratively synthesizes the report. We design Mind2Report as a training‑free agentic workflow that augments general large language models (LLMs) with dynamic memory to support these long‑form cognitive processes. To rigorously evaluate Mind2Report, we further construct QRC‑Eval comprising 200 real‑world commercial tasks and establish a holistic evaluation strategy to assess report quality, reliability, and coverage. Experiments demonstrate that Mind2Report outperforms leading baselines, including OpenAI and Gemini deep research agents. Although this is a preliminary study, we expect it to serve as a foundation for advancing the future design of commercial deep research agents. Our code and data are available at https://github.com/Melmaphother/Mind2Report.
Authors:Maxime Delmas, Lei Xu, André Freitas
Abstract:
Standard RAG pipelines based on chunking excel at simple factual retrieval but fail on complex multi‑hop queries due to a lack of structural connectivity. Conversely, initial strategies that interleave retrieval with reasoning often lack global corpus awareness, while Knowledge Graph (KG)‑based RAG performs strongly on complex multi‑hop tasks but suffers on fact‑oriented single‑hop queries. To bridge this gap, we propose a novel RAG framework: ToPG (Traversal over Proposition Graphs). ToPG models its knowledge base as a heterogeneous graph of propositions, entities, and passages, effectively combining the granular fact density of propositions with graph connectivity. We leverage this structure using iterative Suggestion‑Selection cycles, where the Suggestion phase enables a query‑aware traversal of the graph, and the Selection phase provides LLM feedback to prune irrelevant propositions and seed the next iteration. Evaluated on three distinct QA tasks (Simple, Complex, and Abstract QA), ToPG demonstrates strong performance across both accuracy‑ and quality‑based metrics. Overall, ToPG shows that query‑aware graph traversal combined with factual granularity is a critical component for efficient structured RAG systems. ToPG is available at https://github.com/idiap/ToPG.
Authors:Zhiwei Liu, Paul Thompson, Jiaqi Rong, Baojie Qu, Runteng Guo, Min Peng, Qianqian Xie, Sophia Ananiadou
Abstract:
Online misinformation is increasingly pervasive, yet most existing benchmarks and methods evaluate veracity at the level of whole claims or paragraphs using coarse binary labels, obscuring how true and false details often co‑exist within single sentences. These simplifications also limit interpretability: global explanations cannot identify which specific segments are misleading or differentiate how a detail is false (e.g., distorted vs. fabricated). To address these gaps, we introduce MisSpans, the first multi‑domain, human‑annotated benchmark for span‑level misinformation detection and analysis, consisting of paired real and fake news stories. MisSpans defines three complementary tasks: MisSpansIdentity for pinpointing false spans within sentences, MisSpansType for categorising false spans by misinformation type, and MisSpansExplanation for providing rationales grounded in identified spans. Together, these tasks enable fine‑grained localisation, nuanced characterisation beyond true/false and actionable explanations. Expert annotators were guided by standardised guidelines and consistency checks, leading to high inter‑annotator agreement. We evaluate 15 representative LLMs, including reasoning‑enhanced and non‑reasoning variants, under zero‑shot and one‑shot settings. Results reveal the challenging nature of fine‑grained misinformation identification and analysis, and highlight the need for a deeper understanding of how performance may be influenced by multiple interacting factors, including model size and reasoning capabilities, along with domain‑specific textual features. This project will be available at https://github.com/lzw108/MisSpans.
Authors:Zhiwei Liu, Runteng Guo, Baojie Qu, Yuechen Jiang, Min Peng, Qianqian Xie, Sophia Ananiadou
Abstract:
Cross‑domain misinformation detection is challenging, as misinformation arises across domains with substantial differences in knowledge and discourse. Existing methods often rely on single‑perspective cues and struggle to generalize to challenging or underrepresented domains, while reasoning large language models (LLMs), though effective on complex tasks, are limited to same‑distribution data. To address these gaps, we introduce RAAR, the first retrieval‑augmented agentic reasoning framework for cross‑domain misinformation detection. To enable cross‑domain transfer beyond same‑distribution assumptions, RAAR retrieves multi‑perspective source‑domain evidence aligned with each target sample's semantics, sentiment, and writing style. To overcome single‑perspective modeling and missing systematic reasoning, RAAR constructs verifiable multi‑step reasoning paths through specialized multi‑agent collaboration, where perspective‑specific agents produce complementary analyses and a summary agent integrates them under verifier guidance. RAAR further applies supervised fine‑tuning and reinforcement learning to train a single multi‑task verifier to enhance verification and reasoning capabilities. Based on RAAR, we trained the RAAR‑8b and RAAR‑14b models. Evaluation on three cross‑domain misinformation detection tasks shows that RAAR substantially enhances the capabilities of the base models and outperforms other cross‑domain methods, advanced LLMs, and LLM‑based adaptation approaches. The project will be released at https://github.com/lzw108/RAAR.
Authors:Zefang Zong, Dingwei Chen, Yang Li, Qi Yi, Bo Zhou, Chengming Li, Bo Qian, Peng Chen, Jie Jiang
Abstract:
LLM agents have emerged as powerful systems for tackling multi‑turn tasks by interleaving internal reasoning and external tool interactions. Agentic Reinforcement Learning has recently drawn significant research attention as a critical post‑training paradigm to further refine these capabilities. In this paper, we present AT^2PO (Agentic Turn‑based Policy Optimization via Tree Search), a unified framework for multi‑turn agentic RL that addresses three core challenges: limited exploration diversity, sparse credit assignment, and misaligned policy optimization. AT^2PO introduces a turn‑level tree structure that jointly enables Entropy‑Guided Tree Expansion for strategic exploration and Turn‑wise Credit Assignment for fine‑grained reward propagation from sparse outcomes. Complementing this, we propose Agentic Turn‑based Policy Optimization, a turn‑level learning objective that aligns policy updates with the natural decision granularity of agentic interactions. ATPO is orthogonal to tree search and can be readily integrated into any multi‑turn RL pipeline. Experiments across seven benchmarks demonstrate consistent improvements over the state‑of‑the‑art baseline by up to 1.84 percentage points in average, with ablation studies validating the effectiveness of each component. Our code is available at https://github.com/zzfoutofspace/ATPO.
Authors:Yehoon Jang, Chaewon Lee, Hyun-seok Min, Sungchul Choi
Abstract:
The Patent Trial and Appeal Board (PTAB) of the USPTO adjudicates thousands of ex parte appeals each year, requiring the integration of technical understanding and legal reasoning. While large language models (LLMs) are increasingly applied in patent and legal practice, their use has remained limited to lightweight tasks, with no established means of systematically evaluating their capacity for structured legal reasoning in the patent domain. In this work, we introduce PILOT‑Bench, the first PTAB‑centric benchmark that aligns PTAB decisions with USPTO patent data at the case‑level and formalizes three IRAC‑aligned classification tasks: Issue Type, Board Authorities, and Subdecision. We evaluate a diverse set of closed‑source (commercial) and open‑source LLMs and conduct analyses across multiple perspectives, including input‑variation settings, model families, and error tendencies. Notably, on the Issue Type task, closed‑source models consistently exceed 0.75 in Micro‑F1 score, whereas the strongest open‑source model (Qwen‑8B) achieves performance around 0.56, highlighting a substantial gap in reasoning capabilities. PILOT‑Bench establishes a foundation for the systematic evaluation of patent‑domain legal reasoning and points toward future directions for improving LLMs through dataset design and model alignment. All data, code, and benchmark resources are available at https://github.com/TeamLab/pilot‑bench.
Authors:Mizanur Rahman, Mohammed Saidul Islam, Md Tahmid Rahman Laskar, Shafiq Joty, Enamul Hoque
Abstract:
Text‑to‑Visualization (Text2Vis) systems translate natural language queries over tabular data into concise answers and executable visualizations. While closed‑source LLMs generate functional code, the resulting charts often lack semantic alignment and clarity, qualities that can only be assessed post‑execution. Open‑source models struggle even more, frequently producing non‑executable or visually poor outputs. Although supervised fine‑tuning can improve code executability, it fails to enhance overall visualization quality, as traditional SFT loss cannot capture post‑execution feedback. To address this gap, we propose RL‑Text2Vis, the first reinforcement learning framework for Text2Vis generation. Built on Group Relative Policy Optimization (GRPO), our method uses a novel multi‑objective reward that jointly optimizes textual accuracy, code validity, and visualization quality using post‑execution feedback. By training Qwen2.5 models (7B and 14B), RL‑Text2Vis achieves a 22% relative improvement in chart quality over GPT‑4o on the Text2Vis benchmark and boosts code execution success from 78% to 97% relative to its zero‑shot baseline. Our models significantly outperform strong zero‑shot and supervised baselines and also demonstrate robust generalization to out‑of‑domain datasets like VIS‑Eval and NVBench. These results establish GRPO as an effective strategy for structured, multimodal reasoning in visualization generation. We release our code at https://github.com/vis‑nlp/RL‑Text2Vis.
Authors:Paul Pu Liang
Abstract:
Our experience of the world is multisensory, spanning a synthesis of language, sight, sound, touch, taste, and smell. Yet, artificial intelligence has primarily advanced in digital modalities like text, vision, and audio. This paper outlines a research vision for multisensory artificial intelligence over the next decade. This new set of technologies can change how humans and AI experience and interact with one another, by connecting AI to the human senses and a rich spectrum of signals from physiological and tactile cues on the body, to physical and social signals in homes, cities, and the environment. We outline how this field must advance through three interrelated themes of sensing, science, and synergy. Firstly, research in sensing should extend how AI captures the world in richer ways beyond the digital medium. Secondly, developing a principled science for quantifying multimodal heterogeneity and interactions, developing unified modeling architectures and representations, and understanding cross‑modal transfer. Finally, we present new technical challenges to learn synergy between modalities and between humans and AI, covering multisensory integration, alignment, reasoning, generation, generalization, and experience. Accompanying this vision paper are a series of projects, resources, and demos of latest advances from the Multisensory Intelligence group at the MIT Media Lab, see https://mit‑mi.github.io/.
Authors:Tianle Wang, Zhongyuan Wu, Shenghao Jin, Hao Xu, Wei Chen, Ning Miao
Abstract:
Reinforcement learning with verifiable rewards (RLVR) has become a central component of large language model (LLM) post‑training. Unlike supervised fine‑tuning (SFT), RLVR lets an LLM generate multiple candidate solutions and reinforces those that lead to a verifiably correct final answer. However, in practice, RLVR often requires thousands of training steps to reach strong performance, incurring substantial computation largely attributed to prolonged exploration. In this work, we make a surprising observation: during RLVR, LLMs evolve in a strongly linear manner. Specifically, both model weights and model output log‑probabilities exhibit strong linear correlations with RL training steps. This suggests that RLVR predominantly amplifies trends that emerge early in training, rather than continuously discovering new behaviors throughout the entire optimization trajectory. Motivated by this linearity, we investigate whether future model states can be predicted from intermediate checkpoints via extrapolation, avoiding continued expensive training. We show that Weight Extrapolation produces models with performance comparable to standard RL training while requiring significantly less computation. Moreover, Logits Extrapolation consistently outperforms continued RL training on mathematics and code benchmarks by extrapolating beyond the step range where RL training remains stable. Our code is available at https://github.com/Miaow‑Lab/RLVR‑Linearity
Authors:Yibo Zhao, Jiapeng Zhu, Zichen Ding, Xiang Li
Abstract:
Retrieval‑Augmented Generation (RAG) integrates external knowledge to enhance Large Language Models (LLMs), yet systems remain susceptible to two critical flaws: providing correct answers without explicit grounded evidence and producing fabricated responses when the retrieved context is insufficient. While prior research has addressed these issues independently, a unified framework that integrates evidence‑based grounding and reliable abstention is currently lacking. In this paper, we propose GRACE, a reinforcement‑learning framework that simultaneously mitigates both types of flaws. GRACE employs a data construction method that utilizes heterogeneous retrievers to generate diverse training samples without manual annotation. A multi‑stage gated reward function is then employed to train the model to assess evidence sufficiency, extract key supporting evidence, and provide answers or explicitly abstain. Experimental results on two benchmarks demonstrate that GRACE achieves state‑of‑the‑art overall accuracy and strikes a favorable balance between accurate response and rejection, while requiring only 10% of the annotation costs of prior methods. Our code is available at https://github.com/YiboZhao624/Grace..
Authors:James Brock, Ce Zhang, Nantheera Anantrasirichai
Abstract:
Modern forest monitoring workflows increasingly benefit from the growing availability of high‑resolution satellite imagery and advances in deep learning. Two persistent challenges in this context are accurate pixel‑level change detection and meaningful semantic change captioning for complex forest dynamics. While large language models (LLMs) are being adapted for interactive data exploration, their integration with vision‑language models (VLMs) for remote sensing image change interpretation (RSICI) remains underexplored. To address this gap, we introduce an LLM‑driven agent for integrated forest change analysis that supports natural language querying across multiple RSICI tasks. The proposed system builds upon a multi‑level change interpretation (MCI) vision‑language backbone with LLM‑based orchestration. To facilitate adaptation and evaluation in forest environments, we further introduce the Forest‑Change dataset, which comprises bi‑temporal satellite imagery, pixel‑level change masks, and multi‑granularity semantic change captions generated using a combination of human annotation and rule‑based methods. Experimental results show that the proposed system achieves mIoU and BLEU‑4 scores of 67.10% and 40.17% on the Forest‑Change dataset, and 88.13% and 34.41% on LEVIR‑MCI‑Trees, a tree‑focused subset of LEVIR‑MCI benchmark for joint change detection and captioning. These results highlight the potential of interactive, LLM‑driven RSICI systems to improve accessibility, interpretability, and efficiency of forest change analysis. All data and code are publicly available at https://github.com/JamesBrockUoB/ForestChat.
Authors:Iaroslav Chelombitko, Ekaterina Chelombitko, Aleksey Komissarov
Abstract:
The quality of subword tokenization is critical for Large Language Models, yet evaluating tokenizers for morphologically rich Uralic languages is hampered by the lack of clean morpheme lexicons. We introduce SampoNLP, a corpus‑free toolkit for morphological lexicon creation using MDL‑inspired Self‑Referential Atomicity Scoring, which filters composite forms through internal structural cues ‑ suited for low‑resource settings. Using the high‑purity lexicons generated by SampoNLP for Finnish, Hungarian, and Estonian, we conduct a systematic evaluation of BPE tokenizers across a range of vocabulary sizes (8k‑256k). We propose a unified metric, the Integrated Performance Score (IPS), to navigate the trade‑off between morpheme coverage and over‑splitting. By analyzing the IPS curves, we identify the "elbow points" of diminishing returns and provide the first empirically grounded recommendations for optimal vocabulary sizes (k) in these languages. Our study not only offers practical guidance but also quantitatively demonstrates the limitations of standard BPE for highly agglutinative languages. The SampoNLP library and all generated resources are made publicly available: https://github.com/AragonerUA/SampoNLP
Authors:Yao Dou, Wei Xu
Abstract:
Large language models (LLMs) now support contexts of up to 1M tokens, but their effectiveness on complex long‑context tasks remains unclear. In this paper, we study multi‑document legal case summarization, where a single case often spans many documents totaling 100K‑500K tokens. We introduce Gavel‑Ref, a reference‑based evaluation framework with multi‑value checklist evaluation over 26 items, as well as residual fact and writing‑style evaluations. Using Gavel‑Ref, we go beyond the single aggregate scores reported in prior work and systematically evaluate 12 frontier LLMs on 100 legal cases ranging from 32K to 512K tokens, primarily from 2025. Our results show that even the strongest model, Gemini 2.5 Pro, achieves only around 50 of S_\textGavel‑Ref, highlighting the difficulty of the task. Models perform well on simple checklist items (e.g., filing date) but struggle on multi‑value or rare ones such as settlements and monitor reports. As LLMs continue to improve and may surpass human‑written summaries ‑‑ making human references less reliable ‑‑ we develop Gavel‑Agent, an efficient and autonomous agent scaffold that equips LLMs with six tools to navigate and extract checklists directly from case documents. With Qwen3, Gavel‑Agent reduces token usage by 36% while resulting in only a 7% drop in S_\textchecklist compared to end‑to‑end extraction with GPT‑4.1.
Authors:Sharanya Dasgupta, Arkaprabha Basu, Sujoy Nath, Swagatam Das
Abstract:
Human cognition, driven by complex neurochemical processes, oscillates between imagination and reality and learns to self‑correct whenever such subtle drifts lead to hallucinations or unsafe associations. In recent years, LLMs have demonstrated remarkable performance in a wide range of tasks. However, they still lack human cognition to balance factuality and safety. Bearing the resemblance, we argue that both factual and safety failures in LLMs arise from a representational misalignment in their latent activation space, rather than addressing those as entirely separate alignment issues. We hypothesize that an external network, trained to understand the fluctuations, can selectively intervene in the model to regulate falsehood into truthfulness and unsafe output into safe output without fine‑tuning the model parameters themselves. Reflecting the hypothesis, we propose ARREST (Adversarial Resilient Regulation Enhancing Safety and Truth), a unified framework that identifies and corrects drifted features, engaging both soft and hard refusals in addition to factual corrections. Our empirical results show that ARREST not only regulates misalignment but is also more versatile compared to the RLHF‑aligned models in generating soft refusals due to adversarial training. We make our codebase available at https://github.com/sharanya‑dasgupta001/ARREST.
Authors:Nikita Zmanovskii
Abstract:
We present Qwerty AI, an end‑to‑end system for automated age‑rating and content‑safety assessment of Russian‑language screenplays according to Federal Law No. 436‑FZ. The system processes full‑length scripts (up to 700 pages in under 2 minutes), segments them into narrative units, detects content violations across five categories (violence, sexual content, profanity, substances, frightening elements), and assigns age ratings (0+, 6+, 12+, 16+, 18+) with explainable justifications. Our implementation leverages a fine‑tuned Phi‑3‑mini model with 4‑bit quantization, achieving 80% rating accuracy and 80‑95% segmentation precision (format‑dependent). The system was developed under strict constraints: no external API calls, 80GB VRAM limit, and <5 minute processing time for average scripts. Deployed on Yandex Cloud with CUDA acceleration, Qwerty AI demonstrates practical applicability for production workflows. We achieved these results during the Wink hackathon (November 2025), where our solution addressed real editorial challenges in the Russian media industry.
Authors:Xueqing Wu, Zihan Xue, Da Yin, Shuyan Zhou, Kai-Wei Chang, Nanyun Peng, Yeming Wen
Abstract:
We present FronTalk, a benchmark for front‑end code generation that pioneers the study of a unique interaction dynamic: conversational code generation with multi‑modal feedback. In front‑end development, visual artifacts such as sketches, mockups and annotated creenshots are essential for conveying design intent, yet their role in multi‑turn code generation remains largely unexplored. To address this gap, we focus on the front‑end development task and curate FronTalk, a collection of 100 multi‑turn dialogues derived from real‑world websites across diverse domains such as news, finance, and art. Each turn features both a textual instruction and an equivalent visual instruction, each representing the same user intent. To comprehensively evaluate model performance, we propose a novel agent‑based evaluation framework leveraging a web agent to simulate users and explore the website, and thus measuring both functional correctness and user experience. Evaluation of 20 models reveals two key challenges that are under‑explored systematically in the literature: (1) a significant forgetting issue where models overwrite previously implemented features, resulting in task failures, and (2) a persistent challenge in interpreting visual feedback, especially for open‑source vision‑language models (VLMs). We propose a strong baseline to tackle the forgetting issue with AceCoder, a method that critiques the implementation of every past instruction using an autonomous web agent. This approach significantly reduces forgetting to nearly zero and improves the performance by up to 9.3% (56.0% to 65.3%). Overall, we aim to provide a solid foundation for future research in front‑end development and the general interaction dynamics of multi‑turn, multi‑modal code generation. Code and data are released at https://github.com/shirley‑wu/frontalk
Authors:Zihan Gao, Mohsin Y. K. Yousufi, Jacob Thebault-Spieker
Abstract:
Large language model (LLM) question‑answering systems often fail on community‑specific queries, creating "knowledge blind spots" that marginalize local voices and reinforce epistemic injustice. We present Collective Narrative Grounding, a participatory protocol that transforms community stories into structured narrative units and integrates them into AI systems under community governance. Learning from three participatory mapping workshops with N=24 community members, we designed elicitation methods and a schema that retain narrative richness while enabling entity, time, and place extraction, validation, and provenance control. To scope the problem, we audit a county‑level benchmark of 14,782 local information QA pairs, where factual gaps, cultural misunderstandings, geographic confusions, and temporal misalignments account for 76.7% of errors. On a participatory QA set derived from our workshops, a state‑of‑the‑art LLM answered fewer than 21% of questions correctly without added context, underscoring the need for local grounding. The missing facts often appear in the collected narratives, suggesting a direct path to closing the dominant error modes for narrative items. Beyond the protocol and pilot, we articulate key design tensions, such as representation and power, governance and control, and privacy and consent, providing concrete requirements for retrieval‑first, provenance‑visible, locally governed QA systems. Together, our taxonomy, protocol, and participatory evaluation offer a rigorous foundation for building community‑grounded AI that better answers local questions.
Authors:Xinyue Lou, Jinan Xu, Jingyi Yin, Xiaolong Wang, Zhaolu Kang, Youwei Liao, Yixuan Wang, Xiangyu Shi, Fengran Mo, Su Yao, Kaiyu Huang
Abstract:
As Multimodal Large Language Models (MLLMs) become an indispensable assistant in human life, the unsafe content generated by MLLMs poses a danger to human behavior, perpetually overhanging human society like a sword of Damocles. To investigate and evaluate the safety impact of MLLMs responses on human behavior in daily life, we introduce SaLAD, a multimodal safety benchmark which contains 2,013 real‑world image‑text samples across 10 common categories, with a balanced design covering both unsafe scenarios and cases of oversensitivity. It emphasizes realistic risk exposure, authentic visual inputs, and fine‑grained cross‑modal reasoning, ensuring that safety risks cannot be inferred from text alone. We further propose a safety‑warning‑based evaluation framework that encourages models to provide clear and informative safety warnings, rather than generic refusals. Results on 18 MLLMs demonstrate that the top‑performing models achieve a safe response rate of only 57.2% on unsafe queries. Moreover, even popular safety alignment methods limit effectiveness of the models in our scenario, revealing the vulnerabilities of current MLLMs in identifying dangerous behaviors in daily life. Our dataset is available at https://github.com/xinyuelou/SaLAD.
Authors:Changhao Jiang, Jiahao Chen, Zhenghao Xiang, Zhixiong Yang, Hanchen Wang, Jiabao Zhuang, Xinmeng Che, Jiajun Sun, Hui Li, Yifei Cao, Shihan Dou, Ming Zhang, Junjie Ye, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang
Abstract:
Recent commercial systems such as Suno demonstrate strong capabilities in long‑form song generation, while academic research remains largely non‑reproducible due to the lack of publicly available training data, hindering fair comparison and progress. To this end, we release a fully open‑source system for long‑form song generation with fine‑grained style conditioning, including a licensed synthetic dataset, training and evaluation pipelines, and Muse, an easy‑to‑deploy song generation model. The dataset consists of 116k fully licensed synthetic songs with automatically generated lyrics and style descriptions paired with audio synthesized by SunoV5. We train Muse via single‑stage supervised finetuning of a Qwen‑based language model extended with discrete audio tokens using MuCodec, without task‑specific losses, auxiliary objectives, or additional architectural components. Our evaluations find that although Muse is trained with a modest data scale and model size, it achieves competitive performance on phoneme error rate, text‑‑music style similarity, and audio aesthetic quality, while enabling controllable segment‑level generation across different musical structures. All data, model weights, and training and evaluation pipelines will be publicly released, paving the way for continued progress in controllable long‑form song generation research. The project repository is available at https://github.com/yuhui1038/Muse.
Authors:Wang Chen, Guanqiang Qi, Weikang Li, Yang Li, Deguo Xia, Jizhou Huang
Abstract:
Retrieval‑augmented generation (RAG) enhances large language models (LLMs) by incorporating external knowledge, but existing approaches indiscriminately trigger retrieval and rely on single‑path evidence construction, often introducing noise and limiting performance gains. In this work, we propose Decide Then Retrieve (DTR), a training‑free framework that adaptively determines when retrieval is necessary and how external information should be selected. DTR leverages generation uncertainty to guide retrieval triggering and introduces a dual‑path retrieval mechanism with adaptive information selection to better handle sparse and ambiguous queries. Extensive experiments across five open‑domain QA benchmarks, multiple model scales, and different retrievers demonstrate that DTR consistently improves EM and F1 over standard RAG and strong retrieval‑enhanced baselines, while reducing unnecessary retrievals. The code and data used in this paper are available at https://github.com/ChenWangHKU/DTR.
Authors:Chi Liu, Xin Chen
Abstract:
Group Relative Policy Optimization (GRPO) has emerged as a popular algorithm for reinforcement learning with large language models (LLMs). However, upon analyzing its clipping mechanism, we argue that it is suboptimal in certain scenarios. With appropriate modifications, GRPO can be significantly enhanced to improve both flexibility and generalization. To this end, we propose Adaptive‑Boundary‑Clipping GRPO (ABC‑GRPO), an asymmetric and adaptive refinement of the original GRPO framework. We demonstrate that ABC‑GRPO achieves superior performance over standard GRPO on mathematical reasoning tasks using the Qwen3 LLMs. Moreover, ABC‑GRPO maintains substantially higher entropy throughout training, thereby preserving the model's exploration capacity and mitigating premature convergence. The implementation code is available online to ease reproducibility https://github.com/chi2liu/ABC‑GRPO.
Authors:Fei Wu, Zhenrong Zhang, Qikai Chang, Jianshu Zhang, Quan Liu, Jun Du
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) elicits long chain‑of‑thought reasoning in large language models (LLMs), but outcome‑based rewards lead to coarse‑grained advantage estimation. While existing approaches improve RLVR via token‑level entropy or sequence‑level length control, they lack a semantically grounded, step‑level measure of reasoning progress. As a result, LLMs fail to distinguish necessary deduction from redundant verification: they may continue checking after reaching a correct solution and, in extreme cases, overturn a correct trajectory into an incorrect final answer. To remedy the lack of process supervision, we introduce a training‑free probing mechanism that extracts intermediate confidence and correctness and combines them into a Step Potential signal that explicitly estimates the reasoning state at each step. Building on this signal, we propose Step Potential Advantage Estimation (SPAE), a fine‑grained credit assignment method that amplifies potential gains, penalizes potential drops, and applies penalty after potential saturates to encourage timely termination. Experiments across multiple benchmarks show SPAE consistently improves accuracy while substantially reducing response length, outperforming strong RL baselines and recent efficient reasoning and token‑level advantage estimation methods. The code is available at https://github.com/cii030/SPAE‑RL.
Authors:Jin Wang, Liang Lin, Kaiwen Luo, Weiliu Wang, Yitian Chen, Moayad Aloqaily, Xuehai Tang, Zhenhong Zhou, Kun Wang, Li Sun, Qingsong Wen
Abstract:
While Audio Large Language Models (ALLMs) have achieved remarkable progress in understanding and generation, their potential privacy implications remain largely unexplored. This paper takes the first step to investigate whether ALLMs inadvertently leak user privacy solely through acoustic voiceprints and introduces HearSay, a comprehensive benchmark constructed from over 22,000 real‑world audio clips. To ensure data quality, the benchmark is meticulously curated through a rigorous pipeline involving automated profiling and human verification, guaranteeing that all privacy labels are grounded in factual records. Extensive experiments on HearSay yield three critical findings: Significant Privacy Leakage: ALLMs inherently extract private attributes from voiceprints, reaching 92.89% accuracy on gender and effectively profiling social attributes. Insufficient Safety Mechanisms: Alarmingly, existing safeguards are severely inadequate; most models fail to refuse privacy‑intruding requests, exhibiting near‑zero refusal rates for physiological traits. Reasoning Amplifies Risk: Chain‑of‑Thought (CoT) reasoning exacerbates privacy risks in capable models by uncovering deeper acoustic correlations. These findings expose critical vulnerabilities in ALLMs, underscoring the urgent need for targeted privacy alignment. The codes and dataset are available at https://github.com/JinWang79/HearSay_Benchmark
Authors:Jakob Schuster, Vagrant Gautam, Katja Markert
Abstract:
As large language models (LLMs) are more frequently used in retrieval‑augmented generation pipelines, it is increasingly relevant to study their behavior under knowledge conflicts. Thus far, the role of the source of the retrieved information has gone unexamined. We address this gap with a novel framework to investigate how source preferences affect LLM resolution of inter‑context knowledge conflicts in English, motivated by interdisciplinary research on credibility. With a comprehensive, tightly‑controlled evaluation of 13 open‑weight LLMs, we find that LLMs prefer institutionally‑corroborated information (e.g., government or newspaper sources) over information from people and social media. However, these source preferences can be reversed by simply repeating information from less credible sources. To mitigate repetition effects and maintain consistent preferences, we propose a novel method that reduces repetition bias by up to 99.8%, while also maintaining at least 88.8% of original preferences. We release all data and code to encourage future work on credibility and source preferences in knowledge‑intensive NLP.
Authors:Yunhao Liang, Ruixuan Ying, Bo Li, Hong Li, Kai Yan, Qingwen Li, Min Yang, Okamoto Satoshi, Zhe Cui, Shiwen Ni
Abstract:
DeepSeek‑OCR utilizes an optical 2D mapping approach to achieve high‑ratio vision‑text compression, claiming to decode text tokens exceeding ten times the input visual tokens. While this suggests a promising solution for the LLM long‑context bottleneck, we investigate a critical question: "Visual merit or linguistic crutch ‑ which drives DeepSeek‑OCR's performance?" By employing sentence‑level and word‑level semantic corruption, we isolate the model's intrinsic OCR capabilities from its language priors. Results demonstrate that without linguistic support, DeepSeek‑OCR's performance plummets from approximately 90% to 20%. Comparative benchmarking against 13 baseline models reveals that traditional pipeline OCR methods exhibit significantly higher robustness to such semantic perturbations than end‑to‑end methods. Furthermore, we find that lower visual token counts correlate with increased reliance on priors, exacerbating hallucination risks. Context stress testing also reveals a total model collapse around 10,000 text tokens, suggesting that current optical compression techniques may paradoxically aggravate the long‑context bottleneck. This study empirically defines DeepSeek‑OCR's capability boundaries and offers essential insights for future optimizations of the vision‑text compression paradigm. We release all data, results and scripts used in this study at https://github.com/dududuck00/DeepSeekOCR.
Authors:Quy-Anh Dang, Chris Ngo, Truong-Son Hy
Abstract:
As large language models (LLMs) become integral to safety‑critical applications, ensuring their robustness against adversarial prompts is paramount. However, existing red teaming datasets suffer from inconsistent risk categorizations, limited domain coverage, and outdated evaluations, hindering systematic vulnerability assessments. To address these challenges, we introduce RedBench, a universal dataset aggregating 37 benchmark datasets from leading conferences and repositories, comprising 29,362 samples across attack and refusal prompts. RedBench employs a standardized taxonomy with 22 risk categories and 19 domains, enabling consistent and comprehensive evaluations of LLM vulnerabilities. We provide a detailed analysis of existing datasets, establish baselines for modern LLMs, and open‑source the dataset and evaluation code. Our contributions facilitate robust comparisons, foster future research, and promote the development of secure and reliable LLMs for real‑world deployment. Code: https://github.com/knoveleng/redeval
Authors:Yifan Wei, Li Du, Xiaoyan Yu, Yang Feng, Angsheng Li
Abstract:
Large Language Models (LLMs) and agent‑based systems often struggle with compositional generalization due to a data bottleneck in which complex skill combinations follow a long‑tailed, power‑law distribution, limiting both instruction‑following performance and generalization in agent‑centric tasks. To address this challenge, we propose STEPS, a Skill Taxonomy guided Entropy‑based Post‑training data Synthesis framework for generating compositionally challenging data. STEPS explicitly targets compositional generalization by uncovering latent relationships among skills and organizing them into an interpretable, hierarchical skill taxonomy using structural information theory. Building on this taxonomy, we formulate data synthesis as a constrained information maximization problem, selecting skill combinations that maximize marginal structural information within the hierarchy while preserving semantic coherence. Experiments on challenging instruction‑following benchmarks show that STEPS outperforms existing data synthesis baselines, while also yielding improved compositional generalization in downstream agent‑based evaluations.
Authors:Zhitong Chen, Kai Yin, Xiangjue Dong, Chengkai Liu, Xiangpeng Li, Yiming Xiao, Bo Li, Junwei Ma, Ali Mostafavi, James Caverlee
Abstract:
Accurate question answering (QA) in disaster management requires reasoning over uncertain and conflicting information, a setting poorly captured by existing benchmarks built on clean evidence. We introduce DisastQA, a large‑scale benchmark of 3,000 rigorously verified questions (2,000 multiple‑choice and 1,000 open‑ended) spanning eight disaster types. The benchmark is constructed via a human‑LLM collaboration pipeline with stratified sampling to ensure balanced coverage. Models are evaluated under varying evidence conditions, from closed‑book to noisy evidence integration, enabling separation of internal knowledge from reasoning under imperfect information. For open‑ended QA, we propose a human‑verified keypoint‑based evaluation protocol emphasizing factual completeness over verbosity. Experiments with 20 models reveal substantial divergences from general‑purpose leaderboards such as MMLU‑Pro. While recent open‑weight models approach proprietary systems in clean settings, performance degrades sharply under realistic noise, exposing critical reliability gaps for disaster response. All code, data, and evaluation resources are available at https://github.com/TamuChen18/DisastQA_open.
Authors:Bohao Chu, Qianli Wang, Hendrik Damm, Hui Wang, Ula Muhabbek, Elisabeth Livingstone, Christoph M. Friedrich, Norbert Fuhr
Abstract:
How can system‑generated responses be efficiently verified, especially in the high‑stakes biomedical domain? To address this challenge, we introduce eTracer, a plug‑and‑play framework that enables traceable text generation by grounding claims against contextual evidence. Through post‑hoc grounding, each response claim is aligned with contextual evidence that either supports or contradicts it. Building on claim‑level grounding results, eTracer not only enables users to precisely trace responses back to their contextual source but also quantifies response faithfulness, thereby enabling the verifiability and trustworthiness of generated responses. Experiments show that our claim‑level grounding approach alleviates the limitations of conventional grounding methods in aligning generated statements with contextual sentence‑level evidence, resulting in substantial improvements in overall grounding quality and user verification efficiency. The code and data are available at https://github.com/chubohao/eTracer.
Authors:Zheng Wu, Xingyu Lou, Xinbei Ma, Yansi Li, Weiwen Liu, Weinan Zhang, Jun Wang, Zhuosheng Zhang
Abstract:
Large Language Model (LLM)‑based agents significantly extend the utility of LLMs by interacting with dynamic environments. However, enabling agents to continually learn new tasks without catastrophic forgetting remains a critical challenge, known as the stability‑plasticity dilemma. In this work, we argue that this dilemma fundamentally arises from the failure to explicitly distinguish between common knowledge shared across tasks and conflicting knowledge introduced by task‑specific interference. To address this, we propose Agent‑Dice, a parameter fusion framework based on directional consensus evaluation. Concretely, Agent‑Dice disentangles knowledge updates through a two‑stage process: geometric consensus filtering to prune conflicting gradients, and curvature‑based importance weighting to amplify shared semantics. We provide a rigorous theoretical analysis that establishes the validity of the proposed fusion scheme and offers insight into the origins of the stability‑plasticity dilemma. Extensive experiments on GUI agents and tool‑use agent domains demonstrate that Agent‑Dice exhibits outstanding continual learning performance with minimal computational overhead and parameter updates. The codes are available at https://github.com/Wuzheng02/Agent‑Dice.
Authors:Jean Seo, Gibaeg Kim, Kihun Shin, Seungseop Lim, Hyunkyung Lee, Wooseok Han, Jongwon Lee, Eunho Yang
Abstract:
We introduce EPAG, a benchmark dataset and framework designed for Evaluating the Pre‑consultation Ability of LLMs using diagnostic Guidelines. LLMs are evaluated directly through HPI‑diagnostic guideline comparison and indirectly through disease diagnosis. In our experiments, we observe that small open‑source models fine‑tuned with a well‑curated, task‑specific dataset can outperform frontier LLMs in pre‑consultation. Additionally, we find that increased amount of HPI (History of Present Illness) does not necessarily lead to improved diagnostic performance. Further experiments reveal that the language of pre‑consultation influences the characteristics of the dialogue. By open‑sourcing our dataset and evaluation pipeline on https://github.com/seemdog/EPAG, we aim to contribute to the evaluation and further development of LLM applications in real‑world clinical settings.
Authors:Guobin Tu, Di Weng
Abstract:
Sign Language Translation (SLT) is a complex cross‑modal task requiring the integration of Manual Signals (MS) and Non‑Manual Signals (NMS). While recent gloss‑free SLT methods have made strides in translating manual gestures, they frequently overlook the semantic criticality of facial expressions, resulting in ambiguity when distinct concepts share identical manual articulations. To address this, we present EASLT (Emotion‑Aware Sign Language Translation), a framework that treats facial affect not as auxiliary information, but as a robust semantic anchor. Unlike methods that relegate facial expressions to a secondary role, EASLT incorporates a dedicated emotional encoder to capture continuous affective dynamics. These representations are integrated via a novel Emotion‑Aware Fusion (EAF) module, which adaptively recalibrates spatio‑temporal sign features based on affective context to resolve semantic ambiguities. Extensive evaluations on the PHOENIX14T and CSL‑Daily benchmarks demonstrate that EASLT establishes advanced performance among gloss‑free methods, achieving BLEU‑4 scores of 26.15 and 22.80, and BLEURT scores of 61.0 and 57.8, respectively. Ablation studies confirm that explicitly modeling emotion effectively decouples affective semantics from manual dynamics, significantly enhancing translation fidelity. Code is available at https://github.com/TuGuobin/EASLT.
Authors:Ye Shen, Dun Pei, Yiqiu Guo, Junying Wang, Yijin Guo, Zicheng Zhang, Qi Jia, Jun Zhou, Guangtao Zhai
Abstract:
Despite recent advances in understanding and leveraging long‑range conversational memory, existing benchmarks still lack systematic evaluation of large language models(LLMs) across diverse memory dimensions, particularly in multi‑session settings. In this work, we propose EvolMem, a new benchmark for assessing multi‑session memory capabilities of LLMs and agent systems. EvolMem is grounded in cognitive psychology and encompasses both declarative and non‑declarative memory, further decomposed into multiple fine‑grained abilities. To construct the benchmark, we introduce a hybrid data synthesis framework that consists of topic‑initiated generation and narrative‑inspired transformations. This framework enables scalable generation of multi‑session conversations with controllable complexity, accompanied by sample‑specific evaluation guidelines. Extensive evaluation reveals that no LLM consistently outperforms others across all memory dimensions. Moreover, agent memory mechanisms do not necessarily enhance LLMs' capabilities and often exhibit notable efficiency limitations. Data and code will be released at https://github.com/shenye7436/EvolMem.
Authors:Xukai Liu, Ye Liu, Jipeng Zhang, Yanghai Zhang, Kai Zhang, Qi Liu
Abstract:
Large language models (LLMs) perform well on multi‑hop reasoning, yet how they internally compose multiple facts remains unclear. Recent work proposes \emphhop‑aligned circuit hypothesis, suggesting that bridge entities are computed sequentially across layers before later‑hop answers. Through systematic analyses on real‑world multi‑hop queries, we show that this hop‑aligned assumption does not generalize: later‑hop answer entities can become decodable earlier than bridge entities, a phenomenon we call \emphlayer‑order inversion, which strengthens with total hops. To explain this behavior, we propose a \emphprobabilistic recall‑and‑extract framework that models multi‑hop reasoning as broad probabilistic recall in shallow MLP layers followed by selective extraction in deeper attention layers. This framework is empirically validated through systematic probing analyses, reinterpreting prior layer‑wise decoding evidence, explaining chain‑of‑thought gains, and providing a mechanistic diagnosis of multi‑hop failures despite correct single‑hop knowledge. Code is available at https://github.com/laquabe/Layer‑Order‑Inversion.
Authors:Di Wu, Yanyan Zhao, Xin Lu, Mingzhe Li, Bing Qin
Abstract:
Defending against jailbreak attacks is crucial for the safe deployment of Large Language Models (LLMs). Recent research has attempted to improve safety by training models to reason over safety rules before responding. However, a key issue lies in determining what form of safety reasoning effectively defends against jailbreak attacks, which is difficult to explicitly design or directly obtain. To address this, we propose STAR‑S (Self‑TAught Reasoning based on Safety rules), a framework that integrates the learning of safety rule reasoning into a self‑taught loop. The core of STAR‑S involves eliciting reasoning and reflection guided by safety rules, then leveraging fine‑tuning to enhance safety reasoning. Repeating this process creates a synergistic cycle. Improvements in the model's reasoning and interpretation of safety rules allow it to produce better reasoning data under safety rule prompts, which is then utilized for further training. Experiments show that STAR‑S effectively defends against jailbreak attacks, outperforming baselines. Code is available at: https://github.com/pikepokenew/STAR_S.git.
Authors:Bohao Chu, Sameh Frihat, Tabea M. G. Pakull, Hendrik Damm, Meijie Li, Ula Muhabbek, Georg Lodde, Norbert Fuhr
Abstract:
Verifying system‑generated summaries remains challenging, as effective verification requires precise attribution to the source context, which is especially crucial in high‑stakes medical domains. To address this challenge, we introduce PCoA, an expert‑annotated benchmark for medical aspect‑based summarization with phrase‑level context attribution. PCoA aligns each aspect‑based summary with its supporting contextual sentences and contributory phrases within them. We further propose a fine‑grained, decoupled evaluation framework that independently assesses the quality of generated summaries, citations, and contributory phrases. Through extensive experiments, we validate the quality and consistency of the PCoA dataset and benchmark several large language models on the proposed task. Experimental results demonstrate that PCoA provides a reliable benchmark for evaluating system‑generated summaries with phrase‑level context attribution. Furthermore, comparative experiments show that explicitly identifying relevant sentences and contributory phrases before summarization can improve overall quality. The data and code are available at https://github.com/chubohao/PCoA.
Authors:Maan Qraitem, Kate Saenko, Bryan A. Plummer
Abstract:
Procedural content generation has enabled vast virtual worlds through levels, maps, and quests, but large‑scale character generation remains underexplored. We identify two alignment‑induced biases in existing methods: a positive moral bias, where characters uniformly adopt agreeable stances (e.g. always saying lying is bad), and a helpful assistant bias, where characters invariably answer questions directly (e.g. never refusing or deflecting). While such tendencies suit instruction‑following systems, they suppress dramatic tension and yield predictable characters, stemming from maximum likelihood training and assistant fine‑tuning. To address this, we introduce PersonaWeaver, a framework that disentangles world‑building (roles, demographics) from behavioral‑building (moral stances, interactional styles), yielding characters with more diverse reactions and moral stances, as well as second‑order diversity in stylistic markers like length, tone, and punctuation. Code: https://github.com/mqraitem/Persona‑Weaver
Authors:Bugra Kilictas, Faruk Alpay
Abstract:
The deployment of Large Language Models (LLMs) on edge devices is fundamentally constrained by the "Memory Wall" the bottleneck where data movement latency outstrips arithmetic throughput. Standard inference runtimes often incur significant overhead through high‑level abstractions, dynamic dispatch, and unaligned memory access patterns. In this work, we present a novel "Virtual Tensor Core" architecture implemented in software, optimized specifically for ARM64 microarchitectures (Apple Silicon). By bypassing standard library containers in favor of direct memory mapping (mmap) and implementing hand‑tuned NEON SIMD kernels, we achieve a form of "Software‑Defined Direct Memory Access (DMA)." Our proposed Tensor Virtualization Layout (TVL) guarantees 100% cache line utilization for weight matrices, while our zero‑copy loader eliminates initialization latency. Experimental results on a 110M parameter model demonstrate a stable throughput of >60 tokens/second on M2 hardware. While proprietary hardware accelerators (e.g., Apple AMX) can achieve higher peak throughput, our architecture provides a fully open, portable, and deterministic reference implementation for studying the memory bottleneck on general‑purpose ARM silicon, meeting the 200ms psycholinguistic latency threshold without opaque dependencies.
Authors:Gabriel Benedict, Matthew Butler, Naved Merchant, Eetu Salama-Laine
Abstract:
The emergence of Large Language Models (LLMs) has shifted language model evaluation toward reasoning and problem‑solving tasks as measures of general intelligence. Small Language Models (SLMs) ‑‑ defined here as models under 10B parameters ‑‑ typically score 3‑4 times lower than LLMs on these metrics. However, we demonstrate that these evaluations fail to capture SLMs' effectiveness in common industrial applications, such as tone modification tasks (e.g., funny, serious, professional). We propose an evaluation framework specifically designed to highlight SLMs' capabilities in non‑reasoning tasks where predefined evaluation datasets don't exist. Our framework combines novel approaches in data generation, prompt‑tuning, and LLM‑based evaluation to demonstrate the potential of task‑specific finetuning. This work provides practitioners with tools to effectively benchmark both SLMs and LLMs for practical applications, particularly in edge and private computing scenarios. Our implementation is available at: https://github.com/amazon‑science/wraval.
Authors:Juntong Ni, Shiyu Wang, Ming Jin, Qi He, Wei Jin
Abstract:
Spatio‑temporal reasoning in time series involves the explicit synthesis of temporal dynamics, spatial dependencies, and textual context. This capability is vital for high‑stakes decision‑making in systems such as traffic networks, power grids, and disease propagation. However, the field remains underdeveloped because most existing works prioritize predictive accuracy over reasoning. To address the gap, we introduce ST‑Bench, a benchmark consisting of four core tasks, including etiological reasoning, entity identification, correlation reasoning, and in‑context forecasting, developed via a network SDE‑based multi‑agent data synthesis pipeline. We then propose STReasoner, which empowers LLM to integrate time series, graph structure, and text for explicit reasoning. To promote spatially grounded logic, we introduce S‑GRPO, a reinforcement learning algorithm that rewards performance gains specifically attributable to spatial information. Experiments show that STReasoner achieves average accuracy gains between 17% and 135% at only 0.004X the cost of proprietary models and generalizes robustly to real‑world data.
Authors:Mohammad Zia Ur Rehman, Sai Kartheek Reddy Kasu, Shashivardhan Reddy Koppula, Sai Rithwik Reddy Chirra, Shwetank Shekhar Singh, Nagendra Kumar
Abstract:
Hate speech detection on social media faces challenges in both accuracy and explainability, especially for underexplored Indic languages. We propose a novel explainability‑guided training framework, X‑MuTeST (eXplainable Multilingual haTe Speech deTection), for hate speech detection that combines high‑level semantic reasoning from large language models (LLMs) with traditional attention‑enhancing techniques. We extend this research to Hindi and Telugu alongside English by providing benchmark human‑annotated rationales for each word to justify the assigned class label. The X‑MuTeST explainability method computes the difference between the prediction probabilities of the original text and those of unigrams, bigrams, and trigrams. Final explanations are computed as the union between LLM explanations and X‑MuTeST explanations. We show that leveraging human rationales during training enhances both classification performance and explainability. Moreover, combining human rationales with our explainability method to refine the model attention yields further improvements. We evaluate explainability using Plausibility metrics such as Token‑F1 and IOU‑F1 and Faithfulness metrics such as Comprehensiveness and Sufficiency. By focusing on under‑resourced languages, our work advances hate speech detection across diverse linguistic contexts. Our dataset includes token‑level rationale annotations for 6,004 Hindi, 4,492 Telugu, and 6,334 English samples. Data and code are available on https://github.com/ziarehman30/X‑MuTeST
Authors:Andrew Shin
Abstract:
Despite rapid advances in large language models (LLMs), achieving reliable performance on highly professional and structured examinations remains a significant challenge. The Japanese bar examination is a particularly demanding benchmark, requiring not only advanced legal reasoning but also strict adherence to complex answer formats that involve joint evaluation of multiple propositions. While recent studies have reported improvements by decomposing such questions into simpler true‑‑false judgments, these approaches have not been systematically evaluated under the original exam format and scoring scheme, leaving open the question of whether they truly capture exam‑level competence. In this paper, we present a self‑verification model trained on a newly constructed dataset that faithfully replicates the authentic format and evaluation scale of the exam. Our model is able to exceed the official passing score when evaluated on the actual exam scale, marking the first demonstration, to our knowledge, of an LLM passing the Japanese bar examination without altering its original question structure or scoring rules. We further conduct extensive comparisons with alternative strategies, including multi‑agent inference and decomposition‑based supervision, and find that these methods fail to achieve comparable performance. Our results highlight the importance of format‑faithful supervision and consistency verification, and suggest that carefully designed single‑model approaches can outperform more complex systems in high‑stakes professional reasoning tasks. Our dataset and codes are publicly available.
Authors:Vidhi Rathore, Sambu Aneesh, Himanshu Singh
Abstract:
Hallucinations can be produced by conversational AI systems, particularly in multi‑turn conversations where context changes and contradictions may eventually surface. By representing the entire conversation as a temporal graph, we present a novel graph‑based method for detecting dialogue‑level hallucinations. Our framework models each dialogue as a node, encoding it using a sentence transformer. We explore two different ways of connectivity: i) shared‑entity edges, which connect turns that refer to the same entities; ii) temporal edges, which connect contiguous turns in the conversation. Message‑passing is used to update the node embeddings, allowing flow of information between related nodes. The context‑aware node embeddings are then combined using attention pooling into a single vector, which is then passed on to a classifier to determine the presence and type of hallucinations. We demonstrate that our method offers slightly improved performance over existing methods. Further, we show the attention mechanism can be used to justify the decision making process. The code and model weights are made available at: https://github.com/sambuaneesh/anlp‑project.
Authors:Xinglang Zhang, Yunyao Zhang, ZeLiang Chen, Junqing Yu, Wei Yang, Zikai Song
Abstract:
Symbolic logical reasoning is a critical yet underexplored capability of large language models (LLMs), providing reliable and verifiable decision‑making in high‑stakes domains such as mathematical reasoning and legal judgment. In this study, we present a systematic analysis of logical reasoning under controlled increases in logical complexity, and reveal a previously unrecognized phenomenon, which we term Logical Phase Transitions: rather than degrading smoothly, logical reasoning performance remains stable within a regime but collapses abruptly beyond a critical logical depth, mirroring physical phase transitions such as water freezing beyond a critical temperature threshold. Building on this insight, we propose Neuro‑Symbolic Curriculum Tuning, a principled framework that adaptively aligns natural language with logical symbols to establish a shared representation, and reshapes training dynamics around phase‑transition boundaries to progressively strengthen reasoning at increasing logical depths. Experiments on five benchmarks show that our approach effectively mitigates logical reasoning collapse at high complexity, yielding average accuracy gains of +1.26 in naive prompting and +3.95 in CoT, while improving generalization to unseen logical compositions. Code and data are available at https://github.com/AI4SS/Logical‑Phase‑Transitions.
Authors:Yuetian Chen, Yuntao Du, Kaiyuan Zhang, Ashish Kundu, Charles Fleming, Bruno Ribeiro, Ninghui Li
Abstract:
Most membership inference attacks (MIAs) against Large Language Models (LLMs) rely on global signals, like average loss, to identify training data. This approach, however, dilutes the subtle, localized signals of memorization, reducing attack effectiveness. We challenge this global‑averaging paradigm, positing that membership signals are more pronounced within localized contexts. We introduce WBC (Window‑Based Comparison), which exploits this insight through a sliding window approach with sign‑based aggregation. Our method slides windows of varying sizes across text sequences, with each window casting a binary vote on membership based on loss comparisons between target and reference models. By ensembling votes across geometrically spaced window sizes, we capture memorization patterns from token‑level artifacts to phrase‑level structures. Extensive experiments across eleven datasets demonstrate that WBC substantially outperforms established baselines, achieving higher AUC scores and 2‑3 times improvements in detection rates at low false positive thresholds. Our findings reveal that aggregating localized evidence is fundamentally more effective than global averaging, exposing critical privacy vulnerabilities in fine‑tuned LLMs.
Authors:Arjun S. Nair
Abstract:
Large language model fine‑tuning is bottlenecked by memory: a 7B parameter model requires 84GB‑‑14GB for weights, 14GB for gradients, and 56GB for FP32 optimizer states‑‑exceeding even A100‑40GB capacity. We present Chronicals, an open‑source training framework achieving 3.51x speedup over Unsloth through four synergistic optimizations: (1) fused Triton kernels eliminating 75% of memory traffic via RMSNorm (7x), SwiGLU (5x), and QK‑RoPE (2.3x) fusion; (2) Cut Cross‑Entropy reducing logit memory from 5GB to 135MB through online softmax computation; (3) LoRA+ with theoretically‑derived 16x differential learning rates between adapter matrices; and (4) Best‑Fit Decreasing sequence packing recovering 60‑75% of compute wasted on padding. On Qwen2.5‑0.5B with A100‑40GB, Chronicals achieves 41,184 tokens/second for full fine‑tuning versus Unsloth's 11,736 tokens/second (3.51x). For LoRA at rank 32, we reach 11,699 tokens/second versus Unsloth MAX's 2,857 tokens/second (4.10x). Critically, we discovered that Unsloth's reported 46,000 tokens/second benchmark exhibited zero gradient norms‑‑the model was not training. We provide complete mathematical foundations: online softmax correctness proofs, FlashAttention IO complexity bounds O(N^2 d^2 M^‑1), LoRA+ learning rate derivations from gradient magnitude analysis, and bin‑packing approximation guarantees. All implementations, benchmarks, and proofs are available at https://github.com/Ajwebdevs/Chronicals with pip installation via https://pypi.org/project/chronicals/.
Authors:Hossein Rajabzadeh, Maryam Dialameh, Chul B. Park, Il-Min Kim, Hyock Ju Kwon
Abstract:
Autoregressive large language models (LLMs) are bottlenecked by sequential decoding, where each new token typically requires executing all transformer layers. Existing dynamic‑depth and layer‑skipping methods reduce this cost, but often rely on auxiliary routing mechanisms or incur accuracy degradation when bypassed layers are left uncompensated. We present LoRA‑Drop, a plug‑and‑play inference framework that accelerates decoding by applying a \emphtemporal compute schedule to a fixed subset of intermediate layers: on most decoding steps, selected layers reuse the previous‑token hidden state and apply a low‑rank LoRA correction, while periodic \emphrefresh steps execute the full model to prevent drift. LoRA‑Drop requires no routing network, is compatible with standard KV caching, and can reduce KV‑cache footprint by skipping KV updates in droppable layers during LoRA steps and refreshing periodically. Across LLaMA2‑7B, LLaMA3‑8B, Qwen2.5‑7B, and Qwen2.5‑14B, LoRA‑Drop achieves up to 2.6× faster decoding and 45‑‑55% KV‑cache reduction while staying within 0.5 percentage points (pp) of baseline accuracy. Evaluations on reasoning (GSM8K, MATH, BBH), code generation (HumanEval, MBPP), and long‑context/multilingual benchmarks (LongBench, XNLI, XCOPA) identify a consistent \emphsafe zone of scheduling configurations that preserves quality while delivering substantial efficiency gains, providing a simple path toward adaptive‑capacity inference in LLMs. Codes are available at https://github.com/hosseinbv/LoRA‑Drop.git.
Authors:Hyeong Kyu Choi, Sharon Li
Abstract:
Selecting a single high‑quality output from multiple stochastic generations remains a fundamental challenge for large language models (LLMs), particularly in open‑ended tasks where no canonical answer exists. While Best‑of‑N and self‑consistency methods show that aggregating multiple generations can improve performance, existing approaches typically rely on external evaluators, reward models, or exact string‑match voting, limiting their applicability and efficiency. We propose Mode Extraction (ModeX), an evaluator‑free Best‑of‑N selection framework that generalizes majority voting to open‑ended text generation by identifying the modal output representing the dominant semantic consensus among generated texts. ModeX constructs a similarity graph over candidate generations and recursively applies spectral clustering to select a representative centroid, without requiring additional inference or auxiliary models. We further instantiate this selection principle as ModeX‑Lite, an improved version of ModeX with early pruning for efficiency. Across open‑ended tasks ‑‑ including text summarization, code generation, and mathematical reasoning ‑‑ our approaches consistently outperform standard single‑ and multi‑path baselines, providing a computationally efficient solution for robust open‑ended text generation. Code is released in https://github.com/deeplearning‑wisc/ModeX.
Authors:Inpyo Song, Eunji Jeon, Jangwon Lee
Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains, including software development, education, and technical assistance. Among these, software development is one of the key areas where LLMs are increasingly adopted. However, when hardware constraints are considered‑for instance, in physical computing, where software must interact with and control physical hardware their effectiveness has not been fully explored. To address this gap, we introduce \textscPCEval (Physical Computing Evaluation), the first benchmark in physical computing that enables a fully automatic evaluation of the capabilities of LLM in both the logical and physical aspects of the projects, without requiring human assessment. Our evaluation framework assesses LLMs in generating circuits and producing compatible code across varying levels of project complexity. Through comprehensive testing of 13 leading models, \textscPCEval provides the first reproducible and automatically validated empirical assessment of LLMs' ability to reason about fundamental hardware implementation constraints within a simulation environment. Our findings reveal that while LLMs perform well in code generation and logical circuit design, they struggle significantly with physical breadboard layout creation, particularly in managing proper pin connections and avoiding circuit errors. \textscPCEval advances our understanding of AI assistance in hardware‑dependent computing environments and establishes a foundation for developing more effective tools to support physical computing education.
Authors:Yihao Liang, Ze Wang, Hao Chen, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Emad Barsoum, Zicheng Liu, Niraj K. Jha
Abstract:
Autoregressive large language models achieve strong results on many benchmarks, but decoding remains fundamentally latency‑limited by sequential dependence on previously generated tokens. Diffusion language models (DLMs) promise parallel generation but suffer from a fundamental static‑to‑dynamic misalignment: Training optimizes local transitions under fixed schedules, whereas efficient inference requires adaptive "long‑jump" refinements through unseen states. Our goal is to enable highly parallel decoding for DLMs with low number of function evaluations while preserving generation quality. To achieve this, we propose CD4LM, a framework that decouples training from inference via Discrete‑Space Consistency Distillation (DSCD) and Confidence‑Adaptive Decoding (CAD). Unlike standard objectives, DSCD trains a student to be trajectory‑invariant, mapping diverse noisy states directly to the clean distribution. This intrinsic robustness enables CAD to dynamically allocate compute resources based on token confidence, aggressively skipping steps without the quality collapse typical of heuristic acceleration. On GSM8K, CD4LM matches the LLaDA baseline with a 5.18x wall‑clock speedup; across code and math benchmarks, it strictly dominates the accuracy‑efficiency Pareto frontier, achieving a 3.62x mean speedup while improving average accuracy. Code is available at https://github.com/yihao‑liang/CDLM
Authors:Chuanrui Hu, Xingze Gao, Zuyi Zhou, Dannong Xu, Yi Bai, Xintong Li, Hui Zhang, Tong Li, Chong Zhang, Lidong Bing, Yafeng Deng
Abstract:
Large Language Models (LLMs) are increasingly deployed as long‑term interactive agents, yet their limited context windows make it difficult to sustain coherent behavior over extended interactions. Existing memory systems often store isolated records and retrieve fragments, limiting their ability to consolidate evolving user states and resolve conflicts. We introduce EverMemOS, a self‑organizing memory operating system that implements an engram‑inspired lifecycle for computational memory. Episodic Trace Formation converts dialogue streams into MemCells that capture episodic traces, atomic facts, and time‑bounded Foresight signals. Semantic Consolidation organizes MemCells into thematic MemScenes, distilling stable semantic structures and updating user profiles. Reconstructive Recollection performs MemScene‑guided agentic retrieval to compose the necessary and sufficient context for downstream reasoning. Experiments on LoCoMo and LongMemEval show that EverMemOS achieves state‑of‑the‑art performance on memory‑augmented reasoning tasks. We further report a profile study on PersonaMem v2 and qualitative case studies illustrating chat‑oriented capabilities such as user profiling and Foresight. Code is available at https://github.com/EverMind‑AI/EverMemOS.
Authors:Almaz Ermilov
Abstract:
This paper presents FormationEval, an open multiple‑choice question benchmark for evaluating language models on petroleum geoscience and subsurface disciplines. The dataset contains 505 questions across seven domains including petrophysics, petroleum geology and reservoir engineering, derived from three authoritative sources using a reasoning model with detailed instructions and a concept‑based approach that avoids verbatim copying of copyrighted text. Each question includes source metadata to support traceability and audit. The evaluation covers 72 models from major providers including OpenAI, Anthropic, Google, Meta and open‑weight alternatives. The top performers achieve over 97% accuracy, with Gemini 3 Pro Preview reaching 99.8%, while tier and domain gaps persist. Among open‑weight models, GLM‑4.7 leads at 98.6%, with several DeepSeek, Llama, Qwen and Mistral models also exceeding 93%. The performance gap between open‑weight and closed models is narrower than expected, with several lower‑cost open‑weight models exceeding 90% accuracy. Petrophysics emerges as the most challenging domain across all models, while smaller models show wider performance variance. Residual length bias in the dataset (correct answers tend to be longer) is documented along with bias mitigation strategies applied during construction. The benchmark, evaluation code and results are publicly available.
Authors:Omar Momen, Emilie Sitter, Berenike Herrmann, Sina Zarrieß
Abstract:
Novel metaphor comprehension involves complex semantic processes and linguistic creativity, making it an interesting task for studying language models (LMs). This study investigates whether surprisal, a probabilistic measure of predictability in LMs, correlates with annotations of metaphor novelty in different datasets. We analyse the surprisal of metaphoric words in corpus‑based and synthetic metaphor datasets using 16 causal LM variants. We propose a cloze‑style surprisal method that conditions on full‑sentence context. Results show that LM surprisal yields significant moderate correlations with scores/labels of metaphor novelty. We further identify divergent scaling patterns: on corpus‑based data, correlation strength decreases with model size (inverse scaling effect), whereas on synthetic data it increases (quality‑power hypothesis). We conclude that while surprisal can partially account for annotations of metaphor novelty, it remains limited as a metric of linguistic creativity. Code and data are publicly available: https://github.com/OmarMomen14/surprisal‑metaphor‑novelty
Authors:Tran Sy Bao
Abstract:
Sign language translation systems typically require English as an intermediary language, creating barriers for non‑English speakers in the global deaf community. We present Canonical Semantic Form (CSF), a language‑agnostic semantic representation framework that enables direct translation from any source language to sign language without English mediation. CSF decomposes utterances into nine universal semantic slots: event, intent, time, condition, agent, object, location, purpose, and modifier. A key contribution is our comprehensive condition taxonomy comprising 35 condition types across eight semantic categories, enabling nuanced representation of conditional expressions common in everyday communication. We train a lightweight transformer‑based extractor (0.74 MB) that achieves 99.03% average slot extraction accuracy across four typologically diverse languages: English, Vietnamese, Japanese, and French. The model demonstrates particularly strong performance on condition classification (99.4% accuracy) despite the 35‑class complexity. With inference latency of 3.02ms on CPU, our approach enables real‑time sign language generation in browser‑based applications. We release our code, trained models, and multilingual dataset to support further research in accessible sign language technology.
Authors:Jinghan Ru, Siyuan Yan, Yuguo Yin, Yuexian Zou, Zongyuan Ge
Abstract:
Multimodal Large Language Models (MLLMs) show promise for medical applications, yet progress in dermatology lags due to limited training data, narrow task coverage, and lack of clinically‑grounded supervision that mirrors expert diagnostic workflows. We present a comprehensive framework to address these gaps. First, we introduce DermoInstruct, a large‑scale morphology‑anchored instruction corpus comprising 211,243 images and 772,675 trajectories across five task formats, capturing the complete diagnostic pipeline from morphological observation and clinical reasoning to final diagnosis. Second, we establish DermoBench, a rigorous benchmark evaluating 11 tasks across four clinical axes: Morphology, Diagnosis, Reasoning, and Fairness, including a challenging subset of 3,600 expert‑verified open‑ended instances and human performance baselines. Third, we develop DermoGPT, a dermatology reasoning MLLM trained via supervised fine‑tuning followed by our Morphologically‑Anchored Visual‑Inference‑Consistent (MAVIC) reinforcement learning objective, which enforces consistency between visual observations and diagnostic conclusions. At inference, we deploy Confidence‑Consistency Test‑time adaptation (CCT) for robust predictions. Experiments show DermoGPT significantly outperforms 16 representative baselines across all axes, achieving state‑of‑the‑art performance while substantially narrowing the human‑AI gap. DermoInstruct, DermoBench and DermoGPT will be made publicly available at https://github.com/mendicant04/DermoGPT upon acceptance.
Authors:Jinwei Hu, Xinmiao Huang, Youcheng Sun, Yi Dong, Xiaowei Huang
Abstract:
As large language models (LLMs) transition to autonomous agents synthesizing real‑time information, their reasoning capabilities introduce an unexpected attack surface. This paper introduces a novel threat where colluding agents steer victim beliefs using only truthful evidence fragments distributed through public channels, without relying on covert communications, backdoors, or falsified documents. By exploiting LLMs' overthinking tendency, we formalize the first cognitive collusion attack and propose Generative Montage: a Writer‑Editor‑Director framework that constructs deceptive narratives through adversarial debate and coordinated posting of evidence fragments, causing victims to internalize and propagate fabricated conclusions. To study this risk, we develop CoPHEME, a dataset derived from real‑world rumor events, and simulate attacks across diverse LLM families. Our results show pervasive vulnerability across 14 LLM families: attack success rates reach 74.4% for proprietary models and 70.6% for open‑weights models. Counterintuitively, stronger reasoning capabilities increase susceptibility, with reasoning‑specialized models showing higher attack success than base models or prompts. Furthermore, these false beliefs then cascade to downstream judges, achieving over 60% deception rates, highlighting a socio‑technical vulnerability in how LLM‑based agents interact with dynamic information environments. Our implementation and data are available at: https://github.com/CharlesJW222/Lying_with_Truth/tree/main.
Authors:Jakub Hoscilowicz
Abstract:
We examine two properties of AI systems: capability (what a system can do) and steerability (how reliably one can shift behavior toward intended outcomes). A central question is whether capability growth reduces steerability and risks control collapse. We also distinguish between authorized steerability (builders reliably reaching intended behaviors) and unauthorized steerability (attackers eliciting disallowed behaviors). This distinction highlights a fundamental safety‑‑security dilemma of AI models: safety requires high steerability to enforce control (e.g., stop/refuse), while security requires low steerability for malicious actors to elicit harmful behaviors. This tension presents a significant challenge for open‑weight models, which currently exhibit high steerability via common techniques like fine‑tuning or adversarial attacks. Using Qwen3 and InstrumentalEval, we find that a short anti‑instrumental prompt suffix sharply reduces the measured convergence rate (e.g., shutdown avoidance, self‑replication). For Qwen3‑30B Instruct, the convergence rate drops from 81.69% under a pro‑instrumental suffix to 2.82% under an anti‑instrumental suffix. Under anti‑instrumental prompting, larger aligned models show lower convergence rates than smaller ones (Instruct: 2.82% vs. 4.23%; Thinking: 4.23% vs. 9.86%). Code is available at github.com/j‑hoscilowicz/instrumental_steering.
Authors:Yuxiang Mei, Dongxing Xu, Jiaen Liang, Yanhua Long
Abstract:
The INTERSPEECH 2025 Challenge on Multilingual Conversational Speech Language Models (MLC‑SLM) promotes multilingual conversational ASR with large language models (LLMs). Our previous SHNU‑mASR system adopted a competitive parallel‑speech‑encoder architecture that integrated Whisper and mHuBERT with an LLM. However, it faced two challenges: simple feature concatenation may not fully exploit complementary information, and the performance gap between LLM‑based ASR and end‑to‑end(E2E) encoder‑decoder ASR remained unexplored. In this work, we present an enhanced LLM‑based ASR framework that combines fine‑tuned Whisper and mHuBERT encoders with an LLM to enrich speech representations. We first evaluate E2E Whisper models with LoRA and full fine‑tuning on the MLC‑SLM ASR task, and then propose cross‑attention‑based fusion mechanisms for the parallel‑speech‑encoder. On the official evaluation set of the MLC‑SLM Challenge, our system achieves a CER/WER of 10.69%, ranking on par with the top‑ranked Track 1 systems, even though it uses only 1,500 hours of baseline training data compared with their large‑scale training sets. Nonetheless, we find that our final LLM‑based ASR still does not match the performance of a fine‑tuned E2E Whisper model, providing valuable empirical guidance for future Speech‑LLM design. Our code is publicly available at https://github.com/1535176727/MLC‑SLM.
Authors:Chaofan Tao, Jierun Chen, Yuxin Jiang, Kaiqi Kou, Shaowei Wang, Ruoyu Wang, Xiaohui Li, Sidi Yang, Yiming Du, Jianbo Dai, Zhiming Mao, Xinyu Wang, Lifeng Shang, Haoli Bai
Abstract:
We present SWE‑Lego, a supervised fine‑tuning (SFT) recipe designed to achieve state‑ofthe‑art performance in software engineering (SWE) issue resolving. In contrast to prevalent methods that rely on complex training paradigms (e.g., mid‑training, SFT, reinforcement learning, and their combinations), we explore how to push the limits of a lightweight SFT‑only approach for SWE tasks. SWE‑Lego comprises three core building blocks, with key findings summarized as follows: 1) the SWE‑Lego dataset, a collection of 32k highquality task instances and 18k validated trajectories, combining real and synthetic data to complement each other in both quality and quantity; 2) a refined SFT procedure with error masking and a difficulty‑based curriculum, which demonstrably improves action quality and overall performance. Empirical results show that with these two building bricks alone,the SFT can push SWE‑Lego models to state‑of‑the‑art performance among open‑source models of comparable size on SWE‑bench Verified: SWE‑Lego‑Qwen3‑8B reaches 42.2%, and SWE‑Lego‑Qwen3‑32B attains 52.6%. 3) We further evaluate and improve test‑time scaling (TTS) built upon the SFT foundation. Based on a well‑trained verifier, SWE‑Lego models can be significantly boosted‑‑for example, 42.2% to 49.6% and 52.6% to 58.8% under TTS@16 for the 8B and 32B models, respectively.
Authors:Zihua Yang, Xin Liao, Yiqun Zhang, Yiu-ming Cheung
Abstract:
Categorical data are prevalent in domains such as healthcare, marketing, and bioinformatics, where clustering serves as a fundamental tool for pattern discovery. A core challenge in categorical data clustering lies in measuring similarity among attribute values that lack inherent ordering or distance. Without appropriate similarity measures, values are often treated as equidistant, creating a semantic gap that obscures latent structures and degrades clustering quality. Although existing methods infer value relationships from within‑dataset co‑occurrence patterns, such inference becomes unreliable when samples are limited, leaving the semantic context of the data underexplored. To bridge this gap, we present ARISE (Attention‑weighted Representation with Integrated Semantic Embeddings), which draws on external semantic knowledge from Large Language Models (LLMs) to construct semantic‑aware representations that complement the metric space of categorical data for accurate clustering. That is, LLM is adopted to describe attribute values for representation enhancement, and the LLM‑enhanced embeddings are combined with the original data to explore semantically prominent clusters. Experiments on eight benchmark datasets demonstrate consistent improvements over seven representative counterparts, with gains of 19‑27%. Code is available at https://github.com/develop‑yang/ARISE
Authors:Patricio Vera
Abstract:
Language generation maps a rich, high‑dimensional internal state to a single token sequence. We study this many‑to‑one mapping through the lens of intention collapse: the projection from an internal intention space I to an external language space L. We introduce three cheap, model‑agnostic metrics computed on a pre‑collapse state I: (i) intention entropy Hint(I), (ii) effective dimensionality deff(I), and (iii) recoverability Recov(I), operationalized as probe AUROC for predicting eventual success. We evaluate these metrics in a 3x3 study across models (Mistral‑7B, LLaMA‑3.1‑8B, Qwen‑2.5‑7B) and benchmarks (GSM8K, ARC‑Challenge, AQUA‑RAT), comparing baseline, chain‑of‑thought (CoT), and a babble control (n=200 items per cell). CoT increases average accuracy from 34.2% to 47.3% (+13.1 pp), driven by large gains on GSM8K but consistent degradations on ARC‑Challenge. Across models, CoT induces distinct entropy regimes relative to baseline, dH = Hint(CoT) ‑ Hint(Base): Mistral shows dH < 0 (lower‑entropy CoT), whereas LLaMA shows dH > 0 (higher‑entropy CoT), highlighting heterogeneity in CoT‑induced internal uncertainty. Finally, probe AUROC is significantly above chance in a subset of settings and can dissociate from behavioral accuracy (e.g., high AUROC alongside lower CoT accuracy on ARC‑Challenge for Qwen), suggesting that informative internal signal is not always reliably converted into a final discrete decision under constrained response formats.
Authors:Jawad Chowdhury, Rezaur Rashid, Gabriel Terejanu
Abstract:
Understanding affective polarization in online discourse is crucial for evaluating the societal impact of social media interactions. This study presents a novel framework that leverages large language models (LLMs) and domain‑informed heuristics to systematically analyze and quantify affective polarization in discussions on divisive topics such as climate change and gun control. Unlike most prior approaches that relied on sentiment analysis or predefined classifiers, our method integrates LLMs to extract stance, affective tone, and agreement patterns from large‑scale social media discussions. We then apply a rule‑based scoring system capable of quantifying affective polarization even in small conversations consisting of single interactions, based on stance alignment, emotional content, and interaction dynamics. Our analysis reveals distinct polarization patterns that are event dependent: (i) anticipation‑driven polarization, where extreme polarization escalates before well‑publicized events, and (ii) reactive polarization, where intense affective polarization spikes immediately after sudden, high‑impact events. By combining AI‑driven content annotation with domain‑informed scoring, our framework offers a scalable and interpretable approach to measuring affective polarization. The source code is publicly available at: https://github.com/hasanjawad001/llm‑social‑media‑polarization.
Authors:Gihyeon Sim
Abstract:
Large language models apply uniform computation to all inputs, regardless of difficulty. We propose PonderTTT, a gating strategy using the TTT layer's self‑supervised reconstruction loss to selectively trigger Test‑Time Training (TTT) updates. The gating decision itself is training‑free‑‑requiring no learned classifier or auxiliary networks; only a single scalar threshold is initially calibrated on unlabeled data and continuously adapted via EMA to maintain target update rates. Our experiments with GPT‑2 models (124M to 1.5B) on code language modeling (The Stack v2, teacher‑forced perplexity) demonstrate that this signal is inference‑compatible, requiring no ground‑truth labels. Our Reconstruction Gating achieves 82‑89% Oracle Recovery while being fully training‑free, significantly outperforming Random Skip baselines (up to 16% lower loss on OOD languages).
Authors:Tao An
Abstract:
Conversation summarization loses nuanced details: when asked about coding preferences after 40 turns, summarization recalls "use type hints" but drops the critical constraint "everywhere" (19.0% exact match vs. 93.0% for our approach). We present CogCanvas, a training‑free framework inspired by how teams use whiteboards to anchor shared memory. Rather than compressing conversation history, CogCanvas extracts verbatim‑grounded artifacts (decisions, facts, reminders) and retrieves them via temporal‑aware graph. On the LoCoMo benchmark (all 10 conversations from the ACL 2024 release), CogCanvas achieves the highest overall accuracy among training‑free methods (32.4%), outperforming RAG (24.6%) by +7.8pp, with decisive advantages on complex reasoning tasks: +20.6pp on temporal reasoning (32.7% vs. 12.1% RAG) and +1.1pp on multi‑hop questions (41.7% vs. 40.6% RAG). CogCanvas also leads on single‑hop retrieval (26.6% vs. 24.6% RAG). Ablation studies reveal that BGE reranking contributes +7.7pp, making it the largest contributor to CogCanvas's performance. While heavily‑optimized approaches achieve higher absolute scores through dedicated training (EverMemOS: ~92%), our training‑free approach provides practitioners with an immediately‑deployable alternative that significantly outperforms standard baselines. Code and data: https://github.com/tao‑hpu/cog‑canvas
Authors:Thomas Katraouras, Dimitrios Rafailidis
Abstract:
Large Language Models (LLMs) have become a mainstay for many everyday applications. However, as data evolve their knowledge quickly becomes outdated. Continual learning aims to update LLMs with new information without erasing previously acquired knowledge. Although methods such as full fine‑tuning can incorporate new data, they are computationally expensive and prone to catastrophic forgetting, where prior knowledge is overwritten. Memory‑augmented approaches address this by equipping LLMs with a memory bank, that is an external memory module which stores information for future use. However, these methods face a critical limitation, in particular, the memory bank constantly grows in the real‑world scenario when large‑scale data streams arrive. In this paper, we propose MBC, a model that compresses the memory bank through a codebook optimization strategy during online adaptation learning. To ensure stable learning, we also introduce an online resetting mechanism that prevents codebook collapse. In addition, we employ Key‑Value Low‑Rank Adaptation in the attention layers of the LLM, enabling efficient utilization of the compressed memory representations. Experiments with benchmark question‑answering datasets demonstrate that MBC reduces the memory bank size to 0.3% when compared against the most competitive baseline, while maintaining high retention accuracy during online adaptation learning. Our code is publicly available at https://github.com/Thomkat/MBC.
Authors:Ishir Garg, Neel Kolhe, Xuandong Zhao, Dawn Song
Abstract:
Large language models (LLMs) have demonstrated significant advancements in reasoning and code generation. However, efficiently creating new benchmarks to evaluate these capabilities remains a challenge. Traditional benchmark creation relies on manual human effort, a process that is both expensive and time‑consuming. Furthermore, existing benchmarks often contaminate LLM training data, necessitating novel and diverse benchmarks to accurately assess their genuine capabilities. This work introduces InfoSynth, a novel framework for automatically generating and evaluating reasoning benchmarks guided by information‑theoretic principles. We propose metrics based on KL‑divergence and entropy to quantify benchmark novelty and diversity without relying on costly model evaluations. Building on this framework, we develop an end‑to‑end pipeline that synthesizes robust Python coding problems from seed datasets using genetic algorithms and iterative code feedback. Our method generates accurate test cases and solutions to new problems 97% of the time, and the synthesized benchmarks consistently exhibit higher novelty and diversity compared to their seed datasets. Moreover, our algorithm provides a method for controlling the novelty/diversity and difficulty of generated problems. InfoSynth offers a scalable, self‑verifying pipeline for constructing high‑quality, novel and diverse benchmarks for LLMs. Project Page: https://ishirgarg.github.io/infosynth_web/
Authors:Yifan Zhang, Yifeng Liu, Mengdi Wang, Quanquan Gu
Abstract:
The efficacy of deep residual networks is fundamentally predicated on the identity shortcut connection. While this mechanism effectively mitigates the vanishing gradient problem, it imposes a strictly additive inductive bias on feature transformations, thereby limiting the network's capacity to model complex state transitions. In this paper, we introduce Deep Delta Learning (DDL), a novel architecture that generalizes the standard residual connection by modulating the identity shortcut with a learnable, data‑dependent geometric transformation. This transformation, termed the Delta Operator, constitutes a rank‑1 perturbation of the identity matrix, parameterized by a reflection direction vector \mathbfk(\mathbfX) and a gating scalar β(\mathbfX). We provide a spectral analysis of this operator, demonstrating that the gate β(\mathbfX) enables dynamic interpolation between identity mapping, orthogonal projection, and geometric reflection. Furthermore, we restructure the residual update as a synchronous rank‑1 injection, where the gate acts as a dynamic step size governing both the erasure of old information and the writing of new features. This unification empowers the network to explicitly control the spectrum of its layer‑wise transition operator, enabling the modeling of complex, non‑monotonic dynamics while preserving the stable training characteristics of gated residual architectures.
Authors:Biao Wu, Meng Fang, Ling Chen, Ke Xu, Tao Cheng, Jun Wang
Abstract:
Recent advances in vision‑language models have opened up new possibilities for reasoning‑driven image geolocalization. However, existing approaches often rely on synthetic reasoning annotations or external image retrieval, which can limit interpretability and generalizability. In this paper, we present Geo‑R, a retrieval‑free framework that uncovers structured reasoning paths from existing ground‑truth coordinates and optimizes geolocation accuracy via reinforcement learning. We propose the Chain of Region, a rule‑based hierarchical reasoning paradigm that generates precise, interpretable supervision by mapping GPS coordinates to geographic entities (e.g., country, province, city) without relying on model‑generated or synthetic labels. Building on this, we introduce a lightweight reinforcement learning strategy with coordinate‑aligned rewards based on Haversine distance, enabling the model to refine predictions through spatially meaningful feedback. Our approach bridges structured geographic reasoning with direct spatial supervision, yielding improved localization accuracy, stronger generalization, and more transparent inference. Experimental results across multiple benchmarks confirm the effectiveness of Geo‑R, establishing a new retrieval‑free paradigm for scalable and interpretable image geolocalization. To facilitate further research and ensure reproducibility, both the model and code will be made publicly available.